University of South Florida Scholar Commons

Graduate Theses and Dissertations Graduate School

7-1-2010

Microexpression Spotting in Video Using Optical Strain

Sridhar Godavarthy University of South Florida

Follow this and additional works at: https://scholarcommons.usf.edu/etd

Part of the American Studies Commons

Scholar Commons Citation Godavarthy, Sridhar, "Microexpression Spotting in Video Using Optical Strain" (2010). Graduate Theses and Dissertations. https://scholarcommons.usf.edu/etd/1642

This Thesis is brought to you for free and open access by the Graduate School at Scholar Commons. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Scholar Commons. For more information, please contact [email protected].

Microexpression Spotting in Video Using Optical Strain

by

Sridhar Godavarthy

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Department of Computer Science and Engineering College of Engineering University of South Florida

Major Professor: Dmitry B. Goldgof, Ph.D. Sudeep Sarkar, Ph.D. Rangachar Kasturi, Ph.D.

Date of Approval: July 1, 2010

Keywords: Computer vision, FACS, Optical flow, Facial deformation, Datasets

Copyright © 2010, Sridhar Godavarthy

DEDICATION

To the Omnipotent, Omnipresent and Omniscient one... What are we, but for him?

ACKNOWLEDGEMENTS

First and foremost: God - for allowing me to tread this path. My parents Subrahmanya

Sarma and Raja Rajeswari, brother Dr. Srinivasa Sastry and sister in law Sarada - but for whose support, trust and perseverance, I would have been just another baccalaureate.

A special thanks to Dr. Dmitry Goldgof for accepting me as his student - I have never worked under a more understanding professor. My sincere gratitude goes to Dr. Rangachar

Kasturi - the first person from USF I spoke to - if not for his encouraging words, I might have missed this golden opportunity of working with so many intelligent minds. I am also grateful to Dr.

Sudeep Sarkar for his timely suggestions and ideas. I am greatly indebted to all three of them - each of them a brilliant mind overflowing with ideas, of which, I was able to glean a few- for their guidance and motivation.

Dr. Vasant Manohar - whose spark started this whole work, Matthew Shreve for sharing his knowledge, his technical support and guidance. Joshua Candamo – my inspiration.

I must thank my supervisor Daniel Prieto, an amazing person - for understanding my constraints and allowing me to work flexibly.

Parameswaran Krishnan and Ramya Parameswaran - both unrelated to each other - my prime movers and pillars of support, who will never willingly accept my thanks. Ravikiran

Krishnan- for being my constant companion, those midnight trips to Starbucks and brainstorming sessions. Lucy Puentes for her pep talks and motivation in my times of need. Ekta Karia for her delightful company and more importantly, sticking by me despite all the pestering I subjected her to. Pavithra Bonam for her confidence in me and all those delightful dinners. Ravi Panchumarthy, a gem of a person- for his support and company. Valentine Gnana Selvi for being “that” friend who blindly believes in you. Last but by no means the least, Aruna Pandian – for showing me what life is all about.

There are several others who helped me to get to where I am. I am thankful to every one of them.

Portions of the research in this paper use the Canal 9 Political Debates Database made available by A. Vinciarelli at the Idiap Research Institute, Switzerland.

TABLE OF CONTENTS

LIST OF TABLES iii

LIST OF FIGURES iv

ABSTRACT v

CHAPTER 1 INTRODUCTION 1 1.1 Microexpression 1 1.2 Applications of Automatic Microexpression Detection 2 1.3 FACS 3 1.4 Expression Segmentation 4 1.5 Aim of This Research 5 1.6 Proposed Method 5 1.7 Thesis Organization 6

CHAPTER 2 PREVIOUS WORK 7 2.1 Microexpressions 7 2.2 Automatic Recognition Systems 8 2.2.1 Face Detection 9 2.2.2 Feature Extraction 10 2.2.3 Expression Analysis 10 2.3 Datasets 11

CHAPTER 3 BACKGROUND 14 3.1 Optical Flow 14 3.2 Stress and Strain 16 3.3 Strain Magnitude Characteristics 17 3.4 Computation of Strain 20 3.5 The OpenCV Haar Classifier 21

CHAPTER 4 ALGORITHM AND IMPLEMENTATION 24 4.1 Overview 24 4.2 Face Alignment 26 4.3 Optical Flow 28 4.4 Computation of Optical Strain 28 4.5 Split into Regions 33 4.6 Thresholding 34

CHAPTER 5 DATASETS 37 5.1 USF Dataset 37 5.1.1 Feigned Dataset 37 5.1.2 Live Questioning Dataset 37 5.2 Canal 9 Dataset 38 5.3 Found Videos 39

CHAPTER 6 EXPERIMENTAL RESULTS 41 6.1 Determination of Threshold 41 6.2 Results with Only Macroexpressions 43

i

6.3 Microexpression Spotting Results 44

CHAPTER 7 CONCLUSION AND FUTURE WORK 48 7.1 Conclusions 48 7.2 Future Work 49

REFERENCES 50

ii

LIST OF TABLES

Table 1.1 Comparison of various facial expression coding schemes 4

Table 2.1 Select automatic facial expression recognition systems in literature 13

Table 4.1 List of parameters used 24

Table 5.1 Characteristics of datasets used for evaluation 40

Table 6.1 Results of thresholding on the training set with 22 microexpressions 42

Table 6.2 Microexpression spotting results 44

iii

LIST OF FIGURES

Figure 1.1 Sample microexpressions 2

Figure 3.1 Errors in optical flow caused by change in illumination 15

Figure 3.2 Errors in optical flow due to rotation of the object 15

Figure 3.3 Types of motion in an object when acted upon by an external force 16

Figure 3.4 Sample frames from the stability test video 18

Figure 3.5 Optical flow and optical strain with out of plane movement 19

Figure 3.6 Representations of some Haar like features from the OpenCV library 22

Figure 4.1 Flowchart of proposed algorithm 25

Figure 4.2 Illustration of the face alignment process 27

Figure 4.3 Dense optical flow on two frames of the Alex Rodriguez video 29

Figure 4.4 Computed dense optical flow on two frames of the USF dataset 30

Figure 4.5 Strain maps for the optical flow computed in Figure 4.3 31

Figure 4.6 Strain map for the optical flow computed in Figure 4.4 32

Figure 4.7 Split of face into separate regions 33

Figure 4.8 Algorithm for thresholding strain magnitudes 35

Figure 4.9 Example strain thresholding 36

Figure 5.1 Canal 9 dataset 39

Figure 5.2 Found video dataset 40

Figure 6.1 ROC curve used for threshold determination 41

Figure 6.2 Typical strain pattern for a microexpression 42

Figure 6.3 Strain map and thresholding with only macroexpressions 44

Figure 6.4 Strain plots for select USF sequences 45

Figure 6.5 Microexpression spotting results 46

iv

Microexpression Spotting in Video Using Optical Strain

Sridhar Godavarthy

ABSTRACT

Microexpression detection plays a vital role in applications such as and psychological consultations. Current research is progressing in the direction of automating microexpression recognition by aiming at classifying the microexpressions in terms of FACS

Action Units. Although high detection rates are being achieved, the datasets used for evaluation of these systems are highly restricted. They are limited in size - usually still pictures or extremely short videos; motion constrained; containing only a single microexpression and do not contain negative cases where microexpressions are absent. Only a few of these systems run in real time and even fewer have been tested on real life videos.

This work proposes a novel method for automated spotting of facial microexpressions as a preprocessing step to existing microexpression recognition systems. By identifying and rejecting sequences that do not contain microexpressions, longer sequences can be converted into shorter, constrained, relevant sequences which comprise of only single microexpressions, which can then be passed as input to existing systems, improving their performance and efficiency.

This method utilizes the small temporal extent of microexpressions for their identification.

The extent is determined by the period for which strain, due to the non-rigid motion caused during facial movement, is impacted on the facial skin. The subject‟s face is divided into sub-regions, and facial strain is calculated for each of these regions. The strain patterns in individual regions are used to identify subtle changes which facilitate the detection of microexpressions. The strain magnitude is calculated using the central difference method over the robust and dense optical flow field of each subject‟s face. The computed strain is then thresholded using a variable

v

threshold. If the duration for which the strain is above the threshold corresponds to the duration of a microexpression, detection is reported.

The datasets used for algorithm evaluation are comprised of a mix of natural and enacted microexpressions. The results were promising with up to 80% true detection rate. Increased false positive spots in the Canal 9 dataset can be attributed to talking by the subjects causing fine movements in the mouth region. Performing speech detection to identify sequences where the subject is talking and excluding the mouth region during those periods could help reduce the number of false positives.

vi

CHAPTER 1

INTRODUCTION

Decision making is an integral part of our day to day life. Some of the decisions we make are based on discernible facts and are well thought out, whilst others are made instantaneously considering only the immediately available evidence. In the latter scenario, the human mind does not consciously weigh the various contributing factors; instead it just recognizes that there is something “right” or “wrong” with the situation. A significant number of these instantaneous decisions we make are related to humans and are made when in direct visual contact with them – most often when listening to them. When an individual is lying, trying to deceive or attempting to hide ones natural emotions, his or her body exhibits subtle anomalies that can be perceived and used by our minds in the decision making [1]. Some of these „visible cues‟ may include, but are not limited to- a partial scorn, a shrug of the shoulder, a raised eyebrow etc. These anomalies are so consistent and so full of information that researchers in psychology have even christened them

– microexpressions.

1.1 Microexpression

A microexpression can be described as an involuntary pattern of the human body that is significant enough to be observable, but is insufficiently brief to convey an emotion [2]. They are very subtle and are easy to miss for the untrained eye. A typical microexpression lasts anywhere between th to th of a second [3] as opposed to a regular expression that can last for several

seconds and hence the term “micro”expression. Microexpressions are not restricted to the face and can occur in any part of the body. Microexpressions can be classified, based on the type of modification to an , into three types [3] as follows:

i. Simulated Expressions: When a microexpression is not accompanied by a genuine

expression.

1

ii. Neutralized expressions: When a genuine expression is suppressed and the face

remains neutral.

iii. Masked Expressions: When a genuine expression is completely masked by a falsified

expression.

Of the above, type 1 is the easiest to detect as they are straightforward and do not carry any other interference. An example of a type 1 microexpression would be a half scorn which would be indicative of (Figure 1.1 a). Type 3 expressions are the hardest to detect as the microexpression can be easily missed in the presence of a larger and longer expression.

(a) (b) Figure 1.1 Sample microexpressions. a) A partial, clearly visible scorn lasting about 0.06 seconds; b) A microexpression of thrill-duping delight visible along with a partial smile

Microexpressions can span across the entire human body, but in this work, we focus only upon the microexpressions exhibited in the facial region also known as Facial Microexpressions.

1.2 Applications of Automatic Microexpression Detection

The most prominent application of automatic microexpression detection is in security as a lie detecting aid. Although microexpression detection as a lie detection technology is not yet a reality, it is finding use in a variety of other fields. Researchers are working on using microexpressions to detect pain [4] in patients, especially autistic and anaesthetized ones and for measuring nerve damage in facial regions. Social signal processing research is progressing towards using microexpressions in detecting boredom [5], which may be used by commercial market researchers to measure the impact of their television commercials [6]. Research has also begun on using automated microexpression detection in psychological counseling such as

2

marriage counseling and others. Another intersting application is in classrooms to determine the effectiveness of the lecture from the face of students.

1.3 FACS

As opposed to the expressions for the six standard emotions (, , , , and ) that are universally recognizable and can be irrefutably associated with their corresponding emotions, microexpressions are harder to discern and even harder to decipher. This can be attributed to the fact that microexpressions are not mere visible cues, but can contain much more information, which if properly analyzed, could even describe the mood or the intentions of the individual exhibiting it [7]. The human mind is a very subjective observer of expressions whereas, to derive maximum information from a microexpression, we need a system of quantitative representation and objective measurement. Several such systems of measurement exist [8] viz. Facial affect scoring technique (FAST), Facial action coding system

(FACS), Maximally discriminative facial movement coding system (MAX), Facial electromyography (EMG) and Emotion FACS (EMFACS). A comparison of these various coding schemes is given in Table 1.1. For a detailed description of the terminology involved, the reader is referred to [9].

FACS is the only coding system that is observational and anatomical and is hence capable of representing and thereby measuring any visible facial movement. Because of this,

FACS is one of the most extensively used coding systems today. The FACS system represents each movement of the human face in terms of an Action Unit (AU). The FACS system comprises of 24 unique Action Units defined in terms of muscular movements and 19 Action Units defined in terms of actions performed. Every action performed by the human face is expressed as a combination of AUs. For example, a genuine smile is always a combination of AU 12 (Lip Corner

Puller) + AU 6 (Cheek Raiser). The AU combinations for known emotional expressions have already been defined. Any deviation from these implies that the individual‟s expression is altered and not natural. This is the primary principle behind any method used in the recognition of microexpressions- be it manual or automated. In addition to the standard emotions, several AU

3

combinations for deceptive behavior have also been determined, simplifying the detection of some frequently occurring microexpressions.

Table 1.1 Comparison of various facial expression coding schemes Measurement Derivation Measurement Name Advantages Disadvantages Method Type Content FAST Observational Theoretical Selective One of the first Used only for methods to be negative developed emotions. FACS Observational Anatomical Comprehensive  Covers all - muscles  Allows for discovery of new configurati ons MAX Observational Theoretical Selective Faster Codes only pre performance defined configurations. EMG Obtrusive Anatomical Comprehensive Able to detect  Self Muscular Consciousne activity invisible ss to naked eye  Interference from nearby muscles EMFACS Observational Theoretical Selective Selective Can only be version of applied to FACS. Faster certain emotion performance. expressions.

1.4 Expression Segmentation

Although, there has been a lot of research in the area of spontaneous expression recognition and a lot of advances have been made towards clearer definition and recognition of microexpressions, most of the research has been in the field of psychology. All the deciphering and FACS coding was being done manually by experts while observing video recordings. Despite the innumerable possible benefits, efforts on automation of expression recognition have only recently started gaining momentum.

Expression segmentation is the first step in any expression recognition system and is vital for many applications that analyze temporal facial changes, including micro and macro expression recognition, genuine and feigned emotion detection, animation and artificial

4

expression generation among others. A number of problems such as movements of the head- translation and out of plane rotations, uneven lighting conditions etc., pose challenges to expression segmentation. Solutions to these problems have been presented for specific scenarios such as extremely short sequences, restricted movement, constant lighting conditions etc., but a panacea is yet to be discovered [10]. Further, only a few of the algorithms developed are in real time, and most of them involve preprocessing and training. Present algorithms take far too long for the recognition of Action Units- even in clips that are just a couple of seconds long and have not been tested on longer sequences that may or may not contain microexpressions.

1.5 Aim of This Research

The aim of this research is to design a preprocessing system that is capable of spotting microexpressions by filtering out non microexpressions and thereby help existing recognition algorithms perform better on video sequences of longer durations. It is unrealistic to assume that a subject‟s face will remain still during a sequence and the system should be able to handle small movements of the subject‟s face.

1.6 Proposed Method

A two step approach to segment temporal microexpressions from face videos is being proposed. First, facial strain maps are calculated based on the facial deformation observed in video sequences using the methods described in [11, 12]. Next, the strain magnitude is calculated and used to segment microexpressions from the video. This approach has the advantage of eliminating motion vectors due to in-plane head movement, thus making the method more stable in real-world scenarios.

One way to analyze the observed facial tissue deformation is to calculate optical strain patterns under the application of a force, a condition that naturally happens during facial expressions [12]. The reasoning behind this approach is threefold:

i. The strain pattern can easily be used in conjunction with existing face recognition

methods

5

ii. The strain pattern is related to the biomechanical properties of facial tissues, and

iii. As has been demonstrated empirically in [13], the strain pattern is more stable, compared

to other methods that use only single frames, under lighting variations.

1.7 Thesis Organization

Chapter 2 describes the work so far on automated facial expression recognition and also the constraints and advantages of each method. Chapter 3 gives a brief introduction to the concepts involved in the proposed method. The proposed algorithm, the assumptions made, the reasoning behind the assumptions and the actual implementation are described in detail in

Chapter 4. The datasets we have used and the characteristics of the videos in our dataset are visited in Chapter 5. Chapter 6 discusses the results of our approach and finally, Chapter 7 gives a brief insight into the road ahead.

6

CHAPTER 2

PREVIOUS WORK

2.1 Microexpressions

Microexpressions are exhibited when people try to control a natural expression and might have been in place even before the beginning of verbal communication. Humans were able to identify the meaning behind these, most often subconsciously. Although the discovery of microexpressions and the first published work on microexpressions is attributed to Haggard and

Isaacs [14], Ekman et al. in their work [15], say that Darwin in his epic work [16] had inadvertently mentioned the features of a microexpression and their characteristic involuntary nature.

Research, on microexpressions, has progressed in two major directions:

i. Research in psychology is progressing towards detecting the reasoning behind

occurrence of microexpressions and the implications of an exhibited microexpression.

ii. Research in Computer Vision and Pattern Recognition is progressing towards automated

expression recognition and its applications.

Dr. is widely considered as the father of microexpressions and his inspiration to work on microexpressions could well have started with an interest in face reading [7]. Since then, Dr. Ekman and his collaborator Dr. Wallace have published several works on expressions discussing about their occurrences, their implications and their meaning in a cross cultural background [1, 17-19]. Studies have also shown that although the emotions may be the same, the expression of emotion is hereditary [20] and depends on the bringing up. Hence some expressions may vary based on geographic location. Research has also shown that the extent of microexpressions or even the presence or absence of microexpressions is highly dependent on the consequences involved [1]. For example, a subject who has been caught lying to a parent is less likely to exhibit discernable microexpressions when compared to another subject who may

7

be involved in illegal activities. Training and the frequency of deceit also affects the expression of microexpressions – A trained spy is less likely to be caught exhibiting microexpressions than an amateur.

The prime characteristic of a microexpression is the fact that it is involuntary and cannot be forced. Hence, collection of datasets for microexpressions is a very complex task. Informed consent will nullify the actual purpose and even if videos were to be recorded first and them consent requested, as microexpressions usually occur at situations that could be embarrassing to the subject, getting consent for the usage of that data is a hard task. Several psychologists use data collected from their counseling sessions and hence are unable to share datasets for automation of expression recognition by engineers.

2.2 Automatic Facial Expression Recognition Systems

With the increasing requirement for faster, easier and more intuitive interfaces between humans and computers being fueled by the ever improving computer field, the automatic analysis of the human face is being looked up to provide a solution. Analysis of human facial expressions has been around for a long period albeit manually. Earlier attempts at developing an automatic facial expression recognition system were not greatly successful because of the unavailability of sufficient technology and expertise in various supporting areas such as image processing, computer vision, pattern analysis etc., and also the lack of powerful hardware that was capable of handling the data.

According to [21], the entire process of automatic facial expression recognition can be divided into three main phases as described below:

i. Detection of face in the image

ii. Extraction of the required features from the face detected in (i) above

iii. Analysis of the extracted features

8

2.2.1 Face Detection

Stan Z. Li et al. in [22], give a detailed insight into the various methods on face detection.

Although, the book deals with face detection in the context of face recognition, the methods used are the same for facial expression recognition too. Face detection comes with its own challenges

– internal and external. Internal factors are those that are inherent in the face such as facial hair, expressions, texture etc., while external factors are those that are not under control of the subject such as illumination, occlusions, camera quality, pose and rotation of the face [23]. Several approaches exist for face detection:

Knowledge-based methods that emulate the human method of face detection by converting the properties that identify a face into rules which are then used for detection. These methods are further classified into top down and bottom up approaches. Several works have been published based on this kind of method using varying rules. Guangzheng and Thomas propose one such hierarchical rule based classifier in [24]. Feature invariant methods are those which work on the principle that there are several features in the human face that are invariant to outside factors which are used by humans for detection. Yow and Cipolla in [25-27] use multiple facial features, while [28, 29] use facial texture. Some works use various color representations for determining skin models. Template matching is another approach for face detection where a predefined face model is present and the correlation between the template model and the presented image is calculated and is used to detect the location of the face. Several works have been published based on this method, but it has been shown that template based methods do not produce the best results [23]. In appearance based methods, there is usually a classifier such as a neural network or a SVM which is trained on some input images and is then used to classify the test images. These methods vary in the feature selection for use in the classifier and may even include features from the knowledge based and feature invariant methods. The Viola-Jones face detector [30] – arguably the fastest and most used face detector [21] belongs to this category of detectors.

9

2.2.2 Feature Extraction

In order to automate the process of feature extraction, a model of the face is developed and the available visual information from the detected face is correlated to the developed model.

Several models are in existence today. The most prominent ones among them are, the point based model where certain critical points on the face are identified and the distances from each other are used to identify the expression on the face [31] and 3D mesh mapping of the whole face

[32, 33]. There are also approaches which use a combination of both these approaches. A comprehensive and detailed discussion of this topic is provided in [34]. Most of the challenges that were presented in the face detection step are obstacles in this step too and care should be taken to overcome those challenges without altering the available information.

2.2.3 Expression Analysis

Once the expressions have been extracted, the next process is to interpret the expressions. There are two paths that recent works follow – one is to classify the expression into an emotion based on the prior definition of the expression. However, this is not as easy as it sounds, because the six basic emotions described by Dr. Ekman in [18], are not objectively defined, leaving the interpretation to the individual. Another matter of concern is that these six emotions are by no means all encompassing. Several other expressions such as pain, boredom etc., have also been described in recent works [4, 5, 35-37]. An attempt to differentiate between pain and crying in newborns is made in [38]. The other approach is to code the expression in terms of Action Units as defined in the Facial Action Coding System (FACS) [39]. This is an objective measurement and the resulting Action Units can then be interpreted or classified into emotions or non-emotions. Most of the recent works attempt to automatically code the expression in terms of Action Units [4, 40-43].

Essa and Pentland [44] was one of the initial automatic facial expression recognition systems that used optical flow with a very high accuracy rate of 98%. The only concern is that the method required absence of rigid body movements.

10

2.3 Datasets

Selection of dataset is an important criterion in the evaluation of any algorithm. Initially, due to lack of availability of suitable datasets, most algorithms used facial recognition datasets despite being ill suited for the task [45].

As research progressed, the algorithms became too sophisticated to provide meaningful results when tested on these generic datasets. Researchers soon came up with their own datasets to suit their testing scenarios. The Cohn-Kanade [46] and the JAFFE [47] are some of the initial datasets that were generated specifically for facial expression analysis. Both these datasets comprise of static images of subjects. While all the expressions in the JAFFE dataset are enacted, the Cohn-Kanade dataset comprises of both enacted and natural expressions that have been FACS coded. The HID [48] and the CMU-PIE [49] are other static image datasets that have been used in literature. The MMI dataset [50] is among the most recent datasets for expression evaluation and comprises of FACS coded static image and video sequences of prolonged enacted expressions.

According to [51], at least 50 examples are needed for training an automatic facial expression recognition system (AFERS) even to achieve moderate accuracy levels of 80%.

However, most of the above discussed datasets do not have enough samples of expressions to meet these requirements. Further, research studies have also shown that enacted expressions are significantly different from natural, spontaneously emitted expressions – not only because of the fact that they are generated by different parts of the brain, but also because enacted expressions differ from spontaneous ones in the extent of deformation and duration [51]. Most of the datasets discussed above are enacted sequences and would not produce accurate evaluations of AFERS. Further, systems that are trained on synthetic expressions would not fare well when used for detecting natural expressions.

The final hurdle in using existing datasets for evaluation of the current algorithm is the availability of spontaneous microexpressions. The Rochester/UCSD Facial Action Coding System

Database 1 (RUFACS1) [40] has been claimed to consist of several video sequences each comprising several microexpressions, that have been FACS coded. The videos are recordings of

11

subjects attempting to lie for monetary rewards. They are unaware that their expressions are the primary focus. Although this dataset seems very promising, it has not yet been made publicly available.

12

Table 2.1 Select automatic facial expression recognition systems in literature Dataset Features Expression Analysis Work Face Detection Accuracy Name Type #Subj/Seq Used Technique Method fast direct Image Pantic and 15 facial chaining Action MMI sequences - 19/740 Manual 87% Patras [41] points inference Units Posed procedure Bartlett et al. Cohn- Videos - Enhanced Viola-Jones Action 100/313 Holistic face SVM 94.8% [40] Kanade Posed method Units Rothwell, Image Artificial Artificial neural Holistic face 80% ; Real Bandar et al. Self sequences - 39/39 neural Emotion networks and shoulders Time [52] Spontaneous networks Still fast direct Pantic and images(front Skin color based Fiducial points chaining Action Rothkrantz Self */40 86% and profile) - segmentation on face inference Units [42] Posed procedure Image Head motion Action Cohen et al. Cohn- Piecewise B´ezier sequences - 100/313 and local HMM Units  90% [53] Kanade volume deformation [54] Posed deformation Emotion Geometric Tsalakanidou distances from Action and 3D image Self – using active Self 52/832 81 landmarks Rule based Units 85.5% Malassiotis sequences shape models & surface Emotion [55] deformation Cohn- Kanade Image Tian et al. Neural Action and sequences - 210/1917 Neural network based Hybrid 93.3% [56] networks Units Ekman- Posed Hager Action Essa and Image View-based and Self 8/52 Holistic face Optical flow Units + 98% Pentland [44] sequences modular eigenspace temporal * Not applicable / Not provided

13

CHAPTER 3

BACKGROUND

3.1 Optical Flow

Optical flow can be defined as “The velocity of observed or apparent 2D motion vectors”

[57]. The motion in question may be caused due to the actual motion of the object, change in camera position or a change in illumination that mimics an object motion. Optical flow is a popular motion estimation technique that represents temporal changes in an image using multiple vectors. Each vector begins at the location of a pixel in the initial image and ends at the pixel location in the time shifted image. Optical flow utilizes spatial and temporal changes of intensity to find a matching pixel in the target image [58]. Because this method depends heavily on intensities, it is highly sensitive to changes in lighting and any such changes would be perceived as actual movement. Hence optical flow only works under the assumption of constant lighting.

Optical flow can be determined by several techniques. One well known technique is the gradient based technique. In this approach, the optical flow equation can be expressed as:

(3-1)

where,

I(x,y,t) is the intensity of a pixel at location(x, y) at time t.

dx is the displacement of the pixel along the x direction and

dy is the displacement of the pixel along the y direction.

14

A direct consequence of the above equation is that flow vectors cannot be determined in regions of constant brightness.

(a) (b) (c)

Figure 3.1 Errors in optical flow caused by change in illumination. a) Original Image b) Same image with one source of illumination changed c) the generated optical flow between a) and b). Note how there is a huge optic flow due to the apparent motion caused by the change in illumination, even though there is no motion between the two figures.

(a) (b) (c)

Figure 3.2 Errors in optical flow due to rotation of the object. a) Original Image b) The object in a) rotated by 180° c) the generated optical flow between a) and b). Note how the only optical flow is corresponding to the center because of the lettering, whereas there should have been a rotational flow over the entire object.

Again, several implementations exist for the gradient based technique. A good insight and a comparison of some of these methods is given in [59]. Some of the methods discussed are:

Horn-Schunck method [60], Lucas-Kanade method [61] and Nagel method [62]. After experimentation with the various methods available, we decided upon Black and Anandan‟s

Robust optical flow [63] for its reliable and consistent results.

15

3.2 Stress and Strain

To completely define a system in which a force acts on a body, we need to define two components: The force acting on the body and the changes undergone by the object while experiencing the force.

The force acting on the object could be external or internal. Stress is defined as the force per unit area and is made up of normal and shear components. Normal stress acts in the normal direction to the plane of the object and causes a change in volume. Shear stress acts in the plane of the object in both directions and it tends to deform the object being acted upon.

The changes undergone by a system are of two types – rigid body motion and deformation. Rigid body motion is one in which the body is either translated or rotated without any change in the shape of the object i.e. the distance between any two points on the surface of the object remains the same before and after the application of the force. Deformation is when the there is a change in the shape of the object i.e. the separation between at least two points on the surface of the object is different after the application of the force.

(a) (b)

(c) (d)

Figure 3.3 Types of motion in an object when acted upon by an external force. The original object is shown in dashed lines in the background. a) Original position b) Translation c) Rotation d) Deformation

16

The relative amount of deformation of the material is known as the strain of the material.

Strain is a property of the material and is independent of the shape of the object. If the amount of displacement of a point on the material is know, the strain can be computed as:

(3-2)

where,

is the original length and

is the deformed length.

If is the displacement vector that describes a deformed object, the an infinitesimal strain tensor can be defined as

(3-3)

which can be expanded to:

(3-4)

where ) are normal strain components and ) are shear strain components.

3.3 Strain Magnitude Characteristics

As was discussed in Section 3.1, using optical strain to determine deformation results in a more stable system compared to other methods that use static images. To verify this effectiveness of optical strain over optical flow, videos were shot with subjects moving back and forth while enacting an expression. These videos are not a part of the dataset and were used only for verifying the effectiveness of Strain over optical flow. Figure 3.4 shows a sample of these sequences and Figure 3.5 shows the resulting optical flow and optical strain. While an inspection of the optical flow plots does not reveal much information about the start or end of the expressions, a simple visual examination of the optical strain plot clearly shows the peaks

17

corresponding to the three expressions. By selecting the appropriate threshold, we could easily detect the beginning, peak and end of each expression.

(a) (b)

(c) (d)

Figure 3.4 Sample frames from the stability test video. a) Neutral frame. b) Face comes forward while performing an expression. c) Face goes back to neutral position d) Face remains in the neutral position, but becomes expressive.

18

(a)

(b)

Figure 3.5 Optical flow and optical strain with out of plane movement. a) The optical flow is highly erratic and inconclusive. b) The optical strain shows peaks where the expression occurs and is more readily interpretable.

19

3.4 Computation of Strain

The strain described in Section 3.2 varies depending on the material being deformed.

The human skin is an elastic material – an elastic material being one which returns to its original state once the external force has been removed. An important property of an elastic material is its modulus of elasticity.

(3-5)

Once stress is known, we can calculate strain by solving the inverse problem. Solving inverse problems is computationally expensive owing to the fact that inverse problems are ill posed and highly non linear [10].

Another approach is to estimate strain from (3.3). Although, this approach, in its nascent state, involves computing higher order derivatives of the displacement vector, we can simplify it by approximating it to the 1st order derivatives. If is the displacement vector obtained from the optical flow as described in (3.1),

(3-6)

where,

is the elapsed time between two image frames.

For a constant frame interval, is a constant and hence, we can obtain the partial derivatives for

(3.6) as

(3-7)

(3-8)

The displacement derivatives are computed by finding the slope of the tangent of the displacement function using the Finite Difference Approximation as:

20

(3-9)

(3-10)

where, ( are preset distances of 2-3 pixels.

The magnitude of the strain is given by:

(3-11)

3.5 The OpenCV Haar Classifier

The OpenCV Haar classifier [64] is an efficient implementation of the Viola Jones boosted cascade classifier [65]. It is a supervised classifier, capable of detecting a specific class of objects, based on an expressive over-complete set of basis functions. The features used are thresholds applied to sums and differences of rectangular image regions of varying sizes. A basis set is complete, if there is no linear dependence between the basis features and has the same number of elements as the image space [65]. If the number of elements is more than the image space, the basis set is over-complete. In this case, the number of elements is much greater than the image space.

OpenCV uses the integral image representation [30, 65] which is an intermediate image representation that assists in the faster computation of the rectangular features. The value at a location i, j of the integral image is equal to the sum of all the pixels above and to the left of i, j inclusive such that

(3-12)

Using the integral image helps speed up the computation by reducing the number of references needed to calculate the sums and differences of pixels within the rectangular region.

21

Further, the calculation of the integral image is a onetime process and can be computed with one pass over the image.

Figure 3.6 Representations of some Haar like features from the OpenCV library

Although the adopted approach provides for fast computation of the features, the total number of rectangular features for the whole image is prohibitively large. To reduce the feature size and to speed up the process, the AdaBoost learning algorithm [66] is used to select a small subset of features that best separates the positive and negative samples. These classifiers are known as weak classifiers as they have a true positive selection rate of about 99.9% with a false positive rate of about 50%. These weak classifiers are organized in the form of a rejection cascade wherein a series of classifiers are applied to sub regions of varying size in the image.

Only regions that make it through the entire cascade are considered as detections. Although the false positive rate, for a single classifier, is higher than normal, the overall detection rate after the cascade of classifiers is about 98% with a false positive rate of 0.0001%. The nodes are arranged such that the initial classifier is the least complex with increasing complexity along the cascade.

22

This helps to further reduce the computation as the number of probable regions that is forwarded to the more complex classifiers reduces at each node level.

23

CHAPTER 4

ALGORITHM AND IMPLEMENTATION

4.1 Overview

The primary input to the system is a video sequence of the face of the subject, which need not necessarily contain microexpressions. Constant illumination, for the duration of the sequence, is assumed. The system is capable of handling small rotational and translational movements of the face, but out of plane rotation must be avoided. The system expects only one face at a time, but can be modified to handle more than one face, by adjusting the face alignment step.

The input video sequence is split into frames. The initial frame is assumed to be a neutral expression and all deformations are measured with respect to this image. The algorithm comprises of the following steps:

i. Face alignment and registration.

ii. Optical flow computation between the neutral face and all other images.

iii. Computation of optical strain from the displacements obtained in step ii. Above.

iv. Split face into regions based on face visibility.

v. Compute and normalize strain for each region.

vi. Threshold based on magnitude, duration and spatial locality of the strain pattern.

Table 4.1 provides a list of parameters used and Figure 4.1 shows the flowchart of the proposed algorithm.

Table 4.1 List of parameters used Sl. No. Parameter Implemented as Automated / Manual 1. Threshold % local peak strain magnitude Automated 2. Neutral Frame 1st frame in sequence Manual 3. Maximum number of frames 9 Manual 4. Minimum number of frames 2 Manual

24

Input video sequence

Split into individual frames

Viola-Jones Face Detect and Crop face Detector

Translate/Rotate to align with Haar Cascade, neutral mage Skin Detection

Black and Anandan Calculate Robust Optical Flow Method Neutral Frame

Displacement Vectors

Finite Difference Calculate Optical Strain Method

Strain per pixel

Eight regions or less Split into Regions depending on visibility

Compute and normalize strain/region

Determine Threshold Thresholding for magnitude, duration and spatial locality Minimum interval

Maximum Interval Microexpression

Figure 4.1 Flowchart of proposed algorithm

25

4.2 Face Alignment

The images in the input sequence are usually not aligned because of movement of the face. It is essential that the faces be aligned to ensure that the computed flow and strain are true values from the deformation and not caused by rigid body movement. We use the Viola-Jones face detector [30] to detect and crop out faces from the whole frames. This offers the multiple advantages of faster computation and removal of translational motion. If the input comprises of more than one face or other objects that may prove a hindrance, they may be handled in this step. All our videos comprise of only a single full frontal close up of the face and hence, this has not been incorporated in the implementation. Next, to remove rotation of the face, we perform the following steps (Figure 4.2):

i. Determine location of the eyes using the OpenCV Haar classifier.

ii. Calculate centroid of the eye locations.

iii. Construct the line joining the centroids of the two detected eyes.

iv. Rotate the image so as to align this line with the line corresponding to the neutral image.

Every subsequent image is cropped to the same size as the initial (neutral) image after matching the first detected skin pixel on the top left corner of the image. We use the skin detection implementation in [67] for the skin pixel determination. Manual oversight is required to correct errors which may occur in the process. Some issues to look out for are:

i. The eye detection algorithm may result in erroneous eye detections such as wrong

location, more than two eyes or less than two eyes.

ii. Some dress or background pixels may be threshold as skin pixels or change in angle

may impede visibility of skin pixels in subsequent frames.

26

(a) (b) (c) (d) (e) (f) (g)

Figure 4.2 Illustration of the face alignment process. The detected eyes are in red rectangles. The centroids of the detections are in green and the line joining the two centroids is in white. The lines and centroids have been drawn for the purpose of illustration and are not exact measurements. a)Initial image. b) Face is rotated c) Cropped face from a. d) Cropped face from b. e),f) Interim images in the alignment process. g) The second image is rotated to align the two lines.

27

4.3 Optical Flow

The dense optical flow between the neutral and the target frame is computed using the

Black and Anandan method described in Section 3.1. By computing optical flow after aligning the images, we ensure that it does not fail. Determination of optical flow involves solving robust formulations using optimization techniques [13]. It is known that, when the spatial information used for optical flow determination spans a motion boundary, it might result in violations of the constant illumination assumption and also the assumption that the motion between two frames is smooth. These issues are overcome by using the Lorentzian and Geman-McCure error norm functions as they provide a gradual transition between the motion boundaries. Further, the approach uses the coarse to fine strategy [68] by determining the optical flow in levels. The algorithm begins at the coarse, spatially filtered level and uses the flow estimate to warp the image to the next, finer level until full resolution is reached. Hence the Black and Anandan method generates a robust, dense flow of vectors corresponding to the facial skin deformation.

Despite the robust computation, the optical flow can still fail for a few pixels. These pixels must be handled properly and not be taken into account for the strain computation in the next step. Figure

4.3 and Figure 4.4 give the dense optical flow for some input frames.

4.4 Computation of Optical Strain

Once we have the displacement vectors from the optical flow data, we can compute the optical strain. The optical strain is calculated based on the Finite Difference Method as described in Section 3.4. The strain maps for the optical flow computed in Figure 4.3 are shown in Figure

4.5 and those for Figure 4.4 are shown in Figure 4.6.

28

(a) (b)

(c) (d)

Figure 4.3 Dense optical flow on two frames of the Alex Rodriguez video. a) The neutral frame. b) The peak frame corresponding to the microexpression. c) Illustration of the computed dense optical flow obtained by the Matlab implementation provided by MJ Black, where each colour corresponds to a different direction and the saturation corresponds to the intensity of the motion. d) The normalized optical flow showing only magnitude. As can be seen, the peak flow is at the mouth area to the right.

29

(a) (b)

(c) (d)

Figure 4.4 Computed dense optical flow on two frames of the USF dataset. a) The neutral frame. b) The peak frame corresponding to the microexpression. c) Illustration of the computed dense optical flow obtained by the Matlab implementation provided by MJ Black, where each colour corresponds to a different direction and the saturation corresponds to the intensity of the motion. d) The normalized optical flow showing only magnitude. As can be seen, the peak flow is at both sides of the mouth area.

30

(a) (b)

(c) (d) (e)

Figure 4.5 Strain maps for the optical flow computed in Figure 4.3. a) The neutral frame. b) The peak frame corresponding to the microexpression. c) The vertical strain computed along the y axis d) The horizontal strain computed along the x-axis and e) The combined strain magnitude.

31

(a) (b)

(c)

Figure 4.6 Strain map for the optical flow computed in Figure 4.4. a) The neutral frame. b) The peak frame corresponding to the microexpression. c) The combined optical strain magnitude

32

4.5 Split into Regions

As discussed in Section 1.3, FACS – the most commonly used facial expression coding system represents expressions in terms of Action Units. Most of these Action Units are based on muscular movement. We divide the face into eight regions viz., forehead, left of eye, right of eye, left cheek, right cheek, left of mouth, right of mouth and below mouth. Together, these regions cover 37 of the 43 Action Units. Including the eye region would have enabled us to cover more of the Action Units, but the blinking of eyes need to be accounted for, before it can be used. In order to account for every possible Action Unit, we would have to measure the deformation at every facial muscle. However, we have observed in our experiments that, the regions we have covered are sufficient to spot a majority of the microexpressions.

Also, for faces with only partial visibility, not all eight regions are available. Presently, we watch the video sequence and decide which regions to utilize. The region classification is automated with manual intervention as required. We gather the strain components calculated in the previous step, corresponding to each region, normalize them within the region and arrive at a single value of strain magnitude for each region.

Figure 4.7 Split of face into separate regions. Each colored area corresponds to a region. We do not use the eye and nose area corresponding to the bricked region in the figure.

33

4.6 Thresholding

The strain plots obtained for each region are then subjected to thresholding in order to spot the microexpressions. There are two criteria to select a frame sequence as a region of microexpression:

i. The strain must be significantly larger than the surrounding regions.

ii. The duration for which the strain is larger than the surrounding regions must be less than

1/5th of a second.

In order to arrive at a threshold that satisfies both the conditions, we extracted a portion of the dataset. This set comprised entirely of USF data as it is more pristine and devoid of movement, when compared to the other datasets. We ran the strain computation algorithm on the ground-truthed sequences and checked the region of the strain plots which satisfied the above conditions. We arrived at the conclusion that the best results were achieved when the threshold was selected such that:

i. Strain Magnitude > upper_threshold * Local Peak Strain Magnitude and

ii. Strain Magnitude > lower_threshold * Global Average Strain Magnitude.

The first condition ensures that the measurement of duration does not happen at the base of the peak (which would result in missing of microexpression) and the second condition ensures that the strain is large enough to signify a deformation. If both the conditions are met for a sequence of frames no greater than nine, we declare that sequence as a microexpression. The algorithm for the thresholding is given in Figure 4.8.

Figure 4.9 shows the plot for the Alex Rodriguez sequence. The strain magnitudes for each region are plotted in a different colour. Each color corresponds to a region of the face. The hexagons at the top indicate the peak of the microexpression as detected by the proposed algorithm. When detected in more than one region, the detection is shown for the first detected region only. Multiple detections within an interval of 6 frames are consolidated into a single detection.

34

1. Input: Strain magnitude for all regions of the face

2. Output: Peak frame of sequences corresponding to Microexpressions

3. Avg <- Average strain for current region

4. Detect peak

5. While peak_mag > 2 x Avg

6. While strain_mag >upper_threshold x peak_mag and strain_mag > lower_threshold x

Avg

7. Increment counter

8. Move to next frame

9. End

10. Move back to peak

11. While strain_mag > upper_threshold x peak_mag and strain_mag > lower_threshold x

Avg

12. Increment counter

13. Move to previous frame

14. End

15. If counter > 1 and counter < 10

16. Declare Microexpression at peak

17. Endif

18. Remove visited frames

19. Detect next peak

20. End

21. Merge and Manage results from all regions

Figure 4.8 Algorithm for thresholding strain magnitudes

35

(a)

(b) (c)

Figure 4.9 Example strain thresholding. a) Strain plots for various regions of the face and the detected microexpression corresponding to frame number 802 for the right of mouth region (shown in black in the graph), marked with a hexagon at the top of the graph. The dashed lines have been added for illustration. The legend shows the regions as described in Section 4.5 b) The neutral image for the sequence and c) The frame showing the peak of the detected microexpression.

36

CHAPTER 5

DATASETS

Due to the lack of availability of appropriate databases for evaluating our algorithm, we have gathered our own database of microexpressions. Our database comprises of three datasets. This section provides a brief description of each of the dataset along with its features and collection methodology. Table 5.1 gives the statistics of the datasets used.

5.1 USF Dataset

The USF dataset comprises of two datasets which are described below.

5.1.1 Feigned Dataset

This dataset comprises of 12 videos, with each video containing eight microexpressions for a total of 96 microexpressions. The microexpressions span across the entire facial region. The subjects were aware of the theory behind microexpressions and were shown some real life videos and TV shows containing microexpressions. They were then asked to repeat them in any order they liked, while avoiding out of plane rotation of the head. Sample frames from the USF dataset are shown in Figure 4.2

5.1.2 Live Questioning Dataset

Another experimental dataset comprises of four videos. Each video contains the questioning sequence of one subject. This dataset contains 4 sequences with one microexpression each for a total of 4 microexpressions. Following were the instructions given to the subject:

i. There are four envelopes on the table in front of you.

37

ii. One of those contains a letter. We do not know which one contains the letter. The

contents of the letter are irrelevant.

iii. You are to pick one envelope and check the contents without revealing them.

iv. You will later be questioned. If you are able to convince us that the envelope you picked

contained he letter, you will receive a gift card for $10.

The subjects were later questioned to see if they received the letter. The session was video recorded and lasted 55-65 seconds. The subjects would all claim to have received the letter for monetary benefits, thereby generating spontaneous microexpressions. The subjects were unaware that their expressions were the point of focus in this experiment.

All the videos in the USF dataset were recorded using a Panasonic camcorder in either

SD or HD. The lighting and background have a slight variation between subjects, but are constant for the entirety of the video. The videos show full frontal face with the camera at a distance of 2 meters from the subject. The subjects were asked to avoid out of plane rotations of the face. The ground-truthing involved identifying microexpressions and the region of occurrence from the videos. The task was performed by 2 students in the group who were well versed in microexpressions, though not formally trained.

5.2 Canal 9 Dataset

The Canal 9 political debate corpus [69] is a collection of 72 political debates recorded by the Canal 9 local TV station and broadcast in Valais, Switzerland. This corpus includes a total of roughly 42 hours of edited hi-quality audiovisual recordings. Each debate involves up to five

French speaking participants (one moderator and up to four guest participants) and is focused on a single question with a straight yes/no answer, e.g. “are you favorable to the new education laws?” The guest participants firmly state their positions at the beginning of the debate, forming two clearly defined groups of opponents. The videos were carefully screened for microexpressions and divided into sequences of about 6 seconds each, with each sequence containing one microexpression for a total of 24 microexpressions. The videos comprise of faces

38

in a variety of angles and rotations. The clips that were chosen were such that the position of the face did not change by much within the sequence.

(a)

(b)

Figure 5.1 Canal 9 dataset

5.3 Found Videos

Dr. Paul Ekman, a renowned researcher in microexpressions, has, in his books, talks and blogs, mentioned several examples of microexpressions in public scenarios and speeches. These are classic examples of microexpressions and we have obtained some of those from the internet.

These include: English spy Kim Philby‟s last public interview and Alex Rodriguez‟s interview with

Katie Couric where he denies taking drugs. This dataset comprises a total of 4 microexpressions.

Each of these clips is about two seconds long. The use of these videos is covered by the “fair use” doctrine of the copyright act that allows use of 10% of publicly available videos for non- commercial research.

39

(a) (b)

(c)

Figure 5.2 Found video dataset. a) A microexpression of scorn from Kato Kaelin‟s testimony in the O.J. Simpson trial b) Kim Philby‟s interview with a microexpression of duping delight c) Alex Rodriguez interview with Katie Couric.

Table 5.1 Characteristics of datasets used for evaluation Approximate Dataset Number of Microexpressions Duration per Total Resolution Name Sequences per sequence sequence(s) USF – 12 140 8 96 SD / HD feigned USF – 4 65 1 4 HD questioning Canal9 6 300-400 4 24 HD dataset Found 3 30-40 1 4 Low videos

40

CHAPTER 6

EXPERIMENTAL RESULTS

6.1 Determination of Threshold

The USF dataset was used to determine the threshold. About 20% of the ground-truthed microexpressions were used in the determination. The threshold selection process is fully automated. Varying thresholds are applied on the training set and the strain threshold value that detects maximum number of microexpressions is selected as the threshold value. The Receiver

Operating Characteristic (ROC curve) for the applied threshold values and the resulting true detections is shown in Figure 6.1 and the values are tabulated in Table 6.1.

Threshold Selection 100 90 80 70 60 50 40

% True% Positive 30 20 10 0 75 50 35 30 25 threshold as % Peak Strain

Figure 6.1 ROC curve used for threshold determination. The %true positive detection peaks at 77.2% and corresponds to a threshold of 35% of the peak strain value

41

Table 6.1 Results of thresholding on the training set with 22 microexpressions Sl. No. Threshold as percentage % True positives % False positives of peak strain 1 75 31.8 0 2 50 50 0 3 35 77.2 22.7 4 30 54.5 36.4 5 25 13.6 0

Figure 6.2 Typical strain pattern for a microexpression. Too high a threshold and too low a threshold result in erroneous detection. A threshold of 35% gives the optimum detection rate.

42

When the threshold is set to be high (75% of the detected peak), we spot only about 32% of the microexpressions with no false spots. The low number of spots is because, only one or two frames at the most have the peak strain and they correspond to the peak of the expression. The remaining frames have a lower strain value. This theory is supported by the fact that as the threshold is reduced, there is an increase in the spotted microexpressions. The spotted microexpression count peaks at a threshold of 35% with a significant number of false spots. The false spots are because of noise introduced into the image from fractional improper alignment of the images which is equivalent to a false rigid body movement as far as optical flow determination is concerned. Further reduction of the threshold results in an insignificant number of spots. This is owing to the threshold falling almost to the average strain level which increases the duration of the sequence to well beyond the limits of a microexpression.

Figure 6.2 shows a sample strain pattern of a single microexpression extracted from the training set for the purpose of illustration of the impact of threshold selection. It shows various thresholds and the corresponding number of frames covered for each threshold. As can be seen, the duration is too less at high threshold values and too high when the threshold is very low. An optimal value of 35% is chosen for the threshold.

6.2 Results with Only Macroexpressions

In order to test the ability of the proposed algorithm to reject microexpressions, we ran our algorithm on two videos from the USF expression dataset [10]. The algorithm was run on the entire face instead of splitting it into regions. The algorithm did not spot any microexpressions showing that, when an expression is exhibited exclusively and explicitly, the proposed algorithm rejects that sequence successfully. Figure 6.3 shows the strain map and thresholding results for one of the sequences.

43

Figure 6.3 Strain map and thresholding with only macroexpressions. As can be seen, no sequence is flagged as a microexpression, validating the evaluation for this section of the proposed algorithm.

6.3 Microexpression Spotting Results

Until now, we have seen the suitability and stability of microexpressions as compared to other methods. Now, we present the results of evaluating the proposed algorithm on the datasets described in the previous section.

Table 6.2 Microexpression spotting results Dataset Number of Number of % True % %False Name sequences Microexpressions positive Missed Positive USF enacted 12 96 80.2 19.8 37.5 USF 4 4 75 25 0 questioning Political 6 24 50 50 75 Found 3 4 75 25 0 videos Total 25 128 70.2 29.8 42.18

44

Figure 6.4 Strain plots for select USF sequences. Each color corresponds to a region of the face. The hexagons at the top indicate the peak of the microexpression as detected by the proposed algorithm. When detected in more than one region, the detection is shown for the first detected region only. Multiple detections within an interval of 6 frames are consolidated into a single detection.

45

Figure 6.4 shows strain plots for select sequences in the USF dataset. Each color corresponds to a region of the face. The hexagons at the top indicate the peak of the microexpression as detected by the proposed algorithm. When detected in more than one region, the detection is shown for the first detected region only. Multiple detections within an interval of 6 frames are consolidated into a single detection. Table 6.2 gives the results of the proposed spotting algorithm on various datasets. The true positive detection percentage and the false positive detection percentage were calculated from the obtained results (Figure 6.5).

Spotting Results

100

80

60

40

20

0 USF-enacted USF-questioning Canal9 Found Videos

% True Positive % False Positive

Figure 6.5 Microexpression spotting results

The USF enacted dataset has the highest true positive rate with the lowest corresponding false positive rate. This was expected due to the constant illumination and stable posture with very little head movement. Also, the only movements were the macro and microexpressions that did not include any talking. The questioning dataset showed accurate results with a very low false positive rate. However, the number of subjects is low. The political dataset resulted in a large number of false positives. This is owing to talking and the excessive head movements resulting in occlusions of regions of the face. Talking causes quick and fine movements of skin along the jaw

46

line in the mouth region and these were being misclassified as microexpressions. The found video dataset shows promising results with a 75% detection rate.

The worst performance of the algorithm resulted when the subject was talking, leading to a high false positive rate. Performing voice detection to identify when the subject is talking and excluding the mouth region during talking sequences would help reduce the number of false positives.

47

CHAPTER 7

CONCLUSION AND FUTURE WORK

7.1 Conclusions

A novel method for spotting facial microexpressions in video sequences using optical strain has been presented. This method determines the optical flow between two frames and uses it to compute the strain in facial skin tissue. This strain has been shown to be a measure of deformation. The principle behind using deformation is that maximum deformation occurs at the peak of the microexpression. By detecting deformation that lasts for the duration of an microexpression, we are able to identify the boundaries of the microexpression. As we compute the dense optical flow, we are able to use the Finite Difference Method approach described in

Section 3.4, to determine the optical strain.

Finding appropriate datasets that comprise of spontaneous microexpressions was a big challenge. We conducted experiments on three separate datasets – the USF dataset containing both spontaneous and enacted sequences, the Canal 9 dataset which was gleaned for microexpressions and the found video dataset that comprises of videos from the internet.

The results on the USF dataset are very promising with very good detection accuracy.

There is a significant false positive rate, but this was anticipated because the threshold arrived at during the training phase was selected liberally, so as not to miss any true microexpressions, at the cost of a high false positive rate. The found video dataset comprises of real time videos and hence is the true test of the system. Most sequences comprised of only one microexpression and had at most two microexpressions. The detection rate was high for this dataset. The algorithm performed poorly on the Canal 9 dataset because of frequent talking and out of plane rotation of the head resulting in partial occlusion of the face.

48

Resolution and frame size had no impact on the accuracy of the system as long as the face was clearly visible. The performance time was affected though, as the optical flow took longer to calculate for the HD images.

Adding macro expression spotting reduced the time taken as it helped shorten the sequence to be tested. This was only run as a trial and needs to be fully integrated into the system to get significant and meaningful results.

7.2 Future Work

Availability of datasets was the biggest challenge faced. Hence, future work will begin with collecting an appropriate number of spontaneous video sequences. The approach described in this work as well as those used by other researchers [40] will be used for this purpose.

The main reasons for failure in the Canal 9 dataset are the frequent talking and out of plane movement of the face. Performing voice detection to identify when the subject is talking and excluding the region around the mouth for those durations could help reduce the number of false positives. Increasing the effectiveness of the alignment should help increase the overall accuracy of the algorithm. An interesting approach of using the optic flow vectors of rigid points on the face to determine rotation/translation and using them to register and align the face is discussed in [13].

Detection and splitting of face into regions involves some manual intervention at present.

An algorithm for automatic portioning of faces into regions is presented in [70]. Incorporation of this method into our algorithm could help in reducing the amount of manual intervention in our algorithm.

49

REFERENCES

[1] P. Ekman, Telling Lies: Clues to Deceit in the Marketplace, Politics, and Marriage (Revised and Updated Edition). W. W. Norton & Company, 2001.

[2] P. Ekman, E. T. Rolls, D. I. Perrett and H. D. Ellis, "Facial Expressions of Emotion: An Old Controversy and New Findings [and Discussion]," Philosophical Transactions: Biological Sciences, vol. 335, pp. 63-69, Jan. 29, 1992.

[3] S. Porter and L. ten Brinke, "Reading Between the Lies," Psychological Science, vol. 19, pp. 508-514, 05, 2008.

[4] G. C. Littlewort, M. S. Bartlett and K. Lee, "Automatic coding of facial expressions displayed during posed and genuine pain," Image Vision Comput., vol. 27, pp. 1797-1803, 11, 2009.

[5] B. Schuller, R. Muller, F. Eyben, J. Gast, B. Hornler, M. Wollmer, G. Rigoll, A. Hothker and H. Konosu, "Being bored? Recognising natural interest by extensive audiovisual integration for real-life application," Image Vision Comput., vol. 27, pp. 1760-1774, 2009.

[6] T. Ehrenfeld, "What's In Your Face: Are 'Microexpressions' The Key To Better Security?" Newsweek, Jun 9, 2003.

[7] M. Gladwell, "The naked face," The New Yorker, pp. 38, 2002.

[8] P. Ekman and E. L. Rosenberg, What the Face Reveals: Basic and Applied Studies of Spontaneous Expression using the Facial Action Coding System(FACS). Oxford University Press, 1997.

[9] K. Scherer and P. Ekman, "Methods for measuring facial action," in Handbook of Methods in Nonverbal Behavior ResearchAnonymous Cambridge University Press, 1982, pp. 45-90.

[10] M. Shreve, S. Godavarthy, V. Manohar, D. Goldgof and S. Sarkar, "Towards macro- and micro-expression spotting in video using strain patterns," in Applications of Computer Vision (WACV), 2009 Workshop on, 2009, pp. 1-6.

[11] Vasant Manohar, D. Goldgof, Sudeep Sarkar and Yong Zhang, "Facial strain pattern as a soft forensic evidence," in Applications of Computer Vision, 2007. WACV '07. IEEE Workshop on, 2007, pp. 42.

[12] V. Manohar, M. Shreve, D. Goldgof and S. Sarkar, "Finite element modeling of facial deformation in videos for computing strain pattern," in Pattern Recognition, 2008. ICPR 2008. 19th International Conference on, 2008, pp. 1-4.

[13] V. Manohar. Facial skin motion properties from video: Modeling and applications. Ph.D. dissertation, University of South Florida, 2009

50

[14] E. A. Haggard and K. S. Isaacs, "Micro-momentary facial expressions as indicators of ego mechanisms in psychotherapy," Methods of Research in Psychotherapy, pp. 154, 1966.

[15] P. Eckman, "Darwin, Deception, and Facial Expression," Ann. N. Y. Acad. Sci., vol. 1000, pp. 205-221, 2003.

[16] C. Darwin, Expression of the Emotions in Man and Animals, the. Oxford University Press Inc, 2002.

[17] P. Ekman and W. Friesen, "Constants across cultures in the face and emotion," J. Pers. Soc. Psychol., vol. 17, pp. 124-129, 1971.

[18] P. Ekman and W. V. Friesen, Unmasking the Face: A Guide to Recognizing Emotions from Facial Clues. Prentice-Hall, 1975.

[19] P. Ekman, "Lie catching and micro expressions," in The Philosophy of Deception, C. Martin, Ed. Oxford University Press, 2009.

[20] G. Peleg, G. Katzir, O. Peleg, M. Kamara, L. Brodsky, H. Hel-Or, D. Keren and E. Nevo, "Hereditary family signature of facial expression," Proceedings of the National Academy of Sciences, vol. 103, pp. 15921-15926, October 24, 2006.

[21] M. Pantic, A. Pentland, A. Nijholt and T. Huang, "Human computing and machine understanding of human behavior: A survey," in ACM SIGCHI Proceedings Eighth International Conference on Multimodal Interfaces, 2006, pp. 239-248.

[22] S. Z. Li and A. K. Jain, Handbook of Face Recognition. Springer, Berlin, 2005.

[23] Ming-Hsuan Yang, D. J. Kriegman and N. Ahuja, "Detecting faces in images: a survey," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, pp. 34-58, 2002.

[24] G. Yang and T. S. Huang, "Human face detection in a complex background," Pattern Recognit, vol. 27, pp. 53-63, 1, 1994.

[25] Kin Choong Yow and R. Cipolla, "A probabilistic framework for perceptual grouping of features for human face detection," in Automatic Face and Gesture Recognition, 1996., Proceedings of the Second International Conference on, 1996, pp. 16-21.

[26] S. A. Sirohey. Human face segmentation and identification. Master's thesis, University of Maryland, 1993.

[27] K. Yow, K. Yow, R. Cipolla and R. Cipolla, "Feature-Based Human Face Detection," Image Vision Comput., vol. 15, pp. 713-735, 1996.

[28] S. Fahlman and C. Lebiere, "The cascade-correlation learning architecture," in Advances in Neural Information Processing Systems 2, 1990, pp. 524-532.

[29] Y. Dai, "Face-texture model based on SGLD and its application in face detection in a color scene," Pattern Recognit, vol. 29, pp. 1007-1017, 06, 1996.

[30] P. Viola and M. Jones, "Robust Real-Time Face Detection," International Journal of Computer Vision, vol. 57, pp. 137-154, 05/01, 2004.

51

[31] H. Kobayashi and F. Hara, "Facial interaction between animated 3D face robot and human beings," in Systems, Man, and Cybernetics, 1997. 'Computational Cybernetics and Simulation', 1997 IEEE International Conference on, 1997, pp. 3732-3737 vol.4.

[32] D. Terzopoulos and K. Waters, "Analysis and synthesis of facial image sequences using physical and anatomical models," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 15, pp. 569-579, 1993.

[33] M. Black and Y. Yacoob, "Recognizing Facial Expressions in Image Sequences Using Local Parameterized Models of Image Motion," International Journal of Computer Vision, vol. 25, pp. 23-48, 1997.

[34] M. Pantic, "Automatic Analysis of Facial Expressions: The State of the Art," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 1424-1445, 12/01, 2000.

[35] A. B. Ashraf, S. Lucey, J. F. Cohn, T. Chen, Z. Ambadar, K. M. Prkachin and P. E. Solomon, "The painful face - Pain expression recognition using active appearance models," Image Vision Comput., vol. 27, pp. 1788-1796, 2009.

[36] P. Lucey, J. Cohn, S. Lucey, S. Sridharan and K. M. Prkachin, "Automatically detecting action units from faces of pain: Comparing shape and appearance features," in 2009, pp. 12- 18.

[37] P. Lucey, J. Cohn, S. Lucey, I. Matthews, S. Sridharan and K. M. Prkachin, "Automatically detecting pain using facial actions," in Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on, 2009, pp. 1-8.

[38] G. Lu, X. Li and H. Li, "Facial expression recognition for neonatal pain assessment," in Neural Networks and Signal Processing, 2008 International Conference on, 2008, pp. 456- 460.

[39] P. Ekman and W. Friesen, Facial Action Coding System: A Technique for the Measurement of Facial Movement. Palo Alto: Consulting Psychologists Press, 1978.

[40] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel and J. Movellan, "Recognizing facial expression: Machine learning and application to spontaneous behavior," in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, 2005, pp. 568-573 vol. 2.

[41] M. Pantic and I. Patras, "Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences," Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 36, pp. 433-449, 2006.

[42] M. Pantic and L. Rothkrantz, "Facial Action Recognition for Facial Expression Analysis from Static Face Images," IEEE Trans Syst Man Cybern., vol. 34, pp. 1449-1461, 2004.

[43] G. Donato, M. Bartlett, J. Hager, P. Ekman and T. Sejnowski, "Classifying Facial Actions," IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, pp. 974-989, 1999.

[44] I. A. Essa and A. P. Pentland, "Coding, analysis, interpretation, and recognition of facial expressions," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 19, pp. 757-763, 1997.

52

[45] J. P. Skelly. Experiments in expression recognition. Master‟s thesis, Massachusetts Institute of Technology, 2005

[46] T. Kanade, J. F. Cohn and Yingli Tian, "Comprehensive database for facial expression analysis," in Automatic Face and Gesture Recognition, 2000. Proceedings. Fourth IEEE International Conference on, 2000, pp. 46-53.

[47] M. Lyons, S. Akamatsu, M. Kamachi and J. Gyoba, "Coding facial expressions with gabor wavelets," in Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, 1998, pp. 200-205.

[48] A. J. O'Toole, J. Harms, S. L. Snow, D. R. Hurst, M. R. Pappas, J. H. Ayyad and H. Abdi, "A video database of moving faces and people," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, pp. 812-816, 2005.

[49] T. Sim, S. Baker and M. Bsat, "The CMU pose, illumination, and expression database," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, pp. 1615-1618, 2003.

[50] M. Pantic, M. Valstar, R. Rademaker and L. Maat, "Web-based database for facial expression analysis," in Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, 2005, pp. 5.

[51] M. Pantic, "Machine analysis of facial behaviour: naturalistic and dynamic behaviour," Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 364, pp. 3505- 3513, 12/12, 2009.

[52] J. Rothwell, Z. Bandar, J. O'Shea and D. McLean, "Silent talker: a new computer-based system for the analysis of facial cues to deception," Applied Cognitive Psychology, vol. 20, pp. 757-777, 2006.

[53] I. Cohen, N. Sebe, L. Chen, A. Garg and T. Huang, "Facial expression recognition from video sequences: Temporal and static modelling," in Computer Vision and Image Understanding, 2003, pp. 160-187.

[54] Hai Tao and T. S. Huang, "Connected vibrations: A modal analysis approach for non-rigid motion tracking," in Computer Vision and Pattern Recognition, 1998. Proceedings. 1998 IEEE Computer Society Conference on, 1998, pp. 735-740.

[55] F. Tsalakanidou and S. Malassiotis, "Real-time 2D+3D facial action and expression recognition," Pattern Recognit, vol. 43, pp. 1763-1775, 5, 2010.

[56] Y. Tian, T. Kanade and J. F. Cohn, "Recognizing action units for facial expression analysis," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 23, pp. 97-115, 2001.

[57] Y. Wang, J. Ostermann and Y. Zhang, Video Processing and Communications. Prentice Hall, 2002.

[58] R. Jain, R. Kasturi and B. G. Schunck, Machine Vision. McGraw Hill Inc., 1995.

[59] J. L. Barron, D. J. Fleet, S. S. Beauchemin and T. A. Burkitt, "Performance Of Optical Flow Techniques," International Journal of Computer Vision, vol. 12, pp. 43-77, 1994.

53

[60] B. Horn and B. Schunck, "Determining Optical Flow," ARTIFICAL INTELLIGENCE, vol. 17, pp. 185-203, 1981.

[61] B. D. Lucas and T. Kanade, "An iterative image registration technique with an application to stereo vision," in Vancouver, BC, Canada, 1981, pp. 674-679.

[62] H. Nagel, "On the estimation of optical flow: Relations between different approaches and some new results," Artif. Intell., vol. 33, pp. 299-324, 11, 1987.

[63] M. Black and P. Anandan, "The robust estimation of multiple motions: parametric and piecewise-smooth flow fields," Comput. Vis. Image Underst., vol. 63, pp. 75-104, 1996.

[64] G. Bradski and A. Kaehler, Learning OpenCV. O'Reilly Media, 2008.

[65] P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features," in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, 2001, pp. I-511-I-518 vol.1.

[66] Y. Freund and R. E. Schapire, "A decision-theoretic generalization of on-line learning and an application to boosting," Journal of Computer and System Sciences, vol. 55, pp. 119-139, 1997.

[67] C. O. Conaire, N. E. O'Connor and A. F. Smeaton, "Detector adaptation by maximising agreement between independent data sources," in Computer Vision and Pattern Recognition, 2007. CVPR '07. IEEE Conference on, 2007, pp. 1-6.

[68] P. Anandan, "A computational framework and an algorithm for the measurement of visual motion," International Journal of Computer Vision, vol. 2, pp. 283-310, 01/21, 1989.

[69] A. Vinciarelli, A. Dielmann, S. Favre and H. Salamin, "Canal9: A database of political debates for analysis of social interactions," in Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on, 2009, pp. 1-4.

[70] A. S. M. Sohail and P. Bhattacharya, "Detection of Facial Feature Points Using Anthropometric Face Model," Signal Processing for Image Enhancement and Multimedia Processing, pp. 189-200, 2008.

54