JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Weakly for Facial Behavior Analysis : A Review

Gnana Praveen R, Eric Granger, Member, IEEE and Patrick Cardinal, Member, IEEE

Abstract—In the recent years, there has been a shift in facial behavior analysis from the laboratory-controlled conditions to the challenging in-the-wild conditions due to the superior performance of based approaches for many real world applications. However, the performance of deep learning approaches relies on the amount of training data. One of the major problems with data acquisition is the requirement of annotations for large amount of training data. Labeling process of huge training data demands lot of human support with strong domain expertise for facial expressions or action units, which is difficult to obtain in real-time environments. Moreover, labeling process is highly vulnerable to ambiguity of expressions or action units, especially for intensities due to the bias induced by the domain experts. Therefore, there is an imperative need to address the problem of facial behavior analysis with weak annotations. In this paper, we provide a comprehensive review of weakly supervised learning (WSL) approaches for facial behavior analysis with both categorical as well as dimensional labels along with the challenges and potential research directions associated with it. First, we introduce various types of weak annotations in the context of facial behavior analysis and the corresponding challenges associated with it. We then systematically review the existing state-of-the-art approaches and provide a taxonomy of these approaches along with their insights and limitations. In addition, widely used data-sets in the reviewed literature and the performance of these approaches along with evaluation principles are summarized. Finally, we discuss the remaining challenges and opportunities along with the potential research directions in order to apply facial behavior analysis with weak labels in real life situations.

Index Terms—Weakly Supervised Learning, Facial Expression Recognition, Classification, Regression, Facial Action Units. !

1 INTRODUCTION

ACIAL Behavior Analysis has been growing area of wide range of facial expressions, Facial Action Coding Sys- F interest in computer vision and affective computing, tem (FACS) has been developed by Ekman [4]. FACS is a which has great potential for many applications in human taxonomy of facial expressions that defines all observable computer interaction, sociable robots, autonomous-driving facial movements for every emotion. It comprises of 32 ac- cars, etc. It was shown that only one-third of human com- tion units (AUs) and 14 additional action descriptors (ADs), munication is conveyed through verbal components and where action units are described by the fundamental actions two-third of communication is communicated through non- of individual muscles or group of muscles to form a specific verbal components [1]. Despite several available nonverbal movement. Some of the action units are shown in Figure components, facial behavior plays a major role in conveying 2. It has been used as a standard for manually annotating the mental state of a person, which can be reflected by the the facial expressions as it defines a set of rules to express movements of facial muscles of a person. Ekman and Fries any possible facial expression in terms of specific action [2] conducted cross-cultural study on facial expressions and units (AUs) and measure the intensity of facial expressions showed that there are six basic universal facial emotions at five discrete levels (A < B < C < D < E) where regardless of culture i.e., Angry, Disgust, Fear, Happy, Sad A being the minimum intensity level and E being the and Surprise, which is shown in Figure 1. Subsequently maximum intensity level. In order to further enhance the contempt has been added to these basic emotions, which are range of facial expressions, continuous model over affect

arXiv:2101.09858v1 [cs.CV] 25 Jan 2021 found to be universal across human ethnicity and cultures. dimensions are proposed [5]. Due to the immense potential Due to the simplicity of the discrete representation, these of AUs in interpreting the expressions for deriving high seven prototypical emotions are most widely used categor- level information, we have focused on both expression and ical model for the classification of facial emotions. Though AUs as well as categorical and dimensional labels. the terms ”expression” and ”emotion” are alternatively used Most of the traditional approaches uses hand-crafted in many research papers, the primary difference is facial features such as Scale Invariant Feature Transform (SIFT), emotions convey the mental state of a person whereas facial Local Binary Pattern (LBP), LBP on three orthogonal planes expression is the indicator of the emotions being felt i.e., (LBP-TOP), etc, which are deterministic and shallow in facial expressions display wide range of facial modulations nature. Therefore, it fails to capture the wide range of intra- but facial emotions are limited. class variance i.e., expressions of same class may vary from In order to have a common standard that defines the person to person due to factors such as age, gender, race, cultural background and other person specific character- istics in uncontrolled environment, which restricts the de- • The authors are with the Imagery Vision and Artificial Intelligence Laboratory (LIVIA), Ecole de technologie sup´erieure, Montreal, CA, ployment of FER systems only in controlled environments. H3C1K3. With the advancement of deep learning architectures and E-mail:gnanapraveen.rajasekar.1, eric.granger, the computing capability, there has been a break-through in [email protected] the field of , which has made it possible to JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Fig. 1: Example of primary universal emotions. From left to right : neutral, happy, sad, fear, angry, surprise and disgust [3] cope with wide of variations to develop intelligent systems in uncontrolled real-time environments and perform at par with human ability. With the increase in the training data, labeling of the data remains a tedious task which demands lot of human support and time consuming, thereby not fea- sible in real time applications. In order to achieve minimal competency as a FACS coder, it takes over 100 hours of training, and each minute of video takes approximately one hour to score [6]. Moreover, labels of intensity levels pro- vided by the annotators are subjective in nature, resulting in ambiguity of the annotations due to the bias induced by the annotators. Annotating continuous labels becomes even more challenging as it increases the range of subjectivity to Fig. 2: Examples of Action Units [7] the labeled annotations. Therefore, there is an immense need to deal with weak annotations pertinent facial analysis in order to fully leverage the potential of deep learning models. considered in the framework of WSL for evaluating their In the recent years, though exhaustive surveys including approaches are mentioned in Section 3.4.1 and Section 4.3.1. deep learning approaches are published on facial behav- The comprehensive survey of the existing approaches for ior analysis [7], [8], [9], [10], [11], WSL based approaches facial behavior analysis with weak annotations related to for analyzing facial behavior is not yet done to the best classification and regression is considered in Section 3 and of our knowledge, despite the immense potential of WSL Section 4 respectively. The challenges with the existing state- approaches in real-life situations. Though deep learning of-the-art approaches and opportunities along with poten- models exhibits significant feature learning capability, it tial research directions are discussed in Section 5. requires large quantity of training data with annotations to avoid over-fitting. In this survey, we have conducted 2 WEAKLY SUPERVISED LEARNINGFOR FACIAL a comprehensive review of facial behavior analysis with BEHAVIOR ANALYSIS weak annotations, consolidated and provided a taxonomy of existing work over the last decade, primarily focusing The category of machine learning approaches which deals on classification or regression of facial expressions or action with weakly annotated data are termed as ”Weakly Super- units.The major contribution of this review are as follows : vised Learning (WSL)”. Unlike supervised learning, accu- rate labeling will not be provided for entire data in most of • We present the relevance of facial behavior analysis the real-world applications due to the tedious process of ob- in the context of various types of weak annotations taining annotations. Therefore, weakly supervised learning and the corresponding challenges associated with it is gaining attention in the recent years as it has immense in the context of classification and regression. potential in lot of vision applications such as object detec- • State-of-the-art approaches for expression and AU tion, image categorization, etc. Depending on the mode of analysis in the framework of WSL are extensively availability of labels (annotations), WSL can be classified reviewed, consolidated and discussed insights, ad- to three categories : Inexact Supervision, Incomplete Super- vantages and limitations. vision and Inaccurate Supervision. The details regarding • Widely used data-sets in the context of weak anno- each of these categories are discussed elaborately in Zhou tations are provided along with comparative results et al. [12]. In this section, we will briefly introduce these and the corresponding evaluation strategies. categories and its relevance to recognition of facial expres- • Challenges and opportunities associated with the sions and action units, which is depicted in Figure 3 and development of robust FER system pertinent to WSL Figure 4 respectively. We have primarily focused on four scenarios in real life situations along with the insight specific problems pertinent to facial behavior analysis in the of future research directions are discussed. context of various categories of weakly supervised learning The rest of the paper is organized as follows. The : Expression detection, Expression intensity estimation, AU overview of machine learning techniques for weakly anno- detection and AU intensity estimation. The objective of ex- tated data are presented in Section 2. The databases widely pression or AU detection is to classify various expressions or JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 action units in a given image or video, whereas expression sequence level labels are discussed in Section 3.1.1 and or AU intensity estimation refers to estimating intensities of Section 4.1.1 respectively. expressions or AUs. The task of AU detection in the context of inexact anno- tations is formulated as the problem of AU label estimation 2.1 Inexact Supervision from the expression labels of training data without AU labels, where expression labels act as weak supervision for In this category, coarsely grained labeling is provided for the AU labels as shown in Figure 4b. The relevant approaches data samples instead of exact labeling of data. The goal is to for the problem of AU detection from the training data of predict the accurate labels of unknown test data using the expression labels are elaborated in detail in Section 3.1.2. coarsely labeled training data. One of the major approaches Similar to the framework of MIR for expression intensity to tackle this problem is based on Multiple Instance Learn- estimation, AU intensity estimation can also be formulated ing (MIL) [13]. It has wide range of applications in the for inexact annotations. field of machine learning, where fully labeled information is difficult to attain due to high cost of labeling process. Due to the ubiquity of problems that are naturally formulated 2.2 Incomplete Supervision in the setting of MIL such as image and video classification It refers to the family of ML algorithms which deals with [14], document classification [15], object detection [16], etc., the situation where only a small amount of labeled data is it was emerged as a highly useful tool in many real world provided, which is not sufficient to achieve a good learner applications. despite the availability of abundant unlabeled data. For The coarsely labeled data is denoted as ”bag” and the instance, it will be easy to get huge number of videos of data contributing to the coarse annotation are referred as facial expressions. However, labeling the entire dataset with ”instances”. In most of the applications in computer vi- the corresponding expressions or action units remains to be sion, the bag is considered to be image or video and the tedious task which demands lot of human labor. One of the instances of the bag are considered to be patches in images major solutions to handle this problem is semi-supervised or segments of frames or frames in videos. Even though the learning (SSL) [19], where data distribution is assumed to primary goal is to predict the bag-level prediction, many occur in clusters. Semi-supervised learning relies on the techniques have been proposed to perform instance level assumption of local smoothness ie., samples that lie close prediction, where each instance of the bag can be predicted. to each other are assumed to have similar labels, based on For example, in case of classification task, the instances which it makes use of unlabeled data by modeling the dis- of the bags can be classified along with the bag-level tribution of the data. Therefore, the labels of unlabeled data classification. There are several factors that influence the can be predicted using generative methods, which assume performance of MIL algorithms such as bag composition, that the labeled and unlabeled data are generated from the data distribution and label ambiguity, which is discussed in same model.The model assumption for the data distribution detail by Carbonneau et al [17]. These MIL algorithms can plays a crucial role in improving the performance of the be broadly classified as instance level prediction and bag system as it influences the label assignment of the unlabeled level predictions: data. In the context of incomplete supervision for expression • Instance Level Algorithms: These algorithms tend detection in videos, annotations will be provided for dif- to predict the label of each instance of the bag, which ferent percentages of frames inside each sequence of the is in-turn used to perform bag level classification. training data, which can drastically reduce the time required • Bag Level Algorithms: These methods represents to annotate the data-set by labelling only a subset of frames. the instances of the entire bag as a single feature, The objective is to predict the labels of all the individual and thereby transforming the problem to supervised frames in a test sequence using partial frame level labels learning. In order to predict the label of new bag, of training data. For expression intensity estimation, the distance metrics are used for discriminating the bags. intensity levels will be provided only for a subset of frames In case of expression detection, the task is to localize and of every sequence of the training data as shown in Figure 3c predict the expressions of short-video clips or frames of the and the relevant approaches are discussed in Section 4.2.1. videos using training data with sequence-level labels. Even For AU detection, the problem of semi-supervised learn- though data acquisition of videos of various expressions is ing can be formulated in two ways : Missing Labels and not very difficult to attain, the major bottle-neck is to get Incomplete Labels as shown in Figure 4c. In the first case, the exact label information of all the frames in the training each sample of training data is assumed to be provided with videos, which can be circumvented using MIL algorithms, multiple labels of Action Units but with missing labels. The where sequence level predictions can be estimated along task is to train AU classifier with the training samples of with frame level predictions as shown in Figure 3b. In the missing labels to predict multiple labels for the test sample. context of multiple instance regression (MIR) for expression On the other hand, for incomplete labels, the entire label intensity estimation, the sequence level label is obtained set of multiple AU labels are provided to the training data by the maximum or average of the individual frames of but only for a subset of images in the training data, where a given sequence [18] and the aim is to predict the frame the rest of the training samples does not have AU labels level intensity levels from the training data of sequence but entire data may have expression labels. The problem level intensity levels. The state-of-the-art approaches for of incomplete annotation can be extended further to AU detecting facial expressions and expression intensities from intensity estimation, where AU intensity levels are provided JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Fig. 3: Illustration of WSL scenarios for expression recognition in videos. (a) Supervised Learning with accurate frame level expression labels. (b) Multiple Instance Learning with Sequence Level expression labels. (c) Semisupervised Learning with partial expression labels (d) Inaccurate Supervised Learning with noisy expression labels. y1, y2, y3, yn represents the accurate frame level expression labels. ye1, ye2, ye3, yen denotes frame level predictions. y1, y2, y3, yn refers to noisy frame- level expression labels. Y1, Ye 1 represents sequence level expression labels and sequence level predictions respectively. ? implies ”No Labels”. only for the key frames within a video sequence of the the labels provided by different individuals using majority training data or only for subset of data-set of images and voting strategy. the relevant approaches for AU detection and AU intensity In the case of expression detection, labeling frames of the estimation are discussed in Section 3.2.2 and Section 4.2.2. video with the corresponding expressions is a laborious task and annotators are more vulnerable in mislabeling frames 2.3 Inaccurate Supervision of the videos. For instance, the expression of ”sadness” It refers to the scenario where labeling information provided may look similar to the expression of ”neutral”. Therefore, is highly noisy unlike inexact and incomplete supervision. there is an imperative need to handle the problem of noisy Though labeling information is provided for the entire expression labels and predict the accurate expressions of dataset in a similar way it’s done with supervised learning, test data. The ambiguity will be even more pronounced the labels tend to be noisy. Since inaccurate labels degrade for compound expressions, which is combination of more the performance of the prediction model, the goal is to over- than one expression. The problem of inaccurate annota- come the challenges imposed by noisy labels and generate tions is more prevalent in case of regression i.e., estimating a robust predictive model to estimate the accurate labels of the level of the expressions. Regression can done in two the test data. Some of the widely used approaches to handle ways : ordinal regression, where the expression levels are this problem are data-editing [20] and crowd sourcing [21]. assumed to be discrete and continuous regression, where In data-editing, the training samples are considered to be the expression levels are continuous. Since more variation nodes, which are connected to each other using edges, is possible for continuous regression compared to ordinal forming a graph-like structure. The edges connecting the regression, annotations of continuous regression tend to be samples of different labels are termed as cut-edges. A weight more noisy.The scenario of inaccurate annotations for ex- statistic is estimated for cut-edges, with the intuition that a pression recognition in videos is shown in Figure 3d and the sample is considered as potential mislabeled sample if it is relevant approaches for expression detection are discussed associated with many cut-edges. Crowd-sourcing is a simple in Section 3.3.1. The framework of inaccurate annotations for and effective way of compensating the mislabeled samples, expression recognition can also be further extended to AU where the same labeling task is provided to a group of recognition, where AU labels are considered to be noisy. The independent non-expert annotators and the correct labels framework of inaccurate annotations for AU recognition is of the samples are computed by taking the aggregate of shown in Figure 4d and the relevant approaches for AU JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Fig. 4: Pictorial Illustration of WSL scenarios for AU recognition in images. (a) Supervised Learning with accurate AU annotations. (b) Inexact Supervised Learning with Image Level expression annotations. (c) Incomplete Learning with partial AU annotations (d) Inaccurate Supervised Learning with noisy AU annotations. y1, y2, y3, yn represents the accurate AU labels. ye1, ye2, ye3, yen denotes AU predictions. y1, y2, y3, yn refers to noisy AU labels. Y1 represents image level expression labels. ? implies ”No annotations”. detection are discussed in Section 3.3.2. localize and predict the expressions of test sequences.

3 WSL FOR CLASSIFICATION Instance Level Approaches: The detection of The classification of facial expressions is a well-explored expressions with MIL was initiated by Sikka et al [22], research problem in the field of affective computing. Inspite where automatic pain localization was achieved by of the immense potential of classifying facial behavior with considering each video sequence as a bag comprising weak annotations, only few researchers have addressed the of multiple segments (sub-sequences or instances) and problem of classifying facial expressions or AUs in the segment-level features are obtained by temporally pooling setting of WSL. In this section, we categorize the works in Bag of Words (BoW) representation of frames. Instance-level the literature based on the type of weak annotation : Inexact, classifier is developed using MILBOOST [16] and bag-level Incomplete and Inaccurate annotations, which is further labels are predicted based on the maximum probabilities sub-classified to expression detection and AU detection. of the instances of the corresponding bag. Chongliang et Moreover, we provide summarization of results of these al [23] further enhanced this approach by incorporating approaches along with the evaluation strategies. discriminative Hidden Markov Model (HMM) based instance level classifier with MIL instead of MILBOOST to efficiently capture the temporal dynamics. Moreover, 3.1 Inexact Annotations segment level representation (instance) is obtained based on This section deals with the review of existing approaches the displacement of facial landmarks between the current related to WSL of inexact annotations, where annotations frame and last one and the proposed approach is further are provided at global coarse-level instead of finer low- extended to expression localization. Chen et al [24] explored level annotations. Typically, facial expressions are related to pain classification with MIL using the relationship between facial AUs, which exhibits wide range of basic expressions AUs and pain expression under two strategies of feature to compound expressions. representation: compact and clustered representation. In compact representation, each entry of the feature vector 3.1.1 Detection of Expressions denotes the probability estimate of the corresponding AUs In the case of inexact annotations, expression labels are whereas clustered representation is obtained by clustering provided at sequence-level and the goal of the task is to the co-occurrence of AUs. They have shown improvement JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 using clustered representation of AUs while instance-level TABLE 1: List of AUs observed in expressions (sub-sequence) classifier is modeled using MILBOOST. Expression AUs Anger 4, 5, 7, 10, 17, 22-26 Bag Level Approaches: Unlike the aforementioned ap- Disgust 9, 10, 16, 17, 25, 26 proaches that relies on a single concept assumption, Adria Fear 1, 2, 4, 5, 20, 25, 26, 27 et al [25] proposed multi-concept MIL framework based Happiness 6, 12, 25 on multi concept assumption i.e., multiple expressions in Sadness 1, 4, 6, 11, 15, 17 a video for estimating high level (bag-level) semantic labels Surprise 1, 2, 5, 26, 27 of videos, which is influenced by multiple discriminative ex- Pain 4, 6, 7, 9, 10, 12, 20, 25, 26, 27, 43 pressions. A set of k hyper-planes are modelled to discrimi- nate k concepts (facial expressions) in the instance space and the bag level representation is obtained using probability Ruiz et al [31] investigated the prospect of learning AU of bag for each concept, which is further classified using classifiers using expression labels by exploiting the relation- a linear classifier. Sikka et al [26] also investigated the ship between expressions and AUs. Each input sample is temporal dynamics of facial expressions in videos, where mapped to an Action unit, which is in-turn mapped to an the ordering of the discriminative templates (neutral, onset, expression, and action unit classification is considered as apex or offset) in a video sequence is associated with a hidden task due to the lack of AU labels. Each expression cost function, which captures the likelihood of occurrence classifier is learnt prior to training using empirical study of different temporal orders. Each video is assigned a score of relationship between expressions and AUs [32], which is based on the ordering of the templates using the model, in-turn used to learn AU classifiers using . which is learnt with Stochastic Gradient Descent (SGD) by Instead of using only the relationship between basic expres- minimizing the regularized max-margin hinge loss function. sions and the corresponding AU probabilities, Wang et al. Unlike conventional MIL methods, Huang et al [27] [33] exploited both expression-dependent and expression- developed a novel framework of Personal Affect Detec- independent AU probabilities without using any extra large- tion with Minimal Annotation (PADMA) for handling user- scale expression-labeled facial images. First, the domain specific differences based on association between key facial knowledge of expressions and AUs are summarized, based gestures and affect labels i.e., if an instance occurs frequently on which pseudo AU labels are generated for each expres- in bags of particular class but not in others, the instance has sion. Then a Restricted Boltzmann Machine (RBM) is used a strong association with the label. The facial feature vectors to model prior joint AU distribution from the pseudo AU in a video are clustered as key facial gestures (expressions) labels. Finally, AU classifiers are assumed to be linear func- and video is represented as a sequence of corresponding tions with a sigmoid output layer and learned using Max- affect labels. The sequence level affects are predicted by af- imum Likelihood estimation (MLE) with regard to learned fect frequency analysis using the association between facial AU label prior. Using a similar approach, Peng et al [34] gestures and facial affects. explored adversarial training for AU recognition instead of Liefei et al [28] represented a sequence with a set of 20 maximizing the log-likelihood of AU classifier with regard dominating motion fields and a dictionary of motion words learned AU label prior. Inspired by generative adversarial is obtained from the motion fields of training sequences, networks (GAN), AU classifiers are learned by minimizing which are labeled at sequence level. Each frame (motion the differences between AU output distribution from AU field) is then represented as a histogram of motion words classifiers and pseudo AU label distribution derived from and labels are obtained using the nearest neighbor classifier. the summarized domain knowledge. Then the sequence level label is predicted using majority vote of its frame labels. Yanfei et al [29] proposed automatic Unlike the prior approaches, Wang et al. [35] modelled depression detection system using landmarks of facial ex- the relation between expressions and AU probabilities as pressions through the framework of multiple instance learn- inequalities instead of exact probabilities i.e., higher proba- ing. LSTM is used to model the relationship between the bilities of occurrence have higher rankings than those with instances (sub-sequences) of a bag (video) and global max lower probabilities. Then expression-dependent ranking or- pooling is deployed to identify depression related instances der among AUs is exploited to train the AU classifiers as a and to generate the depression label of a test sequence. multi-label ranking problem through minimizing rank loss. Similarly, Zhang et al [36] comprehensively summarized the domain knowledge and exploited both expression depen- 3.1.2 Detection of Action Units (AUs) dent and expression independent AU probabilities to model Since any expression can be characterized as a combination the relationship among AU probabilities. Then multiple AU of action units, many psychological studies have shown that classifiers are jointly learned by leveraging the prior proba- there exists strong relation between expressions and AUs bilities on AUs i.e., the derived relationships among AUs [30]. Due to the expensive and laborious task of obtaining are represented in terms of inequality constraints among AU annotations compared to expressions and the close AU probabilities, which is incorporated into the objective relation between expressions and AUs as shown in Table function instead of multi-label ranking framework as in [35]. 1, many researchers have explored the the problem of AU Inspired by the idea of dual learning, Wang et al [37] detection in the framework of weakly supervised learning, integrated the task of face synthesis along with AU recog- where expression labels are considered as weak coarse labels nition, where the latter is considered as the main task and for AUs. former as auxiliary task. Specifically, AU labels are predicted JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 using AU classifiers learned from the domain knowledge, of different AUs being significantly different. They have which is formulated as first objective term and a synthetic explored class cardinality bounds, where the model is learnt face is generated using predicted AU labels, which is opti- by imposing the lower and upper bounds, obtained using mized using second objective term indicating the difference histogram of positive AUs into the objective function. between the original and generated faces. Two conditional Incomplete Labels: Wang et al [44] deal with incomplete distributions are learned, one for each task which is jointly AU labels but complete expression labels by modeling the optimized using stochastic gradient descent method. All dependencies among AUs and relationship between expres- of the above mentioned approaches for AU detection with sions and AUs with Bayesian Network (BN) using Maxi- weak expression labels have been extended to the problem mum Likelihood estimation. However, Structural Expecta- of incomplete AU annotations by deploying additional loss tion Maximization (SEM) is used to learn the parameters of component for the available partial AU annotations. BN for data with no AU labels. They have further extended the approach to estimate AU intensities. Peng et al [45] 3.2 Incomplete Annotations explored an adversarial GAN based approach with dual learning by leveraging domain knowledge of expressions In this subsection, we have reviewed the existing algo- and AUs along with facial image synthesis from predicted rithms, which falls under the category of WSL with incom- AUs. Specifically, the probabilistic duality between tasks plete annotations. and the dependencies among facial features, AUs and ex- pressions are explored in an adversarial learning frame- 3.2.1 Detection of Expressions work. Next, reconstruction loss is deployed by considering Cohen et al [38] investigated the impact of unlabeled data the constraints of dual task of AU predictions from facial for improving the classification performance of the system images and facial image synthesis from predicted AUs in and proposed a structure learning algorithm of Bayesian addition to the standard supervised loss pertinent to AU network. The facial features are extracted as motion features labels and facial features. of nonrigid regions of face, which captures the deformation Unlike the above two approaches, Wu et al [46] used only of face and fed to the proposed Bayesian classifier for incomplete AU annotations without expression labels and recognizing expressions. Happy et al [39] addressed the models the prior relationships among AUs using Restricted problem of limited data with incomplete annotations for Boltzmann Machine (RBM). Multiple SVMs are used to learn classifying facial expressions even with low intensity, as in AU classifiers using deep features by minimizing the error real-life data. Label smoothing is used to prevent the model between predicted labels and ground-truth labels while from obtaining high confidence scores, thereby retains ex- simultaneously maximizing the log likelihood of AU label pressions with low intensities. Initially, they train a CNN distribution model to be consistent with learnt AU label dis- model with limited labelled data in a supervised manner tributions. Inspired by the idea of co-training, Niu et al [47] until an adequate performance is achieved. Subsequently, further improved the performance using a novel approach model parameters are further updated by finetuning using a of multi-label co-regularization for semi-supervised AU portion of unlabelled data with high confidence predictions, recognition without expression labels.Specifically, multi- obtained by current model in every epoch. view loss is designed to ensure the features generated from the two views to be conditional independent by orthogo- 3.2.2 Detection of Action Units (AUs) nalizing weights of AU classifiers of two views. Next, a co- Missing Labels: Song et al [40] developed Bayesian frame- regularization loss is designed to enforce the AU classifiers work for AU recognition by encoding the sparsity i.e., only from two views to have similar predictions by minimizing very few AUs are active at any moment and co-occurrence the distance between two predicted probability distributions structure of AUs i.e., overlapping of AUs in multiple groups from two views. Subsequently, graph convolutional net- via compressed sensing and group-wise sparsity inducing work (GCN) is also used to model the strong relationships priors. Wu et al [41] proposed multi-label learning frame- among different AUs and fine-tune the network. work with missing labels to learn multi-label classifiers by enforcing the constraints of consistency between predicted 3.3 Inaccurate Annotations labels and provided labels (label consistency) as well as with label smoothness i.e., labels of similar features should be Since the process of the annotation of expressions or AU close to each other along with modelling the co-occurrence labels is a complex process, the annotations are highly relationships among AUs. However, [42] found that the vulnerable to noise. Therefore, there is an immense need to constraint of label smoothness with shared feature space refine the inaccurate annotations of expression or AU labels. among AUs is violated for the task of AU recognition due to the diverse nature of the occurrence of AUs i.e., different 3.3.1 Detection of Expressions AUs occur in different face regions, thereby features selected Ali et al [48] investigated the consistency of noisy anno- for one AU classifier are not discriminative for other AU tations of images crawled from web and performance of classifier. Therefore, discriminative features are learned for deep networks in handling noisy labels. AlexNet [49] and each AU class before deploying the constraint of label WACV-Net [50] are trained using training data of true labels smoothness. Li et al [43] further extended the idea of [42] only, true labels with noisy labels and true labels with to address the problem of class imbalance in two aspects : noisy labels along with noise modeling [51]. All the three number of positive AUs being much smaller than negative scenarios are evaluated using test set with true labels. It AUs in each sample (image) and rate of positive samples was observed that Alexnet outperforms WACVNet in all JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 cases with the scenario of training with true labels having discrete levels from 0-15 and Observer Pain Intensity (OPI) the best performance followed by noisy labels with noise ratings are provided at sequence-level on 0 scale of 0 - 5. modeling and noisy labels. Barsoum et al [52] trained a CK+ [57] : The database consists of 593 video sequences deep CNN with noisy labels obtained from 10 taggers, captured in controlled laboratory conditions, where the which is analyzed using four different strategies for effec- emotions are spontaneously performed by 123 participants. tive label assignment : majority-voting, multi-label learn- All the video sequences are considered to vary from neutral ing, probabilistic label drawing and cross-entropy loss. It face to peak formation of the facial expressions and the was observed that strategies of multi-label learning fully duration of the sequences vary from 10 to 60 frames. The exploits the label distribution and outperforms the single- video sequences are labeled based on FACS with seven basic label strategy. Jiabei et al [53] proposed a 3-step framework, expression labels (including Contempt) and each emotion Inconsistent Pseudo Annotations to Latent Truth (IPA2LT) is defined by a prototypical combination of specific action to discover the latent true labels of noisy data with multiple units (AUs). Out of 593 sequences, it was found that only 327 inconsistent annotations. First, predictive models are trained sequences satisfy the labelling strategy, where each video for individual labelled data-sets. Next, pseudo annotations sequence is labeled with the corresponding emotion label. are generated from the trained predictive models to obtain Since the emotion labels are provided at video-level, static multiple labels for each image of the labelled data-sets as approaches assign the emotion label of the video sequence well as large scale unlabelled data.Finally, LTNet is trained to last one to three frames that exhibits the peak formation to estimate the latent true labels by maximizing the log of the expression and the first frame is considered as neutral likelihood of observed multiple pseudo annotations. frame. 3.3.2 Detection of Action Units (AUs) MMI [58], [59] : The database contains 326 video Zhao et al [54] explored weakly supervised clustering on sequences spontaneously captured in laboratory controlled large scale of images from web with inaccurate annota- conditions from 32 subjects though it includes challenging tions to derive a weakly-supervised spectral algorithm that variations such as large inter-personal variations, pose, etc learns an embedding space to couple image appearance and compared to CK+ database. The sequences are captured semantics. Next, the noisy annotations are refined using as onset-apex-offset i.e., the sequence starts with neutral rank order clustering by identifying groups of visually and expression (onset), reaches the peak (apex) and returns semantically similar images They have further enhanced again to neutral expression (offset). As per the standard the approach by invoking stochastic extension to deal with of FACS, 213 sequences are labeled with six basic facial large scale of images. Fabian et al [55] proposed a global- expressions (excluding contempt) at video-level, of which local loss function by combining the local loss function, 205 sequences are captured in frontal view. They have also which emphasis on accurate detection by focusing on salient provided frame-level annotations for some of the sequences. regions, however requires very accurate labels for better For approaches based on static images, only the first frame convergence, which is circumvented by combining global (neutral expression) and peak frames (apex of expressions) loss function that captures the global structure of images are considered. yielding consistent results for AU recognition. DISFA [60]: Denver Intensity of Spontaneous Facial Action dataset (DISFA) is created using 9 short video clips 3.4 Experimental Results from YouTube, where the participants of 27 adults are al- In this section, we will present the data-sets used for val- lowed to watch the short video clips pertaining to various idating the WSL models for classification proposed in the emotions. The facial expressions of each of the participants framework of weakly supervised learning along with the are captured with a high resolution video of 1024x768 pixels critical analysis of results. with frame rate of 20fps resulting in 1,30,754 frames in total. Each of these frames are annotated with action units along 3.4.1 Data-sets with the discrete intensity levels by FACS expert raters. The action units related to the expressions in the database UNBC-McMaster [56]: The database contains 200 video are AU1, AU2, AU4, AU5, AU6, AU9, AU12, AU15, AU17, sequences of 48398 frames captured from 25 participants, AU20, AU25 and AU26, whose intensities are provided on who are self-identified with shoulder pain. Each frame of a six-point ordinal scale (neutral < A < B < C < D < E). the video sequence is labeled with 5 discrete intensity levels of AUs pertinent to pain (A < B < C < D < E)obtained by BP4D [61]: The data-set is captured from 41 participants, three certified FACS coders. Only the action units related to where each subject is requested to exhibit 8 spontaneous pain are considered : brow-lowering (AU4), cheek-raising expressions and thereby 2D and 3D videos are obtained for (AU6), eyelid tightening (AU7), nose wrinkling (AU9), each task. A total of 328 video sequences are obtained, where upper-lip raising (AU10), oblique lip raising (AU12), hori- the frames exhibiting high density of facial expressions are zontal lip stretch (AU20), lips parting (AU25), jaw dropping annotated with facial AUs. Due to the intensive process of (AU26), mouth stretching (AU27) and eye-closure (AU43), FACS coding, the most expressive temporal segments i.e., however AU43 is assigned only binary labels. In addition to 20-second segments are encoded. A total of 27 AUs are the annotations based on FACS, they have also provided coded for the expression sequences and AU intensities are labels of discrete pain intensities both at sequence-level also coded on an ordinal scale of 0-5 for AU 12 and AU 14. and frame-level. Prkachin and Solomon Pain Intensity Scale Similar to BU-4DFE, this data-set is also used for analyzing (PSPI) of pain intensities are labeled at frame level with 16 facial expressions in dynamic 3D space. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 TABLE 2: Comparative Evaluation of Performance Measures for classification of expressions or action units under various mode of WSL setting on most widely evaluated data-sets.

Data-set WSL Problem Reference Task Model-type Learning Model Validation Performance Sikka et al (2014) [22] Pain Detection Dynamic Gradient Boosting LOSO 83.70 Chong et al (2015) [23] Expression Detection Dynamic HMM LOSO 85.23 Chen et al (2019) [24] Pain Detection Dynamic Gradient Boosting 10-fold 86.84 Adria et al (2014) [25] Expression Detection Static Gradient Descent LOSO 85.70 Sikka et al (2016) [26] Expression Detection Dynamic SGD LOSO 87.00 Inexact Huang et al (2016) [27] Expression Detection Static RF-IAF LOSO 84.40 Ruiz et al (2015) [31] AU Detection Static Gradient Descent 5-fold 0.235 Wang et al (2018) [33] AU Detection Static RBM 5-fold 0.351 Wang et al (2019) [37] AU Detection Static RBM 5-fold 0.400 UNBC - McMaster Peng et al (2018) [34] AU Detection Static RAN 5-fold 0.376 Zhang et al (2018) [36] AU Detection Static LBFGS [62] 5-fold 0.510 Wang et al (2017) [44] AU Detection Static Bayesian Network 5-fold 0.183 Song et al (2015) [40] AU Detection Static Bayesian Model 5-fold 0.450 Wu et al (2015) [41] AU Detection Static Gradient Descent 5-fold 0.146 Incomplete Ruiz et al (2015) [31] AU Detection Static Gradient Descent 5-fold 0.292 Wang et al (2018) [33] AU Detection Static RBM 5-fold 0.502 Wang et al (2019) [37] AU Detection Static RBM 5-fold 0.514 Peng et al (2018) [34] AU Detection Static RAN 5-fold 0.472 Wang et al (2018) [35] AU Detection Static Classical 5-fold 0.521 Peng et al (2019) [45] AU Detection Static DSGAN 5-fold 0.520 Chong et al (2015) [23] Expression Detection Dynamic HMM LOSO 98.54 Sikka et al (2016) [26] Expression Detection Dynamic SGD 10-fold 95.19 Ruiz et al (2015) [31] AU Detection Static Gradient Descent 5-fold 0.469 Wang et al (2018) [33] AU Detection Static RBM 5-fold 0.705 Inexact Wang et al (2019) [37] AU Detection Static RBM 5-fold 0.740 Peng et al (2018) [34] AU Detection Static RAN 5-fold 0.715 Zhang et al (2018) [36] AU Detection Static LBFGS [62] 5-fold 0.732 Wang et al (2017) [44] AU Detection Static Bayesian Network 5-fold 0.781 CK+ Song et al (2015) [40] AU Detection Static Bayesian Model 5-fold 0.696 Wu et al (2015) [41] AU Detection Static Gradient Descent 5-fold 0.652 Ruiz et al (2015) [31] AU Detection Static Gradient Descent 5-fold 0.594 Incomplete Wang et al (2018) [33] AU Detection Static RBM 5-fold 0.787 Wang et al (2019) [37] AU Detection Static RBM 5-fold 0.806 Peng et al (2018) [34] AU Detection Static RAN 5-fold 0.792 Peng et al (2019) [45] AU Detection Static DSGAN 5-fold 0.792 Zhang et al (2018) [36] AU Detection Static LBFGS [62] 5-fold 0.754 Ruiz et al (2015) [31] AU Detection Static Gradient Descent 5-fold 0.431 Wang et al (2018) [33] AU Detection Static RBM 5-fold 0.516 Inexact Wang et al (2019) [37] AU Detection Static RBM 5-fold 0.530 Peng et al (2018) [34] AU Detection Static RAN 5-fold 0.520 Zhang et al (2018) [36] AU Detection Static LBFGS [62] 5-fold 0.481 Wang et al (2017) [44] AU Detection Static Bayesian Network 5-fold 0.438 MMI Song et al (2015) [40] AU Detection Static Bayesian Model 5-fold 0.447 Wu et al (2015) [41] AU Detection Static Gradient Descent 5-fold 0.432 Incomplete Ruiz et al (2015) [31] AU Detection Static Gradient Descent 5-fold 0.530 Wang et al (2018) [33] AU Detection Static RBM 5-fold 0.531 Wang et al (2019) [37] AU Detection Static RBM 5-fold 0.561 Peng et al (2018) [34] AU Detection Static RAN 5-fold 0.529 Peng et al (2019) [45] AU Detection Static DSGAN 5-fold 0.537 Inexact Ruiz et al (2015) [31] AU Detection Static Gradient Descent 5-fold 0.371 Wang et al (2018) [33] AU Detection Static RBM 5-fold 0.424 DISFA Wang et al (2017) [44] AU Detection Static Bayesian Network 5-fold 0.430 Song et al (2015) [40] AU Detection Static Bayesian Model 5-fold 0.426 Incomplete Wu et al (2015) [41] AU Detection Static Gradient Descent 5-fold 0.382 Ruiz et al (2015) [31] AU Detection Static Gradient Descent 5-fold 0.428 Wang et al (2018) [33] AU Detection Static RBM 5-fold 0.522 Song et al (2015) [40] AU Detection Static Bayesian Model Training : 60% 0.400 BP4D Incomplete Wang et al (2017) [46] AU Detection Static RBM Validation : 20% 0.452 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 3.4.2 Experimental Protocol potential in lot of real-time applications. Facial expressions Depending on the mode of annotation, the data-sets are or AUs are dynamic processes, which evolve over time further modified to match with the corresponding task at and thereby temporal information of expression or AUs hand in order to validate the techniques. For the task of conveys significant information of facial behavior. However, classification, the performance of expression and AUs are temporal information is not well explored for the problem expressed in terms of percentage and F1-score respectively. of AU detection though it has been investigated for pain Inexact Annotations: Though UNBC-McMaster data- detection in the framework of WSL [63], [26]. Current works set is primarily used for both regression and classification, it on expression detection in WSL have used max-pooling or has been explored for expression classification by converting displacement of facial landmarks for extracting the dynamic the ordinal labels (OPI ratings) of the data-set to binary information of video sequences pertinent to expression. labels based on a threshold i.e., OPI > 3 is treated as pain Displacement of facial landmarks was found to capture and OPI=0 as no pain, which results in a total of 149 the temporal dynamic better than max-pooling, which has sequences with 57 positive bags and 92 negative bags. For been reflected in the work of [23], where they have used AU classification, 7319 frames are chosen from 30 video displacement of facial landmarks in conjunction with HMM sequences of 17 subjects that exhibits the expression of pain instead of max-pooling. with PSPI > 4. Six AU labels are associated with the chosen Most of the current research on facial behavior analysis frames i.e.,AU4, AU6, AU7, AU9, AU10 and AU43, which pertinent to WSL are based on classical machine learning has dependency on expression labels of pain. The bag label approaches (shallow networks) though deep-learning based of each pain sequence is considered as maximum of frame approaches are gaining attention in the recent years. One of labels. Out of 25 subjects, 15 are used for training, 9 for the major bottle-necks in using deep learning for FER is the validation and 1 for testing. lack of sufficient data for facial expressions or AUs as they In case of CK+ data-set, 327 sequences are considered for are sparse in nature i.e., facial expressions or AUs occur only expression classification, whereas for AU classification, 309 for a limited duration of time in a video sequence. Deep sequences of 106 subjects are chosen from 593 sequences networks was proven to be robust in handling the wide of 123 subjects based on the occurrence of AU labels i.e., range of intra variations such as illumination, pose, iden- AU labels, which are available for more than 10% of all tity, etc better than shallow networks when large amount frames are chosen resulting in 12 AU labels. MMI data- of data is provided [64]. Therefore, pretrained networks set is used for AU classification by considering sequences, on face recognition [65] are used for expression or AU where AUs are available for more than 10% of all samples, recognition by retaining the lower layers and fine-tuning resulting in 171 sequences from 27 subjects with 13 labels. only the higher layers as they represent the task-relevant For classification of DISFA data-set, 482 apex frames are features in order to handle the problem of limited data-set chosen based on AU intensity levels, for which expression of facial expressions [66]. However, it was found that the labels are obtained by FACS. Similar to CK+ and MMI, 9 abstract feature representation of higher layers still holds AUs are considered, whose occurrence is greater than 10% the expression-unrelated information pertinent to subject 5-fold cross validation is deployed, where 20% of the whole identity even after fine-tuning the face verification nets [67]. database is used as validation set according to subjects. To tackle this problem, large-scale in-the-wild FER data-sets All the experiments are conducted as subject independent [68], [69] have been captured and made available to the protocol. research community in the recent years, where the noisy Incomplete Annotations: In the context of incomplete annotations are refined using WSL approaches related to annotations, UNBC-McMaster, CK+, MMI and DISFA data- inaccurate annotations [53], [70]. Recently, few works [33], sets are used for AU detection, where training data is ob- [37] have explored RBM for AU detection. To the best of tained with missing AU labels by dropping some of the AU our knowledge, no work has been done using deep learning labels. In case of BP4D data-set for AU classification, only architectures for expression detection in the framework of partial AU annotations are considered without expression inexact annotations though it has been recently explored in labels. Images of 60% of subjects are used for training, 20% WSL with incomplete annotations [39]. for validation and last 20% for testing. For all the data-sets, Most of the current research on AU detection has been AU labels for 50% of the training samples are randomly focused on exploiting the domain knowledge of dependen- removed in order to incorporate the setting of incomplete cies among AUs and between basic expressions and AUs annotations. The experiment is repeated for times and the in the framework of WSL, where expression labels act as average score of F1-measure is used for validation. weak supervision for AU detection. Similarly expression detection is widely explored in the context of pain detection in the framework of WSL. Since the experimental strategy 3.5 Critical Analysis of [31], [44], [40] and [41] are different from that of [33], Although classification of facial expressions or action units [37] and [34], [33] have re-conducted the experiments of is well explored in the framework of supervised learning, [31], [44], [40] and [41] with the setup of [33] in order to it is still under-researched problem in the setting of WSL. have fair comparison. The detailed comparison of results Recently, action unit detection has relatively drawn much of current state-of-the-art methods on widely used data-sets attention compared to the problem of expression detection are shown in Table 2. For expression detection, the perfor- in the context of WSL. This could be due to the fact that mance measure reflects the performance of sequence level facial AUs cover wide range of facial expressions rather classification for data with inexact annotations and frame- than limited six basic universal expressions, which has huge level classification for data with incomplete annotations. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 The performance metric of accuracy measure is used for segment are fed to a LSTM network followed by 3 fully con- expression detection and average F1-Score for AU detection. nected layers to obtain the regressed value of engagement intensity.

4 WSL FOR REGRESSION 4.1.2 AU Intensity Estimation Regression can be formulated as ordinal regression and Adria et al [77] extended the idea of [72] for AU intensity continuous regression. Ordinal regression algorithms are estimation by modeling the relationship between the weak the class of machine learning algorithms, which deals with sequence-level label and instance label using two strategies the task of recognizing the patterns on a categorical scale : maximum or relative values of instance labels. They have that reflects the ordering between the labels. Continuous also further extended the approach to partially labeled data, regression algorithms deals with estimating the intensities where the sequence-level labels are provided along with of continuous labels such as valence or arousal. Though the partial instance-level labels. Unlike conventional framework problem of regression is not well explored as it is the case for of MIL, Zhang et al [78] explored domain knowledge of classification task, it has been recently gaining attention due relevance using two labels of peak and valley frames. Specif- to its immense potential in many real world applications. ically, they have considered three major factors : Ordinal In the context of facial behavior analysis, regression deals relevance, intensity smoothness and relevance smoothness with the estimation of intensity levels of facial expressions based on the gradual evolving process of facial behavior or AUs. Similar to the classification of facial expressions, Ordinal relevance ensures the relevant intensity levels based regression also follows the same methodology of major on proximity of neighbouring frames, whereas intensity and building blocks i.e., Preprocessing, Feature Extraction and relevance smoothness constrains the smooth evolution of Model Generation. However, the model is trained in order facial appearance and ordinal relevance respectively. to capture the intensities for the task of regression rather Due to the relevance of local patches for AUs, Zhang than classifying the behavior patterns. Even though few et al [79] further improved the performance by developing algorithms have been proposed for the task of estimating patch based deep model using attention mechanisms for the intensities of facial expressions or AUs [64], [71] in fully- feature fusion and label fusion in order to capture the spatial supervised setting, the problem is still at rudimentary stage relationships among local patches and temporal dynamics in the framework of WSL. We have classified the existing pertinent to each AU respectively. Since the contribution approaches relevant to regression of facial expressions or of local patches and temporal dynamics vary for different AUs in the framework of WSL as inexact, incomplete and AUs, the attention mechanism was further augmented by inaccurate techniques based on mode of WSL setting similar learnable task-related context. to that of classification as described in Section 3. 4.2 Incomplete Annotations 4.1 Inexact Annotations In this framework, the intensities of the frames are provided In case of regression with inexact annotations, the intensity only for subset of the training data. The goal of the task levels of expressions or AUs are provided at the global level is to generate robust training model for predicting intensity i.e., at the sequence level, the goal is to estimate the intensity values of test data at frame-level using partially labeled data level of individual frames or sub-sequences using sequence- along with unlabeled data. level labels. 4.2.1 Expression Intensity Estimation 4.1.1 Expression Intensity Estimation To the best of our knowledge, only one work has been done Adria et al [72] proposed multi-instance dynamic ordinal related to this problem, where Rui et al [80] proposed a max- random fields (MI-DORF) for estimating ordinal intensity margin based ordinal support vector regression using ordi- levels of frames, where the ordinal variables are modelled nal relationship, which is flexible and generic in handling as normal distribution and the relationship between given varying level of annotations and a linear model is learned observation (frame) and latent ordinal value is obtained by by solving the optimization problem using Alternating Di- projecting the given observation (frame) onto the ordinal rection Method of Multipliers (ADMM) to predict the frame- line, which is divided by the consecutive overlapping cutoff level intensity of test image. points of the normal distributions. Next, the temporal infor- mation is modeled across the consecutive latent ordinal vari- 4.2.2 AU Intensity Estimation ables to ensure the smoothness on the latent ordinal states. Yong et al [81] designed a deep convolutional neural net- Praveen et al [73] further improved the performance using work for intensity estimation of Action Units(AUs) using deep 3D CNN model (I3D [74]) by integrating multiple annotations of only peak and valley frames. The parameters instance learning into adversarial deep domain adaptation of CNN are learned by encoding domain knowledge of [75] framework for pain intensity estimation, where source facial symmetry, temporal intensity ordering, relative ap- domain is assumed to have fully annotated videos and pearance similarity and contrastive appearance difference. target domain have periodically annotated weak labels. CNN is designed as 3 convolutional layers followed by 3 Jianfei et al [76] proposed an approach for student max-pooling layers and 1 fully connected layer. Zhang et engagement prediction in-the-wild using multiple-instance al [82] further extended [81] to joint estimation of multiple regression, where the input video (bag) was divided into AU intensities by introducing a task index to update the segments (instances) and spatiotemporal features of each corresponding parameters of fully connected layer. They JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12 have also used lot of unlabelled frames in addition to 4.3.2 Experimental Protocol labelled key frames for training to handle over-fitting under In case of regression, the performance of expression is the framework of semi-supervised learning. Shangfei et al expressed Mean Square Error (MSE). The intensities of pain [83] extended the idea of [46] for AU intensity estimation, and AUs are evaluated in terms of Mean Absolute Error where RBM is used to model AU intensity distribution and (MAE), Pearson Correlation Coefficient (PCC) and Intraclass regularize the prediction model of AU intesnities. Correlation Coefficient (ICC). Zhang et al [84] further improved the performance by Inexact Annotations: In case of regression on UNBC- jointly learning the representation and estimator using par- McMaster data-set, PSPI labels of the frames are converted tially labeled frames by incorporating human knowledge to 6 ordinal levels:0(0), 1(1), 2(2), 3(3), 4-5(4), 6-15(5). The of soft and hard constraints. Specifically, the sequences bag label of each pain sequence is considered as maximum are segmented to monotonically increasing segments and of frame labels. Out of 25 subjects, 15 are used for training, temporal label ranking and positive AU intensity levels are 9 for validation and 1 for testing. For EmotiW data-set, the considered as hard constraints and temporal label smooth- training and validation data-set provided by the organizers ness and temporal feature smoothness are considered as soft are used for training by manually splitting the data to constraints. compensate class imbalance. The performance measure is 4.3 Experimental Results evaluated on the test dataset provided by the organizers. Incomplete Annotations: For regression, only 8.8% of to- In this section, the data-sets used for validating the WSL tal annotations are considered for UNBC-McMaster data-set models proposed in the framework of weakly supervised for the task of pain regression. Similarly, CK+ and BU-4DFE learning for ordinal regression along with the critical analy- data-sets are used for expression regression, where 327 and sis of results pertinent to the WSL approaches are presented. 120 sequences with a total of 5876 and 2289 frames respec- 4.3.1 Data-sets tively are considered, out of which annotations are provided for onset and apex frames. For the task of AU regression in With the advancement of state-of-the-art FER systems, re- UNBC-McMaster and DISFA data-sets, only 10% of anno- gression based approaches are gaining attention as humans tated frames are considered in [77], [83] and [80] in order to make use of the wide range of intensity of facial expressions incorporate the setting of incomplete annotations, whereas to convey their feelings. In order to cover these wide range [78] and [81] considered only the annotations of peak and of facial expressions, several databases have been developed valley frames. In case of FERA 2015 data-set, official training with intensity levels of facial expressions such as pain, and development sets provided by FERA 2015 challenge action units, etc. [85] are deployed. Similar to UNBC-McMaster and DISFA, FERA 2015 Challenge [85]: The data-set is drawn from [83], [44] considers only 10% of annotated frames while [80], BP4D [61] and SEMAINE [86] databases for the task of AU [78], [81] considers annotations of peak and valley frames. occurrence and intensity estimation, where only five AUs from BP4D i.e., AU6, AU10, AU12, AU14 and AU17 are considered for AU intensity estimation and 14 AUs from 4.4 Critical Analysis both SEMAINE and BP4D for occurrence detection i.e., AU1, AU2, AU4, AU6, AU7, AU10, AU12, AU14, AU15, AU17, The intensity estimation of facial expressions or AUs are AU23, AU25, AU28 and AU45. The original data-set of more challenging than the task of classification due to the BP4D is used as training set, where training data is drawn complexity in capturing the subtle variation of facial ap- from 21 subjects, development set from 20 subjects and the pearance and obtaining the annotations of intensity levels data-set is further extended for test-set captured from 20 as they are scarce and expensive. Therefore, in general, subjects, resulting in 75,586 images in training partition, the task of intensity estimation is still an under-researched 71,261 images in development partition and 75,726 in testing problem compared to the task of classification. A similar partition. Similarly for SEMAINE data-set, 48,000 images phenomenon has been observed in the framework of WSL are used for training, 45,000 for development and 37,695 for for which the problem of intensity estimation is rarely testing. The entire data-set is annotated frame-wise for AU explored. Since temporal dynamics plays a crucial role in occurrence and intensity level for the corresponding subset conveying significant information for the task of intensity of AUs. For BP4D-extended set, the onset and offsets are level estimation, [78] and [77] modeled the relevant ordinal treated as B-level of intensity. In both the data-sets, most relationship across temporal frames by incorporating inten- facially-expressive segments are coded for AUs and AU sity and relevance smoothness into the objective function intensities. The intensity levels of AUs are coded on an while [81] and [82] invoked facial symmetry and contrastive ordinal scale of 0-5. appearance difference in addition to the temporal relevance BU-4DFE [87]: The database is an extended version of and regression in a deep learning framework, which was BU-3DFE, where the facial behavior of static 3D space is found to outperform the former approaches. [88] and [76] further extended to include dynamic space. The data-set formulated the task of AU intensity estimation in the set- contains 606 3D video sequences obtained from 101 subjects ting of deep multi-instance regression and captured the by allowing each subject to exhibit six prototypical facial temporal dynamics of frames using LBP-TOP features and expressions. Each video sequence has 100 frames with a represented the video as maximum or average of the corre- resolution of 1040x1329, which results in a total of approxi- sponding instance frames. mately 60,600 frames. This data-set is widely used for multi- With the advent of deep learning architectures and its view 3D facial expression analysis. break-through performance in many applications under JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13 TABLE 3: Comparative Evaluation of Performance Measures for regression of expressions or action units under various mode of WSL setting on most widely evaluated data-sets.

Dataset WSL Problem Reference Task Data Type Learning Model Validation MAE PCC ICC Adria et al (2018) [77] Pain Static DORF LOSO 0.710 0.360 0.340 Inexact Praveen et al (2020) [73] Pain Dynamic WSDA LOSO 0.714 0.630 0.567 UNBC - McMaster Zhang et al (2018) [78] Pain Static BORMIR LOSO 0.821 0.605 0.531 Adria et al (2018) [77] Pain Static DORF LOSO 0.510 0.460 0.460 Incomplete Rui et al (2016) [80] Pain Static OSVR LOSO 0.951 0.544 0.495 Adria et al (2018) [77] Action Unit Static DORF 5-fold 1.130 0.400 0.260 Rui et al (2016) [80] Action Unit Static OSVR 5-fold 1.380 0.350 0.150 Inexact Zhang et al (2018) [78] Action Unit Static BORMIR 5-fold 0.789 0.353 0.283 Zhang et al (2019) [79] Action Unit Dynamic CFLF 3-fold 0.329 - 0.408 Adria et al (2018) [77] Action Unit Static DORF 5-fold 0.480 0.420 0.380 DISFA Rui et al (2016) [80] Action Unit Static OSVR 5-fold 0.800 0.370 0.290 Zhang et al (2018) [81] Action Unit Dynamic CNN 3-fold 0.330 - 0.360 Incomplete Zhang et al (2019) [82] Action Unit Dynamic CNN 3-fold 0.330 - 0.350 Wang et al (2019) [83] Action Unit Static RBM 3-fold 0.431 0.592 0.549 Zhang et al (2019) [84] Action Unit Dynamic CNN 5-fold 0.910 0.370 0.350 Zhang et al (2018) [78] Action Unit Static BORMIR FERA 2015 challenge [85] 0.852 0.635 0.620 Inexact Zhang et al (2019) [79] Action Unit Dynamic CFLF FERA 2015 challenge [85] 0.741 - 0.661 Wang et al (2019) [83] Action Unit Static RBM FERA 2015 challenge [85] 0.728 0.605 0.585 Wang et al (2017) [44] Action Unit Static Bayesian Network FERA 2015 challenge [85] - 0.638 0.610 FERA 2015 Rui et al (2016) [80] Action Unit Static OSVR FERA 2015 challenge [85] 1.077 0.545 0.544 Incomplete Zhang et al (2019) [82] Action Unit Dynamic CNN FERA 2015 challenge [85] 0.640 - 0.670 Zhang et al (2018) [81] Action Unit Dynamic CNN FERA 2015 challenge [85] 0.660 - 0.670 Zhang et al (2019) [84] Action Unit Dynamic CNN FERA 2015 challenge [85] 0.870 0.620 0.600 CK+ Incomplete Rui et al (2016) [80] Action Unit Static OSVR 10-fold 1.981 0.729 0.716 BU-4DFE Incomplete Rui et al (2016) [80] Action Unit Static OSVR LOSO 2.242 0.545 0.503 real-time uncontrolled conditions, few approaches have ex- 5 CHALLENGESAND OPPORTUNITIES plored deep models for AU intensity estimation in the recent In addition to the challenges related to data labeling, there days. However, one of the major challenges in using deep are other challenges for developing a robust FER system. models for intensity estimation of facial expressions or AUs Though it has been discussed extensively in the existing is the requirement of large amount of intensity annotations, literature [91], [7], we will briefly review some of the generic which is very expensive and demands strong domain exper- challenges along with the potential research directions per- tise for obtaining the annotations. Therefore, estimation of tinent to FER. intensity levels using deep models in the framework of WSL still remains to be an open problem though few approaches have explored the problem in fully supervised setting [89], 5.1 Challenges [90]. To the best of our knowledge, only two works have In general, the task of facial analysis suffers from a lot of exploited deep models for AU intensity estimation i.e., challenges, which is prevalent in most of the face related [81] and [83] and no work has been done for estimating applications such as identity recognition, attribute recogni- expression intensity levels with deep models. tion, etc.

5.1.1 Dataset Bias In order to compare the work of [77] with the con- Though there has been a shift in data capture from lab- ventional approach of pain detection [22], MILBOOST is oratory controlled conditions to in-the-wild uncontrolled deployed and the output probabilities of pain detection environments, the data-sets are generally captured in a are normalized between 0 and 5 to have fair comparison specific environment, which may vary across data-sets and with that of [77]. The performance metric used for the thereby results in different data distribution of various data- validation of the state-of-the-art approaches for AU intensity sets. Typically, state-of-the-art approaches are evaluated on estimation is Mean Absolute Error (MAE), Pearson Correla- limited data-sets and shows superior performance. How- tion Coefficient (PCC), Intra-class correlation (ICC). PCC is ever, when deployed on different data-sets, these algorithms normally used to measure the linear association between may fail to retain their superior perfromance due to the predicted values and ground truth and ICC is used for cor- differences in distribution of data-sets, often termed as relation among annotators. MAE measure the error between data-set bias, which is a prevalent problem in the field predictions and ground truth, which is typically used for of machine learning. In order to address the problem of applications pertinent to ordinal regression. The expression data-set bias, few approaches [92] have used multiple data- intensity values for student engagement level prediction in sets for training by merging the data-sets and evaluated [76] and [88] are validated with the performance measure of on different data-sets. Even though merging multiple data- Mean Square Error (MSE). sets may increase the training data and thereby achieve JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14 better generalization, it may suffer from label subjectivity as sequences as continuous range of values for the intensities discussed in 5.1.3. Few more approaches conducted cross- will be more sensitive than discrete values, which will database experiments to validate the generalizability of the result in differences of the labels for same intensity of facial algorithm by evaluating the algorithm on a data-set differ- expression. Another major factor in obtaining annotations ent from training data [31], [33]. for continuous dimensional model is the reaction time of annotators. 5.1.2 Data Sparsity and Class Imbalance In order to alleviate the impact of label subjectivity, Since the visual appearance of face varies person to person the data-set is typically labeled by the strategy of crowd- due to age, civilization, ethnicity, cosmetics, eye-glasses, etc., sourcing, where labels are refined from several annotators the detection of facial expressions is a challenging task. [70]. In addition to label subjectivity, there can be variations In addition to the personal attributes, variations due to in the facial appearance due to heterogeneity of subjects pose, occlusion, illumination are prevalent in unconstrained termed as identity bias i.e., ambiguity induced by the sub- scenarios of facial expressions, which leads to high intra- jective nature of humans. For instance, the expression of class variability. Therefore, there is an immense need for sadness is often misinterpreted as neutral expression as the large scale data-set with wide range of intra-class varia- visual appearance of sadness is very close to that of neutral tion. In most of the machine-learning based applications, expression. the performance of the system is highly influenced by the quality and quantity of data, which has been reflected in 5.1.4 Tool for semi-automatic annotation the superior performance of deep learning architectures. Due to the above mentioned challenges in Section 5.1.3, it Though humans are capable of exhibiting wide range of was found that manual annotations are extremely challeng- facial expressions, most of the existing data-sets are devel- ing and even impossible to obtain for large scale data-sets. oped based on basic universal expressions and limited AUs Though WSL based approaches reduce the need for exact as they are more frequent in our everyday life. and complete annotations, the need for minimal annotations However, these facial expressions or AUs are generally for large scale data-sets has motivated many researchers to sparse in nature as they are expressed only a few frames, automate the process of annotations, which can be further resulting in huge class imbalance, which is also reflected refined by expert annotator to minimize the burden on in UNBC-McMaster Pain database [56]. For instance, elicit- human labor [95]. Since fully automatic tool for obtain- ing a smile is a frequently occurring expression, whereas ing annotations of expressions or AUs is not feasible and expressions such as disgust, anger are less common ex- reliable, semi-automatic tool seems to be more plausible pressions, thereby results in limited data of those less fre- approach to obtain reliable annotations for large scale data- quent expressions. He et al [93] investigated and provided sets especially for the case of continuous affect model which a comprehensive review of state-of-the-art approaches for is more vulnerable to noise and highly complex process to learning from imbalanced data along with metrics used discriminate the subtle variations across the frames. for evaluating the performance of the systems. Shashank et al [94] used cumulative attributes with deep learning 5.1.5 Efficient Feature Representation model as a two stage cascaded network. In the first stage, In the field of machine learning, one of the major charac- original labels are converted to cumulative attributes and teristics of feature representation is to retain the relevant CNN model is trained to output cumulative attribute vector. information on target labels still minimizing the entropy of Next, a regression layer is used to convert the cumulative features. For the task of FER, the features are expected to be attribute vectors to real-valued output. They have also used robust to face appearance variations such as pose, occlusion, evaluated the system with Euclidean loss and log-loss and illumination, blur, etc still retaining the relevancy of ex- found that latter outperforms the former. pressions. Sariyanidi et al [8] have provided comprehensive analysis of local as well as global hand-crafted features such 5.1.3 Label Subjectivity and Identity Bias as Gabor, SIFT, LBP, etc by revealing its advantages and Label subjectivity and Identity Bias are two major factors limitations to various key challenging factors. They have induced by the subjective nature of the annotators and further analyzed the feature selection approaches to refine varied response of expressions. Compared to other prob- the feature representation and showed that fusion based lems of computer vision, labeling of facial expressions is representations outperform individual feature representa- highly complex process as it is subjective in nature. Manual tions. Another promising line of approach is to incorporate annotation of AUs are even more challenging compared to the temporal dynamics in the feature representation, which the prototypical categorical expressions due to increased has outperformed the approaches based on static represen- range of facial behavior. Moreover, the annotation of AUs tation. Typically, Three Orthogonal Planes (TOP) features requires domain expertise certified by FACS coding system, are extended with handcrafted features such as LBP-TOP which is time consuming and laborious task and thereby [96], LGBP [97], etc to incorporate the dynamic information highly prone to errors induced by annotators. The pro- in the static feature representations. cess of annotation becomes more complex for intensities With the ubiquity of deep learning based approaches for of expressions or AUs as the difference between different various computer vision problems, there has been a shift intensity levels are very subtle, which is very challenging from handcrafted features to the learned features, where even for expert annotators. Continuous dimensional model pretrained models of face verification is used for finetuning further complicates the process of labeling especially when with the facial expressions or AU data due to the limited annotators are asked to label every frame of the video data pertinent to facial expressions or AU. However, as JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15 mentioned in [67], the abstract feature representation of [101] used deep multi-instance learning for engagement higher layers still retains identity relevant information even level prediction of sequences, where [88] outperforms [101]. after fine-tuning the face verification net with FER data. In To the best of our knowledge, no work has been done on order to overcome the problem of limited data and exploit expression detection in WSL framework using deep learning the success of deep learning for FER, [98] have explored models. fusion of handcrafted and learned features for automatic pain intensity estimation and showed superior performance 5.2.2 Exploiting SpatioTemporal Dynamics for FER in WSL over state-of-the-art approaches. RNN based approaches are In most of the existing approaches for FER in WSL, only found to be promising in capturing the temporal dynamics short-term dynamics across the temporal frames are ex- which was shown robust performance in various computer ploited. [88] and [101] used LBP-TOP features [96] for cap- vision, speech and NLP applications. Dae Kim et. al. [99] turing the temporal dynamics of the sub-sequences while explored efficient feature representation robust to expres- [24], [22] and [102] frame aggregation i.e., maximum of sion intensity variations by encoding the facial expressions feature vectors of the frames are treated as spatiotemporal in two stages. First, spatial features are obtained via CNN features. [23] explored displacement of facial landmarks using five objective terms to enhance the expression class with HMM for capturing temporal dynamics and showed separability. Second, the obtained spatial features are fed that it outperforms simple max-based frame aggregation to LSTM to learn the temporal features. Wang et. al. [100] techniques. [26] capture the temporal order of the templates studied the contribution of spatio-temporal relationship of the sequence, where temporal dynamics is obtained by among facial muscles for efficient FER. They have modeled appending frame-level features of the sequence while [81] the facial expression as a complex activity of temporally captures temporal dynamics using contrastive appearance overlapping facial events, where they proposed Interval difference i.e., difference between apex frame and neutral Temporal Bayesian Network to capture the temporal rela- frame. tions of facial events for FER. In the recent days, LSTM was found to achieve superior performance in capturing both short-term and long-term 5.2 Potential Research Directions temporal dynamics by exploiting the semantic connection In this section, we will present some of the potential research across the frames. Another promising technique based on directions for the advancement of facial behavior analysis in CNN i.e., C3D [103] is also gaining much attention in the framework of WSL. the field of computer vision for modeling the temporal information across the frames. Compared to RNN, C3D 5.2.1 Exploiting deep networks in WSL is efficient in capturing short-term temporal information. Though LSTM and C3D techniques are widely explored With the massive success of deep learning architectures, for FER in fully supervised setting [99], [104], it is not yet many researchers have leveraged deep models for various explored for the framework of WSL for FER. Therefore, applications in computer vision such as object detection, LSTM and C3D techniques when deployed in WSL setting face recognition, etc and showed significant improvement in for FER was expected to further enhance the performance of performance over the traditional approaches in real world existing state-of-the-art approaches. conditions. However, the performance of deep models is not fully explored in the context of facial behavior analysis due to the limited training data and laborious task of annota- 5.2.3 Dimensional affect model with inaccurate annotations tions which demands human expertise. Despite the complex The problem of FER in the framework of inaccurate anno- process of obtaining annotations for facial behavior, most tations is mostly explored in the context of classification as of the existing approaches on FER based on deep learning described in Section 3.3. Though the problem of noisy an- have been focused on fully supervised setting. Shan et al notations is more pronounced in case of dimensional model [11] provided a comprehensive survey on deep learning as ordinal annotations have more subtle variations across based approaches for FER in the framework of supervised the consecutive frames, not much work has been done learning and gave insight on the advantages and limitations on the problem of inaccurate annotations in dimensional of deploying deep models for FER. model. Due to the complexity of obtaining annotations in To the best of our knowledge, only five works have used the dimensional model and lack of techniques to handle deep networks with domain knowledge for AU recognition noisy dimensional annotations, most of the data-sets are and only one work on prediction of engagement level in developed for the task of classification of facial expressions videos. [33], [37] and [34] have explored the domain knowl- or AUs, which is mentioned in Section 3.4.1. Though few edge of dependencies among AUs and relationship between data-sets have been explored for the task of ordinal regres- expressions and AUs using deep networks, where [33] and sion as described in Section 4.3.1, development of data- [37] used RBM for modeling the domain knowledge and [34] sets for continuous dimensional model is rarely explored. used Recognition Adversarial Network (RAN) to match the As far as we know, only two data-sets i.e., [105] and [69] distribution of predicted labels with the pseudo AU labels have been developed for FER in continuous dimensional obtained from domain knowledge. Only two works i.e., model though a few multi-modal data-sets are available [83] and [81] used unlabeled and partially labeled data for [106], [107]. AU intensity estimation, where RBM is used for modeling The ordinal annotations of DISFA [60] are obtained by global dependencies among AUs and CNN for modeling two FACS certified experts and noisy annotations are re- the ordinal relevance and regression respectively. [88] and duced by evaluating the correlation between the annotations JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16 provided by the two FACS certifed experts. For UNBC- space, where the mismatch between the feature distribu- McMaster [56], the ordinal annotations are obtained from tions of the source and target domains are minimized still three FACS coders, which was then reviewed by fourth retaining the discrimination among the face images related FACS coder and validated using Ekman-Friesen formulae to facial expressions. Shao et al [115] exploited collection [108]. Similarly, [69] hired 12 expert annotators and the of constrained images from the source domain with both annotations are further reviewed by two independent anno- AU and landmark labels, and unpaired collection of un- tators. [105] obtained annotations from six trained experts constrained wild images from the target domain with only and doubly reviewed by two more annotators. Finally, the landmark labels. The rich features of source and target final labels are considered as mean of the annotations for images are disentangled to shape and text features, and the each sample. They have also conducted statistical analysis shape features (AU label information) of source images are to evaluate the inter-annotator correlations. fused with texture information of target feature. Therefore, the performance of FER can be enhanced using domain 5.2.4 WSL for continuous affect model adaptation along with deep learning for handling data with limited annotations still harnessing the potential of deep Continuous dimensional model conveys wider range of networks. expressions than ordinal regression and plays a crucial role in capturing the subtle changes and context sensitivity of 5.2.6 WSL based Localization of AU patches emotions. Compared to the task of classification and ordinal Attention mechanisms has been gaining attention for cap- regression, WSL based approaches for facial expressions or turing the most relevant features and achieved great success AUs are hardly explored in the context of continuous di- for various computer vision problems such as object de- mensional space though few endeavors have been made in tection, semantic segmentation, etc. [116] proposed a deep- the context of fully supervised learning [109], [110]. Hatice based framework for FER using aggregation of local and et al [5] investigated the potential of continuous dimensional global features, where local features capture the expression- model and gave insights on existing state-of-the-art ap- relevant details and global features models the high-level proaches and challenges associated with automatic continu- semantic information of the expression. ous analysis and synthesis of emotional behavior. Due to the In the recent days, few approaches have been proposed intricate and error-prone process of obtaining annotations, in FER for localizing the AU regions based on facial land- there is an imperative need to formulate the problem of marks [117], [118], [119]. However, these approaches rely continuous dimensional model in the framework of WSL to on prior knowledge of predefined AU attentions, which handle noisy annotations and alleviate the negative impact restrict the capacity to predefined AUs and fails to capture of unreliable annotations. Huang et al [111] investigated the the wide range if non-rigid AUs. As far as we know, only impact of annotation delay compensation and other post two approaches [120], [121] have addressed the problem processing operations for continuous emotion prediction of of AU localization in WSL framework, which is discussed multi-modal data. in Section 3.1. [120] integrated the relations among AUs As far as we know, only one work [112] has been done with attention mechanism in an end-to-end deep learning based on WSL for the prediction of continuous dimensional framework to capture more accurate attentions while [121] model i.e., valence and arousal in multi-modal framework used BoVW to capture the pixel-wise attention for emo- using audio and visual features. They have reduced the tion detection using MIL framework. Due to the immense noise of unreliable labels by introducing temporal label, potential of expression-relevant local features and tedious which incorporates the contextual information by consid- task of predefined attentions, there is an imperative need to ering the labels within a temporal window for every time formulate the problem of capturing AU attentions in WSL step. They have further used a robust loss function that framework. ignores small error between predictions and labels in order to further reduce the impact of noisy labels. Therefore, there 5.2.7 WSL for Multimodal Affective Modeling is lot of room for improvement to augment the performance Humans exhibit emotions through diverse range of modal- of FER system in continuous dimensional model using WSL ities such as facial expressions, vocal expressions, physio- based approaches. logical signals, etc. Multimodal analysis has drawn much attention over the past few years as it enhances the overall 5.2.5 Domain Adaptation for FER performance of the system over the isolated mono-modal Domain Adaptation is another promising line of research to approaches. The most effective way of using multimodal handle data with limited annotations although it requires framework is to use different modalities such as face, a source domain with accurate annotations as it exploits speech, ECG, etc in a complimentary fashion to provide a the knowledge of source domain for modeling the target comprehensive feature representation, resulting in higher domain. It has been widely used for many applications accuracy of the system. Inspired by the performance of related to facial analysis such as face recognition, facial multimodal approaches, several multimodal datasets are expression recognition, smile detection, etc. developed for the advancement of the system to handle Wang et al [113] proposed unsupervised domain adapta- the problems of real-world challenging scenarios [106], tion approach for small target data-set using GANs, where [122]. Recently, deep learning architectures are found to GAN generated samples are used to fine-tune the model outperform the state-of-the-art techniques by capturing the pretrained on the source data-set. Zhu et al [114] explored complex non-linear interaction in multimodal data. Philip et unsupervised domain adaptation approach in the feature al [123] provided an exhaustive review on the role of deep JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17 architectures for affect recognition using audio, visual and continuous emotion intensities. Although few approaches physiological signals. have exploited 3D data in fully supervised setting, we Of all the modalities through which emotions can be believe that 3D or depth images has not been explored for expressed, facial images and vocal expressions play a cru- FER in WSL framework. cial role in conveying the emotions. In order to foster the progress in multimodal emotion recognition, several audio- 6 CONCLUSION visual challenges such as AVEC 2014 - AVEC 2017 [124], [125], [126], etc have been conducted. Tzirakis et. al [127] In this paper, we have introduced various categories of extracted visual features and audio features using deep weakly supervised learning approaches and provided a residual network of 50 layers (ResNet-50) [128] and CNN taxonomy of approaches for facial behavior analysis based models respectively and deployed LSTM architectures for on various mode of annotations. A comprehensive review of handling the outliers for better classification. However, most state-of-the-art approaches pertinent to WSL are provided of these works have focused on the setting of supervised along with the comparative evaluation of the results. We learning. As far as we know, only one work [112] has have further provided insights on the limitations of the been done on WSL based approach for multimodal affect existing approaches and challenges associated with it. Fi- recognition using audio and visual features. nally, we have presented potential research directions based on our analysis for future development of facial behavior 5.2.8 WSL for FER with Infrared and Thermal Images analysis in the framework of WSL. Infrared and Thermal images are found to be efficient in capturing texture in images even under low illumination REFERENCES conditions, which has achieved success in applications such [1] A. Mehrabian, Nonverbal Communication. Routledge, 09 2017, p. as Image-dehazing, low light imaging, etc. Inspired by 235. the invariance of thermal images to illumination, a few [2] P. Ekman and W. V. Friesen, Pictures of Facial Affect. Consulting Psychologists Press, 1976. approaches have been proposed to exploit thermal images [3] R. Breuer and R. Kimmel, “A deep learning perspective on the with RGB images in complimentary fashion to augment the origin of facial expressions,” CoRR, vol. abs/1705.01842, 2017. performance of FER system. Wang et al [129] explored ther- [4] P. EKMAN, “Facial action coding system (facs),” A Human Face, mal images for better feature representation by extracting 2002. [5] H. Gunes and B. Schuller, “Categorical and dimensional affect features from visible and thermal images using two deep analysis in continuous input: Current trends and future direc- networks, which is further trained with two SVM models for tions,” Image Vision Comput., vol. 31, no. 2, pp. 120–136, Feb. 2013. expression recognition. The whole architecture is jointly re- [6] P. Ekman and W. Friesen, Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists fined using similarity constraint on the mapping of thermal Press, 1978. and visible representations to expressions. Bowen et al [130] [7] B. Martinez and M. F. Valstar, “Advances, challenges, and oppor- have further enhanced the performance of the approach by tunities in automatic facial expression recognition,” Advances in introducing a discriminator module to differentiate visible Face Detection and Facial Image Analysis, pp. 63–100, 2016. [8] E. Sariyanidi, H. Gunes, and A. Cavallaro, “Automatic analysis of and thermal representation and enforcing the similarity facial affect: A survey of registration, representation, and recog- between mapping functions of visible and thermal repre- nition,” IEEE Tran. on Pattern Analysis and Machine Intelligence, sentation to expression labels through adversarial learning. vol. 37, no. 6, pp. 1113–1133, June 2015. Though fusion of thermal images with RGB images was [9] B. Fasel and J. Luettin, “Automatic facial expression analysis: a survey,” , vol. 36, no. 1, pp. 259 – 275, 2003. expected to be promising line of research for FER, it still [10] A. Samal and P. A. Iyengar, “Automatic recognition and analysis remains to be under-researched problem. of human faces and facial expressions: a survey,” Pattern Recogni- tion, vol. 25, no. 1, pp. 65 – 77, 1992. 5.2.9 WSL for FER with 3D and Depth Images [11] S. Li and W. Deng, “Deep facial expression recognition: A sur- vey,” IEEE Tran. on Affective Computing, pp. 1–1, 2020. Despite the advancement of FER systems based on 2D im- [12] Z.-H. Zhou, “A brief introduction to weakly supervised learn- ages, pose-variance still remains to be challenging problem. ing,” National Science Review, vol. 5, no. 1, pp. 44–53, 2018. To overcome the problem of pose-variance and occlusion, [13] J. Foulds and E. Frank, “A review of multi-instance learning assumptions,” The Knowledge Engineering Review, vol. 25, no. 1, 3D data has been explored to obtain the comprehensive p. 1–25, 2010. information displayed by the face and capture the subtle [14] Y. Chen and J. Z. Wang, “Image categorization by learning and changes of facial AUs in detail using depth of the facial sur- reasoning with regions,” J. Mach. Learn. Res., vol. 5, p. 913–939, Dec. 2004. face. For instance, AU18(Lip pucker) is hard to differentiate [15] B. Settles, M. Craven, and S. Ray, “Multiple-instance active learn- from AU10+AU17+AU24 in a 2D frontal view. Sandach et ing,” in Proceedings of the 20th International Conference on Neural al [131] provided a comprehensive survey on the existing Information Processing Systems, ser. NIPS’07. Red Hook, NY, USA: data-sets and FER systems pertinent to 3D or 4D data. [132] Curran Associates Inc., 2007, p. 1289–1296. [16] P. A. Viola, J. C. Platt, and C. Zhang, “Multiple instance boosting extracted six types of 2D facial attributes from textured for object detection,” in Proc. of NIPS, 2006, vol. 18, pp. 1417–1424. 3D face scans and jointly fed to the feature extraction and [17] M. A. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon, feature fusion subnets to learn the optimal 2D and 3D facial “Multiple instance learning: A survey of problem characteristics and applications,” Pattern Recognition, vol. 77, pp. 329 – 353, 2018. representations. The proposed approach is further enhanced [18] S. Ray and D. Page, “Multiple instance regression,” in Proc. of the by extracting deep features from different facial parts of 18th ICML, 2001. texture and depth images and fused together with feedback [19] O. Chapelle, B. Schlkopf, and A. Zien, Semi-Supervised Learning, mechanism [133]. Chen et al [134] restored 3D facial models 1st ed. The MIT Press, 2010. [20] F. Muhlenbach, S. Lallich, and D. A. Zighed, “Identifying and from 2D images and proposed a novel random forest based handling mislabelled instances,” J. Intell. Inf. Syst., vol. 22, no. 1, algorithm to simultaneously estimate 3D facial tracking and p. 89–109, Jan. 2004. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18

[21] D. C. Brabham, “Crowdsourcing as a model for problem solving: [45] G. Peng and S. Wang, “Dual semi-supervised learning for facial An introduction and cases,” Convergence, vol. 14, no. 1, pp. 75–90, action unit recognition,” in Proc. of the AAAI, vol. 33, 2019, pp. 2008. 8827–8834. [22] K. Sikka, A. Dhall, and M. S. Bartlett, “Classification and weakly [46] S. Wu, S. Wang, B. Pan, and Q. Ji, “Deep facial action unit supervised pain localization using multiple segment representa- recognition from partially labeled data,” in Proc. of IEEE ICCV, tion,” Image and Vision Computing, vol. 32, no. 10, pp. 659 – 670, 2017, pp. 3971–3979. 2014. [47] X. Niu, H. Han, S. Shan, and X. Chen, “Multi-label co- [23] C. Wu, S. Wang, and Q. Ji, “Multi-instance hidden markov model regularization for semi-supervised facial action unit recognition,” for facial expression recognition,” in Proc. of 11th IEEE FG, vol. 1, in NIPS, 2019, vol. 32, pp. 909–919. 2015, pp. 1–6. [48] A. Mollahosseini, B. Hassani, M. J. Salvador, H. Abdollahi, [24] Z. Chen, R. Ansari, and D. Wilkie, “Learning pain from action D. Chan, and M. H. Mahoor, “Facial expression recognition from unit combinations: A weakly supervised approach via multiple world wild web,” in Proc. of IEEE CVPRW, 2016, pp. 1509–1516. instance learning,” IEEE Tran. on Affective Computing, pp. 1–1, [49] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- 2019. fication with deep convolutional neural networks,” in Proc. of the [25] A. Ruiz, J. Van de Weijer, and X. Binefa, “Regularized multi- 25th NIPS, vol. 1, 2012, pp. 1097–1105. concept mil for weakly-supervised facial behavior categoriza- [50] A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper tion,” in Proc. of the BMVC, 2014. in facial expression recognition using deep neural networks,” in [26] K. Sikka, G. Sharma, and M. Bartlett, “Lomo: Latent ordinal Proc. of IEEE WACV, March 2016, pp. 1–10. model for facial analysis in videos,” in Proc. of IEEE CVPR, June [51] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang 2016, pp. 5580–5589. Wang, “Learning from massive noisy labeled data for image [27] M. X. Huang, G. Ngai, K. A. Hua, S. C. F. Chan, and H. V. classification,” in Proc. of IEEE CVPR, June 2015, pp. 2691–2699. Leong, “Identifying user-specific facial affects from spontaneous [52] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep expressions with minimal annotation,” IEEE Tran. on Affective networks for facial expression recognition with crowd-sourced Computing, vol. 7, no. 4, pp. 360–373, Oct 2016. label distribution,” in Proc. of 18th ACM ICMI, 2016, pp. 279–283. [28] L. Xu and P. Mordohai, “Automatic facial expression recognition [53] J. Zeng, S. Shan, and X. Chen, “Facial expression recognition with using bags of motion words,” in Proc. of BMVC, 2010, pp. 13.1– inconsistently annotated datasets,” in Proc. of ECCV, 2018, pp. 13.13. 227–243. [29] Y. Wang, J. Ma, B. Hao, P. Hu, X. Wang, J. Mei, and S. Li, [54] K. Zhao, W.-S. Chu, and A. M. Martinez, “Learning facial action “Automatic depression detection via facial expressions using units from web images with scalable weakly supervised cluster- multiple instance learning,” in Proc. of IEEE 17th Int. Symp. on ing,” in Proc. of IEEE Conf. on CVPR, June 2018. Biomedical Imaging (ISBI), 2020, pp. 1933–1936. [55] C. F. Benitez-Quiroz, Y. Wang, and A. M. Martinez, “Recognition [30] M. Lewis, J. Haviland-Jones, and L. Barrett, Handbook of Emotions. of action units in the wild with deep nets and a new global-local Guilford press, 2010, ch. 13. loss,” in Proc. of IEEE ICCV, Oct 2017, pp. 3990–3999. [31] A. Ruiz, J. V. d. Weijer, and X. Binefa, “From emotions to action units with hidden and semi-hidden-task learning,” in Proc. of [56] P. Lucey, J. F. Cohn, K. M. Prkachin, P. E. Solomon, and IEEE ICCV, Dec 2015, pp. 3703–3711. I. Matthews, “Painful data: The unbc-mcmaster shoulder pain expression archive database,” in Proc. of IEEE FG, 2011, pp. 57– [32] P. Gosselin, G. Kirouac, and F. Dore, “Virtual adversarial training: 64. A regularization method for supervised and semi-supervised learning,” Journal of personality and social psychology, pp. 1–1, 1995. [57] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and [33] S. Wang, G. Peng, and Q. Ji, “Exploring domain knowledge I. Matthews, “The extended cohn-kanade dataset (ck+): A com- for facial expression-assisted action unit activation recognition,” plete dataset for action unit and emotion-specified expression,” IEEE Conf. on CVPRW IEEE Tran. on Affective Computing, pp. 1–1, 2018. in , June 2010, pp. 94–101. [34] G. Peng and S. Wang, “Weakly supervised facial action unit [58] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based recognition through adversarial training,” in Proc. of IEEE CVPR, database for facial expression analysis,” in Proc. of IEEE Int. Conf. June 2018, pp. 2188–2196. on Multimedia and Expo, July 2005, p. 5. [35] S. Wang, G. Peng, S. Chen, and Q. Ji, “Weakly supervised facial [59] M. F. Valstar and M. Pantic, “Induced disgust, happiness and action unit recognition with domain knowledge,” IEEE Tran. on surprise: an addition to the mmi facial expression database,” Cybernetics, vol. 48, no. 11, pp. 3265–3276, Nov 2018. in Proceedings of Int’l Conf. Language Resources and Evaluation, [36] Y. Zhang, W. Dong, B. Hu, and Q. Ji, “Classifier learning with Workshop on EMOTION, Malta, May 2010, pp. 65–70. prior probabilities for facial action unit recognition,” in Proc. of [60] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. IEEE CVPR, June 2018, pp. 5108–5116. Cohn, “Disfa: A spontaneous facial action intensity database,” [37] S. Wang and G. Peng, “Weakly supervised dual learning for facial IEEE Tran. on Affective Computing, vol. 4, no. 2, pp. 151–160, April action unit recognition,” IEEE Tran. on Multimedia, pp. 1–1, 2019. 2013. [38] I. Cohen, N. Sebe, F. G. Cozman, and T. S. Huang, “Semi- [61] X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, supervised learning for facial expression recognition,” in Proc. P. Liu, and J. M. Girard, “Bp4d-spontaneous: a high-resolution of 5th ACM SIGMM Int. Workshop on Multimedia Information spontaneous 3d dynamic facial expression database,” Image and Retrieval, 2003, pp. 17–22. Vision Computing, vol. 32, no. 10, pp. 692 – 706, 2014. [39] S. L. Happy, A. Dantcheva, and F. Bremond, “A weakly su- [62] M. F. Moller, “A scaled conjugate gradient algorithm for fast pervised learning technique for classifying facial expressions,” supervised learning,” Neural Networks, vol. 6, no. 4, pp. 525 – Pattern Recognition Letters, vol. 128, pp. 162 – 168, 2019. 533, 1993. [40] Y. Song, D. McDuff, D. Vasisht, and A. Kapoor, “Exploiting [63] K. Sikka, “Facial expression analysis for estimating pain sparsity and co-occurrence structure for action unit recognition,” in clinical settings,” in Proceedings of the 16th International in Proc. of 11th IEEE FG, vol. 1, May 2015, pp. 1–8. Conference on Multimodal Interaction, ser. ICMI ’14. New [41] B. Wu, S. Lyu, B.-G. Hu, and Q. Ji, “Multi-label learning with York, NY, USA: ACM, 2014, pp. 349–353. [Online]. Available: missing labels for image annotation and facial action unit recog- http://doi.acm.org/10.1145/2663204.2666282 nition,” Pattern Recognition, vol. 48, no. 7, pp. 2279 – 2289, 2015. [64] F. Wang, X. Xiang, C. Liu, T. D. Tran, A. Reiter, G. D. Hager, [42] Y. Li, B. Wu, B. Ghanem, Y. Zhao, H. Yao, and Q. Ji, “Facial H. Quon, J. Cheng, and A. L. Yuille, “Regularizing face verifica- action unit recognition under incomplete data based on multi- tion nets for pain intensity regression,” in Proc. of IEEE ICIP, Sept label learning with missing labels,” Pattern Recognition, vol. 60, 2017, pp. 1087–1091. pp. 890 – 900, 2016. [65] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recogni- [43] Y. Li, B. Wu, Y. Zhao, H. Yao, and Q. Ji, “Handling missing labels tion,” in British Machine Vision Conference, 2015. and class imbalance challenges simultaneously for facial action [66] H. Kaya, F. Grpnar, and A. A. Salah, “Video-based emotion unit recognition,” Multimedia Tools and Applications, vol. 78, pp. recognition in the wild using deep and score 20 309–20 332, 2019. fusion,” Image Vision Comput., vol. 65, no. C, pp. 66–75, Sep. 2017. [44] S. Wang, Q. Gan, and Q. Ji, “Expression-assisted facial action unit [67] H. Ding, S. K. Zhou, and R. Chellappa, “Facenet2expnet: Regu- recognition under incomplete au annotation,” Pattern Recognition, larizing a deep face recognition net for expression recognition,” vol. 61, pp. 78 – 91, 2017. in Proc. of 12th IEEE FG, May 2017, pp. 118–126. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 19

[68] C. F. Benitez-Quiroz, R. Srinivasan, Q. Feng, Y. Wang, and A. M. [90] R. Walecki, O. Rudovic, V. Pavlovic, B. Schuller, and M. Pantic, Mart´ınez, “Emotionet challenge: Recognition of facial expres- “Deep structured learning for facial action unit intensity estima- sions of emotion in the wild,” CoRR, vol. abs/1703.01210, 2017. tion,” in Proc. of IEEE Conf. on CVPR, July 2017, pp. 5709–5718. [69] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A [91] T. Gehrig and H. K. Ekenel, “Why is facial expression analysis in database for facial expression, valence, and arousal computing in the wild challenging?” in Proc. of EmotiW Challenge and Workshop, the wild,” IEEE Tran. on Affective Computing, vol. 10, no. 1, pp. 2013, pp. 9–16. 18–31, Jan 2019. [92] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emo- [70] S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep tionet: An accurate, real-time algorithm for the automatic anno- locality-preserving learning for expression recognition in the tation of a million facial expressions in the wild,” in Proc. of IEEE wild,” in Proc. of IEEE CVPR, July 2017, pp. 2584–2593. Conf. on CVPR, June 2016, pp. 5562–5570. [71] M. Tavakolian and A. Hadid, “Deep spatiotemporal representa- [93] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE tion of the face for automatic pain intensity estimation,” in Proc. Tran. on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263– of 24th ICPR, 2018, pp. 350–354. 1284, Sep. 2009. [72] A. Ruiz, O. Rudovic, X. Binefa, and M. Pantic, “Multi-instance [94] S. Jaiswal, J. Egede, and M. Valstar, “Deep learned cumulative dynamic ordinal random fields for weakly-supervised pain in- attribute regression,” in Proc. of 13th IEEE FG, 2018, pp. 715–722. tensity estimation,” in Proc. of ACCV, S.-H. Lai, V. Lepetit, [95] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Collecting large, K. Nishino, and Y. Sato, Eds., 2016, pp. 171–186. richly annotated facial-expression databases from movies,” IEEE MultiMedia, vol. 19, no. 3, pp. 34–41, July 2012. [73] R. G. Praveen, E. Granger, and P. Cardinal, “Deep weakly super- vised domain adaptation for pain localization in videos,” in Proc. [96] G. Zhao and M. Pietikainen, “Dynamic texture recognition using of 15th IEEE FG, 2020, pp. 883–890. local binary patterns with an application to facial expressions,” IEEE Tran. on Pattern Analysis and Machine Intelligence, vol. 29, [74] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a no. 6, pp. 915–928, June 2007. new model and the kinetics dataset,” in Proc. of IEEE CVPR, 2017, [97] T. R. Almaev and M. F. Valstar, “Local gabor binary patterns pp. 4724 – 4733. from three orthogonal planes for automatic facial expression [75] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation recognition,” in 2013 Humaine Association Conference on Affective by ,” in Proc. of the 32nd ICML, vol. 37, 2015, pp. Computing and Intelligent Interaction, Sep. 2013, pp. 356–361. 1180 – 1189. [98] J. Egede, M. Valstar, and B. Martinez, “Fusing deep learned and [76] J. Yang, K. Wang, X. Peng, and Y. Qiao, “Deep recurrent multi- hand-crafted features of appearance, shape, and dynamics for instance learning with spatio-temporal features for engagement automatic pain estimation,” in Proc. of 12th IEEE Int. Conf. on FG, intensity prediction,” in Proc. of the 20th ACM ICMI, 2018, pp. 2017, pp. 689–696. 594–598. [99] D. H. Kim, W. Baddar, J. Jang, and Y. M. Ro, “Multi-objective [77] A. Ruiz, O. Rudovic, X. Binefa, and M. Pantic, “Multi-instance based spatio-temporal feature representation learning robust to dynamic ordinal random fields for weakly supervised facial expression intensity variations for facial expression recognition,” behavior analysis,” IEEE Tran. on Image Processing, vol. 27, no. 8, IEEE Tran. on Affective Computing, pp. 1–1, 2017. pp. 3969–3982, Aug 2018. [100] Z. Wang, S. Wang, and Q. Ji, “Capturing complex spatio-temporal [78] Y. Zhang, R. Zhao, W. Dong, B. Hu, and Q. Ji, “Bilateral ordinal relations among facial muscles for facial expression recognition,” relevance multi-instance regression for facial action unit intensity in Proc. of IEEE Conf. on CVPR, June 2013, pp. 3422–3429. estimation,” in Proc. of IEEE Conf. on CVPR, 2018, pp. 7034–7043. [101] A. Dhall, A. Kaur, R. Goecke, and T. Gedeon, “Emotiw 2018: [79] Y. Zhang, H. Jiang, B. Wu, Y. Fan, and Q. Ji, “Context-aware Audio-video, student engagement and group-level affect predic- feature and label fusion for facial action unit intensity estimation tion,” in Proc. of 20th ACM ICMI, 2018, pp. 653–656. with partially labeled data,” in Proc. of IEEE ICCV, 2019, pp. 733– [102] L. Xie, D. Tao, and H. Wei, “Early expression detection via online 742. multi-instance learning with nonlinear extension,” IEEE Tran. on [80] R. Zhao, Q. Gan, S. Wang, and Q. Ji, “Facial expression intensity Neural Networks and Learning Systems, vol. 30, no. 5, pp. 1486– estimation using ordinal information,” in Proc. of IEEE Conf. on 1496, May 2019. CVPR, June 2016, pp. 3466–3474. [103] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, [81] Y. Zhang, W. Dong, B.-G. Hu, and Q. Ji, “Weakly-supervised “Learning spatiotemporal features with 3d convolutional net- deep convolutional neural network learning for facial action unit works,” in Proc. of the IEEE ICCV, 2015, pp. 4489–4497. intensity estimation,” in Proc. of IEEE Conf. on CVPR, June 2018. [104] B. Hasani and M. H. Mahoor, “Facial expression recognition [82] Y. Zhang, Y. Fan, W. Dong, B. Hu, and Q. Ji, “Semi-supervised using enhanced deep 3d convolutional neural networks,” in Proc. deep neural network for joint intensity estimation of multiple of IEEE Conf. on CVPRW, July 2017, pp. 2278–2288. facial action units,” IEEE Access, vol. 7, pp. 150 743–150 756, 2019. [105] D. Kollias, P. Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, [83] S. Wang, B. Pan, S. Wu, and Q. Ji, “Deep facial action unit recogni- B. Schuller, I. Kotsia, and S. Zafeiriou, “Deep affect prediction tion and intensity estimation from partially labelled data,” IEEE in-the-wild: Aff-wild database and challenge, deep architectures, International Journal of Computer Vision Tran. on Affective Computing, pp. 1–1, 2019. and beyond,” , Feb 2019. [106] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introduc- [84] Y. Zhang, B. Wu, W. Dong, Z. Li, W. Liu, B. Hu, and Q. Ji, ing the recola multimodal corpus of remote collaborative and “Joint representation and estimator learning for facial action unit affective interactions,” in Proc. of 10th IEEE FG, 2013, pp. 1–8. intensity estimation,” in Proc. of Conf. on CVPR, 2019, pp. 3452– [107] M. S. H. Aung, S. Kaltwang, B. Romera-Paredes, B. Martinez, 3461. A. Singh, M. Cella, M. Valstar, H. Meng, A. Kemp, M. Shafizadeh, [85] M. F. Valstar, T. Almaev, J. M. Girard, G. McKeown, M. Mehu, A. C. Elkins, N. Kanakam, A. de Rothschild, N. Tyler, P. J. , L. Yin, M. Pantic, and J. F. Cohn, “Fera 2015 - second facial A. C. d. C. Williams, M. Pantic, and N. Bianchi-Berthouze, “The expression recognition and analysis challenge,” in Proc. of 11th automatic detection of chronic pain-related expression: Require- IEEE Int. Conf. on FG, vol. 06, May 2015, pp. 1–8. ments, challenges and the multimodal emopain dataset,” IEEE [86] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, Tran. on Affective Computing, vol. 7, no. 4, pp. 435–451, Oct 2016. “The semaine database: Annotated multimodal records of emo- [108] P. Ekman, W. Friesen, and J. Hager, “Facial action coding system: tionally colored conversations between a person and a limited Research nexus,” Network Research Information, vol. 1, 2002. agent,” IEEE Tran. on Affective Computing, vol. 3, no. 1, pp. 5–17, [109] F. Zhou, S. Kong, C. C. Fowlkes, T. Chen, and B. Lei, “Fine- Jan 2012. grained facial expression analysis using dimensional emotion [87] L. Yin, X. Chen, Y. Sun, T. Worm, and M. Reale, “A high- model,” Neurocomputing, vol. 392, pp. 38 – 49, 2020. resolution 3d dynamic facial expression database,” in Proc. of 8th [110] D. Kollias and S. Zafeiriou, “A multi-component CNN-RNN ap- IEEE Int. Conf. on FG, Sep. 2008, pp. 1–6. proach for dimensional emotion recognition in-the-wild,” CoRR, [88] A. Kaur, A. Mustafa, L. Mehta, and A. Dhall, “Prediction and vol. abs/1805.01452, 2018. localization of student engagement in the wild,” in Digital Image [111] Z. Huang, T. Dang, N. Cummins, B. Stasak, P. Le, V. Sethu, and Computing: Techniques and Applications (DICTA), Dec 2018, pp. 1–8. J. Epps, “An investigation of annotation delay compensation and [89] A. Gudi, H. E. Tasli, T. M. den Uyl, and A. Maroulis, “Deep learn- output-associative fusion for multimodal continuous emotion ing based facs action unit occurrence and intensity estimation,” prediction,” in Proc. of the 5th Int. Workshop on Audio/Visual in Proc. of 11th IEEE Int. Conf. on FG, vol. 06, May 2015, pp. 1–5. Emotion Challenge, 2015, pp. 41–48. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 20

[112] E. Pei, D. Jiang, M. Alioscha-Perez, and H. Sahli, “Continuous [133] A. Jan, H. Ding, H. Meng, L. Chen, and H. Li, “Accurate facial affect recognition with weakly supervised learning,” Multimedia parts localization and deep learning for 3d facial expression Tools and Applications, vol. 78, pp. 19 387 – 19 412, Feb 2019. recognition,” in Proc. of 13th IEEE FG, 2018, pp. 466–472. [113] X. Wang, X. Wang, and Y. Ni, “Unsupervised domain adaptation [134] Hui Chen, Jiangdong Li, Fengjun Zhang, Yang Li, and Hongan for facial expression recognition using generative adversarial Wang, “3d model-based continuous emotion recognition,” in networks,,” in Computational Intelligence and Neuroscience, 2018. Proc. of IEEE Conf. on CVPR, 2015, pp. 1836–1845. [114] R. Zhu, G. Sang, and Q. Zhao, “Discriminative feature adaptation for cross-domain facial expression recognition,” in Proc. of Int. Conf. on Biometrics, June 2016, pp. 1–7. [115] Z. Shao, J. Cai, T. Cham, X. Lu, and L. Ma, “Weakly-supervised unconstrained action unit detection via feature disentangle- ment,” CoRR, vol. abs/1903.10143, 2019. [116] S. Xie and H. Hu, “Facial expression recognition using hierar- chical features with deep comprehensive multipatches aggrega- tion convolutional neural networks,” IEEE Tran. on Multimedia, vol. 21, no. 1, pp. 211–220, Jan 2019. [117] W. Li, F. Abtahi, Z. Zhu, and L. Yin, “Eac-net: Deep nets with enhancing and cropping for facial action unit detection,” IEEE Tran. on Pattern Analysis and Machine Intelligence, vol. 40, no. 11, pp. 2583–2596, Nov 2018. [118] W. Li, F. Abtahi, and Z. Zhu, “Action unit detection with re- gion adaptation, multi-labeling learning and optimal temporal fusing,” in Proc. of IEEE Conf. on CVPR, July 2017, pp. 6766–6775. [119] Z. Shao, Z. Liu, J. Cai, and L. Ma, “Deep adaptive attention for joint facial action unit detection and face alignment,” in Proc. of ECCV, September 2018. [120] Z. Shao, Z. Liu, J. Cai, Y. Wu, and L. Ma, “Facial action unit detection using attention and relation learning,” IEEE Tran. on Affective Computing, pp. 1–1, 2019. [121] H. Liu, M. Xu, J. Wang, T. Rao, and I. Burnett, “Improving visual saliency computing with emotion intensity,” IEEE Tran. on Neural Networks and Learning Systems, vol. 27, no. 6, pp. 1201–1213, June 2016. [122] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, p. 335, Nov 2008. [123] P. V. Rouast, M. Adam, and R. Chiong, “Deep learning for human affect recognition: Insights and new developments,” IEEE Tran. on Affective Computing, pp. 1–1, 2019. [124] F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, and M. Pantic, “Av+ec 2015: The first affect recognition challenge bridging across audio, video, and physiological data,” in Proc. of 5th Int. Workshop on Audio/Visual Emotion Challenge, 2015, pp. 3–8. [125] F. Ringeval, B. Schuller, M. Valstar, R. Cowie, and M. Pantic, “Avec 2015: The 5th international audio/visual emotion chal- lenge and workshop,” in Proc. of 23rd ACMM, 2015, pp. 1335– 1336. [126] F. Ringeval, M. Pantic, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, and M. Schmitt, “Avec 2017: Real-life depression, and affect recognition workshop and challenge,” in Proc. of 7th Annual Workshop Audio/Visual Emotion Challenge, 10 2017, pp. 3–9. [127] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, Dec 2017. [128] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception- v4, inception-resnet and the impact of residual connections on learning,” in Proc. of 31st AAAI, 2017, p. 4278–4284. [129] S. Wang, B. Pan, H. Chen, and Q. Ji, “Thermal augmented expression recognition,” IEEE Tran. on Cybernetics, vol. 48, no. 7, pp. 2203–2214, July 2018. [130] B. Pan and S. Wang, “Facial expression recognition enhanced by thermal images through adversarial learning,” in Proc. of 26th ACMM, 2018, pp. 1346–1353. [131] G. Sandbach, S. Zafeiriou, M. Pantic, and L. Yin, “Static and dynamic 3d facial expression recognition: A comprehensive sur- vey,” Image and Vision Computing, vol. 30, no. 10, pp. 683 – 697, 2012. [132] H. Li, J. Sun, Z. Xu, and L. Chen, “Multimodal 2d+3d facial expression recognition with deep fusion convolutional neural network,” IEEE Tran. on Multimedia, vol. 19, no. 12, pp. 2816– 2831, Dec 2017.