<<

1 Automated Detection from Facial Expressions using FACS: A Review

Zhanli Chen, Student Member, IEEE, Rashid Ansari, Fellow, IEEE, and Diana J. Wilkie, Fellow, AAN,

Abstract—Facial pain expression is an important modality for assessing pain, especially when the patient’s verbal ability to communicate is impaired. The facial muscle-based action units (AUs), which are defined by the Facial Action Coding System (FACS), have been widely studied and are highly reliable as a method for detecting facial expressions (FE) including valid detection of pain. Unfortunately, FACS coding by humans is a very time-consuming task that makes its clinical use prohibitive. Significant progress on automated recognition (AFER) have led to its numerous successful applications in FACS-based problems. However, only a handful of studies have been reported on automated pain detection (APD), and its application in clinical settings is still far from a reality. In this paper, we review the progress in research that has contributed to automated pain detection, with focus on 1) the framework-level similarity between spontaneous AFER and APD problems; 2) the evolution of system design including the recent development of methods; 3) the strategies and considerations in developing a FACS-based pain detection framework from existing research; and 4) introduction of the most relevant databases that are available for AFER and APD studies. We attempt to present key considerations in extending a general AFER framework to an APD framework in clinical settings. In addition, the performance metrics are also highlighted in evaluating an AFER or an APD system.

Index Terms—FACS, Action Unit, Automated Pain Detection, Facial Action Coding System, Automated Facial Expression Recognition, Deep Learning !

1 INTRODUCTION those with impaired capacity to communicate about their pain. A study of patients undergoing procedural pain [71] AIN assessment is essential to providing proper patient showed a strong relationship between procedural pain and care and assessing its efficacy under clinical settings. P behavioral responses and it identified specific procedural A patients’ self-report is commonly used as the ’gold stan- pain behaviors that included facial expressions of grimac- dard’ to report pain through manual pain measurement ing, wincing, and shutting of eyes. It was relatively rare for tools, including the verbal numerical scale (VNS) and visual facial expressions to be absent during a painful procedure. analogue scale (VAS). However, human sensing and judge- Findings by Payen et al [68] strongly support the use of ment of pain is subjective and the scale report may vary facial expression as a pain indicator in critically ill sedated significantly among individuals. Behavioral observation of patients. A study of pain assessment [61] in elderly patients a patient, in particular the use of facial expression, as a with severe dementia provides evidence that patients with key behavioral indicator of pain, has been identified as pain show more pronounced behavioral response compared an important modality for assessing pain [99] , especially with patients without pain and note that clinician observa- when the patient’s ability to communicate pain is impaired tions of facial expressions are accurate means for assessing [36]. Patients who are dying, intellectually disabled [63], the presence of pain in patients unable to communicate ver- critically ill and sedated [68] [1], or have dementia [61], bally because of advanced dementia. Research has shown head and neck cancer, or brain metastasis [35] [31] [71] that facial expressions can provide reliable measures of pain are particularly vulnerable and in need of technology that arXiv:1811.07988v1 [cs.CV] 13 Nov 2018 across human lifespan and culture varieties [99] and there could provide reliable and valid alerts about their pain is also good consistency of facial expressions corresponding to busy clinicians. The American Society for Pain Man- to painful stimuli. Assessment of facial expression of pain agement Nursing (ASPMN), in its position statement on not only brings added value when verbal report is available pain assessment in the nonverbal patient [35], describes but also served as a key behavior indicator for pain in the a hierarchy of pain assessment in which the observation scenario of non-communicating patients. of behavior including facial expressions is noted to be Detailed coding of facial activity provides a mechanism a valid approach to pain assessment. McGuire et al [63] for understanding of different parameters of pain that are concluded that pain in the intellectual disability population unavailable in self-report measures [17]. The Facial Action may be under-recognized and under-treated, especially in Coding System (FACS) [26] [24] [98], in which pain be- haviors are observed while a patient undergoes a series of • Z. Chen and R. Ansari are with the Department of Electrical and structured activities [46], have been found to be useful in Computer Engineering, University of Illinois at Chicago assessing pain among seniors with and without cognitive E-mail: [email protected], [email protected] • D. Wilkie is with the Department of Biobehavioral Nursing, University of impairments in research. FACS provides a objective descrip- Florida tion of facial expressions with 30 action units(AU) based E-mail : diwilkie@ufl.edu on the unique changes produced by muscular movements. Manuscript received May 28, 2018; Several studies using FACS conducted by Prkachin et al [16] 2

[17] identify six action units associated with pain: including mimicking human decision procedures of accessing pain is brow lowering (AU4), cheekraising/lid-tightening (AU6/7), within the scope of present CVML techniques. nose wrinkle and upper lip raise (AU9 and AU10), and eye Identifying pain from facial expressions can be viewed closing (AU43). Prkachin et al [70] developed the Prkachin as an extension to the spontaneous facial expression recog- and Solomon Pain Intensity (PSPI) metric based on this ob- nition problem, which is a more general research area for servation, which is a 16-level scale based on the contribution -related learning. An APD framework can be built of individual intensity of pain related AUs and is defined as: naturally atop an AFER system as they can share most modules of system design. In addition, pain facial action P ain = AU4+max(AU6, AU7)+max(AU9, AU10)+AU43 units are also associated with other affects, including dis- gust, , or [99]. In this survey, we present a The list of pain-related AUs has been further expanded systematic review of APD development, with focus on how in more extensive research [99] to include lip corner puller it can benefit from the advances in AFER research as well as (AU12), lip stretch (AU20), lips part (AU25), jaw drop from the progress in deep learning technology. In particular, (AU26) and mouth strectch (AU27). The collection of 11 we refer AFER/APD methods using only classic CVML AUs that occur singly or in combination are considered as techniques, that are not deep learning related, as conven- core Action Units to specify the prototype of pain, both tional approaches. The paper is orgnized as follows: Section acute and chronic types, in most FACS based pain study. 2 summarizes important facial expression datasets for AFER However, full FACS coding procedure is laborious, and it and APD. Section 3 reviews an AFER framework in terms of requires 100 units of time coding for every unit of real-time, block customization, which serves as the basis for describing which is unlikely to be useful in busy clinical settings [31]. APD research. Section 4 presents a systematic overview of The development of an automated system that identifies a the APD development that builds on the available body of configuration of facial actions associated with pain would AFER research. Section 5 discusses the evolution of AFER be a significant innovation for enhanced patient care and and APD with progress on deep learning. Finally, Section clinical practice efficiency. 6 summarizes the state-of-the-art work and concludes with Over the past two decades, significant progress has been discussion on future work of APD in clinical settings. achieved in automated facial expression recognition. Ad- vances in (CV) techniques has established robust facial image and modeling under natu- 2 PUBLICLY ACCESSIBLE FACIAL EXPRESSION ralistic environment, and state-of-the-art DATASETS (ML) methods have been adopted in the research on spon- taneous facial expression recognition. These developments Availability of representative datasets is the fundamental in Computer Vision and Machine Learning (CVML) have basis for training and testing an AFER system. The perfor- greatly motivated researchers seeking practical solutions to mance evaluation of and comparison among different AFER various affect-related problems, where the term ”affect” in approaches would not be fair or meaningful without context is associated experience of or . of the datasets used in the evaluation. Before proceeding to However, only a handful of studies have focused on auto- an overview of specific AFER and APD approaches, we will mated pain detection from facial expressions. A literature present an introduction to selected publicly available facial survey conducted from 1982 to June 2014 with keywords expression datasets in this section. The datasets are grouped ”facial action coding system pain” by Rojo et al [73] found depending on the context of the problem and present them only 26 relevant references. Prkachian notes that infor- in a chronological order of availability. These datasets were mation technology-based systems are currently relatively all used in the studies to be covered in this paper. primitive, but there is little that their use in assessing pain expression is feasible and their capacities will become 2.1 Classic Facial Expression Datasets for Conven- more powerful with further research. [69]. However, the tional AFER challenges impeding the progress on clinical applications of automated pain detection arise largely from the following MMI (2005): The MMI dataset [66] contains 740 static aspects, 1) considerable difficulty on facial pain data collec- images and 780 video sequences captured in frontal and tion from critically ill patient, 2) insufficient well-annotated profile views from 19 subjects of students and research staff. data of pain in existing datasets, 3) imbalanced dataset with Two thirds of image samples and frames of 169 video se- a limited portion of positive (pain) samples, 4) rigid head quences have been annotated by two certified FACS coders motions and low AU/pain intensity in most spontaneous for 30 AUs. The temporal phases of an AU event (onset, facial expression datasets, 5) inadequate methodology that apex, offset) are also coded for the 169 sequences. The MMI uses CVML models to mimick human observers making is a dataset of posed (not spontaneous) facial expressions pain assessments. As a result, most existing APD research under controlled settings. has access to a very limited selection of pain datasets and RU-FACS (2005): The RU-FACS (or M3) dataset has mainly focused on acute pain elicited by a single type [8] captures spontaneous facial expressions from video- of stimuli, which is typically in a laboratory environment recorded interviews of 100 young adults of varying eth- [34]. On the other hand, human ability to perceive pain in nicity. Each video segment lasts about 2 mins on average, others is significantly developed by the age of five to six and where the head pose is close to frontal view with small to a the sensitivity to more subtle facial signs of pain generally moderate out-of-plane rotation. 33 subjects have been coded increases with age [20], which suggests that an APD system by two certified FACS coder for 19 AUs. 3

CK+ (2010): The extended Cohn-Kanade dataset [55] Nonverbal Expression, the SEMAINE dataset [64] provides was constructed upon the Cohn and Kanade's DFAT-504 high-quality, multimodal recording to study social signals dataset [CK (2000)] [45] contains 593 videos captured under occurring during interaction between human beings and controlled conditions from 123 subjects with various ethnic virtual human avatars. The video frames were recorded backgrounds. Each brief video segment (approximately 20 from 150 participants of varying age and partitioned into frames on average) begins with a neutral expression and subsets for training (48, 000), development (45, 000), and finishes at the apex phase. The entire dataset is annotated testing (37, 695). Spontaneous facial expressions were anno- for 30 AUs at frame-level. The CK+ dataset has been made tated by three certified FACS coder for 6 AUs at frame-level. available since 2010 and is widely used for prototype and SEMAINE was selected as a benchmark dataset in FERA benchmark AFER systems. 2015 challenge [91]. GEMEP-FERA (2011): The GEMEP-FERA dataset DISFA (2013): The Denver Intensity of Spontaneous [92] is a fraction of the geneva multimodal emotion por- Facial Action (DISFA) Database [62] captures spontaneous trayal (GEMEP) corpus [7] employed as a benchmark facial expressions elicited by viewing a 4-mins video clip dataset for the first facial expression recognition challenge from 27 young adults via stereo video recording under (FERA2011). The original GEMEP corpus consists of over uniform illumination condition. 4845 frames were recorded 7000 portrayals with 18 posed facial expressions portrayed for each participant and coded by a single FACS coder for by 10 actors. The GEMEP-FERA dataset employs 158 por- 12 AUs (5 upper face AUs and 7 lower face AUs) with 6 (0- trayals for AU detection sub-challenge and 289 portrayals 5) level of intensities. The number of events (from onset to for emotion sub-challenge. Ground truth is provided as offset) and the number of frames for each intensity level frame-by-frame AU coding for 12 AUs and event coding were also recorded. A second FACS coder was recruited of 5 discrete . for annotating video from 10 randomly selected participants GFT (2012): The Sayette Group Formation Task and resulted in a high interobserver reliability (ICC) ranging (GFT) dataset [77] captures spontaneous facial expressions from 0.80 to 0.94. from multiple interacting participants in a psychological BP4D (2014): The Binghamton-Pittsburgh 4D Spon- study to evaluate socioemotional effects of alcohol. The taneous Expression Database (BP4D) [110] contains sponta- dataset include 172, 800 video frames from 96 participants neous facial expression elicited by well-validated emotion in 32 three-person groups [28]. Multiple FACS coders are instructions associated with eight tasks, which are captured recruited for binary occurrence coding for 19 AUs on frame- from a diverse group of young adults. A total of 639, 224 level, and 5 AUs are selected for intensity coding (level: A- frames are produced from forty-one participants and anno- E). The baseline results are presented in a recent publication tated for 27 AUs by FACS certified coders in a 20-second [28] for 10 AUs using maximal-margin framework (linear segment interval. The video is available in both 3D and SVM) and deep learning framework (AlexNet). 2D, where geometric features are represented by a 83-point 3D temporal deformable shape model (3D-TDSM) and a 49- 2.2 Facial Expression Dataset for Automated Pain De- point 2D shape model tracked by a constrained local model tection (CLM). BP4D was also selected as a benchmark dataset in Establishing a pain-oriented facial expression dataset is FERA 2015 [91]. much more challenging than establishing a general AFER dataset. 3 FACS BASED AU DETECTION UNBC-McMaster (2011): The UNBC-McMaster Shoulder Pain Expression Archive Dataset [58] is the only 3.1 FACS Coding publicly available spontaneous facial expression dataset The message-based and sign-based judgements are two targeting pain. It contains 48, 398 FACS coded frames in major approaches in studying facial behavior [25]. While 200 video sequences. The video segments are captured message-based judgements make inferences about emotions from patients from shoulder pain and spontaneous that underlying facial expressions, e.g. basic emotions or facial expressions are triggered by moving their affected and pain, sign-based judgments focus on facial expressions itself unaffected limbs. All frames are coded by certified FACS by describing the surface of behavior created by muscular coders for 10 single pain-related AUs and the frame-level movement. FACS provides a standard technique for mea- pain score is rated by the Prkachin and Solomon Pain Inten- suring facial movement by specifying 12 action units in sity (PSPI). A sequence-level pain label is assigned by self- the upper face and 18 in the lower face. The action units reported VAS and Observer-rated Pain Intensity (OPI). In are defined on the anatomical basis of muscular movement addition, 66-point facial landmarks from Active Appearance that are common to all people. Depending on the research Model (AAM) are also provided for each frame to facilitate question, observers commonly use two schemes to score the development of a user-customized AFER system. It is AUs. The comprehensive scheme codes all AUs in a chosen the only publicly available pain-oriented facial expression video segment, and the selective scheme only codes prede- video dataset in which the spontaneous facial expressions termined AUs. The reliability of AU scoring, also referred to are evoked solely by acute pain. as inter-observer agreement, is assessed by considering four criteria: the occurrence of an AU, the intensity of an AU, the 2.3 Facial Expression Datasets for Deep Learning AU temporal precision defined as onset, apex, offset and Based AFER neutral stage, and the aggregates that code certain AUs in SEMAINE (2012): As part of a project: Sustained combination. While the occurrence of AU is most widely Emotionally colored Machine-human Interaction using used criterion to evaluate an AFER system, the latter three 4 are evaluated depending on specific research problems. For feature points to optimize the shape model through a non- example, an event can be decomposed into a set of action linear constrained local search. The appearance templates in units that overlap in time, thus a link can be created between CLM are represented by a small set of parameters that can coding of pain via message-based mechanism and FACS- be effectively used to fit unseen face images. Therefore a based system via AU aggregates coding. CLM framework can be conveniently modified as a subject- independent model [3] for generic face alignment appli- 3.2 General AFER Framework cations. The result in [13] indicates that CLM is generally more computationally efficient than AAM but it is slightly Over the past two decades, AFER research has generally outperformed by AAM when rigid motions are presented. adopted a framework that is comprised of four core mod- Follow-up research led to continuous improvement to ules: face detection, face alignment, feature extraction and the family of AAM and CLM methods in terms of fit- facial expression classification. ting accuracy and computational efficiency, which involvs adaptation to the fitting methodology [90], feeding rich 3.2.1 Face Detection annotated training data in unconstrained conditions [114], The face detection module locates face regions with a combination with new appearance descriptors [105], and rectangular bounding box in a video frame so that only fine tuning with cascaded regression techniques [4]. Such localized patches are adopted for further processing. One evolutions have boosted the performance of face alignment example of a state-of-the-art face detector is real-time Viola- to handle illumination variations, out-of-plane rigid mo- Jones detection framework [95] that employs the Adaboost tions, as well as partial occlusions. It is worth to noting that learning algorithm in a cascade structure for classification. typical AU coding scenario is at close-to-front (±30◦) view, The ready-to-use Viola-Jones detector is capable of robustly which can be adequately handled by recent 2D alignment detecting multiple faces over a wide range of poses and methods. Although 3D alignments provide much denser illumination conditions, as a result of which it is widely geometric modeling as well as depth information, computa- employed in most AFER research in practice. An extensive tional inefficiency and difficulty in data acquisition makes review for available face detection approaches is presented it currently infeasible in many practical AFER and APD in [109]. applications. Therefore we shall focus on automated pain detection from 2D facial expression recognition in this paper. 3.2.2 Face Alignment in 2D For an adequate survey on facial expression modeling in 3D, The face alignment module tracks a set of fiducial points readers are referred to the survey [76]. on the detected face area in every video frame. The fidu- cial points are typically defined along the cues that best 3.2.3 Feature extraction describe features of a face, including jawline, brows, eyes, nose, and lips. The set of fiducial points is also referred to Facial expressions are revealed by non-rigid facial muscular as the shape model when a parametric model is applied movement. A feature extraction module is added between to human face. While facial expressions produce the non- raw input pixels and classifiers and serves as an important rigid facial surface change, head pose rotations account interpreter to extract relevant non-rigid information for for rigid motions. Although rigid motions are negligible in manifesting Action Units from computer vision , controlled experimental settings for posed facial expression which is highly customizable in the framework structure dataset, occurrence of spontaneous facial expressions in with a rich option of feature descriptors. Features can be natural environment is frequently accompanied by obvious categorized into geometric-based and appearance-based. rigid motions, where challenges arise for accurate feature Geometric features extract geometric measurements related extraction in the next step. Rigid motion normalization can to coordinates of fiducial points, while appearance features be embedded within an alignment framework via similarity are derived from pixel intensities. Predefined distance, cur- or affine transform. Alternatively, an aligned face model vatures and angles from displacement of the fiducial points can be rotated and frontalized with the estimated head were an intuitive way to extract geometric features in early pose parameters and registered to a front view mean face AFER research [87] [65]. Static geometric measurements template with neutral facial expression. can be used as midlevel representations to extract dynamic Faces can be modelled as deformable objects which vary features by encoding with temporal information for a more in terms of shape and appearance. The active appearance detailed description of neuromuscular facial activities [93]. model (AAM) [15] [6] is one of the most commonly used face Geometric features are typically low in dimension and sim- alignment method, where the optimization process relies on ple to compute and focus on describing the deformation of jointly fitting a shape model and an appearance model via the conspicuous feature cues. However, geometric features the steepest decent algorithm over the holistic face texture. are insufficient to model muscular actions off the feature The original AAM is subject-dependent where the model cues and they are sensitive to registration errors from shape is tuned specifically to the environmental conditions of the alignment. Therefore geometric features are typically ap- training dataset. This, however, may cause performance plied in conjunction with other feature descriptors in the degradation on face images of an ’unseen’ subject. This spontaneous AFER scenario. problem can be addressed by using the constrained local On the other hand, appearance features could be ex- model (CLM) [19] [18], which is comprised of a shape tracted directly from the alignment modality [2], either by model and an appearance model similar to AAM. It only similarity normalized appearance (S-APP) that normalizes utilizes rectangular local texture patches around the shape the rigid motion or by canonical appearance (C-APP) that 5

warps the texture to the mean shape template. The result- and local feature extraction. Each AU is activated by a ing appearance descriptor is then concatenated in a high- combination of facial muscles, which can be geometrically dimensional feature vector based on raw pixel intensity located in a specific region on the face. Appearance fea- for classification. In this case, facial-expression-related mus- tures extracted from corresponding region of interests (ROI) cular activities are encoded together with a large portion bounded by geometric features potentially help to reduce of person specific features without discrimination, which the feature dimensionality and boost computation efficiency. potentially attenuates the discriminating power of the fea- Hamm et al [32] divided a face image into 14 rectangular ture vector towards target AUs. Subtle texture changes like regions based on geometric landmarks layout, from which deepening the nasolabial furrow, raising the nostril wing, a 72-dimensional Gabor response is pooled to generate a cheek rising, pushing the infraorbital triangle up, provide histogram based feature representation. If multiple feature important reference for coding of certain AUs [26]. These descriptors are employed in an AFER framework, a feature non-rigid motions are nonstructural, which are infeasible to fusion problem should also be considered. Wu et al [103] be effectively described by raw geometric or pixel-based fea- evaluate the performance multi-layer architectures for fea- tures. Descriptors developed for object detection are intro- ture extraction involving a combination of two texture de- duced in AFER research in order to exploit the features more scriptors, the Gabor Energy Filter (GEF) and LBP. However, in depth. These descriptors detect local intensity variations a more common way for feature fusion is to perform it in as feature points, edges, phase or magnitudes and assemble the classification stage, which will be addressed in the next the low level appearance features into high-level feature section. representations, from which a meaningful classification can be made. A series of studies [8] [102] [51] [103] have applied 3.2.4 AU classification and regression a Gabor filter bank to extract features in 8 directions and 9 FACS-based AFER studies typically focus on two core prob- spatial frequencies and the output magnitudes are concate- lems in Action Unit recognition: 1) which AUs are displayed nated into a single feature vector for AU classification. The in the input images or videos, and 2) how to measure the local binary pattern (LBP) encodes local intensity variation intensity of the observed AUs. Conventional AFER research in the neighbourhood and as an extension to LBP, the local frequently treats the detection of AU occurrence as a binary phase quantization (LPQ) captures local phase information classification problem, where a binary classifier is trained from Fourier coefficients, and a histogram is assembled from for each of the target AU independently, and a separate set the local descriptors as a high-level feature representation. If of classifiers or regressors are trained to discriminate AU the facial video volume is sliced by three orthogonal planes intensity of different levels. Static classification approaches (TOP), both LBP and LPQ static feature descriptors are then serve as the basis for AU recognition, which rely on features expanded to encode temporal information from x − t and extracted from one static image frame. However, psycho- y − t planes, which are known as LBP-TOP and LPQ-TOP logical studies [9] suggest that facial behavior can be more [44]. A detailed survey of the application of the family reliably recognized from an image sequence than from a of LBP descritpors to facial image analysis can be found still image. Hence dynamic approaches include additional in [39]. More geometric-invariant feature descriptors have analysis steps on static predictions by utilizing temporal been successfully applied to AFER including the histogram information from adjacent frames to improve reliability of of oriented gradient (HOG) [13] and scale-invariant feature frame-level AU estimations or perform high-level AU event transform (SIFT) [29]. The appearance descriptors can also coding across the video sequences. be associated with an image pyramid representation to Static Approach: Support vector machine (SVM) and extract feature with multiple scales and resolutions [21] [85]. boosting-based classifiers are commonly employed for the In addition to extracting static features from a single AU classification task. A binary support vector machine frame, it is also beneficial to extract dynamic features com- seeks a subset from training data known as support vectors, prised of spatially repetive, time-varying patterns [12] to which defines a hyperplane that maximizes the margin model motion-based information from an image sequence. between the hyperplane and closest points of both classes. Koelstra et al [49] employ two representation methods, In practice, the feature spaces can be lifted to a higher- motion history image (MHI) and free-form deformation dimensional feature space using kernels [86], where linear (FFD), to derive non-rigid motions from consecutive frames. and radial basis function (RBF) kernels are frequently used A MHI descritpor computes a motion vector field from a set in SVM settings for AU classification. In fact, SVM was of weighted binary difference images within a predefined used as part of the baseline system in most facial expression time window, while a FFD descriptor outlines non-rigid datasets [8] [55] [58] [62] [28]. A boosting-based approach motions via a B-spline interpolation that computes local learns a set of weak binary classifiers in a sequential order, variations of a control point lattice. A quad-tree decompo- where mis-classified samples are assigned higher weights sition method is then applied to decompose the face region and are considered as ’hard’ problems to be handled with in a non-uniform grid representation, where most grids are the next weak classifier. The final decision is made by a placed on areas with high motion response. The quad-tree strong classifier that is generated from a linear combination decomposition is performed in three orthogonal planes to of the set of weak classifiers. extract features in terms of magnitude, horizontal motion, Bartlett et al [8] assessed both SVM (linear and RBF and vertical motions. This work [49] provides an example on kernels) and Adaboost (up to 200 features selected per AU) feature customization in terms of geometric and appearance classifiers for facial action classification on 20 AUs, where feature combination, dynamic feature extraction by utilizing Gabor wavelet features were extracted from the RU-FACS temporal information, as well as a balance between global (spontaneous) and CK (posed) datasets. Chew et al [13] 6

employ a linear SVM to test pixel-based features and more the outputs of the GentleBoost classifiers. The proposed complex appearance features (HOG, Gabor Magnitudes) ex- system was tested for 27 AUs in MMI and 18 AUs in tracted from multiple datasets (CK+, M3, UNBC-McMaster CK+. A follow-up study in [93] proposed a hybrid SVM- and GEMEP-FERA) under different face alignment accuracy. HMM framework to estimate the temporal phases of an AU The experiment suggests that the more complex appearance event. A one-versus-one multi-class SVM was trained for descriptors are robust to alignment errors on AU detection, each of the four temporal phases using 2520-dimensional but their advantages are limited under close-to-perfect- geometric features. The outputs of SVMs were converted alignment. Jiang et al [44] performed recognition of 23 upper to probability measures via a sigmoid function, which were and lower face AUs on MMI dataset with four different used to estimate the emission matrix of a HMM. Rudovic SVM kernel settings, including Linear, Polynomial, RBF and et al noticed the temporal phase of AUs are correlated Histogram intersection. The SVM classifiers were trained with their intensities and proposed an approach based on with a set of features from LBP family with optional Block Conditional Ordinal Random Field (CORF) to utilize the or- and Pyramid extensions, including LBP, LPQ, B-PLBP and dinal relations embedded in intensity levels. The proposed B-PLPQ. Experimental results indicated that a SVM with LAP-KCORF framework [74] extends the CORF model by histogram intersection provided slightly better performance introducing a kernel-induced Laplacian regularizer to the than other kernel settings, and the block-pyramid based B- optimization process. The Composite Histogram Intersec- PLPQ feature outperformed other features but at the cost of tion (CHI) kernel was selected in the framework settings, increased computational complexity. Jeni et al [43] proposed which takes the input of LBP features from aligned training a continuous AU intensity estimation framework using sup- images. The LAP-KCORF model was trained for each AU port vector regression (SVR). Sparse features were learned separately. While neutral and apex phases can be discrim- from local patches via personal mean texture normalization inated solely from the ordinal score, the onset and offset followed by non-negative maxtrix factorization. A L2-loss phases have to be discriminated from the dynamic features. regularized least-squares SVM (LS-SVM) model was trained Experimental results on 9 upper face AUs drawn from MMI for AU intensity regression in 6 ordinal levels (0-5). The pro- datasets demonstrated the advantages of the ordinal model posed AU intensity estimation system was tested on 14 AUs over the SVM-HMM model on both AU detection and tem- on posed data from CK+ as well as on spontaneous data poral phase estimation. Hamm et al [32] proposed an AFER from BP4D for AU12 and AU14. Chu et al [14] investigated system to perform dynamic analysis of facial expressions on the problem of personalizing a classifier trained on a generic a private video dataset featuring patients with neuropsy- facial expression dataset with respect to unlabeled person- chiatric disorders and healthy controls. A one-versus-all specific data through a transductive learning method, which GentleBoost classifier was trained for each of the 15 selected is referred to as Selective Transfer Machine (STM). The idea AUs using Histogram of Gabor Features extracted from behind STM is to re-weight the samples in the generic multiple pre-defined local regions. The dynamic analysis training set by minimizing the training and person-specific was measured by single and combined AU frequencies and test distribution mismatch, such that a person-specific SVM affective flatness and inappropriateness, which was derived classifier can be generated from the reweighted training from the temporal profiles of AU predictions on frame level. data. The proposed system was evaluated for cross-subject A complete AU event refers to activation of an AU in AU detection on CK+ and RU-FACS, and cross-dataset terms of its duration and start/stop boundaries. Realizing AU detetion in terms of RU-FACS→GEMEP-FERA and the event-based detection is more desirable than frame- GFT→RU-FACS. The experimental results demonstrated based detection in many application scenarios in practice, that STM consistently outperformed generic SVM on both Ding et al [22] carried out AU detection on frame-level, tasks. segment-level and transition encoding in a sequential order, Dynamic Approach: A dynamic approach for AU and integrated the results from three detectors to Cascade of classification takes input from static AU predictions of mul- Tasks (COT) architecture to perform AU event coding. The tiple consecutive frames to perform temporal analysis that frame-level detector was trained with SVM on SIFT features, targets three problems in general: 1) improving prediction where the frame-level detection were used to augment reliability on the current frame using information from past segment-level training data in both feature representation frames, 2) generate segment-level AU detection from frame- and sample weight. The segment-level data was represented level prediction, and 3) locate AU events in a video sequence by a Bag of Words (BOW) structure with geometric features. by modeling temporal phase transition (i.e. onset, apex, A weighted-margin SVM was employed for segemnt-level offset, and neutral) of an Action Unit. detection, where samples (segments) containing many posi- Koelstra et al [49] trained a one-versus-all GentleBoost tive frames (scored by frame-level detector) were associated classifier for each AU in onset or offset phase indepdently with higher weights. Two transition detectors was then with spatiotemporal features derived from Free-Form De- trained to refine the onset and offset boundaries from the formations (FFD) or Motion History Image (MHI). The detected AU segments using the segment-level features. The outputs of each pair of onset and offset GentleBoost clas- scores for segment-level prediction and transition predic- sifier were combined into one AU recognizer through a tions were linearly combined to produce event prediction continuous (HMM). The four-state of AU. The start and end frames of an AU event were de- HMM performed as a temporal filter as well as AU event termined based on the highest event score. Multiple events encoder by modeling temporal phase transition for each AU, could be scored in a given video sequence using dynamic where the prior and transition matrices were estimated from programming (DP) via a branching-and-bounding strategy. training set and the emission matrix was estimated from The proposed system was trained and tested for different set 7 of AUs on CK+, RU-FACS, FERA and GFT datasets respec- ered among 11 AUs. tively. Sikka et al modeled an affective event as a sequence of discriminative sub-events, and solved the event detection 3.2.5 Performance Metrics as a weakly supervised problem with ordinal constraints by After training a proposed AFER system on selected datasets, the proposed Latent Ordinal Model (LOMo) [81]. A LOMo the next step is to evaluate the system performance with model seeks a prototypical appearance template to represent proper metrics. The criteria to select measure metrics de- each sub-event, and the order of all sub-event permutations pends on the type of classification problem, e.g. binary AU is associated with a cost value, where permutations far from classification or multi-level AU intensity estimation. the ground truth order was penalized by higher cost. A Accuracy, F1 score and Area Under the receiver operating score function originating from a linear SVM was defined Characteristic (AUC) curve are three most frequently used to measure the correlations between a weakly labeled frame metrics to evaluate a binary classifier. Accuracy is the per- (with only sequence-level label) and the templates with con- centage of correctly classified positive and negative samples. sideration of the ordinal cost. The entire model was learned F1 score is the harmonic mean of between precision (PR) and using an algorithm based on stochastic gradient descent recall (RC), so that the two measures are reflected through (SGD) with respect to a margin hinge loss minimization. one score. The metric AUC is equal to the probability that The proposed method selected multiple frame-level feature a classifier will rank a randomly chosen positive instance descriptors (SIFT, LBP, geometric and deep-learned features) higher than a randomly chosen negative one [27]. and was trained and evaluated on four facial expression Evaluation of multi-level intensity classification and re- datasets including CK+ and UNBC-McMaster. gression depends on two types of performance metrics Discover AU Relations: The aforementioned meth- in general. The first type includes the root mean square ods consider single AU detection as one-versus-all classifi- error (RMSE) and mean absolute error (MAE) metrics that cation problem. There are also efforts that attempt to make effectively capture the difference between predicted value better use of FACS by exploiting dependencies and relations and the ground truth. While MAE is more intuitive for among AUs, which are potentially helpful in improving interpretation, RMSE possesses the benefit of penalizing detection on highly skewed AUs in spontaneous facial large errors more, and is easier to be manipulated in many video datasets. Tong et al [88] proposed a system based mathematical calculations. The second type of metric is cor- on Dynamic (DBN) to model relations relation based including Pearson’s correlation coefficients among different AUs in a probabilistic manner taking into (PCC) and Intraclass correlation (ICC), which measure the consideration of temporal changes in an AU event. An trend of prediction following the ground truth without con- Adaboost classifier trained on Gabor features was employed sidering a potential absolute bias [23]. The ICC is similar to as a baseline static AU detector. A Bayesian Network (BN) PCC except the data are centered and scaled using a pooled was used to model the relations of AUs in terms of their mean and standard deviation [29], which require raters (e.g. co-occurrence and mutually exclusivity. A DBN network a human coder and an AFER) to provide the same rating is comprised of interconnected time slices of static BNs, without multiplicative difference to achieve agreement on where the state transitions between the consecutive time ICC. RMSE/MAE and ICC are often used simultaneously slices is modeled by a HMM. The structure and parameters as complimentary metrics in AFER studies, as a low ICC is of the DBN were obtained by training it offline and AU possible with deceptively low RMSE/MAE scores and vice- recognition under temporal evolution was performed online versa [23]. with the trained DBN network. The DBN-based system Most spontaneous facial expression datasets contain a was learned for 14 AUs from the CK+ dataset and the significantly larger fraction of negative samples than the validation was further generalized using selected samples fraction of positive samples in practice, where the data from the MMI dataset. Zhao et al [111] formulated patch imbalance can be defined by the skew ratio as Skew = negative examples learning and multi-label learning problems in AU detection positive examples . Jeni et al [42] studied the influence of skew as a joint learning framework known as Joint Patch and on multiple metrics and reported Accuracy, F1, Cohen’s κ Multi-label Learning (JPML). Patch learning sought local and Krippendorf’s α were attenuated by skewed distribution. feature dependencies for each AU, where adaptive patch Although AUC was unaffected by skew ratio, it may mask descriptors were placed around 49 facial fiducial points poor performance due to data imbalance. The author sug- to extract 128D SIFT features from each patch. Multi- gested to report skew-normalized scores along with original learning exploited strong correlations among AUs from a ones to minimize skew-biased performance evaluation. This AU relation matrix that was learned on over 350, 000 FACS finding was also taken into account by some studies [89] annotated frames from multiple datasets. Positive correlation [111] [72] presented in this paper. It is worth mentioning and negative competeition were defined for relations of likely that all metrics discussed above are also applicable to pain and rarely co-occurring AU pairs based on the correlation detection and deep learning based AFER approaches. We coefficients in the AU relation matrix. The JPML framework shall use the same abbreviations (as highlighted) when integrated the patch learning and multi-label learning into referring to these metrics in the following sections. a logistic loss in the form of a patch regularizer and a relational regularizer, which were then jointly solved as an 4 AUTOMATED PAIN DETECTIONFROM FACIAL unconstrained problem. The JPML framework was trained and tested on CK+, GPT, BP4D datasets, from which specific EXPRESSIONS active patch regions were identified for 11 AUs. 8 positive Present APD studies generally follow two modalities to correlation and 14 negative competition pairs were discov- detect pain from facial expressions. In the indirect way video 8 frames are first encoded with FACS based Action Units and 4.2 Binary Pain Classification pain is then identified from pain-related AU predictions. Following the first study based on UNBC-McMaster dataset, Alternatively, a mapping can be learned directly from high- [2] proposed a framework to detect pain at both frame and dimensional features to assign pain labels without going sequence level. The face video was aligned by AAM and through the AU coding procedure. The progress in APD has a combination of geometric features (similarity normalized benefited vastly from advances in AFER techniques which shape points - S-PTS) and appearance features (canonical applicable in both modalities, as AFER is the core module in appearance - C-APP) was extracted. A linear SVM was the indirect method and most functional blocks in an AFER used to perform binary classification of ”pain” vs. ”no- framework can be applied within a direct pain detection pain” , and prediction of pain score at frame level was architecture with minor modifications. In fact, with few defined as the distance of the test observation from the exceptions, all APD studies reviewed in this section are separating hyperplane of SVM. Pain scores for every frame linked to some preliminary AFER research. The correlation are summed up and normalized for the duration of the between the frameworks used in addressing the AFER and sequence to produce a cumulative score for sequence level APD problems is illustrated in figure 1. pain measure, and the decision threshold is determined by Aside from the similarity to the AFER problem, there Equal Error Rate (EER) derived from ROC. In the follow- are additional considerations that should be factored in the up research Lucey et al [56] reported on recovering 3D design of an APD system. These stem from the nature of the pose from 2D AAM [104] and performing statistical analysis pain data, methods used in labeling pain data, the system based on ground truth to reveal that patients suffering performance metrics used, and availability of supplemental from pain displayed larger variance in head positions. A pain data from non-facial sensing. For example, pain data set of binary linear SVM classifiers was trained for 10 pain may be collected under controlled, spontaneous, or clinical related AUs to replace the single binary pain SVM classifier conditions; the ground truth may be obtained from patients’ in the previous work [2]. Frame-level pain measurement self-report, or certified FACS coder; a pain event may be was then derived from the output scores of AU detectors scored per frame or per segment for the binary pain deci- that were fused using linear logistical regression (LLR). A sion or pain intensity estimation; distinction between acute combination of S-PTS, S-APP and C-APP yields the best and chronic pain from onset phase duration [34]; suitable average AU recognition performance of 0.78 measured by metrics could be used for performance evaluation and for AUC [2], and the performance was further improved to comparison with human coders’ decision. Last but not the 0.81 by compressing the spatial information with a discrete least, signals from channels other than facial expressions cosine transform (DCT) [59]. Continuous research based on may also help in pain detection. In this section, we will this framework was conducted in [57], where OPI label was review existing APD research in a manner by employed for sequence level pain evaluation to mimic the addressing these considerations. decision making of human observers. The OPI labels are segmented into three classes based on intensity (0-1, 2-3, 4.1 Genuine Pain Vs. Posed Pain and 4-5), and a one-versus-all binary SVM classifier was A pioneering APD study examined the problem of distin- trained for each OPI group to generate a matrix. guishing posed pain from genuine pain through AU coding The frame-level pain label assigned by the classifier model [52] [53]. Facial expressions induced by posed pain and produced the highest probability score and a sequence level genuine pain are mediated by two neural pathway and thus pain label is generated from all its member frames via a have subtle differences in terms of muscular movements majority vote scheme. Although reasonable classification and dynamics. The system implementation was comprised rate was obtained for no pain class (OPI 0-1), the system of a two-level machine learning module. The first stage did not report desired performance to discriminate low solved a general AFER problem by training independant pain level (OPI 2-3) from high pain level (OPI 4-5). Further linear SVM binary classifiers for a list of 18 single AUs findings from the testing results suggests that rigid head and 2 AU combinations about prototypical expression of motions do not necessarily contribute to prediction of OPI fear and distress. Features were extracted by Gabor filters intensities. This series of studies is valuable as it paves the and selected by Adaboost. Differences of Z-scores were way for APD research to replicate the job of a care-giver computed for three pain conditions consisting of real pain, when monitoring the pain of a patient. fake pain, and absence of pain (baseline). AU 1, 4, and the distress brow combination were considered statistically sig- nificant for real vs. fake pain discrimination. Window-based 4.3 Pain Intensity Estimation statistics and event-based statistics were obtained from the Hammel et al [33] built a pain intensity classification system first stage of AU prediction, and a second layer non-linear for the UNBC-McMaster dataset using the AAM-CAPP- SVM was trained to discriminate genuine pain from faked SVM framework. A Log-Normal filter bank tuned to 15 pain. The AFER system was trained on 3 datasets (2 posed orientations and 7 central frequencies was applied to the and 1 spontaneous), while the pain data were collected from C-APP features to enhance the modeling of deepening and 26 subjects subjected to cold pressure pain. The system was orientation change of appearance features that character- evaluated by AUC and 2FAC, and the reported performance ize pain expressions. The pain metric is measured with found to be superior to naive human judges. Although the PSPI scale and the intensities are categorized into 4 accuracy of individual AU classifier is still below that of classes (0, 1, 2, ≥ 3) and four linear SVM are trained with human coders, this is among the first efforts to extend AFER the Log-Normal filtered C-APP feature map for the four to APD study. levels of pain intensity. The pain intensity is estimated on 9

Fig. 1. Correlation between the frameworks used in addressing the AFER and APD problems

a frame-by-frame basis and the performance is evaluated ratings from patients’ self-report, parents, and nurses were by F1 for 5-fold (91% − 96%) and leave-one-subject-out recorded independently in a 11-point Numerical Rating (45% − 67%) cross-validation. Moderate to high consistency Scale (NRS), where self-report pain rates were considered as between manual and automatic PSPI pain intensity was the subjective ground truth and time since surgery provided measured by ICC for the two cross-validation schemes. Irani the objective ground truth alternatively. Video segments et al [40] adopted spatiotemporal oriented energy features to with NRS rating ≥ 4 are defined as trials with pain, while model facial muscular motions in a pain intensity detection those with NRS ratings of 0 are defined as trials without framework on the same pain dataset. The face sequence pain. The computer expression recognition toolbox (CERT) was segmented and aligned using AAM, and then a face was employed to independently score 10 pain-related single patch was further divided into 3 regions based on the prior AUs, one smile-related AU combination, and rigid head knowledge of location of pain expression muscular activi- motion in 3D (yaw, pitch and roll) per frame. Three statistics ties. Facial muscular activities were extracted at the frame (mean, 75th percentile, and 25th percentile) were computed level by steerable and separable energy filters that were for each of the 14 frame-level raw features across the du- implemented by a second derivative Gaussian followed by ration of a pain event (ongoing and transient), which were a Hilbert transform tuned to 4 directions. A histogram was comprised of a 42-dimensional feature vector for the pain generated from pixel-based energy features as per directions event. Two linear regression models with L1 regularization and regions to form region-based energy descriptors. Ver- are trained on the same set of feature vectors for binary pain tical motion and horizontal motion were computed from classification and pain intensity estimation tasks. the histogram of spatial features in up-down and left-right directions respectively. Frame-level motion features were Performance of a binary pain model was evaluated via a 10 summed in time domain to generate a final spatiotemporal -fold cross validation with two metrics, where AUC (in the descriptor for each region. Pain intensity was estimated by range of 0.84-0.94) demonstrates good-to-excellent detection κ a weighted linear combination of the vertical and horizontal rate and Cohen’s provides phycological measurement on motion scores. Experimental results reveal an improvement categorical agreement between raters with consideration in no-pain and weak-pain recognition in the work of [33]. of chance agreement. Categorical agreement evaluated for κ = Lundtoft et al [60] modified the pain intensity framework in transient and ongoing pain was fair to substantial ( 0.36 − 0.61 [40] by introducing a super pixel approach for face region ) for the model trained on self-report ratings κ = 0.61 − 0.72 segmentation. They argued a better recognition rate for no and substantial ( ) for the model trained on pain category can be achieved by only exploiting vertical objective ground truth, where the prediction of automated motions of the spatiotemporal descriptor from the middle system was more consistent than human observers on ongo- κ region of the face. ing pain according to the measurement. The pain intensity estimation model was evaluated by the Pearson correlations metric (both within and across subjects) using a leave-one- 4.4 Pain Detection in Clinical Settings subject-out cross validation scheme. The best performance An encouraging clinical application of automated pain de- was achieved when the PIE system was trained with objec- tection was reported in [79] [78]. This study applied CVML tive ground truth for both pain conditions, and the within- models to assess pediatric postoperative pain and demon- subject correlation (r = 0.80 − 0.86) was consistently higher strated good-to-excellent classification accuracy and strong than the correlation for all subjects (r = 0.55 − 0.59). The correlation with pain ratings from human observers (parents CVML model performance was at least equivalently to or and nurses). In the data collection stage, a single camera exceeded that of a nurse for both BPC and PIE tasks, but recorded video of 50 youth, 5 to 18 years old, for transient was only equivalent to that of parents in PIE task. The and ongoing pain conditions during 3 study visits after analysis of results from PIE also supports the tendency for laparoscopic appendectomy. Facial activities were recorded nurses to underestimate pain severity [99] when compared for 5 minutes as a measure of ongoing pain. Then the with children’s self-report, especially under the ongoing surgical site was manually pressed for two 10-second pe- pain condition. Model performances was not affected by riods to stimulate transient pain, and video recorded. Pain including demographic information in the training features. 10

This research outlines general procedures to for applying where instances are consecutive video frames in the se- a CVML model in clinical settings, which involve exper- quence. The bag label belongs to an ordinal set and instance imental design and data collection, CVML tools, model labels were treated as latent ordinal states, which were evaluation from both engineering and phycology aspects, inferred from a learned mapping from feature space to the as well as performance comparison with human observers. structural output space. A multi-instance dynamic ordinal Therefore the methodology in this research serves as an random fields (MI-DORF) framework built upon the hidden important modality for future APD research. conditional ordinal random fields (HCORF) [48] framework was employed to handle the multiple instance learning (MIL) problem incorporating both ordinal and temporal 4.5 Detecting Pain Event with ’Weakly Labelled’ Data data structures, where the temporal dynamics within the A patient’s self-report is the golden rule for pain evaluation instances was encoded by transitions between ordinal latent in patient care. A pain label is commonly available for states. The proposed MI-DORF framework was then ap- a video sequence but not for every single frame. Such plied to pain intensity estimation task on UNBC-McMaster a situation is encountered frequently in computer vision dataset, where a total of 157 sequences from 25 subjects since it is easier to obtain group labels for the data rather were selected based on the low-discrepancy criteria between than individual labels, and is known as weakly supervised sequence- and frame-level pain label. The sequence-level learning problem. Sikka et al [80] modeled a video sequence pain label in an ordinal scale (OPI label) between 0 and with a bag of words (BOW) structure that is comprised 5 was used to train the system, and the frame-level label of multiple segments to handle the ’weakly labelled data’, in PSPI was also normalized to the same scale and used and solved the ’weakly supervised’ pain detection problem for results validation. The frame-level (instance) features with a multiple instance learning framework (MIL). Two are represented by 49 fiducial points tracked by AAM. For partitioning schemes, normalized cuts (Ncuts) and scanning a fair comparison with MIL in [80], this study followed window (scan-wind), were employed to generate multiple a similar experimental setup as in [80], while the output segments (referred to as instances) from a video sequence probability for binary pain classification in MIC methods (referred to as bag). According to BOW definition, a positive was also normalized to 0 − 5 as a representation of pain (pain) bag contains at least one positive instance and a intensity. The performance on sequence-level prediction was negative (no pain) bag contains only negative instances. measured by PCC (0.67), MAE (0.80), ICC (0.66), Accuracy The frame-level feature extraction learns a mapping from (0.52), and F1 (0.34), which outperformed other MIL meth- the pixel space to a d-dimensional feature vector. Choice for ods [80] [38] as well as the baseline model [48]. Furthermore, frame-level feature descriptor is highly flexible, the multi- the proposed approach achieved a comparable performance scale dense SIFT (MSDF) with a 4-level spatial pyramid was on frame-level pain intensity estimation (ICC = 0.40) selected in this work [80]. The segment-level feature repre- using only sequence-level ground truth towards the state- sentation was then generated from all frames in the segment of-the-art fully supervised (using frame level ground truth) via a max pooling strategy, such that an instance in a bag Context-sensitive Dynamic Ordinal Regression method [75] is also represented by a d-dimensional vector. A Milboost (ICC = 0.59 − 0.67). This result suggests a good trade- [108] framework based on gradient boosting was used to off between system performance and labor intensive frame- learn the posterior probabilities for both bags and instances, level annotation. which were directly related to pain event prediction on segment level or sequence level. The frame-level pain score was also derived from posterior probability of segments by 4.6 Automated Detection of Cancer Pain assigning the maximum score among all the segments it Study of pain other than acute pain, such as chronic pain belongs to. The frame level pain score can be used for pain caused by cancer, is an important yet almost unexplored localization in time domain. The proposed APD framework area in the current state of APD research. A unique dataset was trained and validated on the UNBC-McMaster dataset, was created by D. Wilkie [98], containing videos of 43 where sequences with OPI rating ≥ 3 are selected as positive patients suffering from chronic pain caused by lung cancer. samples and those with OPI rating = 0 are selected as The videos are captured in natural settings in the subjects’ negative samples. The binary pain detection was evaluated homes. The patients were required to repeat a standard set with EER (83.7%) in the ROC. The pain localization task was of randomly ordered actions as instructed such as sit, stand evaluated as pain intensity estimation on frame level with up, walk, and recline, in a 10-minute video with a camera ground truth of PSPI, and the performance was evaluated by focused on the face area to record their facial expressions. F1 socre (.471) and Spearman’s rank correlation [47] (.432). Each video was partitioned into 30 equal-duration, 20- The MIL framework detects pain (pain vs. no pain) second subsequence. The subsequences were reviewed and in a Multi-Instance-Classification (MIC) setting, where the scored for 9 AUs, possibly occurring in combination, by sequence-level intensity labels (VAS,OPI) of pain are bina- three trained human FACS coders independently, and the rized. It is, therefore, not suitable for a multi-level classifi- results were entered in a scoresheet that served as a basis cation/regression problem like the pain intensity detection for ground truth. Pain was declared to occur in a video task. Ruiz et al embedded an ordinal structure of labels subsequence if at least two coders agreed with each other to the bag representation, which naturally corresponds to on the occurrence of a set of specific AU combinations listed various rating or intensity estimation tasks. In the proposed on the scoresheet. The intensity of pain for the entire video Multi-Instance Ordinal Regression (MIOR) weakly super- was measured on the total number of subsequences that vised settings, a video sequence was modeled as a bag were associated with pain labels. In view of the fact of 11

highly imbalanced data samples captured in spontaneous contains multiple signals with and across modalities, in- conditions and lacking available ground truth from patients’ cluding high-resolution multi-view face videos, full body self-report, Chen et al [10] employed a rule-based method 3D motion capture, head-mounted and room audio signals, to analyze pain from the Wilkie’s dataset. The rules for and electromyography signals from back muscles. Partici- AU identification were developed upon geometric features pants are requested to perform both instructed and non- that are extracted from trajectories of a subset of fiducial instructed exercise to simulate therapist-directed and home- points in an AAM shape model. Decision trees were used to based-self-directed therapy scenarios. Level of pain from encode AU combinations and pain was identified from AU facial expressions was labelled by 8 raters, while six pain- combination scores. related body behaviors were annotated by 4 experts. 22 Most existing video datasets containing pain-related fa- CLBP patients together with 28 healthy control subjects cial expressions are developed for targeted studies. These were recruited as final participants for the CLBP study. The datasets are typically small in size without public access, baseline experiment on facial expressions of pain used full and lack sufficient diversity to train a robust APD system in length of 34 unsegmented trials captured by frontal-view general. Chen et al [11] decoupled an APD framework into a camera from 17 patients, which are comprised of a total of cascade of two machine learning modules: an AFER system 317, 352 frames (33.3% contained pain). The baseline system that computes frame-level confidence scores for single AUs tracked 49 inner facial landmarks that were aligned by a and a MIL system that performs sequence-level pain predic- similarity transform. Two local appearance descriptors, the tion based on contributions from a pain-relevant set of AU local binary pattern (LBP) and discrete cosine transform combinations. The AFER system was implemented with a (DCT), are employed to extract features from local patches commercial Emotient system (formerly the CERT) to handle with a radius of 10 pixels centered on 30 fiducial points. general environmental challenges for spontaneous facial ex- The two sets of appearance features together with geometric pression recognition. AU combinations are described by two features from the landmark points are fed separately into distinct low-dimensional novel feature structures: (i) a com- a linear SVM for binary pain recognition task. The best pact structure encodes all AU combinations of into a performance was archived by the geometric feature, which single vector, which is then analyzed in the MIL framework is measured by AUC (.658) and F1 (.446). Given facial ex- [108]; (ii) a clustered structure uses a sparse representation pression as a communicative modality and body behavior as to encode each AU combination separately followed by both communicative and protective [84], the author further analysis in the multiple clustered instance learning (MCIL) performed a qualitative analysis of the relation between framework [106], which is an extension of MIL. The ’weakly pain-related facial and body expressions. The results sup- supervised’ binary pain detection task was performed on ported the finding that facial expressions of pain (70%) UNBC-McMaster dataset and evaluated in terms of Accu- occur in connection with a protective behavior, which may racy (86.8%) and AUC (0.94). A trans-dataset validation by help the pain research community to better understand the applying the trained system (on UNBC-McMaster) to the relation among movement, exercise, and pain experience. Wilkie’s dataset also demontrates good consistency between Wener et al [97] addressed multi-modality pain detection system prediction and human coders’ decisions on both by introducing the Bio Vid Heat Pain dataset, which include pain (82.9%, 2+ coders agreed) and no pain (88.9%, three 90 participants suffering from induced thermode pain in coders agreed) cases. This study suggest a potential way to 4 levels. Multi-view face videos were recorded from three solve the insufficiency of pain-labelled samples by fusing synchronized optical cameras in conjunction with depth data from multiple small-scale pain datasets. map from a Kinect sensor. The physiological signals are captured from the skin conductance level (SCL), the elec- trocardiogram (ECG), the electromyogram (EMG), and the 4.7 Multi-Modality in Pain Detection electroencephalogram (EEG). The pain detection experiment Pain is a subjective experience, which is possibly caused was based on only the video data in this study, where data by various stimuli (e.g., spinal stenosis, rotator cuff tear, analysis based on physiological data would be conducted cardiac conditions or sickle-cell disease) and manifested in future research. A 6D (3D position and 3D orientation) through signals from multiple modalities including voice, head pose vector was computed from a 3D pointed cloud facial and body expressions [34]. Present efforts in APD recovered from the Kinect depth map. The face expression largely focused on acute pain that is elicited by a single was uncoupled from rigid motion by projecting the tracked stimulus in a controlled environment, which poses a gap fiducial points (from 2D video frames) via a pinhole camera between existing APD research and requirements of practi- model onto the surface of a generic face model in 3D accord- cal pain-related clinical applications. The UNBC-McMaster ing to the current head pose. Geometric features are defined is the only publicly available dataset with well-annotated by 8 distance measure from the fiducial points. Some texture data for acute pain that has facilitated APD research since changes, including nasal wrinkles, nasolabial furrows, and its first availability in 2011. Although it has been widely eye closure are synthesized by mean gradient magnitudes employed to benchmark the performance of both AFER and extracted from local rectangular regions centered at 5 se- APD systems, its contribution is limited by its inability to lected fiducial points. The final feature descriptor per frame lend itself to study other types of pain such as chronic pain. was in 13D, which is a combination of geometric and A recent Emotion and Pain project extends the option for appearance features. The temporal features are extracted chronic pain analysis by introducing a fully labeled multi- from the 6D head pose vector and 13D facial expression modal dataset (named EmoPain) [5] targeting at chronic feature vector within a 6s temporal window. This resulted lower back pain (CLBP). The data collected in EmoPain in a 21D descriptor per signal defined by 7 statistic mea- 12

sures (mean, median, range, standard and median absolute input layer of a DNN. The DNN combines the feature deviation, interquartile and interdecile range) for each of extraction and classification module into a single network, the smoothed signal and its first and second derivatives. A where the architecture, training scheme, and optimization binary SVM classifier with radial basis function (RBF) kernel settings need to be specified. Finally the post-processing to discriminate pain from no pain was trained for each of step determines how the raw data from the output layer the 4 pain levels, and the performance was measured by are interpreted to the targeted AUs. the F1 metric. Further experiments on the highest pain level An example of the deep learning based framework data suggested that include head pose features would im- [30] for FACS AU occurrence and intensity estimation was prove system performance on both person-specific and gen- proposed in FG 2015 Facial Expression Recognition and eral pain classifiers. However this finding may be dataset- Analysis challenge (FERA 2015). The core architecture of dependent as a contradictory conclusion was reported by the 7-layer CNN was composed of 3 convolutional layers studies [56] using UNBC-McMaster dataset. A literature and a max-pooling layer as feature descriptor and a fully survey on machine-based pain assessment for infants [107] connected layer to provide classification output. In the pre- reviewed progress on multimodal tools including behavior- processing step, faces were detected, aligned and cropped based approaches (e.g. facial expression, crying sound, body into fixed size from input images followed by a global movements) and physiological-based approaches (e.g. vital contrast normalization before they were fed to the CNN. The signs, and brain dynamics). However this is outside the raw output values from the activation of the output layer scope of our review that is focused on pain experienced were used as confidence scores to determine AU occurrence, primarily by adult patients. where the optimal decision threshold was learned from a validation set based on highest F1 measure. The proposed 5 EVOLUTIONWITH DEEP LEARNING deep learning AFER system was trained from scratch on both BP4D and SEMAINE datasets, where the performance 5.1 AFER with Deep Learning for binary AU detection was measured by F1 (0.522 for Recent advances in deep learning techniques, especially BP4D, 0.341 for SEMAINE) and the intensity estimation was the convolutional neural network (CNN), have provided evaluated on BP4D only with MSE (1.181), PCC (0.621) and an ’end-to-end’ solution to learn facial action unit labels ICC (0.613). directly from raw input images, which significantly re- A more extensive combination of features and learn- duced the dependency on designing functional modules in ing framework was carried out [89] for frame-level AU a conventional CVML framework. The high-level deep fea- detection under large head pose variations. Three type of tures are learned from mass-scale simple features through features were extracted in the pre-processing step, including a pipeline architecture which facilitates developing pose- histogram of oriented gradients (HOG), similarity normal- invariant applications for AU detection. In fact, spontaneous ized facial images and a mosaic representation (cut and put facial expression datasets with large rigid motions that together from patches around landmark positions). The AU are challenging to conventional AFER studies, for example detection was formulated as a single- or multi-label classifi- BP4D and DISFA, have been widely employed in recent cation problem respectively. Four types of network settings deep learning based research. On the other hand, the struc- based on SVM and a customized small-scale CNN are ture of multi-class outputs in a deep neural network (DNN) investigated: 1) a single-label SVM with HOG features, 2) can be conveniently modified at the output layer to handle a single-label CNN with the normalized images, 3) a multi- a multi-label AU detection problem. In this case, AUs are label CNN with the normalized images, and 4) a multi-label jointly predicted with consideration on their co-occurrence CNN with the mosaic features. All the system are trained and dependencies that is insufficiently exploited in previ- on the BP4D dataset from the FERA 2015 challenge and ous one-versus-all binary classification settings. However, tested on an augmented dataset that is created using 3D training a DNN from scratch relies on large labeled training information and renderings of faces from BP4D with large datasets (100K+ samples in [67] ), which is generally not pose variations. The performance was evaluated for binary available in most facial expression datasets due to the labor- detection task of 11 AUs with F1 score, skew-normalized intensive FACS coding procedure. Therefore, deep learning F1 score, and AUC. In general, the single-label CNN took based AFER with partially labelled data has also brought similarity normalized features yielded the best performance to recent studies. The literature review in this over all four framework settings. Further testing with large section will focus on the three problems presented above head pose variation (0◦ to −72◦ for yaw and −36◦ to 36◦ as these are important common problems to APD studies for pitch) indicated F1 score is a weak function of pose, with deep learning. which suggests that the deep-learning technique is capable of handling facial data under large pose variations. 5.1.1 AU Recognition Model with a Deep Learning Archi- Zhou et al addressed the AU intensity estimation under tecture large pose variations by developing a multi-task deep net- A deep learning based system generally follows a different work [113] for the sub-challenge of FERA 2017. They investi- paradigm than a conventional CVML system in performing gated a pose-invariant model and a pose-dependant model, the AU detection task, which focuses on three problems which were both built upon the bottom five layers of a pre- involving pre-processing, the DNN configuration and post- trained VGG16 network [83] via transfer learning as the fea- processing. The input images are cropped, normalized and ture descriptor. A two-layer fully connected regressor was aligned with image processing and computer vision tools trained for each AU independently in the pose-invariant so that they have a uniform format when applied to the model using data from one AU across all poses as the 13

AU intensity estimator. In the pose-dependent model, the face alignment such that local features can be consistently data of each AU were categorized into three groups based captured in the same sub-regions. The entire network was on the pitch pose. Three pose-dependent regressors were trained from scratch with data from BP4D (12 AUs) and trained jointly with an auxiliary pose estimator under the DISFA (8 AUs) respectively and the performance was eval- assumption that the output of pose estimator is equal to the uated with F1 (0.483 on BP4D, 0.267 on DISFA) and AUC pose ground truth, and the final AU estimates were made (0.560 on BP4D, 0.523 on DISFA). The experimental results from the dot product between the estimated pose vector indicated that the detection of AUs with higher skewness and the three pose-dependent AU intensity estimates. Both benefited from multi-label learning, and the region layer models were trained on the BP4D datasets, and evaluated also improved the detection rate of most AUs by better with the metrics of RMSE, PCC and ICC. The pose-invariant exploiting structural information from local facial regions. models using appearance features significantly outperform Region segmentation paradigm based on geometric fea- the baseline model in [94] that relies on geometric features. tures in conventional AFER can also be extended to deep Two likely reasons that the performance is affected are (i) learning based approach. Li et al [50] built a deep learn- the high failure rate on tracking geometric features under ing frame-level binary AU detection framework for region extreme pose change, and (ii) distortion to the facial struc- of interest (ROI) adaptation, multi-label, and LSTM based ture introduced by the warping and alignment operation temporal fusing. Twenty regions of interest (14 × 14) related to the tracked landmarks. Comparison between the pose- to local muscular motions that activate corresponding AUs dependent model and the pose-invariant model in terms of were generated from and centered around 20 fiducial points ICC revealed improvement on AU 1, 4, 12, and 17, but no derived from the shape model. A VGG16 network was used improvement on AU 6, 10, and 14. Hence a mixed model as the base feature descriptor, where the output from 12th (pose-invariant or pose-dependent) can be selected for each convolutional layer (224 × 224) was chosen as the feature specific AU to further improve the performance. map and cropped by the pre-defined sub-regions. A CNN- based ROI cropping net (ROI Net) was trained for each indi- 5.1.2 Multi-label AU Detection vidual sub-region as an AU-specific feature descriptor. The In practice, AUs frequently occur in combinations so that AU detection was formulated as a multi-label problem at the a single image frame may carry positive labels from mul- output layer to jointly predict all AUs per frame. The static tiple AUs. The one-versus-all binary classification in con- predictions were then fed into a LSTM network to refine the ventional AFER studies either perform region-specific AU AU estimates using temporal information from the past 24 detection or ignored the dependencies among co-occurring frames. The proposed framework was trained for 12 AUs in AUs. As a result AUs are detected independently in most the BP4D dataset and and tested on both BP4D and DISFA cases and certain AU combinations are also treated as a datasets. Experimental results demonstrated contributions single variable for classification. On the other hand, the from all three aspects of region cropping, multi-label, as well architecture of a DNN is naturally designed for multi- as LSTM. The settings with multi-label ROI nets followed by classification problem, which can be conveniently converted a single layer LSTM yielded the best performance based on to handle multi-label learning with a proper loss function F1 score (0.661 on BP4D, 0.523 on DISFA). The performance (e.g. multi-class sigmoid entropy), such that the sum of improvement over existing DRML approach was attributed probabilities from all the output classes are not necessarily to the adaptive way of sub-region selection and the adoption equal to one. The multi-label setting combined with deep of a deeper pre-trained network (VGG) as base feature learning provide improvement in handling imbalanced descriptor. samples in many spontaneous facial expression datasets. Walecki et al [96] integrated a conditional graph model Furthermore, the joint estimation of multiple AUs provide to the output of a CNN to learn the the co-occurrence high flexibility in describing various AU combinations, structure of AU intensity levels from deep features in a which provide advantages in encoding high-level event like statistical manner. The CNN was comprised of 3 convolu- pain or emotions. tional layers each with RELU activation followed by max- The deep region and multi-label (DRML) [112] frame- pooling, which was used as a feature descriptor to extract work solved the problem of region-based feature extraction deep features from input image frames. Each image frame and joint estimation of multiple AUs in a unifying deep in the dataset carried a label vector of all AUs of interest learning network. The core architecture of DRML was a and each entry in a label vector assumed a value from a CNN comprised of five convolutional layers and one max- six-point ordinal scale. The ordinal representation of AU pooling layer, and the multi-label option was enabled by intensities and their pairwise co-occurrence relations were configuring the output layer as a multi-label sigmoid cross- defined by unary (ordinal) and binary (copula) cliques in entropy loss. An additional region layer was attached after the output graph that was learned by a Conditional Random the first convolutional layer and partitioned the feature map Fields (CRF) model. The CRF framework estimated the from the convolutional layer with a fixed 8 by 8 grid. Each posterior probability of a label vector conditioned on the sub-region was processed by the same sub-network archi- deep features in the form of a normalized energy func- tecture formed by a batch normalization (BN) layer with tion, which was defined by a set of unary and pairwise ReLU activation and a convolutional layer. The coefficients potential functions. The parameters of CNN, unary, and of each sub-network in the region layer were updated inde- pairwise potential functions were then jointly optimized pendently with respect to local features and AU correlations, through a 3-step iterative balanced batch learning algorithm. and the output of the region layer is a concatenation of all A data augmentation learning approach was also applied re-weighted patches. However, this settings requires good to leverage information from multiple datasets, such that 14

a single ’robust’ model is trained across multiple datasets their method, a deep neural network was used as a feature instead of training an individual ’weak’ model on each descriptor to extract high-level features from a raw input dataset. Under the data-augmentation settings, the CNN image, where only the top hidden layer and the output was trained using data from multiple datasets simulta- layer were involved in learning AU label distributions and neously. However different CRF pairwise connections for AU classifiers. A Restricted Boltzmann Machine (RBM) was AU dependencies were learned seperately for each dataset, employed to learn AU distributions with respect to the as these datasets may contain non-overlapping AUs with partially annotated ground truth labels from the bias of considerable variation in dynamics. The training process of AU labels in the output layer, bias of the hidden layer, the frame-level based CRF-CNN AU intensity framework and the weights between the output layer and the hidden involves data from BP4D in FERA 2015, DISFA and UNBC- layer. More specifically, the weights between the two layers McMaster dataset, where the UNBC-McMaster dataset was were assumed to capture the mutually exclusive and co- only used for data augmentation purpose. The settings of occurrence dependencies among multiple AUs. Multiple CRF-CNN with iterative balanced batch approach and data- SVMs were used as classifiers for multiple AU classification augmentation option achieved the best performance, which based on the high-level features. The object loss function was measured by ICC (0.63 on BP4D, 0.45 on DISFA) and was a Hinge loss constrained by the log-likelihood of the MAE (1.23 on BP4D, 0.61 on DISFA). The DISFA dataset learnt AU label distributions. The loss function was modi- benefited from the augmentation with 7% improvement on fied such that all data are utilized in the training process but ICC. only the annotated data contributed to the loss optimization. The two-step procedure first trained the DNN with the 5.1.3 AU Detection with Partially Labeled Data partially annotated data in an unsupervised manner and In general, FACS-based data annotation is far more labor then fine tuned the parameters of both DNN and SVM with intensive than facial expression data acquisition, such that respect to the object function. The system was first trained a complete label assignment to all the training images in and tested with fully annotated data from BP4D and DISFA a dataset is not always available. By assuming intra-class respectively and minor improvement was reported on the similarity on feature extraction or a distribution depicting proposed system toward other state-of-the-art methods (e.g. the dependencies among AUs, labels of unannotated images DRML) in term of F1 measure. Further test with partially may be inferred using the set of labeled data. Learning labeled data also demonstrated the proposed system out- with partially labeled data could enrich the feature space performed MLML with missing data ratio ranging from 0.1 by improving utilization of all available data, which is to 0.5. especially beneficial to deep learning oriented approaches requiring training with extensive data. 5.2 Applications of Deep Learning in Pain Detection In addition to the binary label set {−1, 1} that indicates negative samples and positive samples, a 0 label is assigned Deep learning techniques have also been adopted to recent to an unlabeled sample in the partial label learning problem. APD research, with focus on improving feature represen- Wu et al proposed a multi-label learning with missing labels tation and exploiting temporal dynamics by utilizing cor- (MLML) framework [100] to solve this problem in a multi- responding deep neural network architectures. To our best label setting, which is applicable to AU recognition. The goal knowledge, the handful of studies on APD involving deep of this method is to learn a mapping that is specified by a set learning are all based on UNBC-McMaster dataset. of coefficients between the extracted frame-level feature and An end-to-end deep learning pain detector presented in the predicted label. The missing labels were handled based [75] is an upgrade of the classic pain detection framework on two local label assumptions. An instance-level smooth- that is based on hand-crafted features. The proposed two- ness assumes that two frames with similar features should stage deep learning framework is comprised of a convolu- have label vectors that are close, and a class-level smooth- tional neural network (CNN) and long short-term memory ness assumed that two classes with close semantic meanings network (LSTM). The CNN took raw images as input and should have similar instantiations across the whole dataset. predicted pain at frame level, which was built upon a The consistency between predicted labels and provided pre-trained VGG-16 CNN and served as a baseline sys- labels was enforced the MLML objective functions, which is tem. The feature frames (4096 dimensions) from the fully comprised of a Hinge loss and constraints from two smooth connected layer (fc6) before the output layer of the CNN matrices generated upon the label smoothness assumptions. were grouped into sequences of length 17, and all the The AU classification task was tested on the CK+ dataset frames are assigned the label from the preceding frame with the available portion of labels on four levels ranging in the sequence. The frame-level features were fed into from 20% to 100%. The performance was evaluated by the LSTM network to improve the pain prediction of a average precision (PR)(0.81−0.92) and AUC (0.892−0.949). frame by taking into account the past 16 frames. The input The label smoothness assumption in [100] could be images were preprocessed in several settings including: 1) invalid as the closeness of two samples in feature space aligned with generalized Procrustes Analysis followed by may arise because the subject is same rather than indica- cropping to a fixed size, 2) removing background with tion of the same AU. Wu et al argued that regular spa- a binary mask, and 3) frontalization with down sampled tial and temporal patterns embedded in AU labels can shape model (C-APP). The proposed system was trained be described probabilistically as a label distribution, and on UNBC-McMaster dataset with frame-level pain labels, AU labels including the missing ones can be modeled as and the best performance on binary pain detection task samples from such underlying AU distributions [101]. In (Accuracy = 90.3%, AUC = 0.913 was reported by the 15

Aligned Crop scheme combined with exploiting temporal in the training stage via three separate settings (Not used, information using LSTM. The author argued that removing Appended to the 3rd layer of NN, Appended to Input background information caused worse performance as the Features), and the multi-task learning was referred to the zero background value would be converted to nonzero in training data labeled with VAS, and labeled with both VAS training CNN. Further test with skew-normalization on and OPI. Instead using a temporal model as in typical APD the dataset showed that the Accuracy measurement varied settings, the authors solved the sequence-level pain estima- significantly with skew in testing data while AUC was less tion as a static problem based on sequence-level statistics. affected, which is in accordance with the findings in [42]. A set of ten statistics (mean, median, minimum, maximum, One obstacle in applying deep learning to APD problem variance, 3rd moment, 4th moment, 5th moment, sum of is that the size of a pain-oriented dataset is too limited all values, interquartile range) was generated from frame- to support training a deep neural network from scratch. level predictions of pain score, as these sufficient statistics [37] Egede et al [23] addressed this issue by combining hand- capture important information that help to infer unknown crafted and deep-learned features to estimate pain intensity parameters in distribution models. The ten-dimensional sta- as a multi-modality fusing problem that is tailored to small tistical sequence-level feature vectors were fed into a GP sample settings on the UNBC-McMaster dataset. The deep regression model to obatian a personalized estimate of VAS learned features were obtained from two pre-trained CNN on sequence-level. The system performance was evaluated models for AU detection task on BP4D dataset [41], which by MAE (2.18) and ICC (0.35), when the first stage NN was are associated to the eye and mouth region of the face trained with both VAS and OPI label in conjunction with respectively. A difference descriptor was applied to the local personal features appended to the 3rd NN layer, and the patch sequences with a temporal window of 5 frames to second stage GP input used only the predicted VAS feature. obtain frame-level features in the form of image intensity This result suggests that using OPI scores from an external and binary mask, which were combined and fed as input observer contribute to estimation of the subjective VAS score of CNN. The output of the last fully connected convolution by exploiting the benefits of multi-task learning. layer (3072D) of each CNN was retrieved and combined into a (6144D) deep-learned feature vector. In addition, a 218D geometric feature vector extracted from the 49-points 6 DISCUSSIONAND FUTURE WORK shape model and 2376D HOG feature vectors extracted from local patches around facial points are employed as Advances in FACS-based automated pain detection can be hand-crafted features at frame-level. A relevance vector realized to a large extent from three aspects. Phycological regressor (RVR) was trained for each of the geometric, HOG, studies provide theoretical basis for a modality to link and deep features, and a seconde-level RVR was trained to the subjective experience of pain to the observation of fa- fuse the RVR predictions from each modality in the first cial expressions. Progress in computer vision and machine level. All RVR learners are trained on the frame level ground learning research will lead to technical implementation of truth annotated by PSPI. System performance was evaluated pain detection from facial expressions by mining relevant by RMSE (0.99) and PCC (0.67), with the most significant features from raw input image frames. The system ar- contributions from HOG, followed by geometric feature and chitectures proposed to solve the automated pain detec- least from deep learned feature. The author attribute these tion problem have to be trained and validated with data findings to the likelihood that the deep features were fine- collected from experiments in clinical settings to demon- tuned to the original AU detection problem in [41], whereas strate its effectiveness in practice. The APD problem can the hand-crafted features were more generic in nature. The be viewed an extension of the AFER problem, not only author further concluded that MSE effectively measures the because pain is associated to a set of facial action units difference between the predictions and the ground truth, and their combinations (e.g. PSPI), but also for the reason while Pearson’s correlation measures how well the predic- that AFER is the core of an APD framework. In this review tion follows the ground truth trend. Although a good system we began with a description of a general AFER framework performance should possess low RMSE and high Pearson’s prior to addressing the APD problem due to the similarity correlation at the same time, a good rating on either metric and inheritance on the methodologies between the two is not necessarily correlated to the same good value on the problems. It is worth noting that in recent AFER studies, other. the face detection and alignment modules were emphasized Liu et al [54] proposed a direct estimation framework less than the feature extraction and classification modules. based on the UNBC-McMaster dataset to predict the sub- This is likely due to the fact that the fiducial points have jective self-report pain intensity score represented by VAS been well-annotated in most publicly available datasets, and via hand-crafted personal features and multi-task learning. state-of-the-art face detector and alignment algorithms are The proposed DeepFaceLIFT (LIFT = Learning Important also available as convenient off-the-self toolboxes. Although Features) framework is a two-stage personalized model, choices of customized combination of geometric, appear- which is comprised of a 4-Layer neural network (NN) ance, and motion-based features are highly flexible, the with ReLU activation, and Gaussian process (GP) regression SVM or boosting-based classifiers are consistently selected models. A fully connected NN was trained on frame-level in conventional approaches, where single AUs are detected AAM facial landmark features to predict frame-level VAS independently using a one-versus-all paradigm in general. score in a weakly supervised manner, where all frames Dynamic approaches are often employed as additional steps within a sequence carried the sequence VAS label. Personal to smooth the noisy static AU predictions at frame level by features (complexion, age, and gender) are also involved exploiting temporal information. Dynamic methods are also 16 required in segment-level AU event encoding and localizing Aside from continuing evolution of CVML technology, the AU events by modeling the temporal phase transition. high priority goal in future work of APD will be collecting Deep neural network has received increasing attention in copious pain-annotated facial expression data by following AFER research since 2015, which is able to provide an end- well-designed protocols in clinical settings and making the to-end solution for AU detection from raw input images. new facial expression datasets publicly available. Mining A DNN network possess an advantage over conventional AU relations from big data analysis will in turn help to machine learning classifiers in handling multi-class classi- optimize theoretical basis by providing more details on fication naturally, where the DNN architecture can be fur- developing APD systems tailored to specific pain types. ther modified to jointly detect action units as a multi-label problem. A deep neural network is able to synthesize high- level features from large-scale simple low-level features, and REFERENCES hence it is capable of substituting the conventional feature [1] Mamoona Arif-Rahu and Mary Jo Grap. Facial expression and extraction module in an AFER framework with an improved pain in the critically ill non-communicative patient: state of modeling of the non-structural facial muscular movement. science review. Intensive and critical care nursing, 26(6):343–352, On the other hand, deep feature descriptors are often used 2010. [2] Ahmed Bilal Ashraf, Simon Lucey, Jeffrey F Cohn, Tsuhan Chen, in conjunction with other hand-crafted feature descriptors to Zara Ambadar, Kenneth M Prkachin, and Patricia E Solomon. enrich facial feature extractions in the current state of AFER The painful face–pain expression recognition using active ap- research. Although training a DNN from scratch requires pearance models. Image and vision computing, 27(12):1788–1796, a large amount of data, which is not always available in a 2009. [3] Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja facial expression dataset, such a constraint may be alleviated Pantic. Robust discriminative response map fitting with con- via transfer learning from pre-trained state-of-the-art DNN strained local models. In Computer Vision and Pattern Recognition models. However, a DNN framework behaves like a black (CVPR), 2013 IEEE Conference on, pages 3444–3451. IEEE, 2013. [4] Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja box in feature selection in contrast to hand-crafted feature Pantic. Incremental face alignment in the wild. In Proceedings of descriptors due to the complexity of its architecture eluding the IEEE conference on computer vision and pattern recognition, pages insightful analysis so far. To address this issue, a saliency 1859–1866, 2014. map [82] that reflects gradient change at the input layer by [5] Min SH Aung, Sebastian Kaltwang, Bernardino Romera-Paredes, Brais Martinez, Aneesha Singh, Matteo Cella, Michel Valstar, disturbance at the output layer via back-propagation is used Hongying Meng, Andrew Kemp, Moshen Shafizadeh, et al. The to visualize the active regions for feature extraction related automatic detection of chronic pain-related expression: require- to a particular AU. ments, challenges and the multimodal emopain dataset. IEEE The primary goal of automated pain detection research transactions on affective computing, 7(4):435–451, 2016. [6] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: in clinical settings is to design an automated system that A unifying framework. International journal of computer vision, outperforms human observers in terms of both accuracy 56(3):221–255, 2004. and efficiency on recognizing pain from facial expressions. [7] Tanja Banziger¨ and Klaus R Scherer. Introducing the geneva multimodal emotion portrayal (gemep) corpus. Blueprint for The objective of pain detection is to encode segment-level affective computing: A sourcebook, pages 271–294, 2010. events in the form of occurrence or intensity of pain. The [8] Marian Stewart Bartlett, Gwen Littlewort, Mark G Frank, Clau- ground truth of pain is available from patients’ self-report dia Lainscsek, Ian R Fasel, and Javier R Movellan. Automatic at segment level in practice and identification of pain usu- recognition of facial actions in spontaneous expressions. Journal of multimedia, 1(6):22–35, 2006. ally requires observation of co-occurring pain-related AUs. [9] John N Bassili. Emotion recognition: the role of facial movement Therefore, clinical-oriented APD may be best described as and the relative importance of upper and lower areas of the face. a weakly supervised problem based on multi-label AU Journal of personality and , 37(11):2049, 1979. detection. Although development of an APD system fully [10] Zhanli Chen, Rashid Ansari, and Diana J Wilkie. Automated detection of pain from facial expressions: a rule-based approach benefits from progress in AFER performance as virtually using aam. In Medical Imaging 2012: Image Processing, volume any AFER techniques can be extended to APD applications, 8314, page 83143O. International Society for Optics and Photon- the reported studies on APD is are far fewer than those on ics, 2012. [11] Zhanli Chen, Rashid Ansari, and Diana J Wilkie. Learning pain AFER. A few studies using automated analysis of facial ex- from action unit combinations: A weakly supervised approach pressions in clinical settings [32] [79] [54] were more focused via multiple instance learning. arXiv preprint arXiv:1712.01496, on psychological studies on interaction with human subjects 2017. rather than elaborated CVML implementations. The estab- [12] Dmitry Chetverikov and Renaud Peteri.´ A brief survey of dynamic texture description and recognition. In Computer Recog- lishment of the UNBC-McMaster Shoulder Pain Expression nition Systems, pages 17–26. Springer, 2005. Archive Dataset has indeed motivated APD research since [13] Sien W Chew, Patrick Lucey, Simon Lucey, Jason Saragih, Jef- 2009. However, the UNBC-McMaster dataset is still the only frey F Cohn, Iain Matthews, and Sridha Sridharan. In the pursuit of effective affective computing: The relationship between fea- publicly available facial expression dataset about pain so far, tures and registration. IEEE Transactions on Systems, Man, and which constrains most APD studies to the acute shoulder Cybernetics, Part B (Cybernetics), 42(4):1006–1016, 2012. pain. It is worth noting that most present APD models [14] Wen-Sheng Chu, Fernando De la Torre, and Jeffrey F Cohn. Se- are still in their infancy with limited consideration on psy- lective transfer machine for personalized facial expression anal- ysis. IEEE transactions on pattern analysis and machine , chological experiments. There is still a big gap between 39(3):529–545, 2017. available APD models and the critical demand of pain [15] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. monitoring applications for non-communicative patients. Active appearance models. IEEE Transactions on pattern analysis The major challenge impeding the progress on current APD and machine intelligence, 23(6):681–685, 2001. [16] KD Craig. Ontogenetic and cultural influences on the expression research is primarily due to insufficient pain-annotated data of pain in man. In Pain and society, pages 37–52. Verlag Chemie for training and validating advanced CVML architectures. Weinheim, 1980. 17

[17] Kenneth D Craig. The facial expression of pain better than a [39] Di Huang, Caifeng Shan, Mohsen Ardabilian, Yunhong Wang, thousand words? APS Journal, 1(3):153–162, 1992. and Liming Chen. Local binary patterns and its application to [18] David Cristinacce and Tim Cootes. Automatic feature localisation facial image analysis: a survey. IEEE Transactions on Systems, Man, with constrained local models. Pattern Recognition, 41(10):3054– and Cybernetics, Part C (Applications and Reviews), 41(6):765–781, 3067, 2008. 2011. [19] David Cristinacce and Timothy F Cootes. Feature detection and [40] Ramin Irani, Kamal Nasrollahi, and Thomas B Moeslund. Pain tracking with constrained local models. In Bmvc, volume 1, recognition using spatiotemporal oriented energy of facial mus- page 3, 2006. cles. In Proceedings of the IEEE Conference on Computer Vision and [20] Kathleen S Deyo, Kenneth M Prkachin, and Susan R Mercer. Pattern Recognition Workshops, pages 80–87, 2015. Development of sensitivity to facial expression of pain. Pain, [41] Shashank Jaiswal and Michel Valstar. Deep learning the dynamic 107(1-2):16–21, 2004. appearance and shape of facial action units. In Applications of [21] Abhinav Dhall, Akshay Asthana, Roland Goecke, and Tom Computer Vision (WACV), 2016 IEEE Winter Conference on, pages Gedeon. Emotion recognition using phog and lpq features. In 1–8. IEEE, 2016. Automatic Face & Gesture Recognition and Workshops (FG 2011), [42]L aszl´ o´ A Jeni, Jeffrey F Cohn, and Fernando De La Torre. Facing 2011 IEEE International Conference on, pages 878–883. IEEE, 2011. imbalanced data–recommendations for the use of performance [22] Xiaoyu Ding, Wen-Sheng Chu, Fernando De la Torre, Jeffery F metrics. In Affective Computing and Intelligent Interaction (ACII), Cohn, and Qiao Wang. Cascade of tasks for facial expression 2013 Humaine Association Conference on, pages 245–251. IEEE, analysis. Image and Vision Computing, 51:36–48, 2016. 2013. [23] Egede, Michel Valstar, and Brais Martinez. Fusing deep [43]L aszl´ o´ A Jeni, Jeffrey M Girard, Jeffrey F Cohn, and Fernando learned and hand-crafted features of appearance, shape, and dy- De La Torre. Continuous au intensity estimation using localized, namics for automatic pain estimation. In Automatic Face & Gesture sparse facial feature space. In 2013 10th IEEE international con- Recognition (FG 2017), 2017 12th IEEE International Conference on, ference and workshops on automatic face and gesture recognition (FG), pages 689–696. IEEE, 2017. pages 1–7. IEEE, 2013. [24] . Facial action coding system (facs). A human face, [44] Bihan Jiang, Michel F Valstar, and Maja Pantic. Action unit de- 2002. tection using sparse appearance descriptors in space-time video [25] Paul Ekman and Wallace V Friesen. The repertoire of nonver- volumes. In Automatic Face & Gesture Recognition and Workshops bal behavior: Categories, origins, usage, and coding. semiotica, (FG 2011), 2011 IEEE International Conference on, pages 314–321. 1(1):49–98, 1969. IEEE, 2011. [26] Paul Ekman and Wallace V Friesen. Manual for the facial action [45] Takeo Kanade, Yingli Tian, and Jeffrey F Cohn. Comprehensive coding system. Consulting Press, 1978. database for facial expression analysis. In fg, page 46. IEEE, 2000. [27] Tom Fawcett. An introduction to roc analysis. Pattern recognition [46] Francis J Keefe and Andrew R Block. Development of an letters, 27(8):861–874, 2006. observation method for assessing pain behavior in chronic low [28] Jeffrey M Girard, Wen-Sheng Chu, Laszl´ o´ A Jeni, and Jeffrey F back pain patients. Behavior Therapy, 1982. Cohn. Sayette group formation task (gft) spontaneous facial [47] Maurice G Kendall. A new measure of rank correlation. expression database. In Automatic Face & Gesture Recognition (FG Biometrika, 30(1/2):81–93, 1938. 2017), 2017 12th IEEE International Conference on, pages 581–588. [48] Minyoung Kim and Vladimir Pavlovic. Hidden conditional IEEE, 2017. ordinal random fields for sequence classification. In Joint Eu- [29] Jeffrey M Girard, Jeffrey F Cohn, Laszlo A Jeni, Michael A ropean Conference on Machine Learning and Knowledge Discovery in Sayette, and Fernando De la Torre. Spontaneous facial expression Databases, pages 51–65. Springer, 2010. in unscripted social interactions can be measured automatically. Behavior research methods, 47(4):1136–1147, 2015. [49] Sander Koelstra, Maja Pantic, and Ioannis Patras. A dynamic [30] Amogh Gudi, H Emrah Tasli, Tim M Den Uyl, and Andreas texture-based approach to recognition of facial actions and their IEEE transactions on pattern analysis and machine Maroulis. Deep learning based facs action unit occurrence and temporal models. intelligence intensity estimation. In Automatic Face and Gesture Recognition , 32(11):1940–1954, 2010. (FG), 2015 11th IEEE International Conference and Workshops on, [50] Wei Li, Farnaz Abtahi, and Zhigang Zhu. Action unit detec- volume 6, pages 1–5. IEEE, 2015. tion with region adaptation, multi-labeling learning and optimal [31] Thomas Hadjistavropoulos, Keela Herr, Dennis C Turk, Perry G temporal fusing. In 2017 IEEE Conference on Computer Vision and Fine, Robert H Dworkin, Robert Helme, Kenneth Jackson, Pa- Pattern Recognition (CVPR), pages 6766–6775. IEEE, 2017. tricia A Parmelee, Thomas E Rudy, B Lynn Beattie, et al. An [51] Gwen Littlewort, Jacob Whitehill, Tingfan Wu, Ian Fasel, Mark interdisciplinary expert consensus statement on assessment of Frank, Javier Movellan, and Marian Bartlett. The computer pain in older persons. The Clinical journal of pain, 23:S1–S43, 2007. expression recognition toolbox (cert). In Automatic Face & Ges- [32] Jihun Hamm, Christian G Kohler, Ruben C Gur, and Ragini ture Recognition and Workshops (FG 2011), 2011 IEEE International Verma. Automated facial action coding system for dynamic anal- Conference on, pages 298–305. IEEE, 2011. ysis of facial expressions in neuropsychiatric disorders. Journal of [52] Gwen C Littlewort, Marian Stewart Bartlett, and Kang Lee. neuroscience methods, 200(2):237–256, 2011. Faces of pain: automated measurement of spontaneousallfacial [33] Zakia Hammal and Jeffrey F Cohn. Automatic detection of pain expressions of genuine and posed pain. In Proceedings of the intensity. In Proceedings of the 14th ACM international conference on 9th international conference on Multimodal interfaces, pages 15–21. , pages 47–52. ACM, 2012. ACM, 2007. [34] Zakia Hammal and Jeffrey F Cohn. Towards multimodal pain [53] Gwen C Littlewort, Marian Stewart Bartlett, and Kang Lee. assessment for research and clinical use. In Proceedings of the Automatic coding of facial expressions displayed during posed 2014 Workshop on Roadmapping the Future of Multimodal Interaction and genuine pain. Image and Vision Computing, 27(12):1797–1803, Research including Business Opportunities and Challenges, pages 13– 2009. 17. ACM, 2014. [54] Dianbo Liu, Fengjiao Peng, Andrew Shea, , et al. [35] Keela Herr, Patrick J Coyne, Tonya Key, Renee Manworren, Deepfacelift: Interpretable personalized models for automatic Margo McCaffery, Sandra Merkel, Jane Pelosi-Kelly, and Lori estimation of self-reported pain. arXiv preprint arXiv:1708.04670, Wild. Pain assessment in the nonverbal patient: position state- 2017. ment with clinical practice recommendations. Pain Management [55] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Nursing, 7(2):44–52, 2006. Ambadar, and Iain Matthews. The extended cohn-kanade dataset [36] Abu-Saad HH. Challenge of pain in the cognitively impaired. (ck+): A complete dataset for action unit and emotion-specified Lancet, 2(356):1867–1868, 2000. expression. In Computer Vision and Pattern Recognition Workshops [37] Robert V Hogg and Allen T Craig. Introduction to mathematical (CVPRW), 2010 IEEE Computer Society Conference on, pages 94– statistics.(5”” edition). Upper Saddle River, New Jersey: Prentice 101. IEEE, 2010. Hall, 1995. [56] Patrick Lucey, Jeffrey F Cohn, Iain Matthews, Simon Lucey, [38] Kuang-Jui Hsu, Yen-Yu Lin, and Yung-Yu Chuang. Aug- Sridha Sridharan, Jessica Howlett, and Kenneth M Prkachin. mented multiple instance regression for inferring object con- Automatically detecting pain in video through facial action units. tours in bounding boxes. IEEE Transactions on Image Processing, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cyber- 23(4):1722–1736, 2014. netics), 41(3):664–674, 2011. 18

[57] Patrick Lucey, Jeffrey F Cohn, Kenneth M Prkachin, Patricia E comprehensive survey. Image and Vision Computing, 30(10):683– Solomon, Sien Chew, and Iain Matthews. Painful monitoring: 697, 2012. Automatic pain monitoring using the unbc-mcmaster shoulder [77] Michael A Sayette, Kasey G Creswell, John D Dimoff, Catharine E pain expression archive database. Image and Vision Computing, Fairbairn, Jeffrey F Cohn, Bryan W Heckman, Thomas R Kirchner, 30(3):197–205, 2012. John M Levine, and Richard L Moreland. Alcohol and group [58] Patrick Lucey, Jeffrey F Cohn, Kenneth M Prkachin, Patricia E formation: A multimodal investigation of the effects of alcohol Solomon, and Iain Matthews. Painful data: The unbc-mcmaster on emotion and social bonding. Psychological science, 23(8):869– shoulder pain expression archive database. In Automatic Face & 878, 2012. Gesture Recognition and Workshops (FG 2011), 2011 IEEE Interna- [78] Karan Sikka. Facial expression analysis for estimating pain in tional Conference on, pages 57–64. IEEE, 2011. clinical settings. In Proceedings of the 16th International Conference [59] Patrick Lucey, Jessica Howlett, Jeff Cohn, Simon Lucey, Sridha on Multimodal Interaction, pages 349–353. ACM, 2014. Sridharan, and Zara Ambadar. Improving pain recognition [79] Karan Sikka, Alex A Ahmed, Damaris Diaz, Matthew S Good- through better utilisation of temporal information. In Interna- win, Kenneth D Craig, Marian S Bartlett, and Jeannie S Huang. tional conference on auditory-visual , volume 2008, Automated assessment of childrens postoperative pain using page 167. NIH Public Access, 2008. computer vision. Pediatrics, 136(1):e124–e131, 2015. [60] Dennis H Lundtoft, Kamal Nasrollahi, Thomas B Moeslund, [80] Karan Sikka, Abhinav Dhall, and Marian Stewart Bartlett. Classi- and Sergio Escalera. Spatiotemporal facial super-pixels for pain fication and weakly supervised pain localization using multiple detection. In International Conference on Articulated Motion and segment representation. Image and vision computing, 32(10):659– Deformable Objects, pages 34–43. Springer, 2016. 670, 2014. [61] Paolo L Manfredi, Brenda Breuer, Diane E Meier, and Leslie [81] Karan Sikka, Gaurav Sharma, and Marian Bartlett. Lomo: Latent Libow. Pain assessment in elderly patients with severe dementia. ordinal model for facial analysis in videos. In Proceedings of the Journal of Pain and Symptom Management, 25(1):48–52, 2003. IEEE Conference on Computer Vision and Pattern Recognition, pages [62] S Mohammad Mavadati, Mohammad H Mahoor, Kevin Bartlett, 5580–5589, 2016. Philip Trinh, and Jeffrey F Cohn. Disfa: A spontaneous facial ac- [82] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep tion intensity database. IEEE Transactions on Affective Computing, inside convolutional networks: Visualising image classification 4(2):151–160, 2013. models and saliency maps. arXiv preprint arXiv:1312.6034, 2013. [63] BE McGuire, P Daly, and F Smyth. Chronic pain in people with [83] Karen Simonyan and Andrew Zisserman. Very deep convolu- an intellectual disability: under-recognised and under-treated? tional networks for large-scale image recognition. arXiv preprint Journal of Intellectual Disability Research, 54(3):240–245, 2010. arXiv:1409.1556, 2014. [64] Gary McKeown, Michel Valstar, Roddy Cowie, Maja Pantic, and [84] Michael JL Sullivan, Heather Adams, and Maureen E Sullivan. Marc Schroder. The semaine database: Annotated multimodal Communicative dimensions of pain catastrophizing: - records of emotionally colored conversations between a person ing effects on pain behaviour and coping. Pain, 107(3):220–226, and a limited agent. IEEE Transactions on Affective Computing, 2004. 3(1):5–17, 2012. [85] Bo Sun, Liandong Li, Tian Zuo, Ying Chen, Guoyan Zhou, and [65] Maja Pantic and Ioannis Patras. Dynamics of facial expression: Xuewen Wu. Combining multimodal features with hierarchical recognition of facial actions and their temporal segments from classifier fusion for emotion recognition in the wild. In Proceed- face profile image sequences. IEEE Transactions on Systems, Man, ings of the 16th International Conference on Multimodal Interaction, and Cybernetics, Part B (Cybernetics), 36(2):433–449, 2006. pages 481–486. ACM, 2014. [66] Maja Pantic, Michel Valstar, Ron Rademaker, and Ludo Maat. Web-based database for facial expression analysis. In 2005 IEEE [86] Richard Szeliski. Computer vision: algorithms and applications. international conference on multimedia and Expo, page 5. IEEE, 2005. Springer Science & Business Media, 2010. [67] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. [87] Y-I Tian, Takeo Kanade, and Jeffrey F Cohn. Recognizing action Deep face recognition. In BMVC, volume 1, page 6, 2015. units for facial expression analysis. IEEE Transactions on pattern [68] Jean-Francois Payen, Olivier Bru, Jean-Luc Bosson, Anna La- analysis and machine intelligence, 23(2):97–115, 2001. grasta, Eric Novel, Isabelle Deschaux, Pierre Lavagne, and [88] Yan Tong, Wenhui Liao, and Qiang Ji. Facial action unit recog- Claude Jacquot. Assessing pain in critically ill sedated patients by nition by exploiting their dynamic and semantic relationships. using a behavioral pain scale. Critical care medicine, 29(12):2258– IEEE transactions on pattern analysis and machine intelligence, 29(10), 2263, 2001. 2007. [69] Kenneth M Prkachin. Assessing pain by facial expression: facial [89] Zoltan´ Tos˝ er,´ Laszl´ o´ A Jeni, Andras´ Lorincz,˝ and Jeffrey F Cohn. expression as nexus. Pain Research and Management, 14(1):53–58, Deep learning for facial action unit detection under large head 2009. poses. In European Conference on Computer Vision, pages 359–371. [70] Kenneth M Prkachin and Patricia E Solomon. The structure, Springer, 2016. reliability and validity of pain expression: Evidence from patients [90] Georgios Tzimiropoulos and Maja Pantic. Fast algorithms for with shoulder pain. Pain, 139(2):267–274, 2008. fitting active appearance models to unconstrained images. Inter- [71] Kathleen A Puntillo, Ann B Morris, Carol L Thompson, Julie national journal of computer vision, 122(1):17–33, 2017. Stanik-Hutt, Cheri A White, and Lorie R Wild. Pain behaviors [91] Michel F Valstar, Timur Almaev, Jeffrey M Girard, Gary McKe- observed during six common procedures: results from thunder own, Marc Mehu, Lijun Yin, Maja Pantic, and Jeffrey F Cohn. Fera project ii. Critical care medicine, 32(2):421–427, 2004. 2015-second facial expression recognition and analysis challenge. [72] Pau Rodriguez, Guillem Cucurull, Jordi Gonzalez,` Josep M Gon- In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE faus, Kamal Nasrollahi, Thomas B Moeslund, and F Xavier Roca. International Conference and Workshops on, volume 6, pages 1–8. Deep pain: Exploiting long short-term memory networks for IEEE, 2015. facial expression classification. IEEE transactions on cybernetics, [92] Michel F Valstar, Marc Mehu, Bihan Jiang, Maja Pantic, and Klaus 2017. Scherer. Meta-analysis of the first facial expression recognition [73] Rosa Rojo, Juan Carlos Prados-Frutos, and Antonio Lopez-´ challenge. IEEE Transactions on Systems, Man, and Cybernetics, Valverde. Pain assessment using the facial action coding system. Part B (Cybernetics), 42(4):966–979, 2012. a systematic review. Medicina Cl´ınica(English Edition), 145(8):350– [93] Michel F Valstar and Maja Pantic. Fully automatic recognition 355, 2015. of the temporal phases of facial actions. IEEE Transactions on [74] Ognjen Rudovic, Vladimir Pavlovic, and Maja Pantic. Kernel Systems, Man, and Cybernetics, Part B (Cybernetics), 42(1):28–43, conditional ordinal random fields for temporal segmentation of 2012. facial action units. In European Conference on Computer Vision, [94] Michel F Valstar, Enrique Sanchez-Lozano,´ Jeffrey F Cohn, pages 260–269. Springer, 2012. Laszl´ o´ A Jeni, Jeffrey M Girard, Zheng Zhang, Lijun Yin, and [75] Ognjen Rudovic, Vladimir Pavlovic, and Maja Pantic. Context- Maja Pantic. Fera 2017-addressing head pose in the third facial sensitive dynamic ordinal regression for intensity estimation of expression recognition and analysis challenge. In Automatic Face facial action units. IEEE transactions on pattern analysis and machine & Gesture Recognition (FG 2017), 2017 12th IEEE International intelligence, 37(5):944–958, 2015. Conference on, pages 839–847. IEEE, 2017. [76] Georgia Sandbach, Stefanos Zafeiriou, Maja Pantic, and Lijun [95] Paul Viola and Michael Jones. Rapid object detection using Yin. Static and dynamic 3d facial expression recognition: A a boosted cascade of simple features. In Computer Vision and 19

Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Zhanli Chen is a Ph.D student in the Depart- Computer Society Conference on, volume 1, pages I–I. IEEE, 2001. ment of Electrical and Computer Engineering at [96] Robert Walecki, Vladimir Pavlovic, Bjorn¨ Schuller, Maja Pantic, the University of Illinois at Chicago (UIC). He et al. Deep structured learning for facial action unit intensity received the Msc. degree in electrical engineer- estimation. arXiv preprint arXiv:1704.04481, 2017. ing at Hong Kong University of Science and [97] Philipp Werner, Ayoub Al-Hamadi, Robert Niese, Steffen Walter, Technology (HKUST) and the B.Eng in electrical Sascha Gruss, and Harald C Traue. Towards pain monitoring: engineering at Nanjing University of Posts and Facial expression, head pose, a new database, an automatic Telecommunications (NUPT). He is a student system and remaining challenges. In Proceedings of the British member of IEEE and his research interest in- Machine Vision Conference, pages 119–1, 2013. volves automated pain detection from facial ex- [98] Diana J Wilkie. Facial expressions of pain in lung cancer. Analge- pressions in clinical conditions. sia, 1(2):91–99, 1995. [99] Amanda C de C Williams. Facial expression of pain: an evolu- tionary account. Behavioral and brain sciences, 25(4):439–455, 2002. [100] Baoyuan Wu, Siwei Lyu, Bao-Gang Hu, and Qiang Ji. Multi- label learning with missing labels for image annotation and facial action unit recognition. Pattern Recognition, 48(7):2279–2289, 2015. [101] Shan Wu, Shangfei Wang, Bowen Pan, and Qiang Ji. Deep facial action unit recognition from partially labeled data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3951–3959, 2017. Rashid Ansari received the Ph.D. degree in [102] Tingfan Wu, Marian S Bartlett, and Javier R Movellan. Facial electrical engineering and computer science in expression recognition using gabor motion energy filters. In 1981 from Princeton University, Princeton, NJ, Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 and the B.Tech. and M.Tech. degrees in electri- IEEE Computer Society Conference on, pages 42–47. IEEE, 2010. cal engineering from the Indian Institute of Tech- [103] Tingfan Wu, Nicholas J Butko, Paul Ruvolo, Jacob Whitehill, nology, Kanpur, India. He is currently Professor Marian S Bartlett, and Javier R Movellan. Multilayer architectures and Head in the Department of Electrical and for facial action unit recognition. IEEE Transactions on Systems, Computer Engineering at the University of Illinois Man, and Cybernetics, Part B (Cybernetics), 42(4):1027–1038, 2012. at Chicago (UIC). Before joining UIC he served [104] Jing Xiao, Simon Baker, Iain Matthews, and Takeo Kanade. Real- as a Research Scientist at Bell Communications time combined 2d+ 3d active appearance models. In CVPR (2), Research and on the faculty of Electrical Engi- pages 535–542, 2004. neering at University of Pennsylvania. His research interests are in the [105] Xuehan Xiong and Fernando De la Torre. Supervised descent general areas of and communications, with recent method and its applications to face alignment. In Computer Vision focus on image and video analysis, multimedia communication, and and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages medical imaging. In the past he served as Associate Editor of the IEEE 532–539. IEEE, 2013. Transactions on Image Processing, IEEE Signal Processing Letters, [106] Yan Xu, Jun-Yan Zhu, I Eric, Chao Chang, Maode Lai, and and IEEE Transactions on Circuits and Systems. He has served as a Zhuowen Tu. Weakly supervised histopathology cancer image member of the Digital Signal Processing Technical Committee of the segmentation and classification. Medical image analysis, 18(3):591– IEEE Circuits and Systems Society and a member of the Image, Video, 604, 2014. and Multidimensional Signal Processing Technical Committee. He was a [107] Ghada Zamzmi, Chih-Yun Pai, Dmitry Goldgof, Rangachar Kas- member of the organizing and program committees of several past IEEE turi, Yu Sun, and Terri Ashmeade. Machine-based multimodal conferences. He served on the organizing and executive committees of pain assessment tool for infants: a review. arXiv preprint the Visual Communication; Image Processing (VCIP) conferences and arXiv:1607.00331, 2016. was General Chair of the 1996 VCIP Conference. He was elected a [108] Cha Zhang, John C Platt, and Paul A Viola. Multiple instance Fellow of the IEEE in 1999. boosting for object detection. In Advances in neural information processing systems, pages 1417–1424, 2006. [109] Cha Zhang and Zhengyou Zhang. A survey of recent advances in face detection. 2010. [110] Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M Girard. Bp4d- spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing, 32(10):692–706, 2014. Diana Wilkie Ph.D., R.N., FAAN, comes to Uni- [111] Kaili Zhao, Wen-Sheng Chu, Fernando De la Torre, Jeffrey F versity of Florida from the University of Illinois Cohn, and Honggang Zhang. Joint patch and multi-label learning at Chicago, where she served as professor and for facial action unit detection. In Proceedings of the IEEE Confer- Harriet J. Werley Endowed Chair for Nursing ence on Computer Vision and Pattern Recognition, pages 2207–2216, Research. She joined UF in 2015. She is a fellow 2015. of the American Academy of Nursing and served [112] Kaili Zhao, Wen-Sheng Chu, and Honggang Zhang. Deep region for five years as an American Cancer Society and multi-label learning for facial action unit detection. In professor of oncology nursing. She has authored Proceedings of the IEEE Conference on Computer Vision and Pattern or co-authored more than 210 scholarly publi- Recognition, pages 3391–3399, 2016. cations and serves as a reviewer on a number [113] Yuqian Zhou, Jimin Pi, and Bertram E Shi. Pose-independent of journals, including the Journal of Palliative facial action unit intensity regression based on multi-task deep Medicine, Cancer Nursing and the Journal of Pain. As a member of transfer learning. In Automatic Face & Gesture Recognition (FG the National Academy of Medicine (formerly the Institute of Medicine), 2017), 2017 12th IEEE International Conference on, pages 872–877. she has devoted her research program to management of cancer pain IEEE, 2017. and to end-of-life issues. She has been continuously funded since 1986 [114] Xiangxin Zhu and Deva Ramanan. Face detection, pose estima- from numerous organizations such as the National Institutes of Health, tion, and landmark localization in the wild. In Computer Vision the National Cancer Institute and the Robert Wood Johnson Foundation and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages totaling more than $48 million. She now directs the UF Center for 2879–2886. IEEE, 2012. Palliative Care Research and Education.