Arxiv:1805.05563V1 [Cs.CV] 15 May 2018 Sion

International Journal on Computer Vision https://doi.org/10.1007/s11263-018-1097-z

Facial Landmark Detection: a Literature Survey

Yue Wu Qiang Ji ·

Received: Nov. 2016 / Accepted: Apr. 2018

Abstract The locations of the fiducial facial landmark points around facial components and facial contour capture the rigid and non-rigid facial deformations due to head movements and facial expressions. They are hence important for various facial analysis tasks. Many facial landmark detection algorithms have been developed to automatically detect those key points over the (a) (b) years, and in this paper, we perform an extensive review of them. We classify the facial landmark detection Fig. 1: (a) Facial landmarks defining the face shape. (b) algorithms into three major categories: holistic meth- Sample images [10] with annotated facial landmarks. ods, Constrained Local Model (CLM) methods, and the regression-based methods. They differ in the ways to utilize the facial appearance and shape information. arate section to review the latest deep learning based The holistic methods explicitly build models to repre- algorithms. sent the global facial appearance and shape informa- The survey also includes a listing of the benchmark tion. The CLMs explicitly leverage the global shape databases and existing software. Finally, we identify model but build the local appearance models. The re- future research directions, including combining meth- gression based methods implicitly capture facial shape ods in different categories to leverage their respective and appearance information. For algorithms within each strengths to solve landmark detection “in-the-wild”. category, we discuss their underlying theories as well as Keywords Facial landmark detection survey their differences. We also compare their performances · on both controlled and in the wild benchmark datasets, under varying facial expressions, head poses, and occlu- 1 Introduction arXiv:1805.05563v1 [cs.CV] 15 May 2018 sion. Based on the evaluations, we point out their respective strengths and weaknesses. There is also a sep- The face plays an important role in visual communi- cation. By looking at the face, human can automatically extract many nonverbal messages, such as hu- Yue Wu mans’ identity, intent, and emotion. In computer vision, Department of Electrical, Computer, and Systems Engineer- to automatically extract those facial information, the ing, Rensselaer Polytechnic Institute, 110 8th Street, Troy, localizations of the fiducial facial key points (Figure 1) NY 12180-3590 E-mail: [email protected] are usually a key step and many facial analysis methods are built up on the accurate detection of those landmark Qiang Ji Department of Electrical, Computer, and Systems Engineer- points. For example, facial expression recognition [68] ing, Rensselaer Polytechnic Institute, 110 8th Street, Troy, and head pose estimation algorithms [66] may heavily NY 12180-3590 rely on the facial shape information provided by the E-mail: [email protected] landmark locations. The facial landmark points around 2 Yue Wu, Qiang Ji

Table 1: Comparison of the major facial landmark detection algorithms.

Algorithms Appearance Shape Performance Speed Holistic Method whole face explicit poor generalization/good slow/fast Constrained Local Method (CLM) local patch explicit good slow/fast Regression-based Method local patch/whole face implicit good/very good fast/very fast eyes can provide the initial guess of the pupil center po- Facial landmark detection algorithms can be clas- sitions for eye detection and eye gaze tracking [41]. For sified into three major categories: the holistic meth- facial recognition, the landmark locations on 2D image ods, the Constrained Local Model (CLM) methods, and are usually combined with 3D head model to “frontal- regression-based methods, depending on how they model ize” the face and help reduce the significant within- the facial appearance and facial shape patterns. The fa- subject variations to improve recognition accuracy [92]. cial appearance refers to the distinctive pixel intensity The facial information gained through the facial land- patterns around the facial landmarks or in the whole mark locations can provide important information for face region, while face shape patterns refer to the pat- human and computer interaction, entertainment, secu- terns of the face shapes as defined by the landmark loca- rity surveillance, and medical applications. tions and their spatial relationships. As summarized in Facial landmark detection algorithms aim to auto- Table 1, the holistic methods explicitly model the holis- matically identify the locations of the facial key land- tic facial appearance and global facial shape patterns. mark points on facial images or videos. Those key points CLMs rely on the explicit local facial appearance and are either the dominant points describing the unique lo- explicit global facial shape patterns. The regression- cation of a facial component (e.g., eye corner) or an based methods use holistic or local appearance infor- interpolated point connecting those dominant points mation and they may embed the global facial shape around the facial components and facial contour. For- patterns implicitly for joint landmark detection. In gen- mally, given a facial image denoted as , a landmark eral, the regression-based methods show better perfor- I detection algorithm predicts the locations of D land- mances recently (details will be discussed later). Note marks x = x , y , x , y , ..., x , y , where x and y that, some recent methods combine the deep learning { 1 1 2 2 D D} . . resentment the image coordinates of the facial land- models and global 3D shape models for detection and marks. they are outside the scope of the three major categories. They will be discussed in detail in Section 4.3. Facial landmark detection is challenging for sev- The remaining parts of the paper are organized as eral reasons. First, facial appearance changes signifi- follows. In Sections 2, 3, and 4, we discuss methods in cantly across subjects under different facial expressions the three major categories: the holistic methods, the and head poses. Second, the environmental conditions Constrained Local Model methods, and the regression- such as the illumination would affect the appearance based methods. Section 4.3 is dedicated to the review of the faces on the facial images. Third, facial occlu- of the recent deep learning based methods. In Section sion by other objects or self-occlusion due to extreme 5, we discus the relationships among methods in the head poses would lead to incomplete facial appearance three major categories. In Section 6, we discuss the lim- information. itations of the existing algorithms in “in-the-wild” con- Over the past few decades, there have been signifi- ditions and some advanced algorithms that are specifi- cant developments of the facial landmark detection al- cally designed to handle those challenges. In Section 7, gorithms. The early works focus on the less challenging we discuss related topics, such as face detection, facial facial images without the aforementioned facial varia- landmark tracking, and 3D facial landmark detection. tions. Later, the facial landmark detection algorithms In Section 8, we discuss facial landmark annotations, aim to handle several variations within certain cate- the popular facial landmark detection databases, soft- gories, and the facial images are usually collected with ware, and the evaluation of the leading algorithms. Fi- “controlled” conditions. For example, in “controlled” nally, we summarize the paper in Section 9, where we conditions, the facial poses and facial expressions can point out future directions. only be in certain categories. More recently, the research focuses on the challenging “in-the-wild” conditions, in which facial images can undergo arbitrary facial expres- 2 Holistic methods sions, head poses, illumination, facial occlusions, etc. In general, there is still a lack of a robust method that can Holistic methods explicitly leverage the holistic fa- handle all those variations. cial appearance information as well as the global facial Facial Landmark Detection: a Literature Survey 3 shape patterns for facial landmark detection (Figure 2). In the following, we first introduce the classic holistic method: the Active Appearance Model (AAM) [18]. Then, we introduce its several extensions. Fig. 3: Learned shape variations using AAM model, adapted from [18].

Shape model

Appearance model Fig. 4: Learned appearance variations using AAM, adapted from [18]. Training image Shape free textures

Fig. 2: Holistic Model. Third, to learn the appearance model, image wrap- ping is applied to register the image to the mean shape and generate the shape normalized facial image, denoted as ( (x0 )), where (.) indicates the wrap- Ii W i W ping operation. Then, PCA is applied again on the N 2.1 Active Appearance Model shape normalized facial images ( (x0 )) to gen- {Ii W i }i=1 erate a mean appearance A0 and Ka appearance bases 1 The Active Appearance Model (AAM) was introduced = A Ka , as shown in Figure 4. Given the ap- A { m}m=1 by Taylor and Cootes [27][18]. It is a statistical model pearance model = A Ka , each shape normalized A { m}m=0 that fits the facial images with a small number of coeffi- facial image can be represented using the appearance cients, controlling both the facial appearance and shape Ka coefficients λ = λm m=1. variations. During model construction, AAM builds the { } global facial shape model and the holistic facial ap- Ka ( (x0)) = + λ (2) pearance model sequentially based on Principal Com- I W A0 mAm m=1 ponent Analysis (PCA). During detection, it identifies X the landmark locations by fitting the learned appear- An optional third model may be applied to learn the ance and shape models to the testing images correlations among the shape coefficients p and appear- There are a few steps for AAM to construct the ance coefficients λ. appearance and shape models, given the training facial In landmark detection, AAM finds the shape and N images with landmark annotations, denoted as i, xi , appearance coefficients p and λ, as well as the affine {I }i=1 where N is the number of training images. First, Pro- transformation parameters c, θ, t , t that best fit the { c r} crustes Analysis [36] is applied to register all the train- testing image, which determine the landmark locations: ing facial shapes. It removes the affine transformation (c, θ, tc, tr denote the scale, rotation and translation Ks parameters) of each face shape xi and generates the x = cR2d(θ)(s0 + pn sn) + t. (3) normalized training facial shapes x . Second, given the ∗ i0 n=1 N normalized training facial shapes x0 , PCA is ap- X { i}i=1 plied to learn the mean shape s and a few orthonor- Here, R2d(θ) denotes the rotation matrix, and t = tc, tr . 0 { } mal bases s Ks that capture the shape variations, To simplify the notation, in the following the shape co- { n}n=1 where Ks is the number of bases (Figure 3). Given the efficients would include both PCA coefficients and affine learned facial shape bases s Ks , a normalized facial transformation parameters. { n}n=0 shape x0 can be represented using the shape coefficients In general, the fitting procedure can be formulated Ks p = pn n=1: by minimizing the distance between the reconstructed { } images + Ka λ and the shape normalized A0 m=1 mAm Ks testing image ( (p)). The difference is usually re- x0 = s + p s . (1) PI W 0 n ∗ n ferred to as the error image, denoted as ∆ : n=1 A X Ka 1 In this paper, we refer Active Appearance Model to the ∆ (λ, p) = Diff( 0 + λm m, ( (p))) (4) model, independent of the fitting algorithms. A A A I W m=1 X 4 Yue Wu, Qiang Ji

Recently, more advanced analytic fitting methods only estimate the shape coefficients, which fully deter- λ∗, p∗ = arg min ∆ (λ, p) (5) mine the landmark locations as in Equation 3. For ex- λ,p A ample, in [4], the Bayesian Active Appearance Model In the conventional AAM [27][18], model coefficients formulates AAM as a probabilistic PCA problem. It are estimated by iterative calculation of the error im- treats the texture coefficients as hidden variables and age based on the current model coefficients and model marginalizes them out to solve for the shape coeffi- coefficient update prediction based on the error image. cients:

p˜ = arg max lnp(p) = arg max ln p(p λ)p(λ)dλ. (7) p p | 2.2 Fitting algorithms Zλ But exactly integrating out λ can be computationally Most of the holistic methods focus on improving the expensive. To alleviate this problem, in [99], Tzimiropou- fitting algorithms, which involves solving Equation 5. los and Pantic propose the fast-SIC algorithm and the They can be classified into the analytic fitting method fast-forward algorithm. In both algorithms, the appear- and learning-based fitting method. ance coefficient updates ∆λ are represented in terms of the shape coefficient updates ∆p, and they are plugged 2.2.1 Analytic fitting methods into the nonlinear least square formulation, which is then directly minimized to solve for ∆p. Different from The analytic fitting methods formulate AAM fitting [4], [99] follows a deterministic approach. problem as a nonlinear optimization problem and solve it analytically. In particular, the algorithm searches the 2.2.2 Learning-based fitting methods best set of shape and appearance coefficients p, λ that minimize the difference between reconstructed image Instead of directly solving the fitting problem analyti- and the testing image with a nonlinear least squares cally, the learning-based fitting methods learn to pre- formulation: dict the shape and appearance coefficients from the im- M age appearances. They can be further classified into lin- p˜, λ˜ = arg min + λ ( (p)) 2. (6) ear regression fitting methods, nonlinear regression fit- kA0 mAm − I W k2 p,λ m ting methods, and other learning-based fitting methods. X Linear regression fitting methods: The linear Here, + M λ represents the reconstructed face A0 m mAm regression fitting methods assume that there is linear in the shape normalized frame depending on the shape P relationship between model coefficient updates and the and appearance coefficients, and the whole objective error image ∆ (λ, p) or image features (λ, p). They A I function represents the reconstruction error. learn linear regression function for the prediction, which One natural way to solve the optimization problem follows the conventional AAM as illustrated in the last is to use the Gaussian-Newton methods. However, since section. the Jacobin and Hessian matrix for both p and λ need Linear Regression to be calculated for each iteration [48], the fitting pro- ∆ (λ, p) or (λ, p) ∆λ, ∆p (8) cedure is usually very slow. To address this problem, A I −−−−−−−−−−−−→ Baker and Matthews proposed a series of algorithms, They therefore estimate the model coefficients by iter- among which the Project Out Inverse Compositional atively estimating the model coefficient updates, and algorithm (POIC) [63] and the Simultaneous Inverse add them to the currently estimated coefficients for Compositional (SIC) algorithm [7] are two popular the prediction in the next iteration. For example, in works. In POIC [63], the errors are projected into space [26], Canonical Correlation Analysis (CCA) is applied spanned by the appearance eigenvectors Ka , and to model the correlation between the error image and {Am}m=1 its orthogonal complement space. The shape coefficients the model coefficient updates. It then learns the linear are firstly searched in the appearance space, and the ap- regression function to map the canonical projections of pearance coefficients are then searched in the orthogo- the error image to the coefficient updates. Similarly, in nal space, given the shape coefficients. In SIC [7], the Direct Appearance Model [43], linear model is applied appearance and shape coefficients are estimated jointly to directly predict the shape coefficient updates from with gradient descent algorithm. Compared to POIC, the principal components of the error images. The lin- SIC is more computationally intensive, but generalizes ear regression fitting methods usually differ in the used better than POIC [38]. image features, linear regression models, and whether Facial Landmark Detection: a Literature Survey 5 to go to a different feature space to learn the mapping The learned correlation between shape and appearance [26] [43]. coefficients can reduce the number of parameters. Such Nonlinear regression fitting methods: The lin- learned correlation may not generalize well to different ear regression fitting methods assume that the relation- images. The joint estimation of shape and appearance ship between features and the error image as shown in coefficients using Equation 6 can be more accurate. But Equation 4 around the true solution of model coeffi- they are more difficult. cients are close to quadratic, which ensures that an iterative procedure with linear updates and adaptive step size would lead to convergence. However, this linear as- 2.3 Other extensions sumption is only true when the initialization is around the true solution which makes the linear regression fit- 2.3.1 Feature representation ting methods sensitive to the initialization. To tackle There are other extensions of the conventional AAM this problem, the nonlinear regression fitting methods methods. One particular direction is to improve the use nonlinear models to learn the relationship among feature representations. It is well known that the AAM the image features and the model coefficient updates: model has limited generalization ability and it has diffi- Nonlinear Regression ∆ (λ, p) or (λ, p) ∆λ, ∆p (9) culty fitting the unseen face variations (e.g. across sub- A I −−−−−−−−−−−−−−−→ jects, illumination, partial occlusion, etc.) [37][38]. This For example, in [83], boosting algorithm is proposed limitation is partially due to the usage of raw pixel in- to predict the coefficient updates from the appearance. tensity as features. To tackle this problem, some algo- It combines a set of weak learners to form a strong rithms use more robust image features. For example, regressor, and the weak learners are developed based in [45], instead of using the raw pixel intensity, the on the Haar-like features and the one-dimensional de- wavelet features are used to model the facial appear- cision stump. The strong nonlinear regressor can be ance. In addition, only the local appearance information considered as an additive piecewise function. In [95], is used to improve the robustness to partial occlusion Tresadern et al. compared the linear and nonlinear re- and illumination. In [47], Gabor wavelet with Gaussian gression algorithms. The used nonlinear regression al- Mixture model is used to model the local image ap- gorithms include the additive piecewise function devel- pearance, which enables fast local point search. Both oped with boosting algorithm in [83] and the Relevance methods improve the performances of the conventional Vector Machine [104]. They empirically showed that AAM method. nonlinear regression method is better at the first few iterations to avoid the local minima, while linear re- 2.3.2 Ensemble of AAM models gression is better when the estimation is close to the true solution. A single AAM model inherently assumes linearity in face shape and appearance variation. Realizing this lim- 2.2.3 Discussion: analytic fitting methods vs. itation, some methods utilize ensemble models to im- learning-based fitting methods prove the performance. For example, in [71], sequential regression AAM model is proposed, which trains a seri- Compared to the analytic fitting methods solved with als of AAMs for sequential model fitting in a cascaded gradient descent algorithm with explicit calculation of manner. AAM in the early stage takes into account of the Hessian and Jacobian matrices, the learning-based the large variations (e.g. pose), while those in the later fitting methods use constant linear or nonlinear regres- stage fits to the small variations. In this work, both sion functions to approximate the steepest descent di- independent ensemble AAMs and coupled sequential rection. As a result, the learning-based fitting methods AAMs are used. The independent ensemble AAMs use are generally fast but they may not be accurate. The independently perturbed model coefficients with differ- analytic methods do not need training images, while the ent settings, while the coupled ensemble AAMs apply fitting methods do. The learning-based fitting methods the learned prediction model in the first few levels to usually use a third PCA to learn the joint correlations generate the perturbed training data in the later level. among the shape and appearance coefficients and fur- Both boosting regression and random forest regressions ther reduce the number of unknown coefficients, while are utilized predict the model updates. Similar to the the analytic fitting methods usually do not. But, for coupled ensemble AAMs in [71], in [82], AAM fitting the analytic fitting methods, the interaction among ap- problem is formulated as an optimization problem with pearance and shape coefficients can be embedded in stochastic gradient descent solution. It leads to an ap- the joint fitting objective function as in Equation 6. proximated algorithm that iteratively learns the linear 6 Yue Wu, Qiang Ji

Here, each landmark location xd is determined by p as in Equation 3. In the probabilistic view point, CLM can be inter- preted as maximizing the product of the prior proba- Facial image Local appearance model Global face shape model Facial landmark bility of the facial shape patterns p(x; η) of all points detect each landmark captures the face shape detection results independently variations to refine the and the local appearance likelihoods p(xd ; θd) of each local detection results |I point:

Fig. 5: Constrained Local Method. D x˜ = arg max p(x; η) p(xd ; θd) (12) x |I d=1 prediction model from the training data with perturbed Y model coefficients at the first iteration, and then up- Similar to the deterministic formulation, the prior can dates the model coefficients for training data to be used also be applied to the shape coefficients p, and Equa- in the next iteration following a cascaded manner. Dif- tion 12 becomes: ferent from [71], [82] uses different subsets of training D data in different stages to escape from the local minima. p˜ = arg max p(p; ηp) p(xd(p) ; θd). (13) p |I dY=1 3 Constrained local methods For both the deterministic and the probabilistic CLMs, there are two major components. The first component is the local appearance model embedded in D (x , ) or As shown in Figure 5, the Constrained Local Model d d I p(x ; θ ) in Equations 10, 11, 12, and 13. The second (CLM) methods [24][84] infer the landmark locations x d|I d based on the global facial shape patterns as well as the component refers to the facial shape pattern constraints independent local appearance information around each either applied to the shape model coefficients p or the landmark, which is easier to capture and more robust shape x itself, as penalty terms or probabilistic prior to illumination and occlusion, comparing to the holistic distributions. The two components are usually learned appearance. separately during training and they are combined to infer landmark locations during landmark detection. In the following, we will discuss each component, and how 3.1 Problem formulation to combine them for landmark detection.

In general, CLM can be formulated either as a deterministic or probabilistic method. In the deterministic 3.2 Local appearance model view point, CLM [24][84] finds the landmarks by minimizing the misalignment errors subject to the shape The local appearance model assigns confidence score patterns: Dd(xd, ) or probability p(xd ; θd) that the landmark I |I with index d is located at a specific pixel location xd D based on the local appearance information around xd x˜ = arg min Q(x) + Dd(xd, ) (10) x I of image . The local appearance models can be cate- d=1 I X gorized into classifier-based local appearance model and Here, xd represents the positions of different landmarks the regression-based local appearance model. in x. D (x , ) represents the local confidence score d d I around xd . Q(x) represents a regularization term to 3.2.1 Classifier-based local appearance model penalize the infeasible or anti-anthropology face shapes in a global sense. The intuition is that we want to find The classifier-based local appearance model trains bi- the best set of landmark locations that have strong in- nary classifier to distinguish the positive patches that dependent local support for each landmark and satisfy are centered at the ground truth locations and the neg- the global shape constraint. ative patches that are far away from the ground truth The shape regularization can be applied to the shape locations. During detection, the classifier can be applied coefficients p. If we denote the regularization term as to different pixel locations to generate the confidence Qp(p), Equation 10 becomes: scores D (x , ) or probabilities p(x ; θ ) through vot- d d I d|I d ing. Different image features and classifiers are used. D For example, in the original CLM work [24] and FPLL p˜ = arg min Qp(p) + Dd(xd(p), ) (11) p I [125], template based method is used to construct the Xd=1 Facial Landmark Detection: a Literature Survey 7 classifier. The original CLM uses the raw image patch, while FPLL uses the HOG feature descriptor. In [11][10], SIFT feature descriptor and SVM classifier are used to learn the appearance model. In [22], Gentle Boost classifier is used. One issue with the classifier-based local appearance model is that it is unclear which feature representation and classifier to use. Even though SIFT Fig. 6: Regression-based local appearance model [19]. and HOG features and SVM classifier are the popular choice, there is some work[105] that learns the features using the deep learning methods, and it is shown that For example, a large local patch is more robust, while the learned features are comparable to HOG and SIFT. it is less accurate for precise landmark localization. A small patch with more distinctive appearance informa- 3.2.2 Regression-based local appearance model tion would lead to more accurate detection results. To tackle this problem, some algorithms [76] combine the During training, the goal of the regression-based local large patch and small patch for estimation and adapt appearance model is to predict the displacement vector the sizes of the patches or searching regions across iterations. ∆xd∗ = xd∗ x, which is the difference between any pixel − Second, it is unclear which approach to follow, among location x and the ground truth landmark location xd∗ from the local appearance information around x using the classifier-based methods and the regression-based the regression models. During detection, the regression methods. One advantage of the regression-based ap- model can be applied to patches at different locations proach is that it only needs to calculate the features x in a region of interest to predict ∆xd, which can be and predict the displacement vectors for a few sample added to the current location to calculate xd. patches in testing. It is more efficient than the classifier- based approach that scans all the pixel locations in the Regression : (x) ∆x (14) I → d region of interest. It is empirically shown in [22] that the Gentleboost regressor as a regression-based appearance model is better than the Gentleboost classifier as xd = x + ∆xd; (15) a classifier-based local appearance model. Predictions from multiple patches can be merged to calculate the final prediction of the confidence score 3.3 Face shape models or probability through voting (Figure 6). Different image features and regression models are used. In [22], The face shape model captures the spatial relationships Cristinacce and Cootes proposed to use the Gentleboost among facial landmarks, which constrain and refine the as the regression function. In Cootes’s later work [19], landmark location search. In general, they can be clas- he extended the method and used the random forests sified into deterministic face shape models and proba- as the regressor, which shows better performance. In bilistic face shape models. [102], Adaboost feature selection method is combined with SVM regressor to learn the regression function. In 3.3.1 Deterministic face shape models [88], nonparametric appearance model is used for location voting based on the contextual features around the The deterministic face shape models utilize determinis- landmark. Similar to the classifier-based methods, it’s tic models to capture the face shape patterns. They as- unclear which feature and regression function to use. sign low fitting errors to the feasible face shapes and pe- It is empirically shown in [61] that the LBP features nalize infeasible face shapes. For example, Active shape are better than the Haar features and LPQ descriptor. model (ASM) [20] is the most popular and conventional Another issue is that since the regression-based local face shape model. It learns the linear subspaces of the appearance model performs one-step prediction. The training face shapes using Principal Component Anal- prediction may not be accurate if the current positions ysis as in Equation 1. The face can be evaluated by are far away from the true target. the fitness to the subspaces. It has been used both in the holistic AAM method and the CLM methods. 3.2.3 Discussion: local appearance model Since one linear ASM may not be effective to model the global face shape variations, in [54], two levels of There are several issues related to the local appearance ASMs are constructed. One level of ASMs are used to model. First, there exists accuracy-robustness tradeoffs. capture the shape patterns of each facial component 8 Yue Wu, Qiang Ji independently, and the other level of ASM is used for Algorithm 1: Iterative methods modeling the joint spatial relationships among facial Data: The initial searching regions components. In [125], Zhu and Ramanan built pose- (1) (1) (1) (1) Ω = Ω1 ,Ω2 , ..., ΩD for all D dependent tree structure face shape model to capture landmarks.{ } the local nonlinear shape patterns, in which each land- Result: The detected landmark locations x or the shape coefficients p that determine the mark is represented as a tree node. Improving upon landmark locations. [125], the method in [44] builds two levels of tree struc- 1 for t=1 until convergence do (t) tured models focusing on different numbers of landmark 2 Within the searching region Ωd , detect each points on images with different resolutions. In [9], in- facial landmark point independently using the stead of using the 2D facial shape models, Baltrusaitis local appearance models, and treat them as the measurements mt. et al. proposed to embed the facial shape patterns into 3 Refine the measurements m jointly with the face a 3D facial deformable model to handle pose variations. shape model constraint, and output the estimated During detection, both the 3D model coefficients and locations xt or shape coefficients pt. (t+1) the head pose parameters are jointly estimated for land- 4 Modify each searching region Ωd to be around mark detection. This method, however, requires to learn currently estimated landmark location. 3D deformable model, and to estimate head pose. 3D landmark detection will be further discussed in Section 7.3. will be discussed in Section 6 ). In addition, it’s time- consuming to generate the facial landmark annotations 3.3.2 Probabilistic face shape models under different facial shape variations, and more complex model requires more data to train. For example, The probabilistic face shape models capture the facial due to facial occlusion, complete landmark annotation shape patterns in a probabilistic manner. They assign is infeasible on the self-occluded profile faces. high probabilities to face shapes that satisfy the anthro- pological constraint learned from training data and low probabilities to other infeasible face shapes. The early 3.4 Landmark point detection via optimization probabilistic face shape model is the switching model proposed in [94]. It can automatically switch the states Given the local appearance models and the face shape of facial components (e.g. mouth open and close) to models described above, CLMs combine them for de- handle different facial expressions. In [102][61], a gen- tection using Equations 10, 11, 12 or 12. This is a erative Boosted Regression and graph Models based non-trivial task, since the analytic representation of method (BoRMaN) constructed based on the Markov the local appearance model is usually not directly com- Random Field is proposed. Each node in the MRF cor- putable from the local appearance model and the whole responds to the relative positions of three points, and objective function is usually non-convex. To solve this the MRF as a whole can model the joint relationships problem, there are two sets of approaches, including the among all landmarks. In [11][10], Belhumeur et al. pro- iterative methods and the joint optimization methods. posed a non-parametric probabilistic face shape model with optimization strategy to fit the facial images. In 3.4.1 Iterative methods [108][105], Wu and Ji proposed a discriminative deep face shape model based on the Restricted Boltzmann The iterative methods decompose the optimization prob- Machine model. It explicitly handles face pose and ex- lem into two steps: landmark detection by local ap- pression variation, by decoupling the face shapes into pearance model and location refinement by the shape head pose related part and expression related part. Com- model. They occur alternately until the estimation con- pared to the other probabilistic face shape models, it verges. Specifically, it estimates the optimal landmark can better handle facial expressions and poses within a positions that best fit the local appearance models for unified model. each landmark in the local region independently, and then refines them jointly with the face shape model. 3.3.3 Discussion: face shape model The detailed algorithm is shown in Algorithm 1. There are a few algorithms [24][22][19][102][61][108][109] that Despite of the recent developments of the face shape follow the iterative framework. Those methods differ in model, its construction is still an unsolved problem. the used local appearance models and facial shape mod- There is a lack of a unified shape model that can cap- els, and we have discussed their particular techniques ture all natural facial shape variations (some methods in the above sections. Facial Landmark Detection: a Literature Survey 9

3.4.2 Joint optimization methods 4 Regression-based methods

Different from the iterative methods, the joint optimiza- The regression-based methods directly learn the map- tion methods aim to perform joint inference. The chal- ping from image appearance to the landmark locations. lenge is that they have to find a way to represent the Different from the Holistic Methods and Constrained detection results from independent local point detec- Local Model methods, they usually do not explicitly tors and combine them with the face shape model for build any global face shape model. Instead, the face joint inference. To tackle this problem, in [84], the inde- shape constraints may be implicitly embedded. In gen- pendent detection results are represented with Kernel eral, the regression-based methods can be classified into Dense Estimation (KDE) and the optimization problem direct regression methods, cascaded regression methods, is solved with EM algorithm subject to the shape con- and deep-learning based regression methods. Direct re- straint, which treats the true landmark location as hid- gression methods predict the landmarks in one iteration den variables. It also discusses some other methods to without any initialization, while the cascaded regression represent the local detection results, such as using the methods perform cascaded prediction and they usually Isotropic Gaussian Model [20], the Anisotropic Gaus- require initial landmark locations. The deep-learning sian model [67], and Gaussian Mixture Model [40]. In based methods follow either the direct regression or the the Consensus of exemplars work [11][10], in a Bayesian cascaded regression. Since they use unique deep learn- formulation, the local detector is combined with the ing methods, we discuss them separately. nonparametric global model. Because of the special probabilistic formulation, the objective function can be optimized in a “brutal-force” way with RANSAC strategy. 4.1 Direct regression methods There are also algorithms that simplify the shape model to use efficient inference methods. For example, The direct regression methods learn the direct mapping in [125], due to the usage of simple tree structure face from the image appearance to the facial landmark loca- shape model, dynamic programming can be applied to tions without any initialization of landmark locations. solve the optimization problem efficiently. In [23], by They are typically carried out in one step. They can converting the objective function into a linear program- be further classified into local approaches and global ap- ming problem, Nelder-Meade simplex method is applied proaches. The local approaches use image patches, while to solve the optimization problem. the global approaches use the holistic facial appearance. Local approaches: The local approaches sample 3.4.3 Discussion: optimization different patches from the face region, and build structured regression models to predict the displacement The iterative methods and joint optimization meth- vectors (target face shape to the locations of the ex- ods have their own benefits and disadvantages. On one tracted patches), which can be added to the current hand, the iterative methods are generally more efficient, patch location to produce all landmark locations jointly. but it may fail into the local minima due to the itera- The final facial landmark locations can be calculated by tive procedure. On the other hand, the joint optimiza- merging the prediction results from multiple sampled tion methods are usually more difficult to solve and are patches. Note that, this is different from the regression- computationally expensive. based local appearance model that predicts each point Note that, as shown in Equations 10, 11, 12 or 12, independently (Section 3.2.2), while the local approaches in CLM, we would either infer the exact landmark lo- here predict the updates for all points simultaneously. cations x or the shape coefficients that can fully deter- For example, in [25], conditional Regression Forests are mine the landmark locations (e.g. shape coefficients p used to learn the mapping from randomly sampled patches when using ASM as the face shape model as in Equa- in the face region to the face shape updates. In ad- tion 3). There is a dilemma. On one hand, since small dition, several head pose dependent models are built shape model errors may lead to large landmark detec- and they are combined together for detection. Simi- tion errors, directly predicting the landmark locations larly, privileged Information-based conditional random may be a better choice. On the other hand, it may be forests model [113] uses additional facial attributes (e.g. easier to design the cost function for the model coeffi- head pose, gender, etc.) to train regression forests to cients than the shapes. For example, in ASM [20], it is predict the face shape updates. Different from [25] which relatively easier to set up the range for the model coef- merges the prediction from different pose-dependent ficients while it is difficult to directly set the constraint models, in testing, it predicts the attributes first and for the face shapes. then performs attribute dependent landmark location 10 Yue Wu, Qiang Ji

Algorithm 2: Cascaded regression detection 1 Initialize the landmark locations x0 (e.g. mean face). 2 for t=1, 2, ..., T or convergence do 3 Update the landmark locations, given the image and the current landmark location.

t 1 t Iteration 1Iteration 2Iteration t ft : , x − ∆x I → t t 1 t x = x − + ∆x Fig. 7: Cascaded regression methods.

4 Output the estimated landmark locations xT . Different shape-indexed image appearance and regression models are used. For example, in [14], Cao et estimation. In this case, the landmark prediction accu- al. proposed the shape-indexed pixel intensity features racy will be affected by attribute prediction. One issue which are the pixel intensity differences between pairs of with the local regression methods is that the indepen- pixels whose locations are defined by their relative po- dent local patches may not convey enough information sitions to the current shape. In [76], Ren et al. proposed for global shape estimation. In addition, for images with to learn discriminative binary features by the regression occlusion, the randomly sampled patches may lead to forests for each landmark independently. Then, the bi- bad estimations. nary features from all landmarks are concatenated and Global approaches: The global approaches learn a linear regression function is used to learn the joint the mapping from the global facial image to landmark mapping from appearance to the global shape updates. locations directly. Different from the local approaches, In [52], ensemble of regression trees are used as regres- the holistic face conveys more information for landmark sion models for face alignment. By modifying the ob- detection. But, the mapping from the global facial ap- jective function, the algorithm can use training images pearance to the landmark locations is more difficult to with partially labeled facial landmark locations. learn, since the global facial appearance has significant Among different cascaded regression methods, the variations, and they are more susceptible to facial occlu- Supervised Descent Method (SDM) in [110] achieves sion. The leading approaches [91][119] all use the deep promising performances. It formulates face alignment learning methods to learn the mapping, which we will as a nonlinear least squares problem. In particular, as- discuss in details in Section 4.3. Note that, since the suming the appearance features (e.g. SIFT) of the local global approaches directly predict landmark locations, patches around the true landmark locations x∗ are de- they are different from the holistic methods in Section noted as Φ( (x∗)), the goal of landmark detection is to I 2 that construct the shape and appearance models and estimate the location updates δx starting from an ini- predict the model coefficients. tial shape x0, so that the feature distance is minimized:

δx˜ = arg min f(x0 + δx) 4.2 Cascaded regression methods δx 2 (16) = arg min Φ( (x∗)) Φ( (x0 + δx)) 2 In contrast to the direct regression methods that per- δx k I − I k form one-step prediction, the cascaded regression meth- By applying the second order Taylor expansion with ods start from an initial guess of the facial landmark lo- Newton-type method, the shape updates are calculated: cations (e.g. mean face), and they gradually update the landmark locations across stages with different regres- 1 sion functions learned for different stages (Figure 7). f(x +δx) f(x )+J (x )T δx δxT H (x )δx (17) 0 ≈ 0 f 0 − 2 f 0 Specifically, in training, in each stage, regression models are applied to learn the mapping between shape- 1 indexed image appearances (e.g., local appearance ex- δx = H (x )− J (x ) − f 0 f 0 tracted based on the currently estimated landmark lo- 1 T (18) = 2Hf (x0)− J (Φ( (x0)) Φ( (x∗))) cations) to the shape updates. The learned model from − Φ I − I the early stage will be used to update the training data To directly calculate δx analytically is difficult, since for the training in the next stage. During testing, the it requires the calculation of the Jacobin and Hessian learned regression models are sequentially applied to matrix for different x0. Therefore, supervised descent update the shapes across iterations. Algorithm 2 sum- method is proposed to learn the descent direction with marizes the detection process. regression method. It is then simplified as the cascaded Facial Landmark Detection: a Literature Survey 11 regression method with linear regression function, which can predict the landmark location updates from shape- indexed local appearance.

1 T R = 2H (x )− J (19) − f 0 Φ

1 T b 2H (x )− J Φ( (x∗)) (20) Fig. 8: CNN model structure, adapted from [91]. ≈ f 0 Φ I

minima while the iteration continues. It is also possi- δx = RΦ( (x )) + b (21) I 0 ble that the prediction is already close to the optimum The regression functions are different for different iter- after a few stages. In the existing methods [76], it is ations, but they ignore different possible starting shape only shown that the cascaded regression methods can x0 within one iteration. improve the performance over different cascaded stages, There are some other variations of the cascaded re- but it doesn’t know when to stop. gression methods. For example, in [6], instead of learning the regression functions in a cascaded manner (the later level depends on the former level), a parallel learn- 4.3 Deep learning based methods ing method is proposed, so that the later level only needs the statistic information from the previous level. Recently, deep learning methods become popular tools Based on the parallel learning framework, it’s possi- for computer vision problems. For facial landmark de- ble to incrementally update the model parameters in tection and tracking, there is a trend to shift from tradi- each level by adding a few more training samples, which tional methods to deep learning based methods. In the achieves fast training. early work [108], the deep boltzmann machine model, The cascaded regression methods are more effective which is a probabilistic deep model, was used to capture than the direct regression since they follow the coarse- the facial shape variations due to poses and expressions to-fine strategy. The regression functions in the early for facial landmark detection and tracking. More re- stage can focus on the large variations while the regres- cently, the Convolutional Neural Network (CNN) mod- sion functions in the later stage may focus on the fine els become the dominate deep learning models for fa- search. cial landmark detections, and most of them follow the However, for cascaded regression methods, it is un- global direct regression framework or cascaded regres- clear how to generate the initial landmark locations. sion framework. Those methods can be broadly clas- The popular choice is to use the mean face, which may sified into pure-learning methods and hybrid methods. be sub-optimal for images with large head poses. To The pure-learning methods directly predict the facial tackle this problem, there are some hybrid methods landmark locations, while the hybrid methods combine that use the direct regression methods to generate the deep learning methods with computer vision projection initial estimation for cascaded regression methods. For model for prediction. example, in [118], a model based on auto-encoder is Pure-learning methods: Methods in this cate- proposed. It first performs direct regression on down- gory use the powerful CNN models to directly predict sampled lower-resolution images, and then refine the the landmark locations from facial images. [91] is the prediction in a cascaded manner with higher resolution early work and it predicts five facial key points in a cas- images. In [123], a coarse to fine searching strategy is caded manner. In the first level, it applies a CNN model employed and the initial face shape is continuously up- with four convolution layers (Figure 8) to predict the dated based on the estimation from last stage. There- landmark locations given the facial image determined fore, a more close-to-solution initialization will be gen- by the face bounding box. Then, several shallow net- erated and fine facial landmark detection results are works refine each individual point locally. easier to get. Ever since then, there are several improvements over Another issue about the cascaded regression method [91] in two directions. In the first direction, [119][120] is that the algorithms apply a fixed number of cascaded and [75] leverage multi-task learning idea to improve prediction, and there is no way to judge the quality of the performance. The intuition is that multiple tasks landmark prediction and adapt the necessary cascaded could share the same representation and their joint re- stages for different testing images. In this case, it is pos- lationships would improve the performances of individ- sible that the prediction is already trapped in a local ual tasks. For example, in [119][120], multi-task learning 12 Yue Wu, Qiang Ji is combined with CNN model to jointly predict facial landmarks, facial head pose, facial attributes etc. A similar multi-task CNN framework is proposed in [75] to jointly perform face detection, landmark localization, pose estimation, and gender recognition. Different from [119] [120], it combines features from multiple convolutional layers to leverage both the coarse and fine feature representations.

In the second direction, some works improve the cas- Fig. 9: 3D face model and its projection based on the caded procedure of method [91]. For example, in [122], head pose parameters (i.e. pitch, yaw, roll angles) similar cascaded CNN model is constructed to predict many more points (68 landmarks instead of 5). It starts from the prediction of all 68 points and gradually de- Table 2 2 summarizes the CNN structures of the couples the prediction into local facial components. In leading methods. We list their numbers of convolutional [118], the deep auto-encoder model is used to perform layers, the numbers of fully connected layer, whether a the same cascaded landmark search. In [96], instead of 3D model is used, and whether the cascaded method training multiple networks in a cascaded manner, Trige- is used. For the cascaded methods, if different model orgis et. al trained a deep convolutional Recurrent Neu- structures are used for different layers, we only list the ral Network (RNN) for end-to-end facial landmark de- model in the first level. As can be seen, the models tection to mimic the cascaded behavior. The cascaded proposed for facial landmark detection usually contain stage is embedded into the different time slices of RNN. around four convolutional layers and one fully connected layer. The model complexity is on par with the deep Hybrid deep methods: The hybrid deep methods models used for other related face analysis tasks, such combine the CNN with 3D vision, such as the projec- as head pose estimation [70], age and gender estimation tion model and 3D deformable shape model (Figure 9). [55], and facial expression recognition [58], which usu- Instead of directly predicting the 2D facial landmark ally have similar or smaller numbers of convolutional locations, they predict 3D shape deformable model co- layers. For the face recognition problem, the CNN mod- efficients and the head poses. Then, the 2D landmark els are usually more complex with more convolutional locations can be determined through the computer vi- layers (e.g. eight layers) and fully connected layers [90] sion projection model. For example, in [124], a dense [85] [92]. It is in part due to the fact that there is much 3D face shape model is construct. Then, an iterative more training data (e.g., 10M+, 100M+ images) for cascaded regression framework and deep CNN models face recognition comparing to the data set used for are used to update the coefficients of 3D face shape facial landmark detections (e.g., 20K+ images) [75]. and pose parameters. In each iteration, to incorporate It is still an open question whether adding more data the currently estimated 3D parameters, the 3D shape is would improve the performances of facial landmark de- projected to 2D using the vision projection model and tection. Another promising direction is to leverage the the 2D shape is used as additional input of the CNN multi-task learning idea to jointly predict related tasks model for regression prediction. Similarly, in [50], in a (e.g., landmark detection, pose, age and gender) with cascaded manner, the whole facial appearance is used in a deeper model to boost the performances for all tasks the first cascaded CNN model to predict the updates of [75]. 3D shape parameters and pose, while the local patches are used in the later cascaded CNN models to refine the landmarks. 4.4 Discussion: regression-based methods

Compared to the pure-learning methods, the 3D Among diﬀerent regression methods, cascaded regres- shape deformable model and pose parameters of the sion method achieves better results than direct regres- hybrid methods are more compact ways to represent sion. Cascaded regression with deep learning can fur- the 2D landmark locations. Therefore, there are fewer ther improve the performance. One issue for the regression- parameters to estimate in CNN and shape constraint based methods is that since they learn the mapping can be explicitly embedded in the prediction. Further- from the facial appearance within the face bounding more, due to the introduction of 3D pose parameters, 2 For [75], we list the landmark prediction model instead of they can better handle pose variations. the multi-task prediction model for fair comparison. Facial Landmark Detection: a Literature Survey 13

Table 2: CNN model structures of the leading methods.

methods # convolutional layer # fully connected layer, # features 3D model cascaded method Sun et. al [91] 4 1 (120) N Y Zhang et. al [119] 4 1 (100) N N Ranjan et. al [75] 5 + Dim. reduction 1 (3072) N N Zhou et. al [122] 4 1 (120) N Y Zhu et. al [124] 4 2 (256, 234) Y Y Jourabloo and Liu [50] 3 1 (150) Y Y box region to the landmarks, they may be sensitive predict the landmarks directly by fitting the local ap- to the used face detector and the quality of the face pearances without explicit 2D shape model. The fitting bounding box. Because the size and location of the ini- problem of holistic methods can be solved with learn- tial face is determined by the face bounding box, al- based approaches or analytically as discussed in section gorithms trained with one face detector may not work 2.2, while all the cascaded regression methods perform well if a different biased face detector is used in testing. estimation by learning. While the learning-based fit- This issue has been studied in [78]. ting methods for holistic models usually use the same Even though we mentioned that the regression-based model for coefficient updates in an iterative manner, methods do not explicitly build the facial shape model, the cascaded regression methods learn different regres- the facial shape patterns are usually implicitly embed- sion models in a cascaded manner. The AAM model ded. In particular, since the regression-based methods [82] discussed in section 2.3.2 as one particular type of predict all the facial landmark locations jointly, the holistic method is very similar to the Supervised De- structured information as well as the shape constraint scent Methods (SDM) [110] as one particular type of are implicitly learned through the process. the cascaded regression method. Both train cascaded models to learn the mapping from shape-indexed features to shape (coefficient) updates. The trained model 5 Discussions: relationships among methods in in the current cascaded stage will modify the training three major categories. data to train the regression model in the next state. While the former holistic method fits the holistic ap- In the previous three sections, we discussed the facial pearance and predicts the model coefficients, SDM fits landmark detection methods in three major categories: the local appearance and predicts the landmark loca- the holistic methods, the Constrained Local Methods tions. (CLM), and the regression-based methods as summa- Third, there are similarities among the regression- rized in Figure 10. There exist similarities and relation- based local appearance model used in CLM in section ships among the three major approaches. 3.2.2 and the regression-based methods in section 4. First, both the holistic methods and CLMs would Both of them predict the location updates from an ini- capture the global facial shape patterns using the ex- tial guess of the landmark locations. The former ap- plicitly constructed facial shape models, which are usu- proach predicts each landmark location independently, ally shared between them. CLMs improve over the holis- while the later approach predicts them jointly, so that tic methods in that they use the local appearance around shape constraint can be embedded implicitly. The for- landmarks instead of the holistic facial appearance. The mer approach usually performs one-step prediction with motivation is that it’s more difficult to model the holis- the same regression model, while the later approach can tic facial appearances, and the local image patches are apply different regression functions in a cascaded man- more robust to illumination changes and facial occlu- ner. sion compared to the holistic appearance models. Fourth, compared to the holistic methods and con- Second, the regression-based methods, especially for strained local methods, the regression-based methods the cascaded regression methods [110] share similar in- may be more promising. The regression-based methods tuitions as the holistic AAM [7][82]. For example, both bypass the explicit face shape modeling and embed the of them estimate the landmarks by fitting the appear- face shape pattern constraint implicitly. The regression- ance and they all can be formulated as a nonlinear least based methods directly predict the landmarks, instead squares problem as shown in Equation 6 and 16. How- of the model coefficients as in the holistic methods and ever, the holistic methods predict the 2D shape and some CLMs. Directly predicting the shape usually can appearance model coefficients by fitting the holistic ap- achieve better accuracy since small model coefficient pearance model, while the cascaded regression methods errors may lead to large landmark errors. 14 Yue Wu, Qiang Ji

Basic model – AAM (Sec. 2.1) Analytic fitting methods (Sec. 2.2.1) Fitting algorithm (Sec. 2.2) Holistic methods Learning‐based fitting methods (Sec. 2.2.2) (Sec. 2) Feature representation (Sec. 2.3.1) Other extensions (Sec. 2.3) Model construction (Sec. 2.3.2)

General formulations (Sec. 3.1) Classifier‐based local appearance models (Sec. 3.2.1) Local appearance model (Sec. 3.2) Regression‐based local appearance models (Sec. 3.2.2) Constrained Local Facial landmark methods (Sec. 3) Deterministic face shape models (Sec. 3.3.1) detection methods Face shape model (Sec. 3.3) Probabilistic face shape models (Sec. 3.3.2)

Iterative estimation methods (Sec. 3.4.1) Detection by optimization (Sec. 3.4) Joint optimization methods (Sec. 3.4.2)

Regression‐based Direct regression (Sec. 4.1) landmark detection methods (Sec. 4) Cascaded regression (Sec. 4.2)

Deep learning based methods (Sec. 4.3)

Fig. 10: Major categories of facial landmark detection algorithms

6 Facial landmark detection “in-the-wild” limited training data with large head poses, and it may need extra efforts to annotate the head pose labels to Most of the aforementioned algorithms focus on fa- train the algorithms. cial images in “controlled conditions” without signifi- To handle large head poses, one direction is to train cant variations. However, in real world, the facial im- pose dependent models, and these methods differ in the ages would undergo varying facial expressions, head detection procedures [112][17][125][25]. They either se- poses, illuminations, facial occlusion etc., which are gen- lect the best model or merge the results from all models. erally referred to as “in-the-wild” conditions. Some of There are two ways to select the model. The first way the aforementioned algorithms may be able to handle is to estimate the head poses using existing head pose those variations implicitly (e.g., deep learning based estimation methods. For example, in the early work methods), while some others may fail. In this section, [112], multiple pose dependent AAM models are built in we focus on algorithms that are explicitly designed to training and the model is selected from the multi-view handle those challenging conditions. face detector during testing. In [115], the head pose is first estimated based on the detection of a few facial key 6.1 Head Poses points. Then, head pose dependent fitting algorithm is applied to further refine the landmark detection results Significant head pose (e.g. profile face) is one of the using the selected pose model. The head pose can also major cause of the failure of the facial landmark detec- be selected based on the confidence scores using differ- tion algorithms (Figure 11). There are a few difficulties. ent pose dependent models. For example, in the early First, the 3D rigid head movement affects the 2D facial work [17], three AAM models are built for faces in dif- appearance and face shape. There would be significant ferent head poses (e.g. left profile, frontal, and right pro- facial appearance and shape variations caused by dif- file) during training. During detection, the result with ferent head poses. Traditional shape models such as the the smallest fitting error is considered as the final out- PCA-based shape model used in AAM and ASM can put. In [125], multiple models are built for each discrete no longer model the large facial shape variations since head pose and the best fit during testing is outputted they are linear in nature and large facial pose shape as the final result. variation is non-linear. Second, large head pose may lead to self-occlusion. Due to the missing of the facial The algorithms that select the best head pose de- landmark points, some facial landmark detection algo- pendent model would fail, if the model is not selected rithms may not be directly applicable. Third, there is correctly. Therefore, it may be better to merge the re- Facial Landmark Detection: a Literature Survey 15 sults from different pose dependent models. For ex- mask or nose may be occluded by hand). Third, the ample, in [25], a probabilistic head pose estimator is occlusion region is usually locally consistent (e.g. it is trained and the facial landmark detection results from unlikely that every other point is occluded), but it is different pose dependent models are merged through difficult to embed this property as a constraint for oc- Bayesian integration. clusion prediction and landmark detection. More recently, there are a few algorithms that build Due to these difficulties, there is limited work that one unified model to handle all head poses. For exam- can handle occlusion. Most of the current algorithms ple, in [106], self-occlusion caused by large head poses build occlusion dependent models by assuming that some is considered as the general facial occlusion, and a uni- parts of the face are occluded, and merge those models fied model is proposed to handle facial occlusion, which for detection. For example, in [13], face is divided into explicitly predicts the landmark occlusion along with nine parts and it is assumed that only one part is not the landmark locations. In [111], landmark detection occluded. Therefore, the facial appearance information follows the cascaded iterative procedure and the pose from one particular part is utilized to predict the facial dependent model is automatically selected based on the landmark locations and facial occlusions for all parts. estimation from the last iteration. In [124] and [50], The predictions from all nine parts are merged together Convolutional Neural Network (CNN) is combined with based on the predicted facial occlusion probability for 3D deformable facial shape model to jointly estimate each part. Similarly, it is assumed in [116] that a few the head pose and facial landmarks on images with large manually pre-designed regions (e.g. one facial compo- head poses, following the cascaded regression frame- nent) are occluded, and different models are trained for work. In summary, methods handling head poses in- prediction based on the non-occluded parts. clude pose dependent shape models, unified pose mod- The aforementioned occlusion dependent models may els, and pose invariant features. They all have their be sub-optimal, since they assume that a few pre-defined strengths and weaknesses. It depends on the applica- regions are occluded, while the facial occlusion could be tions to choose which one to follow. Also, for some ap- arbitrary. Therefore, those algorithms may not cover all plications, it may be best to combine different types of rich and complex occlusion cases in real world scenario. methods. Another issue related to the aforementioned methods is that the limited facial appearance information from one part of the face (e.g. from the mouth region) maybe insufficient for the prediction of the facial landmarks in the whole face region. To alleviate this problem, some algorithms handle facial occlusion in a unified framework. For example, in [32], a probabilistic model is pro- Fig. 11: Facial images [39] with different head poses. posed to predict the facial landmark locations by joint modeling the local facial appearance around each facial landmark, the landmark visibility for each landmark, occlusion pattern, and hidden states of facial components. It jointly predicts the landmark locations and 6.2 Occlusion landmark occlusions through inference in the probabilistic model. In [106], a constrained cascaded regres- Facial occlusion is another cause of the failure of the sion model is proposed to iteratively predict the facial facial landmark detection algorithms. Facial occlusion landmark locations and landmark visibility probabili- could be caused by objects or self-occlusion due to large ties, based on local appearance around currently pre- head poses. Figure 12 shows facial images with object dicted landmark points iteratively. For landmark vis- occlusion. Some images in Figure 11 also contain facial ibility prediction, it gradually updates the landmark occlusion. There are a few difficulties to handle facial visibility probabilities and it explicitly adds occlusion occlusions. First, the algorithm should rely more on the patterns as a constraint in the prediction. For landmark facial parts without occlusion than the part with oc- location prediction, it assigns weights to facial appear- clusion. However, it’s difficult to predict which facial ance around different facial landmarks based on their part or which facial landmarks are occluded. Second, landmark visibility probabilities, so that the algorithm since arbitrary facial part could be occluded by objects relies more on the local appearance from visible land- with arbitrary appearances and shapes, the facial landmarks than that from occluded landmarks. Different mark detection algorithms should be flexible enough to from [32], the model can handle facial occlusion caused handle different cases (e.g. mouth may be occluded by by both object occlusion and large head poses. 16 Yue Wu, Qiang Ji

facial landmark detection. Similarly, in [107], a constrained joint cascaded regression framework is proposed for simultaneous facial action unit recognition and facial landmark detection. It first learns the joint relationships among facial action units and facial shapes, Fig. 12: Facial images [13] with different occlusions. and then uses the relationship as a constraint to iteratively update the action unit activation probabilities and landmark locations in a cascaded iterative manner. 6.3 Facial expression There are also some works that handle both facial expression and head pose variations simultaneously for Facial expression would lead to non-rigid facial motions facial landmark detection. For example, In [72][9], 3D which would affect facial landmark detection and track- face models are used to handle facial expression and ing. For example, as shown in Figure 13, the six basic pose variations. [9] can predict both the landmarks and facial expressions, including happy, surprise, sadness, the head poses. In [108][105], the face shape model is angry, fear and disgust, would cause the changes of fa- constructed to handle the variations due to facial ex- cial appearance and shape. In more natural conditions, pression changes. The model decouples the face shape facial images would undergo more spontaneous facial into expression related parts and head pose related parts. expressions other than the six basic expressions. Gen- In [121], not only the facial landmark locations, but erally speaking, recent facial landmark detection algo- also the expression and pose are estimated jointly in rithms have been able to handle facial expressions to a cascaded manner with Random Forests model. In some extent. addition, in each cascade level, the poses and expressions are firstly updated and they are used further for the estimation of facial landmarks. In [109], a hierarchical probabilistic model is proposed to automatically exploit the relationships among facial expression, head pose, and facial shape changes of different facial components for facial landmark detection. Some multi-task (a) surprised (b) sadness (c) disgust deep learning methods discussed in Section 4.3 also can be included in this category.

7 Related topics (d) anger (e) happy (f) fear 7.1 Face detection for facial landmark detection Fig. 13: Facial images [51][59] with different facial expressions. It is usually assumed that face is already detected and given for most of the existing facial landmark detection algorithms. The detected face would provide the initial Even though most algorithms handle facial expres- guess of the face location and face scale. However, there sions implicitly, there are some algorithms that are ex- are some issues. First, face detection is still an unsolved plicitly designed to handle significant facial expression problem and it would fail especially on images with variations. For example, In [94], a hierarchical dynamic large variations. The failure of the face detection would probabilistic model is proposed, which can automati- directly lead to the failure of most facial landmark de- cally switch between specific states of the facial compo- tection algorithms. Second, facial landmark detection nents caused by different facial expressions. Due to the accuracy may be significantly affected by the face de- correlation between facial expression and facial shape, tectors accuracy. For example, in the regression-based some algorithms also perform joint facial expression and methods, the initial shape is generated by placing the facial landmark detection. For example, in [56], a dy- mean shape in the center of the face bounding box, namic bayesian network model is proposed to model the where the scale is also estimated from the bounding dependencies among facial action units, facial expres- box. In CLM, the initial regions of interest for each in- sions, and face shapes for joint facial behavior analy- dependent local point detector are determined by the sis and facial landmark tracking. It shows that exploit- face bounding box. Third, to ensure real-time facial ing the joint relationships and interactions improves the landmark detection, fast face detector is usually pre- performances of both facial expression recognition and ferred. Facial Landmark Detection: a Literature Survey 17

The most popular face detector has been the Viola- each frame satisfies the face shape pattern constraint Jones face detector (VJ)[103]. The usage of the integral [12][94][108]. The joint tracking methods perform facial image and the adaboost learning ensures both fast com- landmark points update jointly, and they initialize the putation and effective face detection. The part-based model parameters or landmark locations based on the approaches [42][29][62] use a slightly different frame- information from the last frame. For example, in [3], work. They consider the face as a object consisting of AAM model coefficients estimated in the last frame are several parts with spatial constraints. More recently, used to initialize the model coefficients in the current the region-based convolutional neural networks (RCNN) frame. In the cascaded regression framework [110], the methods [34][33][77] have been used for face detection. tracked facial landmark locations in the last frame are They are based on a region proposal component which used to determine the location and size of the initial identifies the possible face regions and a detection com- facial shape for the cascaded regression in the current ponent, which further refines the proposal regions for frame. Since detection and tracking are initialized dif- face detection. Generally, the RCNN based methods are ferently (detection uses face bounding box to determine more accurate especially for images with large pose, il- the face size and location, while it uses landmark loca- lumination, occlusion variations. However, their com- tions in the last frame in tracking), the method needs putational costs (about 5 frame/second with GPU) are train different sets of regression functions for detection much higher than the traditional face detectors which and tracking. The probabilistic graphical model based provide real-time detection. A more detailed study and methods build dynamic models to jointly embed the survey of the face detection algorithms can be found in spatial and temporal relationships among facial land- [117]. mark points for facial landmark tracking. For example, There are some algorithms that perform joint face Markov Random Field (MRF) model is used in [21], detection and landmark localization. For example, in and Dynamic Bayesian Network is used in [56]. The dy- [125], deformable part model is applied to jointly per- namic probabilistic models capture both the temporal form face detection, face alignment and head pose es- coherence as well as shape dependencies among land- timation. In [87], face detection and face alignment are mark points. More evaluation and review about facial formulated as an image retrieval problem. Similarly, landmark tracking can be found here [16]. In [15], face detection and face alignment are performed jointly in a cascaded regression framework. The face shape is iteratively updated and the bounding box would 7.3 3D facial landmark detection be rejected if the confidence based on the current face shape is less than a threshold. 3D facial landmark detection algorithms detect the 3D facial landmark locations. The existing works can be 7.2 Facial landmark tracking classified into 2D-data based methods using 2D images and 3D-data based methods using 3D face scan data. Facial landmark detection algorithms are generally de- Given the limited information, the detection of the signed to handle individual facial images, and they can 3D facial landmark locations from 2D image is a ill- be extended to handle facial image sequences or video. posed problem. To solve this issue, existing methods ei- The simplest way is tracking by detection, where fa- ther leverage 3D training data or a pre-trained 3D facial cial landmark detection is applied. However, methods shape model and combine them with machine learning. in this category ignore the dependency and temporal For example, in [97], Tulyakov and Sebe extended the smoothness among consecutive frames, which are sub- cascaded regression method from 2D to 3D by adding optimal. the prediction of the depth information. The method There are three types of works that perform fa- directly learns the regressors to predict the 3D facial cial landmark tracking by leveraging the temporal re- landmark locations from 2D image given 3D training lationship: tracker based independent tracking meth- data. Since 3D data is difficult to generate, some algo- ods, joint tracking methods, and probabilistic graphi- rithms learn a 3D facial shape model instead with lim- cal model based methods. The tracker based indepen- ited training data. For example, in [35], 2D facial land- dent tracking methods [12][94][108] perform facial landmarks are firstly detected using the cascaded regression mark tracking on the individual points based on the method, and they are combined with a 3D deformable general object trackers, such as the Kalman Filter and model to determine the face pose and coefficients of Kanade-Lucas-Tomasti tracker (KLT). In each frame, the deformable model, based on which they can then the face shape model is applied to restrict the inde- recover the positions of the 3D landmark points. Sim- pendently tracked points, so that the face shape in ilarly, in [46], cascaded regression method is used to 18 Yue Wu, Qiang Ji predict both dense 2D facial landmarks and their visi- are annotated, while there are 68 and 194 landmarks bilities. An iterative method is then applied to fit the annotated in ibug and Heledominantn databases. 2D shape to a pre-trained dense 3D model to estimate the 3D model parameters. There are some methods that use both 3D deformable model and 3D training data. For example, in [49], cascaded regression method is used to estimate the 3D model coefficients and pose parameters from 2D images, which determine both the 2D and the 3D facial landmark locations. There are a few algorithms that perform 3D facial landmark detection on 3D face scan. For example, in (a) 20 points (b) 68 points (c) 194 points [69], eight 3D facial landmark locations are estimated. Fig. 14: Facial landmark annotations. Images are The algorithm first uses two 3D local shape descriptors, adapted from [1][79][54] including the shape index feature and the spin image to generate landmark location candidates for each landmark. Then the final landmark locations are selected by fitting the 3D face model. Similarly, in [57], 17 dominate landmarks are firstly detected on 3D face scan There are some issues with the existing landmark using their particular geometric properties. Then, a 3D annotations. First, the landmark annotations are inher- template is matched to the testing face to estimate the ently bias and they are inconsistent across databases. 3D locations of 20 landmarks. In [69], dense 3D features As a result, it’s difficult to combine multiple databases are estimated from the 3D point cloud generated with for evaluations. The annotation inconsistency also ex- depth sensor. In particular, a triangular surface patch ists for individual landmark annotation. For example, descriptor is designed to select and match the training for the annotation of eye corners, some databases tend patches to the randomly generated patches from the to provide annotation within the eye region, while the testing image. Then, the associated 3D face shapes of others may annotate the point outside the eye region. the training patches are used to vote the 3D shape of To solve this issue, in [89], a method is proposed to the testing image. combine databases with different facial landmark anno- Compared to the 2D facial landmark detection, 3D tations. It generates a union of landmark annotations facial landmark detection is still new. There is lack of a by transferring the landmark annotations from source large 3D face database with abundant 3D annotations. database to target database. Compared to largely available 2D images, 3D face scans The second issue is that manual annotation is a are difficult to obtain. Labeling 3D facial landmarks on time-consuming process. There are some works improv- 2D images or 3D face scan are usually more difficult ing the annotation process. In [30], 3D face scan and than 2D landmarks. projection models are used to generate the synthetic 2D facial landmarks and corresponding landmark anno- 8 Databases and evaluations tations. The synthetic images are then combined with real images to train the facial landmark detector. In 8.1 Landmark annotations [80], an iterative semi-automatic landmark annotation method is proposed. A facial landmark detector is ini- Facial landmark annotations refer to the manual an- tially trained with a small number of training data, notations of the groundtruth facial landmark locations and it is used to fit new testing images, which are se- on facial images. There are usually two types of facial lected by user to retrain the detector. Similarly, in [93], landmarks: the facial key points and interpolated land- a semi-supervised facial landmark annotation method marks. The facial key points are the dominant land- is proposed. Even though the aforementioned methods marks on face, such as the eye corners, nose tip, mouth improve the facial landmark annotation process, the an- corners, etc. They possess unique local appearance/shape notation is still time-consuming and expensive. Overall, patterns. The interpolated landmark points either de- the existing training images and databases may still not scribe the facial contour or connect the key points (Fig- be adequate for some landmark detection algorithms, ure 14). In the early research, only sparse key landmark such as the deep learning based methods. Finally, to points are annotated and detected (Figure 14 (a)). Re- scale up annotation to large datasets, online crowd- cently, more points are annotated in the new databases sourcing such Amazon Mechanical Turk may be a po- (Figure 14 (b)(c)). For example, in BioID, 20 landmarks tential method for facial landmark annotation. Facial Landmark Detection: a Literature Survey 19

Table 3: Summary of the databases. We use the following notations to represent diﬀerent variations. e: expression, i: illumination, o: occlusion, p: pose. “(.)” represents moderate/spontaneous variations.

databases video(v)/ gray(g)/ amount of variations number of image(i) color(c) data landmark points BioID [1] i g 1521 e, i 20 AR [60] i c 4000+ e, i, o 22 [2] Extended YaleB [31] i g 16128 i, p 3 FERET [74] i g 14501 e, i, p 11 [106] CK/CK+ [51][59] v g&c 486 (593) e 68 MultiPIE [39] i c 750,000 e, i, p 68 or 39 XM2VTSDB [64] i c 1180 p 68 [79] FRGC v2[73] i c 50,000 e, i 68 [79] BU-4DFE[114] i c 3000 e 68 [97] AFLW [53] i c 25,000 (e, i, o, p) 21 LFPW [10] i c 1,432 (e, i, o, p) 29 (68 [79]) Helen [54] i c 2330 (e, i, o, p) 168 (68 [79]) AFW [125] i c 205 (e, i, o), p 6 (68 [79]) ibug300 [79] i c 135 e, i, o, p 68 ibug300-VW [86] v c 114 (e, i, o, p) 68 COFW [13] i c 1852 (e, i, p), o 29

8.2 Databases video sequences of frontal faces from 97+ subjects with 6 basic expressions, including happy, surprised, There are two types of databases: databases collected sadness, disgust, fear and anger. The videos starts under the “controlled” conditions or databases with from the neural expression and goes to apex. CK+ “in-the-wild” images. See Table 3 for the summary. is an extended version of CK database. It includes both posed and spontaneous expressions. AAM land- 8.2.1 Databases under “controlled” conditions mark tracking results are provided by the database. – Multi-PIE database [39]: The Multi-PIE face database Databases under “controlled” conditions refer to databases contains more than 750,000 images of 337 subjects. with video/images collected indoor with certain restric- The facial images are taken under 15 view points tions (e.g. pre-defined expressions, head poses etc.). and 19 illumination conditions. A few facial expres- – BioID [1]: The data set contains 1521 gray scale in- sions are included, such as neutral, smile, surprise, door images with a resolution of 384 286 from 23 × squint, disgust, and scream. 68 or 39 facial land- subjects. Images are taken under different illumina- marks are annotated, depending on the head poses. tion and backgrounds. Subjects may show moderate – XM2VTSDB [64]: The Extended M2VTS database expression variations. It contains landmark annota- contains videos of 295 subjects with speech and ro- tions of 20 points. tation head movements. 3D head model of each sub- – AR [60]: The set contains 4,000 frontal color im- ject is also provided. 68 facial landmark annotations ages of 126 people with expressions, illumination, are provided by [79]. and facial occlusions (e.g. sun glasses and scarf). 22 – FRGC v2 [73]: The Face Recognition Grand Chal- landmark annotations are provided [2]. lenge (FRGC) database contains 50,000 facial im- – Extended YaleB [31]: The extended Yale Face database ages from 4,003 subject sessions with different light- B contains 16,128 images of 28 subjects under 9 ing conditions and two facial expressions (smile and poses and 64 illumination conditions. The database neutral). 3D images acquired by special sensor (Mi- provides original images, the cropped facial images, nolta Vivid 900/910) consisting of both range and and three annotated landmarks. texture images are also provided. 68 facial landmark – FERET [74]: The Facial Recognition Technology annotations on selected images are provided by [79]. (FERET) database contains 14,051 gray scale facial – BU-4DFE [114]: The Binghamton University 4D Fa- images, covering about 20 discrete head poses that cial Expression database (BU-4DFE) contains 2D differ in yaw angles. Frontal faces also have illumina- and 3D videos for six prototypic facial expressions tion and facial expression variations. 11 landmarks (e.g., anger, disgust, happiness, fear, sadness, and on selected profile faces are provided by [106]. surprise) from 101 subjects (58 female and 43 male). – CK/CK+ [51][59]: The Cohn-Kanade AU-coded ex- There are approximately 60k+ images. 68 2D and pression database (CK) contains 486 (593 in CK+) 20 Yue Wu, Qiang Ji

3D facial landmark annotations on selected images and 507 testing images. There are annotations of 29 are provided by [97] landmark locations and landmark occlusions.

8.2.2 “In-the-wild” databases 8.3 Evaluation and discussion

Recently, researchers focus on developing more robust 8.3.1 Evaluation criteria and effective algorithms to handle facial landmark detection in real day-life situations. To evaluate the algo- Facial landmark detection and tracking algorithms out- rithms in those conditions, a few “in-the-wild” databases put the facial landmark locations in the facial images are collected from the webs, such as Flicks, facebook or videos. The accuracy is evaluated by comparing the etc. They contain all sorts of variations, including head detected landmark locations to the groundtruth facial pose, facial expression, illumination, ethnicity, occlu- landmark locations. In particular, if we denote the de- sion, etc. They are much more difficult than images with tected and groundtruth landmark locations for land- “controlled” conditions. Those databases are listed as mark i as d = d , d and g = g , g , the i { x,i y,i} i { x,i y,i} follows: detection error for the ith point is: – AFLW [53]: The Annotated Facial Landmark in the errori = di gi 2 (22) Wild (AFLW) database contains about 25K images. k − k The annotations include up to 21 landmarks based One issue with the above criteria is that the error on their visibility. could change significantly for faces with different sizes. – LFPW [10]: The Labeled face parts in the wild (LFPW) To handle this issue, there are several ways to normalize database contains 1,432 facial images. Since only the the error. The inter-ocular distance is the most popular URLs are provided, some images are no longer avail- criteria. If we denote the left and right pupil centers as able. 29 landmark annotations are provided by the gle and gre, we can calculate the normalized error as original database. Re-annotations of 68 facial land- follows: marks for 1,132 training images and 300 testing im- di gi 2 norm error = k − k (23) ages are provided by [79]. i g g k le − rek2 – Helen database [54]: The Helen database contains Besides the inter-ocular distance, some works [80] may 2,330 high resolution images with dense 194 facial choose the distance between outer eye corners as a nor- landmark annotations. Re-annotations of 68 land- malization constant. For particular images, such as im- marks are also provided by [79]. ages with extreme head poses ( 60 degree) or occlusion – AFW [125]: The Annotated Faces in the Wild (AFW) ≥ (e.g., Figure 11 and 12), the eyes may not be visible. database contains about 205 images with relatively Therefore, some other normalization constants, such as larger pose variations than the other “in-the-wild” the face size from the face bounding box [125] or the dis- databases. 6 facial landmark annotations are pro- tance between outer eye corner and outer mouth corner vided by the database, and re-annotations of 68 (same side of the face) [106] can be used as the normal- landmarks are provided by [79]. ization constants. – Ibug 300-W [79][81]: The ibug dataset from 300 faces To accumulate the errors of multiple landmarks for in the Wild (300-W) database 3 is the most chal- one image, the average normalized errors are used: lenging database so far with significant variations. It only contains 114 faces of 135 images with anno- 1 N d g norm error image = k i − ik2 (24) tations of 68 landmarks. N g g i le re 2 – Ibug 300-VW [86]: The 300 Video in the Wild (300) X k − k database contains 114 video sequences for three dif- To calculate the performances on multiple images, the ferent scenarios from easy to difficult. 68 facial land- mean error or the cumulative distribution error are mark annotations are provided. used. The mean error calculates the mean of the nor- – COFW [13]: The Caltech Occluded Faces in the malized errors of multiple images. The cumulative dis- Wild (COFW) database contains images with sig- tribution error calculates the percentages of images that nificant occlusions. There are 1345 training images lie under certain thresholds (see Figure 15). To evaluate the efficiency, the number of processed 3 Ibug 300-W database contains public available training frames is used. Normally, facial landmark detection al- images and private testing images. The training images in- gorithms are evaluated on regular PC (e.g., laptop) clude the annotations of public available databases and several newly collected images. Here, we name the newly col- without powerful GPU or parallel computing imple- lected images as Ibug 300-W database mentation etc. Facial Landmark Detection: a Literature Survey 21

8.3.2 Evaluation of existing algorithms see that the traditional cascaded regression methods [52][76] are faster than the other methods. In Table 4, we list the performances of leading algorithms on the benchmark databases, their categories Table 5: Efficiency comparison of leading algorithms. and the landmark detection errors. In Figure 15, we Types: H: holistic methods, C: constrained local meth- show the cumulative distribution curves of some algo- ods, R: regression based methods, DR: deep learning rithms on LFPW dtabase. Note that, in this paper, we based regression methods. focus on the reported results from the existing litera- tures. There are additional detailed references [16] [78] methods type # points fps [86] that provide original evaluations by running the TCDCN [119] DR 5 58 software and implementations of known algorithms on HyperFace [75] DR 21 5 Consensus of exem- C 29 1 [76] different databases. plars [10] There are several observations. First, generally, the 3DDFA [124] DR 68 13 regression based methods achieve much better perfor- CFAN [118] DR 68 40 CFSS [123] R 68 25 mances than the holistic methods and the constrained SDM [110] R 68 30 local model methods, especially on images with signifi- 3D Regression [97] R 68 111 cant variations (e.g., ibug 300-w). Second, deep learning Explicit shape regres- R 87 345 based regression methods (e.g., [119]) are the leading sion [14] RCPR [13] R 194 6 techniques and they achieve the state-of-the-art perfor- One millisecond face R 194 1000 mances on several databases. Third, the performances alignment [52] of the same algorithm are different across database, but Face alignment 3000 R 194 200/1500 the rank of multiple algorithms is generally consistent. fps [76]

The results shown here are generally consistent with the ﬁndings in [16] [78]. In [16], Chrysos et al. show that the one millisecond face alignment [52], the Supervised Descent method [110], and CFSS [123] are good options considering both the speed and accuracy.

8.4 Software

In Table 6 and 7, we list a few academic software and commercial software. The academic software refers to the implementations of the existing methods with paper publications. The commercial software is usually only available in a limited sense. For commercial software, visage SDK covers many applications, including facial Fig. 15: Comparison of the cumulative distribution landmark detection, head pose estimation, and facial curves of several algorithms: SDM [110], Discriminative expression recognition, which is a good option. deep model [105], Consensus of exemplars [10], AAM in the wild [99], and FPLL [125] on LFPW databases [10]. Figure is adapted from [105]. 9 Conclusion

In this paper, we reviewed the facial landmark detection The efficiencies of leading algorithms are shown in algorithms in three major categories: the holistic meth- Table 5. Note that, the computational speeds of differ- ods, the constrained local methods, and the regression- ent algorithms are reported from their original papers based methods. In addition, we specifically discussed and their evaluation methods may vary. For example, a few recent algorithms that try to handle facial land- they have different implementation choices (matlab vs. mark detection “in-the-wild” under different variations C++), and they run on different computers. Some al- caused by head poses, facial expressions, facial occlu- gorithms may only report the processing time by ex- sion, strong illumination, low resolution etc. Further- cluding the image loading time etc. Generally, we can more, we discussed the popular benchmark databases, 22 Yue Wu, Qiang Ji

Table 4: Accuracy comparison of leading algorithms. Types: H: holistic methods, C: constrained local methods, R: regression based methods, DR: deep learning based regression methods. The number provided by the original paper is marked as “*”. The error normalized by face size is indicated as “fs”.

databases # points methods type normalized error One millisecond face alignment [52] R 5.22 [97] BU-4DFE [114] 68[97] 3D regression[97] R 5.15* Explicit shape regression[14] R 8.24 [124] (fs) RCPR[13] R 7.85 [124] (fs) AFLW [53] 21 SDM[110] R 6.55 [124] (fs) 3DDFA [124] DR 5.32* (fs) HyperFace [75] DR 4.26* (fs) Explicit shape regression [14] R 10.4 [119] RCPR[13] R 9.3 [119] AFW[125] 5 SDM[110] R 8.8 [119] TCDCN[119] DR 8.2* Consensus of exemplars [10] C 3.99 [14] Robust facial landmark detection [106] R 3.93* One millisecond face alignment [52] R 3.8* 29 RCPR[13] R 3.50* SDM[110] R 3.47* Explicit shape regression [14] R 3.43* LFPW [10] Face alignment 3000 fps [76] R 3.35* FPLL[125] C 8.29 [123] DRMF[5] C 6.57 [123] RCPR[13] R 6.56 [123] 68 Gaussian-Newton DPM[100] C 5.92 [123] SDM[110] R 5.67 [123] CFAN[118] DR 5.44 [123] CFSS [123] R 4.87* FPLL[125] C 8.166 [123] DRMF[5] C 6.70 [123] RCPR[13] R 5.93 [123] 68 Gaussian-Newton DPM [100] C 5.69 [123] CFAN[118] DR 5.53 [123] SDM[110] R 5.50 [123] CFSS[123] R 4.63* Helen [54] Stasm(ASM)[65] C 11.1 [54] Component-based ASM[54] C 9.1* RCPR[13] R 6.50* SDM[110] R 5.85 [76] 194 Explicit shape regression [14] R 5.7* Robust facial landmark detection [106] R 5.49* Face alignment 3000 fps [76] R 5.41* One millisecond face alignment [52] R 4.9* CFSS[123] R 4.74* FPLL[125] C 10.20 [123] DRMF[5] C 9.22 [123] RCPR[13] R 8.35 [123] CFAN[118] DR 7.69 [28] Explicit shape regression [14] R 7.58 [76] SDM[110] R 7.52 [76] Ibug 300-W [79][81] 68 One millisecond face alignment [52] R 6.4 [123] Face alignment 3000 fps [76] R 6.32* 3DDFA [124] DR 6.31* CFSS[123] R 5.76* TCDCN[119] DR 5.54 [28] Facial Landmark Detection: a Literature Survey 23

Table 6: Summary of the academic software

methods detection realtime source number links (d) or (y) or code of points tracking not (n) (sc) or (t) binary code (bc) Stasm (ASM) [65] d y sc 77 http://www.milbo.users.sonic.net/stasm/ DeMoLib (AAM, d sc http://staff.estem-uc.edu.au/roland/ ASM etc.) research/demolib-home/ Generic AAM [98] d n sc 68 http://ibug.doc.ic.ac.uk/resources/ aoms-generic-face-alignment/ AAM in the wild [99] d n sc 68 http://ibug.doc.ic.ac.uk/resources/ fitting-aams-wild-iccv-2013/ FPLL [125] d n sc 68 or 39 http://www.ics.uci.edu/~xzhu/face/ Pose-free [115] d n bc 66 http://www.research.rutgers.edu/~xiangyu/ face_align.html Flandmark [101] d y sc 8 http://cmp.felk.cvut.cz/~uricamic/flandmark/ BoRMaN [102], d n bc 20 http://ibug.doc.ic.ac.uk/resources/ LEAR [61] facial-point-detector-2010/ DRMF [5] d n bc 66 http://ibug.doc.ic.ac.uk/resources/ drmf-matlab-code-cvpr-2013/; SDM [110] d & t y bc 49 http://www.humansensing.cs.cmu.edu/ intraface/index.php Face alignment 3000 d y sc 68 https://github.com/jwyang/face-alignment fps [76] RCPR [13] d&t y sc 29 http://www.vision.caltech.edu/xpburgos/ ICCV13/ One millisecond face d y sc 194 http://www.csc.kth.se/~vahidk/face_ert.html alignment [52] CNN [91] d y bc 5 http://mmlab.ie.cuhk.edu.hk/archive/CNN_ FacePoint.htm Incremental face d&t y bc 49 http://ibug.doc.ic.ac.uk/resources/ alignment [6] chehra-tracker-cvpr-2014/ Conditional regres- d sc 10 http://www.dantone.me/projects-2/ sion forest [25] facial-feature-detection/ CLMZ (OpenFace) d&t y sc 49 https://github.com/TadasBaltrusaitis/ [9] OpenFace/ CCNF [8] d y sc 30 https://www.cl.cam.ac.uk/~tb346/res/ccnf. html

Table 7: Summary of the commercial software. The trial version refers to the free evaluation/download without full access/functionalities.

methods detection realtime trial ver- number links (d) or (y) or sion of points tracking not (n) (t) Face++ d y y 83/25/5 http://www.faceplusplus.com/demo-landmark/ Betaface d n y 101 http://betaface.com/wpa/index.php/ demo-gallery Lambda Labs d y y 6 https://lambdal.com/face-recognition-api# src Visage d&t y y 51 http://visagetechnologies.com/ products-and-services/visagesdk/ LUXAND d&t y y 66 https://www.luxand.com/facesdk/ 24 Yue Wu, Qiang Ji performances of leading algorithms and a few existing such a large image requires a hybrid annotation meth- software. ods, including human annotation, online crowd sourc- There are still a few open questions about facial ing, and automatic annotation algorithms. landmark detection. First, the current facial landmark detection and tracking algorithms still have problems on facial images under challenging “in-the-wild” conditions, including extreme head poses, facial occlusion, References strong illumination, etc. The existing algorithms focus 1. BioID. https://www.bioid.com/About/ on solving one or a few conditions. There is still lack BioID-Face-Database. Accessed: 2015-08-30 of a facial landmark detection and tracking algorithm 2. Landmark annotation for ar database. http: that can handle all those cases. Second, there is a lack //personalpages.manchester.ac.uk/staff/timothy. f.cootes/data/tarfd_markup/tarfd_markup.html of a large facial image database that covers all differ- 3. Ahlberg, J.: An active model for facial feature track- ent conditions with facial landmark annotations, which ing. EURASIP Journal on Advances in Signal Process- may significantly speed up the development of the al- ing 2002(6), 569,028 (2002) gorithms. The existing databases only cover a few con- 4. Alabort-I-Medina, J., Zafeiriou, S.: Bayesian active appearance models. In: IEEE Conference on Computer ditions (e.g. head poses and expressions). Third, facial Vision and Pattern Recognition (2014) landmark detection still heavily relies on the face detec- 5. Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Robust tion accuracy, which may still fail in certain conditions. discriminative response map fitting with constrained lo- Fourth, the computational cost for some landmark de- cal models. In: IEEE Conference on Computer Vi- sion and Pattern Recognition, CVPR ’13, pp. 3444–3451 tection and tracking algorithms is still high. The fa- (2013) cial landmark detection and tracking algorithms should 6. Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Incre- meet the real-time processing requirement. mental face alignment in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1859– There are a few future research directions. First, 1866 (2014) since there are similarities as well as unique proper- 7. Baker, S., Gross, R., Matthews, I.: Lucas-kanade 20 ties about the methods in three major approaches, it years on: A unifying framework: Part 3. International would be beneficial to have a hybrid approach that com- Journal of Computer Vision 56, 221–255 (2002) 8. Baltrusaitis, T., Robinson, P., Morency, L.P.: Contin- bines all three approaches. For example, it would be in- uous Conditional Neural Fields for Structured Regres- teresting to see how and whether the appearance and sion. In: European Conference on Computer Vision, pp. shape models used in holistic methods and CLM can 593–608. Springer (2014) help the regression based methods. It is also interest- 9. Baltruˇsaitis,T., Robinson, P., Morency, L.P.: 3d constrained local model for rigid and non-rigid facial tracking to study whether the analytic solutions used for the ing. In: IEEE Conference on Computer Vision and Pat- holistic methods can be applied to solve the cascaded tern Recognition (2012) regression at each stage, since they share similar ob- 10. Belhumeur, P., Jacobs, D., Kriegman, D., Kumar, N.: ject functions as discussed in section 5. Vice versa, the Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine cascaded regression idea may be applied to the holistic Intelligence 35(12), 2930–2940 (2013) methods to predict the model coefficients in a cascaded 11. Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Ku- manner. Second, currently, the dynamic information is mar, N.: Localizing parts of faces using a consensus of utilized in a limited sense. The facial motion informa- exemplars. In: IEEE Conference on Computer Vision and Pattern Recognition (2011) tion should be combined with the facial appearance and 12. Bourel, F., Chibelushi, C., Low, A.: Robust facial fea- facial shape for facial landmark tracking. For exam- ture tracking. In: British Machine Vision Conference, ple, it would be interesting to see how and whether the pp. 24.1–24.10 (2000) dynamic features would help facial landmark tracking. 13. Burgos-Artizzu, X.P., Perona, P., Dollar, P.: Robust face Landmark tracking with facial structure information is landmark estimation under occlusion. In: IEEE Interna- tional Conference on Computer Vision, pp. 1513–1520 also an interesting direction. Third, since there are re- (2013) lationships among facial landmark detection and other 14. Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by facial behavior analysis tasks, including head pose esti- explicit shape regression. International Journal of Com- mation and facial expression recognition, their interac- puter Vision pp. 177–190 (2014) 15. Chen, D., Ren, S., Wei, Y., Cao, X., Sun, J.: Joint cas- tions should be utilized for joint analysis. By leveraging cade face detection and alignment. In: D. Fleet, T. Pa- their dependencies, we can incorporate the computer vi- jdla, B. Schiele, T. Tuytelaars (eds.) European Confer- sion projection models and improve the performances ence on Computer Vision, Lecture Notes in Computer for all tasks. Finally, to fully exploit the power of deep Science, vol. 8694, pp. 109–122. Springer International Publishing (2014) learning, a large annotated database of millions of im- 16. Chrysos, G.G., Antonakos, E., Snape, P., Asthana, A., ages under different conditions is needed. Annotation of Zafeiriou, S.: A comprehensive performance evaluation Facial Landmark Detection: a Literature Survey 25

of deformable face tracking “in-the-wild”. International 34. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich Journal of Computer Vision (2017) feature hierarchies for accurate object detection and 17. Cootes, T., Walker, K., Taylor, C.: View-based active semantic segmentation. In: The IEEE Conference on appearance models. In: IEEE International Conference Computer Vision and Pattern Recognition (CVPR) on Automatic Face and Gesture Recognition, pp. 227– (2014) 232 (2000) 35. Gou, C., Wu, Y., Wang, F.Y., Ji, Q.: Shape Augmented 18. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active ap- Regression for 3D Face Alignment, pp. 604–615. Cham pearance models. IEEE Transactions on Pattern Anal- (2016) ysis and Machine Intelligence 23(6), 681–685 (2001) 36. Gower, J.C.: Generalized procrustes analysis. Psy- 19. Cootes, T.F., Ionita, M.C., Lindner, C., Sauer, P.: Ro- chometrika 40(1), 33–51 (1975) bust and accurate shape model fitting using random for- 37. Gross, R., Matthews, I., Baker, S.: Appearance-based est regression voting. In: European Conference on Com- face recognition and light-fields. IEEE Transactions on puter Vision - Volume Part VII, pp. 278–291 (2012) Pattern Analysis and Machine Intelligence 26(4), 449– 20. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: 465 (2004) Active shape models their training and application. 38. Gross, R., Matthews, I., Baker, S.: Generic vs. person Computer Vision and Image Understanding 61(1), 38– specific active appearance models. Image Vision Com- 59 (1995) puting 23(12), 1080–1093 (2005) 21. Cosar, S., Cetin, M.: A graphical model based solution 39. Gross, R., Matthews, I., Cohn, J., Kanade, T., Baker, to the facial feature point tracking problem. Image and S.: Multi-pie. Image Vision Computing 28(5), 807–813 Vision Computing 29(5), 335 – 350 (2011) (2010) 22. Cristinacce, D., Cootes, T.: Boosted regression active 40. Gu, L., Kanade, T.: A generative shape regularization shape models. In: British Machine Vision Conference, model for robust face alignment. In: European Confer- pp. 880–889 (2007) ence on Computer Vision: Part I, pp. 413–426. Springer- 23. Cristinacce, D., Cootes, T.F.: A comparison of shape Verlag, Berlin, Heidelberg (2008) constrained facial feature detectors. In: International 41. Hansen, D.W., Ji, Q.: In the eye of the beholder: A sur- Conference on Automatic Face and Gesture Recogni- vey of models for eyes and gaze. IEEE Transactions on tion, pp. 375–380 (2004) Pattern Analysis and Machine Intelligence 32(3), 478– 500 (2010) 24. Cristinacce, D., Cootes, T.F.: Feature detection and 42. Heisele, B., Serre, T., Poggio, T.: A component-based tracking with constrained local models. In: British Ma- framework for face detection and identification. Inter- chine Vision Conference (2006) national Journal of Computer Vision 74(2), 167–181 25. Dantone, M., Gall, J., Fanelli, G., Gool, L.V.: Real- (2007) time facial feature detection using conditional regression 43. Hou, X., Li, S., Zhang, H., Cheng, Q.: Direct appearance forests. In: IEEE Conference on Computer Vision and models. In: IEEE Conference on Computer Vision and Pattern Recognition (2012) Pattern Recognition, vol. 1 (2001) 26. Donner, R., Reiter, M., Langs, G., Peloschek, P., 44. Hsu, G.S., Chang, K.H., Huang, S.C.: Regressive tree Bischof, H.: Fast active appearance model search us- structured model for facial landmark localization. In: ing canonical correlation analysis. IEEE Transactions IEEE International Conference on Computer Vision, pp. on Pattern Analysis and Machine Intelligence 28(10), 3855–3861 (2015) 1690–1694 (2006) 45. Hu, C., Feris, R., Turk, M.: Real-time view-based face 27. Edwards, G.J., Taylor, C.J., Cootes, T.F.: Interpreting alignment using active wavelet networks. In: IEEE face images using active appearance models. In: IEEE International Workshop on Analysis and Modeling of International Conference on Face and Gesture Recogni- Faces and Gestures, pp. 215–221 (2003) tion, pp. 300–305. IEEE Computer Society (1998) 46. Jeni, L.A., Cohn, J.F., Kanade, T.: Dense 3d face 28. Fan, H., Zhou, E.: Approaching human level facial land- alignment from 2d videos in real-time. In: 2015 11th mark localization by deep learning. Image Vision Com- IEEE International Conference and Workshops on Au- put. 47(C), 27–35 (2016) tomatic Face and Gesture Recognition (FG) (2015). 29. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., URL articles/Jeni15FG_ZFace.pdf Ramanan, D.: Object detection with discriminatively 47. Jiao, F., Li, S., Shum, H., Schuurmans, D.: Face align- trained part-based models. IEEE Transactions on Pat- ment using statistical models and wavelet features. In: tern Analysis and Machine Intellgence 32(9), 1627–1645 IEEE Conference on Computer Vision and Pattern (2010) Recognition (2003) 30. Feng, Z.H., Huber, P., Kittler, J., Christmas, W., Wu, 48. Jones, M., Poggio, T.: Multidimensional morphable X.J.: Random cascaded-regression copse for robust fa- models: A framework for representing and matching ob- cial landmark detection. Signal Processing Letters, ject classes. International Journal of Computer Vision IEEE 22(1), 76–80 (2015) 29(2), 107–131 (1998) 31. Georghiades, A., Belhumeur, P., Kriegman, D.: From 49. Jourabloo, A., Liu, X.: Pose-invariant 3d face alignment. few to many: Illumination cone models for face recog- In: 2015 IEEE International Conference on Computer nition under variable lighting and pose. IEEE Trans- Vision (ICCV), pp. 3694–3702 (2015) actions on Pattern Analysis and Machine Intelligence 50. Jourabloo, A., Liu, X.: Large-pose face alignment via 23(6), 643–660 (2001) cnn-based dense 3d model fitting. In: IEEE Conference 32. Ghiasi, G., Fowlkes, C.: Occlusion coherence: Localiz- on Computer Vision and Pattern Recognition. Las Ve- ing occluded faces with a hierarchical deformable part gas, NV (2016) model. In: IEEE Conference on Computer Vision and 51. Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive Pattern Recognition, pp. 1899–1906 (2014) database for facial expression analysis. In: IEEE In- 33. Girshick, R.: Fast r-cnn. In: The IEEE International ternational Conference on Automatic Face and Gesture Conference on Computer Vision (ICCV) (2015) Recognition, pp. 46–53 26 Yue Wu, Qiang Ji

52. Kazemi, V., Sullivan, J.: One millisecond face alignment 69. Papazov, C., Marks, T., Jones, M.: Real-time head pose with an ensemble of regression trees. In: IEEE Con- and facial landmark estimation from depth images using ference on Computer Vision and Pattern Recognition triangular surface patch features. In: IEEE Conference (CVPR), pp. 1867–1874 (2014) on Computer Vision and Pattern Recognition (CVPR), 53. Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: pp. 4722–4730. IEEE (2015) Annotated facial landmarks in the wild: A large-scale, 70. Patacchiola, M., Cangelosi, A.: Head pose estimation in real-world database for facial landmark localization. In: the wild using convolutional neural networks and adap- First IEEE International Workshop on Benchmarking tive gradient methods. Pattern Recognition 71, 132 – Facial Image Analysis Technologies (2011) 143 (2017) 54. Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: In- 71. Patrick Sauer, T.C., Taylor, C.: Accurate regression pro- teractive facial feature localization. In: European Con- cedures for active appearance models. In: British Ma- ference on Computer Vision - Volume Part III, pp. 679– chine Vision Conference (2011) 692 (2012) 72. Perakis, P., Passalis, G., Theoharis, T., Kakadiaris, 55. Levi, G., Hassncer, T.: Age and gender classification us- I.A.: 3d facial landmark detection under large yaw and ing convolutional neural networks. In: 2015 IEEE Con- expression variations. IEEE Transactions on Pattern ference on Computer Vision and Pattern Recognition Analysis and Machine Intelligence 35(7), 1552–1564 Workshops (CVPRW), pp. 34–42 (2015) (2013) 56. Li, Y., Wang, S., Zhao, Y., Ji, Q.: Simultaneous fa- 73. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., cial feature tracking and facial expression recognition. Chang, J., Hoffman, K., Marques, J., Min, J., Worek, IEEE Transactions on Image Processing 22(7), 2559– W.: Overview of the face recognition grand challenge. 2573 (2013) In: IEEE Conference on Computer Vision and Pattern 57. Liang S Wu J, W.S.S.L.: Improved detection of land- Recognition, CVPR ’05, pp. 947–954. IEEE Computer marks on 3d human face data. In: Annual International Society, Washington, DC, USA (2005) Conference of the IEEE Engineering in Medicine and 74. Phillips, P.J., Moon, H., Rauss, P., Rizvi, S.A.: The feret Biology Society (2013) evaluation methodology for face-recognition algorithms. 58. Lopes, A.T., de Aguiar, E., Souza, A.F.D., Oliveira- In: IEEE Conference on Computer Vision and Pattern Santos, T.: Facial expression recognition with convolu- Recognition, CVPR ’97, pp. 137–. IEEE Computer So- tional neural networks: Coping with few data and the ciety, Washington, DC, USA (1997) training sample order. Pattern Recognition 61, 610 – 75. Ranjan, R., Patel, V.M., Chellappa, R.: Hyperface: 628 (2017) A deep multi-task learning framework for face detec- 59. Lucey, P., Cohn, J., Kanade, T., Saragih, J., Ambadar, tion, landmark localization, pose estimation, and gen- Z., Matthews, I.: The extended cohn-kanade dataset der recognition. CoRR abs/1603.01249 (2016). URL (ck+): A complete dataset for action unit and emotion- http://arxiv.org/abs/1603.01249 specified expression. In: IEEE Conference on Computer 76. Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at Vision and Pattern Recognition Workshops, pp. 94–101 3000 fps via regressing local binary features. In: IEEE (2010) Conference on Computer Vision and Pattern Recogni- 60. Mart´ınez, A., Benavente, R.: The ar face database tion (CVPR), pp. 1685–1692 (2014) (1998) 77. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: To- 61. Martinez, B., Valstar, M.F., Binefa, X., Pantic, M.: wards real-time object detection with region proposal Local evidence aggregation for regression-based facial networks. In: NIPS (2015) point detection. IEEE Transactions on Pattern Analy- sis and Machine Intelligence 35(5), 1149–1163 (2013) 78. Sagonas, C., Antonakos, E., Tzimiropoulos, G., 62. Mathias, M., Benenson, R., Pedersoli, M., Van Gool, L.: Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: Face detection without bells and whistles. In: European database and results. Image and Vision Computing 47, Conference on Computer Vision (2014) 3 – 18 (2016). 300-W, the First Automatic Facial Land- 63. Matthews, I., Baker, S.: Active appearance models revis- mark Detection in-the-Wild Challenge ited. International Journal of Computer Vision 60(2), 79. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, 135–164 (2004) M.: 300 faces in-the-wild challenge: The first facial land- 64. Messer, K., Matas, J., Kittler, J., Jonsson, K.: mark localization challenge. In: IEEE International Xm2vtsdb: The extended m2vts database. In: Interna- Conference on Computer Vision, 300 Faces in-the-Wild tional Conference on Audio and Video-based Biometric Challenge (300-W). Sydney, Australia (2013) Person Authentication, pp. 72–77 (1999) 80. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, 65. Milborrow, S., Nicolls, F.: Locating facial features with M.: A semi-automatic methodology for facial landmark an extended active shape model. In: European Con- annotation. In: 2013 IEEE Conference on Computer ference on Computer Vision: Part IV, pp. 504–513. Vision and Pattern Recognition Workshops, pp. 896– Springer-Verlag, Berlin, Heidelberg (2008) 903 (2013) 66. Murphy-Chutorian, E., Trivedi, M.: Head pose estima- 81. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, tion in computer vision: A survey. IEEE Transactions on M.: A semi-automatic methodology for facial landmark Pattern Analysis and Machine Intelligence 31(4), 607– annotation. In: IEEE Conference on Computer Vision 626 (2009) and Pattern Recognition Workshop. Portland Oregon, 67. Nickels, K., Hutchinson, S.: Estimating uncertainty in USA (2013) ssd-based feature tracking. Image and Vision Comput- 82. Saragih, J., Gocke, R.: Learning aam fitting through ing 20, 47–58 (2002) simulation. Pattern Recognition 42(11), 2628 – 2636 68. Pantic, M., Rothkrantz, L.J.M.: Automatic analysis of (2009) facial expressions: The state of the art. IEEE Tran- 83. Saragih, J., Goecke, R.: A nonlinear discriminative ap- sanctions on Pattern Analysis and Machine Intellgence proach to aam fitting. In: International Conference on 22(12), 1424–1445 (2000) Computer Vision, pp. 1–8 (2007) Facial Landmark Detection: a Literature Survey 27

84. Saragih, J.M., Lucey, S., Cohn, J.F.: Deformable model International Conference on Computer Vision Theory fitting by regularized landmark mean-shift. Inter- and Applications, vol. 1, pp. 547–556. Portugal (2012) national Journal of Computer Vision 91(2), 200–215 102. Valstar, M., Martinez, B., Binefa, V., Pantic, M.: Fa- (2011) cial point detection using boosted regression and graph 85. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A models. In: IEEE Conference on Computer Vision and unified embedding for face recognition and clustering Pattern Recognition, pp. 13–18 (2010) (2015) 103. Viola, P., Jones, M.: Rapid object detection using a 86. Shen, J., Zafeiriou, S., Chrysos, G.G., Kossaifi, J., Tz- boosted cascade of simple features. In: IEEE Conference imiropoulos, G., Pantic, M.: The first facial landmark on Computer Vision and Pattern Recognition, vol. 1, pp. tracking in-the-wild challenge: Benchmark and results. I–511–I–518 vol.1 (2001) In: The IEEE International Conference on Computer 104. Williams, O., Blake, A., Cipolla, R.: Sparse Bayesian Vision (ICCV) Workshops (2015) learning for efficient visual tracking. IEEE Transactions 87. Shen, X., Lin, Z., Brandt, J., Wu, Y.: Detecting and on Pattern Analysis and Machine Intelligence 27(8), aligning faces by image retrieval. In: IEEE Conference 1292–1304 (2005) 105. Wu, Y., Ji, Q.: Discriminative deep face shape model on Computer Vision and Pattern Recognition (2013) for facial point detection. International Journal of Com- 88. Smith, B., Brandt, J., Lin, Z., Zhang, L.: Nonparamet- puter Vision 113(1), 37–53 (2015) ric context modeling of local appearance for pose- and 106. Wu, Y., Ji, Q.: Robust facial landmark detection under expression-robust facial landmark localization. In: IEEE significant head poses and occlusion. In: International Conference on Computer Vision and Pattern Recogni- Conference on Computer Vision (2015) tion, pp. 1741–1748 (2014) 107. Wu, Y., Ji, Q.: Constrained joint cascade regression 89. Smith, B.M., Zhang, L.: Collaborative Facial Land- framework for simultaneous facial action unit recogni- mark Localization for Transferring Annotations Across tion and facial landmark detection. In: IEEE Conference Datasets, pp. 78–93. Springer International Publishing, on Computer Vision and Pattern Recognition (2016) Cham (2014) 108. Wu, Y., Wang, Z., Ji, Q.: Facial feature tracking under 90. Sun, Y., Liang, D., Wang, X., Tang, X.: Deepid3: Face varying facial expressions and face poses based on re- recognition with very deep neural networks. CoRR stricted boltzmann machines. In: IEEE Conference on abs/1502.00873 (2015) Computer Vision and Pattern Recognition, pp. 3452– 91. Sun, Y., Wang, X., Tang, X.: Deep convolutional net- 3459 (2013) work cascade for facial point detection. In: IEEE Con- 109. Wu, Y., Wang, Z., Ji, Q.: A hierarchical probabilistic ference on Computer Vision and Pattern Recognition, model for facial feature detection. In: IEEE Conference pp. 3476–3483 (2013) on Computer Vision and Pattern Recognition, pp. 1781– 92. Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deep- 1788 (2014) face: Closing the gap to human-level performance in face 110. Xiong, X., De la Torre Frade, F.: Supervised descent verification (2014) method and its applications to face alignment. In: IEEE 93. Tong, Y., Liu, X., Wheeler, F.W., Tu, P.H.: Semi- International Conference on Computer Vision and Pat- supervised facial landmark annotation. Computer Vi- tern Recognition (2013) sion and Image Understanding 116(8), 922 – 935 (2012) 111. Xiong, X., la Torre, F.D.: Global supervised descent 94. Tong, Y., Wang, Y., Zhu, Z., Ji, Q.: Robust facial feature method. In: IEEE Conference on Computer Vision and tracking under varying face pose and facial expression. Pattern Recognition, pp. 2664–2673 (2015) Pattern Recogn 40(11), 3195–3208 (2007) 112. Yan, S., Hou, X., Li, S.Z., Zhang, H., Cheng, Q.: Face 95. Tresadern, P., Sauer, P., Cootes, T.: Additive update alignment using view-based direct appearance models. predictors in active appearance models. In: British Ma- Special Issue on Facial Image Processing, Analysis and chine Vision Conference, pp. 91.1–91.12. BMVA Press Synthesis. International Journal of Imaging Systems and (2010) Technology. 2003 13, 106–112 (2003) 96. Trigeorgis, G., Snape, P., Nicolaou, M.A., Antonakos, 113. Yang, H., Patras, I.: Privileged information-based con- E., Zafeiriou, S.: Mnemonic descent method: A recurrent ditional regression forest for facial feature detection. In: process applied for end-to-end face alignment. In: IEEE IEEE International Conference and Workshops on Au- Conference on Computer Vision and Pattern Recog- tomatic Face and Gesture Recognition, pp. 1–6 (2013) 114. Yin, L., Chen, X., Sun, Y., Worm, T., Reale, M.: A nition (CVPR), pp. 4177–4187. Las Vegas, NV, USA high-resolution 3d dynamic facial expression database. (2016) FG 2,3,5 (2008) 97. Tulyakov, S., Sebe, N.: Regressing a 3d face shape from 115. Yu, X., Huang, J., Zhang, S., Yan, W., Metaxas, D.: a single image. In: IEEE International Conference on Pose free facial landmark fitting via optimized part mix- Computer Vision, pp. 3748–3755 (2015) tures and cascaded deformable shape model. In: IEEE 98. Tzimiropoulos, G., i medina, J.A., Zafeiriou, S., Pan- International Conference on Computer Vision (2013) tic, M.: Generic active appearance models revisited. In: 116. Yu, X., Lin, Z., Brandt, J., Metaxas, D.N.: Consensus Asian Conference on Computer Vision, pp. 650–663. of regression for occlusion-robust facial feature localiza- Daejeon, Korea (2012) tion. In: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars 99. Tzimiropoulos, G., Pantic, M.: Optimization problems (eds.) European Conference on Computer Vision, Lec- for fast aam fitting in-the-wild. In: IEEE International ture Notes in Computer Science, vol. 8692, pp. 105–118. conference on Computer Vision, pp. 593–600 Springer (2014) 100. Tzimiropoulos, G., Pantic, M.: Gauss-newton de- 117. Zhang, C., Zhang, Z.: A survey of recent advances in formable part models for face alignment in-the-wild. face detection. Tech. Rep. MSR-TR-2010-66 (2010) In: IEEE Conference on Computer Vision and Pattern 118. Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-fine Recognition, pp. 1851–1858 (2014) auto-encoder networks (CFAN) for real-time face align- 101. Uˇriˇcáˇr, M., Franc, V., Hlaváˇc,V.: Detector of facial ment. In: European Conference on Computer Vision, landmarks learned by the structured output SVM. In: Part II, pp. 1–16 (2014) 28 Yue Wu, Qiang Ji

119. Zhang, Z., Luo, P., Loy, C., Tang, X.: Facial landmark detection by deep multi-task learning. In: European Conference on Computer Vision, Part II, pp. 94–108 (2014) 120. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning deep representation for face alignment with auxiliary attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(5), 918–930 (2016) 121. Zhao, X., Kim, T.K., Luo, W.: Unified face analysis by iterative multi-output random forests. In: IEEE Con- ference on Computer Vision and Pattern Recognition, pp. 1765–1772 (2014) 122. Zhou, E., Fan, H., Cao, Z., Jiang, Y., Yin, Q.: Extensive facial landmark localization with coarse-to-fine convolutional network cascade. In: IEEE International Con- ference on Computer Vision Workshops, pp. 386–391 (2013) 123. Zhu, S., Li, C., Change Loy, C., Tang, X.: Face alignment by coarse-to-fine shape searching. In: IEEE Con- ference on Computer Vision and Pattern Recognition (2015) 124. Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.: Face alignment across large poses: A 3d solution. In: IEEE Conference on Computer Vision and Pattern Recognition. Las Ve- gas, NV (2016) 125. Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: IEEE Con- ference on Computer Vision and Pattern Recognition, pp. 2879–2886 (2012)