LDDMM-Face: Large Deformation Diffeomorphic Metric Learning for Flexible and Consistent Face Alignment

LDDMM-Face: Large Deformation Diffeomorphic Metric Learning for Flexible and Consistent Face Alignment Huilin Yang *1,2, Junyan Lyu *1,3, Pujin Cheng1, and Xiaoying Tang1 1Southern University of Science and Technology [email protected] 2The University of British Columbia 3The University of Queensland Abstract We innovatively propose a flexible and consistent face alignment framework, LDDMM-Face, the key contribution of which is a deformation layer that naturally embeds facial geometry in a diffeomorphic way. Instead of predicting facial landmarks via heatmap or coordinate regression, we formulate this task in a diffeomorphic registration manner and predict momenta that uniquely parameterize the deformation between initial boundary and true boundary, and then perform large deformation diffeomorphic metric mapping (LDDMM) simultaneously for curve and landmark to localize the facial landmarks. Due to the embedding of LD- DMM into a deep network, LDDMM-Face can consistently annotate facial landmarks without ambiguity and flexibly Figure 1: Representative LDDMM-Face results. Top: train- handle various annotation schemes, and can even predict ing with partial annotations on the WFLW, HELEN and dense annotations from sparse ones. Our method can 300W datasets (from left to right); bottom: predictions of be easily integrated into various face alignment networks. full annotations from LDDMM-Face. We extensively evaluate LDDMM-Face on four benchmark datasets: 300W, WFLW, HELEN and COFW-68. LDDMM- Face is comparable or superior to state-of-the-art meth- 1. Introduction arXiv:2108.00690v1 [cs.CV] 2 Aug 2021 ods for traditional within-dataset and same-annotation settings, but truly distinguishes itself with outstanding per- Face alignment refers to identifying the geometric struc- formance when dealing with weakly-supervised learning ture of a human face in a digital image, through localizing (partial-to-full), challenging cases (e.g., occluded faces), key landmarks that are usually predefined and characteriz- and different training and prediction datasets. In addition, able of the face’s geometry. Face alignment is a prerequi- LDDMM-Face shows promising results on the most chal- site in many computer vision tasks, such as face recognition lenging task of predicting across datasets with different an- [53], facial expression recognition [36, 52], face verification notation schemes. [39], face reconstruction [33] and face reenactment [29]. For different datasets, there exist different face alignment annotation schemes. For example, COFW [4] annotates 29 landmarks, 300W [34] annotates 68 landmarks, WFLW [46] annotates 98 landmarks, and HELEN [23] annotates 194 landmarks. Most existing face alignment methods can *Equal contribution only deal with the specific annotation scheme adopted by the training dataset of interest, but cannot flexibly accom- most face alignment networks to effectively predict fa- modate multiple annotation schemes. Namely, if a model cial landmarks with different annotation schemes. is trained on a dataset with a specific annotation scheme, it can then only predict landmarks of the specific scheme; • Our approach for the first time identifies the feasibil- a model trained on 300W with a 68-landmark annotation ity of predicting consistent facial boundaries and full scheme can only predict the learned 68 landmarks but not facial landmarks supervised with only partial annota- other annotation schemes such as a 194-landmark scheme. tions of training data, in a novel manner of weakly- In addition, to date, there is no such work that can fully uti- supervised learning. lize partially-annotated data. In other words, it is infeasible • We comprehensively evaluate the performance of our to make full predictions based on only partially-annotated proposed LDDMM-Face on multiple widely-used face data; for example, predicting 194 landmarks when train- alignment benchmark datasets [23, 46, 34], in terms of ing with only 97 or even less landmarks. Even in tradi- not only landmark prediction accuracy but also overall tional fully-supervised settings, most existing works cannot facial geometry matching degree. very well handle challenging cases such as faces with oc- clusion. This is because those existing works usually han- • We demonstrate the effectiveness of LDDMM-Face dle each landmark individually or predict landmarks with in being superior in handling challenging cases even non-diffeomorphic deformations. In this way, the predicted across datasets, being excellent in making partial-to- landmarks could be inconsistent, resulting in incorrect fa- full predictions, being adaptable to various deep net- cial geometry topology. work settings, being capable of predicting consistent In such context, we formulate face alignment into a facial boundaries with different training annotations, diffeomorphic registration framework. Specifically, we and being flexible in handling multiple annotation use boundary curves to represent facial geometry [11]. schemes either within or across datasets. Then, large deformation diffeomorphic metric mapping (LDDMM) simultaneously for curve and landmark [16, 18] 2. Related Works between an initial face and the true face is encoded into a neural network for landmark localization. LDDMM de- Face alignment has been widely researched with fruitful livers a non-linear smooth transformation with a favorable outcomes. From early Active Appearance Models [7, 25, topology-preserving one-to-one mapping property. Once 19, 35], Active Shape Models [26] to recently-developed the diffeomorphism, which is parameterized by momenta Cascaded Shape Regression [49, 38, 20, 32, 24, 58, 5, 43] [11, 16, 18], between the initial face and the true face is ob- and deep learning methods [56, 55, 41, 48, 12, 3, 51, 21, 28], tained, all points on/around the initial face have unique cor- accuracy has been improved significantly. Equipped with respondence on/around the true face through the acquired powerful image feature extraction capability of Convolu- diffeomorphism. This property makes it possible to pre- tional Neural Network (CNN), deep learning methods hold dict facial landmarks of different annotation schemes with state-of-the-art (SOTA) results. The proposed method of a model trained only on landmarks from a single annotation this work, LDDMM-Face, fits in the deep learning scope. scheme. Utilizing both landmark and curve enables LD- Coordinate regression models Coordinate regression DMM to handle shape deformations both locally and glob- models use neural networks to directly predict coordinates ally; the role of the landmark term is to match the corre- of facial landmarks. Zhang et al. [55] incorporates a vari- sponding landmarks whereas the role of the curve term is to ety of facial attributes like gender, expression and appear- make the corresponding facial curves be close to each other ance in a multi-task learning framework. Kowalski et al. and to preserve facial topology such that consistent land- [21] presents the Deep Alignment Network (DAN) that em- mark predictions can be made. Noteworthily, we predict ploys landmark heatmap and transformation of face and momenta instead of increments between the initial face and landmarks to a canonical plane in CNN. Qian et al. [31] the true face, which gives it the additional flexibility and is leverages disentangled style and shape space of each indi- the key novelty of the proposed method. This is the first vidual to augment existing structures via style translation, time that face alignment is formulated as a diffeomorphic which can be regarded as a data augmentation approach. registration problem, providing novel insights into the face Heatmap regression models Heatmap regression mod- alignment realm. els use neural networks to regress a set of heatmaps, one for In this work, our contributions include: each individual landmark, and then indirectly estimate landmarks from the heatmaps. Wu et al. [46] uses facial bound- • We propose a novel face alignment network by inte- ary heatmap to supervise landmark prediction, utilizing a grating LDDMM into deep neural networks to handle hourglass module [28] and a message passing scheme [6]. various facial annotation schemes. Our proposed ap- Zou et al. [60] employs hierarchically structured landmark proach, LDDMM-Face, can be easily integrated into ensembles to depict holistic and local structures of facial landmarks for robust facial landmark detection. Wang et features are passed through a deep LDDMM head which al. [45] proposes an adaptive wing loss for heatmap regres- consists of a momenta estimator and a deformation layer. sion, with an ability to adapt its shape to various types of The momenta estimator contains fully-connected layers and ground truth heatmap pixels. Kumar et al. [22] uses a mean predicts vertical and horizontal momenta for each land- estimator for heatmap and jointly predicts landmark loca- mark. Suppose the geometry of a face is characterized by tions, associating uncertainties of the predicted locations N boundary curves, the deformation layer has N sublayers and landmark visibilities. Browatzki et al. [2] leverages im- (flow 1 to flow N). Each sublayer separately deforms the plicit knowledge by training an autoencoder on unannotated corresponding initial curve, the procedure of which is de- facial datasets to perform few-shot face alignment. tailed in subsection 3.1. Two inputs, the mean face serving Weakly supervised learning for face alignment as the initial face and the estimated momenta, are fed into Weakly

Load more