<<

International Journal of Computer Vision manuscript No. (will be inserted by the editor)

Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields

Stephen Ackland · Francisco Chiclana · Howell Istance · Simon Coupland

Received: date / Accepted: date

Abstract Tracking the head in a video stream is a robust. Finally, the texture information is trained via common thread seen within computer vision literature, Local Neural Fields (LNFs) – a deep learning approach supplying the research community with a large number that utilizes small discriminative patches to exploit spa- of challenging and interesting problems. Head pose es- tial relationships between the and provide strong timation from monocular cameras is often considered peaks at the optimal locations. an extended application after the face tracking task Keywords Head Pose Estimation · 2.5D CLM · LNF · has already been performed. This often involves passing Real-time the resultant 2D data through a simpler algorithm that best fits the data to a static 3D model to determine the 3D pose estimate. This work describes the 2.5D 1 Introduction Constrained Local Model, combining a deformable 3D point model with 2D texture information to pro- Face tracking and 3D pose estimation from monocu- vide direct estimation of the pose parameters, avoid- lar video streams is an interesting yet challenging task, ing the need for additional optimization strategies. It with novel approaches continually being published in achieves this through an analytical derivation of a Ja- the literature. This paper investigates the prospect of cobian matrix describing how changes in the parame- acquiring the 3D head pose from consumer cameras ters of the model create changes in the shape within the such as webcams and those attached to modern com- through a full- camera model. In addi- puting devices such as smart phones and tablets, with tion, the model has very low computational complexity the processing power and memory available on such and can run in real-time on modern mobile devices such devices now allowing for computationally intensive ap- as tablets and laptops. The Point Distribution Model plications to run in real-time. The head-pose estimate of the face is built in a unique way, so as to minimize involves the approximation of the and rota- the effect of changes in facial expressions on the esti- tion of the head, typically relative to the camera. This mated head pose and hence make the solution more process involves some element of detection and tracking within the 2D image in order to obtain a 3D output. Of S. Ackland course since the 2D image has effectively lost a dimen- De Montfort University, Leicester, UK E-mail: [email protected] sion of data, it is vital that a realistic model of the head and camera are known to achieve a reliable estimate of F. Chiclana De Montfort University, Leicester, UK real-world pose. E-mail: [email protected] Having a predefined model of shape and texture can H. Istance be beneficial in constraining the search process in order University of Tampere, Finland to reduce false positives. The analysis of the image data E-mail: howell.istance@staff.uta.fi can take two main forms: generative models, which are S. Coupland parameterized textures of the whole face (a holistic tex- De Montfort University, Leicester, UK ture model) that when combined synthesize a new facial E-mail: [email protected] appearance, and discriminative models, which combine 2 Stephen Ackland et al. a number of local feature detectors (patch texture mod- Jones, 2001), which provides a good initial estimate of els) representing each point on the face. The discrimi- the head location. A common modern approach to de- native approach comes down to a simple classification tect face points in an image is to utilize deep neural net- problem that can be solved using applications of deep works (Bulat and Tzimiropoulos, 2018; Merget et al., learning. For example, does this new image patch repre- 2018). In particular, the use of cascades of Convolu- sent an eye corner, or not? By combining the response tional Neural Networks have shown excellent results on from all of the patches, we can optimize our shape pa- 3D face alignment problems (Bulat and Tzimiropoulos, rameters to best solve the problem. 2017; Zhu et al., 2016). The variety of tagged train- The main contribution in this work involves the de- ing data now freely available along with the computing sign and development of the 2.5D Constrained Local power of multicore GPUs allow for remarkable infer- Model (2.5D CLM) for head pose estimation from a ence from image to face points. However, this accuracy monocular camera, particularly on devices with limited comes at a high computational cost and is currently not resources such as a tablet or laptop. The model com- suitable for domains where the hardware is limited and bines a 3D shape with 2D texture information that pro- there is a need for real-time performance from video. vides direct estimation of the pose parameters, avoid- For such problems, efficient tracking can be performed ing the need for additional optimization strategies. The by searching the local area of the current solution using texture information is evaluated via Local Neural Fields simple face models (Ackland et al., 2014; Baltruˇsaitis (LNFs) – a deep learning paradigm that utilizes small et al., 2016; Saragih et al., 2011). This is usually suffi- discriminative patches to exploit spatial relationships cient provided the relative movement is not too great between the pixels and provide strong peaks at the and the refresh rate is quick enough, i.e. frames pro- optimal location. The model is carefully constructed cessed per second is high enough. with the alignment of the training via the inner One of the most important issues when dealing with eye-corners, reducing the complexity of the model, and head tracking relates to choosing a representation of the making it more robust to changes in facial expression. head. Many of the representations involve capturing the The model is complete with an analytical derivation of texture from the image and capturing the relative move- a Jacobian matrix, which describes how changes in the ment over time. One of the simplest representations parameters of the model create changes in the point po- comprises a texture mapped cylinder (Cylindrical Head sitions within the image through a full-perspective cam- Model (CHM)) (Xiao et al., 2003), which is formed by era model. The model competes well with other state- instantiating a cylinder into the camera scene and map- of-the-art models on 3D face alignment and head-pose ping the current image frame onto the cylinder surface. estimation tasks, while running with very low compu- Since only the relative movement is determined, cap- tational complexity. turing between the head shape and other items in the environment requires further calibration and additionally, like many tracking methods of this 2 Related Work variety, the tracking tends to degenerate over time and requires re-instantiation regularly. Other simple shapes Evaluating the 3D head pose from a monocular camera may approximate the head shape slightly better with a involves a number of other areas including the detection relatively small computational cost including ellipsoidal and tracking of the head through the evaluation of facial models (Choi and Kim, 2008) and sinusoidal models landmarks in an image. This is a well researched area (Cheung and Peng, 2015). in recent years and has seen many examples within the literature of how to solve these problems. This section Alternatively, models of the head shape and texture summarizes these contributions and provides the basis can be built beforehand, where we attempt to acquire of the 2.5D CLM with Local Neural Fields. a best-fit solution for the model to the new image. By approaching the problem with pre-learned knowledge about the face, we can estimate relative distances be- 2.1 Head tracking models tween in-scene objects better, and also reduce cumula- tive tracking errors by always optimizing the fitting to Tracking the user’s head within a series of images from the learned data and using the previous tracking iter- a video stream typically involves two tasks, first to de- ation as a guide only. Specifically, if we wish to track tect the initial location within an image of the head and the shape and subsequently determine pose, we need to secondly to track that head through a series of images. acquire more information about the distribution of the The most successful and well-documented head detec- individual face components or features, such as the eye tor in recent times is the Haar classifier from (Viola and corners and bridge of the nose. Predetermined knowl- Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields 3 edge about a face shape often comes in the form of a where Di is a representing the incoming data Point Distribution Model (PDM) (Cootes et al., 2001). and evaluates how each landmark xi is misaligned within The PDM is capable of creating plausible shapes from the image I. Note as in Saragih et al. (2011) the exten- a sequence of deformable points that are statistically sion of the error term with a regularization term R to learnt from a training set of marked up images of the punish complex deformations of the shape model (i.e. shape. It is a simple linear parametric model in either shapes that deviate too far from the mean shape). 2D or 3D that given enough training data generalizes An alternative way of viewing the problem is via a well to even unseen data. Furthermore, by restricting probabilistic interpretation (Saragih et al., 2011). As- the parameters to fall between plausible boundaries, the suming conditional independence of all the landmarks model can overcome issues with noisy data and occlu- let us observe that the probability (p) of the shape sion. model being correctly aligned with q parameters within The PDM is typically built by first aligning the the image I is proportional to the product of each indi- collection of annotated training shapes. In 2D this is vidual landmark probability being correctly aligned at often achieved via a generalized Procrustes algorithm xi. The Regularization and Data terms are now (Goodall, 1991) which normalizes the shapes to a com- mon reference shape (with common scale, translation and ). In 3D other solutions are often incor- R(q) = − ln {p(q)} (2) porated such as Iterative Closest Point (Rusinkiewicz Di(xi; I) = − ln {p(li = 1 |xi, I)} (3) and Levoy, 2001). When the shapes are aligned, Prin- cipal Component Analysis (PCA) (Kirby and Sirovich, where li = 1 represents the likelihood of a correct match. 1990) can be applied, which provides the mean shape The regularization term is dependent on the prob- along with a set of eigenvectors capturing the statis- ability of our prior beliefs about the shape model, and tical variation of the shapes in the training set. PDMs thus our problem becomes ‘Given a collection of n or- form the basis of many tracking techniques in the litera- dered 3D points, how likely is it they represent a face ture from simple Active Shape Models (ASMs) (Cootes shape?’ The rigid component of the q-parameters; p et al., 2001) built from a relatively small amount of (representing the pose); can be any translation and ro- points, to 3D Morphable Models (3DMMs) (Blanz and tation so a non-informative prior can be used. As such Vetter, 1999) that are usually constructed via 3D range the regularizer only depends on the deformation param- scanners and are therefore denser, comprising of poten- eters d and a regularization weight r that allows us to tially thousands of points. bias the result in favour of either the prior knowledge or the incoming observed data. The regularization term then becomes 1 2.2 Constrained Local Models 2 R(q) = r kqkΛ˜−1 (4) The Constrained Local Model (CLM) (Cristinacce and Cootes, 2006) is a discriminative model that involves where a non-informative prior is kept on the initial pa- training small texture patches that correspond to parts rameters representing the transformation of the face such as an eye-corner, or nose-bridge. Each 1 1 1 patch searches its local area for a ‘best-fit’ and the pose Λ˜−1 = diag{0, , ,..., } (5) update of the face is determined via a least squares λ1 λ2 λν approach from all the patch results. The result is further enhanced by ensuring that the face is still constrained The data term D represents the incoming informa- by the learned shape-model that describes how a face tion from the camera and is specifically tasked in finding can move. To fit a PDM to an image we need to look the most likely of the head within the image at finding the joint deformation d and transformation frame (assuming there is one within the image). To ac- p parameter values that minimize the misalignment of complish this, it is important to know what we are look- each vertex in the shape model to the estimated image ing for and to minimize false positives. Since each land- location x for that vertex. Concatenating the pose and mark can be evaluated independently, the error term deformation parameter vectors to form a new vector each landmark contributes to the overall error term in q = [p|d], we define the error function E as its simplest form can be obtained simply through cal- culating the squared Euclidean distance from the pro- n X 1 2 E(q) = R(q) + Di(xi; I) (1) kqkΛ−1 is shorthand for the squared Mahalanobis dis- tance qT Λ−1q i=1 4 Stephen Ackland et al. jected shape model points to the observed optimal lo- patch used, and how best to amalgamate all the local cations y within the image. response maps to globally optimize the solution. The

n format of the trained patch model can be of any clas- X 2 2 sification tool as long as it produces a high response D(x, I) = (yi − W(Xi, q)) = ky − P(q)k (6) 1 at the correct location. Typically a patch is trained on upright faces, and during fitting the image patch is ro- The warping function W defines how a single 3D point tated to match the previous known face position. The can be projected into the image, and P defines the full first CLMs trained their discriminative patches via a model . The details of these functions can be Support Vector Machine (SVM) (Wang et al., 2008). found in sections (3.2) and (3.3). Asthana et al. (2015) process the image with low-level Combining equations (4) and (6) gives us filters and ensure sparsity in the estimation by recon- qˆ = argmin {R(q) + D(x, I)} structing the most likely response texture from a li- q brary of pretrained responses using Principal Compo- n 2 2o nent Analysis (PCA) (Kirby and Sirovich, 1990). Other = argmin r kqkΛ˜−1 + ky − P(q)k (7) q notable examples include correlation filters such as Min- imum Output Sum of Squared Error (MOSSE) (Bolme Since this minimization formulation is in the form of a et al., 2010) and Local Neural Fields (LNFs), which non-linear least squares problem, the problem is solved learn non-linear and spacial relationships between the incrementally with linear update approximations added pixels (Baltrusaitis et al., 2013). These filters tend to to the current parameters: q ← q + ∆q (with a slight produce more robust classifiers over large pose vari- adjustment for the rotation parameters - see section ations, although non-linear filters tend to be slightly 3.6). Taking a first order Taylor series expansion of the more computationally intensive. A version of the LNF projected landmarks we obtain classifier is contained within the OpenFace toolkit (Bal- P(q + ∆q) ≈ P(q) + J∆q (8) truˇsaitiset al., 2016), which is freely available for com- parison metrics where J is the Jacobian matrix containing the deriva- Specifically, LNFs are trained as undirected graph tives of the projected shape points P(q) with respect models that takes a series of vectorized image patches to the q parameters i.e. Ξ = {ξ1,..., ξn} to predict a conditional probability ∂P(q) vector ψ = {ψ1, . . . , ψn} representing the likelihood of a J = (9) being the desired output, such as an eye corner or ∂q the tip of the nose. It is similar to a Convolutional Neu- Thus we are looking to minimize ral Network where vertex features represent a mapping between Ξ and ψ through a one-layer neural network. n 2 2o ∆qd = argmin r kq + ∆qkΛ˜−1 + ky − (P(q) + J∆q)k In addition, two further constraints are placed on the ∆q model in the form of edge features g and l that enforce (10) smoothness and sparsity respectively. For a neuron k aligned in a grid structure with edges linking neighbor- ing nodes 2.3 Local Patch Filters 2 gk(ψi, ψj) = 0.5 ∗ Sg(ψi − ψj) The local patch filters work independently at first and l (ψ , ψ ) = 0.5 ∗ S (ψ + ψ )2 (11) have a simple goal – to search their local image space k i j l i j and provide a probability of how well the pixels in where Sg = 1 when the nodes ψi and ψj are immediate the vicinity resemble the tagged training sample, with neighbors (0 otherwise) and Sl = 1 on nodes between 4 closer resemblances achieving a higher response rate. A and 6 edges apart (0 otherwise). pixel here in fact represents the centre of a sliding win- After training, a number of ‘neurons’ are generated dow of some width and height, and therefore is ideally where each pixel has a corresponding value (see Figure suited for convolution or cross-correlation which can 1). When vectorized, the values from the k-th neuron be efficiently applied within the Fourier domain. The form the weights wk, which along with a bias term bk model itself does not care which discriminator is used attempt to discriminate between the positive and neg- and so several different types can easily be substituted ative samples. A response value γ can be generated by in and compared for speed, accuracy and robustness. each neuron k for an incoming image pixel Ip by Much of the literature investigating improvements to T the CLM fitting process has been based on the type of γk (Ip) = wk V(p, I) + bk (12) Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields 5

Fig. 1 The LNF comprises of multiple filters for a single point of the PDM (here showing the outer eye corner). Each of the filters need to be convolved with the new input image before their responses are summed. This makes them more robust, although there is a cost in computation time. where V(p, I) is a vectorized window image patch cropped The mean-shift algorithm relies on a Density Es- around the image point p, altered to zero mean and timator (KDE) to gradually smooth the response map unit variance. The final response is then obtaining by and ‘push’ the mean-shift vector toward the greatest summing each neuron’s individual response to the pixel concentration of high response points. location. A response map Γ (x, I) can be generated by Allowing each data point in the response map Γ to evaluating all the pixels within a localized image area exhibit some Gaussian noise (with variance ρ), the data around a point x. The map can be formed using a slid- term becomes ing window approach over the area but can also be ob- X tained more efficiently through convolutions within the Di = p(li = 1|xi, I) = p(li = 1|yi, I) N (xi, yi, ρI) Fourier domain. yi∈Γi To obtain the response in probabilistic terms to match (14) our model, a logistic regressor is trained on each of the Normalising these posterior likelihoods gives us weighted

filter responses on the dataset. The regressor is calcu- values wyi that some to 1 where lated by taking the true positive match from the naive p(li = 1|yi, I) N (xi, yi, ρI) response of the filter Γ (x, I) as well as many randomly wyi = P ; (15) p(li = 1|zi, I) N (xi, zi, ρI) chosen negative samples from misaligned regions of the zi∈Γi response maps. The regressor in turn estimates a gain The mean-shift vector vi is then the normalized weighted α and bias β term to manipulate the response map re- pull wyi from each location in the response map yi on sults. At runtime the standard logistic regression curve the current estimate xi then provides the likelihood l of a correct match at a   X landmark x vi =  wyi yi − xi (16) 1 yi∈Γi p(li = 1|x, I) = (13) 1 + e−(αiΓi(x,I)+βi) Saragih et al. (2011) showed that the mean-shift algo- rithm is equivalent to employing an Expectation Maxi- This has the effect of turning each likelihood into a mization (EM) strategy which is a bound optimization probability mass function through a sigmoid curve which that converges to an optimal solution through iteration. peaks at 1 for a positive match and 0 for a poor match. Further details on the training and implementations details of LNFs can be found in Baltrusaitis et al. (2013); 2.5 Solving for the new parameters Baltruˇsaitiset al. (2016). The optimal movement projections for all the points within the image frame can be stored within a matrix 2.4 Optimizing the response v, where v = y − P(q) (17) The response maps are determined from small image regions that can have a large degree of variation. As and therefore from equation (10) we are looking for the such, there is often noise and ambiguity with multiple change in parameters ∆qd that minimize the following peaks giving high response. While many previous CLMs n 2 2o ∆qd = argmin r kq + ∆qkΛ˜−1 + kv − J∆qk (18) optimized the response maps with simple parametric ∆q forms such as a Gaussian response (Paquet, 2009) or via Since our problem is a non-linear least squares prob- Convex Quadratic Fitting (Wang et al., 2008), Saragih lem involving the sum of square residuals the Gauss- et al. (2011) described how these were error-prone due Newton algorithm is used giving the parameter update to the multi-modal nature of the response map. Instead as a non-parametric form was shown to give more accurate  −1   results in Regularized Landmark Mean-Shift (RLMS). ∆q = rΛ˜−1 + J T J J T v − rΛ˜−1q (19) 6 Stephen Ackland et al.

2.6 Obtaining the 3D head pose Additionally, it is difficult to ensure that its absolute translation values are measured to be correct and re- One simple way to extract an estimated 3D pose from liable. This may be because the camera itself has not the 2D PDMs is through the Pose from Orthography been calibrated to determine its focal parameters. As and with ITerations (POSIT) algorithm (De- such many researchers have taken to measuring the ac- menthon and Davis, 1995), which estimates pose via a curacy of a head pose methodology through the rotation given 3D rigid model and 2D point correspondences. parameters alone. Since it is based on an orthographic model with scal- Other researchers have taken to using combined color ing (a weak-perspective view frustrum), the full cam- and depth (RGB-D) cameras such as the Microsoft Kinect era perspective is not taken into account. (Xiao et al., or Intel Realsense to try to obtain the measurements 2004) attempt to jointly constrain the shape in 2D and without attaching devices to the participant. The infor- 3D via a ‘combined 2D + 3D AAM’. The 2D shape gen- mation captured from these cameras tend to be noisy erated from the model is further constrained to satisfy and since the incoming data is in raw pixel format, the limit of a 3D PDM. The solution described works acquiring the output pose data has to be performed with a weak-perspective model on the principle that any through another estimation process such as Random 3D shape can be represented in 2D by adding 6 addi- Forests (Fanelli et al., 2011; Tang et al., 2011) or Parti- tional parameters and a balancing weight. (Baltruˇsaitis cle Swarm Optimization (Padeleris et al., 2012). Thus et al., 2012) describe the CLM-Z which matches a 3D the veracity of the ‘ground truth’ can be questioned. PDM to 3D depth data coming from a RGB-D camera The measures produced are typically the Euler rota- while (Martins et al., 2012) simplify the optimization tion angles of pitch, yaw and roll where the mean error process by proposing a 2.5D AAM that works with a for each, along with the standard deviation, can be ac- 3D PDM and 2D texture model under a full perspective quired over all frames (Ariz et al., 2016). It is important camera. The 3D PDM decreases the number of eigen- to remember that these angle triplets are highly depen- vectors needed to statistically represent the face (since dent on the order they are performed in and since there a 3D shape remains relatively constant unlike a 2D rep- is no consistent way of performing these operations in resentation of a 3D shape). Additionally, the full per- the literature the Mean Angular Error (MAE) is re- spective model better represents the likely shapes seen ported. This value obtains the mean of all three rotation from cameras positioned reasonably close to the head. angles and allows models to be compared fairly. Frames However, since AAMs are generative methods they typ- of the image sequence where tracking is completely lost ically require texture information from the specific user are usually removed as outliers. These frames can be to perform with high accuracy which means they won’t a result of obfuscation or partially removing the head generalize as well to the average user or even different from the image. An example of this issue is shown in lighting conditions. The user would likely be required Figure 2. State-of-the-art methods (Ariz et al., 2016) to annotate parts of their face and build a new model. can achieve errors within 3 ± 3◦ using a generic shape model (from 720p RGB video at 30 frames-per-second (fps), which is a realistic minimum target to allow the 2.7 Measuring accuracy of head pose estimates acquired information to be further utilized in an appli- cation. Due to the complexities of acquiring the ground truth data in real-world scenarios, measuring the accuracy of many applications within the computer vision domain 3 2.5D Constrained Local Model is often very tricky. This is certainly true for the area of 3D head-pose estimation. Since the head shape is not To track the head efficiently with a stream of video static it can be very difficult to define what its exact data, a novel approach utilizing 3D shape models with head pose is. Some works such as the Boston University 2D texture information is needed. The model is named (BU) (La Cascia et al., 2000) and Public University of the 2.5D Constrained Local Model (2.5D CLM) and Navarre (UPNA) (Ariz et al., 2016) head pose datasets emphasizes the unique combination of 3D and 2D infor- acquire their estimate through the user wearing a track- mation. This work extends previously published work ing device (such as a Flock Of Birds real-time tracker). that investigated the potential for this kind of tracker This often takes the form of a head band firmly affixed in determining the gaze from a mobile device (Ackland over the head so as to not obscure the face. Of course et al., 2014). The model takes inspiration from the 2.5D there is no default position for it to be placed in, and Active Appearance Model (Martins et al., 2012) but de- so its output will be some (hopefully fixed) offset in viates in a number of key areas such as in the use of translation and rotation relative to the real head pose. a discriminative model that uses small texture patches Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields 7

The approach builds the shape model directly in 3D with a full-perspective camera model, which firstly con- strains the number of possible face combinations in the 2D image (creating a lower dimensional problem space) and secondly, by definition, provides the 3D head pose. The full-perspective camera allows us to ‘project’ the 3D PDM into the camera image space allowing for real- world distance estimation. The model can also correctly accommodate radial and tangential distortion that is often prevalent on low-cost cameras such as webcams. Since the optimization method is a local search, it relies heavily on good estimates from previous frames Fig. 2 It is frequently the case that when a user holds a to correctly find the face pose in subsequent frames. tablet in a comfortable position that their head is only par- One of the benefits of taking a relatively small num- tially visible within the image of the on-board camera ber of pixels for evaluation around each point, rather than take a holistic view of the face region, is to allow Precompute the 2.5D CLM to run very quickly on devices with- out graphics cards and with a low memory footprint. Construct Calibrate Face Model However, local search approaches can accumulate er- Camera (Shape & Texture) rors easily when points fail to track effectively, for ex- ample, when they are obscured, or the image is noisy. Fit 2.5D CLM This issue is addressed in this work with Local Neu- ral Fields (Baltrusaitis et al., 2013) trained at multiple Capture next First frame or Yes angles, which provide strong and sparse peaks at the image frame lost tracking? correct locations.

No Most model-based approaches to pose estimation

Initialise face problems require the computation of the Jacobian ma- Estimate global estimate face shift (Haar) trix, which describes how changes in the model parame- ters manipulate the points within the image (Pons-Moll and Rosenhahn, 2011). The formulation of the Jaco-

Yes bian matrix depends on the parameters of the model, Lost tracking? Initialise global face but other works often use weak-perspective models with tracking Yes scale terms (Ariz et al., 2016; Baltrusaitis et al., 2013). No These methods then require an additional step and fur- Estimate local point optimal ther optimization to determine the full 3D pose. Other image locations methods simplify the Jacobian to evaluate only rela- tive changes that are summed over time, accumulating No Calculate image errors from noise (Martins et al., 2010). The formula- Jacobian tion described in this work avoids all these and allows for the direct estimation of pose and deformation to-

Converged, Optimise pose gether. This direct estimation of the pose parameters lost tracking or max and deformation iterations? parameters forgoes the need for multiple optimization steps from face alignment to pose, further reducing the method’s Fig. 3 2.5D Constrained Local Model flowchart computational cost. around important areas of the face rather than a gener- 3.1 The pose parameters ative texture model. As such, the model contains a 3D PDM which contains knowledge about the shape of the The ‘pose’ of an object defines some measure of both face, and 2D texture patches around small sub-regions translation and rotation. In order to measure the pose of the face which replace the holistic texture. The tex- in absolute terms, it is imperative that any trained ture patches are built so as to work with as many people models are built using ‘real-world’ measurements. This as possible by utilizing robust non-linear filters. implies that by having a correctly measured 3D head 8 Stephen Ackland et al. model, we can accurately track its translation and ro- tation from any camera (provided we have calibrated z and obtained the cameras intrinsic values). Further- (0,0) more, this means that unlike other tracking algorithms which are effectively only tracking relative movements, no scalar or balancing weight parameters are required. The ‘world’ coordinates are defined from the point- of-view of the camera, that is to say, the camera acts (cx, cy) as the origin with the camera always pointing down the x positive z-axis. From the cameras perspective, the pos- f ŷ itive y-axis points downwards and the x-axis points to the right creating a right-handed as Camera Centre (0,0,0) shown in Figure (4). Of course under this definition the camera pose remains fixed at all times and any ‘mobile’ interactions involve the of all other objects in x the world moving relative to it. The presented image frame is dependent on the f in pixels be- tween the camera pinhole point and the image plane. If y the pixels are non-square, perhaps due to a small cam- Fig. 4 The persepective camera model era defect, the focal length can be defined in terms of its horizontal and vertical components fx and fy, which are approximately equal. The camera has a ‘principal 3.2 The Perspective Camera Model axis’ that passes through the camera pinhole and is per- pendicular to the image plane at the ‘principal point’ Different webcams have a wide spread of varying in- (cx, cy), often at or near the centre of the image. trinsic camera parameters (such as focal length) under The pose P of the head can be described at any which our head tracking model would produce vastly time via a 3x4 matrix, with 9 parameters forming a differing and inaccurate estimates. To capture as much traditional 3D R (using ) of this variability as possible a full-perspective camera and 3 parameters representing the translation values model is used to project the information from 3D to T t = [tx, ty, tz] along the x, y and z axes measured in 2D. millimetres (mm). A 2D point in the image (xˆ, yˆ) can actually rep- resent an infinite number of potential 3D points. This R R R t  00 01 02 x relationship can be captured through the use of ho- P = R R R t = R t (20)  10 11 12 y mogeneous coordinates where 3 values [xH : yH : wH ] R R R t 20 21 22 z are used to represent the 2D Euclidean coordinate. The To simplify our problem, we recognize that there function E used to convert to are in fact only 6 degrees of freedom (three dimensions Euclidean coordinates is defined as in rotation and translation) so formulate our problem  H  in these terms using the Rodrigues rotation formula. x " xH #   H wH xˆ This formula defines the rotation as an ‘Axis-Angle’ E y  = yH = (24) H yˆ relationship where the rotation is defined as an angle w wH θ around an arbitrary unit axis ω = [ω1, ω2, ω3] in the following way: The camera intrinsic parameters play a key role in determining the homogeneous coordinates of a particu- 2 R = I + sinθ Ω + (1 − cosθ) Ω (21) lar 3D point X observed in the world from the camera where Ω is the cross-product matrix for the vector ω: location. These parameters can be represented by a ma-   trix K 0 −ω3 ω2   Ω =  ω3 0 −ω1 (22) f x α cx −ω2 ω1 0 K =  0 f y cy (25) The three rotation parameters can be efficiently stored 0 0 1 in a vector r = θω where r q where f x, fy represent the camera focal length; cx, cy ω = , θ = r2 + r2 + r2 (23) θ x y z represent the camera principal point and α the skew Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields 9 parameter. These values can be determined via a stan- (Baltruˇsaitiset al., 2012; Martins et al., 2012; Saragih dard calibration procedure where a checkerboard pat- et al., 2011). The origin of the model then becomes the tern with known dimensions is observed multiple times mean of all points as shown in Figure (5(a)) and im- within the image frame. plies that a deformation of the model can change the In addition, real camera lenses are likely to have pose which is unsatisfactory. Instead, one of the novel some minor tangential or radial distortion. An extreme contributions of this work is that the shape model is example of this would be a fisheye lens which causes designed with pose evaluation in mind by first fixing a extreme distortion around the edges of the image. Pro- local origin, located midway between the inner eye cor- vided the distortion values are known through some ners. Additionally, it is rotated such that the inner eye calibration procedure they can be easily supplemented corners lie on the x-axis and the underside of the nose into the camera projection model without loss of gen- lies on the y-axis, as shown in Figure (5(b)). erality. A discussion about how to do this alongside To see how this could improve the robustness of the expected projection errors can be seen in (Weng et al., pose estimation, suppose we had multiple good qual- 1992). ity images of a head taken simultaneously from 2 or The camera intrinsic matrix can be multiplied with more cameras. Running the algorithm on each image > any 3D point X = [x, y, z] to determine its homo- may output the correct face alignment as tagged by a geneous coordinates. By utilizing homogeneous coordi- human. However, the 3D point estimations around the nates the ‘warping’ process W that transfers this 3D face outline may evaluate quite differently from the var- point to 2D image coordinates is then ious perspectives, as the model allows for deformation. W(X) = E(KX) (26) Taking the mean of the points as the pose estimate from each camera would provide different world-space results whereas an identical pose would be generated 3.3 Constructing the shape model from an origin defined between the eye corners (which do not deform, from the PDM construction). This pre- The shape model is based on the well known Point Dis- vents heavy fluctuations in the pose estimate as the tribution Model (PDM), where the 3D shape is defined participant rotates their head in a video stream. Cru- as a vector s cially, it also prevents the more deformable parts of the  > face such as the mouth from changing the pose when s = x1, x2, . . . , xn, y1, y2, . . . , yn, z1, z2, . . . , zn (27) the participant simply talks or smiles. The shape s is made up on n individual vertex The 3D training data can be acquired directly, for points in space located at suitable locations on the head example, by tagging vertices in 3D modelling software such as the nose tip and eye corners. For a successful on data captured from a range scanner which measures 3D shape tracker, face points should be chosen that depth through lasers or other means. Since this is a tend to remain static, so as to limit the amount of de- laborious process, an alternative can be acquired in- formation of the model and have surrounding textures directly though methods such as Non-Rigid Structure that are well defined, i.e. limit the reliance of landmarks from Motion (NRSfM) (Torresani et al., 2008) whereby that are defined by shadows or outlines. Many models the 3D data is estimated from several tagged 2D im- observed within the literature are often focused on face- ages simultaneously captured of the object in question. alignment rather than pose estimation. As such, they The MultiPIE dataset (Gross et al., 2010) is a large do not account for the fact that while landmarks such annotated image collection carried out over numerous as those on the outline of their face (e.g. the cheek) are sessions and captures a large numbers of people from relatively easy to detect, they do not represent a sin- synchronized cameras set at 15° intervals around the gle 3D point on the face, but instead a whole region of y-axis. The dataset has been shown to provide a suit- possible points as the user orients their head about in able 3D model of the face (Saragih et al., 2011). In this space. work, the NRSfM MATLAB implementation by (Tor- For the creation of a 3D parametric model it is nec- resani et al., 2008) is applied to the 68 annotated points essary to capture the training data as actual 3D data within the images in order to obtain the 3D vertex in- measured in millimetres (mm) to give us the bene- formation. Although there is a small amount of noise fits of obtaining real-world pose estimation values. This from the process due to the fact that it is difficult to is directly opposed to more traditional 2D approaches hand-select identical points from multiple images, the where tagged images are required to be scaled, trans- algorithm does a good job in capturing the variability lated and rotated via a generalized Procrustes align- of the faces in the dataset, and produces believable face ment and is frequently carried needlessly to the 3D case representations. 10 Stephen Ackland et al.

(a) Generalized Procrustes alignment (b) Inner eye corner alignment

Fig. 5 Visualization of the 3D training shapes (xy-plane shown) aligned via a Generalized Procrustes method and a new method aligned relative to the inner eye corners. Note how the statistical point distribution around the eyes becomes tightly packed with the inner eye corners only able to move symmetrically along the x-axis.

The objective is to obtain a linearized parametric in the world derived in section (3.1). In particular, the model of the head, rotation parameters allow for the subsequent calcula- tion of the 3 × 3 matrix R (using equation 21). With s = s + dΦ (28) 0 parameters d and p, the kth 3D point of the shape > where s0 is the mean shape and Φ a set of ν orthog- Xk = [xk, yk, zk] can be determined in world space as onal linear basis vectors (eigenvectors) describing the sxk + Pν d Φxk  t  directions in which the shape can deform, parameter- 0 i=1 i i x yk Pν yk Xk(d, p) = R s0 + i=1 diΦi  + ty (29) ized by weighted values d = {d1, d2, . . . , dν }. To obtain szk + Pν d Φzk t the mean-shape and eigenvectors a Principal Compo- 0 i=1 i i z nent Analysis (PCA) method is applied to the train- Using the warping function W (equation 26), all the ing shape vectors once the data has been rotated and points of the model can be located within the image translated (not scaled) via the eye corners and nose using a whole model projection P(d, p) (Figure 6). Additionally the process obtains the equiv-   alent eigenvalues Λ which give the statistical variance W(X1(d, p)) along the vectors. These values can act as limits for how W(X2(d, p)) P(d, p) =   (30) far the eigenvectors are allowed to stretch and squash  .   .  the model. There is a clear benefit here from the change W(Xn(d, p)) in shape alignment for the PCA process. When keep- ing 95% of the shape variation from the training set (to remove noise from misplaced tags), using Procrustes 3.4 Constructing the texture model alignment the shapes produce a deformable model with 24 eigenvectors. The alternative alignment (described When developing a model for tracking it is imperative in section 3.3) produces a PDM with only 14 eigenvec- that the texture information we are using is truly rep- tors, massively simplifying the calculations required to resentative of the underlying shape model. Like many optimize the shape. deep learning models, creating a general pattern match- The shape model is completed with a set of 6 pose ing algorithm requires variety within the dataset, re- > parameters p = [rx, ry, rz, tx, ty, tz] which describe flecting a diverse range of people with differing face the 6 degrees of freedom for the position of the head characteristics such as eye and nose shape, eyebrow Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields 11

Table 1 Trained patch sizes and scales, where a scale of 1.0 gives a head width of 150 pixels.

Size Scale 11 x 11 1.0 9 x 9 0.8 7 x 7 0.6

response (Baltrusaitis et al., 2013). To obtain the posi- tive filters, small patches were taken from the annotated training images and were affine warped to slightly vary- ing scales, rotations, and translations to create a robust filter. For efficiency purposes the patch filters were kept small with the underlying image scaled to cover a re- gion of the face large enough for discrimination. Nega- tive samples were taken from other regions of the face, making sure to cover the regions near to the positive samples that are likely to come up if the tracking starts

- 3√λ - 1.5√λ Mean Shape + 1.5√λ + 3√λ to drift. Additionally, the patches were trained at multi- s0 ple different scales and patch sizes such that they could be used in a pyramid refinement sense starting with Fig. 6 The first 5 principal components of the head model are shown with each row representing an eigenvector. The a wider search region with lower accuracy. Setting the output is constrained to within 3 standard deviations to en- default trained head width to 150 pixels, the trained sure a realistic face shape is produced. All shapes have their regions and scales can be seen in Table 1. origin equidistant between the inner eye corners which lie on the x-axis. Note that since the cheeks are not well-defined, The implementation of LNF used in this work is the first principal component has to take into account a large freely available as part of the OpenFace framework (Bal- range of possible cheek outlines. truˇsaitiset al., 2016). An example of the typical neuron generation can be seen in Figure (1), where 7 neurons thickness and color, skin tone and whether or not they with each having a support region of 11 × 11 pixels wear glasses. To train the patch filters, the Multi-PIE (with an overall head width of 150 pixels). At run-time training set (Gross et al., 2010) is again used since it the set of patches closest to the head pose has 337 people captured under 19 different illumination are used. The patch data is collected at some standard conditions. Another issue to consider is the restriction set size so that the image patch we are looking for can on how well a 2D texture can represent a 3D object from be identified during run-time. The standard approach many different angles. If our model only uses a limited (Baltruˇsaitiset al., 2012; Martins et al., 2012; Saragih amount of well-defined points, even small changes in et al., 2011) is to adjust the incoming patch via a Pro- texture due to rotation can become the source of large crustes alignment which aligns the model through scale, errors or loss of tracking completely. The Multi-PIE rotation and translation of the observed 2D projected dataset also has another big advantage in that it simul- points. As the head rotates and is observed from differ- taneously captures images from various views. To ac- ent angles this has the effect of actually changing the commodate different head orientations (where the tex- overall size of the head which can cause the model to ture patch can fluctuate significantly), 9 different sets ‘jump’ when switching between different trained patch of patches are trained which capture the likely observed sets. A novel feature of this work is ensuring that the head angles from the camera. These are transitions between patch sets is smooth as it does not depend on a Procrustes alignment of the 2D points but yaw = {−90°, −75°, −45°, −20°, 0°, 20°, 45°, 75°, 90°} through a simple technique. (31) The goal is to rotate the incoming image to make the face appear as ‘upright’ as possible. This is done by This work utilizes Local Neural Fields (LNFs) that uses first projecting the head’s local y-axis (the up vector) positive image samples along with negative samples taken into the image and rotating the image such that the axis from areas close to the the correct area to find a non- becomes vertical. The size of the face within the image linear mapping between the image data and the ideal is another factor that needs to be considered. Let o 12 Stephen Ackland et al. represent the real width of a 3D object and r represent where the E converts the 3D homogeneous its width in pixels within an image. The relationship coordinates to 2D Euclidean coordinates (equation 24). between the two is defined as Ignoring the skew parameter α and any radial or of tangential distortion, which have a negligible effect on r = (32) d the derivative, the warp is approximately " # where f is the camera focal length and d is the distance   fxxk 1 fxxk + cxzk + cx W(X , q) ≈ = zk (37) of the object from the camera. This relationship can k fy yk zk fyyk + cyzk + cy be exploited to scale an incoming image to a standard zk reference size r0 by evaluating its situated distance d0 The derivatives for the warp can be calculated through away from the camera. The scaling term s each frame the quotient rule is then evaluated as a ratio of the constant reference d0    ∂ xk fx distance and the current distance of the head from the ∂W(Xk, q) ∂qi zk =    ∂ yk camera d ∂qi fy ∂qi zk d0   ∂(xk) ∂(zk)  s = (33) 1 fx zk − xk d = ∂qi ∂qi (38) 2   ∂(yk) ∂(zk)  zk fy zk − yk ∂qi ∂qi 3.5 Calculating the Jacobian and hence with respect to each parameter in q we sim- ply need to find the derivative of the 3D point and The Jacobian matrix (J) from equation (9) is the deriva- substitute it into equation (38). tive of the projected points in the shape model with The derivative of the point Xk defined in equation respect to the q parameters. th (29) with respect to the i deformable parameter di is h ∂P(q) ∂P(q) ∂P(q) ∂P(q) ∂P(q) ∂P(q) i simply the relevant eigenvectors rotated to match the J = ...... ∂p1 ∂p2 ∂p6 ∂d1 ∂d2 ∂dν model rotation. (34)  ∂(x )  k Φxk  ∂X ∂di i where k  ∂(yk)  yk = ∂d = R Φi  (39) ∂d  i  z i ∂(zk) k  ∂W(X1,q)   ∂W(X1,q)  Φi ∂di ∂pi ∂di  ∂W(X2,q)   ∂W(X2,q)  The derivative of the 3D point with respect to the ∂P(q)  ∂pi  ∂P(q)  ∂di  =  .  , =  .  pose parameters p is more complicated. Since we re- ∂pi  .  ∂di  .      quire a linear representation of a rotation around the ∂W(Xn,q) ∂W(Xn,q) ∂pi ∂di current point, a matrix R˜ representing an infinitesi- (35) mally small rotation is introduced to the 3D model. Since the rotation is so small (i.e. θ˜ → 0) we can Intuitively, it describes how a small change in each simplify the calculations by making the transformation of the parameters q is reflected in a change of the image i linear using the fact that cos θ˜ ≈ 1 and sin θ˜ ≈ θ. The coordinates. Notice that because the projection is non- Rodrigues axis-angle rotation derived in equation (21) linear, the derivative has the additional complexity that then becomes for each pose in 3D space, this 2D Jacobian is only     locally applicable. This means the Jacobian matrix has 0 −ω˜z ω˜y 1 −r˜z r˜y ˜ ˜ to be calculated for each frame. R ≈ I + θ  ω˜z 0 −ω˜x =  r˜z 1 −r˜x (40) Combining equations (26) and (28) we know that −ω˜y ω˜x 0 −r˜y r˜x 1 the warp projection of the kth 3D point of the shape > > A 3D point on the model is defined as Xk = [xk, yk, zk] to the 2D image plane xk = [ˆxk, yˆk]  k    with parameters q is defined as xd tx ˜ ˜ k      Xk(q) = RR yd + ty   wkxˆk xk k xˆk zd tz W(Xk, q) = = E wkyˆk  = E K yk  yˆk  xk − ykr˜ + zkr˜  t  wk zk d d z d y x k k k ν = R  x r˜z + y − z r˜x  + ty (41) f α c   sxk + P d Φxk  t  d d d x x 0 i=1 i i x −xk r˜ + ykr˜ + zk t yk Pν yk d y d x d z = E  0 f y cy R s0 + i=1 diΦi  + ty zk Pν zk th 0 0 1 s0 + i=1 diΦi tz where for clarity the k point before pose adjustments k k k k k > (36) (s0 + dΦi ) = [xd, yd, zd] . Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields 13

Finally, the translation calculations are straightfor- method of approximating the initial start region for face ward enough since they are real-numbered variables giv- detection and is widely used in the literature. ing a derivative of 1 along each respective axis. The final Since during each frame only the local regions are derivatives can be summarized in the following way searched there is a risk that during large movements be- tween frames (either through a low number of frames ∂(x ) ∂(x ) ∂(x )  k k k   k k per second or high velocity of the user) the head may ˜ ∂r˜x ∂r˜y ∂r˜z 0 zd −yd ∂Xk ∂(y ) ∂(y ) ∂(y ) move out of the local search regions making tracking =  k k k  = R −zk 0 xk   ∂r˜x ∂r˜y ∂r˜z  d d difficult. In an attempt to combat this, a global tempo- ∂r˜ ∂(z ) ∂(z ) ∂(z ) k k k k k yd −xd 0 ∂r˜x ∂r˜y ∂r˜z ral movement tracker is built, providing a larger search (42) region to determine approximate movement in x and y from the previous frame.

 ∂(xk) ∂(xk) ∂(xk)    ˜ ∂tx ∂ty ∂tz 1 0 0 ∂Xk ∂(y ) ∂(y ) ∂(y ) =  k k k  =  0 1 0  ∂t  ∂tx ∂ty ∂tz  ∂(zk) ∂(zk) ∂(zk) 0 0 1 ∂tx ∂ty ∂tz (43)

3.6 The rotation update

It is important to note here that while for the major- ity of parameters q ← q + ∆q this is not actually correct for the rotation update because we are using a linear approximation of the rotation which is only valid for small rotations where sinθ ≈ θ. Instead, successive rotations need to be multiplied to obtain the new ro- tation. To do this, the axis-angle rotation parameters Fig. 7 The global tracker gradually adjusts to increase its robustness, particularly during rotations of the head. are converted back to a rotation matrix R∆, and the 0 new rotation R = RR∆. Unfortunately, this rotation matrix is no longer guaranteed to be orthogonal due to the linear approximation of the rotation eigenvectors. Correlation filters are highly suited to this kind of Similar to the work by Baltruvsaitis (2014), a Singu- problem and can be implemented in linear (Bolme et al., lar Value Decomposition (SVD) is used that factorizes 2010) and more recently non-linear (Henriques et al., the matrix into 2 orthogonal matrices U and V T and 2015) forms. To limit computational cost a linear MOSSE a diagonal matrix S. filter but is dynamically built and updated each frame. The region around the whole face is cropped and scaled T USV = SVD(R∆) (44) to the correct size giving a 128×128 patch model of the face. On the first frame during initialization, 10 sam- The diagonal matrix can be discarded, giving an ples are taken by affine transforming the patch. Then orthogonal rotation matrix as in Bolme et al. (2010) a learning rate is introduced T T Rˆ ∆ = U · det(UV ) · V (45) (here denoted λ) that puts additional weight on more recent image data to let the filter template accommo- where det(UV T ) = ±1, ensuring against the case of a date reasonable changes to the model that occur natu- reflection in the parameters. The final rotation matrix rally under different environment conditions and pose is then changes. The MOSSE equation takes the form of a frac- tion, where both the numerator A and denominator B R0 = RRˆ (46) ∆ are calculated via a sum of all training samples. For the ith frame the constituent parts of the MOSSE filter can 3.7 Global Face Alignment be stored separately as

On the initial frame or whenever we have lost tracking, the model is (re)initialized with a simple Haar classifier ∗ Ai (Viola and Jones, 2001) which is a quick and reliable Hi = (47) Bi 14 Stephen Ackland et al.

This allows the filter to be updated for each subse- quent frame (i + 1) as follows

∗ Ai+1 = λ Gi+1 F(Ii+1) + (1 − λ)Ai (48) ∗ Bi+1 = λ F(Ii+1) F(Ii+1) + (1 − λ)Bi (49) where F denotes a transfer to the Fourier domain. G is the desired output image with peak response as a Gaussian with 2 pixels of standard deviation, and ∗ denotes the complex conjugate. A learning rate of λ = 0.0125 allows the filter to adapt quickly while remaining robust. The large size of the patch allows for a lot of detail within the filter, and since it is dynamic the peak response on correct tracking tends to be strong and dense. The MOSSE filter detector output Γ for each new incoming image patch Ii+1 is given by:

−1 ∗ Γ = F (F(Ii+1) Hi ) (50) Incorrect tracking produces a weak and noisy response map. This allows for a simple measure to be used to de- tect tracking loss called the Peak-Sidelobe Ratio (PSR) (Bolme et al., 2010). The pixel location of the maximum Fig. 8 Input images and their responses. The top row shows a strong PSR rate with strong response values isolated around response value ΓMAX is found and a region designated the peak. The bottom row shows a failed tracking example the sidelobe is defined which includes all pixel values in - the PSR drops below 15 due to a noisy response map that the response map apart from a small area around the does not have a strong isolated peak. peak. The ratio is then defined as Γ − µ PSR = MAX SL (51) σSL where µSL and σSL are the statistically calculated mean value and standard deviation of the sidelobe respec- tively. An example of the response map under successful and failed tracking can be seen in Figure (8). The PSR can vary widely depending on the complexities of the Fig. 9 The green box represents a 32 × 32 window around incoming image and the learning rate λ so the value the maximum point. The remaining area represents the side- chosen is specific to the implementation and can be ad- lobe in which a region of Gaussian noise has been added to represent a second peak. The larger the amount of noise, the justed depending on how important it is to avoid false lower the PSR value becomes and the less sure we can be that positive scenarios. Figure (9) demonstrates the PSR un- we have found the correct target. der differing amounts of noise away from the true peak at the centre of the response map, with a strong sec- ond peak or large amount of noise decreasing the PSR include a rotation of the head model, the global shift significantly. In this work it was empirically observed in the face points can let us adjust the translation pose that a value of below 15 suggested that the tracking parameters as was struggling when a 32 × 32 window was removed from the response map around the maximum point. To allow for short term obfuscation of the face the track- tz∆xˆ tz∆yˆ ing is only deemed lost with consecutive failures over ∆t = , ∆t = (52) x f y f 10 frames. This triggers a new search from the Haar x y classifier and a reset of the model parameters. Using a mean-shift approximation to obtain the change in image coordinates (∆x,ˆ ∆yˆ) and assuming the simul- The final fitting process for the 2.5D Constrained taneous shift of all points in the same direction does not Local Model is described in Algorithm (1). Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields 15

Algorithm 1 2.5D CLM Fitting trast, the UPNA dataset has high definition resolutions 1: Precompute of 1280 × 720. 2: Calibrate camera intrinsic matrix K As the UPNA dataset is publicly available, there are 3: Construct 3D mean shape s and eigenvectors Φ 0 a number of models in the the literature that have been 4: Train 2D texture filters Hk around n PDM points 5: End tested on it. For head tracking, state-of-the-art methods such as the Active Shape Model (ASM) (Cootes et al., 6: procedure Fit(deformation d, pose p, image I) 2001) and Active Appearance Model (AAM) (Cootes 7: Use p to crop, rotate and scale I, isolate head Iface 8: if First frame or tracking lost then et al., 2001) have been available for a number of years. 9: Detect face using Haar classifier A vast number of varieties have been documented over 10: Build global correlation filter the years, but for generality and fair comparison it is 11: else important that the training and testing data does not 12: Generate global response map 13: Check for tracking failure overlap. While ASM and AAM are very common, they 14: Estimate global shift between frames are typically only used for 2D feature tracking. Head- 15: Update global filter pose is often an afterthought and frequently methods 16: repeat such as the POSIT algorithm (Dementhon and Davis, 17: for k ← 1, ..., n points in head model do 1995) or similar are subsequently applied to the result. 18: Generate Γ via local regions of I and H k face k They can therefore be considered to be a two-stage pro- 19: Obtain mean-shift estimates vk from Γk 20: Project the 3D point Xk into image, W(Xk, q) cess with the 2D features first being optimized followed ∂Xk 21: Calculate deformable derivatives ∂d by the fitting of a static 3D model to the points through ˜ ˜ ∂Xk ∂Xk an image projection method. The choice of 3D model is 22: Evaluate pose derivatives ∂r˜ and ∂t ∂W(Xk ,q) as important as the underlying method itself and fortu- 23: Evaluate derivatives of projection ∂q 24: Combine derivatives for final Jacobian matrix J nately the authors of the UPNA dataset have tested an ∆d 25: Compute the parameter updates ∆q = ASM and AAM with a number of head models includ- ∆p ing a cylinder and a generic ’Basel head model’ (Ariz 26: Update deformation parameters d ← d + ∆d 27: Update pose p ← p + ∆p, with rot. adjustment et al., 2016). Each have shown impressive results with 28: Determine the new 3D points X the data for roll, pitch and yaw angles individually pub- 29: until (k∆qk < ) or (iter = maxIter) lished alongside the Mean Angular Error (MAE) in de- grees.

4 Experiments 4.2 Head pose estimation results

4.1 Head Pose Test Dataset Since the UPNA head pose dataset was measured with a ‘flock-of-birds’ tracker and care was taken when cal- For the comparison of head pose algorithms a dataset ibrating the device, robust ground truth values for both was recently established by Ariz et al. (2016) from the rotation and translation are available. A separate ‘model’ Public University of Navarre (UPNA). There are 10 in- file was included for each participant where 3D coordi- dividuals in the dataset, each with 12 videos perform- nates on the participant (including, crucially, the inner ing various head movements over a 10 second period. eye corners) are known. The local coordinate system of The users are directly so as to isolate movements in the supplied ground-truth is actually located on the top different dimensions and around different axes. To col- of the participant’s head and needs to be transformed lect the data, each individual had a Flock-Of-Birds 3D through a constant pose adjustment to the inner eye Guidance TrakSTAR firmly fitted to the top of their corners so that it can be compared correctly. By fol- head. The device and user were carefully calibrated to lowing the same alignment process that was performed ensure correct rotation and translation from the cam- on the creation of the PDM (section 3.3), the supplied era. To accomplish this, the 3D points of the head model tagged point model can be used to determine an es- were recorded relative to the tracking device before each timate of the pose offset which can subsequently be recording by attaching a second TrakSTAR sensor to applied on all frames, although there is no guarantee the end of a plastic marker and holding it at each lo- on its accuracy. It is important to stress that this does cation on the user’s face for one second. Previously, a not in any way affect the impartiality of the proposed common dataset for pose estimation was the Boston tracking model as it remains independent at all times University (La Cascia et al., 2000) however the dataset with no additional training taking place. is nearly two decades old and features very low camera The 2.5D CLM head pose estimation data was ac- resolutions by modern standards (320 × 240). In con- quired over all video frames in the dataset. No assump- 16 Stephen Ackland et al.

Table 2 Mean rotation errors on the UPNA dataset. Mean ± Standard Deviation displayed where known. All POSIT results taken from (Ariz et al., 2016) and Cascade Regression Tree result from (Tulyakov et al., 2018)

Method Head Rotation Error (°) Tracker Head Model Roll Yaw Pitch Mean 2D PDM & Basel ASM & POSIT 1.12 2.97 4.04 2.71 ± 2.82 Face Model ASM & POSIT 2D PDM & CHM 1.14 3.56 5.52 3.40 ± 3.02 2D PDM & Basel AAM & POSIT 1.74 2.30 6.01 3.35 ± 4.29 Face Model AAM & POSIT 2D PDM & CHM 1.74 3.68 8.83 4.75 ± 4.84 Cascade Regression 3D PDM - 4.33 3.41 - Tree 2.5D CLM with 3D PDM 1.32 ± 3.63 3.81 ± 9.78 3.14 ± 3.77 2.76 ± 6.49 MOSSE (this work) 2.5D CLM with 3D PDM 0.81 ± 0.59 1.88 ± 1.18 2.47 ± 2.09 1.72 ± 1.58 LNF (this work)

Fig. 10 Examples of model fitting on the UPNA dataset. While the fitting is generally good, facial hair can be the source of some inaccuracies as seen in the bottom-right image. tions were made about starting locations, with the ini- a generic 3D head model (the Basel Face Model (Paysan tial frame head pose first being estimated by a Haar et al., 2009)) and a cylindrical model (CHM). The state- classifier as outlined in section 3.7. All of the videos of-the-art results observed in recent literature using a start with a frontal face and therefore all faces were de- Cascade of Regression Trees (Tulyakov et al., 2018) is tected successfully. Table (2) compares the new model also shown for comparison. Like the 2.5D CLM, their with the state-of-the-art models tested in Ariz et al. method uses a 3D Model of the face and evaluates the (2016). The authors first estimated the 2D face shape pose parameters directly. To evaluate the effectiveness though both an ASM and an AAM which are both com- of both the 2.5D CLM and the LNF filters directly, monly seen in the literature. Then the POSIT method another form of discriminator is utilized for each point (Dementhon and Davis, 1995) was applied which at- individually based on the MOSSE correlation filters de- tempts to best fit the 2D points with a known 3D model. scribed in section 3.7. The MOSSE filters were the same Two face models from the authors are also compared – Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields 17 size and were trained on the same data as the LNF fil- 20.49mm. This is perhaps to be expected as monocu- ters for suitable comparisons. lar systems, even in humans who have lost the sight in The 2.5D CLM with LNF texture filters outper- one eye, struggle with depth perception. This is made forms all other models tested on the mean accuracy significantly more complex by the fact that the model of rotation. In addition, it performs better than all the is allowed to deform and thus a small change in depth other models on individual rotation angles. The stan- can be misattributed as a facial deformation. dard deviation on all three rotation axes is low show- ing its robust nature. It does not require a fixed 3D model like the POSIT algorithm, instead deforming the 4.3 Efficiency on a mobile device original PDM directly to estimate the pose parameters. Thus it is a one-stage process and can therefore be im- It is of interest to know how efficiently the algorithm plemented more efficiently than the other methods. performs on a tablet device using the internal battery only. The code-base uses the OpenCV library alongside As there is no prior on the pose estimation, outliers a collection of self-developed libraries in C++. No sig- can have a significant negative effect on the mean. The nificant optimization steps were taken during develop- median errors in degrees are 0.75, 1.73, 2.04 and 1.20 for ment and there are likely to be many opportunities for roll, yaw, pitch and all angles respectively. This shows parallelization that were not taken, for example when that for the 2.5D CLM generally the errors are consis- obtaining the local texture responses for each point. tently low, with larger errors accruing on a small num- The frames-per-second (fps) computation time was cal- ber of occasions, due to minor tracking failure. Larger culated using high performance timers on a Microsoft errors can be seen for the MOSSE filters, which bene- Surface Pro 3 with an Intel i5-4300U CPU and 8GB fited from the direct evaluation of the 3D parameters RAM that was released in 2014. The timer ran over from the 2.5D CLM but were not as robust as their the length of all videos of the UPNA dataset with an LNF counterparts. This kind of tracking failure is very average fps of 29.1, including the first frame face detec- difficult to determine since no ‘global’ failure has oc- tion. The disparity between frames comes mostly from curred and the facial features are still being tracked. the loading of the larger images and the cropping and However, since the face is believed to be at a different rescaling of the various target features. That is to say, rotation than it is, the ‘wrong’ set of patches from the when the head is larger within the image frame, com- multiple view choices are being used. It is likely that putation time increases. The tablet reported during the the non-linear nature of the LNF filters provides them test that it was utilising under 40% of the available sys- a larger tolerance to the variability of the textures from tem resources. As tablets are becoming more powerful different view angles. each year, this shows that the proposed model is per- As expected, the pitch angle is the most difficult to fectly suited to a mobile device and sufficiently meets determine since most of the frontal face points appear the minimum requirements for real-time use. approximately planar to one another with only the posi- tion of the tip of the nose in relation to the other points around it providing any strong evidence of pitch angle. 4.4 3D face alignment in the wild One other notable instance of an inaccurate pitch being obtained was when estimating the values of the users While the 2.5D CLM was not specifically designed for with beards, as seen in the bottom-right image of Fig- face alignment within images, it is interesting to see ure (10). Further evidence of difficulties acquiring the how it compares with other state-of-the-art methods on pitch can be seen in Figure (11), where the tracking challenging in-the-wild data. Face alignment is differ- results for videos isolating the roll, yaw and pitch are ent from head pose estimation in that it measures how shown over time for four participants. The graphs do close each estimated point on the face is to a tagged show however that while tracking large rotations can image landmark when projected in to the image. While produce inaccurate results, the 2.5D CLM is able to there are many datasets for measuring these errors, the recover quickly once the rotation values return to less markups have been annotated strictly in 2D. In par- extreme orientations. ticular, side profiles often have many points untagged Since it is not standard to report translation errors since they are not visible within the 2D image. It has within the literature, there is no comparison metric been shown that these 2D markups often do no actually available. However, it is interesting to see that while satisfy the additional constraints of a projected 3D face the mean errors in both x and y translations are very shape, instead they are stretched and squashed to fit the close to the ground truth (1.5mm and 1.26mm respec- incoming image and hence they often not viewpoint- tively), the z-translation has significantly larger errors consistent (Tulyakov et al., 2018). 18 Stephen Ackland et al.

Roll (video 4) Yaw (video 5) Pitch (video 6)

50 50 50

40 40 40

30 30 30

20 20 20

10 10 10

0 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 -10 -10 -10

User 7 -20 -20 -20

-30 -30 -30

-40 -40 -40

-50 -50 -50

50 50 50

40 40 40

30 30 30

20 20 20

10 10 10

0 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 -10 -10 -10 User 8 -20 -20 -20 -30 -30 -30

-40 -40 -40

-50 -50 -50

50 50 50

40 40 40

30 30 30

20 20 20

10 10 10

0 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 -10 -10 -10

User 9 -20 -20 -20

-30 -30 -30

-40 -40 -40

-50 -50 -50

50 50 50

40 40 40

30 30 30

20 20 20

10 10 10

0 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 -10 -10 -10

User 10 -20 -20 -20

-30 -30 -30

-40 -40 -40

-50 -50 -50

Fig. 11 Sample errors (in degrees) of roll, yaw and pitch over time on the UPNA dataset. The blue and orange lines show the ground truth and estimated values respectively, with the yellow area representing the error on each of the 300 frames of video. The participants were directed to perform isolated movements of roll, yaw and pitch on separate videos and these are compared here for four users. The largest errors occur for all 3 rotation axes when furthest from the front-facing position (0°), although it is clear that the tracking is able to recover well after these large rotations. The pitch is particularly difficult to determine when the participant has facial hair obscuring the outline of the face such as User 8.

The viewpoint-consistency approach tries to ensure ing with a wide variety of extreme poses, often along that the tagged points would be correct from other cam- with any number occluding items and textured envi- era angles and therefore would provide an anatomically ronments with a variety of lighting conditions. correct face. The difficulty of evaluating a 3D model Since the images do not come with any informa- comes from the lack of 3D tagged image data. One tion regarding the camera intrinsic parameters, the face dataset that attempts to solve this problem is called alignment error is typically determined from the 2D AFLW2000-3D (Zhu et al., 2016). The dataset is an ex- Euclidean Distance between the estimated and ground- tension of the AFLW (Annotated Facial Landmarks in truth 3D points when projected to the image. This Nor- the Wild) dataset, where the ground-truth 3D land- malized Mean Error (NME) is defined as marks representing 68 points on the face have been n reevaluated for 2000 AFLW samples, specifically for 3D 1 X kxk − ykk NME = 2 (53) face alignment evaluation. The dataset is very challeng- n dk k=1 Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields 19

Table 3 Normalized Mean Errors (%) for 3D face alignment on AFLW2000-3D. For 3D methods it is important to evaluate all 68 points of the model, even when they are not visible in the picture. Evaluations for RCPR, ESR, SDM and 3DDFA from (Zhu et al., 2016). The Hierarchical binary CNNs result is taken from (Bulat and Tzimiropoulos, 2018) and 3DFAN (Bulat and Tzimiropoulos, 2017) was evaluated using source code provided by the authors3. ’Real-time on CPU’ here denotes the potential suitability for the method to be used on a typical mobile device without a GPU at over 30fps, taken from the performance evaluation from their respective papers. Real-time Method [0 ,30 ] [30 ,60 ] [60 ,90 ] Mean ° ° ° ° ° ° on CPU RCPR(300W) Burgos-Artizzu et al. (2013) 4.16 9.88 22.58 12.21 ESR(300W) Cao et al. (2014) 4.38 10.47 20.31 11.72 X SDM(300W) Xiong and De la Torre (2013) 3.56 7.08 17.48 9.37 X ESR(300W-LP) Cao et al. (2014) 4.60 6.70 12.67 7.99 X RCPR(300W-LP) Burgos-Artizzu et al. (2013) 4.26 5.96 13.18 7.80 SDM(300W-LP) Xiong and De la Torre (2013) 3.67 4.94 9.76 6.12 X 2.5D CLM with LNF This work 4.56 6.56 7.17 6.10 X 3DDFA Zhu et al. (2016) 3.78 4.54 7.93 5.42 3DDFA+SDM Zhu et al. (2016) 3.43 4.24 7.17 4.94 3DFAN Bulat and Tzimiropoulos (2017) 3.64 4.79 5.99 4.81 Hierarchical binary Bulat and Tzimiropoulos (2018) 2.47 3.01 4.31 3.26 CNNs

where x and y are the projected ground-truth and es- we have indicated the other methods that have the po- timated points of the n-th face image respectively. The tential to do the same within the table, based on the value d attempts to normalize the error for different performance evaluations from their respective papers. face and image sizes and is computed here as The results also show that while the 2.5D CLM suf- p fers as the yaw rotation of the head increases towards d = wb ∗ hb (54) a profile view, it is consistent overall compared to the older models. This is likely due to a strong pose esti- where w and h represent the width and height of the b b mate, clearly showcasing the benefits of designing the bounding box enclosing all face points. PDM for pose evaluation specifically rather than the The results summarizing how the 2.5D CLM com- simpler problem of face-alignment to arbitrarily tagged pares in accuracy to other more computationally de- 2D points. manding methods can be seen in Table (3). Since the 2.5D CLM only searches its local vicinity for good point matches, it needs a good initial prediction to success- 5 Conclusion fully find the face points. Utilizing a standard Haar classifier method for face detection (Viola and Jones, The head-pose estimation model; the 2.5D CLM; was 2001) provides poor results as the dataset contains ’in- successfully able to outperform 2D ASM and AAM the-wild’ images that are designed to be challenging. trackers that are supplemented with the POSIT algo- Instead, in line with the other methods in Table (3), rithm on the UPNA dataset. Additionally, it has shown the ground-truth region-of-interest is used to set the to outperform a recently published viewpoint-consistent initial translation of the head. For the initial rotation 3D model using Cascade Regression Trees. It has shown estimate, a heuristic was implemented where 9 initial that it can achieve robust results under many variations poses were assessed (representing the 9 sets of patches of head pose movement. The 2.5D head pose estimator trained with yaw-angles between [−90 , 90 ] as detailed ° ° has a significant advantage over other models in that it in equation 31). The yaw-angle with the highest sum of directly optimizes the rotation and translation param- Peak-Sidelobe Ratios (PSR - see section 3.7) from each eters, rather than being a two-stage process like many point was then taken as the initial pose estimate. While other models. The weakest rotation angle was consis- the 2.5D CLM is by no means the most accurate; possi- tently the pitch of the head, perhaps due to lack of bly due to initialization difficulties or the absence of in- relative depth around the nose region. Although a per- formation about the camera focal length; the strongest son’s nose shape remains relatively static the 3D model methods rely on deep learning approaches computing is capable of a large amount of deformation around the in parallel on one or more GPUs to get anywhere need nose to accommodate many different people. The z- real-time performance. While only our approach has been shown to work in real-time on a mobile device, 3 https://github.com/1adrianb/face-alignment 20 Stephen Ackland et al. translation component suffers from a similar problem annual conference on and inter- where the optimization process can stretch and squash active techniques, ACM Press/Addison-Wesley Pub- the face as it needs, rather than move correctly along lishing Co., pp 187–194 the z-dimension. Both of these issues could potentially Bolme DS, Beveridge JR, Draper BA, Lui YM (2010) be alleviated by learning about the users face shape over Visual object tracking using adaptive correlation fil- time so that the inter-person deformation is removed ters. In: Computer Vision and Pattern Recognition and only the deformation specific to that individual is (CVPR), 2010 IEEE Conference on, IEEE, pp 2544– kept. 2550 The head-pose was successfully able to run on a Bulat A, Tzimiropoulos G (2017) How far are we from commodity tablet in real-time. The work can be ex- solving the 2d & 3d face alignment problem?(and a tended by using the many opportunities for paralleliza- dataset of 230,000 3d facial landmarks). In: Com- tion within the code to optimize the algorithm. Ad- puter Vision (ICCV), 2017 IEEE International Con- ditionally, since the model does not require a specific ference on, IEEE, pp 1021–1030 texture model, the 2.5D CLM can take advantage of Bulat A, Tzimiropoulos Y (2018) Hierarchical binary the explosion of many new and exciting deep learning cnns for landmark localization with limited resources. models that are being explored within the research com- IEEE Transactions on Pattern Analysis and Machine munity. Intelligence Burgos-Artizzu XP, Perona P, Doll´arP (2013) Robust face landmark estimation under occlusion. In: Pro- ceedings of the IEEE International Conference on References Computer Vision, pp 1513–1520 Cao X, Wei Y, Wen F, Sun J (2014) Face alignment Ackland S, Istance H, Coupland S, Vickers S (2014) An by explicit shape regression. International Journal of investigation into determining head pose for gaze es- Computer Vision 107(2):177–190 timation on unmodified mobile devices. In: Proceed- Cheung Ym, Peng Q (2015) Eye gaze tracking with ings of the Symposium on Eye Tracking Research and a web camera in a desktop environment. Human- Applications, ACM, pp 203–206 Machine Systems, IEEE Transactions on 45(4):419– Ariz M, Bengoechea JJ, Villanueva A, Cabeza R (2016) 430 A novel 2d/3d database with automatic face annota- Choi S, Kim D (2008) Robust head tracking using 3d el- tion for head tracking and pose estimation. Computer lipsoidal head model in particle filter. Pattern Recog- Vision and Image Understanding 148:201–210 nition 41(9):2901–2915 Asthana A, Zafeiriou S, Tzimiropoulos G, Cheng S, Cootes TF, Edwards GJ, Taylor CJ (2001) Active ap- Pantic M, et al. (2015) From pixels to response maps: pearance models. Pattern Analysis and Machine In- Discriminative image filtering for face alignment in telligence, IEEE Transactions on 23(6):681–685 the wild. IEEE transactions on pattern analysis and Cristinacce D, Cootes T (2006) Feature detection and machine intelligence 37(6):1312–1320 tracking with constrained local models. In: Proc. BaltruˇsaitisT, Robinson P, Morency LP (2012) 3d con- British Machine Vision Conference, vol 3, pp 929– strained local model for rigid and non-rigid facial 938 tracking. In: Computer Vision and Pattern Recog- Dementhon DF, Davis LS (1995) Model-based object nition (CVPR), 2012 IEEE Conference on, IEEE, pp pose in 25 lines of code. International Journal of 2610–2617 Computer Vision 15(1):123–141 Baltrusaitis T, Robinson P, Morency LP (2013) Con- Fanelli G, Weise T, Gall J, Van Gool L (2011) Real time strained local neural fields for robust facial landmark head pose estimation from consumer depth cameras. detection in the wild. In: Proceedings of the IEEE Pattern Recognition pp 101–110 International Conference on Computer Vision Work- Goodall C (1991) Procrustes methods in the statisti- shops, pp 354–361 cal analysis of shape. Journal of the Royal Statistical BaltruˇsaitisT, Robinson P, Morency LP (2016) Open- Society Series B (Methodological) pp 285–339 face: an open source facial behavior analysis toolkit. Gross R, Matthews I, Cohn J, Kanade T, Baker In: Applications of Computer Vision (WACV), 2016 S (2010) Multi-pie. Image and Vision Computing IEEE Winter Conference on, IEEE, pp 1–10 28(5):807–813 Baltruvsaitis T (2014) Automatic facial expression Henriques JF, Caseiro R, Martins P, Batista J (2015) analysis. PhD thesis, University of Cambridge High-speed tracking with kernelized correlation fil- Blanz V, Vetter T (1999) A morphable model for the ters. IEEE Transactions on Pattern Analysis and Ma- synthesis of 3d faces. In: Proceedings of the 26th Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields 21

chine Intelligence 37(3):583–596 Tulyakov S, Jeni LA, Cohn JF, Sebe N (2018) Kirby M, Sirovich L (1990) Application of the Viewpoint-consistent 3d face alignment. IEEE trans- karhunen-loeve procedure for the characterization of actions on pattern analysis and machine intelligence human faces. Pattern Analysis and Machine Intelli- 40(9):2250–2264 gence, IEEE Transactions on 12(1):103–108 Viola P, Jones M (2001) Rapid object detection us- La Cascia M, Sclaroff S, Athitsos V (2000) Fast, re- ing a boosted cascade of simple features. In: Com- liable head tracking under varying illumination: An puter Vision and Pattern Recognition, 2001. CVPR approach based on registration of texture-mapped 3d 2001. Proceedings of the 2001 IEEE Computer Soci- models. Pattern Analysis and Machine Intelligence, ety Conference on, IEEE, vol 1, pp I–511 IEEE Transactions on 22(4):322–336 Wang Y, Lucey S, Cohn JF (2008) Enforcing convexity Martins P, Caseiro R, Batista J (2010) Face alignment for improved alignment with constrained local mod- through 2.5 d active appearance models. Interna- els. In: Computer Vision and Pattern Recognition, tional Journal of Computer Vision 56(1):221–255 2008. CVPR 2008. IEEE Conference on, IEEE, pp Martins P, Caseiro R, Batista J (2012) Generative face 1–8 alignment through 2.5 d active appearance models. Weng J, Cohen P, Herniou M, et al. (1992) Camera Computer Vision and Image Understanding calibration with distortion models and accuracy eval- Merget D, Rock M, Rigoll G (2018) Robust facial land- uation. IEEE Transactions on pattern analysis and mark detection via a fully-convolutional local-global machine intelligence 14(10):965–980 context network. In: Proceedings of the IEEE Confer- Xiao J, Moriyama T, Kanade T, Cohn J (2003) Robust ence on Computer Vision and Pattern Recognition, full-motion recovery of head by dynamic templates pp 781–790 and re-registration techniques. International Journal Padeleris P, Zabulis X, Argyros AA (2012) Head pose of Imaging Systems and Technology 13(1):85–94 estimation on depth data based on particle swarm op- Xiao J, Baker S, Matthews I, Kanade T (2004) Real- timization. In: Computer Vision and Pattern Recog- time combined 2d+ 3d active appearance models. In: nition Workshops (CVPRW), 2012 IEEE Computer IEEE Computer Society Conference on Computer Vi- Society Conference on, IEEE, pp 42–49 sion and Pattern Recognition, IEEE Computer Soci- Paquet U (2009) Convexity and bayesian constrained ety; 1999, vol 2 local models. In: Computer Vision and Pattern Xiong X, De la Torre F (2013) Supervised descent Recognition, 2009. CVPR 2009. IEEE Conference on, method and its applications to face alignment. In: IEEE, pp 1193–1199 Proceedings of the IEEE conference on computer vi- Paysan P, Knothe R, Amberg B, Romdhani S, Vetter sion and pattern recognition, pp 532–539 T (2009) A 3d face model for pose and illumination Zhu X, Lei Z, Liu X, Shi H, Li SZ (2016) Face alignment face recognition. In: Advanced video and across large poses: A 3d solution. In: Proceedings of signal based surveillance, 2009. AVSS’09. Sixth IEEE the IEEE conference on computer vision and pattern International Conference on, IEEE, pp 296–301 recognition, pp 146–155 Pons-Moll G, Rosenhahn B (2011) Model-based pose estimation. In: Visual analysis of humans, Springer, pp 139–170 Rusinkiewicz S, Levoy M (2001) Efficient variants of the icp algorithm. In: 3-D Digital Imaging and Mod- eling, 2001. Proceedings. Third International Confer- ence on, IEEE, pp 145–152 Saragih J, Lucey S, Cohn J (2011) Deformable model fitting by regularized landmark mean-shift. Interna- tional Journal of Computer Vision 91(2):200–215 Tang Y, Sun Z, Tan T (2011) Real-time head pose es- timation using random regression forests. Biometric Recognition pp 66–73 Torresani L, Hertzmann A, Bregler C (2008) Nonrigid structure-from-motion: Estimating shape and motion with hierarchical priors. Pattern Analysis and Ma- chine Intelligence, IEEE Transactions on 30(5):878– 892