Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields

International Journal of Computer Vision manuscript No. (will be inserted by the editor) Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields Stephen Ackland · Francisco Chiclana · Howell Istance · Simon Coupland Received: date / Accepted: date Abstract Tracking the head in a video stream is a robust. Finally, the texture information is trained via common thread seen within computer vision literature, Local Neural Fields (LNFs) { a deep learning approach supplying the research community with a large number that utilizes small discriminative patches to exploit spa- of challenging and interesting problems. Head pose es- tial relationships between the pixels and provide strong timation from monocular cameras is often considered peaks at the optimal locations. an extended application after the face tracking task Keywords Head Pose Estimation · 2.5D CLM · LNF · has already been performed. This often involves passing Real-time the resultant 2D data through a simpler algorithm that best fits the data to a static 3D model to determine the 3D pose estimate. This work describes the 2.5D 1 Introduction Constrained Local Model, combining a deformable 3D shape point model with 2D texture information to pro- Face tracking and 3D pose estimation from monocu- vide direct estimation of the pose parameters, avoid- lar video streams is an interesting yet challenging task, ing the need for additional optimization strategies. It with novel approaches continually being published in achieves this through an analytical derivation of a Ja- the literature. This paper investigates the prospect of cobian matrix describing how changes in the parame- acquiring the 3D head pose from consumer cameras ters of the model create changes in the shape within the such as webcams and those attached to modern com- image through a full-perspective camera model. In addi- puting devices such as smart phones and tablets, with tion, the model has very low computational complexity the processing power and memory available on such and can run in real-time on modern mobile devices such devices now allowing for computationally intensive ap- as tablets and laptops. The Point Distribution Model plications to run in real-time. The head-pose estimate of the face is built in a unique way, so as to minimize involves the approximation of the translation and rota- the effect of changes in facial expressions on the esti- tion of the head, typically relative to the camera. This mated head pose and hence make the solution more process involves some element of detection and tracking within the 2D image in order to obtain a 3D output. Of S. Ackland course since the 2D image has effectively lost a dimen- De Montfort University, Leicester, UK E-mail: [email protected] sion of data, it is vital that a realistic model of the head and camera are known to achieve a reliable estimate of F. Chiclana De Montfort University, Leicester, UK real-world pose. E-mail: [email protected] Having a predefined model of shape and texture can H. Istance be beneficial in constraining the search process in order University of Tampere, Finland to reduce false positives. The analysis of the image data E-mail: howell.istance@staff.uta.fi can take two main forms: generative models, which are S. Coupland parameterized textures of the whole face (a holistic tex- De Montfort University, Leicester, UK ture model) that when combined synthesize a new facial E-mail: [email protected] appearance, and discriminative models, which combine 2 Stephen Ackland et al. a number of local feature detectors (patch texture mod- Jones, 2001), which provides a good initial estimate of els) representing each point on the face. The discrimi- the head location. A common modern approach to de- native approach comes down to a simple classification tect face points in an image is to utilize deep neural net- problem that can be solved using applications of deep works (Bulat and Tzimiropoulos, 2018; Merget et al., learning. For example, does this new image patch repre- 2018). In particular, the use of cascades of Convolu- sent an eye corner, or not? By combining the response tional Neural Networks have shown excellent results on from all of the patches, we can optimize our shape pa- 3D face alignment problems (Bulat and Tzimiropoulos, rameters to best solve the problem. 2017; Zhu et al., 2016). The variety of tagged train- The main contribution in this work involves the de- ing data now freely available along with the computing sign and development of the 2.5D Constrained Local power of multicore GPUs allow for remarkable infer- Model (2.5D CLM) for head pose estimation from a ence from image to face points. However, this accuracy monocular camera, particularly on devices with limited comes at a high computational cost and is currently not resources such as a tablet or laptop. The model com- suitable for domains where the hardware is limited and bines a 3D shape with 2D texture information that pro- there is a need for real-time performance from video. vides direct estimation of the pose parameters, avoid- For such problems, efficient tracking can be performed ing the need for additional optimization strategies. The by searching the local area of the current solution using texture information is evaluated via Local Neural Fields simple face models (Ackland et al., 2014; Baltruˇsaitis (LNFs) { a deep learning paradigm that utilizes small et al., 2016; Saragih et al., 2011). This is usually suffi- discriminative patches to exploit spatial relationships cient provided the relative movement is not too great between the pixels and provide strong peaks at the and the refresh rate is quick enough, i.e. frames pro- optimal location. The model is carefully constructed cessed per second is high enough. with the alignment of the training shapes via the inner One of the most important issues when dealing with eye-corners, reducing the complexity of the model, and head tracking relates to choosing a representation of the making it more robust to changes in facial expression. head. Many of the representations involve capturing the The model is complete with an analytical derivation of texture from the image and capturing the relative move- a Jacobian matrix, which describes how changes in the ment over time. One of the simplest representations parameters of the model create changes in the point po- comprises a texture mapped cylinder (Cylindrical Head sitions within the image through a full-perspective cam- Model (CHM)) (Xiao et al., 2003), which is formed by era model. The model competes well with other state- instantiating a cylinder into the camera scene and map- of-the-art models on 3D face alignment and head-pose ping the current image frame onto the cylinder surface. estimation tasks, while running with very low compu- Since only the relative movement is determined, cap- tational complexity. turing displacement between the head shape and other items in the environment requires further calibration and additionally, like many tracking methods of this 2 Related Work variety, the tracking tends to degenerate over time and requires re-instantiation regularly. Other simple shapes Evaluating the 3D head pose from a monocular camera may approximate the head shape slightly better with a involves a number of other areas including the detection relatively small computational cost including ellipsoidal and tracking of the head through the evaluation of facial models (Choi and Kim, 2008) and sinusoidal models landmarks in an image. This is a well researched area (Cheung and Peng, 2015). in recent years and has seen many examples within the literature of how to solve these problems. This section Alternatively, models of the head shape and texture summarizes these contributions and provides the basis can be built beforehand, where we attempt to acquire of the 2.5D CLM with Local Neural Fields. a best-fit solution for the model to the new image. By approaching the problem with pre-learned knowledge about the face, we can estimate relative distances be- 2.1 Head tracking models tween in-scene objects better, and also reduce cumula- tive tracking errors by always optimizing the fitting to Tracking the user's head within a series of images from the learned data and using the previous tracking iter- a video stream typically involves two tasks, first to de- ation as a guide only. Specifically, if we wish to track tect the initial location within an image of the head and the shape and subsequently determine pose, we need to secondly to track that head through a series of images. acquire more information about the distribution of the The most successful and well-documented head detec- individual face components or features, such as the eye tor in recent times is the Haar classifier from (Viola and corners and bridge of the nose. Predetermined knowl- Real-Time 3D Head Pose Tracking Through 2.5D Constrained Local Models with Local Neural Fields 3 edge about a face shape often comes in the form of a where Di is a function representing the incoming data Point Distribution Model (PDM) (Cootes et al., 2001). and evaluates how each landmark xi is misaligned within The PDM is capable of creating plausible shapes from the image I. Note as in Saragih et al. (2011) the exten- a sequence of deformable points that are statistically sion of the error term with a regularization term R to learnt from a training set of marked up images of the punish complex deformations of the shape model (i.e. shape. It is a simple linear parametric model in either shapes that deviate too far from the mean shape). 2D or 3D that given enough training data generalizes An alternative way of viewing the problem is via a well to even unseen data.

Load more