Deformable Templates for Face Recognition
Total Page:16
File Type:pdf, Size:1020Kb
Deformable Templates for Face Recognition Alan L. Yuille Division of Applied Science Harvard University Abstract Downloaded from http://mitprc.silverchair.com/jocn/article-pdf/3/1/59/1755719/jocn.1991.3.1.59.pdf by guest on 18 May 2021 We describe an approach for extracting facial features from data. Variations in the parameters correspond to allowable de- images and for determining the spatial organization between formations of the object and can be specified by a probabilis- these features using the concept of a deformable template. This tic model. After the extraction stage the parameters of the is a parameterized geometric model of the object to be rec- deformable template can be used for object description and ognized together with a measure of how well it fits the image recognition. LNTRODUCTION geometry will appear in the image and a corresponding measure of fitness, or matching criterion, to determine The sophistication of the human visual system is often how the template interacts with the image. taken for granted until we try to design artificial systems For many objects it is straightforward to specify a plau- with similar capabilities. In particular humans have an sible geometric model. Intuition is often a useful guide amazing ability to recognize faces seen from different and one can draw on the experience of artists (for ex- viewpoints, with various expressions and under a variety ample, Bridgman, 1973). Once the form of the model is of lighting conditions. It is hoped that current attempts specified it can be evaluated and the probability distri- to build computer vision face recognition systems will bution of the parameters determined by statistical tests, shed light on how humans perform this task and the given enough instances of the object. difficulties they overcome. Specifying the imaging model and the matching cri- In this article we describe one promising approach terion is often considerably harder. The way the object toward building a face recognition system. Some alter- reflects light depends both on the reflectance function native approaches are reviewed in Turk and Pentland of the object, which in principle can be modeled, and (1989). on the scene illumination, which is usually unknown. A standard way of describing a face consists of repre- The matching criterion, however, should aim to be rel- senting the features and the spatial relations between atively independent of the lighting conditions. An even them. Indeed two of the earliest face recognition systems more serious problem arises when part of the object is (Goldstein, Harmon, & Lesk, 1972; Kanade, 1977) used invisible, perhaps due to occlusion by another object or features and spatial relations, respectively. Such repre- by lying in deep shadow. Most matching criteria will sentations are independent of lighting conditions, may break down in this situation but recent work, described be able to characterize different spatial expressions, and in the fourth section, shows promise of dealing with such have some limited viewpoint invariance. problems. These representations, however, are extremely difficult The plan of this chapter is as follows. The second to compute reliably from a photograph of a face. It is section gives a simple introduction to deformable tem- very hard to extract features, such as the eyes, by using plates by describing the pioneering work of Fischler and current computer vision techniques. Deformable tem- Elschlager (1973) on extracting global descriptions of plates, however, offer a promising way of using a priori faces. The third section gives a more detailed description knowledge about features, and the spatial relationships of using deformable templates to extract facial features between them, in the detection stage. Once the feature (Yuille, Cohen, Hallinan, 1989). The fourth section shows has been extracted the parameters of the deformable a way to put deformable templates in a more robust template can be used for description and recognition. framework (Hallinan & Mumford, 1990), which promises Designing a deformable template to detect a feature, to make them more reliable and effective. or an object, falls into two parts: (1) providing a geo- Deformable templates, and the closely related elastic metric model for the template, and (2) specifying a models (Burr, 1981a,b;Durbin & Willshaw, 1987; Durbin, model, the imaging model, for how a template of specific Szeliski, & Yuille, 1989) and snakes (Kass, Witlun, & 0 1991 Massachusetts Institute of Technology Journal of Cognitive Neuroscience Volume 3, Number 1 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/jocn.1991.3.1.59 by guest on 24 September 2021 Terzopolous, 1987; Terzopolous, Witkin, & Kass, 1987), The total measure of fitness is the sum of the local have been extensively used in recent years in vision and fitness measures for each feature and the cost of the related areas, Much of this work fits into the general spring terms. Let X7 = {xl,x2, . , x7} be the positions Bayesian statistical framework developed by Grenander of the features. The fitness is and his collaborators (Chow, Grenander, & Keenan, 1989; 7 7c Grenander, 1989; Knoerr, 1989) for synthesizing and rec- ognizing biological shapes. The theory suggests localizing the features in a specific face by adjusting to maximize E(X,). This is a hard GLOBAL TEMPLATES X7 combinatoric problem and Fischler and Elschlager sug- In Fischler and Elschlager’s approach (1973) the face is gest an algorithm, the linear embedding algorithm, to modeled by a set of basic features connected by springs solve it. We will not, however, be concerned here with (see Fig. 1). The set of features includes the eyes, hair, the details of the algorithm, which are related to dynamic mouth, nose, and left and right edges. The individual programming. Downloaded from http://mitprc.silverchair.com/jocn/article-pdf/3/1/59/1755719/jocn.1991.3.1.59.pdf by guest on 18 May 2021 features do not deform and each feature has a local The system was shown to work on a number of 35 by measure of fit to the image. During the matching stage 40 images. The authors reported that the main errors the entire structure is deformed, rather like a rubber occurred in the location of the mouthhose complex, sheet, until all features have a good local fit and the although the other features were correctly located. In spring forces are balanced. The springs help ensure that these situations the spring forces put the mouthhose the spatial relations between the features are reasonable complex in roughly the right position but were unable (i.e.,the nose is not above the hair). to localize it correctly. They also mentioned that in the More specifically, this approach defines a local fitness presence of noise the simple local fitness measures measure Zi(xz)that indicates how strongly the zth feature seemed inadequate to accurately locate the nose and fits at location xi. Fischler and Elschlager define simple mouth. fitness measures for each feature. For example, for the Computer simulations at the EIarvard Robotics Lab (U. left edge of the face the fitness measure at xj is the Wehmeier, personal communication) show that closely difference between the sums of the four intensity values related techniques work well for images of faces with to the left and right of x,.Thus this measure is large for constant size and with little noise. In these simulations a straight vertical line with low intensity on the left and five features were chosen: (1) the eyes, (2) the nostrils high intensity on the right. of the nose, and (3) the sides of the mouth. These fea- The spring joining the ith andjth features are given a tures were attracted to valleys, dark blobs, in the image cost function g&, q),where xi and q are the positions intensity and were connected by springs. A steepest of the features. In most cases gq is assumed to be sym- descent algorithm was used (see Fig. 2). Once the posi- metric and to depend only on the relative positions of tions of the features were located approximately, then the features, hence it can be written as g,(x, - 3).Not more accurate detectors can be used to find them more all features are joined by springs (see Fig. l), so for each precisely. feature i we let N, denote the set of features to which it Fischler and Elschlager’s approach is very attractive is connected but it has several weaknesses in its present form. The main problem is the simplicity of the local fitness mea- sures. At present they are strongly scale dependent and may fail under certain noise conditions. The spring forces might also be improved, by a systematic study of the <I spatial relations, to make sure they give a more accurate bias. Systems of this type seem to work best on small pic- tures with coarse levels of resolution. At coarse scales the local fitness measures may be very simple (Wehmeier was able to locate many features in terms of valleys). At finer scales, however, the local fitness measures must become more complex to take into account the greater variability of the feature. This suggests a coarse to fine, or pyramidal (Burt & Adelson, 1983), approach in which -\ / a global template is used to identify likely positions in the coarse image that can be located and (we hope) verified by more sophisticated techniques at a finer scale. Figure 1. The Eace model consists of the hair, eyes, nose, mouth, We will now describe a system that uses deformable and the left and right edges joined together by springs. templates to locate the features themselves. 60 Journal of Cognitive Neuroscience Volume 3, Number 1 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/jocn.1991.3.1.59 by guest on 24 September 2021 Downloaded from http://mitprc.silverchair.com/jocn/article-pdf/3/1/59/1755719/jocn.1991.3.1.59.pdf by guest on 18 May 2021 Figure 2.