Fusion Based Face Recognition Using Statistics Of Shaded Subregions

Fusion Based Face Recognition Using Statistics of Shaded Subregions

ECE 533 Final Project

December 21,2006

John A. Boehm Introduction:

Face recognition is a growing field in image processing and machine learning with important and useful applications in surveillance, authorization and many other security applications. Recent advancements in the state of the art include using techniques operating on 3-D range data [1] and image capture under Near Infra-Red lighting [2] offer great promise for the future but can do little for helping to build a system today using existing databases of uncontrolled 2-D images. Uncontrolled images can come in the form of varying lighting, which introduces shadowing effects, pose variations that constrain a system to doing recognition on a mere fraction of the face, changes in expression that can include contortions of a face into an unrecognizable subject or poor quality, noisy images that simply do not contain enough information for recognition systems to operate. The most common of these problems is uncontrolled lighting.

In unprocessed images, changes between images of the same person under different lighting conditions are larger than those between two different people under the same lighting conditions [3]. Given the extensive number of images that exist, robust methods for pre-processing and recognition algorithms are clearly worthwhile topics for investigation. I have been working on the extensive set of images provided in the FRGC [4] database.

The Face Recognition Grand Challenge (FRGC) sponsored by NIST in 2005-06 was conceived in an attempt to improve the current state of the art of face recognition systems by an order of magnitude. Six different challenges were issued ranging from performing recognition on controlled, still images to using combinations of 2-D texture and 3-D range scan data. Of the six different challenges issued, experiment 4 is arguably the most challenging as it contains 2-D images of uncontrolled pose, expression and lighting variations in the data set. In an effort to standardize the metrics used for comparing algorithms and to insure reproducibility of results the FRGC offered the use of the Biometric Experiment Environment (BEE) as an operating platform.

The baseline recognition algorithm for experiment 4 provided with the BEE was only able to produce ~14% successful recognition (True Acceptance Rate or TAR) when the False Acceptance Rate (FAR) threshold was set at 0.1% (the standard used to judge all algorithms in the competition). My project deals with trying to improve the existing baseline algorithm by gathering statistics of sub-regions in each image, using the statistics to create weights for emphasizing the “more important regions” and generating a new matrix of similarity scores that improves the overall performance of the baseline algorithm.

Background:

The baseline algorithm first pre-processes the images in the database using meta-data provided for each image that provides the locations for critical landmarks, such as the corners of the eyes. This facilitates the rotation and translation of the image for proper registration. It then extracts an oval shaped region containing just the face and performs histogram equalization on the region before storing it as a 1X19500 vector. A typical pre- processed controlled image is shown in fig. 1 while fig. 2 is a pre-processed uncontrolled image of the same person. The shading of the areas around the inside corners of the eyes as well as the regions below the nose and lips are quite evident.

fig. 1 fig. 2

Normalized controlled image Uncontrolled image

The next module in the BEE is the biobox, which takes the normalized images from the pre-processing step as input and performs the actual recognition algorithm on them. The output of the biobox is an MxN matrix of similarity scores where M is the number of Query images (unknown identity) and N is the number of known Target images (known identity). Therefore, the i,jth similarity score is the distance measure between the ith Query image and the jth Target image. The lower the score the closer the match is between the two. Finally, after normalizing the similarity matrix, a Receiver Operating Curve (ROC) is generated which plots the True Acceptance Rate (TAR) as a function of the False Acceptance Rate (FAR). The results of the baseline algorithm are shown in fig. 3 and fig. 4 below for both intra-semester and inter-semester recognition respectively. The standard used in the FRGC for comparing algorithms was the True Recognition Rate at a False Acceptance Rate threshold of 0.1%. Thus the score for the two curves below are 13.6% and 12.0% respectively.

fig. 3 fig. 4

Baseline Intra-Semester Baseline Inter-Semester

Approach:

Many of the images in the uncontrolled image set contain the effects of several variants simultaneously. In fact, the pair of images with the worst match score for the same person taken in the same semester is shown below in figs. 5 and 6. The query image is suffering from poor quality, has shading around the eyes, has a different expression than the target image and has a pose that is different than the target as well (head is tilted up in target image).

fig. 5 fig. 6

Query image Target image

It is unclear how much each of these variants affects the overall score. My project only attempts to address the shading issue.

My hypothesis is that if the face images were broken down into regions of varying sizes and different locations it may be possible to determine if shading is present and to use weighted combinations (fusion) of the regions to develop a new similarity score that improves the performance of the baseline algorithm. The fusion approach is not new to pattern recognition [3] and will generally improve the results of a decision process as long as the weights of the fusion regions are determined by “experts”. In other words if two or more classifiers perform well on a data set then a combination of their scores should perform better. Of course, there is no telling what will happen if all of the classifiers perform poorly.

Work Performed:

My first task was to extract regions from the normalized faces and perform the baseline algorithm on them to see if there were any regions that were “good” classifiers by themselves. I ran the baseline algorithm on the 21 different sub-regions shown in figs. 7 and 8. The TAR scores at 0.1% FAR are shown in Table 1 for each of the regions. My premise has been that regions with similar statistics should perform better in recognition tasks than those that are significantly different. I performed histogram equalization on the first ten regions and ran the experiments again to see if there would be an improvement in the overall score of any of the sub-regions. Table 2 shows the results of the histogram equalized regions using the baseline algorithm. Looking at region 2 and 3 (left eye and right eye) there is a 4.2 % performance difference between the two eyes in the unprocessed sub-regions while the same sub-regions after histogram equalization differed by only 2.4%. While the numbers are not significant, given the size of the data sets, 8014 Query images and 16028 Target images, it seems that there may be some useful insight that can be gleaned from the results. fig. 7

Table 1 Region TAR @0.1% FAR Region TAR @0.1% FAR Brow 1 9.8 Brow 9.9 L. Eye 2.6 Eyes 8.1 R. Eye 5.8 Nose 4.6 Mid Nose 5.6 Mid 4.8 Whole Nose 7.0 Mouth 5.1 Nose Tip 2.5 L. Upper 5.4 L. Cheek lg 0.9 R. Upper 10.1 R. Cheek lg 4.8 L. Mid 6.0 L. Cheek sm 0.2 R. Mid 10.7 R. Cheek sm 0.3 L. Lower 4.3 R. Lower 7.0 Sub-Region Scores Table 2 Region TAR @0.1% FAR Brow 1 14.7 L. Eye 3.7 R. Eye 6.1 Mid Nose 6.2 Whole Nose 6.8 Nose Tip 1.8 L. Cheek lg 0.7 R. Cheek lg 3.2 L. Cheek sm 0.3 R. Cheek sm 0.3 Histogram Equalized Scores

I next developed an algorithm that would use the information from various sub-regions to weight the original scores of the corresponding similarity matrices and combine them to create a new score i.e.:

S(i, j)   wk Sk (i, j) k  1,2,... ; i  1,2,...8014 ; j  1,2,...,16028 (1) k

Where k is the number of similarity matrices used. The size of the similarity matrices posed a great problem as each one was ~490 MB therefore careful selection of the proper regions was in order if I were to expect an algorithm to run within the last weeks of the semester. Moreover, since the weights wk were dependant on statistics of several regions and there were 8,014+16,028 = 24,042 images to gather statistics from for each sub- region, the computational limitations of the algorithm were always a great concern.

The first attempt I made was to use the mean pixel value for each of the first 10 sub- regions of each Query image and compare the values with the corresponding mean values for every Target image that the Query was to be compared to. Then I would subtract the absolute value of the difference from 1 to determine the weighting of that region’s similarity score to the overall score. If the difference was greater than 1 I set the weight equal to zero since a change in sign of the similarity value [-1.7,1.7] could introduce erroneous results. Then:

1  k : 0   k  1 wk   where,  k  mean{Qsubk } mean{Tsubk }  0 :  k 1

This method proved to be unfeasible using 10 sub-regions as I soon found after completing just two Query comparisons (of the 8014) that the algorithm would take ~ 11 days to complete. This lead to a more detailed investigation of the statistics of the sub- regions in order to find the best regions to use in my classifier. In figs. 9, 10 and 11 the results of comparing the mean values of the Query vs. Target images for pairs of images that were classified correctly (blue) and incorrectly (red) are shown for the first 3 sub- regions. fig. 9 fig. 10

fig. 11

It appears that regions 1 and 2 (figs. 9 and 10) exhibit a pattern that might suggest similar mean valued images that are lower (darker) may better classify images than those with higher (lighter) mean values. We must also keep in mind that the right eye (fig. 11) was generally a better classifier than the left (fig.12) so clearly, there is no simple explanation for the data observed.

Given the somewhat ambiguous nature of the data I decided to run a simpler version of the fist experiment with a pared down data set. I compared the mean value of the eye regions of the Query and Target images and used the differences to generate weights for just three similarity matrices for the three regions shown in fig. 12. The “Brow” and “Right Mid” were chosen since they had better performance then nearly all of the other regions. The left and right eyes were chosen for determining the weights since there was a noticeable difference between the left and right eye performance. It was hoped that when one side had a large difference in mean pixel values between the Query and Target that the other side’s difference would not be as great and thus would provide a better match. fig. 12

Results:

It took nearly four days for the experiment to run and the results were not as good as anticipated. The ROC curve (intra-semester) for the fusion experiment is shown in fig. 13 and the original baseline experiment is shown again in fig. 14 for convenient comparison. The results for the fusion approach are 14.77% TAR @ 0.1% FAR while the baseline results were 13.6%

fig. 13 fig. 14

Fusion Results Baseline Results Discussion:

Though there was a small improvement it appears that the fusion based approach was severely limited from the start as I did not have any known “expert” regions to use that offered good classification on their own. It is also quite possible that my concerns regarding the influence of the pose and expression variations have been validated and those variations are impacting the performance of the overall system more than anticipated. Despite these concerns it is still worthwhile to examine further the eye regions used for generating the weights to see if any additional insight can be gained from the performance of this system.

In fig. 15 we can see examples of the left eye region for both Query and Target images from the same person before and after histogram equalization. The histogram equalized version of the controlled (Query) image is a very even looking image. This is because the controlled image had pixel values that were all in the same relative range i.e. not too dark and not too light. The uncontrolled (Target) image however, is very dark in the inside of the eye and very bright on the bottom left of the image. Since the image already has a fairly dynamic range the histogram equalization does not do as good of a job at stretching the range of the uncontrolled image.

fig.12

There may have been a similar problem in using just the mean pixel value of the eye regions to determine the weights of the similarity matrices. Since the regions around the eyes contain a relatively large area of both bright and dark pixels, the mean value may not be a good indicator of differences between eye regions of two separate images. Compound this with the fact that the normalized image has already been histogram equalized once, which has a tendency to make nearly all eye regions darker and all cheek regions lighter. As long as the eye regions used have such a wide range of pixel values the mean will not be a good metric to use in the classifier. If the eye region were resized to better fit only the eye then histogram equalization should be more uniform for the eye and the mean value would likely be more significant in classification. The problem with making the eye region smaller is that eyes do not typically make good classifiers (unless the resolution allows for iris classification [5]) since they are prone to significant variations due to expression changes.

Finally, I note here that the most significant single improvement in performance occurred between the “Brow 1” region and the histogram equalized “Brow 1” region, which showed an increase of nearly 5% (Tables 1 and 2) from 9.9% to 14.7%. Yet, many of the “Brow 1” sub-region images suffered from the same large variation in pixel values. Moreover, the larger sub-regions also tended to perform better than the small. This suggests to me that improvements to this system may be possible by using local histogram equalization on smaller sub-regions of the sub-regions. Then through a recombination of the sub-regions a more uniform global histogram could be created that adequately compensates for shading in all sub-regions.

Conclusions:

Face recognition is a challenging application of image processing with many existing approaches for solutions. Shadows in 2D images due to lighting variations pose a significant obstacle in the development of an effective, robust algorithm. The effects of global and local histogram equalization of images containing shadows need to be carefully considered as what works well for one image may make another image more difficult to classify. Larger sub-regions of images tend to work better than smaller ones in general but pixel values that cover too broad of a range due to shading effects are not good candidates for global histogram equalization. This suggests that performing histogram equalization on smaller sub-regions and then re-combining them in the pre- processing steps may offer further improvement in the performance of a face recognition algorithm

Fusion based approaches to classification using multiple systems can often improve the overall performance but the decision algorithm for choosing a weighting scheme must be considered carefully. Face recognition data sets can be very large and computational requirements for a practical system should not be taken lightly. Only rules using classifiers that can be considered “experts” for a given region (or problem) will offer robust improvements to the system and thus decision rules and weighting values should be based accordingly. References:

[1] Wei-Yang Lin, Kin-Chung Wong, Nigel Boston, Yu Hen Hu “Fusion of Summation Invariants in 3D Human Face Recognition” IEEE 2006 [2] Stan Z. Li, RuFeng Chu, ShenCai Liao, Lun Zhang “Illumination Invariant Face Recognition Using Near Infrared Images” IEEE Transactions on Pattern Analysis and Machine Intelligence, (accepted for publication) May, 2007 [3] Y. Adini, Y. Moses, S. Ullman “Face recognition: The problem of compensating for changes in illumination direction.” IEEE Transactions on Pattern Analysis and Machine Intelligence [4] NIST, “Face Recognition Grand Challenge” http://www.frvt.org/FRGC/ [5] John Daugman “How Iris Recognition WorksI” IEEE Transacations on Circuits and Systems for Video Technology, January, 2004