Analysis of Human Attentions for Face Recognition on Natural Videos and Comparison with CV Algorithm on Performance

The AAAI 2017 Spring Symposium on Science of Intelligence: Computational Principles of Natural and Artificial Intelligence Technical Report SS-17-07

Mona Ragab Sayed,1,2 Rosary Yuting Lim,3 Bappaditya Mandal,1 Liyuan Li,1 Joo Hwee Lim,1 Terence Sim2 1 Institute for Infocomm Research, ASTAR, Singapore 138632 2 School of Computing, National University of Singapore 117417 3 Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge MA 02139-4307

Abstract the results of human studies are obtained under certain Researchers have conducted many studies on human atten- restricted conditions. In previous human studies, the re- tions and their eye gaze patterns for face recognition (FR), searchers used still images with constrained illuminations, hoping to inspire new ideas to develop computer vision (CV) viewing perspectives (poses), expressions, and ages. How algorithms which perform like or even better than human. humans perform on challenging videos in natural conditions Yet, while these studies have been performed on still face has not been studied systematically. In addition, whether the images, human attention in natural videos have not been state-of-the-art computational algorithm could surpass hu- investigated. In this work, we conducted a human psycho- man FR in unconstrained conditions and face variations such physics experiment to study the human FR performance, at- as natural environment has also not been well investigated. tention and analyze eye gaze patterns on challenging natural videos. We compared human and machine FR performance Humans are experts in recognizing faces in natural envi- in natural videos by applying a state-of-the-art deep convolu- ronments. They are able to exploit the advantages of percev- tional neural network (dCNN). Our findings show a signifi- ing and predicting facial movements in such environments cant gap between machine and human on performance, espe- for FR (Knight and Johnston 1997; Xiao et al. 2013). It is cially for humans who achieve higher FR performance (high- observed that the cognitive ability of perceving faces that performers). Learning from the cognitive capabilities of hu- are moving at their natural speed and correct order helps mans, in particular the high-performers, may aid in reducing to enhance face encoding and recognition (Schultz et al. this gap. Hence, we investigated persons’ attentions to unfa- 2012). Therefore, it is beneficial to study humans attentions miliar faces in a challenging face recognition task, and studied their eye gaze patterns. We propose a new and effective to face appearances in natural environments. The new find- computational approach for eye gaze pattern analysis on nat- ings of the cues may be helpful to inspire new computational ural videos, which produces reliable results and reduces the methods that is able to emulate the competence of human’s time and manual efforts needed. Our findings reveal that hu- face recognition system and enhance the state-of-the-art face mans in general attend more to the regions around the center recognition technologies. of the face than the other facial regions. The high-performers in particular reveal attention to the lower part of the face in Contributions. First, we show that the perfromance of state- addition to the centre regions. of-the-art CV algorithm for FR is still under the average performance of humans clearly, and below the perfromance Since faces provide important cues in social interaction, face of best human perfromers (high performers) quite largely, perception and recognition becomes an interesting problem, for the task of face recognition under challenging natural not only for psychologists and neuroscientists (Little, Jones, conditions. This observation is obtained by conducting a and DeBruine 2011), but also for computer vision (CV) human psycho-physics experiment in very challenging FR researchers (Grainger et al. 2015; Ito and Sakurai 2014; tasks using natural videos, and compare with state-of-the-art Liao et al. 2014; Suneetha 2014; Wolf 2014), who aim CV algorithm (i.e. deep CNN (dCNN)) on the same dataset at developing computational face recognition (FR) algo- (Parkhi, Vedaldi, and Zisserman 2015). Second, Our study rithms with an accuracy equivalent to or even surpass hu- suggests that face recognition algorithms may yet benefit mans (Dhamecha et al. 2014; O’Toole et al. 2007; Sinha from the strategies used by humans, especially the high per- et al. 2005). By investigating how humans recognize faces, formers, in natural conditions. For this purpose, we present the findings in the humans FR strategies could inspire re- the detailed statistics of human attentions for FR, i.e. fix- searchers in computer vision to develop advanced computa- ation distributions and eye gaze patterns in natural videos. tional techniques for better performance (Mandal et al. 2015; Our findings reveal that the high performers give more at- Dhamecha et al. 2014; Shipley and Kellman 2001). Some tention to some facial regions than others, which might pro- successes have been reported in previous studies, however, vide some helpful cues for further development of CV al-

616 gorithms. Third, we propose a novel automatic approach for or faces of the same identity but look different) (Rice et eye gaze patterns analysis on natrural videos. This approach al. 2013). This was done to make the FR tasks more chal- reduces the time, effort, and variability of subjectivity, when lenging to the subjects. We retained the hair, some parts of compared to the traditional method based on manual anno- the neck and shoulders in the videos for natural appearance tations. of the faces in real world. Eye movement fixation data was collected using a SMI RED 250 remote eye-tracker. After Human Experiment showing the stimuli, the subject was required to answer a matching question. He/She must choose one of these Experiment Methods options: the presented pair of face videos are of ‘same Participants. We recruited 42 asian volunteers (22 males) identity’, ‘different identity’, or ‘not sure’. The duration from the National University of Singapore (NUS). The ma- of answering such question was capped by 5 seconds. We jority of the volunteers were in their twenties. All of them recorded all the answers, response times, profile data and reported normal or corrected-to-normal vision. They were eye tracking data for each subject. Recognition performance paid 10 dollars in vouchers for one hour of participation. scores were calculated by dividing the number of correct We obtained Informed consent from each of the subjects answers over the total number of tested face pairs. prior to the study. All the procedures in our experiments were approved by NUS IRB. Human vs. Machines Methods Dataset Collection. We targeted at evaluating human FR performance on natural videos containing high reso- In this experiment, we investigated how much a gap between lution faces. Since no existing public dataset meets such machine and human on FR performance in challenging nat- requirement, we collected a new dataset for this study. Al- ural videos, and, in more detail, the performance gaps be- though YouTube Faces (YTF) dataset have a large number tween machine and humans of different performance levels. of natural face videos (Wolf, Hassner, and Maoz 2011), To evaluate dCNN performance, we extracted dCNN fea- 35th many are manually cropped and re-sized which deteriorated tures from the layer of the trained Oxford deep learn- the visual quality. Using face images of poor quality as ing network for face recognition (Parkhi, Vedaldi, and Zis- stimuli in our experiment will affect the reliability of eye serman 2015). A distance matrix for all video frames was gaze localization. We used the YTF identities to search generated, where each element was an Euclidean distance for new high resolution videos (at least 720 pixels wide) (dCNNeucd35) or a Cosine distance (dCNNcos35) between which were recorded in natural environments. Sixty videos a pair of frames which were represented by the dCNN fea- consisting of 45 unique Caucasian identities were down- ture vectors of the faces. The organization of the dataset used loaded from the YouTube website. Caucasian faces were in this experiment and frame matchings statistics are given in employed to avoid own-race bias as all our subjects were Table 1. For humans, the average performance of all subjects asians (Hills and Lewis 2006). Half of the videos contain was computed (42 subjects, excluding two subjects who <= 50% frontal faces according to viewing perspectives whereas achieved accuracy rates below chance, i.e. ). In ad- the other half contain mixture of frontal and non-frontal dition, we further clustered the valid subjects into three clus- faces. We defined frontal faces as those directly faced or ters of performance level ’high’, ’mid’ and ’low’ perform- rotated within range of approximately (−45◦, 45◦) to the ers by using Kmeans clustering according to their recog- camera whereas non-frontal faces were defined as faces nition performance scores (Arthur and Vassilvitskii 2007; rotated out of this range. An equal number of females and Lloyd 1982). The average performance of each cluster was males were used. We clipped these videos to approximately also calculated. 3 seconds with an average of +0.15 seconds for each. We then converted each video into an image sequence of 96 Results and Discussions frames on average. At each frame, the face is cropped with We reported FR results using ROC curves. For human per- a small surrounding area. On average, the frame size is of formance, ROC curves were generated as in (Best-Rowden 769 × 765 pixels (see Figure 2 as an example). et al. 2014). We mapped the subject responses of ‘same identity’,‘different identity’,‘not sure’ into numerical values. For Stimuli. We introduced 30 pairs of video stimuli for each showed stimuli pair, these values were averaged across initial learning and subsequent recognition tasks respec- all subjects to compute a human confidence scores for pair. tively for each subject. Each video was shown for 3 seconds This score were then used as a threshold value to classify (30 frames per second), with a 2 seconds inter-stimulus human responses into two categories, i.e. correct or incor- interval. In each pair, a mixture face video (i.e. video rect responses, for the stimuli pair. containing mixture of frontal and non frontal faces) was A false acceptance rate (FAR) and a correct verification displayed as the first stimulus, and the second stimulus was rate (CVR) were calluated, where FAR indicates to how a frontal face video. The identities in the two videos may many Foil face pairs were recognized inaccurately as Gen- be the same (Genuine) or different (Foil). The pairs were uine pairs, and CVR denotes to how many Genuine face randomly selected without replacement. However, within pairs were recognized correctly. For dCNN, we extracted each pair, we chose face videos that were not easy for face the feature vector from the 35th layer (located after the sec- recognition tasks (i.e. similar faces of different identities, ond last fully-connected layer) in the deep network (Parkhi,

617 than chance, i.e. less than 50%. The fixation records were utilized to analyze human attentions and eye gaze patterns in FR. Three clusters of performance levels (i.e. high, mid and low performers) were created for the subjects using the Kmeans clustering algorithm (Arthur and Vassilvitskii 2007; Lloyd 1982) according to their FR performance scores. We extracted the fixation data of these subjects and mapped them to image locations on the corresponding frames of each video. Two methods were then used to generate and analyze the eye gaze patterns. Eye Gaze Analysis Methods Figure 1: ROC of human and machine face recognition per- formances on our human experiment dataset for the chal- Manual Approach. An eye gaze pattern for each subject on lenging FR tasks, on average and at different performance a video was generated by projecting all the fixation points levels. on its consecutive frames onto the first frame of that video (see Figure 2 (a)). This was done by manually annotating 3 facial points (centres of the eyes and tip of the nose) on each Vedaldi, and Zisserman 2015) as mentioned earlier. We frame of the video. The mean coordinate of these 3 points tested to use features from other nearby layers but no sig- was then calculated and defined as the reference point of the nificant difference of the performance was observed. Figure frame. The reference point of a consecutive frame was then 1 shows the ROC curves of human in general (Avg-Human) aligned to the reference point of the first frame. A shift vec- and dCNN with 800 features. The performance of dCNN tor between the two reference points is generated and used to on different number of features did not significantly change recover the face displacement through the video caused by the performance, as reported by (Hu et al. 2015). Also, the head movement. The fixation point on each frame was then ROC curves of dCNN resulted from Euclidean and Cosine projected onto the first frame according to the shift vector. distance metrics were almost the same. Our result revealed To analyze the eye gaze pattern, we followed the same better performance of human subjects on average than that strategy as in (Shipley and Kellman 2001). We manually of the state-of-the-art CV algorithm for our natural video partitioned the face in the first video frame into 9 non- dataset of challenging FR tasks. overlapped regions of interest (ROIs), which consisted of the Comparing dCNN to human clusters on the different per- forehead, left eye, right eye, nose, left cheek, right cheek, formance levels, Figure 1 shows that the dCNN has a simi- mouth, chin, ears and hair (see Figure 2 (b)). Then, we as- lar performance to the low-performers. It also approximates signed each fixation point of the eye gaze into one of these the performance of the mid-performers at high FAR range ROIs according to its aligned position on the first frame. (FAR>30%). The high-performers show a clear superiori- ority in performance over the dCNN. These findings reveal that although the dCNN models have shown superiority over humans on datasets of still images (Parkhi, Vedaldi, and Zis- serman 2015), there is still a large gap between humans and machines on FR performance in challenging natural videos. Learning the cognitive capabilities and attentions from the high-performers might lead to the development of superior CV algorithms for FR in natural conditions.

Table 1: The proposed dataset organization and frame matchings statistics for machine evaluation. Gallery images (Training) 1,623 Images 30 Subjects Figure 2: Eye gaze analysis using manual approach. a) ex- Probe Genuine images (Testing) 2,552 Images 30 Subjects ample of eye gaze pattern, represented by fixation points, Probe Imposters images (Testing) 2,316 Images 30 Subjects numbers which refer to the order of these fixations. The No. of Genuine matches 141,577 filled circles are the annotation points. b) face partitioning No. of Imposter Matches 375,886,8 and fixation points accumulation.

Automatic Approach. The manual operations in the previ- Human Attention Analysis ous approach, such as facial point annotation and face region In the experiments mentioned above, eye-gaze ﬁxation data partition, require huge labour efforts if the scale of the video of 42 subjects on the videos of frontal faces were collected. data is large. In addition, subjective biases may also be in- Among them, records from 26 subjects are valid. The rest troduced. Previous literature reported that manual operations of the subjects were excluded because either their ﬁxation for marking ROIs lead to the subjective biases in selection of data are incomplete or their FR accuracy rates are lower facial regions, sizes and positions (Hessels et al. 2015). We

618 Results and Discussions After assigning each fixation point to one of the ROIs, the data was accumulated to generate two fixation distribution patterns on ROIs. The first is mean percentage entered (MPE), which is the average percentage of the number of face stimuli in which subjects fixated on a specific ROI region at least once during FR tasks, with respect to the number of all face stimuli. The second is the average time duration (ATD), which is the average time the subjects spent fixating on each ROI (Shipley and Kellman 2001; Henderson, Williams, and Falk 2005). In general, the obtained patterns revealed that the subjects paid their attentions to two eyes, nose and mouth frequently. These findings are consistent with studies on still images (Shipley and Kell- man 2001; Henderson, Williams, and Falk 2005). The ROIs of nose, mouth and two eyes attracted the longest fixation duration respectively, which is consistent with (Tan, Shep- pard, and Stephen 2015). Also, the consistency with other studies like (Kelly et al. 2011; Chan et al. 2015) are observed. To validate the proposed automatic approach, a cor- Figure 3: Eye gaze analysis using automatic approach, an relation between the two approaches on fixation distribution automatic face detection using using Viola-Jones method patterns was evaluated. We found a high correlation between R2 =0.75 (Viola and Jones 2004) and features extraction of eyes in- the corresponding patterns (for MPE ( ) and ATD R2 =0.67 t cluding eyebrows, nose and mouth regions using method ( )). Individual -tests for each facial region sepa- in(Asthana et al. 2014). Cheeks, chin and forehead are de- rately revealed insignificant differences for all regions in the rived according to their relations with the extracted ROIs in distribution patterns among the two approaches. This meant sizes and locations. Centre points of ROIs are shown as + that the proposed automatic approach was reliable and could and the fixation point over the same video frame is shown as be used in similar studies. a filled circle. MPE and ATD patterns were generated for the three clusters of performance levels (i.e. high-performers, mid- performers and low-performers). Figure 4 shows MPE and proposed an automatic approach to save time and effort in ATD of the high-performers. The subjects of this group tend similar studies, and avoid the biases caused by subjectivity to fixate frequently on the nose, mouth and two eyes. And in manual annotations. more, their attention fixations on chin stay longer than other groups. In automatic approach, eye gaze pattern was generated as a probability distribution of fixation points in ROIs. First, a face in each frame was automatically detected using the Conclusions Viola-Jones face detector (Viola and Jones 2004), see Fig- In this study, we conducted a psycho-physics experiment on ure 3. Then, facial feature points were extracted by appling human face recognition on challenging natural videos to an- the method in (Asthana et al. 2014). From the extracted fa- alyze human performance and attention to faces using eye cial feature points, we computed the center points (i.e.mean gaze patterns. An experimental comparison between human coordinates of the extracted facial feature points) of corre- and a state-of-the-art CV algorithm on natural videos for sponding ROIs, such as eyes, eyebrows, nose and mouth (see challenging face recognition task was conducted. The re- Figure 3). Other ROI regions, like cheeks, forehead and chin, sults revealed that machine performance was at the same were derived according to their relations with the extracted level of the low-performers (subjects who achieved lower ROIs in sizes and locations. The probability of assigning the performance in our conducted human experiment). There fixation point to a ROI region is computed based on the nor- was a large gap between machine and mid-performers (sub- malized Euclidean distance between the fixation point xFP jects who approached to the average performance in our con- and the center point xr of the ROI region on that frame. It is ducted human experiment). The gap was further enlarged defined as, when compared to high-performers (subjects who achieved 2 higher performance in our conducted human experiment). − xFP−xr Pr(xFP)=e s2 (1) These findings suggested that machine FR might be im- proved if we could learn from the cognitive capabilities and Where, the subscript r indicates to the r ROI region, and s attentions of the high-performers and discover helpful cues is a scalar used for normalization (the distance between the to develop superior algorithms. Human attention in terms centers of two eyes is used in this study). Then, the fixa- of eye gaze patterns was analyzed in detail. An automatic tion point was assigned to the ROI of the highest probability computational approach for eye gaze analysis was proposed, value. which was found to be reliable and could significantly re-

619 movements are functional during face learning. Memory & cognition 33(1):98–106. Hessels, R. S.; Kemner, C.; van den Boomen, C.; and Hooge, I. T. 2015. The area-of-interest problem in eyetracking research: A noise-robust solution for face and sparse stimuli. Behavior research methods 1–19. Hills, P. J., and Lewis, M. B. 2006. Reducing the own-race bias in face recognition by shifting attention. The Quarterly Journal of Experimental Psychology 59(6):996–1002. Hu, G.; Yang, Y.; Yi, D.; Kittler, J.; Christmas, W.; Li, S. Z.; and Hospedales, T. 2015. When face recognition meets with deep learning: an evaluation of convolutional neural networks for face recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 142–150. Ito, H., and Sakurai, A. 2014. Familiar and unfamiliar face recognition in a crowd. Psychology 5(9):1011. Kelly, D. J.; Liu, S.; Rodger, H.; Miellet, S.; Ge, L.; and Caldara, R. 2011. Developing cultural differences in face processing. De- velopmental science 14(5):1176–1184. Knight, B., and Johnston, A. 1997. The role of movement in face Figure 4: Fixation distribution patterns of high-performers. recognition. Visual cognition 4(3):265–273. a) mean percent entered which is the average percentage of Liao, Q.; Leibo, J. Z.; Mroueh, Y.; and Poggio, T. 2014. Can a the number of face stimuli in which subjects fixated on a spe- biologically-plausible hierarchy effectively replace face detection, cific ROI region at least once during FR tasks, with respect alignment, and recognition pipelines? arXiv:1311.4082. to the number of all face stimuli. b) average time duration Little, A. C.; Jones, B. C.; and DeBruine, L. M. 2011. The which the subjects spent fixating at each region. many faces of research on face perception. Philosophical Trans- actions of the Royal Society of London B: Biological Sciences 366(1571):1634–1637. duce time, labour efforts and effects of subjectivity com- Lloyd, S. 1982. Least squares quantization in pcm. IEEE transac- pared with the existing methods based on manual annotations on information theory 28(2):129–137. tions. The results of the eye gaze analysis revealed a prefer- ence for subjects to fixate their attentions on facial regions Mandal, B.; Lim, R.; Dai, P.; Ragab, M.; Li, L.; and Lim, J. 2015. like nose, mouth and two eyes. We found that the high- Trends in Machine and Human Face Recognition, Chapter in Ad- vances in Face Detection and Facial Image Analysis. Springer. performers had similar fixation interests to the chin region. We hoped that these findings provided helpful cues for fur- O’Toole, A. J.; Phillips, P. J.; Jiang, F.; Ayyad, J.; Penard, N.; ther development of computational face recognition technol- and Abdi, H. 2007. Face recognition algorithms surpass humans matching faces over changes in illumination. IEEE Transactions ogy that can match or exceed human capabilities. on PAMI 29(9):1642–1646. References Parkhi, O. M.; Vedaldi, A.; and Zisserman, A. 2015. Deep face recognition. In British Machine Vision Conference, volume 1, 6. Arthur, D., and Vassilvitskii, S. 2007. k-means++: The advantages Rice, A.; Phillips, P. J.; Natu, V.; An, X.; and OToole, A. J. 2013. of careful seeding. In Proceedings of the eighteenth annual ACM- Unaware person recognition from the body when face identification SIAM symposium on Discrete algorithms, 1027–1035. Society for fails. Psychological science 0956797613492986. Industrial and Applied Mathematics. Asthana, A.; Zafeiriou, S.; Cheng, S.; and Pantic, M. 2014. Incre- Schultz, J.; Brockhaus, M.; Bulthoff,¨ H. H.; and Pilz, K. S. 2012. mental face alignment in the wild. In CVPR, IEEE, 1859–1866. What the human brain likes about facial motion. Cerebral Cortex bhs106. Best-Rowden, L.; Bisht, S.; Klontz, J. C.; and Jain, A. K. 2014. Unconstrained face recognition: Establishing baseline human per- Shipley, T., and Kellman, P. 2001. Gaze control for face learn- formance via crowdsourcing. In IJCB, 1–8. IEEE. ing and recognition by humans and machines. From fragments to objects: Segmentation and grouping in vision 463. Chan, C. Y.; Chan, A. B.; Lee, T. M.; and Hsiao, J. H. 2015. Eye movement pattern in face recognition is associated with cognitive Sinha, P.; Balas, B.; Ostrovsky, Y.; and Russell, R. 2005. {Face decline in the elderly. recognition by humans: 20 results all computer vision researchers should know about}. Department of Brain and Cognitive Sciences Dhamecha, T. I.; Singh, R.; Vatsa, M.; and Kumar, A. 2014. Rec- Massachusetts Institute of Technology Cambridge, MA. ognizing disguised faces: Human and machine evaluation. PloS one 9(7):e99212. Suneetha, J. 2014. A survey on video-based face recognition ap- Grainger, S. A.; Henry, J. D.; Phillips, L. H.; Vanman, E. J.; and proaches. IJAIEM 3(2). Allen, R. 2015. Age deficits in facial affect recognition: The in- Tan, C. B.; Sheppard, E.; and Stephen, I. D. 2015. Recognizing fluence of dynamic cues. The Journals of Gerontology Series B: dynamic faces in malaysian chinese participants. Perception. Psychological Sciences and Social Sciences gbv100. Viola, P., and Jones, M. J. 2004. Robust real-time face detection. Henderson, J. M.; Williams, C. C.; and Falk, R. J. 2005. Eye International journal of computer vision 57(2):137–154.

620 Wolf, L.; Hassner, T.; and Maoz, I. 2011. Face recognition in unconstrained videos with matched background similarity. In CVPR, 2011 IEEE Conference on, 529–534. IEEE. Wolf, L. 2014. Deepface: Closing the gap to human-level performance in face veriﬁcation. In CVPR,IEEE Conference on. IEEE. Xiao, N. G.; Quinn, P. C.; Ge, L.; and Lee, K. 2013. Elastic facial movement inﬂuences part-based but not holistic processing. Experimental Psychology: Human Perception and Perfor- mance 39(5):1457.

621