Mean-Variance Loss for Deep Age Estimation from a Face
1,2 ,1 1,2,3 1,2 Hongyu Pan , Hu Han∗ , Shiguang Shan , and Xilin Chen 1Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China 2University of Chinese Academy of Sciences, Beijing, 100049, China 3CAS Center for Excellence in Brain Science and Intelligence Technology [email protected], {hanhu,sgshan,xlchen}@ict.ac.cn
Abstract before mean-variance loss after
Age estimation has wide applications in video surveil- Estimated age distribution Estimated age distribution (µ == 15 15.5,.5, σ =4.4) = 4.4 (µ = 18 18,.0, σ =3 =.0) 3.0 lance, social networking, and human-computer interac- Probabolity tion. Many of the published approaches simply treat age Probabolity estimation as an exact age regression problem, and thus do 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Age not leverage a distribution’s robustness in representing la- age: 18 Age bels with ambiguity such as ages. In this paper, we propose Figure 1. An example of age distribution learning using the pro- posed mean-variance loss. Our mean-variance loss aims to learn a new loss function, called mean-variance loss, for robust an age distribution which has not only a mean value close to the age estimation via distribution learning. Specifically, the ground-truth age (red dotted line), but also a concentrated shape. mean-variance loss consists of a mean loss, which penalizes difference between the mean of the estimated age distribu- tion and the ground-truth age, and a variance loss, which bor [8], and SIFT [32]. In recent years, CNNs have shown penalizes the variance of the estimated age distribution to great success on various computer vision tasks, such as face ensure a concentrated distribution. The proposed mean- recognition [44, 36], object detection [37, 31], and scene variance loss and softmax loss are jointly embedded into segmentation [19]. Age estimation using CNNs has also Convolutional Neural Networks (CNNs) for age estimation. obtained increasing attentions [23, 5], particularly with the Experimental results on the FG-NET, MORPH Album II, promotion of the ChaLearn looking at people challenge1. CLAP2016, and AADB databases show that the proposed Existing approaches for age estimation can be grouped approach outperforms the state-of-the-art age estimation into three categories: classification based methods, regres- methods by a large margin, and generalizes well to image sion based methods, and ranking based methods. Classifica- 1 aesthetics assessment. tion based methods are often used to estimate the age group of the subject in a face image [29, 48], which treat different ages or age groups as independent classes; therefore, the 1. Introduction costs of classifying a young subject as middle-aged subject and old subject are the same. Apparently, such a modeling Age estimation from facial images has broad application method is not optimum for the age estimation task. scenarios, such as video surveillance, social networking, Regression based methods are widely used to estimate and human-computer interaction. Studies on age estimation the exact age of the subject in a face image [13, 7, 45]. from face images can be dated back to 1994 [28], in which Many of the existing regression based methods use a Eu- different facial regions and skin wrinkles were used to esti- clidean loss (L loss), which penalizes the differences be- mate age group. A widely used feature representation by the 2 tween the estimated ages and the ground-truth ages. How- early age estimation approaches is the biologically inspired ever, this type of loss defined based on a single image does features (BIF) [14], which reported much better accuracy not explicitly make use of the ordinal relationship among than the traditional descriptors, such as LBP [34], Ga- face images with individual ages.
∗H. Han is the corresponding author. In recent years, a few ranking based methods were pro- 1We plan to put the code into public domain: http://www. escience.cn/people/hhan/publication.html 1http://chalearnlap.cvc.uab.es
15285 posed for age estimation from a face image [4, 2, 5]. These age estimation. AAM was also used in [29], and multiple approaches treat the age value as a rank ordered data, and classifiers, such as shortest distance, quadratic functions, use multiple binary classifiers to determine the rank of the and artificial neural networks were used for age estimation. age in a face image. Different from the L2 loss commonly Guo et al. [14] used multi-directional and multi-scale Ga- used in regression methods, ranking based methods could bor filters followed by feature pooling to extract BIF fea- explicitly make use of the ordinal relationship among face tures for age estimation. While BIF was reported to have images with individual ages. promising age estimation results on several public-domain Despite a large amount of work on age estimation, their face databases such as FG-NET [35], MORPH II [25], it accuracy in unconstrained scenarios is still not sufficiently remains a hand-crafted feature representation, and thus may high in real application scenarios. This is due to the nature not be optimum for the age estimation task. of the complicated face aging processing caused by both With the great success of deep learning methods in a internal factors such as gene and external factors, such as number of computer vision tasks, such as object detection living environment and lifestyle, as well as the ambiguity [37, 31, 19], image classification [27, 41], and face recog- issue in the age label space. For age estimation by hu- nition [44, 22], deep learning methods are also being used mans, it is relatively easy to give an age estimate with a in age estimation. Similar to [28], Yi et al. [49] used CNNs particular confidence interval, such as a Gaussian distribu- models to extract features from several facial regions, and tion with a particular mean age and a standard deviation used a square loss for age estimation. Niu et al. [33] utilized (see Fig. 1). Inspired by this observation, in this paper, we the ordinal information of ages to learn a network with mul- propose a mean-variance loss for age estimation which pe- tiple binary outputs for age estimation. Ordinal informa- nalizes not only the difference between the mean of an es- tion was also used in [5] to learn multiple binary CNNs, timated age distribution and the ground-truth age, but also and the outputs were aggregated. Rothe et al. [38] used the the variance of the estimated age distribution. As a result, weights of the softmax classifier to calculate a weighted av- the estimated age distribution is expected to have a mean erage age for age estimation, which was found to have bet- value as close to the ground-truth age as possible, and take ter performance than using softmax for age classification. a concentrated distribution as sharp as possible. The pro- Han et al. [16, 43] proposed an effective deep multi-task posed approach is evaluated on a number of challenging learning approach for joint estimation of a large number databases (e.g., FG-NET [35], MORPH Album II [25] and of attributes, which consists of shared feature learning and CLAP2016 [6]), and it achieves much better age estimation attribute group specific feature learning. Yang et al. [46] accuracy than the state-of-the-art methods. The main con- and Huo et al. [23] proposed to perform age estimation via tributions of this work are three-fold: (i) Different from the distribution learning, in which each age was represented as existing methods which aim to estimate an exact age for a a distribution, and KL divergence was used to measure the face image [14, 17, 49], we propose a new loss, named as similarity between the estimated and ground-truth distribu- mean-variance loss, aiming at the estimate of an age dis- tions. However, in real applications, the mean and variance tribution with its mean as close to the ground-truth age as of a distribution are usually not available for a face image possible, and its variance as small as possible; (ii) Different except for apparent ages collected via crowdsourcing. from the age distribution learning methods such as [12, 46], the proposed approach does not require that each training 2.2. Distribution Learning image must have a mean age and a variance (neither real In our approach, we also use the weights of softmax to nor assumed) labels during model training, but it can still calculate a weighted average age, but different from [38], give a distribution estimate for a face image; (iii) The pro- we use the method in both the training and the testing posed loss can be easily embedded into different CNNs, and phases. Besides, we penalize the variance so that the es- the network can be optimized via SGD [30] end-to-end. timated age distribution could take a sharp shape, which is important to obtain an age estimate as accurate as possi- 2. Related Work ble. In addition, the proposed approach does not require 2.1. Age Estimation each training face image to have a mean age and a variance value. Kwon and Lobo [28] did the very early work on age Distribution learning is proposed to address problems estimation from a face, in which the ages are divided into due to label ambiguity [10]. Different from single label only three groups (i.e., babies, young adults, and senior learning or multi-label learning [42], which assigns a single adults). After that, age estimation from a face image label or multiple labels to an object, distribution learning has attracted increasing attention. Lanitis et al. [25] em- assigns a label distribution to an object. Compared to single ployed an Active Appearance Model (AAM) [35] to com- label learning and multi-label learning, distribution learning bine the shape and texture (e.g., wrinkles) information for is able to leverage the relative relationship of a sequence of
5286
softmax 0 � 1� � loss 2 3 … 10 11 12 convolution Fully connected softmax 13 … … … … 14 mean loss 15 �� = ∗ �� 16 17 18 … 97 98 variance ��� � �� = 99 2 loss 100 − �� ∗ �� predicted age =