Loss Function Search for Face Recognition

Loss Function Search for Face Recognition Xiaobo Wang * 1 Shuo Wang * 1 Cheng Chi 2 Shifeng Zhang 2 Tao Mei 1 Abstract Generally, the CNNs are equipped with classification loss In face recognition, designing margin-based (e.g., functions (Liu et al., 2017; Wang et al., 2018f;e; 2019a; Yao angular, additive, additive angular margins) soft- et al., 2018; 2017; Guo et al., 2020), metric learning loss max loss functions plays an important role in functions (Sun et al., 2014; Schroff et al., 2015) or both learning discriminative features. However, these (Sun et al., 2015; Wen et al., 2016; Zheng et al., 2018b). hand-crafted heuristic methods are sub-optimal Metric learning loss functions such as contrastive loss (Sun because they require much effort to explore the et al., 2014) or triplet loss (Schroff et al., 2015) usually large design space. Recently, an AutoML for loss suffer from high computational cost. To avoid this problem, function search method AM-LFS has been de- they require well-designed sample mining strategies. So rived, which leverages reinforcement learning to the performance is very sensitive to these strategies. In- search loss functions during the training process. creasingly more researchers shift their attention to construct But its search space is complex and unstable that deep face recognition models by re-designing the classical hindering its superiority. In this paper, we first an- classification loss functions. alyze that the key to enhance the feature discrim- Intuitively, face features are discriminative if their intra- ination is actually how to reduce the softmax class compactness and inter-class separability are well max- probability. We then design a unified formula- imized. However, as pointed out by (Wen et al., 2016; Liu tion for the current margin-based softmax losses. et al., 2017; Wang et al., 2018b; Deng et al., 2019), the clas- Accordingly, we define a novel search space and sical softmax loss lacks the power of feature discrimination. develop a reward-guided search method to auto- To address this issue, Wen et al. (Wen et al., 2016) develop matically obtain the best candidate. Experimental a center loss to learn centers for each identity to enhance results on a variety of face recognition bench- the intra-class compactness. Wang et al. (Wang et al., 2017) marks have demonstrated the effectiveness of our and Ranjan et al. (Ranjan et al., 2017) propose to use a method over the state-of-the-art alternatives. scale parameter to control the temperature of softmax loss, producing higher gradients to the well-separated samples to reduce the intra-class variance. Recently, several margin- 1. Introduction based softmax loss functions (Liu et al., 2017; Chen et al., Face recognition is a fundamental and of great practice 2018; Wang et al., 2018c;b; Deng et al., 2019) to increase the values task in the community of pattern recognition and feature margin between different classes have also been pro- machine learning. The task of face recognition contains two posed. Chen et al. (Chen et al., 2018) insert virtual classes categories: face identification to classify a given face to a between different classes to enlarge the inter-class margins. specific identity, and face verification to determine whether Liu et al. (Liu et al., 2017) introduce an angular margin (A- a pair of face images are of the same identity. In recent Softmax) between the ground truth class and other classes to years, the advanced face recognition methods (Simonyan & encourage larger inter-class variance. However, it is usually Andrew, 2014; Guo et al., 2018; Wang et al., 2019b; Deng unstable and the optimal parameters need to be carefully et al., 2019) are built upon convolutional neural networks adjusted for different settings. To enhance the stability of (CNNs) and the learned high-level discriminative features A-Softmax loss, Liang et al. (Liang et al., 2017) and Wang are adopted for evaluation. To train CNNs with discrimi- et al. (Wang et al., 2018b;c) propose an additive margin native features, the loss function plays an important role. (AM-Softmax) loss to stabilize the optimization. Deng et al. (Deng et al., 2019) develop an additive angular margin *Equal contribution 1JD AI Research 2Institute of Automation, (Arc-Softmax) loss, which has a clear geometric interpreta- Chinese Academy of Science. Correspondence to: Shifeng Zhang tion. However, despite great achievements have been made, <[email protected]>. all of them are hand-crafted heuristic methods that rely on Proceedings of the 37 th International Conference on Machine great effort from experts to explore the large design space, Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- which is usually sub-optimal in practice. thor(s). Loss Function Search for Face Recognition d Recently, Li et al. (Li et al., 2019) propose an AutoML where wk 2 R is the k-th classier (k 2 f1; 2;:::;Kg) and for loss function search method (AM-LFS) from a hyper- K is the number of classes. x 2 Rd denotes the feature parameter optimization perspective. Specifically, they for- belonging to the y-th class and d is the feature dimension. mulate hyper-parameters of loss functions as a parameter- In face recognition, the weights fw1; w2;:::; wK g and ized probability distribution sampling and achieve promis- the feature x of the last fully connected layer are usually ing results on several different vision tasks. However, they normalized and their magnitudes are replaced as a scale attribute the success of margin-based softmax losses to the parameter s (Wang et al., 2017; Deng et al., 2019; Wang relative significance of intra-class distance to inter-class et al., 2019b). In consequence, given an input feature vector distance, which is not directly used to guide the design of x with its ground truth label y, the original softmax loss Eq. search space. In consequence, the search space is complex (1) can be re-formulated as follows (Wang et al., 2017): and unstable, and is hard to obtain the best candidate. s cos(θ ) e wy ;x L = − log ; To overcome the aforementioned shortcomings, including 2 s cos(θ ) K s cos(θ ) (2) e wy ;x + P e wk;x hand-crafted heuristic methods and the AutoML one AM- k6=y LFS, we try to analyze the success of margin-based softmax T where cos(θwk;x) = wk x is the cosine similarity and losses and conclude that the key to enhance the feature dis- θwk;x is the angle between wk and x. As pointed out by crimination is to reduce the softmax probability. According a great many studies (Liu et al., 2016; 2017; Wang et al., to this analysis, we develop a unified formulation and define 2018b; Deng et al., 2019; Wang et al., 2019b), the learned a novel search space. We also design a new reward-guided features with softmax loss are prone to be separable, rather schedule to search the optimal solution. To sum up, the main than to be discriminative for face recognition. contributions of this paper can be summarized as follows: Margin-based Softmax. To enhance the feature discrimination for face recognition, several margin-based softmax • We identify that for margin-based softmax losses, the loss functions (Liu et al., 2017; Wang et al., 2018f;b; Deng key to enhance the feature discrimination is actually et al., 2019) have been proposed in recent years. In summary, how to reduce the softmax probability. Based on they can be defined as follows: this understanding, we develop a unified formulation sf(m,θ ) for the prevalent margin-based softmax losses, which e wy ;x L3 = − log K ; (3) involves only one parameter to be determined. sf(m,θwy ;x) P s cos(θw ;x) e + k6=y e k • We define a simple but very effective search space, where f(m; θwy ;x) ≤ cos(θwy ;x) is a carefully designed which can sufficiently guarantee the feature discrim- margin function. Basically, f(m1; θwy ;x) = cos(m1θwy ;x) iantion for face recognition. Accordingly, we design is the motivation of A-Softmax loss (Liu et al., 2017), where a random and a reward-guided method to search the m1 ≥ 1 and is an integer. f(m2; θwy ;x) = cos(θwy ;x+m2) best candidate. Moreover, for reward-guided one, we with m2 > 0 is the Arc-Softmax loss (Deng et al., 2019). develop an efficient optimization framework to dynam- f(m3; θwy ;x) = cos(θwy ;x) − m3 with m3 > 0 is the ically optimize the distribution for sampling of losses. AM-Softmax loss (Wang et al., 2018c;b). More generally, the margin function can be summarized into a combined • We conduct extensive experiments on the face recog- version: f(m; θ ) = cos(m θ + m ) − m . nition benchmarks, including LFW, SLLFW, CALFW, wy ;x 1 wy ;x 2 3 CPLFW, AgeDB, CFP, RFW, MegaFace and Trillion- AM-LFS. Previous methods relay on hand-crafted heuris- Pairs, which have verified the superiority of our new tics that require much effort from experts to explore the large approach over the baseline Softmax loss, the hand- design space. To address this issue, Li et al. (Li et al., 2019) crafted heuristic margin-based Softmax losses, and the propose a new AutoML for Loss Function Search (AM-LFS) AutoML method AM-LFS. To allow more experimen- to automatically determine the search space. Specifically, tal verification, our code is available at http://www. the formulation of AM-LFS is written as follows: cbsr.ia.ac.cn/users/xiaobowang/. scos(θ ) e wy ;x L =−log a +b ; 4 i scos(θ ) K scos(θ ) i (4) e wy ;x +P e wk;x 2.

Loss Function Search for Face Recognition

Lecture 4 Feedforward Neural Networks, Backpropagation

Revisiting the Softmax Bellman Operator: New Beneﬁts and New Perspective

On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks

CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett

Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan

Pseudo-Learning Effects in Reinforcement Learning Model-Based Analysis: a Problem Of

Lecture 18: Wrapping up Classification Mark Hasegawa-Johnson, 3/9/2019

Categorical Data

Mixed Pattern Recognition Methodology on Wafer Maps with Pre-Trained Convolutional Neural Networks

Arxiv:1910.04465V2 [Cs.CV] 16 Oct 2019

Reinforcement Learning with Dynamic Boltzmann Softmax Updates Arxiv

Rethinking Feature Distribution for Loss Functions in Image Classification