Circle Loss: a Unified Perspective of Pair Similarity Optimization

Circle Loss: A Unified Perspective of Pair Similarity Optimization Yifan Sun1,∗ Changmao Cheng1,∗ Yuhan Zhang2,∗ Chi Zhang1, Liang Zheng3, Zhongdao Wang4, Yichen Wei1† 1MEGVII Technology 2Beihang University 3Australian National University 4Tsinghua University {peter, chengchangmao, zhangchi, weiyichen}@megvii.com Abstract 1 1 This paper provides a pair similarity optimization view- T B T B point on deep feature learning, aiming to maximize the T’ T’ C C within-class similarity sp and minimize the between-class 푠" 푠" similarity sn. We find a majority of loss functions, in- cluding the triplet loss and the softmax cross-entropy loss, A A embed sn and sp into similarity pairs and seek to reduce 푠 (s s ). Such an optimization manner is inflexible, be- 0! 1 0푠! 1 n p (a) (b) cause− the penalty strength on every single similarity score is restricted to be equal. Our intuition is that if a similarity score deviates far from the optimum, it should be empha- Figure 1: Comparison between the popular optimization manner of reducing (sn sp) and the proposed optimization sized. To this end, we simply re-weight each similarity to − manner of reducing (αnsn αpsp). (a) Reducing (sn sp) highlight the less-optimized similarity scores. It results in − − a Circle loss, which is named due to its circular decision is prone to inflexible optimization (A, B and C all have boundary. The Circle loss has a unified formula for two equal gradients with respect to sn and sp), as well as am- ′ elemental deep feature learning paradigms, i.e., learning biguous convergence status (both T and T on the decision boundary are acceptable). (b) With (αnsn αpsp), the Cir- with class-level labels and pair-wise labels. Analytically, − we show that the Circle loss offers a more flexible optimiza- cle loss dynamically adjusts its gradients on sp and sn, and tion approach towards a more definite convergence target, thus benefits from a flexible optimization process. For A, it emphasizes on increasing sp; for B, it emphasizes on reduc- compared with the loss functions optimizing (sn sp). Ex- perimentally, we demonstrate the superiority of− the Circle ing sn. Moreover, it favors a specified point T on the circu- loss on a variety of deep feature learning tasks. On face lar decision boundary for convergence, setting up a definite recognition, person re-identification, as well as several fine- convergence target. grained image retrieval datasets, the achieved performance is on par with the state of the art. between-class similarity sn, as well as to maximize within- class similarity sp. From this viewpoint, we find that many popular loss 1. Introduction functions (e.g., triplet loss [9, 22], softmax cross-entropy loss and its variants [25, 16, 36, 29, 32, 2]) share a similar This paper holds a similarity optimization view towards optimization pattern. They all embed sn and sp into sim- two elemental deep feature learning paradigms, i.e., learn- ilarity pairs and seek to reduce (sn sp). In (sn sp), − − ing from data with class-level labels and from data with increasing sp is equivalent to reducing sn. We argue that pair-wise labels. The former employs a classification loss this symmetric optimization manner is prone to the follow- function (e.g., softmax cross-entropy loss [25, 16, 36]) to ing two problems. optimize the similarity between samples and weight vec- Lack of flexibility for optimization. The penalty tors. The latter leverages a metric loss function (e.g., triplet • strength on sn and sp is restricted to be equal. Given the loss [9, 22]) to optimize the similarity between samples. In specified loss functions, the gradients with respect to sn our interpretation, there is no intrinsic difference between and sp are of same amplitudes (as detailed in Section 2). these two learning approaches. They both seek to minimize In some corner cases, e.g., sp is small and sn already ap- ∗Equal contribution. proaches 0 (“A” in Fig. 1 (a)), it keeps on penalizing sn †Corresponding author. with a large gradient. It is inefficient and irrational. 6398 Ambiguous convergence status. Optimizing (sn sp) slight modifications. • − usually leads to a decision boundary of sp sn = m (m is the margin). This decision boundary allows− ambiguity We conduct extensive experiments on a variety of deep • feature learning tasks, e.g. face recognition, person re- (e.g.,“T ” and “T ′” in Fig. 1 (a)) for convergence. For ex- identification, car image retrieval and so on. On all ample, T has s ,s = 0.2, 0.5 and T ′ has s′ ,s′ = n p n p these tasks, we demonstrate the superiority of Circle 0.4, 0.7 . They{ both} obtain{ the margin} m = 0.3{. However,} comparing{ } them against each other, we find the gap between loss with performance on par with the state of the art. ′ sn and sp is only 0.1. Consequently, the ambiguous conver- 2. A Unified Perspective gence compromises the separability of the feature space. With these insights, we reach an intuition that different Deep feature learning aims to maximize the within-class similarity scores should have different penalty strength. If similarity sp, as well as to minimize the between-class sim- a similarity score deviates far from the optimum, it should ilarity sn. Under the cosine similarity metric, for example, receive a strong penalty. Otherwise, if a similarity score we expect sp 1 and sn 0. already approaches the optimum, it should be optimized To this end,→learning with→ class-level labels and learn- mildly. To this end, we first generalize (sn sp) into ing with pair-wise labels are two elemental paradigms. − (αnsn αpsp), where αn and αp are independent weight- They are conventionally considered separately and signif- − ing factors, allowing sn and sp to learn at different paces. icantly differ from each other w.r.t to the loss functions. We then implement αn and αp as linear functions w.r.t. sn Given class-level labels, the first one basically learns to and sp respectively, to make the learning pace adaptive to classify each training sample to its target class with a clas- the optimization status: The farther a similarity score de- sification loss, e.g. L2-Softmax [21], Large-margin Soft- viates from the optimum, the larger the weighting factor max [15], Angular Softmax [16], NormFace [30], AM- will be. Such optimization results in the decision boundary Softmax [29], CosFace [32], ArcFace [2]. These methods αnsn αpsp = m, yielding a circle shape in the (sn,sp) are also known as proxy-based learning, as they optimize − space, so we name the proposed loss function Circle loss. the similarity between samples and a set of proxies rep- Being simple, Circle loss intrinsically reshapes the char- resenting each class. In contrast, given pair-wise labels, acteristics of the deep feature learning from the following the second one directly learns pair-wise similarity (i.e., the three aspects: similarity between samples) in the feature space and thus First, a unified loss function. From the unified simi- requires no proxies, e.g., constrastive loss [5, 1], triplet larity pair optimization perspective, we propose a unified loss [9, 22], Lifted-Structure loss [19], N-pair loss [24], His- loss function for two elemental learning paradigms, learn- togram loss [27], Angular loss [33], Margin based loss [38], ing with class-level labels and with pair-wise labels. Multi-Similarity loss [34] and so on. Second, flexible optimization. During training, the This paper views both learning approaches from a uni- gradient back-propagated to sn (sp) will be amplified by fied perspective, with no preference for either proxy-based αn (αp). Those less-optimized similarity scores will have or pair-wise similarity. Given a single sample x in the fea- larger weighting factors and consequentially get larger gra- ture space, let us assume that there are K within-class sim- dients. As shown in Fig. 1 (b), the optimization on A, B ilarity scores and L between-class similarity scores associ- i and C are different to each other. ated with x. We denote these similarity scores as sp (i = j { } Third, definite convergence status. On the circular de- 1, 2, ,K) and sn (j = 1, 2, ,L), respectively. ··· { } j ··· i cision boundary, Circle loss favors a specified convergence To minimize each sn as well as to maximize sp, ( i status (“T ” in Fig. 1 (b)), as to be demonstrated in Sec- 1, 2, ,K , j 1, 2, ,L ), we propose a unified∀ ∈ tion 3.3. Correspondingly, it sets up a definite optimization loss{ function··· } by:∀ ∈ { ··· } target and benefits the separability. K L L j − i The main contributions of this paper are summarized as uni = log 1+ exp(γ(sn sp + m)) i=1 j=1 follows: h X X i (1) L K j − i We propose Circle loss, a simple loss function for deep = log 1+ exp(γ(sn + m)) exp(γ( sp)) , • j=1 i=1 feature learning. By re-weighting each similarity score h X X i under supervision, Circle loss benefits the deep feature in which γ is a scale factor and m is a margin for better learning with flexible optimization and definite conver- similarity separation. gence target. Eq. 1 is intuitive. It iterates through every similarity pair j i to reduce (sn sp). We note that it degenerates to triplet We present Circle loss with compatibility to both class- loss or classification− loss, through slight modifications. • level labels and pair-wise labels. Circle loss degener- Given class-level labels, we calculate the similarity ates to triplet loss or softmax cross-entropy loss with scores between x and weight vectors wi (i = 1, 2, , N) ··· 6399 푑퐿 푑퐿 푑퐿 푑퐿 푑퐿 푑퐿 푑푠$ A B 푑푠% A B 푑푠$ A B 푑푠% A B 푑푠$ 푑푠% A B B A 푠 푠 푠 푠 % % % % 푠% 푠% 푠$ 푠$ 푠$ 푠$ 푠$ 푠$ (a) Triplet loss (b) AMSoftmax loss (c) Circle loss Figure 2: The gradients of the loss functions.

Load more