On Parameter Adaptation in Softmax-Based Cross-Entropy Loss for Improved Convergence Speed and Accuracy in DNN-Based Speaker Recognition

INTERSPEECH 2020 October 25–29, 2020, Shanghai, China On Parameter Adaptation in Softmax-based Cross-Entropy Loss for Improved Convergence Speed and Accuracy in DNN-based Speaker Recognition Magdalena Rybicka and Konrad Kowalczyk AGH University of Science and Technology Department of Electronics 30-059 Krakow, Poland [email protected], [email protected] Abstract repeating the training with different hyperparameter values. To address this issue, in [17] a cosine-based activation function In various classification tasks the major challenge is in generat- called AdaCos has been proposed for the face recognition ap- ing discriminative representation of classes. By proper selection plication, which adapts the scale parameter in angular softmax of deep neural network (DNN) loss function we can encour- representation to improve the training effectiveness. age it to produce embeddings with increased inter-class sepa- In this paper, we aim to develop a softmax-based cross- ration and smaller intra-class distances. In this paper, we de- entropy loss function which adapts its hyperparameters so that velop softmax-based cross-entropy loss function which adapts they strengthen supervision at different neural network train- its parameters to the current training phase. The proposed solu- ing phases. The proposed approach allows to adapt the scale tion improves accuracy up to 24% in terms of Equal Error Rate and additive angular margin parameters in joint softmax-based (EER) and minimum Detection Cost Function (minDCF). In ad- cross-entropy loss function, resulting in a notable improvement dition, our proposal also accelerates network convergence com- in convergence speed of the network training and in speaker pared with other state-of-the-art softmax-based losses. As an recognition accuracy. In addition, we propose a modification of additional contribution of this paper, we adopt and subsequently the Residual Network (ResNet) [18] architecture which shows modify the ResNet DNN structure for the speaker recognition significant improvements in accuracy over the standard TDNN task. The proposed ResNet network achieves relative gains of architecture. In evaluations, we compare the proposed parame- up to 32% and 15% in terms of EER and minDCF respectively, ter adaptation (ParAda) in softmax-based loss function in terms compared with the well-established Time Delay Neural Net- of convergence speed and accuracy for two speaker recognition work (TDNN) architecture for x-vector extraction. systems based on TDNN and the proposed modified ResNet. Index Terms: speaker recognition, deep neural networks, softmax activation functions, speaker embedding, ResNet 2. Softmax-based cross-entropy loss 1. Introduction functions Recently we have seen a rapid increase in popularity of speaker In this section, we present an overview of several recently pro- modeling using deep neural networks (DNNs) [1, 2, 3]. State- posed modifications of a standard cross-entropy loss with soft- of-the-art solution consist in extracting a speaker embedding max activation functions for DNN training, and next we pro- from the Time Delay Neural Network (TDNN), which is com- pose a softmax-based loss function with adaptive parameters monly referred to as the x-vector [2]. Extensions of the basic (ParAda) which improves discriminative capabilities of the net- TDNN structure have been presented e.g. in [4, 5], while differ- work and accelerates its convergence. Let us first observe that the dot product between the softmax layer input vector and the ent network structures for the extraction of speaker embeddings T have very recently been proposed e.g. in [1, 3, 6, 7, 8, 9]. Find- weight vector can be written as wk xi = kwkk kxik cos(θi;k), ing an appropriate network structure which facilitates notable where kwkk denotes the norm of the weight vector for the kth improvement in speaker recognition performance is thus still an class, k = 1; 2; :::; K with K being the number of speakers on-going research topic. in the entire training set, kxik denotes the norm of the input In speaker recognition, it is common to use a cross-entropy vector for the ith minibatch example, i = 1; 2; :::; N with N loss function with softmax activations in the last DNN layer denoting the minibatch size, and cos(θi;k) denotes the cosine [1, 4]. This loss function is also widely used in other tasks such angular distance between vectors wk and xi. Next, we can ex- as image recognition [10] or speech emotion recognition [11]. press the general equation for the softmax-based cross-entropy Recent modifications to such loss functions include increas- loss function as N N f ing inter-class separation by an introduction of various types of 1 X 1 X e yi L = − log Py = − log ; (1) margins in the angular functions [12, 13, 14]. They have been i K N N f f successfully applied in speaker recognition e.g. in [15, 16]. The i=1 i=1 e yi + P e yi;k convergence of the training process and the resulting model per- k=1;k6=yi formance strongly depend on the selection of hyperparameters where yi is the ground truth label of a training example, Pyi of the modified loss function, which often need to be tuned by denotes the predicted classification probability of all samples in the minibatch, while fyi and fyi;k denote the target and non- This research received financial support from the Founda- target logits given respectively by tion for Polish Science under grant number First TEAM/2017-3/23 f = s(θ ) (θ ) ; (2) (POIR.04.04.00-00-3FC4/17-00) which is co-financed by the European yi yi yi Union and was supported in part by PLGrid Infrastructure. fyi;k = s(θyi ) cos(θyi;k) : (3) Copyright © 2020 ISCA 3805 http://dx.doi.org/10.21437/Interspeech.2020-2264 In standard softmax-based cross-entropy loss, (θyi ) is defined the second-order derivative of Pyi is equal to zero, which yields as the cosine of the angle between the ith minibatch input vector the approximated relation for the angular function and the weight vector corresponding to its ground truth label. K 1 X sm cos(θi;k) Note that for the convenience of comparing various softmax cos(θ0 + mAda) = log e ; (8) modifications, in (2) and (3) we normalized the weight vectors sm k=1;k6=yi for all classes such that kwkk = 1 and replaced the norm of where sm denotes the fixed scale parameter in margin adap- an input vector for the true class kxik with a new scale variable π tation and θ0 2 [0; 2 ]. In order to reflect the convergence s(θy ). i state of the network in the current minibatch, we replace θ0 with the median of angles θy over the entire minibatch, i.e., 2.1. State-of-the-art fixed angular and scale functions i θ0 = Θ = median(θy1 ; θy2 ; :::; θyN ), which yields the fol- There are three types of modifications of the standard angular lowing update for margin adaptation (MAda) function (θyi ) in (2), namely the so-called Angular Softmax 1 mAda = arccos log(BMAda) − Θ ; (9) (AS) [12], Additive Angular Softmax (AAS) [13], and Addi- sm tive Margin Softmax (AMS) [14], which can all be presented in N K a general form 1 X X s cos(θ ) B = e m i;k : (10) MAda N (θyi ) = cos(mAS θyi + mAAS) − mAMS; (4) i=1 k=1;k6=yi where mAS, mAAS and mAMS are the real numbers for each modification. Although proper setting of these parameters has 2.3.1. Annealing strategy in margin-based angular function been shown to improve the accuracy of DNN-based speaker Similarly to the procedure presented in [15] for the stabiliza- recognition [15, 16], the disadvantage of these approaches is tion of the network training for fixed margin-based softmax, the that parameter tuning requires time-consuming repetitions of annealing strategy can also be incorporated into the proposed the network training. Another approach is taken in [17] in which softmax function with adaptive margin. The angular function a fixed scale function s(θyi ) = sFix is given by a constant p with margin adaptation (MAda) then takes the form sFix = 2 log (K − 1); (5) 1 γ (θ )= cos(θ +m )+ cos(θ ) (11) which depends on the number of speakers K in the training, MAda yi 1 + γ yi Ada 1 + γ yi which allows to avoid scale parameter tuning. −α where γ = maxfγmin; γb(1 + β · iter) g where iter is training iteration, γmin is the minimum value, while γb, β and α are 2.2. Adaptation of the scaling parameter hyperparamters that control the annealing speed. In this section, we discuss a method to adapt the scale parame- 2.3.2. Lower bound on the scale parameter value ter s(θyi ) depending on network convergence at current training iteration. This method is based on the recently proposed Adap- By noting that arccos(·) function in (9) is only defined for ar- tively Scaling Cosine Logits (AdaCos) introduced in [17] in the π guments from the range [−1; 1] and that θi;k 2 [0; 2 ], we can context of face recognition, which relies on adapting the scale find the lower bound on the value of the scale parameter in the function s(θ ) during the network training. As derived in [17], yi proposed MAda approach by solving inequality the scale adaptation (SAda) is given by p K ( 1 X s cos(θ ) 2 log(K − 1) iter = 0 −1 ≤ log e m i;k ≤ 1 : (12) sAda(θyi ) = log(BSAda) (6) sm π iter ≥ 1 k=1;k6=yi cos(min( 4 ;Θ)) Assuming that all θi;k are approximately equal, we can set where Θ = median(θy1 ; θy2 ; :::; θyN ) denotes the median of 8 k; i θi;k = θ¯, and by replacing the values and solving (12) angles θyi over the entire minibatch of length N, and iter de- we obtain the lower bound on the scale parameter sm value notes the iteration index.

On Parameter Adaptation in Softmax-Based Cross-Entropy Loss for Improved Convergence Speed and Accuracy in DNN-Based Speaker Recognition

Lecture 4 Feedforward Neural Networks, Backpropagation

Revisiting the Softmax Bellman Operator: New Beneﬁts and New Perspective

On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks

CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett

Loss Function Search for Face Recognition

Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan

Pseudo-Learning Effects in Reinforcement Learning Model-Based Analysis: a Problem Of

Lecture 18: Wrapping up Classification Mark Hasegawa-Johnson, 3/9/2019

Categorical Data

Mixed Pattern Recognition Methodology on Wafer Maps with Pre-Trained Convolutional Neural Networks

Arxiv:1910.04465V2 [Cs.CV] 16 Oct 2019

Reinforcement Learning with Dynamic Boltzmann Softmax Updates Arxiv