Rethinking Feature Distribution for Loss Functions in Image Classification Weitao Wan1∗ Yuanyi Zhong1,2∗† Tianpeng Li1 Jiansheng Chen1‡ 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2Department of Computer Science, University of at Urbana-Champaign, Illinois, USA
[email protected] [email protected] [email protected] [email protected] Abstract and the class label is used as the loss function. The soft- max loss has its probabilistic interpretation in that, for a We propose a large-margin Gaussian Mixture (L-GM) large class of distributions, the posterior distribution com- loss for deep neural networks in classification tasks. Dif- plies with the softmax transformation of linear functions of ferent from the softmax cross-entropy loss, our proposal the feature vectors [1]. It can also be derived from a binary is established on the assumption that the deep features of Markov Random Field or a Boltzmann Machine model [3]. the training set follow a Gaussian Mixture distribution. By However, the relationship between the affinity score and the involving a classification margin and a likelihood regular- probability distribution of the training feature space is vague. ization, the L-GM loss facilitates both a high classification In other words, for an extracted feature, its likelihood to the performance and an accurate modeling of the training fea- training feature distribution is not well formulated. ture distribution. As such, the L-GM loss is superior to the Several variants have been proposed to enhance the ef- softmax loss and its major variants in the sense that besides fectiveness of the softmax loss.