Classification with Label Distribution Learning

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Classification with Label Distribution Learning Jing Wang and Xin Geng ∗ MOE Key Laboratory of Computer Network and Information Integration School of Computer Science and Engineering, Southeast University, Nanjing 210096, China fwangjing91, [email protected] Abstract a soft label instead of a hard one seems to be a reasonable so- lution. Inspired by this, recently a novel learning paradigm, Label Distribution Learning (LDL) is a novel learn- called Label Distribution Learning (LDL) [Geng, 2016], is ing paradigm, aim of which is to minimize the dis- proposed. LDL tackles label ambiguity with the definition tance between the model output and the ground- of label description degree. Formally speaking, given an in- y truth label distribution. We notice that, in real-word stance x, LDL assigns each y 2 Y a real value dx (label applications, the learned label distribution model is description degree), which indicates the importance of y to generally treated as a classification model, with the x. To make the definition proper, [Geng, 2016] suggests that label corresponding to the highest model output as y P y dx 2 [0; 1] and y2Y dx = 1, and the real value function d is the predicted label, which unfortunately prompts an called the label distribution function. Since LDL extends the inconsistency between the training phrase and the supervision from binary to label distribution, which is more test phrase. To solve the inconsistency, we propose applicable for real-world scenarios. in this paper a new Label Distribution Learning Due to the utility of dealing with ambiguity explicitly, LDL algorithm for Classification (LDL4C). Firstly, in- has been extensively applied in varieties of real-world prob- stead of KL-divergence, absolute loss is applied as lems. According to the source of label distribution, applica- the measure for LDL4C. Secondly, samples are re- tions can be mainly classified into three classes. The first one weighted with information entropy. Thirdly, large includes emotion recognition [Zhou et al., 2015], pre-release margin classifier is adapted to boost discrimina- rating prediction on movies [Geng and Hou, 2015], et. al, tion precision. We then reveal that theoretically where the label distribution is from the data. Applications LDL4C seeks a balance between generalization and of the second class include head pose estimation [Huo and discrimination. Finally, we compare LDL4C with Geng, 2017], crowd counting [Zhang et al., 2015], et. al, existing LDL algorithms on 17 real-word datasets, label distribution of which are generated by pre-knowledge. and experimental results demonstrate the effective- Representative applications of the third one include beauty ness of LDL4C in classification. sensing [Ren and Geng, 2017], label enhancement [Xu et al., 2018], where the label distribution is learned from the data. Notice that the aforementioned applications of LDL fall into 1 Introduction the scope of classification, and we find that a learned LDL Learning with ambiguity, especially label ambiguity [Gao et model is generally treated as a classification model, with the al., 2017], has become one of the hot topics among the ma- label corresponding to the highest model output as the pre- chine learning communities. Traditional supervised learn- diction, which unfortunately draws an inconsistency between ing paradigms include single-label learning (SLL) and multi- the aim of the training phrase and the goal of test phrase of label learning (MLL) [Zhang and Zhou, 2014], which label LDL. In the training phrase, the aim of LDL is to minimize each instance with one or multiple labels. MLL assumes that the distance between the model output and the ground-truth each instance is associated with multiple labels, which com- label distribution, while in the test phrase, the object is to pared with SLL takes the label ambiguity into consideration minimize the 0/1 error. somewhat. Essentially, both SLL and MLL consider the rela- We tackle the inconsistency in this paper. In order to al- tion between instance and label to be binary, i.e., whether or leviate the inconsistency, we come up with three improve- not the label is relevant with the instance. However, there are ments, i.e., absolute loss, re-weighting with entropy informa- a variety of real-world tasks that instances are involved with tion, and large margin. The proposal of applying absolute labels with different importance degree, e.g., image annota- loss and introducing large margin are inspired by theory ar- tion [Zhou and Zhang, 2006], emotion recognition [Zhou et guments, and re-weighting samples with information entropy al., 2015], age estimation [Geng et al., 2013]. Consequently, is based on observing the gap between the metric of LDL and that of the corresponding classification model. Since ∗Corresponding author. improvements are mainly originated from well-established 3712 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) theory tools, we are able to provide theoretical analysis of recent work [Wang and Geng, 2019] provides a theoretical the proposed method. Consequently, the proposed algorithm analysis of the relation between the risk of LDL (absolute LDL4C is well-designed and theoretically sound. loss) and error probability of the corresponding classification The main contributions of this paper are summarized as function, discovering that LDL with absolute loss dominates followings, classification. However, the major motivation of the paper is 1) We propose three key components for improving clas- to understanding LDL theoretically, and no algorithm design sification performance for LDL, i.e, absolute loss, re- to improve classification precision is provided. weighting samples with information entropy, and large margin. 3 The Proposed Method 2) We establish upper bounds for 0/1 error and error proba- 3.1 Preliminary bility of the proposed algorithms. Theoretical and empir- Let X be the input space, and Y = fy1; y2; : : : ; ymg be the ical studies show that LDL4C enjoys generalization and label space. Denote the instance by x. The label distribu- discrimination. tion function d 2 RX ×Y is defined as d : X × Y 7! R, y P y The rest of the paper is organized as follows. Firstly, re- satisfying the constraints dx > 0 and y2Y dx = 1, where lated works are briefly reviewed. Secondly details of the pro- y dx = d(x; y) for convenience of notations. Given a training posed method are presented. Then we give the theoretical S = f(x ; [dy1 ; : : : ; dym ]);:::; (x ; [dy1 ; : : : ; dym ])g set 1 x1 x1 n xn xn , analysis of the proposed method. Next experimental results the goal of LDL is to learn the unknown function d, and the are reported. Finally, we conclude the paper. object of LDL4C is to make a prediction. 2 Related Work 3.2 Classification via LDL Existing studies on LDL are primarily concentrated on de- In real-world applications we generally regard a learned LDL signing new algorithms for LDL, and many algorithms have model as a classification mode. Formally speaking, denote already been designed for LDL. [Geng, 2016] suggests three the learned LDL function by h and the corresponding classi- strategies for designing LDL algorithms. The first one is fication function by h^, for a given instance x, then we have Problem Transformation (PT), which generates SL dataset ^ y h(x) = arg max hx; according to the label distribution and then learns the trans- y2Y formed dataset with SL algorithms. Algorithms of the first strategy include PT-SVM and PT-Bayes, which apply SVM i.e, the prediction is the label with the highest output. Intu- and Bayes classifier respectively. The second one is Algo- itively if h is near the ground-truth label distribution function rithm Adaptation (AA), algorithms of which adapt existing d, then the corresponding classification function h^ is close to learning algorithms to deal with label distribution straightly. the Bayes classification (we assume d is the conditional prob- Two representative algorithms are AA-kNN and AA-BP. For ability distribution function), thereby LDL is definitely re- AA-kNN, mean of label distribution of k nearest neighbors is lated with classification. The mathematical tool which quan- calculated as the predicted label distribution, and for AA-BP, tifies the relation between LDL and classification is the plug- one-hidden-layer neural network with multi-output is adapted in decision theorem [Devroye et al., 1996]. Let h∗ be the to minimize the sum-squared loss of output of the network Bayes classification function, i.e., compared with the ground-truth label distribution. The last ∗ y h (x) = arg max dx; one is Specialized Algorithm (SA). This category of algo- y2Y rithms take characteristics of LDL into consideration. Two representative approaches of SA are IIS-LDL and BFGS- then we have LDL, which apply the maximum entropy model to learn the Theorem 1. [Devroye et al., 1996; Wang and Geng, 2019] label distribution. Besides, [Geng and Hou, 2015] treats LDL The error probability difference between h^ and h∗ satisfies as a regression problem, and proposes LDL-SVR, which em- 2 3 braces SVR to deal with the label distribution. Furthermore, P ^ P ∗ E X y y [Shen et al., 2017] proposes LDLF, which extends random (h(x) 6= y) − (h (x) 6= y) ≤ x 4 jhx − dxj5 : forest to learn label distribution. In addition, [Gao et al., y2Y 2017] provides the first deep LDL model DLDL. Notice that The theorem says that if h is close to d in terms of absolute compared with the classic LDL algorithms, LDLF and DLDL ^ ∗ support end-to-end training, which are suitable for computer loss, then error probability of h is close to that of h . Theo- vision tasks. Notice that none of aforementioned algorithms retically LDL with absolute loss is directly relevant with clas- ever consider the inconsistency as we previously discussed.

Load more