<<

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Classification with Label Distribution Learning

Jing Wang and Geng ∗ MOE Key Laboratory of Computer Network and Information Integration School of Computer Science and Engineering, Southeast University, Nanjing 210096, China {wangjing91, xgeng}@seu.edu.cn

Abstract a soft label instead of a hard one seems to be a reasonable so- lution. Inspired by this, recently a novel learning paradigm, Label Distribution Learning (LDL) is a novel learn- called Label Distribution Learning (LDL) [Geng, 2016], is ing paradigm, aim of which is to minimize the dis- proposed. LDL tackles label ambiguity with the definition tance between the model output and the ground- of label description degree. Formally speaking, given in- y truth label distribution. We notice that, in real-word stance x, LDL assigns each y ∈ Y a real value dx (label applications, the learned label distribution model is description degree), which indicates the importance of y to generally treated as a classification model, with the x. To make the definition proper, [Geng, 2016] suggests that label corresponding to the highest model output as y P y dx ∈ [0, 1] and y∈Y dx = 1, and the real value function d is the predicted label, which unfortunately prompts an called the label distribution function. Since LDL extends the inconsistency between the training phrase and the supervision from binary to label distribution, which is more test phrase. To solve the inconsistency, we propose applicable for real-world scenarios. in this paper a new Label Distribution Learning Due to the utility of dealing with ambiguity explicitly, LDL algorithm for Classification (LDL4C). Firstly, in- has been extensively applied in varieties of real-world prob- stead of KL-divergence, absolute loss is applied as lems. According to the source of label distribution, applica- the measure for LDL4C. Secondly, samples are re- tions can be mainly classified into three classes. The first one weighted with information entropy. Thirdly, large includes emotion recognition [Zhou et al., 2015], pre-release margin classifier is adapted to boost discrimina- rating prediction on movies [Geng and Hou, 2015], et. al, tion precision. We then reveal that theoretically where the label distribution is from the data. Applications LDL4C seeks a balance between generalization and of the second class include head pose estimation [ and discrimination. Finally, we compare LDL4C with Geng, 2017], crowd counting [ et al., 2015], et. al, existing LDL algorithms on 17 real-word datasets, label distribution of which are generated by pre-knowledge. and experimental results demonstrate the effective- Representative applications of the third one include beauty ness of LDL4C in classification. sensing [Ren and Geng, 2017], label enhancement [ et al., 2018], where the label distribution is learned from the data. Notice that the aforementioned applications of LDL fall into 1 Introduction the scope of classification, and we find that a learned LDL Learning with ambiguity, especially label ambiguity [Gao et model is generally treated as a classification model, with the al., 2017], has become one of the hot topics among the ma- label corresponding to the highest model output as the pre- chine learning communities. Traditional supervised learn- diction, which unfortunately draws an inconsistency between ing paradigms include single-label learning (SLL) and multi- the aim of the training phrase and the goal of test phrase of label learning (MLL) [Zhang and Zhou, 2014], which label LDL. In the training phrase, the aim of LDL is to minimize each instance with one or multiple labels. MLL assumes that the distance between the model output and the ground-truth each instance is associated with multiple labels, which com- label distribution, while in the test phrase, the object is to pared with SLL takes the label ambiguity into consideration minimize the 0/1 error. somewhat. Essentially, both SLL and MLL consider the rela- We tackle the inconsistency in this paper. In order to al- tion between instance and label to be binary, i.e., whether or leviate the inconsistency, we come up with three improve- not the label is relevant with the instance. However, there are ments, i.e., absolute loss, re-weighting with entropy informa- a variety of real-world tasks that instances are involved with tion, and large margin. The proposal of applying absolute labels with different importance degree, e.g., image annota- loss and introducing large margin are inspired by theory ar- tion [Zhou and Zhang, 2006], emotion recognition [Zhou et guments, and re-weighting samples with information entropy al., 2015], age estimation [Geng et al., 2013]. Consequently, is based on observing the gap between the metric of LDL and that of the corresponding classification model. Since ∗Corresponding author. improvements are mainly originated from well-established

3712 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) theory tools, we are able to provide theoretical analysis of recent work [Wang and Geng, 2019] provides a theoretical the proposed method. Consequently, the proposed algorithm analysis of the relation between the risk of LDL (absolute LDL4C is well-designed and theoretically sound. loss) and error probability of the corresponding classification The main contributions of this paper are summarized as function, discovering that LDL with absolute loss dominates followings, classification. However, the major motivation of the paper is 1) We propose three key components for improving clas- to understanding LDL theoretically, and no algorithm design sification performance for LDL, i.e, absolute loss, re- to improve classification precision is provided. weighting samples with information entropy, and large margin. 3 The Proposed Method 2) We establish upper bounds for 0/1 error and error proba- 3.1 Preliminary bility of the proposed algorithms. Theoretical and empir- Let X be the input space, and Y = {y1, y2, . . . , ym} be the ical studies show that LDL4C enjoys generalization and label space. Denote the instance by x. The label distribu- discrimination. tion function d ∈ RX ×Y is defined as d : X × Y 7→ R, y P y The rest of the paper is organized as follows. Firstly, re- satisfying the constraints dx > 0 and y∈Y dx = 1, where lated works are briefly reviewed. Secondly details of the pro- y dx = d(x, y) for convenience of notations. Given a training posed method are presented. Then we give the theoretical S = {(x , [dy1 , . . . , dym ]),..., (x , [dy1 , . . . , dym ])} set 1 x1 x1 n xn xn , analysis of the proposed method. Next experimental results the goal of LDL is to learn the unknown function d, and the are reported. Finally, we conclude the paper. object of LDL4C is to make a prediction. 2 Related Work 3.2 Classification via LDL Existing studies on LDL are primarily concentrated on de- In real-world applications we generally regard a learned LDL signing new algorithms for LDL, and many algorithms have model as a classification mode. Formally speaking, denote already been designed for LDL. [Geng, 2016] suggests three the learned LDL function by h and the corresponding classi- strategies for designing LDL algorithms. The first one is fication function by hˆ, for a given instance x, then we have Problem Transformation (PT), which generates SL dataset ˆ y h(x) = arg max hx, according to the label distribution and then learns the trans- y∈Y formed dataset with SL algorithms. Algorithms of the first strategy include PT-SVM and PT-Bayes, which apply SVM i.e, the prediction is the label with the highest output. Intu- and Bayes classifier respectively. The second one is Algo- itively if h is near the ground-truth label distribution function rithm Adaptation (AA), algorithms of which adapt existing d, then the corresponding classification function hˆ is close to learning algorithms to deal with label distribution straightly. the Bayes classification (we assume d is the conditional prob- Two representative algorithms are AA-kNN and AA-BP. For ability distribution function), thereby LDL is definitely re- AA-kNN, mean of label distribution of k nearest neighbors is lated with classification. The mathematical tool which quan- calculated as the predicted label distribution, and for AA-BP, tifies the relation between LDL and classification is the plug- one-hidden-layer neural network with multi-output is adapted in decision theorem [Devroye et al., 1996]. Let h∗ be the to minimize the sum-squared loss of output of the network Bayes classification function, i.e., compared with the ground-truth label distribution. The last ∗ y h (x) = arg max dx, one is Specialized Algorithm (SA). This category of algo- y∈Y rithms take characteristics of LDL into consideration. Two representative approaches of SA are IIS-LDL and BFGS- then we have LDL, which apply the maximum entropy model to learn the Theorem 1. [Devroye et al., 1996; Wang and Geng, 2019] label distribution. Besides, [Geng and Hou, 2015] treats LDL The error probability difference between hˆ and h∗ satisfies as a regression problem, and proposes LDL-SVR, which em-   braces SVR to deal with the label distribution. Furthermore, P ˆ P ∗ E X y y [Shen et al., 2017] proposes LDLF, which extends random (h(x) 6= y) − (h (x) 6= y) ≤ x  |hx − dx| . forest to learn label distribution. In addition, [Gao et al., y∈Y 2017] provides the first deep LDL model DLDL. Notice that The theorem says that if h is close to d in terms of absolute compared with the classic LDL algorithms, LDLF and DLDL ˆ ∗ support end-to-end training, which are suitable for computer loss, then error probability of h is close to that of h . Theo- vision tasks. Notice that none of aforementioned algorithms retically LDL with absolute loss is directly relevant with clas- ever consider the inconsistency as we previously discussed. sification. Also absolute loss is tighter than KL-divergence, p m There are few work on inconsistency between the train- since |p − q| ≤ 2 KL(p, q), for p, q ∈ R [Cover and ing and the test phrase of LDL. [Gao et al., 2018] firstly Thomas, 2012]. Accordingly we propose to use absolute loss recognizes the inconsistency in the application of age esti- in LDL for classification, instead of KL-divergence. Similar mation, and designs a lightweight network to jointly learn with [Geng, 2016], we use maximum entropy model to learn the age distribution and the ground-truth age to bridge the the label distribution. Maximum entropy mode is defined as gap. However the method is only suitable for real-valued la- 1 hyj = exp(w · x), (1) bel space and no theory guarantee is provided. Besides, one x Z j

3713 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

0.5 0.30 Accordingly the problems is then formulated as y y dx dx 0.25 n m m 0.4 y y hx hx X X yj yj C X 0.20 2 0.3 min Exi hxi − dxi + ||wj|| . (3) W 2 0.15 i=1 j=1 j=1 0.2 0.10 0.1 0.05 3.4 LDL with Large Margin Label Distribution Label Distribution 0.0 0.00 y1 y2 y3 y4 y5 y1 y2 y3 y4 y5 To further boost the classification precision, we borrow the Label Space Label Space margin theory. Formally speaking, let the model output cor- (a) `1 = 0.5, `ep = 0.5 (b) `1 = 0.2, `ep = 0.8 responding to the Bayes prediction has a large margin over other labels, which pushes the corresponding classification ∗ Figure 1: An example to illustrate the gap between absolute loss for model consistent with the Bayes classification model. Let ∗ ∗ LDL and error probability for classification. The red bar represents be the Bayes prediction for xi, i.e., yi = h (xi). Then we ∗ the ground-truth label distribution, and the green one represents the yi y assume a margin ρ between h and max ∗ h , and xi y∈{Y−yi } xi learned label distribution. the problem can be formulated as n m n m P X X yj yj X C2 X 2 where Z = exp(wj ·x) is the normalization factor. Abso- min Exi hx − dx + C1 ξi + ||wj || , j W i i 2 lute loss is applied as the measure for LDL, and consequently i=1 j=1 i=1 j=1 ∗ the problem can be formulated as yi y s.t. : hxi − max hx > ρ − ξi, y∈{Y−y∗} i n m C m i X X yj yj X 2 ξ ≥ 0, ∀i ∈ [n], min hxi − dxi + ||wj|| , (2) i W 2 i=1 j=1 j=1 (4) where W = [w1, w2,..., wm] is the set of parameters. where C1,C2 are the balance coefficients for margin loss and regularization respectively. Large margin turns out to be a 3.3 Re-weighting with Information Entropy reasonable assumption since the ultimate goal of LDL4C is After taking a second look at the Theorem (1), we will find to pursuit the Bayes classifier. Overall, model (4) seeks a out that model (2) actually optimizes the upper bound for er- balance between LDL and large margin classifier. ror probability. This really brings a gap between absolute 3.5 Optimization loss for LDL and 0/1 loss for classification. Fig. 1 is an il- x lustrative example to demonstrate the gap, where `1 denotes To start, for , define ∗ absolute loss and ` denotes error probability defined as Eq. yi y ep α = hx − max h , (5) ∗ x (15). As we can see from the figure, (b) is superior to (a) y∈{Y−yi } from the perspective of absolute loss. However (b) is infe- and ρ-margin loss as `ρ(α) = max{0, 1− α }. By introducing rior to (a) in terms of error probability. For (a), y3 has the ρ highest label distribution both for d and h, thus prediction margin loss, model (4) can be re-formed as ˆ of h coincides with that of the Bayes classification function n m n m ∗ y y C h . For (b), y3 has the highest label distribution for d and y2 X X j j X ρ 2 X 2 min Exi hx − dx + C1 ` (αi) + ||wj || . W i i 2 has the highest label distribution for h, thus the corresponding i=1 j=1 i=1 j=1 classification result is different with that of the Bayes classifi- (6) cation. Essentially for label distribution which is multimodal which can be optimized efficiently by gradient-based method (like (b) of Fig.1), the corresponding classification function is (`ρ is differential). We optimize the target function by BFGS much more sensitive to the loss of LDL. While, for label dis- [Nocedal and Wright, 2006]. Define φ as the step function, tribution which is unimodal (like (a) of Fig.1), there is more  1 if x ≥ 0, room to manoeuvre. Information entropy seems to be a rea- φ(x) = sonable tool to quantize this sensitiveness, since a multimodal 0 otherwise, distribution generally brings higher information entropy com- and denote the target function by T , gradient of which is ob- pared with a unimodal one. Precisely, for the label distribu- tained through tion with higher information entropy (i.e., multimodal), the yj ∂T X y y ∂hxi corresponding classification model is much more sensitive to = E sign(h j − d j ) + C w + ∂w xi xi xi w 2 k the LDL loss, and vice verse. In other words, samples with k i,j k high information entropy deserves more attention, which re- n y y∗ ! ∂ max ∗ h )) i minds us to re-weight samples with information entropy. Re- X y∈{Y−yi } xi ∂hxi C1 φ(ρ − αi) − . weighted with information entropy will penalize large loss wk wk for samples with high information entropy, and leave more i=1 room for samples with low information entropy. Recall the Moreover, gradient of h is got through definition of entropy information , for x, y " 2# ∂h j exp(w · x)  exp(w · x)  x = 1 k − j x . X y y ∂w {j=k} P exp(w · x ) P exp(w · x) i Ex = − dx ln dx. k i i i i i y∈Y

3714 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

4 Theoretical Results Proof. Firstly, according to [Bartlett and Mendelson, 2003; Mohri et al., 2012], and P |dy − hy | ≤ 2 (triangle inequal- In this section, we will demonstrate that, theoretically LDL4C y x x ity), data-dependent risk bound for LDL4C is as seeks a trade-off between generalization and discrimination,   where generalization is due to the LDL process, and discrim- m n m E X yj yj X X yj yj ination is credited to the large margin classifier. x  |hx − dx | ≤ |hxi − dxi |+ j=1 i j (11) √ 4.1 Generalization 2 r 4 2m maxi∈[n] ||xi||2 log 2/δ Here by generalization we mean that the error probability of √ + 6 . n 2n LDL4C converges to that of the Bayes classifier, by develop- ing an upper bound for error probability of LDL4C. The basic Secondly combine Theorem (1) and Eq. (3), which completes steps are providing an upper bound for risk of LDL4C (with the proof for the theorem. absolute loss) firstly, and then converting the risk bound into error probability bound by Theorem 1. 4.2 Discrimination Notice that maximum entropy model, i.e., Eq. (1) can be In this part, we will borrow margin theory to demonstrate that regarded as function combination of softmax function and LDL4C enjoys discrimination. By discrimination, we mean multi-output linear regression function. Formally, let SF be the ability to output the same prediction as the Bayes pre- the softmax function, F be a family of functions for multi- diction. According to [Bartlett and Mendelson, 2003], with output linear regression. Denote the family of functions margin loss, we have for the maximum entropy model by H, then H = {x 7→ Theorem 4. Define H˜ as H˜ = {(x, y∗) 7→ h(x, y∗) − SF ◦ f(x): f ∈ F}. To bound the risk of LDL4C, it suf- maxy∈{Y−y∗} h(x, y): h ∈ H}. Fix ρ > 0. Then, for any fices to derive an upper bound on the Rademacher complexity δ > 0, with probability at least 1− δ, for all h ∈ H, such that [Bartlett and Mendelson, 2003] of the model. T n r Theorem 2. Define F as F = {x 7→ [w1 · x,..., wm · x] : ρ X ρ ρ log 2/δ E [` (α)] ≤ ` (αi) + 2Rˆn(` ◦ H˜) + 3 . (12) ||wj|| ≤ 1}. Rademacher complexity of `1 ◦ H satisfies 2n √ i=1 2 2 2m maxi∈[n] ||xi||2 Since `ρ satisfies 1/ρ-Lipschitzness, we have Rˆ (`ρ ◦ Rˆ (` ◦ H) ≤ √ . (7) n n 1 n ˜ 1 ˆ ˜ [ ] H) ≤ ρ Rn(H). And according to Mohri et al., 2012 , Rˆn(H˜) ≤ 2mRˆn(Π1(H)), where Π1(H) is defined as Proof. Firstly, according to [Wang and Geng, 2019], `1◦SF is shown to be 2m-Lipschitz. And according to [Maurer, 2016], Π1(H) = {x 7→ h(x, y): y ∈ Y, h ∈ H}. with `1 ◦ SF being 2m-Lipschitz, then ρ Since 1{h(x)6=y∗} < ` (α), thus we have √ m s X n ˆ log 2 Rˆ (` ◦ H) ≤ 2 2m Rˆ (F ◦ S), (8)   X ρ 4mRn(Π1(H)) δ n 1 n i E 1 ∗ ≤ ` (α ) + + 3 . h(x)6=y i ρ 2n j=1 i=1 ˆ where Fj = {x 7→ wj · x : ||wj|| ≤ 1}. Moreover, accord- And Rn(Π1(H)) can be bounded as ing to [Kakade et al., 2009], Rademacher complexity of F j ˆ (m − 1)γ exp(γ) satisfies Rn(Π1(H)) ≤ √ , (13) n max ||xi||2 Rˆ (F ) ≤ i∈√[n] . (9) n j n where γ = 2 maxi∈[n] ||xi||, which leads to Theorem 5. Fix ρ > 0. Then for any δ > 0, with probability Finally substitute Eq. (9) into Eq. (8), and we finish the proof at least 1 − δ, for all h ∈ H such that for Theorem (2). s n 2   X ρ 4(m − 1)mγ exp(γ) log δ E 1 ∗ ≤ ` (α ) + √ + 3 . Then data-dependent error probability for LDL4C is as fol- h(x)6=y i ρ n 2n i=1 lowing, (14) Theorem 3. Define Hˆ = {x 7→ max hy : h ∈ H}. Then y∈Y x 5 Experiments for any δ > 0, with probability at least 1 − δ, for any hˆ ∈ Hˆ, such that 5.1 Experimental Configuration n m Real-word datasets. The experiments are extensively con- ∗ X X yj yj ducted on seventeen datasets totally, among which fifteen are P(hˆ(x) 6= y) − P(h (x) 6= y) ≤ |hx − dx |+ i i [ ] 2 [ et al. ] i j from Geng, 2016 , and M B is from Nguyen , 2012 , √ and SCUT-FBP is from [ et al., 2015]. Note that M2B 2 r 4 2m maxi∈[n] ||xi||2 log 2/δ and SCUT-FBP are transformed to label distribution as [Ren √ + 6 . n 2n and Geng, 2017]. Datasets are from a variety of fields, and (10) statistics about the datasets are listed as Table 1.

3715 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Dataset # Examples # Features # Labels Dataset LDL LDL+`1 LDL+RW LDL+LM SJAFFE 213 243 6 SJAFFE .437 .417 .394 .426 M2B 1240 250 5 M2B .486 .486 .482 .485 SCUT-FBP 1500 300 5 SCUT-FBP .467 .470 .465 .467 Natural Scene 2000 294 9 Natural Scene .597 .420 .421 .425 Yeast-alpha 2465 24 18 Yeast-alpha .891 .787 .787 .787 Yeast-cdc 2465 24 15 Yeast-cdc .823 .820 .821 .821 Yeast-cold 2465 24 4 Yeast-cold .580 .576 .576 .576 Yeast-diau 2465 24 7 Yeast-diau .694 .665 .668 .669 Yeast-dtt 2465 24 4 Yeast-dtt .629 .630 .627 .627 Yeast-elu 2465 24 14 Yeast-heat .702 .676 .675 .677 Yeast-heat 2465 24 6 Yeast-spo .548 .547 .547 .547 Yeast-spo 2465 24 6 Yeast-spoem .435 .400 .416 .422 Yeast-spo5 2465 24 3 Movie .416 .416 .410 .414 Yeast-spoem 2465 24 2 SBU 3DFE 2500 243 6 Movie 7755 1869 5 Table 2: Experimental results of comparing BFGS-LDL with BFGS- Human Gene 17892 36 68 LDL + { `1, RW, LM } in terms of 0/1 loss. The best results are highlighted in bold face. Table 1: Statistics of 17 real-world datasets used in the experiments. compared with the original BFGS-LDL model. Secondly Comparing algorithms. We compare LDL4C with six we conduct extensive experiments of LDL4C and comparing popular LDL algorithms, i.e., PT-Bayes, PT-SVM, AA- methods on 17 real-world datasets in terms of 0/1 loss and BP, AA-kNN, LDL-SVR [Geng and Hou, 2015], BFGS- error probability, and the experimental results are presented LDL, and two state-of-the-art LDL algorithms, i.e., Struc- as in Table 3 and Table 4 respectively. The pairwise t-test tRF [Chen et al., 2018] and LDL-SCL [Zheng et al., 2018]. at 0.05 significance level is conducted, with the best perfor- For LDL4C, the balance parameter C1 and C2 are selected mance marked in bold face. From the tables, we can get that: from {0.001, 0.01, 0.1, 1, 10, 100} and ρ is chosen from • In terms of 0/1 loss, LDL4C rank 1st among 13 datasets, {0.001, 0.01, 0.1} by cross validation. Moreover for AA- and is significantly superior to other comparing methods BP, the number of hidden-layer neurons is set to 64, and for in 70% cases, which validates the ability of LDL4C for AA-kNN, the number of nearest neighbors k is selected from performing classification. {3, 5, 7, 9, 11}. Furthermore, for BFGS-LDL, we follow the same settings with [Geng, 2016]. Besides, the insensitive • In terms of error probability, LDL4C ranks 1st among parameter  is set to 0.1 as suggested by [Geng and Hou, 12 datasets, and achieves significant better performance 2015], and rbf kernel is used for both PT-SVM and LDL- in 46% cases, which discloses that LDL4C has better SVR. For StructRF, depth is set to 20 and number of trees generalization. is set to 50, and sampling ratio is set to 0.8 as suggested by [Chen et al., 2018]. In addition, for LDL-SCL, λ1 = 0.001, 6 Conclusions λ2 = 0.001, λ3 = 0.001 and number of clusters m = 6 as given in [Zheng et al., 2018]. Finally, all algorithms are This paper tackles the inconsistency between the training and examined on 17 datasets with 10-fold cross validation, and test phrase of LDL when applied in applications. In the train- average performance is reported. ing phrase of LDL, aim of which is to learn the given la- bel distribution, i.e., minimizing (maximizing) distance (sim- Evaluation metrics. We test algorithms in terms of two ilarity) between the model output and the given label distri- measures, 0/1 loss and error probability. For 0/1 loss, we bution. However, for applications of LDL, a learned LDL regard the label with the highest label distribution as the model is generally treated as a classification model, with the x y ground-truth label. Given an instance with prediction , label corresponding to the highest output as the prediction. the error probability is defined as The proposed algorithm LDL4C alleviates the inconsistency y ` ep(x, y) = 1 − dx. (15) with three key improvements. The first one is inspired by the plug-in decision Theorem 1, which states that absolute loss 5.2 Experimental Results is directly related with classification error, thereby absolute Firstly, to validate the three key components (i.e., absolute loss is applied as the loss function for LDL4C. The second loss, re-weighting with information entropy, and large mar- one stems from noticing the gap between absolute loss and gin) for boosting classification precision, we add each of the classification error, which suggests re-weighting samples the components into BFGS-LDL and compare 0/1 loss with with information entropy to pay more attention to multimodal BFGS-LDL. The performance is reported in Table 2. Due to distribution. The last one is due to margin theory, which in- y∗ y the page limitation, we only report part of the results. Here troduces a margin between hx and maxy∈{Y−y∗} hx to en- “LDL” is short for BFGS-LDL. “LDL+`1” represents BFGS- force the corresponding classification model approaching the LDL with absolute loss, and “LDL+RW” represents BFGS- Bayes decision rule. Furthermore, we explore the effective- LDL with information entropy re-weighting, and “LDL+LM” ness of LDL4C theoretically and empirically. Theoretically, stands for BFGS-LDL with large margin. As we can see from LDL4C enjoy both generalization and discrimination, and the table, armed with three key components, the correspond- empirically extensive comparing experiments clearly mani- ing LDL model can achieve better classification performance fest the advantage of LDL4C in classification.

3716 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Dataset PT-Bayes PT-SVM AA-BP AA-kNN LDL-SVR StructRF LDL-SCL BFGS-LDL LDL4C SJAFFE .783.016• .785.003• .807.003• .483.012• .686.004• .569.011• .789.002• .437.018 .394.010 M2B .645.003• .552.003• .487.001 .484.001 .495.002 .483.002 .498.001 .486.002 .481.001 SCUT-FBP .893.002• .517.001 .463.001 .457.001 .467.001 .467.001 .502.001• .467.000 .465.001 Natural Scene .801.002• .649.000• .667.001• .521.001• .568.001• .520.001• .696.001• .597.002• .420.001 Yeast-alpha .945.000• .940.000• .886.000• .882.000• .925.000• .882.001• .903.000• .891.000• .787.001 Yeast-cdc .935.000• .926.000• .825.001 .830.001• .858.000• .824.001 .825.000 .823.001 .818.001 Yeast-cold .730.000• .733.001• .580.001 .570.001◦ .609.000• .585.001• .586.001• .580.001 .575.001 Yeast-diau .866.000• .848.001• .711.001• .688.000• .694.001• .678.000 .703.001• .694.001• .665.000 Yeast-dtt .743.001• .740.001• .641.001 .637.001 .651.001• .622.001 .636.001 .629.001 .627.001 Yeast-elu .930.000• .924.000• .917.000• .867.001• .892.000• .875.000• .903.000• .899.000• .803.000 Yeast-heat .824.001• .827.000• .707.001• .692.001 .710.001• .680.001 .717.001• .703.001• .675.001 Yeast-spo .826.001• .817.001• .548.000 .565.001• .616.000• .566.001• .584.000• .548.000 .547.000 Yeast-spo5 .667.001• .637.000• .543.001 .542.000 .559.001 .524.001 .622.000• .559.001• .534.001 Yeast-spoem .482.000• .479.001• .441.001• .410.001 .409.000 .401.002 .439.002• .435.002• .401.000 SBU 3DFE .818.001• .810.001• .675.001• .616.001• .633.002• .516.001◦ .514.001◦ .582.001 .569.001 Movie .819.000• .677.000 • .423.001• .427.000• .415.000• .410.000 .454.000• .416.000• .409.000 Human Gene .986.000• .981.000• .951.001• .966.000• .977.000• .965.000• .976.000• .955.000• .927.000

Table 3: Experimental results (mean  std.) of LDL4C and comparing algorithms in terms of 0/1 loss on 17 real-word datasets. In addition, •/◦ indicates whether LDL4C is statistically superior/inferior to the comparing algorithms.

Dataset PT-Bayes PT-SVM AA-BP AA-kNN LDL-SVR StructRF LDL-SCL BFGS-LDL LDL4C SJAFFE .80803e-4• .83054e-4• .81882e-4• .76936e-4• .79511e-4• .77665e-4• .83092e-4• .75836e-4 .75734e-4 M2B .64389e-4• .54414e-4 .54364e-4 .54244e-4 .54653e-4 .54276e-4 .54944e-4 .54196e-4 .53586e-4 SCUT-FBP .89056e-4• .54142e-4 .54102e-4 .53793e-4 .54202e-4 .54863e-4• .54944e-4 .54262e-4 .54142e-4 Natural Scene .80139e-4• .64781e-4 .66203e-4• .66582e-4• .64001e-4 .62002e-4◦ .66482e-4• .64712e-4 .64503e-4 Yeast-alpha .94440e-4• .94420e-4• .94270e-4 .94280e-4 .94350e-4• .94290e-4• .94290e-4• .94260e-4 .94260e-4 Yeast-cdc .93290e-4• .93200e-4• .92870e-4 .92900e-4 .92980e-4• .92880e-4 .92890e-4• .92870e-4 .92870e-4 Yeast-cold .74650e-4• .74020e-4• .72970e-4 .72970e-4 .73500e-4• .72910e-4 .73040e-4• .72970e-4 .72960e-4 Yeast-diau .85240e-4• .84780e-4• .84320e-4 .84240e-4 .84320e-4 .84240e-4 .84270e-4 .84280e-4 .84280e-4 Yeast-dtt .74890e-4• .74860e-4• .74210e-4• .74150e-4 .74290e-4• .74300e-4• .74180e-4 .74160e-4 .74120e-4 Yeast-elu .92850e-4• .92770e-4• .92610e-4 .92610e-4 .92660e-4• .92600e-4 .92620e-4 .92610e-4 .92600e-4 Yeast-heat .82990e-4• .83090e-4 .82380e-4 .82370e-4 .82500e-4 .82340e-4 .82500e-4• .82340e-4 .82330e-4 Yeast-spo .83240e-4• .81650e-4• .81010e-4 .81090e-4 .81370e-4• .81050e-4 .81150e-4 .81010e-4 .80000e-4 Yeast-spo5 .66010e-4• .65510e-4 .65110e-4 .64730e-4 .64870e-4 .64300e-4◦ .65330e-4 .65270e-4 .65260e-4 Yeast-spoem .49090e-4 .47561e-4 .47171e-4 .46890e-4 .47200e-4 .46701e-4 .47131e-4 .47050e-4 .47000e-4 SBU 3DFE .79911e-4• .80931e-4• .80411e-4• .78750e-4• .79370e-4• .77450e-4 .77690e-4• .77460e-4 .77340e-4 Movie .81790e-4• .68150e-4• .67600e-4• .67810e-4• .67600e-4• .67450e-4 .68440e-4• .67550e-4• .67430e-4 Human Gene .98480e-4• .98340e-4• .98240e-4• .98310e-4• .98220e-4• .96260e-4• .98220e-4• .98160e-4 .98160e-4

Table 4: Experimental results (mean  std.) of LDL4C and comparing algorithms in terms of error probability on 17 real-word datasets.In addition, •/◦ indicates whether LDL4C is statistically superior/inferior to the comparing algorithms.

Acknowledgments where σ1, σ2, . . . , σn are n independent random variables P P This research was supported by the National Key Research with (σi = −1) = (σi = 1) = 1/2, and W = [w , w ,..., w ] with ||w || < 1, and the third inequality is & Development Plan of China (No. 2017YFB1002801), the 1 2 m j according to the 1-Lipschitzness of function 1/x for x > 1. National Science Foundation of China (61622203), the Col- And for each j 6= y, laborative Innovation Center of Novel Software Technology " n # " n # and Industrialization, and the Collaborative Innovation Cen- X σi (wj −wy )xi γ X σi(wj − wy)xi Eσ sup e ≤e Eσ sup ter of Wireless Communications Technology. y,j n y,j n i=1 i=1 γeγ ≤ √ , A Proof for Eq. (13) n Proof. Recall the definition of Rademacher complexity where the first inequality is according to the ea-Lipschitzness x " n # of function e for x < a, and the second one is according to ˆ E 1 X exp(xi · wy) R(Π1(H)) = σ sup P σi [Kakade et al., 2009] by noticing that ||wj −wy|| < 2, which n w ,W exp(wj · xi) y i=1 j concludes the proof for Eq. (13). " n # 1 X σi = E sup σ P (w −w )·x References n w ,W j y i y i=1 j e   [Bartlett and Mendelson, 2003] Peter L. Bartlett and Shahar

1 X (wj −wy )·xi Mendelson. Rademacher and gaussian complexities: Risk ≤ Eσ  sup e σi n wy ,W bounds and structural results. J. Mach. Learn. Res., 3:463– i,j6=y 482, March 2003. " n # 1 X X (wj −wy )·xi [Chen et al., 2018] Mengting Chen, Xinggang Wang, Bin ≤ Eσ sup e σi , n w ,w j6=y y j i=1 Feng, and Wenyu . Structured random forest for la-

3717 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

bel distribution learning. Neurocomputing, 320:171 – 182, [Ren and Geng, 2017] Yi Ren and Xin Geng. Sense beauty 2018. by label distribution learning. In Proceedings of the [Cover and Thomas, 2012] Thomas M Cover and Joy A Twenty-Sixth International Joint Conference on Artificial Thomas. Elements of information theory. John Wiley & Intelligence, IJCAI-17, pages 2648–2654, 2017. Sons, 2012. [Shen et al., 2017] Wei Shen, Kai Zhao, Yilu Guo, and [Devroye et al., 1996] Luc Devroye, Laszl´ o´ Gyorfi,¨ and Alan L. Yuille. Label distribution learning forests. In Gabor´ Lugosi. A probabilistic theory of pattern recog- Proceedings of the 30th International Conference on Neu- nition, volume 31. Springer Science & Business Media, ral Information Processing Systems, NIPS’30, pages 834– 1996. 843. Curran Associates, Inc., 2017. [Gao et al., 2017] Binbin Gao, Chao , Chenwei Xie, [Wang and Geng, 2019] Jing Wang and Xin Geng. Theoret- Jianxin , and Xin Geng. Deep label distribution learn- ical analysis of label distribution learning (in press). In ing with label ambiguity. IEEE Transactions on Image AAAI Conference on 33th AAAI Conference on Artificial Processing, 26(6):2825–2838, June 2017. Intelligence, AAAI’19, 2019. [Gao et al., 2018] Binbin Gao, Hongyu Zhou, Jianxin Wu, [Xie et al., 2015] Duorui Xie, Lingyu Liang, Lianwen , and Xin Geng. Age estimation using expectation of la- Jie Xu, and Mengru . Scut-fbp: A benchmark dataset for bel distribution learning. In Proceedings of the Twenty- facial beauty perception. 2015 IEEE International Confer- Seventh International Joint Conference on Artificial In- ence on Systems, Man, and Cybernetics, Oct 2015. telligence, IJCAI-18, pages 712–718. International Joint [Xu et al., 2018] Xu, An Tao, and Xin Geng. Label en- Conferences on Artificial Intelligence Organization, 7 hancement for label distribution learning. In Proceedings 2018. of the Twenty-Seventh International Joint Conference on [Geng and Hou, 2015] Xin Geng and Peng Hou. Pre-release Artificial Intelligence, IJCAI-18, pages 2926–2932. Inter- prediction of crowd opinion on movies by label distribu- national Joint Conferences on Artificial Intelligence Orga- tion learning. In Proceedings of the 24th International nization, 7 2018. Conference on Artificial Intelligence, IJCAI’15, pages [Zhang and Zhou, 2014] Minling Zhang and Zhihua Zhou. 3511–3517, 2015. A review on multi-label learning algorithms. IEEE Trans- [Geng et al., 2013] Xin Geng, Chao Yin, and Zhihua Zhou. actions on Knowledge and Data Engineering, 26(8):1819– Facial age estimation by learning from label distributions. 1837, Aug 2014. IEEE Transactions on Pattern Analysis and Machine In- [Zhang et al., 2015] Zhaoxiang Zhang, Wang Mo, and Geng telligence, 35(10):2401–2412, Oct 2013. Xin. Crowd counting in public video surveillance by label [Geng, 2016] Xin Geng. Label distribution learning. distribution learning. Neurocomputing, 166(1):151–163, IEEE Transactions on Knowledge and Data Engineering, 2015. 28(7):1734–1748, July 2016. [Zheng et al., 2018] Zheng, Xiuyi Jia, and Weiwei Li. [Huo and Geng, 2017] Zengwei Huo and Xin Geng. Ordi- Label distribution learning by exploiting sample correla- nal zero-shot learning. In Proceedings of the Twenty-Sixth tions locally, 2018. International Joint Conference on Artificial Intelligence, [Zhou and Zhang, 2006] Zhihua Zhou and Minling Zhang. IJCAI-17, pages 1916–1922, 2017. Multi-instance multi-label learning with application to [Kakade et al., 2009] Sham M Kakade, Karthik Sridharan, scene classification. In Proceedings of the 19th Interna- and Ambuj Tewari. On the complexity of linear predic- tional Conference on Neural Information Processing Sys- tion: Risk bounds, margin bounds, and regularization. tems, NIPS’06, pages 1609–1616, Cambridge, MA, USA, In Advances in Neural Information Processing Systems, 2006. MIT Press. NIPS’09, pages 793–800, 2009. [Zhou et al., 2015] Ying Zhou, Hui Xue, and Xin Geng. [Maurer, 2016] Andreas Maurer. A vector-contraction in- Emotion distribution recognition from facial expressions. equality for rademacher complexities. In Algorithmic In Proceedings of the 23rd ACM International Conference Learning Theory, pages 3–17, 2016. on Multimedia, MM ’15, pages 1247–1250, New York, [Mohri et al., 2012] Mehryar Mohri, Afshin Rostamizadeh, NY, USA, 2015. ACM. and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012. [Nguyen et al., 2012] Tam V. Nguyen, Liu, Bingbing , Jun Tan, Yong Rui, and Shuicheng . Sense beauty via face, dressing, and/or voice. In Proceedings of the 20th ACM International Conference on Multimedia, MM ’12, pages 239–248, New York, USA, 2012. ACM. [Nocedal and Wright, 2006] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

3718