Theoretical Analysis of Label Distribution Learning

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Jing Wang, Xin Geng∗ MOE Key Laboratory of Computer Network and Information Integration School of Computer Science and Engineering, Southeast University, Nanjing 210096, China {wangjing91, xgeng}@seu.edu.cn

Abstract eling of different relative importance of labels assigned to the instance, which is more suitable for many real scenarios. As a novel learning paradigm, label distribution learning (LDL) explicitly models label ambiguity with the definition Recently, LDL has been extensively applied in many real- of label description degree. Although lots of work has been world applications, which can be classified into three classes done to deal with real-world applications, theoretical results according to the source of label distribution. The first one on LDL remain unexplored. In this paper, we rethink LDL features that label distribution is from data itself, which in- from theoretical aspects, towards analyzing learnability of cludes pre-release rating prediction on movies (Geng and LDL. Firstly, risk bounds for three representative LDL al- Hou 2015), emotion recognition (Zhou, Xue, and Geng gorithms (AA-kNN, AA-BP and SA-ME) are provided. For 2015), et al. The second one is characterized by that label AA-kNN, Lipschitzness of the label distribution function is distribution is originated from pre-knowledge, among which assumed to bound the risk, and for AA-BP and SA-ME, applications include age estimation (Geng, Yin, and Zhou rademacher complexity is utilized to give data-dependent risk 2013), head pose estimation (Geng and Xia 2014), et al. bounds. Secondly, a generalized plug-in decision theorem is proposed to understand the relation between LDL and clas- The third one is attributed to that label distribution is learned sification, uncovering that approximation to the conditional from data automatically. Applications of such class include probability distribution function in absolute loss guarantees label-importance-aware multi-label learning (Li, Zhang, and approaching to the optimal classifier, and also data-dependent Geng 2015), beauty sensing (Ren and Geng 2017), video error probability bounds are presented for the corresponding parsing (Geng and Ling 2017), et al. The secret of successes LDL algorithms to perform classification. As far as we know, of LDL being applied in a variety of fields is that explicit in- this is perhaps the first research on theory of LDL. troduction of label ambiguity with label distribution boosts performance of real-world applications. Introduction In this paper, we re-examine LDL from theoretical aspects, towards analyzing generalization of LDL algorithms. Traditional learning paradigms include single-label learn- Precisely, there are at least two arguments related to the ing (SLL) and multi-label learning (MLL) (Zhang and Zhou generalization of LDL. The first one is on generalization 2014). SLL assumes that an instance is associated with one of LDL itself, and the second one is on generalization of label, while MLL assumes that multiple labels are assigned LDL to perform classification. On one hand, LDL can be to an instance. Essentially both SLL and MLL aim at find- regarded as one kind of multi-output regression (Borchani ing related label/labels to describe the instance, while nei- et al. 2015), and generalization of multi-output regression ther SLL nor MLL support relative importance of labels as- algorithms (Kakade, Sridharan, and Tewari 2009) can be signed to the instance. However in some real-word applica- transfered to LDL somewhat. However, specializations of tions, labels are related to the instance with different relative LDL should not be neglected. The first thing to note is prob- importance degree. Thus it seems reasonable to label an in- ability simplex constraint of label distribution. As suggested stance with a soft label rather than a hard one (e.g., single by (Geng 2016), the target label distribution is real-valued label or a set of labels). Inspired by this, (Geng 2016) pro- y and satisfies distribution constraints, i.e., ηx ∈ [0, 1], and poses a novel learning paradigm, Label Distribution Learn- P y y∈Y ηx = 1 (formally defined in Preliminary). This con- ing (LDL), which labels an instance with a distribution of straint is often satisfied by applying a softmax function onto description degree over label space, called label distribution, each output of a multi-output regression model, which com- and learns a mapping from instance to label distribution di- plicates complexity of the corresponding multi-output re- rectly. Compared with MLL, where positive labels are com- gression model. Furthermore, the second thing to note is 1/c monly treated equally (with description degree equals specialization of measures for LDL. Although Lipschitzness c implicitly for positive labels), LDL allows explicitly mod- of loss function is a general assumption when bounding the ∗Corresponding author. risk, Kullback-Leibler (KL) divergence as loss function for Copyright c 2019, Association for the Advancement of Artificial LDL (e.g., SA-ME) does not satisfy Lipschitzness instead. Intelligence (www.aaai.org). All rights reserved. On another hand, we find that LDL is usually adopted to per-

5256 form classification implicitly. For examples, in applications acquire. (Xu, Tao, and Geng 2018) proposes to boost logi- of age estimation (Geng, Yin, and Zhou 2013), head-pose es- cal label (from SLL/MLL dataset) into real-valued label dis- timation (Geng and Xia 2014), pre-release rating prediction tribution by graph laplacian label enhancement. (Hou et al. on movies (Geng and Hou 2015), et al., label correspond- 2017) utilizes semi-supervised learning, with abundant un- ing to the maximum predicted description degree is treated labeled data, to adaptively learn label distribution. (Xu and as the predicted label. As far as we know, generalization of Zhou 2017) deals with the situation when label distribu- such framework has never been touched. The main contribu- tion is partially observed, and jointly recovers and learns it tions of this paper are summarized as followings, by matrix completion with trace-norm regularizer. The sec- 1) We establish risk bounds for three representative LDL ond one exploits label correlations of LDL. (Zhou, Xue, and algorithms, i.e., AA-BP, SA-ME, and AA-kNN, where Geng 2015) represents label correlations by Pearson’s cor- bounds for the first two are data-dependent risk bounds, relation coefficient. (Jia et al. 2018) addresses label correla- and bound for AA-kNN is consistency bound with Lips- tions by adding a regularizer, which encodes label correla- chitz assumption. tions into distance between labels. (Zheng, Jia, and Li 2018) considers label correlation is local, which is encoded into 2) We generalize the binary plug-in decision theorem into a local correlation vector, and learns optimal encoding vec- multi-class classification, discovering that LDL domi- tor and LDL simultaneously. Algorithm improvements are nates classification. mainly according to practice guide, without taking theoreti- 3) We provide, to the best of our knowledge, the first data- cal feasibility into consideration. dependent error probability bounds for three representa- There are few work on theoretical research of LDL. One tive LDL algorithms to perform classification. recent paper (Zhao and Zhou 2018) studies generalization of LDL partially. However, the major motivation of the paper The rest of this paper is organized as follows. Firstly, re- is to take structures of label space into consideration by ap- lated works are briefly reviewed. Secondly, notations are in- plying optimal transport distance (Villani 2008) instead of troduced in preliminary. Then risk bounds for three repre- traditional LDL measures. And the proposed risk bound is sentative LDL algorithms are provided. Next generalization too abroad, without providing a bound for rademacher com- of LDL to perform classification is examined. Finally, we plexity of the hypothesis space. Besides, characteristics of conclude the paper. LDL measures are not taken into consideration at all. Risk bounds we developed in this paper match specializations of Related Work LDL (i.e., probability simplex constraint and loss function LDL (Geng 2016) is a novel learning paradigm, which la- characteristics), which are broadly applicable. bels an instance with a label distribution and learns a map- For generalization of LDL to perform classification, one ping from instance to label distribution straightly. Existing possible related topic is the plug-in decision theorem (De- studies mainly focus on algorithm design and improvement. vroye, Gyorfi,¨ and Lugosi 1996) (formally described in The- Three strategies are embraced to design algorithms for orem 8). It states that error probability difference between LDL (Geng 2016). The first one is Problem Transforma- a decision function and the optimal one is bounded by ex- tion (PT). Algorithms of this class firstly transform dataset pectation of absolute difference between the plug-in func- equipped with label distribution into single-label dataset via tion and the conditional probability distribution function. re-sampling, and then SLL algorithms are employed to learn Plug-in decision theorem has been widely used to prove the transformed single-label dataset. Two representative al- consistency of many decision rules, such as Stone’s Theo- gorithms are PT-SVM and PT-Bayes, which apply SVM and rem (Stone 1977), strong consistency of k-NN rule (Biau Bayes classifier respectively. The second one is Algorithm and Devroye 2015), consistency of kernel rule (Devroye and Adaptation (AA), which extends certain existing learning al- Krzyzak´ 1989), et al. Note that the classic plug-in decision gorithms to handle label distribution seamlessly. Concretely, theorem is only established in the binary setting, which lim- k-NN is adapted in such that, for an instance, mean of label its its applications. In this paper, we generalize the plug-in distribution of its k nearest neighbors is calculated as the decision theorem and develop data-dependent error proba- predicted label distribution, which is denoted by AA-kNN. bility bounds for the corresponding LDL algorithms. Besides, three-layer back-propagation neural network with multi-output is also adapted to minimize the sum-squared Preliminary loss of output of neural network compared with real label d distribution, which is denoted by AA-BP. The third one is Denote input space by X ∈ R , and label space by Y = Specialized Algorithms {y1, y2, . . . , ym}. Define label distribution function η : X × (SA), which matches characteristics P of LDL. Two specialized algorithms, i.e., SA-IIS and SA- Y → R, which satisfies η(x, y) ≥ 0 and y η(x, y) = 1. BFGS, are proposed by applying maximum entropy model Let the training set be S = {(x1, η(x1)),..., (xn, η(xn))}, (Berger, Pietra, and Pietra 1996) with KL divergence as loss where xi is sampled according to an underlying probabil- D η(x ) = {ηy1 , . . . , ηym } function to learn the label distribution. Notice that all above ity distribution , and i xi xi is deter- algorithms are designed without learnability guarantee. mined by an unknown label distribution function η with yj Recently, two improvements on LDL are noteworthy. The ηxi = η(xi, yj) for convenience of notation. The goal of first one tackles shortage of label distribution dataset. Com- LDL is to learn the unknown function η. pared with SLL and MLL, LDL dataset is more difficult to Given a function class H and a loss function ` : Rm ×

5257 m R → R+, for function h ∈ H, its corresponding risk and Thus to prove Theorem (2), it suffices to prove empirical risk are defined as LD(h) = Ex∼D[`(h(x), η(x))]   1 Pn 1/(d+1) and LS(h) = `(h(xi), η(xi)), respectively. Recall X √ 2k n i=1 ||x − x|| ≤ 4 dk , (3) the definition of rademacher complexity w.r.t. S and `, ES,x  j 2 n j∈N (x) " n # k 1 X Rˆ (` ◦ H ◦ S) = sup `(h(x ), η(x )) , which is tricky and left to appendices. Then apply Markov’s n E1,...,n n i i i h∈H i=1 inequality to Eq. (3), and Theorem (2) follows directly. where , . . . , are n independent rademacher random 1 n AA-BP variables with P(i = 1) = P(i = −1) = 1/2. Then Lemma 1. (Bartlett and Mendelson 2003; Mohri, Ros- AA-BP adapts back-propagation neural network to perform tamizadeh, and Talwalkar 2012) Let H be a family of func- label distribution learning. The three-layer neural network tions. For a loss function ` bounded by µ, then for any δ > 0, has m output units, each of which outputs the description with probability at least 1 − δ, for all h ∈ H such that degree of a label. To make sure the output of neural network r satisfies probability simplex constraint, softmax activation log 2/δ function is applied in each output unit. Similar to multi- L (h) ≤ L (h) + 2Rˆ (` ◦ H ◦ S) + 3µ . (1) D S n 2n output regression, AA-BP minimizes sum-squared loss of the output of neural network with the real label distribution. Risk Bounds for LDL Algorithms Observing that AA-BP can be regarded as a combination of We now provide risk bounds for three representative algo- softmax function and function family of the three-layer neu- rithms from two classes, i.e., AA-kNN and AA-BP from al- ral network, rademacher complexity of which w.r.t. S for gorithm adaptation, SA-ME (maximum entropy) from spe- loss function ` is bounded as following, cialized algorithms. Notice that (Geng 2016) proposes two Theorem 3. Denote softmax function by SF. Let H be a specialized algorithms, i.e., SA-IIS and SA-BFGS, which family of functions for three-layer neural network with m differ only in the underlying optimization methods. In this outputs (identity activation on the output layer), and Hj be paper we focus on the generalization ability of algorithms, a family of functions for j-th output. For a loss function ` and SA-ME represents SA-IIS and SA-BFGS collectively with Lipschitz constant L`, we have from the perspective of the underlying model, i.e., maxi- √ m mum entropy model. Also algorithms from problem trans- ˆ X ˆ formation are not covered, for the reason that this type of Rn(` ◦ SF ◦ H ◦ S) ≤ 2 2mL` Rn(Hj ◦ S), (4) algorithms circumvent learning label distribution with re- j=1 sampling and single-label learning algorithms instead of Proof. Define function φ(·, ·) as `(SF(·), ·). Next, we show learning it directly. that φ is Lipschitz. For probability distribution p, q ∈ Rm, AA-kNN |φ(p, ·) − φ(q, ·)| ≤ L`||SF(p) − SF(q)||1, AA-kNN extends kNN to deal with label distribution. Given where the inequality is according to Lipschitzness of `. Next, a new instance x, its k nearest neighbors are firstly selected right-hand side of the proceeding equation equals in the training set. Then mean of label distribution of k near- m est neighbors is calculated as the label distribution of x, i.e., X 1 1 L` − , P pj −pi P qj −qi 1 X 1 + j6=i e 1 + j6=i e η˜yi = ηyi , (i = 1, 2, . . . , m), i=1 x k xj j∈N (x) where pi, qi is i-th element of p and q, respectively. Ob- k m 1 m serving that for v ∈ R , function 1+P exp(v ) is actually where η˜ : X × Y → R is the output function of AA-kNN i i yi 1-Lipschitz, thus the preceding equation is bounded by with η˜x =η ˜(x, yi) for simplicity of notation, and Nk(x) is the index set of k nearest neighbors of x. For convenience m d X of analysis, we assume X = [0, 1] . L` ||p − 1 · pi − q + 1 · qi||2 i=1 Theorem 2. Assume η(·, yi) be ci-Lipschitz w.r.t. X , and Pm m c = i=1 ci, then for any δ > 0, with probability at least X √ 1 − δ such that ≤ L` (||p − q||2 + m|pi − qi|) " m # √ i=1 4c d 2k 1/(d+1) X yi yi m x∼D |η − η˜ | ≤ . (2) √ X E x x δ n ≤ L (m||p − q|| + m |p − q |) i=1 ` 2 i i i=1 Proof. With Lipschitz assumption of η(·, yi), we have ≤ 2mL`||p − q||2, " m #  m  X yi yi X 1 X yi yi S,x |ηx − η˜x | ≤ S,x  |ηx − ηx | where the last inequality is according to Cauchy-Schwarz’s E E k j i=1 i=1 j∈Nk(x) inequality. Recall the definition of rademacher complexity   " n # c X ˆ 1 X ≤ ES,x  ||xj − x||2 . Rn(` ◦ SF ◦ H ◦ S) = E sup φ(h(xi), η(xi))i , k h∈H n j∈Nk(x) i=1

5258 P and according to (Maurer 2016), with φ being 2mL`- where Z = j exp(wj · x) is the normalization factor. Ac- Lipschitz, right-hand side of above equation is bounded by tually Eq. (7) can be regarded as a combination of softmax function and multi-output linear regression, namely SF ◦ H,  n m  √ 1 X X where H represents a class of functions of multi-output lin- 2(2mL ) sup h (x ) , (5) ` E  n i,j j i  ear regression. SA-ME uses KL divergence as loss function, h∈H i=1 j=1 m m which is denoted by KL : R × R → R+. Rademacher complexity of SA-ME w.r.t. S for loss function KL satisfies where hj(xi) is j-th component of h(xi), and i,j are n×m i.i.d. random variables. Notice that function class of a multi- Theorem 6. Let H be a family of functions for multi-output output neural network can be regarded as a direct sum of linear regression, and Hj be a family of functions for the function classes of multiple scalar-output neural networks. j-th output. Rademacher complexity of SA-ME with KL loss Suppose H1,..., Hm be classes of functions for three-layer satisfies m neural network with scalar output, then H = ⊕j∈[m]Hj = √ √ X T Rˆ (KL◦SF◦H◦S) ≤ ( 2m+ 2) Rˆ (H ◦S), (8) x → [h1(x) . . . hm(x)] : hj ∈ Hj , and Eq. (5) equals n n j j=1   n m m √ 1 X X Proof. Note that KL(u, ·) is not ρ-Lipschitz over R for 2 2mLÈ  sup i,jhj(xi) any ρ ∈ and u ∈ m, thus Theorem 3 cannot be ap- h∈L H n R R j∈[m] j i=1 j=1 plied directly. Define function φ(·, ·) as KL(·, SF(·)). Next m " n # we show that φ(u, ·) satisfy Lipschitzness. For p, q ∈ m, √ X 1 X R ≤ 2 2mL sup h (x ) ` E n i,j j i |φ(u, p) − φ(u, q)| = |KL(u, SF(p)) − KL(u, SF(q))|, j=1 hj ∈Hj i=1 which equals √ m X ˆ m ! ≤ 2 2mL` Rn(Hj ◦ S). X exp(pi) exp(qi) u ln − ln j=1 i Pm exp(p ) Pm exp(q ) i=1 j=1 j j=1 j which finishes proof of Theorem 3. m X X pj −pi X qj −qi Rademacher complexity of class of functions for three- ≤ ln(1 + e ) − ln(1 + e ) ui. layer neural network with scalar output satisfies i=1 j6=i j6=i P m Lemma 4. (Bartlett and Mendelson 2003; Gao and Zhou Observing that ln(1+ j exp vi) is 1-Lipschitz for v ∈ R , 2016). Let σ be Lipschitz with constant Lσ. Define class thus right-hand side of preceding equation is bounded by P of functions Hj = {x 7→ i wj,iσ(vi · x): ||wj||2 ≤ m B , ||v || ≤ B }, rademacher complexity of which satis- X 1 i 2 0 ui||p − 1 · pi − q + 1 · qi||2 fies i=1 ˆ LσB0B1 m Rn(Hj ◦ S) ≤ √ max ||xi||2. √ X n i∈[n] ≤ ||p − q||2 + m ui|pi − qi| Accordingly, right-hand side of Eq. (4) is bounded as √ i=1 √ 2 ≤ ( m + 1)||p − q||2, ˆ 2 2m L`LσB0B1 √ Rn(` ◦ SF ◦ H ◦ S) ≤ √ max ||xi||2. namely, φ is ( m + 1)-Lipschitz. Similar to the discussion n i∈[n] of bounding rademacher complexity for AA-BP, we have For AA-BP, sum-squared loss is 2-Lipschitz, and bounded m √ √ X by 2. Finally data-dependent risk bound for AA-BP is Rˆn(KL ◦ SF ◦ H ◦ S) ≤ ( 2m + 2) Rˆn(Hj ◦ S), Theorem 5. Let F be the family of functions for AA-BP j=1 defined above, with sum-squared loss and weight constraints which concludes the proof. as Theorem 4, for any δ > 0, with probability at least 1 − δ, for all f ∈ F such that As discussed above, although KL√ alone does not satisfy Lipschitzness, KL◦SF, however, is ( m+1)-Lipschitz. De- " m # n m X 1 X X y y fine class of functions of j-th output with weight constraints (f yi − ηyi )2 ≤ (η j − f j )2 Ex∼D x x n xi xi as Hj = {x → wj ·x : ||wj||2 ≤ 1}. According to (Kakade, i=1 i=1 j=1 Sridharan, and Tewari 2009), rademacher complexity of H √ (6) j 2 r satisfies 8 2m LσB0B1 log 2/δ + √ max ||xi||2 + 6 . maxi∈[n] ||xi||2 n i∈[n] 2n Rˆ (H ◦ S) ≤ √ . n j n SA-ME Then right-hand side of Eq. (8) is bounded as √ √ SA-ME applies maximum entropy model to learn label dis- ( 2m + 2)m Rˆn(KL ◦ SF ◦ H ◦ S) ≤ √ max ||xi||2. tribution, i.e., n i∈[n] 1 η˜yj = exp(w · x), (7) x Z j Accordingly, data-dependent risk bound for SA-ME is

5259 Theorem 7. Let F be the family of functions for SA-ME Theorem 8. (Devroye, Gyorfi,¨ and Lugosi 1996) For the defined above with KL divergence as loss function bounded plug-in decision function h defined above, difference of error by a constant b, for any δ > 0, with probability at least 1−δ, probability between h and h∗ satisfies for all f ∈ F such that ∗ P(h(x) 6= y) − P(h (x) 6= y) ≤ 2Ex [|η˜1(x) − η1(x)|] .   m yj n m yj η˜ (x) η (x) η 1 η The theorem states that if 1 is close to 1 in ab- X yj x X X yj xi solute value, then 0/1 risk of decision h is near that of the Ex∼D  ηx ln yj  ≤ ηxi ln yj n ∗ fx fxi optimal decision function h . The preceding plug-in deci- j=1 i=1 j=1 (9) r √ √ sion theorem only applies to binary classification, and we log 2/δ 2( 2m + 2)m generalize it to multi-class classification. Formally, given an +3b + √ max ||xi||2. 2n n i∈[n] instance x with the conditional probability distribution func- As (Cha 2007) suggests that 0 is replaced by a very small tion η (multi-class), the corresponding Bayes decision is value, say γ > 0, for division by 0 when implementing KL h∗(x) = arg max η(x, y). y∈Y divergence, then for probability distribution p, q ∈ m with R Similarly, assume we have access to η˜ that approximates η pi ≥ γ, qi ≥ γ m m and the plug-in decision is defined as X pi X 1 h(x) = arg max η˜(x, y). KL(p, q) = pi ln ≤ pi ln ≤ − ln γ, y∈Y qi γ i=1 i=1 Theorem 9. For the plug-in decision function h defined thus there exists a constant b ≥ − ln γ such that KL(·, ·) ≤ b above, we have −15 (e.g., b = 35 for γ = 1 × 10 ). " m # ∗ X P(h(x) 6= y)−P(h (x) 6= y) ≤ Ex |ηi(x) − η˜i(x)| , Relation between LDL and Classification i=1 LDL aims at learning the unknown label distribution func- where ηi(x), η˜i(x) represent η(x, yi), η˜(x, yi) respectively. tion η by minimizing distance (or maximize similarity) between the given distribution and the output distribu- Proof. Given an instance x, then the conditional error prob- tion. However, in practice, LDL is usually applied to per- ability of h form classification. Firstly, a label distribution function η˜ P(h(x) 6= y|x) = 1 − P(h(x) = y|x) is learned according to training sample S with label dis- m X tribution. Secondly, a given instance x is classified by η˜, = 1 − P(h(x) = yi, y = yi|x) with label corresponding to the maximum predicted la- i=1 bel description degree as the predicted label, i.e., y˜ = m X arg maxy∈Y η˜(x, y). In this part we tackle feasibility of per- = 1 − Kr(h(x), yi)ηi(x), forming classification of LDL algorithms. To make the anal- i=1 ysis possible, we stay in probabilistic setting, where the un- where Kr(·, ·) is the Kronecker delta function. Then differ- derlying label distribution function η is the conditional prob- ence of conditional error probability between h and h∗ ability distribution function, i.e., η(x, y) = (y|x). ∗ P P(h(x) 6= y|x) − P(h (x) 6= y|x) Unlike LDL, where the unknown label distribution func- m tion η is deterministic though unknown, label variable for X = (Kr(h∗(x), y ) − Kr(h(x), y )) η (x). classification is stochastic, and sampled according to the i i i conditional probability distribution. Thus we start with the i=1 Without loss of generality, let k be the prediction of h and j plug-in decision theorem to bound error probability with risk ∗ of LDL (with absolute loss), then get the upper bound for er- be the prediction for h , then ∗ ror probability using union bound. P(h(x) 6= y|x) − P(h (x) 6= y|x) = ηj(x) − ηk(x). (10) Lemma 10. Let a ≥ b and d ≥ c, then |a − b| ≤ |a − c| + Generalized Plug-in Decision Theorem |b − d|. As a preliminary, we firstly introduce the well-known plug- Proof. If c ≤ b or d ≥ a, the inequality is obvious. If c ≥ b in decision theorem. The classic plug-in decision theorem and a ≥ d, then |a − b| = |a − c| + |c − b| ≤ |a − c| + |b − applies to binary decision. For a given x with label y, then d|. the Bayes decision function h∗ is For k 6= j, then η (x) ≥ η (x) and η˜ (x) ≥ η˜ (x), and ∗ y0 if η1(x) ≤ 1/2, j k k j h (x) = according to Lemma 10, |η (x)−η (x)| ≤ |η (x)−η˜ (x)|+ y1 otherwise, j k j j |ηk(x) − η˜k(x)|. For k = j, left-hand side of (10) reduces where η1(x) = P(y = y1|x). For another decision h with to 0. Thus we have plug-in decision function m ∗ X y if η˜ (x) ≤ 1/2, P(h(x) 6= y|x) − P(h (x) 6= y|x) ≤ |ηi(x) − η˜i(x)|. h(x) = 0 1 y otherwise, i=1 1 (11) where η˜1(x) (0 ≤ η˜1(x) ≤ 1) is an approximation of η1(x). Take expectation of both sides of Eq. (11) and we finish Then we have proof of Theorem 9.

5260 Hoeffding’s inequality, for any > 0 such that Table 1: Measures for LDL ∗ ∗ 2 P {|LS(h ) − L | ≥ } ≤ 2 exp(−2n ), Measure Formula Chebyshev ↓ Dis1(p, q) = maxi |pi − qi| namely for any δ > 0, with probability at least 1 − δ, q 2 P pi−qi) r Clark ↓ Dis2(p, q) = 2 i (pi+qi) ∗ ∗ ln 2/δ P |pi−qi| |LS(h ) − L | ≤ . (12) Canberra ↓ Dis3(p, q) = 2n i pi+qi P pi Kullback-Leibler ↓ Dis4(p, q) = i pi ln q P i Finally combine Theorem 9 and Equation 12 with the results i piqi Cosine ↑ Sim1(p, q) = √ √ of Theorem 2, Theorem 5 and Theorem 7 respectively, and ||p|| ||q|| P 2 2 we conclude with following theorems. Intersection ↑ Sim2(p, q) = i min(pi, qi) Theorem 12. Let ηi be ci-Lipschitz. Let η˜ be the output function of AA-kNN. Then for any δ > 0, with probability at least 1 − δ, we have Measures for LDL √ 1/(d+1) r 8c d 2k ln 4/δ As we already have discussed in Theorem 9 that in proba- (g(˜η(x)) 6= y) ≤ + bilistic setting, LDL with absolute loss is sufficient for clas- P δ n 2n sification, because approximation to the conditional proba- n 1 X y bility distribution function with absolute loss guarantees ap- +1 − max ηx . n y∈Y i proximation to the optimal classifier. (Geng 2016) suggests i=1 six measures for LDL, i.e., Chebyshev distance, Clark dis- Different with AA-kNN, AA-BP minimizes sum-squared tance, Canberra metric, KL divergence, cosine coefficient, m and intersection similarity, which belong to Minkowski fam- loss, and note that for random variable z ∈ R ily, the χ2 family, the L family, the Shannon’s entropy fam- √ √ q 1 2 ily, the inner product family, and the intersection family, re- Ez||z||1 ≤ mEz||z||2 ≤ m E||z||2, spectively (Cha 2007). For probability distribution p, q ∈ where the first inequality is according to Cauchy-Schwarz’s Rm, formulation of the six measures are summarized in Ta- ble 1, where ↓ after the distance measures indicates “the inequality and the second one is according to Jensen’s in- smaller the better”, and ↑ after the similarity measures inequality. Thus error probability for AA-BP satisfies dicates “the larger the better”. Theorem 13. Let F be a family of functions for AA-BP defined above. For any δ > 0, with probability at least 1 − δ, Theorem 11. All measures in Table 1 are sufficient for LDL for all f ∈ F, such that to perform classification.  s 4 √ √ 1 X yj yj 2 log δ P P(g(f(x)) 6= y) ≤ m  (ηx − fx ) + 7 Proof. Denote absolute loss as Dis(p, q) = i |pi−qi|. For n i 2n Chebyshev, Clark and Canberra distances, it’s trivial to val- i,j √ 1 1 √1 2 n ! idate that Dis1 ≥ m Dis, Dis3 ≥ 2 Dis and Dis2 ≥ 2 m Dis. 8 2m LσB0B1 1 X y √ + √ max ||xi||2 + 1 − max ηx , 1 n i∈[n] n y∈Y i For KL distance, we have Dis4 ≥ 2 Dis. For cosine sim- i=1 1 2 √ ilarity, we have 1 − mSim1 ≥ 2 Dis , and for intersection 1 where is the root operator. similarity, it satisfies that 1 − Sim2 = 2 Dis. In conclusion, Observing that for probability distribution p, q ∈ Rm, all measures in Table 1 bound absolute loss somehow. De- p tails of the proof is left to appendix. ||p − q||1 ≤ 2 KL(p, q), which implies that Theorem 14. Let F be family of functions for SA-ME de- Data-dependent Error Probability Bounds for LDL fined above, and b0 = 3(b + 1). For any δ > 0, with proba- As discussed above, error probability is directly correlated bility at least 1 − δ, for all f ∈ F such that with absolute loss, and Theorem 11 states that measures  s yj 4 in Table 1 bound absolute loss somehow. Accordingly, risk √ 1 X y ηxi log (g(f(x)) 6= y) ≤ 2 η j ln + b0 δ bounds for AA-kNN (absolute loss), AA-BP (sum-squared P  xi yj n fxi 2n loss), and SA-ME (KL loss) can be extended to error prob- i,j √ √ ability bounds. Formally, let f be a learned label distri- n ! 2( 2m + 2)m 1 X y bution function, define corresponding decision function as + √ max ||xi||2 + 1 − max ηx , n i∈[n] n y∈Y i g(f(x)) = arg maxy∈Y f(x, y). i=1 Notice that the optimal error probability L∗ exists in the left-hand side of Theorem 9, which can be bounded with Conclusion Hoeffding’s inequality. Moreover, the optimal classifier h∗ This paper studies learnability of LDL from two aspects, i.e., outputs the label corresponding to the maximum conditional generalization of LDL itself, and generalization of LDL to probability, and empirical error probability of the optimal perform classification. On one hand, for generalization of h∗ L (h∗) = 1 Pn 1 − max ηy classifier is S n i=1 y∈Y xi . By LDL, risk bounds for three representative LDL algorithms,

5261 i.e., AA-kNN, AA-BP and SA-ME are provided, with con- Proof of Equation (3) vergence rate O(( k )1/d+1), O( √1 ), O( √1 ) respectively, n n n The proof is according to the work of (Shalev-Shwartz and which indicates learnability of LDL. On the other hand, Ben-David 2014), which bounds expected distance between for generalization of LDL to perform classification, a gen- a random x and its closest neighbor in the training set. eralized plug-in decision theorem is proposed, discovering that minimizing absolute loss is sufficient for a corresponding LDL decision function approaching the optimal clas- Proof. For k = 1, according to (Shalev-Shwartz and Ben- sifier. Furthermore, six commonly used LDL measures are David 2014), also shown to be sufficient for classification, for the reason √ 1 − d+1 that all six measures bound absolute loss somehow. Besides, Ex,S ||xN1(x) − x||2 ≤ 4 dn , data-dependent error probability bounds for LDL are given to demonstrate feasibility of LDL to perform classification. which satisfies Eq. (3). One can easily check√ that when k = 1/(d+1) −1/(d+1) 1√, the right-hand side of Eq. (3) is 2 4 dn > Appendices 4 dn−1/(d+1), thus Eq. (3) holds. Furthermore for k ≥ 2, Details of Proof of Theorem 11 For discrete probability distribution p, q ∈ Rm, the missing Lemma 15. (Shalev-Shwartz and Ben-David 2014) Let proof for relation between measures in Table 1 and absolute C1,C2,...,Cr be a collection of subsets of X , then for any loss is as following. k ≥ 2, Intersection Similarity. For intersection similarity and ab-   X 2rk solute distance, it satisfies 1 − Sim2 = Dis/2. [C ] ≤ . ES  P i  n Proof. Firstly for a, b ∈ R, we have min(a, b) = (a+b)/2− i:|Ci∩S|

5262 Acknowledgments Hou, P.; Geng, X.; Huo, Z.-W.; and Lv, J.-Q. 2017. Semi- This research was supported by the National Key Research supervised adaptive label distribution learning for facial age & Development Plan of China (No. 2017YFB1002801), estimation. In AAAI Conference on 31th AAAI Conference the National Science Foundation of China (61622203), the on Artificial Intelligence, AAAI’17. Jiangsu Natural Science Funds for Distinguished Young Jia, X.; Li, W.; Liu, J.; and Zhang, Y. 2018. Label distri- Scholar (BK20140022), the Collaborative Innovation Cen- bution learning by exploiting label correlations. In Proceed- ter of Novel Software Technology and Industrialization, and ings of the Twenty-Seventh International Joint Conference the Collaborative Innovation Center of Wireless Communi- on Artificial Intelligence, IJCAI’18. cations Technology. Kakade, S. M.; Sridharan, K.; and Tewari, A. 2009. On the complexity of linear prediction: Risk bounds, margin References bounds, and regularization. In Advances in Neural Infor- Bartlett, P. L., and Mendelson, S. 2003. Rademacher and mation Processing Systems, NIPS’09, 793–800. gaussian complexities: Risk bounds and structural results. J. Li, Y.; Zhang, M.; and Geng, X. 2015. Leveraging implicit Mach. Learn. Res. 3:463–482. relative labeling-importance information for effective multi- Berger, A. L.; Pietra, V. J. D.; and Pietra, S. A. D. 1996. A label learning. In 2015 IEEE International Conference on maximum entropy approach to natural language processing. Data Mining, 251–260. Comput. Linguist. 22(1):39–71. Maurer, A. 2016. A vector-contraction inequality for Biau, G., and Devroye, L. 2015. Lectures on the nearest rademacher complexities. In Algorithmic Learning Theory, neighbor method. Springer. 3–17. Borchani, H.; Varando, G.; Bielza, C.; and Larranaga,˜ P. Mohri, M.; Rostamizadeh, A.; and Talwalkar, A. 2012. 2015. A survey on multi-output regression. Wiley Interdis- Foundations of machine learning. MIT press. ciplinary Reviews: Data Mining and Knowledge Discovery Ren, Y., and Geng, X. 2017. Sense beauty by label distri- 5(5):216–233. bution learning. In Proceedings of the Twenty-Sixth Interna- Cha, S.-H. 2007. Comprehensive survey on dis- tional Joint Conference on Artificial Intelligence, IJCAI’17, tance/similarity measures between probability density func- 2648–2654. tions. Shalev-Shwartz, S., and Ben-David, S. 2014. Understanding Cover, T. M., and Thomas, J. A. 2012. Elements of informa- machine learning: From theory to algorithms. Cambridge tion theory. John Wiley & Sons. university press. Devroye, L., and Krzyzak,´ A. 1989. An equivalence theorem Stone, C. J. 1977. Consistent nonparametric regression. The for l1 convergence of the kernel regression estimate. Journal Annals of Statistics 5(4):595–620. of Statistical Planning and Inference 23(1):71 – 82. Villani, C. 2008. Optimal transport: old and new, volume Devroye, L.; Gyorfi,¨ L.; and Lugosi, G. 1996. A proba- 338. Springer Science & Business Media. bilistic theory of pattern recognition, volume 31. Springer Xu, M., and Zhou, Z.-H. 2017. Incomplete label distribution Science & Business Media. learning. In Proceedings of the Twenty-Sixth International Gao, W., and Zhou, Z.-H. 2016. Dropout rademacher com- Joint Conference on Artificial Intelligence, IJCAI’17, 3175– plexity of deep neural networks. Science China Information 3181. Sciences 59(7):072104. Xu, N.; Tao, A.; and Geng, X. 2018. Label enhancement for Geng, X., and Hou, P. 2015. Pre-release prediction of crowd label distribution learning. In Proceedings of the Twenty- opinion on movies by label distribution learning. In Pro- Seventh International Joint Conference on Artificial Intelli- ceedings of the 24th International Conference on Artificial gence, IJCAI’18, 2926–2932. Intelligence, IJCAI’15, 3511–3517. Zhang, M., and Zhou, Z. 2014. A review on multi-label Geng, X., and Ling, M. 2017. Soft video parsing by la- learning algorithms. IEEE Transactions on Knowledge and bel distribution learning. In AAAI Conference on 31th AAAI Data Engineering 26(8):1819–1837. Conference on Artificial Intelligence, AAAI’17, 1331–1337. Zhao, P., and Zhou, Z.-H. 2018. Label distribution learning Geng, X., and Xia, Y. 2014. Head pose estimation based on by optimal transport. In AAAI Conference on 32th AAAI multivariate label distribution. In 2014 IEEE Conference on Conference on Artificial Intelligence, AAAI’18. Computer Vision and Pattern Recognition, CVPR’14, 1837– Zheng, X.; Jia, X.; and Li, W. 2018. Label distribution learn- 1842. ing by exploiting sample correlations locally. In AAAI Con- Geng, X.; Yin, C.; and Zhou, Z. 2013. Facial age estima- ference on 32th AAAI Conference on Artificial Intelligence, tion by learning from label distributions. IEEE Transactions AAAI’18. on Pattern Analysis and Machine Intelligence 35(10):2401– Zhou, Y.; Xue, H.; and Geng, X. 2015. Emotion distri- 2412. bution recognition from facial expressions. In Proceedings Geng, X. 2016. Label distribution learning. IEEE Trans- of the 23rd ACM International Conference on Multimedia, actions on Knowledge and Data Engineering 28(7):1734– MM ’15, 1247–1250. New York, NY, USA: ACM. 1748.

5263