Polytree-Augmented Classifier Chains for Multi-Label Classification
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Polytree-Augmented Classifier Chains for Multi-Label Classification Lu Sun and Mineichi Kudo Graduate School of Information Science and Technology Hokkaido University, Sapporo, Japan fsunlu, [email protected] Abstract Multi-label classification is a challenging and ap- pealing supervised learning problem where a sub- set of labels, rather than a single label seen in tra- ditional classification problems, is assigned to a single test instance. Classifier chains based meth- ods are a promising strategy to tackle multi-label classification problems as they model label corre- (a) “fish”,“sea” (b) “windsock”,“sky” lations at acceptable complexity. However, these methods are difficult to approximate the under- Figure 1: A multi-label example to show label dependency lying dependency in the label space, and suffer for inference. (a) A fish in the sea; (b) carp shaped windsocks from the problems of poorly ordered chain and (koinobori) in the sky. error propagation. In this paper, we propose a novel polytree-augmented classifier chains method to remedy these problems. A polytree is used contexts (backgrounds), in other words, the label dependen- to model reasonable conditional dependence be- cies, distinguish them clearly. tween labels over attributes, under which the di- The existing MLC methods can be cast into two rectional relationship between labels within causal strategies: problem transformation and algorithm adapta- basins could be appropriately determined. In ad- tion [Tsoumakas and Katakis, 2007]. A convenient and dition, based on the max-sum algorithm, exact in- straightforward way for MLC is to conduct problem trans- ference would be performed on polytrees at rea- formation in which a MLC problem is transformed into sonable cost, preventing from error propagation. one or more single label classification problems. Prob- The experiments performed on both artificial and lem transformation mainly comprises three kinds of meth- benchmark multi-label data sets demonstrated that ods: binary relevance (BR) [Zhang and Zhou, 2007], pair- the proposed method is competitive with the state- wise (PW) [Furnkranz¨ et al., 2008] and label combination of-the-art multi-label classification methods. (LC) [Tsoumakas et al., 2011]. BR simply trains a binary classifier for each label, ignoring the label correlation, which is apprantly crucial to make accurate prediction for MLC. For 1 Introduction example, it is quite difficult to distinguish the labels “fish” Unlike traditional single label classification problems where and “windsock” in Fig. 1 from the visual features, unless we an instance is associated with a single-label, multi-label clas- consider the label correlations with “sea” and “sky.” PW and sification (MLC) attempts to allocate multiple labels to any LC are designed to capture label dependence directly. PW input unseen instance by a multi-label classifier learned from learns a classifier for each pair of labels, and LC transforms a training set. Obviously, such a generalization greatly raises MLC into the possible largest single-label classification prob- the difficulty of obtaining a desirable prediction accuracy at lem by treating each possible label combination as a meta- a tractable complexity. Nowadays, MLC has drawn a lot of label. At the expense of their high ability on modeling la- attentions in a wide range of real world applications, such bel correlations, the complexity increases quadratically and as text categorization, semantic image classification, mu- exponentially with the number of labels for PW and LC, re- sic emotions detection and bioinformatics analysis. Fig. 11 spectively, thus they typically become impracticable even for shows a multi-label example, where Fig. 1(a) belongs to la- a small number of labels. bels “fish” and “sea” and Fig. 1(b) is assigned with labels In the second strategy, algorithm adaptation, multi-label “windsock” and “sky.” Note that both objects are fish, but the problems are solved by modifying conventional machine learning algorithms, such as support vector machines [Elis- 1http://www.yunphoto.net seeff and Weston, 2001], k-nearest neighbors [Zhang and 3834 Zhou, 2007], adaboost [Schapire and Singer, 2000], neural networks [Zhang and Zhou, 2006], decision trees [Comite et al., 2003] and probabilistic graphical models [Ghamrawi and McCallum, 2005; Qi et al., 2007; Guo and Gu, 2011; Alessandro et al., 2013]. They achieved competitive per- formances to those of problem transformation based meth- ods. However, they have several limitations, such as difficulty of choosing parameters, high complexity on the prediction (a) BR (b) CC and PCC (c) BCC phase, and sensitivity on the statistics of data. Figure 2: Probabilistic graphical models of BR and CC based Recently, a BR based MLC method, namely classifier methods. chains (CC) [Read et al., 2009], is proposed to overcome the intrinsic drawback of BR and to achieve higher predictive ac- curacy. It succeeded in modeling label correlations at low (x; y), which contains a feature vector x = (x1; :::; xm) as computational complexity, and produced furthermore two a realization of the random vector X = (X1; :::; Xm) drawn variants: probabilistic classifier chains (PCC) [Dembczynski from the input feature space X = Rm, and the correspond- ] [ et al., 2010 and Bayesian classifier chains (BCC) Zaragoza ing label vector y = (y1; :::; yd) drawn from the output label ] d et al., 2011 . PCC makes an attempt to avoid the problem of space Y = f0; 1g , where yj = 1 if and only if label λj is error propagation that was possessed by CC, however, its abil- associated with instance x, and 0 otherwise. In other words, ity is greatly limited by the number of labels due to its high y = (y1; :::; yd) can also be viewed as a realization of corre- prediction cost resulting from the exhaustive search manner. sponding random vector Y = (Y1; :::; Yd); Y 2 Y. Benefiting from mining marginal dependence between labels Assume that we are given a data set of N instances D = as a directed tree, BCC can establish classifier chains in par- (i) (i) N (i) f(x ; y )gi=1, where y is the label assignment of the ith ticular chain orderings, but its performance is limited due to instance. The task of MLC is to find an optimal classifier the modeling with only second-order correlations of labels. h : X!Y which assigns a label vector y to each instance Moreover, these CC based methods cannot model label cor- x and meanwhile minimizes a loss function. Given a loss relations in a reasonable way, since CC and PCC randomly function L(Y; h(X)), the optimal h∗ is establish fully connected Bayesian networks, which have a ∗ possible risk to lead to trivial label dependence. In addi- h = arg min EP (x;y)L(Y; h(X)): (1) tion, the expression ability of BCC is strongly restricted by h the adopted tree structure. In the BR context, a classifier h is comprised of d binary To overcome these limitations of CC based methods, this classifiers h1; :::; hd, where each hj predicts y^j 2 f0; 1g, paper proposes a novel polytree-augmented classifier chains forming a vector y^ 2 f0; 1gd. Fig. 2(a) shows the proba- (PACC). In contrast to CC and PCC, PACC seeks a reasonable bilistic graphical model of BR. Here, the model ordering by a polytree that represents the underlying depen- d dence among labels. Unlike BCC, PACC is derived from con- Y ditional dependence between labels and is able to model both P (YjX) = P (YjjX) (2) low and high-order label correlations by virtue of polytrees. j=1 Polytree structure is ubiquitous in real world MLC problems, for example, we can readily observe such a structure on the means that the class labels are mututally independent. popular Wikipedia2 and Gene Ontology3 data sets. In addi- tion, the polytree structure makes it possible to perform ex- 3 Related works act inference in reasonable time, avoiding the problem of er- 3.1 Classifier chains (CC) ror propagation. Moreover, by introduction of causal basins, the directionality between labels is mostly determined, which Classifier chains (CC) [Read et al., 2009] models label corre- further decreases the number of possible orderings. lations in a randomly ordered chain based on (3). The rest of this paper is organized as follows. Section 2 d gives the mathematical definition of MLC. Section 3 presents Y P (YjX) = P (Yjjpa(Yj); X): (3) related works of CC based methods. Section 4 illustrates an j=1 overview of PACC, and presents its implementation. Section 5 reports experimental results performed on both artificial and Here pa(Yj) represents the parent labels for Yj. Obviously, benchmark multi-label data sets. Finally, Section 6 concludes jpa(Yj)j = p, where p denotes the number of labels prior to this paper and gives discussions for the future work. Yj following the chain order. In the training phase, according to a predefined chain order, 2 Multi-label classification it bulids d binary classifiers h1; :::; hd such that each classifier predicts correctly the value of y by referring to pa(y ) in In the scenario of MLC, given a finite set of labels L = j j addition to x. In the testing phase, it predicts the value of y fλ ; :::; λ g, an instance is typically represented by a pair j 1 d in a greedy manner: 2https://www.kaggle.com/c/lshtc 3 y^j = arg max P (yjjpa^ (yj); x); j = 1; :::; d: (4) http://geneontology.org/ yj 3835 The final prediction is the collection of the results, y^. The computational complexity of CC is O(d × T (m; N)), where T (m; N) is the complexity of constructing a learner for m attributes and N instances. Its complexity is identical with BR’s, if a linear baseline learner is used.