A Hierarchical Tree Distance Measure for Classification

A Hierarchical Tree Distance Measure for Classification Kent Munthe Caspersen, Martin Bjeldbak Madsen, Andreas Berre Eriksen and Bo Thiesson Department of Computer Science, Aalborg University, Aalborg, Denmark {swallet, martinbmadsen}@gmail.com, {andreasb, thiesson}@cs.aau.dk Keywords: Machine Learning, Multi-class Classification, Hierarchical Classification, Tree Distance Measures, Multi- output Regression, Multidimensional Scaling, Process Automation, UNSPSC. Abstract: In this paper, we explore the problem of classification where class labels exhibit a hierarchical tree structure. Many multiclass classification algorithms assume a flat label space, where hierarchical structures are ignored. We take advantage of hierarchical structures and the interdependencies between labels. In our setting, labels are structured in a product and service hierarchy, with a focus on spend analysis. We define a novel distance measure between classes in a hierarchical label tree. This measure penalizes paths though high levels in the hierarchy. We use a known classification algorithm that aims to minimize distance between labels, given any symmetric distance measure. The approach is global in that it constructs a single classifier for an entire hierarchy by embedding hierarchical distances into a lower-dimensional space. Results show that combining our novel distance measure with the classifier induces a trade-off between accuracy and lower hierarchical distances on misclassifications. This is useful in a setting where erroneous predictions vastly change the context of a label. 1 INTRODUCTION the UNSPSC standard, see (Programme, 2016). Many classification problems have a hierarchical With the increasing advancement of technologies de- structure, but few multiclass classification algorithms veloped to gather and store vast quantities of data, in- take advantage of this fact. Traditional multiclass clas- teresting applications arise. Many kinds of business sification algorithms ignore any hierarchical structure, processes are supported by classifying data into one essentially flattening the hierarchy such that the clas- of multiple categories. In addition, as the quantity of sification problem is solved as a multiclass classifi- data grows, structured organizations of assigned cate- cation problem. Such problems are often solved by gories are often created to describe interdependencies combining the output of multiple binary classifiers, us- between categories. Spend analysis systems are an ex- ing techniques such as One-vs-One and One-vs-Rest ample domain where such a hierarchical structure can to provide predictions (Bishop, 2006). be beneficial. Hierarchical multiclass classification (HMC) algo- In a spend analysis system, one is interested in rithms are a variant of multiclass classification algo- being able to drill down on the types of purchases rithms which take advantage of labels organized in a across levels of specificity to aid in planning of pro- hierarchical structure. Depending on the label space, curements. Such tools also provide processes to gain hierarchical structures can be in the shape of a tree insights in how much and to whom spending is going or directed acyclic graph (DAG). Figure 1 shows an towards, supporting spend visibility. For example, in example of a tree-based label structure. In this paper, the UNSPSC1 taxonomy, a procured computer mouse we focus on tree structures. would belong to the following categories of increasing Silla and Freitas (Silla and Freitas, 2011) de- specificity: “Information Technology Broadcasting scribe hierarchical classification problems as a 3-tuple Υ, Ψ, Φ , where Υ is the type of graph represent- and Telecommunications”, “Computer Equipment and h i Accessories”, “Computer data input devices”, “Com- ing the hierarchical structure of classes, Ψ specifies puter mouse or trackballs”. For more information on whether a datapoint can have multiple labels in the class hierarchy, and Φ specifies whether the labeling 1 R United Nations Standard Products and Services Code , of datapoints only includes leaves or if nodes within a cross-industry taxonomy for product and service classifi- the hierarchy are included as well. Using this defini- cation. tion, we are concerned with problems of the form 502 Caspersen, K., Madsen, M., Eriksen, A. and Thiesson, B. A Hierarchical Tree Distance Measure for Classification. DOI: 10.5220/0006198505020509 In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 502-509 ISBN: 978-989-758-222-6 Copyright c 2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved A Hierarchical Tree Distance Measure for Classification Υ = T (tree), meaning classes are organized in a used in conjunction with active learning. This devi- • tree structure. ates from the method introduced in this paper in that Ψ = SPL (single path of labels), meaning the model we introduce does not support DAGs and can be • problems we consider are not hierarchically multi- used without the aid of active learning, with promising label. results. Φ = PD (partial depth) labeling, meaning data- (Wang et al., 1999) identify issues in local- • points do not always have a leaf class. approach hierarchical classification and propose a In this paper, a novel distance measure is introduced, global-classifier based approach, aiming for closeness with respect to the label tree. The purpose of the of hierarchy labels. The authors realize that the con- distance measure is to capture similarity between la- cern of simply being correct or wrong in hierarchical bels and penalize errors at high levels in the hierarchy, classification is not enough, and that only focusing on more than errors at lower levels. This distance mea- the broader, higher levels is where the structure, and sure leads to a trade-off between accuracy and the dis- thus accuracy, diminishes. To mitigate these issues, tance of misclassifications. Intuitively, this trade-off the authors implement a multilabel classifier based makes sense for UNSPSC codes as, for example, clas- upon rules from features to classes found during train- sifying an apple as a fresh fruit should be penalized ing. These rules minimize a distance measure between less than classifying an apple as toxic waste. Training two classes, and are deterministically found. Their a classifier for such distance measures is not straight- distance measure is application-dependent, and the au- forward, therefore a classification method is presented, thors use the shortest distance between two labels. In which copes with a distance measure defined between this paper, we also construct a global classifier which two labels. aims to minimize distances between hierarchy labels. The rest of this paper is structured as follows. Sec- (Weinberger and Chapelle, 2009) introduces a la- tion 2 discusses existing HMC approaches in the lit- bel embedding with respect to the hierarchical struc- erature. Section 3 introduces hierarchical classifica- ture of the label tree. They build a global multiclass tion. In Section 4, we define properties a hierarchical classifier based on the embedding. We utilize their tree distance measure should comply to, and describe method of classification with our novel distance mea- our concrete implementation of these properties. Sec- sure. tion 5 details how to embed the distance measure in a hierarchical multiclass classifier. The experiments of Section 6 compares this classifier with other classi- 3 HIERARCHICAL fiers. Finally, Section 7 presents our ideas for further research. Section 8 concludes. CLASSIFICATION The hierarchical structure among labels allows us to reason about different degrees of misclassification. 2 RELATED WORK We are concerned with predicting the label of datapoints within a hierarchical taxonomy. We define the (Dumais and Chen, 2000) explore hierarchical classi- input data as a set of tuples, such that a dataset D is fication of web content by ordering SVMs in a hierar- defined by chical fashion, and classifying based on user-specified D = (x, y) x X, y Y , (1) thresholds. The authors focus on a two-level label hi- { | ∈ ∈ } erarchy, as opposed to the 4-level UNSPSC hierarchy where x is a q-dimensional datapoint in feature space we utilize on in this paper. Assigning an instance to X and y is a label in a hierarchically structured set of a class requires using the posterior probabilities prop- labels Y = 1, 2, . , m . agated from the SVMs through the hierarchy. The Assume{ we have a datapoint} x with label y = U authors conclude that exploiting the hierarchical struc- from the label tree in Figure 1. It makes sense that ture of an underlying problem can, in some cases, pro- a prediction yˆ = V should be penalized less than a duce a better classifier, especially in situations with a prediction yˆ0 = Z, since it is closer to the true label large number of labels. y in the label tree. We capture this notion of distance (Labrou and Finin, 1999) use a global classifier between any two labels with our hierarchy embracing based system to classify web pages into a 2-level DAG- distance measure, properties of which are defined in based hierarchy of Yahoo! categories by computing Section 4. the similarity between documents. The authors con- One commonly used distance measure is to count clude that their system is not accurate enough to be the number of edges on a path between two labels in suitable for automatic classification, and should be the node hierarchy. We call this method the Edges 503 ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods R 0, or 1, depending on whether x is smaller than, equal to, or larger than 0, respectively. Finally, we define a notion of structural equivalence S X between two nodes in a label tree, denoted A B, such that the root is structurally equivalent to itself,≡ T W Y Z and A B ( sib(A) = sib(B) ρ(A) ρ(B)) . ≡ ⇐⇒ | | | | ∧ ≡ U V This recursive definition causes two nodes A and B to Figure 1: An example of a 3-level label tree. be structurally equivalent if all nodes met on the path from A to the root, pair-wise have the same number Between (EB) distance.

Load more