Inter-Node Hellinger Distance Based Decision Tree
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Inter-node Hellinger Distance based Decision Tree Pritom Saha Akash1 , Md. Eusha Kadir1 , Amin Ahsan Ali2 and Mohammad Shoyaib1 1Institute of Information Technology, University of Dhaka, Bangladesh 2Department of Computer Science & Engineering, Independent University, Bangladesh fbsse0604 & [email protected], [email protected], [email protected] Abstract well for balanced datasets where the class distribution is uni- form. However, as class prior probability is used to calculate This paper introduces a new splitting criterion the impurity of a node, in an imbalanced dataset, IG becomes called Inter-node Hellinger Distance (iHD) and a biased towards the majority class which is also called skew weighted version of it (iHDw) for constructing de- sensitivity [Drummond and Holte, 2000]. cision trees. iHD measures the distance between To improve the performance of standard DTs, several the parent and each of the child nodes in a split us- splitting criteria are proposed to construct DTs in Distinct ing Hellinger distance. We prove that this ensures Class based Splitting Measure (DCSM) [Chandra et al., the mutual exclusiveness between the child nodes. 2010], Hellinger Distance Decision Tree (HDDT) [Cieslak The weight term in iHDw is concerned with the pu- and Chawla, 2008] and Class Confidence Proportion Deci- rity of individual child node considering the class sion Tree (CCPDT) [Liu et al., 2010]. Besides these, to deal imbalance problem. The combination of the dis- with class imbalance problem in Lazy DT construction, two tance and weight term in iHDw thus favors a parti- skew insensitive split criteria based on Hellinger distance and tion where child nodes are purer and mutually ex- K-L divergence are proposed in [Su and Cao, 2019]. Since clusive, and skew insensitive. We perform an ex- Lazy DTs use the test instance to make splitting decisions, in periment over twenty balanced and twenty imbal- this paper, we omit it from our discussion. anced datasets. The results show that decision trees based on iHD win against six other state-of-the-art In DCSM, the number of distinct classes in a partition methods on at least 14 balanced and 10 imbalanced is incorporated. Trees generated with DCSM is smaller in datasets. We also observe that adding the weight size, however, DCSM is still skew sensitive because of its to iHD improves the performance of decision trees use of class prior probability. HDDT and CCPDT propose on imbalanced datasets. Moreover, according to the new splitting criteria to address the class imbalance problem. result of the Friedman test, this improvement is sta- However, the Hellinger distance based criterion proposed in tistically significant compared to other methods. HDDT can perform poorly when training samples are more balanced [Cieslak and Chawla, 2008]. At the same time, HDDT fails more often to differentiate between two differ- 1 Introduction ent splits (specifically for multiclass problems) which is il- Three of the major tasks in machine learning are Feature Ex- lustrated in the following example: traction, Feature Selection, and Classification [Iqbal et al., Assume, there are 80 samples, 40 of class A, 20 of class B, 2017; Sharmin et al., 2019; Kotsiantis et al., 2007]. In 10 of class C and rest of class D. Two splits (split X and Y) this study, we only focus on the classification task. One are compared where each split has 50 observations on the left of the simplest and easily interpretable classification meth- child and the rest on the right child. Split X channels all the ods (also known as classifiers) is decision tree (DT) [Quin- samples of class A and class D into the left child, and the rest lan, 1986]. It is a tree-like representation of possible out- to the right child while Split Y places all the samples of class comes to a problem. Learning of a DT is a greedy ap- A, 5 each of classes C and D into the left child, and the rest proach. Nodes in a DT can be categorized into two types: to the right child. It is easily observable that, Split X is more decision nodes and leaf nodes. At each decision node, a exclusive than Split Y. However, HDDT cannot differentiate locally best feature is selected to split the data into child between these two splits and provides the same measure. nodes. This process is repeated until a leaf node is reached On the other hand, instead of using class probability, where the further splitting is not possible. The best fea- CCPDT calculates the splitting criteria like entropy and gini ture is selected based on a splitting criterion which mea- using a new measure called Class Confidence Proportion sures the goodness of a split. One of the most popular (CCP) which is skew insensitive. However, it uses HDDT to splitting criteria is Information Gain (IG) [Quinlan, 1986; break ties while splits based on two different features provide Breiman et al., 1984] which is an impurity based splitting cri- the same split measure. Hence, CCPDT exhibits the same terion (i.e., entropy and gini). DTs based on IG perform quite limitation as HDDT. 1967 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) To address the limitations of the above methods, we pro- 2.2 DCSM pose a new splitting criterion called Inter-node Hellinger Dis- Chandra et. al. proposed a new splitting criterion called tance (iHD) which is skew insensitive. We then propose Distinct Class based Split Measure (DCSM) that emphasizes iHDw which adds a weight to iHD to make sure child nodes the number of distinct classes in a partition [Chandra et al., are purer without forgoing skew insensitivity. Both iHD and 2010]. For a given attribute xj and V number of partitions, [ iHDw exhibit exclusivity preference property defined in Tay- the measure M(xj) is defined as (6). lor and Silverman, 1993]. Rigorous experiments over a large V " number of datasets and statistical tests are performed to show X N (v) M(x ) = ∗ D(v) exp(D(v))∗ the superiority of iHD and iHDw for the construction of DTs. j N (u) The rest of the paper is organized in the following manner. v=1 C # In Section 2, several related node splitting criteria for DTs are X [a(v) ∗ exp(δ(v)(1 − (a(v))2))] discussed. The new node splitting criteria are presented in !k !k (6) Section 3. In Section 4, datasets, performance measures, and k=1 experimental results are described. Finally, Section 5 con- where C is the number of distinct classes in the dataset, u is cludes the paper. the splitting node (parent) representing xj and v represents a partition (child node). D(v) denotes the number of distinct 2 Related Work classes in a partition v, δ(v) is the ratio of the number of dis- D(v) (v) In this section, we discuss several split criteria for DT related tinct classes in the partition v to that of u, i.e., D(u) and a!k to our proposed measure. is the probability of class !k in the partition v. The first term D(v) ∗ exp(D(v)) deals with the number of 2.1 Information Gain distinct classes in a partition. It increases when the number Information Gain (IG) calculates how much “information” a of distinct classes in a partition increases causing purer parti- (v) (v) feature gives about the class, and measures the decrease of tions to be preferred. The second term is a!k ∗ exp(δ (1 − impurity in a collection of examples after splitting into child (v) 2 (v) (a! ) )). Here, impurity decreases when δ decreases. On nodes. IG for splitting a data, (X; y) with attribute A and k the other hand, (1 − (a(v))2)) decreases when there are more threshold, T is calculated using (1). !k examples of a class compared to the total number of exam- V ples in a partition. Hence, the DCSM is intended to reduce X jXij Gain(A; T : X; y) = Imp(yjX) − Imp(yjX ) (1) jXj i the impurity of each partition when it is minimized. i=1 The main difference between DCSM and other splitting where V is the number of partitions and Imp is the impurity criteria is that DCSM introduces the concept of distinct measure. Widely used impurity metrics are Entropy [Quinlan, classes. The limitation of DCSM is same as IG. It cannot deal 1986] and Gini Index [Breiman et al., 1984] and calculated with imbalanced class distribution and thus, is biased towards using (2) and (3) respectively. the majority classes. k 2.3 HDDT X Entropy(yjX) = − p(yjjX) log p(yjjX) (2) Hellinger Distance Decision Trees (HDDT) uses Hellinger j=1 distance as the splitting criterion to solve the problem of class k imbalance [Cieslak and Chawla, 2008; Cieslak et al., 2012]. X 2 Gini(yjX) = 1 − p(yjjX) (3) The details of Hellinger distance is presented in Section 3. j=1 In HDDT, Hellinger distance (dH ) is used as a split criterion to construct a DT. Assume a two-class problem (class + and where k is the number of classes. When data are balanced, IG class −) and, X+ and X− are the set of classes + and − re- gives a reasonably good splitting boundary. However, when spectively. Then, dH between the distributions, X+ and X− there is an imbalanced distribution of classes in a dataset, IG is calculated as (7). becomes biased towards the majority classes [Drummond and v u 2 Holte, 2000]. Another drawback of IG is that it favors at- u p s s ! uX jX+j j jX−j j tributes with a large number of distinct values.