A Fast Decision Tree Learning Algorithm

A Fast Decision Tree Learning Algorithm Jiang Su and Harry Zhang Faculty of Computer Science University of New Brunswick, NB, Canada, E3B 5A3 jiang.su, hzhang @unb.ca { } Abstract lan 1993), has a time complexity of O(m n2), where m is the size of the training data and n is the number· of attributes. There is growing interest in scaling up the widely-used Since large data sets with millions of instances and thou- decision-tree learning algorithms to very large data sets. Although numerous diverse techniques have been pro- sands of attributes are not rare today, the interest in devel- posed, a fast tree-growing algorithm without substantial oping fast decision-tree learning algorithms is rapidly grow- decrease in accuracy and substantial increase in space ing. The major reason is that decision-tree learning works complexity is essential. In this paper, we present a comparably well on large data sets, in addition to its gen- novel, fast decision-tree learning algorithm that is based eral attractive features. For example, decision-tree learning on a conditional independence assumption. The new outperforms naive Bayes on larger data sets, while naive algorithm has a time complexity of O(m n), where Bayes performs better on smaller data sets (Kohavi 1996; · m is the size of the training data and n is the num- Domingos & Pazzani 1997). A similar observation has been ber of attributes. This is a significant asymptotic im- 2 noticed in comparing decision-tree learning with logistic re- provement over the time complexity O(m n ) of the gression (Perlich, Provost, & Simonoff 2003). standard decision-tree learning algorithm C4.5,· with an additional space increase of only O(n). Experiments Numerous techniques have been developed to speed up show that our algorithm performs competitively with decision-tree learning, such as designing a fast tree-growing C4.5 in accuracy on a large number of UCI benchmark algorithm, parallelization, and data partitioning. Among data sets, and performs even better and significantly them, a large amount of research work has been done on faster than C4.5 on a large number of text classification reducing the computational cost related to accessing the sec- data sets. The time complexity of our algorithm is as ondary storage, such as SLIQ (Mehta, Agrawal, & Rissanen low as naive Bayes’. Indeed, it is as fast as naive Bayes 1996), SPRINT (Shafer, Agrawal, & Mehta 1996), or Rain- but outperforms naive Bayes in accuracy according to forest (Gehrke, Ramakrishnan, & Ganti 2000). An excellent our experiments. Our algorithm is a core tree-growing survey is given in (Provost & Kolluri 1999). Apparently, algorithm that can be combined with other scaling-up techniques to achieve further speedup. however, developing a fast tree-growing algorithm is more essential. There are basically two approaches to designing a fast tree-growing algorithm: searching in a restricted model Introduction and Related Work space, and using a powerful search heuristic. Decision-tree learning is one of the most successful learn- Learning a decision tree from a restricted model space ing algorithms, due to its various attractive features: sim- achieves the speedup by avoiding searching the vast model plicity, comprehensibility, no parameters, and being able to space. One-level decision trees (Holte 1993) are a simple handle mixed-type data. In decision-tree learning, a deci- structure in which only one attribute is used to predict the sion tree is induced from a set of labeled training instances class variable. A one-level tree can be learned quickly with a time complexity of O(m n), but its accuracy is often much represented by a tuple of attribute values and a class label. · Because of the vast search space, decision-tree learning is lower than C4.5’s. Auer et al. (1995) present an algorithm typically a greedy, top-down and recursive process starting T 2 for learning two-level decision trees. However, It has with the entire training data and an empty tree. An attribute been noticed that T 2 is no more efficient than even C4.5 that best partitions the training data is chosen as the splitting (Provost & Kolluri 1999). attribute for the root, and the training data are then parti- Learning restricted decision trees often leads to perfor- tioned into disjoint subsets satisfying the values of the split- mance degradation in some complex domains. Using a pow- ting attribute. For each subset, the algorithm proceeds re- erful heuristic to search the unrestricted model space is an- cursively until all instances in a subset belong to the same other realistic approach. Indeed, most standard decision-tree class. A typical tree-growing algorithm, such as C4.5 (Quin- learning algorithms are based on heuristic search. Among them, the decision tree learning algorithm C4.5 (Quinlan Copyright c 2006, American Association for Artificial Intelli- 1993) has been well recognized as the reigning standard. gence (www.aaai.org). All rights reserved. C4.5 adopts information gain as the criterion (heuristic) for 500 splitting attribute selection and has a time complexity of on the entire training data, where Xp is the set of attributes O(m n2). Note that the number n of attributes corresponds along the path from the current node to the root, called path · to the depth of the decision tree, which is an important fac- attributes, and xp is an assignment of values to the variables tor contributing to the computational cost for tree-growing. in Xp. Similarly, PSx (ci) is P (ci xp,x) on the entire train- Although one empirical study suggests that on average the ing data. | learning time of ID3 is linear with the number of attributes In the tree-growing process, each candidate attribute (the (Shavlik, Mooney, & Towell 1991), it has been also noticed attributes not in Xp) is examined using Equation 1, and the that C4.5 does not scale well when there are many attributes one with the highest information gain is selected as the split- (Dietterich 1997). ting attribute. The most time-consuming part in this process The motivation of this paper is to develop a fast algo- is evaluating P (c x ,x) for computing Entropy(S ).It i| p x rithm searching the unrestricted model space with a pow- must pass through each instance in Sx, for each of which it erful heuristic that can be computed efficiently. Our work iterates through each candidate attribute X. This results in a is inspired by naive Bayes, which is based on an unrealistic time complexity of O( S n). Note that the union of the sub- assumption: all attributes are independent given the class. sets on each level of the| tree|· is the entire training data of size Because of the assumption, it has a very low time complex- m, and the time complexity for each level is thus O(m n). ity of O(m n), and still performs surprisingly well (Domin- Therefore, the standard decision-tree learning algorithm· has gos & Pazzani· 1997). Interestingly, if we introduce a sim- a time complexity of O(m n2). ilar assumption in decision-tree learning, the widely used Our key observation is· that we may not need to information-gain heuristic can be computed more efficiently, pass through S for each candidate attribute to estimate which leads to a more efficient tree-growing algorithm with P (ci xp,x). According to probability theory, we have the same asymptotic time complexity of O(m n) with naive | · P (c x )P (x x ,c ) Bayes and one-level decision trees. That is the key idea of P (c x ,x)= i| p | p i this paper. i| p P (x x ) | p P (c x )P (x x ,c ) i p p i (3) A Fast Tree-Growing Algorithm = C | | . | | P (c x )P (x x ,c ) To simplify our discussion, we assume that all attributes are i=1 i| p | p i non-numeric, and each attribute then occurs exactly once on Assume that each candidate attribute is independent of the each path from leaf to root. We will specify how to cope path attribute assignment xp given the class, i.e., with numeric attributes later. In the algorithm analysis in P (X x ,C)=P (X C). (4) this paper, we assume that both the number of classes and | p | the number of values of each attribute are much less than m Then we have and are then discarded. We also assume that all training data are loaded to the main memory. P (ci xp)P (x ci) (5) P (ci xp,x) C | | . | ≈ | | P (c x )P (x c ) Conditional Independence Assumption i=1 i| p | i In tree-growing, the heuristic plays a critical role in deter- The information gain obtained by Equation 5 and Equation mining both classification performance and computational 1 is called independent information gain (IIG) in this paper. cost. Most modern decision-tree learning algorithms adopt Note that in Equation 5, P (x ci) is the percentage of in- | a (im)purity-based heuristic, which essentially measures the stances with X = x and C = ci on the entire training data purity of the resulting subsets after applying the splitting at- that can be pre-computed and stored with a time complexity tribute to partition the training data. Information gain, de- of O(m n) before the tree-growing process with an addi- · fined as follows, is widely used as a standard heuristic. tional space increase of O(n), and P (ci xp) is the percent- | age of instances belonging to class ci in S that can be com- S IG(S,X)=Entropy(S) | x|Entropy(S ), (1) puted by passing through S once taking O( S ).

Load more