<<

Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02

A decision-tree-based symbolic rule induction system for text categorization

by D. E. Johnson F. J. Oles T. Zhang T. Goetz

We present a decision-tree-based symbolic mans often provide valuable insights in many prac- rule induction system for categorizing text tical problems. documents automatically. Our method for rule induction involves the novel combination of In a symbolic rule system, text is represented as a (1) a fast decision tree induction algorithm vector in which the components are the number of especially suited to text data and (2) a new occurrences of a certain word in the text. The sys- method for converting a decision tree to a tem induces rules from the training data, and the gen- rule set that is simplified, but still logically erated rules can then be used to categorize arbitrary equivalent to, the original tree. We report data that are similar to the training data. Each rule experimental results on the use of this system ultimately produced by such a system states that a on some practical problems. condition, which is usually a conjunction of simpler conditions, implies membership in a particular cat- egory. The condition forms the antecedent of the rule, and the conclusion posited as true when the condi- tion is satisfied is the consequent of the rule. Usu- The text categorization problem is to determine pre- ally, the antecedent of a rule is a combination of tests defined categories for an incoming unlabeled mes- to be done on various components. For example: sage or document containing text, based on infor- mation extracted from a training set of labeled shareϾ3 & yearϽ1 & acquireϾ2 3 acq messages or documents. Text categorization is an im- portant practical problem for companies that wish may be read as “if the word share occurs more than to use computers to categorize incoming electronic three times in the document and the word year oc- mail, thereby either enabling an automatic machine curs at most one time in the document and the word response to the e-mail or simply assuring that the acquire occurs more than twice in the document, then e-mail reaches the correct human recipient. But, be- classify the document in the category acq.” yond e-mail, text items to be categorized may come from many sources, including the output of voice rec- In this paper, we describe a system that produces ognition software, collections of documents (e.g., such symbolic rules from a decision tree. We em- news stories, patents, or case summaries), and the phasize the novel aspects of our work: a fast deci- contents of Web pages. sion tree construction algorithm that takes advan- ௠Copyright 2002 by International Business Machines Corpora- Previous text categorization methods have used de- tion. Copying in printed form for private use is permitted with- 1 cision trees (with or without boosting), naive Bayes out payment of royalty provided that (1) each reproduction is done classifiers, 2 nearest-neighbor methods, 3 support vec- without alteration and (2) the Journal reference and IBM copy- tor machines, 4,5 and various kinds of direct symbolic right notice are included on the first page. The title and abstract, 6 but no other portions, of this paper may be copied or distributed rule induction. Among all these methods, we are royalty free without further permission by computer-based and particularly interested in systems that can produce other information-service systems. Permission to republish any symbolic rules, because rules comprehensible to hu- other portion of this paper must be obtained from the Editor.

IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 0018-8670/02/$5.00 © 2002 IBM JOHNSON ET AL. 1 Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02

Figure 1 The KitCat system architecture

CONFIGURATION FILE

DATA PREPARATION FEATURE EXTRACTION TESTING

DATA EXAMINATION CONVERT TREES TO RULE SETS, DATA CLEANING FEATURE FEATURE BUILD TESTING COMPUTE DEFINITION SELECTION DECISION CONFIDENCE TREES LEVELS DATA SPLITTING

TABLE CATEGORY

tage of the sparsity of text data, and a rule In order to support multiple categorization (a com- simplification method that converts the decision tree mon requirement in text categorization applications), into a logically equivalent rule set. Compared with the KitCat system induces a separate rule set for each a standard C4.5 tree algorithm, 7 our algorithm not category by treating categorization vis-a-vis each cat- only can be as much as a hundred times faster on egory as a separate binary-classification problem. some text data, but also can handle larger feature This has the secondary advantage that a person who sets, which in turn leads to more accurate rules. analyzes the rules can then concentrate on one cat- egory at a time. Also, we associate with each rule a The prototype system described in this paper is avail- confidence measure that is the estimated in-class able in product form as the IBM Text Analyzer, an probability that a document satisfies the rule. In set- IBM WebSphere* Business Component, and it is also tings where it is desirable to assign a single category embedded in the IBM Enterprise Information Por- to a document, confidence levels enable the user to tal, version 8, an IBM content-management product resolve the ambiguity resulting from the multiple fir- offering. ing of rules for different categories.

System architecture The decision tree induction algorithm

All of the methods described in this paper have been DTREE is our implementation of the decision tree in- implemented in a prototype text categorization sys- duction algorithm, the component of KitCat used tem that we call KitCat, an abbreviation of “tool kit to carry out the “build decision trees” function in for text categorization.” The overall KitCat system Figure 1. In this section we describe the novel fea- architecture is shown in Figure 1. tures of the DTREE algorithm.

2 JOHNSON ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02

We start with a set of training documents indexed Figure 2 Functions that can be used to define impurity by consecutive integers. For each integer i in the in- criteria dex set, we transform the ith document into a nu- ϭ meric vector x i ( x i,1,...,x i,d ) whose components are counts of the numbers of occurrences of words 0.5 or features in the ith document. For each category, ʦ 0.45 we would like to determine a label y i {0, 1} so ϭ that y i 1 indicates that the ith document belongs 0.4 ϭ to the category, and y i 0 indicates that the ith doc- 0.35 ument does not belong to the category. 0.3 The dimension d of the input vector space is huge 0.25 for text categorization problems. Usually a few hun- 0.2 dred to a few thousand words (features) are required 0.15 to achieve good performance, and tens of thousands ENTROPY = DOTTED 0.1 of features may be present, depending both on the GINI INDEX = DASH DOTTED 0.05 MODIFIED ENTROPY = DASHED size of the data set and on the details of feature def- CLASSIFICATION ERROR = SOLID inition, such as tokenization and stemming. Despite 0 the large number of features potentially present in 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 text, the lengths of individual documents, especially for e-mail and Web applications, can be rather short on average. The result is that each data vector x is T 1 ϭ { x ʦ TϺx Յ v} and T 2 ϭ { x ʦ TϺx Ͼ typically very sparse. f,v i i,f f,v i i,f v}. This partition corresponds to a decision rule based on whether or not the word corresponding to However, traditional tree construction algorithms feature f occurs more than v times in document d (such as C4.5 7 or CART8) were not designed for i represented by vector x . sparse data. In the following, we briefly describe the i fast decision tree algorithm that we developed spe- Let p 1 ϭ P( y ϭ 1͉x ʦ T 1 ), p 2 ϭ P( y ϭ 1͉x ʦ cifically for text categorization applications. We as- f,v i i f,v f,v i i T 2 ), and p ϭ P( x ʦ T 1 ͉x ʦ T). We consider sume that the reader has a basic knowledge of de- f,v f,v i f,v i the following transform of probability estimate rϺ[0, cision tree generation algorithms, so that we only 1] 3 [0, 1] defined as elaborate on issues that are different from standard procedures, such as methods described by Quinlan 7 8 1 ͑ ϩ ͱ Ϫ ͒ if p Ͼ 0.5, and Breiman et al. For a discussion of various is- 2 1 2p 1 sues in developing scalable decision tree algorithms, r͑ p͒ ϭ ͭ (1) 1 ͱ 9 ͑1 Ϫ 1 Ϫ 2p͒ if p Յ 0.5. see Gehrke et al. While the construction of a tun- 2 able decision tree induction system especially de- signed for text categorization was mentioned in Yang We then define a modified entropy for p ʦ [0, 1] et al., 10 few details were given there, so we cannot as compare our system to theirs. g͑ p͒ ϭ Ϫr͑ p͒ log ͑r͑ p͒͒ Ϫ ͑1 Ϫ r͑ p͒͒ Similar to a standard decision tree construction al- ͑ Ϫ ͑ ͒͒ gorithm, our algorithm contains two phases: tree log 1 r p . (2) growing and tree smoothing. There are three major aspects of our algorithm that we describe in more For each possible split ( f, v), we associate with it detail: (1) taking advantage of the sparse structure, a cost function (2) using a modified entropy as the impurity mea- ͑ ͒ ϭ ͑ 1 ͒ ϩ ͑ Ϫ ͒ ͑ 2 ͒ sure, and (3) using smoothing to do pruning. Q f, v pf,v g p f,v 1 pf,v g p f,v (3)

Tree growing. We employ a greedy algorithm to grow that measures the impurity of the split. The mod- the tree. At each node corresponding to a subset T ified entropy criterion is plotted as a function of p of the training data, we select a feature f and a value in Figure 2, where we compare it with classification v so that data in T are partitioned into two subsets error, Gini index (we use 2p(1 Ϫ p) as its defini- 1 2 Յ T f,v and T f,v based on whether or not x i,f v: tion), and the standard entropy metric. Note that the

IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 JOHNSON ET AL. 3 Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02

strict concavity of the modified entropy criterion in- C4.5 does not take advantage of sparse structure. dicates that it favors a split that enhances the purity Since we are not aware of any previous work in the of the partition, and, in this respect, it is similar to existing literature that carefully studied sparse de- the Gini index and the entropy criterion. cision tree methods, we describe this aspect of our algorithm in detail with analysis. Due to its flatness (nonstrict concavity), the classi- fication error function does not favor a partition that In order to speed up the algorithm, we also make enhances the purity. Consider an example such that an important modification to the data: that is, we ϭ 1 ϭ 2 ϭ ϭ p f,v 0.5, p f,v 0.51 and p f,v 0.89 and p fЈ,vЈ truncate the word count x i,f to be at most a value V 1 ϭ 2 ϭ Ј Ј 1, p fЈ,vЈ 0.7 and p fЈ,vЈ 0. The ( f , v ) partition (typically, we choose this value to be 3). Note that is equivalent to no partition. Clearly, the classifica- this is a rather reasonable simplification for text cat- tion error criterion gives equal scores for both par- egorization problems. For example, a ten-time oc- titions, since they both predict y ϭ 1 for all data in currence of the word “car” is hardly a more infor- T. However, in reality, ( f, v) should be preferable, mative predictor that a document is in a car-related because it makes some progress: it is likely that we category than a nine-time occurrence. Let d be the 1 can further partition T f,v so that part of the data will dimension of x, which indicates the number of fea- be associated with a probability of y ϭ 1 smaller than tures (words) under consideration. We create an ar- 0.51, and hence we change the prediction from 1 to ray inclass-count[1 . . . d][0 . . . V], where inclass- ʦ 0 for those data. This future prospect of making a count[ f][v] is the number of documents x i T such ϭ ϭ different prediction (increased purity of part of the that y i 1 and x i,f v; and an array total-count data) is not reflected in the classification error cri- [1...d][0 . . . V], where total-count[ f][v] is the ʦ ϭ terion. This is why it is a bad measure for building number of documents x i T such that x i, f v. The a decision tree. time required to create and fill the table is O(͉T͉ ⅐ ϩ ʦ ¯l T dV), where¯l T is the average nonzeros of x i This problem can be remedied by using a strict con- T. This can be achieved by going through all the non- ʦ cave, bell-shaped function. However, the widely zeros of vectors x i T and appropriately increasing adopted entropy and Gini index criteria have shapes the counts of the corresponding entries in the inclass- too different from the classification error curve. The count and the total-count arrays. final criterion for evaluating a tree is by its classi- fication performance; therefore we know that a tree After this step, we loop through f ϭ 1,...,d: for with small entropy or Gini index does not automat- each f, we let v go from 0 to V. We accumulate the ʦ Յ ically guarantee a small classification error. The mod- total number of x i T such that x i,f v, and the ʦ ϭ Յ ified entropy is introduced to balance the advantage total number of x i T such that y i 1 and x i,f 1 2 of classification error, which reflects the final per- v. The probability p f,v , p f,v and p f,v can then be es- formance measurement of the tree, and the advan- timated to compute the cost Q( f, v)defined in (3). tage of the entropy criterion, which tries to enhance We keep the partition ( f, v) that gives the minimum the prospect of good future partitions within a greedy cost. This step requires O(dV) operations. algorithm setting. It is designed to be a strictly con- cave function that closely resembles classification For the tree growing stage, we recursively split the error. Note that, similar to the classification error tree starting from the root node that contains all doc- function, the modified entropy curve is also nondif- uments, until no progress can be made. Assuming ferentiable at p ϭ 0.5. From our experience, this the documents roughly have equal length¯l, then the criterion outperforms the entropy and the Gini in- total time required to grow the tree is roughly ϩ dex criteria for text categorization problems in which O(nh¯l l dV M), where M is the number of nodes we are interested. For example, on the standard Re- in the tree, n is the total number of documents, and 11 uters-21578 data, the new criterion reduces the h l is the average depth of the tree per document. In overall number of errors by 6 percent compared with our experience with text categorization, the domi- ϩ the entropy splitting criterion. nating factor in O(nh¯l l dV M) is the first term O(nh¯l l). As a comparison, a dense tree growing al- We now show how we can take advantage of the gorithm will have complexity at least of O(nhl d), sparse structure presented in text documents to se- which is usually at least ten times slower. lect the best partition. This aspect of our decision tree algorithm is crucial for its speed. Note that a Tree smoothing. Many algorithms for constructing standard decision tree algorithm such as CART or a decision tree from data involve a two-stage pro-

4 JOHNSON ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02

cess: the first stage is to grow a tree as large as pos- 1 1 g͑p͑T͒͒ ϩ log ͩͩ1 ϩ ͪw ͪ if w Ͼ 0.5, sible to fit the data, using procedures similar to what ͉T͉ c T T ͉ ͉ ͉ ͉ we have described in the previous section. After this T1 T2 G͑T͒ ϭ G͑T ͒ ϩ G͑T ͒ stage, the tree typically “over-fits” the data in the Ά͉T͉ 1 ͉T͉ 2 sense that it may perform poorly on the test set. 1 ϩ log ͑͑1 ϩ c͒͑1 Ϫ w ͒͒ otherwise. Therefore a second phase is required to prune the ͉T͉ T large tree so that the smaller tree gives more stable probability estimates, which lead to better perfor- The parameter c is chosen a priori to reflect the mance on the test set. Bayesian “cost” of a split. For a leaf node T, we de- ϭ ϭ ϭ fine G(T) g( p(T)) and w T 1. We choose c 16 in our applications but the performance is not In this section, we describe a procedure, which in- very sensitive to this parameter. stead of pruning the fully grown tree, re-estimates the probability of every leaf node by averaging the These formulas may seem a little mysterious at first. probability estimates along the path leading from the They are in fact derived from applying the Bayesian root node to this leaf node. To achieve this, we bor- model averaging method locally at each node (note row the “tree-weighting” idea from data compres- that we replace the probability p at a node with its sion. 12,13 If we use the tree for compression of the transformed estimate r( p)). From the bottom up, binary class indicator string y i based on x i , then the we re-estimate the local probability at each node and tree-weighting scheme guarantees that the re-esti- feed it into a new probability estimate by shrinking mated probability achieves a compression ratio not toward its parent estimate. The quality of the re-es- much worse than that of the best pruned tree. 11,12 timate at each node T is measured by G(T), which Because we apply this method to the transformed is updated at each step based on the information probability estimate r( p) rather than to p directly, gathered from its children. Therefore based on this theoretical result can be interpreted nonrigor- G(T), we can compute w T —the local relative im- ously as follows: by using the re-estimated probabil- portance of the current node compared to its chil- ity, we can achieve an expected classification perfor- dren. Due to the limitation of space, we shall skip mance on the test set not much worse than that of the detailed derivations and theoretical conse- the best pruned tree. quences. Interested readers are referred to Willems et al. 12 and Zhang. 13 Note that in , this technique is also called After calculating weight w for each node recursively shrinkage because it shrinks the estimate from a node T (the computation is done bottom-up), we compute deeper in the tree (which has a lower bias but higher the global re-estimated probability for each tree node variance because there are fewer data points) toward from top-down. This step averages all the estimates estimates from nodes shallower in the tree (which r( p) from the root node T to a leaf node T down have higher biases but lower variance). In Bayesian 0 h a path T 0 ...T h , based on weight w Ti . Note that statistics and machine learning, this method is also weight w is only the local relative importance, which 14 T called model averaging. See Carlin and Louis for means that the global relative importance of a node more discussions from the statistical point of view. ϭ͟ Ϫ T k is ˜w k iϽk (1 w i )w k along the path. By def- ¥ h ϭ inition, iϭ1 ˜w i 1 for any path leading to a leaf. The basic idea of our algorithm can be described as The following recursive procedure efficiently com-

follows: consider sibling nodes T 1 and T 2 with a com- putes the global re-estimate of the children T 1 and mon parent T. Let p(T 1 ), p(T 2 ) and p(T) be the T 2 at the parent node T: corresponding probability estimates. The local re- ϩ Ϫ ϭ ͑ Ϫ ͒ estimated probability is w T p(T) (1 w T ) p(T 1 ) wˆ Ti wˆ T 1 wT (4) ϩ Ϫ for T 1 and w T p(T) (1 w T ) p(T 2 ) for T 2 . The ͑ ͒ ϭ ͑ ͒ ϩ ͑ ͑ ͒͒ ˜r Ti ˜r T wˆ T wT r p Ti , (5) local weight w T and an accompanying function G(T) i i are computed by mutual recursion, based on the fol- lowing formulas: where r( p(T)) is the transformation using Equation 1 of the probability estimate p(T) at node T. At the ϭ root node, we let wˆ 1. After˜r(T h ) has been com- w c ⅐ exp͑Ϫ͉T͉g͑ p͑T͒͒͒ Ϫ1 T ϭ puted for a leaf node T h , we can use r (˜r(T h )as Ϫ Ϫ͉ ͉ ͑ ͒ Ϫ ͉ ͉ ͑ ͒ , Ͼ 1 wT exp( T1 G T1 T2 G T2 ) its probability estimate. The label of T i is1if˜r(T i )

IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 JOHNSON ET AL. 5 Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02

0.5 and 0 otherwise. We prune the tree from the bot- the introduction of new terminology or new prod- tom up by checking whether two siblings give iden- ucts related to a category), without fully recre- tical labels; if so, we remove them and use the label ating the system from scratch. for their parent. This procedure continues until we cannot proceed. Typically, for a large fully grown tree 3. The fact that a rule set is logically equivalent to from the first stage, the pruned tree can be as little a corresponding decision tree for a particular text as 10 percent of the unpruned tree. categorization problem guarantees that any math- ematical analysis of the overall performance of The smoothing procedure consistently enhances the the decision tree (as opposed to the performance classification performance of the tree. The running of individual rules) with respect to text catego- time is O(M) where M is the number of nodes in the rization carries over to the rule set. This would unpruned tree. Therefore the total complexity of our not be true if a rule set only approximated a de- ϩ decision tree construction algorithm is O(nh¯i l dV cision tree, as would be the case if the rule set M). As mentioned before, in practice, we observe were derived from the decision tree by heuristics. a complexity of O(nh¯l l). The most straightforward way to convert a tree into Turning decision trees into symbolic rule an equivalent set of rules is to create a set with one sets rule for each leaf by forming the logical conjunction of the tests on the unique path from the root of the An integral part of our system writes the decision tree to the leaf. This starting point is laid out in de- tree as an equivalent set of human interpretable tail by Quinlan 7 in the discussion of how C4.5 cre- rules. The value of converting decision trees into sim- ates rule sets from trees. In C4.5, the initial rule set plified logically equivalent symbolic rule sets, instead derived from a decision tree is modified based on of employing a text categorization system based the way changes in the rule set affect how the train- solely on decision trees, is threefold: ing data are classified. This can be a computational burden, but the burden is lessened by the use of plau- 1. A system based solely on generating decision trees sible heuristics. However, one should note that the from training data is useless for categories for resulting rule set of C4.5 need not be logically equiv- which there are no training data. However, it is alent to the rule set initially derived from the tree. not hard for an intelligent person to write logical rules to cover such a situation. Creating a deci- Our approach to deriving rule sets from decision sion tree by hand for the purpose of text catego- trees differs markedly from that of C4.5. We want rization would be much harder, particularly for a fast algorithm that produces a simplified rule set a person who is not mathematically sophisticated. that is logically equivalent to the rule set initially ex- We envision that handwritten rule sets could be tracted from a decision tree. As long as the new rule seamlessly incorporated with machine-generated set is logically equivalent to the original rule set, it rule sets, either permanently or as a stopgap mea- does not need experimental validation to compare sure pending the collection of additional train- the new set’s overall performance with that of the ing data. original set. We do not propose to get a provably minimal set of rules. Instead, our algorithm “picks 2. A human user can understand and modify a rule the low-hanging fruit” by set much more easily than he or she can under- stand and modify a decision tree. There are a number of scenarios in which we envision the need 1. Carrying out, within rules, all elementary logical for such modifications. For instance, there may simplifications related to greater than and less Ͻ ٙ Ͻ Ͻ be a discrepancy between the training data and than, e.g., converting ( A 3) ( A 2) to (A the anticipated application that requires hand 2). modification of an automatically created system, 2. Removing tests that are logically superfluous in and, in a system such as the one we are describ- the context of the entire rule set, when those tests ing, this modification may be accomplished by ed- are readily identifiable from the structure of our iting a rule file. Another scenario involves a us- decision trees. er’s desire to update an existing system in order to improve performance that may have degraded The former simplification is standard, but the latter due to a changing environment (perhaps due to is not. It changes the meaning of rules associated with

6 JOHNSON ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02

a particular leaf of a decision tree, while preserving Figure 3 Example of a decision tree the overall meaning of the rule set. F.P.O. Here is the precise algorithm that implements the second rule simplification. It works for a decision tree for a two-class problem (i.e., each leaf node is la- beled by a class X or its complement Y), where our aim is to produce rules for membership in the class X.

For each leaf labeled X, create a rule for member- ship in the class X by forming a conjunction of con- ditions obtained by traversing the path from the root to X, but use those conditions only in conformance with the following:

For each node N on a path from the root to a leaf labeled X, the condition attached to the parent of N is to be part of the conjunction only if the sib- ling criterion holds for N,

where the sibling criterion for N is P ٙ C 3 X The node N is not the root, and the sibling node of N is not a leaf node labeled by X. P ٙ ¬C ٙ Q 3 X

The resulting set of rules is then logically equiv- where P and Q are conjunctions of conditions. These alent to the original tree. two rules are equivalent to the single rule It should be noted that the sibling criterion, and its ٙ ͒ ٚ ͑ ٙ ¬ ٙ ͒ 3 ͑ use here, is novel. P C P C Q X

For example, consider the decision tree shown in Fig- However, ure 3 in which we assume each feature count is in- Ͻ teger-valued, so that the negation of A 2, for ex- ͑P ٙ C͒ ٚ ͑P ٙ ¬C ٙ Q͒ ample, is A Ն 2. If one applies the algorithm to the (decision tree in Figure 3, one sees that for each leaf ϭ P ٙ ͑C ٚ ͑¬C ٙ Q͒͒ ϭ P ٙ ͑C ٚ Q͒ labeled X, there is only one condition on the path from the root to the leaf that is not to be omitted ϭ ͑P ٙ C͒ ٚ ͑P ٙ Q͒ from the conjunction, and so one immediately ob- tains the equivalent rule set Hence, the two rules above are logically equivalent to the two rules ͑B Ͻ 2͒ 3 X ٙ 3 ͑B Ն 4͒ 3 X P C X ٙ 3 ͑ A Ն 2͒ 3 X P Q X

Why does this algorithm yield a set of rules equiv- This verifies correctness of the algorithm. alent to the standard set derived from a binary de- cision tree for a two-class problem? Each time a con- In practice, decision trees obtained from text cat- dition C is a candidate for elimination, there must egorization problems are often very unbalanced, be two leaf nodes labeled X, giving rise to two rules which may enable the procedure outlined to accom- of the form plish significant simplification.

IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 JOHNSON ET AL. 7 Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02

Examples tory structure. Our goal was to analyze the Web site structure and see whether it would be possible to set One of the most common data sets used for com- up a good category scheme based on the existing di- paring categorizers is the Reuters-21578 collection 11 rectory structure. Such a category scheme could be of categorized newswires. We used the Mod-Apte used to help users to navigate through the Web site. data split, which gives a training set with 9603 items To see how well KitCat works on these data, we stud- and a test set with 3299 items. Since these data are ied its prediction accuracy under a three-fold cross- multiply classified, the performance is usually mea- validation setting. We observed that the KitCat sys- sured by precision and recall for each binary clas- tem worked very well on this Web site structure, with sification problem: a prediction accuracy of about 86 percent. This com- pares favorably with our implementation of a naive true positive Bayes method that gave an accuracy of about 75 per- precision ϭ ϫ 100 true positive ϩ false positive cent, and a support vector machine (SVM) style clas- sifier with an accuracy of about 87 percent. Note that true positive the KitCat rule-based system has an added advan- recall ϭ ϫ 100 true positive ϩ false negative tage of human interpretability. We also made some interesting observations in this study. For example, we found that the meta-text (for example, the key- The overall performance can be measured by the mi- words and the description sections) in an HTML cro-averaged precision, recall, and the break-even ( Markup Language) file were not as valu- point computed from the overall confusion matrix able as we originally expected—rules we obtained defined as the sum of individual confusion matrices often did not contain these meta-data. In addition, corresponding to the categories. Training on 93 cat- removing them from the text representation does not egories, our results were micro-averaged precision have any significant impact on classification accuracy. of 87.0 percent and micro-averaged recall of 80.5 per- In the following, we list a few example rules obtained cent, the average of these two values being 83.8 per- from this data set: cent. In this run, 500 features were selected by using 15 the IG criterion in Yang and Pederson, and no stop- room Ͼ 0 3 press @ 0.79 words list was used. The training time for the DTREE room Ͻ 1 & press Ͼ 0 3 press @ 0.16 decision tree induction algorithm on the Reuters- 21578 data set was 80 seconds on a 400 megahertz webcasts Ͼ 0 & ibm Ͼ 0 & letter Ͼ 0 3 news @ Pentium II processor, so it is quite fast. (As a com- 0.95 parison, C4.5 takes hours to finish training on the newswire Ͼ 0 & letter Ͼ 0 3 news @ 0.9 same data.) newswire Ͼ 0 & letter Ͻ 1 3 news @ 0.78

DTREE compares very favorably with results using job Ͼ 0 3 employment @ 0.87 the C4.5 decision tree induction algorithm, which has resume Ͼ 0 3 employment @ 0.91 4 a micro-averaged precision/recall of 78.9 percent. employment Ͼ 0 3 employment @ 0.6 If one were willing to make the major sacrifice of giving up human-readable and human-modifiable For each rule, the word before “@” is the category rules, one could do better than DTREE alone by us- name, and the number after “@” is the confidence ing boosting (i.e., adaptive resampling, with the ac- measure (estimated probability). It is interesting to companying generation of multiple trees). For ex- see that the word “room” is more indicative than ample, boosted DTREE with 10 trees and 1000 “press” for the press category; “letter” is an indic- features yields a precision of 89.0 percent and recall ative word for the news category; and both “job” and of 84.0 percent. With more than ten trees, additional “resume” are more indicative than “employment” improvement can also be obtained. for the employment category. This shows that, in gen- eral, it can be difficult for a human to produce hand- We have also used KitCat-generated human- written rules without sufficient knowledge of the data. comprehensible rules to study structures of corpo- On the other hand, machine-learned rules can give rate Web sites. In one instance, we obtained around people valuable insights into a category scheme. 7000 Web pages using a Web crawler on the www. ibm.com Web site. These Web pages are partitioned As another example, KitCat was also applied to cat- into 42 categories based on the IBM Web site direc- egorize e-mail sent to a large U.S. bank. We had 4319

8 JOHNSON ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02

e-mail documents, with 3943 used for training and els for Naive Bayes Text Classification,” Proceedings, AAAI 970 used for testing. We used nine categories, cov- Workshop on Learning for Text Categorization, Madison, WI (July 26–27, 1998), pp. 41–48. ering 86 percent of the data. The rules produced by 3. Y. Yang, “An Evaluation of Statistical Approaches to Text KitCat had a micro-averaged precision of 92.8 per- Categorization,” Information Retrieval Journal 1,69–90 cent and a micro-averaged recall of 89.1 percent. (1999). 4. T. Joachims, “Text Categorization with Support Vector Ma- Our applications of KitCat have also taught us that chines: Learning with Many Relevant Features,” Proceedings, 10th European Conference on Machine Learning, Chemnitz, it works very well on “dirty” data. Detailed analysis Germany (April 21–24, 1998), pp. 137–142. shows that the rules produced for classifying e-mail 5. S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Induc- documents are quite good, even when the tester ini- tive Learning Algorithms and Representations for Text Cat- tially reports the contrary. On four categories in a egorization,” Proceedings, 7th ACM International Conference on Information and Knowledge Management, Washington, DC customer data set, where we were surprised by less (November 3–7, 1988), pp. 148–155. than stellar results as measured by the test data, Kit- 6. C. Apte, E. Damerau, and S. M. Weiss, “Automated Learn- Cat rules actually were seen to have 100 percent pre- ing of Decision Rules for Text Categorization,” ACM Trans- cision, after the test data were carefully re-examined. actions on Information Systems 12, 233–251 (1994). The initial human categorization had precision rang- 7. J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Francisco, CA (1993). ing from 50.0 to 83.3 percent on these categories. 8. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, There were 84 instances of these categories out of Classification and Regression Trees, Wadsworth Advanced 8895 documents in the test set. Thus, misclassified Books and Software, Belmont, CA (1984). test data can make even good rules look bad. How- 9. J. Gehrke, R. Ramakrishnan, and V. Ganti, “Rainforest—A Framework for Fast Decision Tree Construction of Large ever, bad data are a fact of life, and robustness in Datasets,” and Knowledge Discovery 12, No. 2/3, its presence is one of KitCat’s advantages. 127–162 (2000). 10. Y. Yang, J. Carbonell, R. Brown, T. Pierce, B. Archibald, and X. Liu, “Learning Approaches for Detecting and Track- Conclusion ing News Events,” IEEE Intelligent Systems 14, No. 4, 32–43 (1999). In this paper, we describe a decision-tree-based sym- 11. See http://www.research.att.com/ϳlewis. bolic rule generation algorithm for text categoriza- 12. F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “The tion. Our decision tree algorithm has three main Context Tree Weighting Method: Basic Properties,” IEEE characteristics: Transactions on Information Theory 41, No. 3, 653–664 (1995). 13. T. Zhang, “Compression by Model Combination,” Proceed- ings, IEEE Data Compression Conference, Snowbird, Utah 1. Taking advantage of the sparse structure (March 30–April 1, 1998), pp. 319–328. 2. Using a modified entropy as the impurity mea- 14. B. Carlin and T. Louis, Bayes and Empirical Bayes Methods sure for Data Analysis, Chapman and Hall, New York (1996). 15. Y. Yang and J. P. Pedersen, “A Comparative Study on Fea- 3. Using smoothing to do pruning ture Selection in Text Categorization,” Proceedings, 14th AAAI International Conference on Machine Learning, Nashville, TN We also have a procedure for turning a decision tree (July 8–12, 1997). into a symbolic rule set by transforming a standard rule set into a simpler equivalent one based on an Accepted for publication March 26, 2002. examination of the structure of the decision tree, and we give a proof of correctness for this procedure and David E. Johnson IBM Research Division, Thomas J. Watson some text categorization examples of the use of the Research Center, P.O. Box 218, Yorktown Heights, New York (elec- resulting KitCat system. We created an industrial- tronic mail: [email protected]). Dr. Johnson is a research staff member at IBM’s Watson Research Center, where he manages strength state-of-the-art prototype system that even- the computational linguistics and text mining group. His research tually evolved into the IBM Text Analyzer product interests include natural language processing, the syntax and se- offering. mantics of natural language, and the foundations of linguistic the- ory. He holds a Ph.D. degree in theoretical linguistics from the *Trademark or registered trademark of International Business University of Illinois at Urbana-Champaign, is a past president Machines Corporation. of the Association for the Mathematics of Language, and serves on several editorial boards. Cited references Frank J. Oles IBM Research Division, Thomas J. Watson Re- 1. S. M. Weiss, C. Apte, F. Damerau, D. E. Johnson, F. J. Oles, search Center, P.O. Box 218, Yorktown Heights, New York (elec- T. Goetz, and T. Hampp, “Maximizing Text-Mining Perfor- tronic mail: [email protected]). Dr. Oles has been a research staff mance,” IEEE Intelligent Systems 14,63–69 (1999). member at the IBM Watson Research Center since 1983. He has 2. A. McCallum and K. Nigam, “A Comparison of Event Mod- a Ph.D. degree in computer and information science from Syr-

IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 JOHNSON ET AL. 9 Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02

acuse University. His interests include text categorization, infor- mation extraction from text, inductive learning of patterns, knowl- edge representation, the semantics of programming languages, and the mathematical foundations of computer science.

Tong Zhang IBM Research Division, Thomas J. Watson Research Center, P.O. Box 218, Yorktown Heights, New York (electronic mail: [email protected]). Dr. Zhang received a B.A. degree in math- ematics and computer science from Cornell University in 1994 and a Ph.D. degree in computer science from Stanford Univer- sity in 1998. Since 1998, he has been with the IBM Watson Re- search Center, where he is now a research staff member in the Knowledge Management department. His research interests in- clude machine learning, numerical algorithms, and their appli- cations.

Thilo Goetz IBM Research Division, Thomas J. Watson Research Center, P.O. Box 218, Yorktown Heights, New York (electronic mail: [email protected]). Dr. Goetz has been a research associate and later research staff member at IBM’s Watson Research Center since 1997. He has a Ph.D. degree in computational linguistics from the University of Tu¨bingen, Germany. His areas of interest include the intersection of computer science and computational linguistics, such as parsing, finite state methods, and compiler con- struction for computational linguistics, as well as complexity the- ory and logic for computer science.

10 JOHNSON ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002