A Decision-Tree-Based Symbolic Rule Induction System for Text Categorization
Total Page:16
File Type:pdf, Size:1020Kb
Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02 A decision-tree-based symbolic rule induction system for text categorization by D. E. Johnson F. J. Oles T. Zhang T. Goetz We present a decision-tree-based symbolic mans often provide valuable insights in many prac- rule induction system for categorizing text tical problems. documents automatically. Our method for rule induction involves the novel combination of In a symbolic rule system, text is represented as a (1) a fast decision tree induction algorithm vector in which the components are the number of especially suited to text data and (2) a new occurrences of a certain word in the text. The sys- method for converting a decision tree to a tem induces rules from the training data, and the gen- rule set that is simplified, but still logically erated rules can then be used to categorize arbitrary equivalent to, the original tree. We report data that are similar to the training data. Each rule experimental results on the use of this system ultimately produced by such a system states that a on some practical problems. condition, which is usually a conjunction of simpler conditions, implies membership in a particular cat- egory. The condition forms the antecedent of the rule, and the conclusion posited as true when the condi- tion is satisfied is the consequent of the rule. Usu- The text categorization problem is to determine pre- ally, the antecedent of a rule is a combination of tests defined categories for an incoming unlabeled mes- to be done on various components. For example: sage or document containing text, based on infor- mation extracted from a training set of labeled shareϾ3 & yearϽ1 & acquireϾ2 3 acq messages or documents. Text categorization is an im- portant practical problem for companies that wish may be read as “if the word share occurs more than to use computers to categorize incoming electronic three times in the document and the word year oc- mail, thereby either enabling an automatic machine curs at most one time in the document and the word response to the e-mail or simply assuring that the acquire occurs more than twice in the document, then e-mail reaches the correct human recipient. But, be- classify the document in the category acq.” yond e-mail, text items to be categorized may come from many sources, including the output of voice rec- In this paper, we describe a system that produces ognition software, collections of documents (e.g., such symbolic rules from a decision tree. We em- news stories, patents, or case summaries), and the phasize the novel aspects of our work: a fast deci- contents of Web pages. sion tree construction algorithm that takes advan- Copyright 2002 by International Business Machines Corpora- Previous text categorization methods have used de- tion. Copying in printed form for private use is permitted with- 1 cision trees (with or without boosting), naive Bayes out payment of royalty provided that (1) each reproduction is done classifiers, 2 nearest-neighbor methods, 3 support vec- without alteration and (2) the Journal reference and IBM copy- tor machines, 4,5 and various kinds of direct symbolic right notice are included on the first page. The title and abstract, 6 but no other portions, of this paper may be copied or distributed rule induction. Among all these methods, we are royalty free without further permission by computer-based and particularly interested in systems that can produce other information-service systems. Permission to republish any symbolic rules, because rules comprehensible to hu- other portion of this paper must be obtained from the Editor. IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 0018-8670/02/$5.00 © 2002 IBM JOHNSON ET AL. 1 Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02 Figure 1 The KitCat system architecture CONFIGURATION FILE DATA PREPARATION FEATURE EXTRACTION MACHINE LEARNING TESTING DATA EXAMINATION CONVERT TREES TO RULE SETS, DATA CLEANING FEATURE FEATURE BUILD TESTING COMPUTE DEFINITION SELECTION DECISION CONFIDENCE TREES LEVELS DATA SPLITTING TABLE CATEGORY tage of the sparsity of text data, and a rule In order to support multiple categorization (a com- simplification method that converts the decision tree mon requirement in text categorization applications), into a logically equivalent rule set. Compared with the KitCat system induces a separate rule set for each a standard C4.5 tree algorithm, 7 our algorithm not category by treating categorization vis-a-vis each cat- only can be as much as a hundred times faster on egory as a separate binary-classification problem. some text data, but also can handle larger feature This has the secondary advantage that a person who sets, which in turn leads to more accurate rules. analyzes the rules can then concentrate on one cat- egory at a time. Also, we associate with each rule a The prototype system described in this paper is avail- confidence measure that is the estimated in-class able in product form as the IBM Text Analyzer, an probability that a document satisfies the rule. In set- IBM WebSphere* Business Component, and it is also tings where it is desirable to assign a single category embedded in the IBM Enterprise Information Por- to a document, confidence levels enable the user to tal, version 8, an IBM content-management product resolve the ambiguity resulting from the multiple fir- offering. ing of rules for different categories. System architecture The decision tree induction algorithm All of the methods described in this paper have been DTREE is our implementation of the decision tree in- implemented in a prototype text categorization sys- duction algorithm, the component of KitCat used tem that we call KitCat, an abbreviation of “tool kit to carry out the “build decision trees” function in for text categorization.” The overall KitCat system Figure 1. In this section we describe the novel fea- architecture is shown in Figure 1. tures of the DTREE algorithm. 2 JOHNSON ET AL. IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002 Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02 We start with a set of training documents indexed Figure 2 Functions that can be used to define impurity by consecutive integers. For each integer i in the in- criteria dex set, we transform the ith document into a nu- ϭ meric vector x i ( x i,1,...,x i,d ) whose components are counts of the numbers of occurrences of words 0.5 or features in the ith document. For each category, ʦ 0.45 we would like to determine a label y i {0, 1} so ϭ that y i 1 indicates that the ith document belongs 0.4 ϭ to the category, and y i 0 indicates that the ith doc- 0.35 ument does not belong to the category. 0.3 The dimension d of the input vector space is huge 0.25 for text categorization problems. Usually a few hun- 0.2 dred to a few thousand words (features) are required 0.15 to achieve good performance, and tens of thousands ENTROPY = DOTTED 0.1 of features may be present, depending both on the GINI INDEX = DASH DOTTED 0.05 MODIFIED ENTROPY = DASHED size of the data set and on the details of feature def- CLASSIFICATION ERROR = SOLID inition, such as tokenization and stemming. Despite 0 the large number of features potentially present in 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 text, the lengths of individual documents, especially for e-mail and Web applications, can be rather short on average. The result is that each data vector x is T 1 ϭ { x ʦ TϺx Յ v} and T 2 ϭ { x ʦ TϺx Ͼ typically very sparse. f,v i i,f f,v i i,f v}. This partition corresponds to a decision rule based on whether or not the word corresponding to However, traditional tree construction algorithms feature f occurs more than v times in document d (such as C4.5 7 or CART8) were not designed for i represented by vector x . sparse data. In the following, we briefly describe the i fast decision tree algorithm that we developed spe- Let p 1 ϭ P( y ϭ 1͉x ʦ T 1 ), p 2 ϭ P( y ϭ 1͉x ʦ cifically for text categorization applications. We as- f,v i i f,v f,v i i T 2 ), and p ϭ P( x ʦ T 1 ͉x ʦ T). We consider sume that the reader has a basic knowledge of de- f,v f,v i f,v i the following transform of probability estimate rϺ[0, cision tree generation algorithms, so that we only 1] 3 [0, 1] defined as elaborate on issues that are different from standard procedures, such as methods described by Quinlan 7 8 1 ͑ ϩ ͱ Ϫ ͒ if p Ͼ 0.5, and Breiman et al. For a discussion of various is- 2 1 2p 1 sues in developing scalable decision tree algorithms, r͑ p͒ ϭ ͭ (1) 1 ͱ 9 ͑1 Ϫ 1 Ϫ 2p͒ if p Յ 0.5. see Gehrke et al. While the construction of a tun- 2 able decision tree induction system especially de- signed for text categorization was mentioned in Yang We then define a modified entropy for p ʦ [0, 1] et al., 10 few details were given there, so we cannot as compare our system to theirs. g͑ p͒ ϭ Ϫr͑ p͒ log ͑r͑ p͒͒ Ϫ ͑1 Ϫ r͑ p͒͒ Similar to a standard decision tree construction al- ͑ Ϫ ͑ ͒͒ gorithm, our algorithm contains two phases: tree log 1 r p .