Classifying Text with ID3 and C4.5 | October 1, 1997 Page 1 of 7

Dr. Dobb's | Classifying Text with ID3 and C4.5 | October 1, 1997 Page 1 of 7 Classifying Text with ID3 and C4.5 Java servlets represent a new model for developing server-based applications. Cliff shows you how to write them. By Lynn Monson, Dr. Dobb's Journal Oct 01, 1997 URL: http://www.ddj.com/184410304 Lynn is a software architect with Novell. He can be reached at [email protected]. E-mail programs that offer automatic mail sorting almost always fall prey to the same problem -- it's easier to manually sort your mail than figure out how to program the mail client to handle it for you. Similar problems afflict many user-interface designs. But what if the mail program simply watched you manually sort your mail, and learned by example how to do it for you? The algorithm Lynn Monson describes this month may be one way to accomplish exactly that. -- Tim Kientzle With the rise of the Internet, the ability to effectively search for information is becoming increasingly important. Web users, for example, routinely use search engines, catalogs, and Internet directories to find data of interest. However, the information on the Internet is too large, too diverse, and changes too rapidly for these methods to be very effective. One technology that may help involves software agents that search for data on a user's behalf. These agents can search through a set of data and determine which particular items are likely to be interesting. The agent uses criteria it has learned to make that judgment. In this article, I'll describe a variation of the ID3 and C4.5 algorithms that can be used to classify textual data. With this algorithm as a basis, you'll be able to understand the related literature and begin building your own information-gathering agents. ID3 ID3 is a supervised learning algorithm, examined by Andrew Colin in "Building Decision Tress with the ID3 Algorithm," ( DDJ , June 1996). It is explicitly taught from a series of training examples from several classes. The algorithm builds a theory that allows it to predict the class of an item. ID3 attempts to identify properties (or features) that differentiate one class of examples from others. ID3 requires that all features be known in advance and that each feature is "well behaved" (that is, all possible values are known in advance). This means a given property must be either a continuous number or drawn from a set of options. Age, height, temperature, and country of citizenship are all well-behaved features. ID3 uses entropy to determine which features of the training cases are important. Entropy is used to construct a decision tree, which is then used for testing future cases. In addition, the decision tree is usually optimized using one of several techniques. The techniques discussed in this article are taken from the C4.5 algorithm. The information, or entropy, of a set of examples is defined as: From a set of examples, construct a probability distribution P = {p1,p2,...pn} using a classification scheme. Given that distribution, the information required to identify a given case in the distribution is shown in Figure 1 (a). This metric is the number of bits required to identify a given class from the probability distribution. For example, if I have 12 marbles, 3 of which are blue, 7 are red, and 2 are green; the distribution is {3/12,7/12,2/12} and requires 1.38 bits to represent. Conditional Entropy If you partition data in some meaningful way, the total entropy of the parts will be lower than the entropy of the set you started with. ID3 uses a calculation called "conditional entropy" to determine which of several different partitions is most effective. Start with a distribution of examples -- P, as before -- but divide the examples into groups named Xi. The conditional entropy is then defined http://www.ddj.com/article/printableArticle.jhtml?articleID=184410304&dept_url=/ 3/26/2008 Dr. Dobb's | Classifying Text with ID3 and C4.5 | October 1, 1997 Page 2 of 7 in Figure 1 (b). To determine which features are most important, you use that feature to partition your data and compute the conditional entropy. The feature that gives the lowest conditional entropy is the most important. In Table 1 , you have three features -- lot size, income level, and age -- you can use to predict the type of lawn mower used by a homeowner. Since there are three riding lawn mowers and four push mowers, the probability distribution for the examples is P={3/7,4/7}. This gives a total entropy value of-(3/7)*log 2(3/7)-(4/7)*log 2(4/7)=0.98522. Now suppose you divide up the examples based on lot size. This gives three new probability distributions, one for each possible lot size. The distributions are P1={0/3,3/3} for Small lots, P2={1/2,1/2} for Medium lots, and P3={2/2,0/2} for Large lots. The conditional entropy value is E(P |Lot Size )=(3/7)*E(P1) +(2/7)*E(P2) +(2/7)*E(P3) =0.286. The conditional entropy for age would require separate tests -- one for each range of values in the training data: E(P |Age<= 27) ,E (P |Age<= 30) , and so on. ID3 tests every possible feature using conditional entropy. The feature with the lowest entropy value is taken to be the best test. ID3 then builds a node in a decision tree that identifies the given feature. The branches of the tree, coming from the node, are associated with possible outcomes of the test. If the node is labeled "Lot Size," the branches from the node would be "Small," "Medium," and "Large." Having identified a test on a feature, ID3 then invokes itself recursively -- the list of examples is partitioned based on the identified test, the feature already tested is removed from consideration, and the algorithm is invoked for each branch of the tree. This process continues until either the remaining examples are all of one class or there are too few remaining examples. At that point, a leaf is added to the tree identifying the class of examples. C4.5 One limitation of ID3 is that it is overly sensitive to features with large numbers of values. This must be overcome if you are going to use ID3 as an Internet search agent. I address this difficulty by borrowing from the C4.5 algorithm, an ID3 extension. ID3's sensitivity to features with large numbers of values is illustrated by Social Security numbers. Since Social Security numbers are unique for every individual, testing on its value will always yield low conditional entropy values. However, this is not a useful test. (Social Security numbers do not help you predict whether a future medical patient needs surgical intervention.) To overcome this problem, C4.5 uses a metric called "information gain," which is defined by subtracting conditional entropy from the base entropy; that is, Gain( P|X )= E(P)-E(P|X ). This computation does not, in itself, produce anything new. However, it allows you to measure a gain ratio. Gain ratio, defined as GainRatio( P|X)=Gain( P|X)/ E(X), where E (X) is the entropy of the examples relative only to the attribute X, measures the information gain of feature X relative to the "raw" information of the X distribution. By using the gain ratio instead of the plain conditional entropy, C4.5 reduces problems from artificially low entropy values such as Social Security numbers. Extending ID3 and C4.5 While simple and expressive, ID3 and C4.5 aren't enough for searching the Web. ID3 assumes all properties of a test case are known in advance and that each property has a known range of values. That definition doesn't fit text. Text is open ended and can contain an infinite number of values. Fortunately, developments in the relevant literature illustrate how ID3 (and hence C4.5) can be extended for searching text. The ideas are loosely based on concepts from information retrieval. In many information-retrieval algorithms, a text document is compressed into a form known as a "bag of words" -- a bag contains every word in the document, but information about word ordering and sentence structure is not preserved. Sometimes each word also has a count. The assumption is that the relative frequency of words in two given word bags can be compared to determine if the documents are similar. To extend ID3 to support text comparison, I adjust the test criteria. First, the test cases presented to ID3 are allowed to have features that are textual. To test such a feature, it is interpreted as a "document" and put into the form of a bag of words. You can then use ID3 to classify documents based on true/false tests of the form "Does bag X contain word Y?" This test -- whether a word is present in the feature -- fits into the notion of conditional entropy tests. The ID3 algorithm can then proceed in the usual manner. It sometimes makes sense to reverse the state of the test when testing for an outcome. For example, if all examples are drawn from two classes, you may want to construct the decision tree such that all left branches correspond to one class, while right branches correspond to the other. In the case of searching web data, ID3 can be simplified to learning examples that have a single feature. Since that feature can now be text, http://www.ddj.com/article/printableArticle.jhtml?articleID=184410304&dept_url=/ 3/26/2008 Dr.

Classifying Text with ID3 and C4.5 | October 1, 1997 Page 1 of 7

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support