Decision Tree Pruning Using Expert Knowledge A
Total Page:16
File Type:pdf, Size:1020Kb
DECISION TREE PRUNING USING EXPERT KNOWLEDGE A Dissertation Present to The Graduate Faculty of The University of Akron In Partial Ful¯llment of the Requirements for the Degree Doctor of Philosophy Jingfeng Cai December, 2006 DECISION TREE PRUNING USING EXPERT KNOWLEDGE Jingfeng Cai Dissertation Approved: Accepted: ||||||||||||{ |||||||||||| Advisor Department Chair Dr. John Durkin Dr. Alex De Abreu Garcia ||||||||||||{ |||||||||||| Committee Member Dean of the College Dr. Chien-Chung Chan Dr. George K. Haritos ||||||||||||{ ||||||||||||{ Committee Member Dean of the Graduate School Dr. James Grover Dr. George R. Newkome ||||||||||||{ ||||||||||||{ Committee Member Date Dr. Narender P. Reddy ||||||||||||{ Committee Member Dr. Shiva Sastry ||||||||||||{ Committee Member Dr. John Welch ii ABSTRACT Decision tree technology has proven to be a valuable way of capturing human decision making within a computer. It has long been a popular arti¯cial intelligence(AI) technique. During the 1980s, it was one of the primary ways for creating an AI system. During the early part of the 1990s, it somewhat fell out of favor, as did the entire AI ¯eld in general. However, during the later 1990s, with the emergence of data mining technology, the technique has resurfaced as a powerful method for creating a decision-making program. How to prune the decision tree is one of the research directions of the decision tree tech- nique, but the idea of cost-sensitive pruning has received much less attention than other pruning techniques even though additional flexibility and increased performance can be ob- tained from this method. This dissertation reports on a study of cost-sensitive methods for decision tree pruning. A decision tree pruning algorithm called KBP1.0, which includes four cost-sensitive methods, is developed. The intelligent inexact classi¯cation is used for ¯rst time in KBP1.0 to prune the decision tree. Using expert knowledge in decision tree pruning is discussed for the ¯rst time. By comparing the cost-sensitive pruning methods in KBP1.0 with other traditional pruning methods, such as reduced error pruning, pessimistic error pruning, cost complexity pruning, and C4.5, on benchmark data sets, the advantage and disadvantage of cost-sensitive methods in KBP1.0 have been summarized. This research will enhance our understanding of the theory, design and implementation of decision tree pruning using expert knowledge. In the future, the cost-sensitive pruning methods can be integrated into other pruning methods, such as minimum error pruning and critical value pruning, and include new iii pruning methods in KBP. Using KBP to prune the decision tree and getting the rules from the pruned tree to help us build the expert system is another direction of our future work. iv ACKNOWLEDGEMENTS I want to express my sincerest gratitude to my advisor, Dr. John Durkin, for his guidance throughout my doctoral studies. Thanks to his encouraging and forbearing attitude I was able to ¯nish this dissertation. I learned a lot from him and he is the reason I came to University of Akron. Thanks to all my doctoral dissertation committee members. You taught me so much over the years. Thank you to Dr. Chien-Chung Chan, Dr. James Grover, Dr. Narender P. Reddy, Dr. Sastry, and Dr. John Welch. Thanks to the other faculty in our Electrical and Computer Engineering Department and College of Engineering, especially Dr. George K. Haritos and Dr. S.I. Hariharan. During my Ph.D. studies, I was ¯nancially supported by our Electrical and Computer Engineering Department and College of Engineering. Thanks to my wonderful parents, Zixing and Huan, my parents-in-law, Haiquan and Zhanru, and to my brother and sister-in-law, Yufeng and Xiaoxue, for their continuous support during my studies. This Ph.D. is a result of their extraordinary will, and sacri¯ces. My ¯nal, and most heartfelt, acknowledgment must go to my wife Qingbo. Her patience, encouragement, love and guidance were essential for my joining and smooth-sailing through the Ph.D. program. For all that, and for being everything I am not, she has my everlasting love. Qingbo and I also want to thank our oncoming baby who is our angel and brings luck to us. v TABLE OF CONTENTS .................................................Page LIST OF TABLES . viii LIST OF FIGURES . ix CHAPTER . I. INTRODUCTION . 1 1. Motivation . 1 2. Contributions of Research . 3 3. Dissertation Outline . 4 II. OVERVIEW OF DECISION TREE TECHNOLOGY . 6 1. De¯nition of Terminology . 6 2. What is Decision Tree . 7 3. Decision Tree Construction . 8 4. Pruning a Decision Tree . 12 5. Other Work on Decision Tree Technology . 18 6. Decision Tree Applications . 26 7. Intelligent Inexact Classi¯cation in Expert System . 29 III. KBP1.0: A NEW DECISION TREE PRUNING METHOD . 33 1. Using Intelligent Inexact Classi¯cation in Cost-sensitive Pruning . 33 2. How to Determine ® ................................. 37 3. Using Cost and Error Rate in Decision Tree Pruning . 38 4. Pessimistic Cost Pruning . 43 5. Expert Knowledge used in Decision Tree Pruning . 44 6. Di®erence between KBP1.0 and C4.5 . 45 7. An Example of KBP1.0 . 47 IV. COMPARATIVE ANALYSIS . 59 1. The Design of the Experiment . 59 2. Criteria and Data Set Partition . 62 3. Cost Matrix . 63 4. Cost and Error Weights, Cost and Error Threshold Values . 66 5. Experimental Results . 66 6. Statistical Measures . 90 7. The Cost Matrix Sensitivity . 91 V. THE FRAMEWORK FOR KNOWLEDGE-BASE PRUNING . 93 1. De¯nitions . 93 2. Strategies for KBP1.0 . 96 3. Discussion . 98 vi VI. SUMMARY . 100 1. Summary . 100 2. Discussion . 103 3. Future Work . 105 BIBLIOGRAPHY . 106 APPENDIX . 114 vii LIST OF TABLES Table . .Page 1 Attribute Information. 49 2 Major properties of the data sets considered in the experimentation . 60 3 Values for I1, I2, Cth and Eth ........................ 66 4 Test results on cost between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the other non-cost sensitive pruning methods . 67 5 Test results on error rate (percent) between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4 and the other non-cost sensitive pruning methods . 73 6 Test results on size between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the other non-cost sensitive pruning methods . 77 7 Test results on cost between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the C4.5 pruning method . 80 8 Test results on error rate (percent) between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the C4.5 pruning method . 84 9 Test results on size between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the C4.5 pruning method . 88 10 Experimental results on cost according to di®erent cost matrices . 91 11 Experimental results on the percentage of cost changes . 92 12 Values for I1 and I2 in KBP1.0 . 96 viii LIST OF FIGURES Figure . .Page 1 Test results on cost between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the other non-cost sensitive pruning methods for all the data sets . 68 2 Test results on cost between the cost-sensitive pruning methods KBP1.0- 1, KBP1.0-2, KBP1.0-3, and KBP1.0-4 and the other non-cost sensitive pruning methods for the Hepatitis data set . 69 3 Test results on cost between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the other non-cost sensitive pruning methods for the Iris data set . 70 4 Test results on cost between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the other non-cost sensitive pruning methods for the Vote data set . 71 5 Test results on error rate (percent) between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the other non-cost sensitive pruning methods for all the data sets . 74 6 Test results on error rate (percent) between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the other non-cost sensitive pruning methods for the Hypothyroid data set . 75 7 Test results on size between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the other non-cost sensitive pruning methods for all the data sets . 78 8 Test results on cost between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the C4.5 pruning method for all the data sets . 81 9 Test results on cost between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the C4.5 pruning method for the Hypothyroid data set . 82 10 Test results on cost between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, and KBP1.0-3, and the C4.5 pruning method for the Vote data set....................................... 83 11 Test results on error rate between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-3, and the C4.5 pruning method for all the data sets . 85 ix 12 Test results on error rate between the cost-sensitive pruning methods KBP1.0-1, KBP1.0-2, KBP1.0-3, and KBP1.0-4, and the C4.5 pruning method for Hypothyroid data set .