Pruning Decision Trees and Lists
Total Page:16
File Type:pdf, Size:1020Kb
Department of Computer Science Hamilton, New Zealand Pruning Decision Trees and Lists A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy at the University of Waikato by Eibe Frank January 2000 c 2000 Eibe Frank All Rights Reserved Abstract Machine learning algorithms are techniques that automatically build models describ- ing the structure at the heart of a set of data. Ideally, such models can be used to predict properties of future data points and people can use them to analyze the domain from which the data originates. Decision trees and lists are potentially powerful predictors and embody an explic- it representation of the structure in a dataset. Their accuracy and comprehensibility depends on how concisely the learning algorithm can summarize this structure. The final model should not incorporate spurious effects—patterns that are not genuine features of the underlying domain. Given an efficient mechanism for determining when a particular effect is due to chance alone, non-predictive parts of a model can be eliminated or “pruned.” Pruning mechanisms require a sensitive instrument that uses the data to detect whether there is a genuine relationship between the com- ponents of a model and the domain. Statistical significance tests are theoretically well-founded tools for doing exactly that. This thesis presents pruning algorithms for decision trees and lists that are based on significance tests. We explain why pruning is often necessary to obtain small and accurate models and show that the performance of standard pruning algorithms can be improved by taking the statistical significance of observations into account. We compare the effect of parametric and non-parametric tests, analyze why current pruning algorithms for decision lists often prune too aggressively, and review related work—in particular existing approaches that use significance tests in the context of pruning. The main outcome of this investigation is a set of simple pruning algorithms that should prove useful in practical data mining applications. i Acknowledgments This thesis was written in a very supportive and inspiring environment, and I would like to thank all the people in the Department of Computer Science at the University of Waikato, especially the members of the Machine Learning Group, for making it so much fun to do my research here in New Zealand. Most of all I would like to thank my supervisor Ian Witten for his generous support, for the freedom that allowed me to pursue several interesting little projects during the course of my PhD, for his insightful comments on all aspects of my work, and for his persistent attempts at teaching me how to write English prose. Many thanks also to current and former members of the WEKA Machine Learn- ing Group, in particular Geoff Holmes, Mark Hall, Len Trigg, and Stuart Inglis, and to my fellow PhD students Gordon Paynter and Yong Wang, for being such a great team to work with. iii Contents Abstract i Acknowledgments iii 1 Introduction 1 1.1 BasicConcepts .............................. 3 1.2 DecisionTreesandLists ......................... 6 1.3 ThesisStructure ............................. 10 1.4 ExperimentalMethodology . 11 1.5 HistoricalRemarks ............................ 14 2 Why prune? 15 2.1 Samplingvariance ............................ 16 2.2 Overfitting ................................ 19 2.3 Pruning .................................. 23 2.4 RelatedWork............................... 25 2.5 Conclusions ................................ 28 3 Post-pruning 29 3.1 Decision tree pruning and statistical significance . ....... 30 3.2 Significance tests on contingency tables . 36 3.2.1 Testingsignificance. 38 3.2.2 Parametrictests ......................... 40 3.2.3 Non-parametric tests . 42 3.2.4 Approximation .......................... 44 3.2.5 Teststatistics........................... 45 3.2.6 Sparseness............................. 46 3.3 Experimentsonartificialdata . 47 v 3.4 Experiments on practical datasets . 51 3.5 Minimizingtreesize ........................... 54 3.6 Relatedwork ............................... 58 3.6.1 Cost-complexitypruning. 59 3.6.2 Pessimisticerrorpruning . 60 3.6.3 Criticalvaluepruning . 62 3.6.4 Minimum-errorpruning . 63 3.6.5 Error-basedpruning . 64 3.6.6 Underpruning........................... 66 3.6.7 Otherpruningmethods . 69 3.7 Conclusions ................................ 73 4 Pre-pruning 75 4.1 Pre-pruningwithsignificancetests . 78 4.1.1 Significance tests on attributes . 79 4.1.2 Multipletests........................... 82 4.2 Experiments................................ 85 4.2.1 Parametric vs. non-parametric tests . 85 4.2.2 Adjusting for multiple tests . 88 4.2.3 Pre-vs.post-pruning . 90 4.3 Relatedwork ............................... 95 4.3.1 Pre-pruning in statistics and machine learning . 95 4.3.2 Pre-pruning with permutation tests . 97 4.3.3 Statistical tests for attribute selection . 100 4.4 Conclusions ................................ 100 5 Pruning decision lists 103 5.1 Pruningalgorithms.. .. .. .. .. .. .. .. 107 5.1.1 Reduced-errorpruningforrules. 108 5.1.2 Incremental reduced-error pruning . 110 5.1.3 Incremental tree-based pruning . 117 5.2 Improved rule sets with significance tests . 122 5.2.1 Using statistical significance for rule selection . 122 vi 5.2.2 Examplesforunderpruning . 124 5.2.3 Integrating significance testing into pruning . 126 5.3 Rulelearningwithpartialtrees . 128 5.3.1 Altering the basic pruning operation . 129 5.3.2 Building a partial tree . 131 5.3.3 Remarks.............................. 132 5.4 Experiments................................ 133 5.4.1 Tree-based and standard incremental reduced-error pruning . 133 5.4.2 Using significance testing . 136 5.4.3 Building partial trees . 144 5.5 Relatedwork ............................... 152 5.5.1 RIPPER.............................. 153 5.5.2 C4.5................................ 156 5.5.3 Inducingdecisionlists . 159 5.5.4 Significance testing in rule learners . 161 5.6 Conclusions ................................ 162 6 Conclusions 165 6.1 Results................................... 165 6.2 Contributions............................... 167 6.3 Futurework................................ 169 Appendix A Results for post-pruning 171 A.1 Parametric vs. non-parametric tests . 171 A.2 Reduced-overlappruning. 174 Appendix B Results for pre-pruning 175 B.1 Parametric vs. non-parametric tests . 175 B.2 Adjustingformultipletests . 178 B.3 Pre-vs.post-pruning.. .. .. .. .. .. .. .. 182 Appendix C Results for pruning decision lists 187 vii List of Figures 1.1 A decision tree for classifying Iris flowers . ..... 6 1.2 A decision list for classifying Iris flowers . ..... 8 2.1 Expected error of disjuncts for different sizes and different values of p 18 2.2 Average error of random classifier and minimum error classifier . 21 2.3 Average error of different size disjuncts . 22 2.4 Average number of different size disjuncts . 23 3.1 A decision tree with two classes A and B . 31 3.2 Anexamplepruningset ......................... 31 3.3 Reduced-errorpruningexample . 32 3.4 Tree for no-information dataset before pruning . ..... 34 3.5 Tree for no-information dataset after pruning . ..... 35 3.6 Size of pruned tree relative to training set size . ...... 36 3.7 Structure of a contingency table . 37 3.8 Example pruning set with predicted class values . 37 3.9 Exampleconfusionmatrix . 37 3.10 Contingency tables for permutations in Table 3.2 . ...... 43 3.11 Sequential probability ratio test . 45 3.12 Three tables and their p-values . 46 3.13 Performance of tests for no-information dataset . ....... 48 3.14 Performance of tests for parity dataset . 49 3.15 Significance test A uniformly dominates test B . 50 3.16 Comparison of significance tests . 52 4.1 Genericpre-pruningalgorithm . 79 4.2 Example contingency tables for attribute selection . ....... 80 4.3 Permutations of a small dataset . 80 ix 4.4 Contingency tables for the permutations . 80 4.5 Comparisonofsignificancetests. 87 4.6 UsingtheBonferronicorrection . 89 4.7 Pre- vs. post-pruningon the paritydataset . 91 4.8 Pre-vs.post-pruning........................... 92 4.9 A split that does not reduce the classification error . ...... 94 5.1 An example of subtree replication . 104 5.2 Asimpledecisionlist. .. .. .. .. .. .. .. 105 5.3 Separate-and-conquer algorithm . 108 5.4 Incrementalreduced-errorpruning . 111 5.5 A hypothetical target concept for a noisy domain. 115 5.6 Tree-based incremental reduced-error pruning . 118 5.7 DecisiontreepruninginTB-IREP . 118 5.8 Example illustrating the two pruning procedures . 119 5.9 Contingency table representing a rule . 123 5.10 Contingency table representing two example rules . ....... 123 5.11 Two examples where TB-IREP does not prune sufficiently . 125 5.12 Contingency tables for the examples from Figure 5.11 . 126 5.13 Decision tree pruning with significance testing in TB-IREP ..... 127 5.14 Method that expands a given set of instances into a partial tree . 130 5.15 Exampleofhowapartialtreeisbuilt . 131 5.16 Runtime and accuracy of TB-IREPsig and PARTsig ......... 151 A.1 Comparisonofsignificancetests. 171 B.1 Comparisonofsignificancetests. 175 B.2 UsingtheBonferronicorrection . 178 B.3 Pre-vs.post-pruning. .. .. .. .. .. .. .. 182 x List of Tables 1.1 Datasets used for the experiments . 12 3.1 Predicted and true class labels for a small dataset . ..... 42 3.2 Permutations of the dataset from Table 3.1 and corresponding χ2-values 42 3.3 Accuracies for REP compared to SIG and SIG-SE . 56 3.4 Results of paired t-tests