Correlation-Based Feature Selection for Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Department of Computer Science Hamilton, NewZealand Correlation-based Feature Selection for Machine Learning Mark A. Hall This thesis is submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy at The University of Waikato. April 1999 c 1999 Mark A. Hall ii Abstract A central problem in machine learning is identifying a representative set of features from which to construct a classification model for a particular task. This thesis addresses the problem of feature selection for machine learning through a correlation based approach. The central hypothesis is that good feature sets contain features that are highly correlated with the class, yet uncorrelated with each other. A feature evaluation formula, based on ideas from test theory, provides an operational definition of this hypothesis. CFS (Correlation based Feature Selection) is an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy. CFS was evaluated by experiments on artificial and natural datasets. Three machine learn- ing algorithms were used: C4.5 (a decision tree learner), IB1 (an instance based learner), and naive Bayes. Experiments on artificial datasets showed that CFS quickly identifies and screens irrelevant, redundant, and noisy features, and identifies relevant features as long as their relevance does not strongly depend on other features. On natural domains, CFS typically eliminated well over half the features. In most cases, classification accuracy using the reduced feature set equaled or bettered accuracy using the complete feature set. Feature selection degraded machine learning performance in cases where some features were eliminated which were highly predictive of very small areas of the instance space. Further experiments compared CFS with a wrapper—a well known approach to feature selection that employs the target learning algorithm to evaluate feature sets. In many cases CFS gave comparable results to the wrapper, and in general, outperformed the wrapper on small datasets. CFS executes many times faster than the wrapper, which allows it to scale to larger datasets. Two methods of extending CFS to handle feature interaction are presented and exper- imentally evaluated. The first considers pairs of features and the second incorporates iii feature weights calculated by the RELIEF algorithm. Experiments on artificial domains showed that both methods were able to identify interacting features. On natural domains, the pairwise method gave more reliable results than using weights provided by RELIEF. iv Acknowledgements First and foremost I would like to acknowledge the tireless and prompt help of my super- visor, Lloyd Smith. Lloyd has always allowed me complete freedom to define and explore my own directions in research. While this proved difficult and somewhat bewildering to begin with, I have come to appreciate the wisdom of his way—it encouraged me to think for myself, something that is unfortunately all to easy to avoid as an undergraduate. Lloyd and the Department of Computer Science have provided me with much appreciated financial support during my degree. They have kindly provided teaching assistantship positions and travel funds to attend conferences. I thank Geoff Holmes, Ian Witten and Bill Teahan for providing valuable feedback and reading parts of this thesis. Stuart Inglis (super-combo!), Len Trigg, and Eibe Frank deserve thanks for their technical assistance and helpful comments. Len convinced me (rather emphatically) not to use MS Word for writing a thesis. Thanks go to Richard Littin and David McWha for kindly providing the University of Waikato thesis style and assistance with LATEX. Special thanks must also go to my family and my partner Bernadette. They have provided unconditional support and encouragement through both the highs and lows of my time in graduate school. v vi Contents Abstract iii Acknowledgements v List of Figures xv List of Tables xx 1 Introduction 1 1.1 Motivation . 1 1.2 Thesis statement . 4 1.3 Thesis Overview . 5 2 Supervised Machine Learning: Concepts and Definitions 7 2.1 The Classification Task . 7 2.2 Data Representation . 8 2.3 Learning Algorithms . 9 2.3.1 Naive Bayes . 10 2.3.2 C4.5 Decision Tree Generator . 12 2.3.3 IB1-Instance Based Learner . 14 2.4 Performance Evaluation . 16 2.5 Attribute Discretization . 18 2.5.1 Methods of Discretization . 19 3 Feature Selection for Machine Learning 25 3.1 Feature Selection in Statistics and Pattern Recognition . 26 3.2 Characteristics of Feature Selection Algorithms . 27 3.3 Heuristic Search . 28 vii 3.4 Feature Filters . 32 3.4.1 Consistency Driven Filters . 32 3.4.2 Feature Selection Through Discretization . 36 3.4.3 Using One Learning Algorithm as a Filter for Another . 36 3.4.4 An Information Theoretic Feature Filter . 38 3.4.5 An Instance Based Approach to Feature Selection . 39 3.5 Feature Wrappers . 40 3.5.1 Wrappers for Decision Tree Learners . 41 3.5.2 Wrappers for Instance Based Learning . 42 3.5.3 Wrappers for Bayes Classifiers . 45 3.5.4 Methods of Improving the Wrapper . 46 3.6 Feature Weighting Algorithms . 47 3.7 Chapter Summary . 49 4 Correlation-based Feature Selection 51 4.1 Rationale . 51 4.2 Correlating Nominal Features . 55 4.2.1 Symmetrical Uncertainty . 56 4.2.2 Relief . 57 4.2.3 MDL . 59 4.3 Bias in Correlation Measures between Nominal Features . 61 4.3.1 Experimental Measurement of Bias . 62 4.3.2 Varying the Level of Attributes . 64 4.3.3 Varying the Sample Size . 66 4.3.4 Discussion . 67 4.4 A Correlation-based Feature Selector . 69 4.5 Chapter Summary . 74 5 Datasets Used in Experiments 75 5.1 Domains . 75 5.2 Experimental Methodology . 80 viii 6 Evaluating CFS with 3 ML Algorithms 85 6.1 Artificial Domains . 85 6.1.1 Irrelevant Attributes . 86 6.1.2 Redundant Attributes . 95 6.1.3 Monk’s problems . 104 6.1.4 Discussion . 106 6.2 Natural Domains . 107 6.3 Chapter Summary . 119 7 Comparing CFS to the Wrapper 121 7.1 Wrapper Feature Selection . 121 7.2 Comparison . 123 7.3 Chapter Summary . 128 8 Extending CFS: Higher Order Dependencies 131 8.1 Related Work . 131 8.2 Joining Features . 133 8.3 Incorporating RELIEF into CFS . 135 8.4 Evaluation . 136 8.5 Discussion . 143 9 Conclusions 145 9.1 Summary . 145 9.2 Conclusions . 147 9.3 Future Work . 147 Appendices A Graphs for Chapter 4 151 B Curves for Concept A3 with Added Redundant Attributes 153 C Results for CFS-UC, CFS-MDL, and CFS-Relief on 12 Natural Domains 157 D 5 2cv Paired t test Results 159 ix E CFS Merit Versus Accuracy 163 F CFS Applied to 37 UCI Domains 167 Bibliography 171 x List of Figures 2.1 A decision tree for the “Golf” dataset. Branches correspond to the values of attributes; leaves indicate classifications. 13 3.1 Filter and wrapper feature selectors. 29 3.2 Feature subset space for the “golf” dataset. 30 4.1 The effects on the correlation between an outside variable and a compos- © § ite variable ¢¡¤£¦¥¨§ of the number of components , the inter-correlations ¡ § among the components , and the correlations between the compo- ¡ £ § nents and the outside variable . 54 4.2 The effects of varying the attribute and class level on symmetrical uncer- tainty (a & b), symmetrical relief (c & d), and normalized symmetrical MDL (e & f) when attributes are informative (graphs on the left) and non- ¤ informative (graphs on the right). Curves are shown for , , and classes. 65 4.3 The effect of varying the training set size on symmetrical uncertainty (a & b), symmetrical relief (c & d), and normalized symmetrical MDL (e & f) when attributes are informative and non-informative. The number of classes is ; curves are shown for , , and valued attributes. 68 4.4 The components of CFS. Training and testing data is reduced to contain only the features selected by CFS. The dimensionally reduced data can then be passed to a machine learning algorithm for induction and prediction. 71 5.1 Effect of CFS feature selection on accuracy of naive Bayes classification. Dots show results that are statistically significant . 82 5.2 The learning curve for IB1 on the dataset A with added irrelevant attributes. 83 xi 6.1 Number of irrelevant attributes selected on concept A1 (with added irrel- evant features) by CFS-UC, CFS-MDL, and CFS-Relief as a function of training set size. 87 6.2 Number of relevant attributes selected on concept A1 (with added irrel- evant features) by CFS-UC, CFS-MDL, and CFS-Relief as a function of training set size. 88 6.3 Learning curves for IB1, CFS-UC-IB1, CFS-MDL-IB1, and CFS-Relief- IB1 on concept A1 (with added irrelevant features) . 90 6.4 Number of irrelevant attributes selected on concept A2 (with added irrel- evant features) by CFS-UC, CFS-MDL, and CFS-Relief as a function of training set size. Note: CFS-UC and CFS-Relief produce the same result. 90 6.5 Number of relevant attributes selected on concept A (with added irrel- evant features) by CFS-UC, CFS-MDL, and CFS-Relief as a function of training set size. 93 6.6 Learning curves for IB1, CFS-UC-IB1, CFS-MDL-IB1, and CFS-Relief- IB1 on concept A (with added irrelevant features). 93 6.7 Number of irrelevant attributes selected on concept A (with added irrel- evant features) by CFS-UC, CFS-MDL, and CFS-Relief as a function of training set size. 94 6.8 Number of relevant attributes selected on concept A (with added irrel- evant features) by CFS-UC, CFS-MDL, and CFS-Relief as a function of training set size. 94 6.9 Number of irrelevant multi-valued attributes selected on concept A (with added irrelevant features) by CFS-UC, CFS-MDL, and CFS-Relief as a function of training set size.