Pruning Decision Trees and Lists

Total Page:16

File Type:pdf, Size:1020Kb

Pruning Decision Trees and Lists Department of Computer Science Hamilton, New Zealand Pruning Decision Trees and Lists A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy at the University of Waikato by Eibe Frank January 2000 c 2000 Eibe Frank All Rights Reserved Abstract Machine learning algorithms are techniques that automatically build models describ- ing the structure at the heart of a set of data. Ideally, such models can be used to predict properties of future data points and people can use them to analyze the domain from which the data originates. Decision trees and lists are potentially powerful predictors and embody an explic- it representation of the structure in a dataset. Their accuracy and comprehensibility depends on how concisely the learning algorithm can summarize this structure. The final model should not incorporate spurious effects—patterns that are not genuine features of the underlying domain. Given an efficient mechanism for determining when a particular effect is due to chance alone, non-predictive parts of a model can be eliminated or “pruned.” Pruning mechanisms require a sensitive instrument that uses the data to detect whether there is a genuine relationship between the com- ponents of a model and the domain. Statistical significance tests are theoretically well-founded tools for doing exactly that. This thesis presents pruning algorithms for decision trees and lists that are based on significance tests. We explain why pruning is often necessary to obtain small and accurate models and show that the performance of standard pruning algorithms can be improved by taking the statistical significance of observations into account. We compare the effect of parametric and non-parametric tests, analyze why current pruning algorithms for decision lists often prune too aggressively, and review related work—in particular existing approaches that use significance tests in the context of pruning. The main outcome of this investigation is a set of simple pruning algorithms that should prove useful in practical data mining applications. i Acknowledgments This thesis was written in a very supportive and inspiring environment, and I would like to thank all the people in the Department of Computer Science at the University of Waikato, especially the members of the Machine Learning Group, for making it so much fun to do my research here in New Zealand. Most of all I would like to thank my supervisor Ian Witten for his generous support, for the freedom that allowed me to pursue several interesting little projects during the course of my PhD, for his insightful comments on all aspects of my work, and for his persistent attempts at teaching me how to write English prose. Many thanks also to current and former members of the WEKA Machine Learn- ing Group, in particular Geoff Holmes, Mark Hall, Len Trigg, and Stuart Inglis, and to my fellow PhD students Gordon Paynter and Yong Wang, for being such a great team to work with. iii Contents Abstract i Acknowledgments iii 1 Introduction 1 1.1 BasicConcepts .............................. 3 1.2 DecisionTreesandLists ......................... 6 1.3 ThesisStructure ............................. 10 1.4 ExperimentalMethodology . 11 1.5 HistoricalRemarks ............................ 14 2 Why prune? 15 2.1 Samplingvariance ............................ 16 2.2 Overfitting ................................ 19 2.3 Pruning .................................. 23 2.4 RelatedWork............................... 25 2.5 Conclusions ................................ 28 3 Post-pruning 29 3.1 Decision tree pruning and statistical significance . ....... 30 3.2 Significance tests on contingency tables . 36 3.2.1 Testingsignificance. 38 3.2.2 Parametrictests ......................... 40 3.2.3 Non-parametric tests . 42 3.2.4 Approximation .......................... 44 3.2.5 Teststatistics........................... 45 3.2.6 Sparseness............................. 46 3.3 Experimentsonartificialdata . 47 v 3.4 Experiments on practical datasets . 51 3.5 Minimizingtreesize ........................... 54 3.6 Relatedwork ............................... 58 3.6.1 Cost-complexitypruning. 59 3.6.2 Pessimisticerrorpruning . 60 3.6.3 Criticalvaluepruning . 62 3.6.4 Minimum-errorpruning . 63 3.6.5 Error-basedpruning . 64 3.6.6 Underpruning........................... 66 3.6.7 Otherpruningmethods . 69 3.7 Conclusions ................................ 73 4 Pre-pruning 75 4.1 Pre-pruningwithsignificancetests . 78 4.1.1 Significance tests on attributes . 79 4.1.2 Multipletests........................... 82 4.2 Experiments................................ 85 4.2.1 Parametric vs. non-parametric tests . 85 4.2.2 Adjusting for multiple tests . 88 4.2.3 Pre-vs.post-pruning . 90 4.3 Relatedwork ............................... 95 4.3.1 Pre-pruning in statistics and machine learning . 95 4.3.2 Pre-pruning with permutation tests . 97 4.3.3 Statistical tests for attribute selection . 100 4.4 Conclusions ................................ 100 5 Pruning decision lists 103 5.1 Pruningalgorithms.. .. .. .. .. .. .. .. 107 5.1.1 Reduced-errorpruningforrules. 108 5.1.2 Incremental reduced-error pruning . 110 5.1.3 Incremental tree-based pruning . 117 5.2 Improved rule sets with significance tests . 122 5.2.1 Using statistical significance for rule selection . 122 vi 5.2.2 Examplesforunderpruning . 124 5.2.3 Integrating significance testing into pruning . 126 5.3 Rulelearningwithpartialtrees . 128 5.3.1 Altering the basic pruning operation . 129 5.3.2 Building a partial tree . 131 5.3.3 Remarks.............................. 132 5.4 Experiments................................ 133 5.4.1 Tree-based and standard incremental reduced-error pruning . 133 5.4.2 Using significance testing . 136 5.4.3 Building partial trees . 144 5.5 Relatedwork ............................... 152 5.5.1 RIPPER.............................. 153 5.5.2 C4.5................................ 156 5.5.3 Inducingdecisionlists . 159 5.5.4 Significance testing in rule learners . 161 5.6 Conclusions ................................ 162 6 Conclusions 165 6.1 Results................................... 165 6.2 Contributions............................... 167 6.3 Futurework................................ 169 Appendix A Results for post-pruning 171 A.1 Parametric vs. non-parametric tests . 171 A.2 Reduced-overlappruning. 174 Appendix B Results for pre-pruning 175 B.1 Parametric vs. non-parametric tests . 175 B.2 Adjustingformultipletests . 178 B.3 Pre-vs.post-pruning.. .. .. .. .. .. .. .. 182 Appendix C Results for pruning decision lists 187 vii List of Figures 1.1 A decision tree for classifying Iris flowers . ..... 6 1.2 A decision list for classifying Iris flowers . ..... 8 2.1 Expected error of disjuncts for different sizes and different values of p 18 2.2 Average error of random classifier and minimum error classifier . 21 2.3 Average error of different size disjuncts . 22 2.4 Average number of different size disjuncts . 23 3.1 A decision tree with two classes A and B . 31 3.2 Anexamplepruningset ......................... 31 3.3 Reduced-errorpruningexample . 32 3.4 Tree for no-information dataset before pruning . ..... 34 3.5 Tree for no-information dataset after pruning . ..... 35 3.6 Size of pruned tree relative to training set size . ...... 36 3.7 Structure of a contingency table . 37 3.8 Example pruning set with predicted class values . 37 3.9 Exampleconfusionmatrix . 37 3.10 Contingency tables for permutations in Table 3.2 . ...... 43 3.11 Sequential probability ratio test . 45 3.12 Three tables and their p-values . 46 3.13 Performance of tests for no-information dataset . ....... 48 3.14 Performance of tests for parity dataset . 49 3.15 Significance test A uniformly dominates test B . 50 3.16 Comparison of significance tests . 52 4.1 Genericpre-pruningalgorithm . 79 4.2 Example contingency tables for attribute selection . ....... 80 4.3 Permutations of a small dataset . 80 ix 4.4 Contingency tables for the permutations . 80 4.5 Comparisonofsignificancetests. 87 4.6 UsingtheBonferronicorrection . 89 4.7 Pre- vs. post-pruningon the paritydataset . 91 4.8 Pre-vs.post-pruning........................... 92 4.9 A split that does not reduce the classification error . ...... 94 5.1 An example of subtree replication . 104 5.2 Asimpledecisionlist. .. .. .. .. .. .. .. 105 5.3 Separate-and-conquer algorithm . 108 5.4 Incrementalreduced-errorpruning . 111 5.5 A hypothetical target concept for a noisy domain. 115 5.6 Tree-based incremental reduced-error pruning . 118 5.7 DecisiontreepruninginTB-IREP . 118 5.8 Example illustrating the two pruning procedures . 119 5.9 Contingency table representing a rule . 123 5.10 Contingency table representing two example rules . ....... 123 5.11 Two examples where TB-IREP does not prune sufficiently . 125 5.12 Contingency tables for the examples from Figure 5.11 . 126 5.13 Decision tree pruning with significance testing in TB-IREP ..... 127 5.14 Method that expands a given set of instances into a partial tree . 130 5.15 Exampleofhowapartialtreeisbuilt . 131 5.16 Runtime and accuracy of TB-IREPsig and PARTsig ......... 151 A.1 Comparisonofsignificancetests. 171 B.1 Comparisonofsignificancetests. 175 B.2 UsingtheBonferronicorrection . 178 B.3 Pre-vs.post-pruning. .. .. .. .. .. .. .. 182 x List of Tables 1.1 Datasets used for the experiments . 12 3.1 Predicted and true class labels for a small dataset . ..... 42 3.2 Permutations of the dataset from Table 3.1 and corresponding χ2-values 42 3.3 Accuracies for REP compared to SIG and SIG-SE . 56 3.4 Results of paired t-tests
Recommended publications
  • Decision Tree Classification and Forecasting of Pricing Time Series
    Decision Tree Classification and Forecasting of Pricing Time Series Data EMIL LUNDKVIST Master’s Degree Project Stockholm, Sweden July 2014 TRITA-XR-EE-RT 2014:017 Abstract Many companies today, in different fields of operations and sizes, have access to a vast amount of data which was not available only a couple of years ago. This situation gives rise to questions regarding how to organize and use the data in the best way possible. In this thesis a large database of pricing data for products within various market segments is analysed. The pricing data is from both external and internal sources and is therefore confidential. Because of the confidentiality, the labels from the database are in this thesis substituted with generic ones and the company is not referred to by name, but the analysis is carried out on the real data set. The data is from the beginning unstructured and difficult to overlook. Therefore, it is first classified. This is performed by feeding some manual training data into an algorithm which builds a decision tree. The decision tree is used to divide the rest of the products in the database into classes. Then, for each class, a multivariate time series model is built and each product’s future price within the class can be predicted. In order to interact with the classification and price prediction, a front end is also developed. The results show that the classification algorithm both is fast enough to operate in real time and performs well. The time series analysis shows that it is possible to use the information within each class to do predictions, and a simple vector autoregressive model used to perform it shows good predictive results.
    [Show full text]
  • Network Classification and Categorization Paper
    Network Classification and Categorization Paper James P. Canning, Emma E. Ingram, Adriana M. Ortiz and Karl R. B. Schmitt June 2017 Abstract Networks are often labeled according to the underlying phenomena that they represent, such as re- tweets, protein interactions, or web page links. Our research seeks to determine if we can use machine learning techniques to gain a better understanding of the categories of networks on the Network Repos- itory (www.networkrepository.com) and then classify the unlabeled networks into categories that make sense. It is generally believed that networks from different categories have inherently unique network characteristics. Our research provides conclusive evidence to validate this belief by presenting the results of global network clustering and classification into common categories using machine learning algorithms. The machine learning techniques of Decisions Trees, Random Forests, Linear Support Vector Classifica- tion and Gaussian Na¨ıve Bayes were applied to a 14-feature `identifying vector' for each graph. During cross-validation, the best technique, Gaussian Na¨ıve Bayes, achieved an accuracy of 92.8%. After train- ing the machine learning algorithm it was applied to a collection of initially unlabeled graphs from the Network Repository (www.networkrepository.com). Results were then manually checked by determining (when possible) original sources for these graphs. We conclude by examining the accuracy of our results and discussing how future researchers can make use of this process. 1 Introduction Network science is a field which seeks to understand the world by representing interactions and relationships between objects as graphs. By organizing data into networks, we can better understand how objects of interest are connected.
    [Show full text]
  • A Convenient and Low-Cost Model of Depression Screening and Early Warning Based on Voice Data Using for Public Mental Health
    International Journal of Environmental Research and Public Health Article A Convenient and Low-Cost Model of Depression Screening and Early Warning Based on Voice Data Using for Public Mental Health Xin Chen 1,2,3 and Zhigeng Pan 1,2,3,* 1 School of Medicine, Hangzhou Normal University, Hangzhou 311121, China; [email protected] 2 Engineering Research Center of Mobile Health Management System, Ministry of Education, Hangzhou Normal University, Hangzhou 311121, China 3 Institute of VR and Intelligent System, Hangzhou Normal University, Hangzhou 311121, China * Correspondence: [email protected] Abstract: Depression is a common mental health disease, which has great harm to public health. At present, the diagnosis of depression mainly depends on the interviews between doctors and patients, which is subjective, slow and expensive. Voice data are a kind of data that are easy to obtain and have the advantage of low cost. It has been proved that it can be used in the diagnosis of depression. The voice data used for modeling in this study adopted the authoritative public data set, which had passed the ethical review. The features of voice data were extracted by Python programming, and the voice features were stored in the format of CSV files. Through data processing, a big database, containing 1479 voice feature samples, was generated for modeling. Then, the decision tree screening model of depression was established by 10-fold cross validation and algorithm selection. The experiment achieved 83.4% prediction accuracy on voice data set. According to the prediction Citation: Chen, X.; Pan, Z. A results of the model, the patients can be given early warning and intervention in time, so as to realize Convenient and Low-Cost Model of the health management of personal depression.
    [Show full text]
  • Combining Machine Learning and Social Network Analysis to Reveal the Organizational Structures
    applied sciences Article Combining Machine Learning and Social Network Analysis to Reveal the Organizational Structures Mateusz Nurek and Radosław Michalski * Department of Computational Intelligence, Faculty of Computer Science and Management, Wrocław University of Science and Technology, 50-370 Wrocław, Poland; [email protected] * Correspondence: [email protected] Received: 7 January 2020; Accepted: 21 February 2020; Published: 2 March 2020 Abstract: Formation of a hierarchy within an organization is a natural way of assigning the duties, delegating responsibilities and optimizing the flow of information. Only for the smallest companies the lack of the hierarchy, that is, a flat one, is possible. Yet, if they grow, the introduction of a hierarchy is inevitable. Most often, its existence results in different nature of the tasks and duties of its members located at various organizational levels or in distant parts of it. On the other hand, employees often send dozens of emails each day, and by doing so, and also by being engaged in other activities, they naturally form an informal social network where nodes are individuals and edges are the actions linking them. At first, such a social network seems distinct from the organizational one. However, the analysis of this network may lead to reproducing the organizational hierarchy of companies. This is due to the fact that that people holding a similar position in the hierarchy possibly share also a similar way of behaving and communicating attributed to their role. The key concept of this work is to evaluate how well social network measures when combined with other features gained from the feature engineering align with the classification of the members of organizational social network.
    [Show full text]
  • Surrogate Decision Tree Visualization Interpreting and Visualizing Black-Box Classification Models with Surrogate Decision Tree
    Surrogate Decision Tree Visualization Interpreting and Visualizing Black-Box Classification Models with Surrogate Decision Tree Federica Di Castro Enrico Bertini Università di Roma La Sapienza New York University Tandon School of Engineering Rome, Italy Brooklyn, New York [email protected] [email protected] Figure 1: Understanding the behavior of a black-box Machine Learning model using the explanatory visual interface of our proposed technique. Interacting with the user-interaction-panel (A) a user can choose a combination of model and dataset that they desire to analyze. The built model is a surrogate decision tree whose structure can be analyzed in the tree Panel (B) and the details of the leaves are featured in the Rule Panel (C). The user can interact with the tree, collapsing any node they see fit and automatically updating the performance Overview (D). ABSTRACT analysis of the level of fidelity the surrogate decision tree can reach With the growing interest towards the application of Machine with respect to the original model. Learning techniques to many application domains, the need for transparent and interpretable ML is getting stronger. Visualizations CCS CONCEPTS methods can help model developers understand and refine ML mod- • Human-centered computing → Dendrograms; User interface els by making the logic of a given model visible and interactive. toolkits; • Information systems → Data analytics. In this paper we describe a visual analytics tool we developed to ACM Reference Format: support developers and domain experts (with little to no expertise Federica Di Castro and Enrico Bertini. 2019. Surrogate Decision Tree Vi- in ML) in understanding the logic of a ML model without having sualization: Interpreting and Visualizing Black-Box Classification Models access to the internal structure of the model (i.e., a model-agnostic with Surrogate Decision Tree.
    [Show full text]
  • Extracting Decision Trees from Diagnostic Bayesian Networks to Guide Test Selection
    Extracting Decision Trees from Diagnostic Bayesian Networks to Guide Test Selection Scott Wahl 1, John W. Sheppard 1 1 Department of Computer Science, Montana State University, Bozeman, MT, 59717, USA [email protected] [email protected] ABSTRACT procedure is a natural extension of the general process In this paper, we present a comparison of five differ- used in troubleshooting systems. Given no prior knowl- ent approaches to extracting decision trees from diag- edge, tests are performed sequentially, continuously nar- nostic Bayesian nets, including an approach based on rowing down the ambiguity group of likely faults. Re- the dependency structure of the network itself. With this sulting decision trees are called “fault trees” in the sys- approach, attributes used in branching the decision tree tem maintenance literature. are selected by a weighted information gain metric com- In recent years, tools have emerged that apply an al- puted based upon an associated D-matrix. Using these ternative approach to fault diagnosis using diagnostic trees, tests are recommended for setting evidence within Bayesian networks. One early example of such a net- the diagnostic Bayesian nets for use in a PHM applica- work can be found in the creation of the the QMR knowl- tion. We hypothesized that this approach would yield edge base (Shwe et al., 1991), used in medical diagno- effective decision trees and test selection and greatly re- sis. Bayesian networks provide a means for incorpo- duce the amount of evidence required for obtaining ac- rating uncertainty into the diagnostic process; however, curate classification with the associated Bayesian net- Bayesian networks by themselves provide no guidance works.
    [Show full text]
  • Decision Trees
    Chapter 9 DECISION TREES Lior Rokach Department of Industrial Engineering Tel-Aviv University [email protected] Oded Maimon Department of Industrial Engineering Tel-Aviv University [email protected] Abstract Decision Trees are considered to be one of the most popular approaches for rep- resenting classifiers. Researchers from various disciplines such as statistics, ma- chine learning, pattern recognition, and Data Mining have dealt with the issue of growing a decision tree from available data. This paper presents an updated sur- vey of current methods for constructing decision tree classifiers in a top-down manner. The chapter suggests a unified algorithmic framework for presenting these algorithms and describes various splitting criteria and pruning methodolo- gies. Keywords: Decision tree, Information Gain, Gini Index, Gain Ratio, Pruning, Minimum Description Length, C4.5, CART, Oblivious Decision Trees 1. Decision Trees A decision tree is a classifier expressed as a recursive partition of the in- stance space. The decision tree consists of nodes that form a rooted tree, meaning it is a directed tree with a node called “root” that has no incoming edges. All other nodes have exactly one incoming edge. A node with outgoing edges is called an internal or test node. All other nodes are called leaves (also known as terminal or decision nodes). In a decision tree, each internal node splits the instance space into two or more sub-spaces according to a certain discrete function of the input attributes values. In the simplest and most fre- 166 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK quent case, each test considers a single attribute, such that the instance space is partitioned according to the attribute’s value.
    [Show full text]
  • Decision Tree Pruning Using Expert Knowledge A
    DECISION TREE PRUNING USING EXPERT KNOWLEDGE A Dissertation Present to The Graduate Faculty of The University of Akron In Partial Ful¯llment of the Requirements for the Degree Doctor of Philosophy Jingfeng Cai December, 2006 DECISION TREE PRUNING USING EXPERT KNOWLEDGE Jingfeng Cai Dissertation Approved: Accepted: ||||||||||||{ |||||||||||| Advisor Department Chair Dr. John Durkin Dr. Alex De Abreu Garcia ||||||||||||{ |||||||||||| Committee Member Dean of the College Dr. Chien-Chung Chan Dr. George K. Haritos ||||||||||||{ ||||||||||||{ Committee Member Dean of the Graduate School Dr. James Grover Dr. George R. Newkome ||||||||||||{ ||||||||||||{ Committee Member Date Dr. Narender P. Reddy ||||||||||||{ Committee Member Dr. Shiva Sastry ||||||||||||{ Committee Member Dr. John Welch ii ABSTRACT Decision tree technology has proven to be a valuable way of capturing human decision making within a computer. It has long been a popular arti¯cial intelligence(AI) technique. During the 1980s, it was one of the primary ways for creating an AI system. During the early part of the 1990s, it somewhat fell out of favor, as did the entire AI ¯eld in general. However, during the later 1990s, with the emergence of data mining technology, the technique has resurfaced as a powerful method for creating a decision-making program. How to prune the decision tree is one of the research directions of the decision tree tech- nique, but the idea of cost-sensitive pruning has received much less attention than other pruning techniques even though additional flexibility and increased performance can be ob- tained from this method. This dissertation reports on a study of cost-sensitive methods for decision tree pruning. A decision tree pruning algorithm called KBP1.0, which includes four cost-sensitive methods, is developed.
    [Show full text]
  • A Decision-Tree-Based Symbolic Rule Induction System for Text Categorization
    Session 4 @richxpp/rich5/CLS_sj-ibm/GRP_sj-ibm/JOB_sj0302/DIV_sj4072-02a 04/29/02 A decision-tree-based symbolic rule induction system for text categorization by D. E. Johnson F. J. Oles T. Zhang T. Goetz We present a decision-tree-based symbolic mans often provide valuable insights in many prac- rule induction system for categorizing text tical problems. documents automatically. Our method for rule induction involves the novel combination of In a symbolic rule system, text is represented as a (1) a fast decision tree induction algorithm vector in which the components are the number of especially suited to text data and (2) a new occurrences of a certain word in the text. The sys- method for converting a decision tree to a tem induces rules from the training data, and the gen- rule set that is simplified, but still logically erated rules can then be used to categorize arbitrary equivalent to, the original tree. We report data that are similar to the training data. Each rule experimental results on the use of this system ultimately produced by such a system states that a on some practical problems. condition, which is usually a conjunction of simpler conditions, implies membership in a particular cat- egory. The condition forms the antecedent of the rule, and the conclusion posited as true when the condi- tion is satisfied is the consequent of the rule. Usu- The text categorization problem is to determine pre- ally, the antecedent of a rule is a combination of tests defined categories for an incoming unlabeled mes- to be done on various components.
    [Show full text]
  • Classification with Decision Trees
    Classification with decision trees Lecture 2 Decision trees Decision support tool ▪ Decision trees ▪ Supervised learning Input: a situation or an object ▪ Tree induction described by a set of attributes algorithm Output: decision ▪ Algorithm design issues ▪ Applications of decision trees Decision tree – mental model Situation: restaurant ▪ Decision trees ▪ Supervised learning Question: to leave or to stay? ▪ Tree induction algorithm ▪ Algorithm design issues ▪ Applications of decision trees Decision tree – mental model Situation: restaurant ▪ Decision trees ▪ Supervised learning Question: to leave or to stay? ▪ Tree induction algorithm ▪ Algorithm design Set of important attributes: issues Alternative restaurants: yes, no ▪ Applications of decision trees Am I hungry?: yes, no Clients?: none, some, full Is it raining?: yes, no Wait estimate: 0-10, 10-30, 30-60, >60 min Decision tree – mental model Clients? ▪ Decision trees ▪ Supervised learning none some full ▪ Tree induction Wait Leave Stay algorithm estimate: ▪ Algorithm design >60 30-60 10-30 1-10 issues Alternative ▪ Applications of Leave Hungry? Stay ? decision trees no yes no yes Alternative Stay Leave Stay ? no yes Stay Raining? no yes Leave Stay We have mental models for such situations Decision tree – structure Root ▪ Decision trees ▪ Supervised learning ▪ Tree induction Clients? algorithm ▪ Algorithm design none some full issues Wait Leave Stay ▪ Applications of estimate: decision trees >60 30-60 10-30 1-10 Alternative Leave Hungry? Stay ? no yes no yes … Stay Leave
    [Show full text]
  • Classification: Basic Concepts, Decision Trees, and Model Evaluation
    4 Classification: Basic Concepts, Decision Trees, and Model Evaluation Classification, which is the task of assigning objects to one of several predefined categories, is a pervasive problem that encompasses many diverse applications. Examples include detecting spam email messages based upon the message header and content, categorizing cells as malignant or benign based upon the results of MRI scans, and classifying galaxies based upon their shapes (see Figure 4.1). (a) A spiral galaxy. (b) An elliptical galaxy. Figure 4.1. Classification of galaxies. The images are from the NASA website. 146 Chapter 4 Classification Input Output Classification Attribute set Class label model (x) (y) Figure 4.2. Classification as the task of mapping an input attribute set x into its class label y. This chapter introduces the basic concepts of classification, describes some of the key issues such as model overfitting, and presents methods for evaluating and comparing the performance of a classification technique. While it focuses mainly on a technique known as decision tree induction, most of the discussion in this chapter is also applicable to other classification techniques, many of which are covered in Chapter 5. 4.1 Preliminaries The input data for a classification task is a collection of records. Each record, also known as an instance or example, is characterized by a tuple (x,y), where x is the attribute set and y is a special attribute, designated as the class label (also known as category or target attribute). Table 4.1 shows a sample data set used for classifying vertebrates into one of the following categories: mammal, bird, fish, reptile, or amphibian.
    [Show full text]
  • Graphical Representation and Exploratory Visualization for Decision Trees in the KDD Process
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Available online at www.sciencedirect.com Procedia - Social and Behavioral Sciences 73 ( 2013 ) 136 – 144 The 2nd International Conference on Integrated Information Graphical Representation and Exploratory Visualization for Decision Trees in the KDD Process Mr Wilson A. Castillo Rojasa, Mr Claudio J. Meneses Villegasb* aArturo Prat University, Engineering - Area Computing and Informatics, Av. Arturo Prat 2120, Iquique - Postcode: 110-0000, Chile bNorth Catholic University, Systems and Computer Engineering, Av. Angamos 0610 Postcode: 1280, Antofagasta, Chile Abstract This article presents a proposal of representation and scheme of exploratory visualization for Decision Trees in the KDD (Knowledge Discovery in Database) process, specifically in the data mining stage. With this, the improvement of the understandability of the internal operation of the model is pursued. This exploratory visualization is based on the well-known technique named Treemap (maps of trees) that allows representing hierarchical structures like the Decision Trees, being used grids to represent the nodes of the Decision Trees. The proposed visualization represents the number of instances or weight associated to a node with a scale of colors in degradation. In this way it is managed to heighten the rules of the Decision Trees in a 2D and 3D graphical representation of this visualization. Finally, a first attempt of subjective evaluation, based on for criteria, of the proposed visualization, is made. In this sense, this work pursues to introduce new schemes of visualizations that allow specifically understand how the data mining models work internally.
    [Show full text]