A Multi-Label Extension of the ID3 Algorithm Henrik

Bachelor Thesis Computer Science Thesis no: 06 2016 MLID A multi-label extension of the ID3 algorithm Henrik Starefors Rasmus Persson Dept. Computer Science & Engineering Blekinge Institute of Technology SE–371 79 Karlskrona, Sweden This thesis is submitted to the Department of Computer Science & Engineering at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science. The thesis is equivalent to 10 weeks of full-time studies. Contact Information: Author(s): Henrik Starefors E-mail: [email protected] Rasmus Persson E-mail: [email protected] University advisor: Prof. Håkan Grahn Dept. Computer Science & Engineering Dept. Computer Science & Engineering Internet : www.bth.se/didd Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract Context:Machine learning is a subfield within artificial intelligence that revolves around constructing algorithms that can learn from, and make predictions on data patterns. Instead of following strict and static instructions, the system operates by adapting and learning from input data in order to make predictions and decisions. This work will focus on a subcategory of machine learning called “Multi-label Classification”, which is the concept of where data introduced to the system is categorized by an analytical model, learned through what is called supervised learning, where each instance of the dataset can belong to multiple labels, or classes. Objetives:This paper presents the task of implementing a multi-label classifier based on the ID3 algorithm, which we call MLID (Multi-label Iterative Di- chotomiser). The solution is presented both in a sequentially executed version as well as an parallelized one. We also present a comparison based on accuracy and execution time, that is performed against algorithms of a similar nature in order to evaluate the viability of using ID3 as a base to further expand and build upon in regards of multi label classification. Methods:In order to evaluate the performance of the MLID algorithm, we have measured the execution time, accuracy, and made a summarization of precision and recall into what is called F-measure, which is the harmonic mean of both precision and sensitivity of the algorithm. These results are then compared to already defined and established algorithms, on a range of datasets of varying sizes, in order to assess the viability of the MLID algorithm. Conclusions:The results produced when comparing MLID against other multi- label algorithms such as Binary relevance, Classifier Chains and Random Trees shows that MLID does produce superior results compared with other classifiers in term of accuracy and F-measure, but does so in a extensive amount of time compared to the other algorithms. Through these results, we can conclude that MLID is a viable option to use as a multi-label classifier. Although, some con- straints inherited from the original ID3 algorithm does impede the full utility of the algorithm, we are certain that following the same path of development and improvement as ID3 experienced would allow MLID to develop towards a suitable choice of algorithm for a diverse range of multi-label classification problems. Keywords: Machine learning, Multi-label, Classification, ID3 i Contents Abstract i 1 Introduction 1 1.1 Motivation and scope of this thesis . 2 1.2 Problem Statement . 2 2 Background and Related Work 3 2.1 Machine Learning . 3 2.2 Basic concepts of multi-label learning . 5 2.3 Decision trees . 5 2.4 Entropy . 6 2.5 Attribute relation file format . 7 2.6 The ID3 algorithm . 8 2.7 C4.5 . 9 2.8 CLARE . 10 2.9 Noise . 10 3 Approach 13 3.1 Ensemble of classifier chains . 13 3.2 The proposed algorithm . 14 3.2.1 General . 14 3.2.2 Finding the tree order . 15 3.2.3 Creation of each tree . 18 3.2.4 Classification . 19 3.2.5 Classification statistics . 19 3.2.6 Threading . 20 4 Method 21 4.1 Environment . 21 4.2 WEKA/MEKA . 21 4.3 Datasets . 22 4.4 Evaluation measurements . 22 4.4.1 Example based evaluation measures . 23 4.4.2 Evaluations for execution time . 23 ii 5 Results 24 5.1 Accuracy and F-Measure . 24 5.2 Execution time . 26 6 Analysis 28 6.1 Accuracy and F-Measure . 28 6.2 Execution Time . 30 7 Conclusions and Future Work 32 7.1 Is it a viable multi-label classifier? . 32 7.2 How is performance affected by parallelisation? . 33 7.3 How is performance affected by large datasets in comparison to small datasets? . 34 7.4 Is this a viable approach for extending ID3? . 34 7.5 Future work . 35 References 36 iii Chapter 1 Introduction Our current era is one of technology, and particularly one of information. Each day, a vast amount of data is collected from an array of different sources, and the amount of data is expanding rapidly[1]. Alongside this development, the field of artificial intelligence, and especially machine learning, have seen a significant growth, as the concept of Artificial Intelligence and machine learning is integrated into more and more systems[7]. The theory of machine learning[1] is a field of study that focuses on creating computer systems that is able to learn and improve themselves. These systems are often used to perform tasks such as make predictions from unknown data-sets and find patterns within. The patterns can then be used to deduce information related to the data. Learning is accomplished by creating a model based on example input, often using a training set with known inputs, and corresponding outputs. Efficiency of the algorithm is improved by using these initial models and further adapt using unknown data, in order to make predictions and decisions based on historical relationships and trends found within the data. One of these methods is called Decision trees, where a tree is build based on “questions” asked in order to categorize the data into one or more label, based on the conditions that item fulfills. This labeling can be view as fulfilling the following statement: if condition1 and condition2 and condition3 then label x. Artificial Intelligence and machine learning are fields where improvements are made every day and is relevant for many different kinds of businesses. Examples of machine-learning aspects are the use of random-forest trees [3], in the system controlling Microsoft’s Cortana and kinect camera[10], or Google’s automatic car.[2] 1 Chapter 1. Introduction 2 1.1 Motivation and scope of this thesis Doing improvements to algorithms is always relevant in any form of application. The problem in general with decision tree based algorithms is the performance impact it can have on an application depending on the size of the dataset to be analyzed. ID3[12] is an algorithm that achieves its classification through splitting data based on the current object’s attribute into leaf nodes in a tree. When this tree have been created, an object passing through the decision tree will end up in one of the final leafs of the tree, and thereby become classified. ID3 has no explicit multi label extension even though some existing tree algorithms are based on ID3[13]. There are however implementations of a multi-label algorithm based on C4.5 that is based on ID3[12]. The first challenge encountered is to modify the ID3 algorithm. As the original algorithm is only able to classify each attribute into a single label, we have ex- panded the capabilities and created a multiclass classifier version of the algorithm. This was done by defining a new hybrid node that acts as both a decision-node and a class-node simultaneously. This was in itself a challenge and together with finding an appropriate entropy calculation for splitting for feature-values since the regular ID3 does not take multiple classes into the calculation. 1.2 Problem Statement By using machine learning it is possible to create an analytical model that can be used to make decisions based on historical relationships and trends in data . Since ID3 is adapted to binary classification[13], but can be extended to multiclass, it is interesting to see what benefits MLID can exhibit compared to other algorithms, and how well it will performance in regard of execution time and accuracy. Research questions • Can the MLID algorithm be a viable option as a multi-label classifier in comparison to already established algorithms? • How will parallelization affect accuracy and execution time of MLID in comparison to a sequential execution? • How will accuracy and execution time be affected by large data-sets in comparison to smaller data-sets for MLID? • How to extend the ID3 algorithm, allowing it to handle multi label classification problems? Chapter 2 Background and Related Work 2.1 Machine Learning Machine learning is a subfield to the more umbrella definition of artificial intelligence. The definition was coined by Arthur Samuel (1959) as “a field of study that gives computers the ability to learn without being explicitly programmed”[16]. The field is often divided into three broad categories[11]: • Supervised learning: Example inputs are presented to the algorithm together with the desired outputs in order for the algorithm to create a general model that maps inputs to outputs. • Unsupervised learning: The algorithm is left to its own to find rules and structures in its provided inputs. • Reinforcement learning: The algorithm interacts with a dynamic environment and are given certain goals to perform. the algorithm have to analyze the actions taken in order to find the optimal set of actions in order to reach the goal.

Load more