Bachelor Thesis Computer Science Thesis no: 06 2016

MLID A multi-label extension of the ID3

Henrik Starefors

Rasmus Persson

Dept. Computer Science & Engineering Blekinge Institute of Technology SE–371 79 Karlskrona, Sweden This thesis is submitted to the Department of Computer Science & Engineering at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science. The thesis is equivalent to 10 weeks of full-time studies.

Contact Information: Author(s): Henrik Starefors E-mail: [email protected] Rasmus Persson E-mail: [email protected]

University advisor: Prof. Håkan Grahn Dept. Computer Science & Engineering

Dept. Computer Science & Engineering Internet : www.bth.se/didd Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract

Context: is a subfield within artificial intelligence that revolves around constructing that can learn from, and make predictions on data patterns. Instead of following strict and static instructions, the system operates by adapting and learning from input data in order to make predictions and decisions. This work will focus on a subcategory of machine learning called “Multi-label Classification”, which is the concept of where data introduced to the system is categorized by an analytical model, learned through what is called supervised learning, where each instance of the dataset can belong to multiple labels, or classes.

Objetives:This paper presents the task of implementing a multi-label classi- fier based on the ID3 algorithm, which we call MLID (Multi-label Iterative Di- chotomiser). The solution is presented both in a sequentially executed version as well as an parallelized one. We also present a comparison based on accuracy and execution time, that is performed against algorithms of a similar nature in order to evaluate the viability of using ID3 as a base to further expand and build upon in regards of multi label classification.

Methods:In order to evaluate the performance of the MLID algorithm, we have measured the execution time, accuracy, and made a summarization of precision and recall into what is called F-measure, which is the harmonic mean of both precision and sensitivity of the algorithm. These results are then compared to already defined and established algorithms, on a range of datasets of varying sizes, in order to assess the viability of the MLID algorithm.

Conclusions:The results produced when comparing MLID against other multi- label algorithms such as Binary relevance, Classifier Chains and Random Trees shows that MLID does produce superior results compared with other classifiers in term of accuracy and F-measure, but does so in a extensive amount of time compared to the other algorithms. Through these results, we can conclude that MLID is a viable option to use as a multi-label classifier. Although, some con- straints inherited from the original ID3 algorithm does impede the full utility of the algorithm, we are certain that following the same path of development and improvement as ID3 experienced would allow MLID to develop towards a suitable choice of algorithm for a diverse range of multi-label classification problems.

Keywords: Machine learning, Multi-label, Classification, ID3

i Contents

Abstract i

1 Introduction 1 1.1 Motivation and scope of this thesis ...... 2 1.2 Problem Statement ...... 2

2 Background and Related Work 3 2.1 Machine Learning ...... 3 2.2 Basic concepts of multi-label learning ...... 5 2.3 Decision trees ...... 5 2.4 Entropy ...... 6 2.5 Attribute relation file format ...... 7 2.6 The ID3 algorithm ...... 8 2.7 C4.5 ...... 9 2.8 CLARE ...... 10 2.9 Noise ...... 10

3 Approach 13 3.1 Ensemble of classifier chains ...... 13 3.2 The proposed algorithm ...... 14 3.2.1 General ...... 14 3.2.2 Finding the order ...... 15 3.2.3 Creation of each tree ...... 18 3.2.4 Classification ...... 19 3.2.5 Classification statistics ...... 19 3.2.6 Threading ...... 20

4 Method 21 4.1 Environment ...... 21 4.2 WEKA/MEKA ...... 21 4.3 Datasets ...... 22 4.4 Evaluation measurements ...... 22 4.4.1 Example based evaluation measures ...... 23 4.4.2 Evaluations for execution time ...... 23

ii 5 Results 24 5.1 Accuracy and F-Measure ...... 24 5.2 Execution time ...... 26

6 Analysis 28 6.1 Accuracy and F-Measure ...... 28 6.2 Execution Time ...... 30

7 Conclusions and Future Work 32 7.1 Is it a viable multi-label classifier? ...... 32 7.2 How is performance affected by parallelisation? ...... 33 7.3 How is performance affected by large datasets in comparison to small datasets? ...... 34 7.4 Is this a viable approach for extending ID3? ...... 34 7.5 Future work ...... 35

References 36

iii Chapter 1 Introduction

Our current era is one of technology, and particularly one of information. Each day, a vast amount of data is collected from an array of different sources, and the amount of data is expanding rapidly[1]. Alongside this development, the field of artificial intelligence, and especially machine learning, have seen a significant growth, as the concept of Artificial Intelligence and machine learning is integrated into more and more systems[7].

The theory of machine learning[1] is a field of study that focuses on creating computer systems that is able to learn and improve themselves. These systems are often used to perform tasks such as make predictions from unknown data-sets and find patterns within. The patterns can then be used to deduce information related to the data. Learning is accomplished by creating a model based on example input, often using a training set with known inputs, and corresponding outputs.

Efficiency of the algorithm is improved by using these initial models and further adapt using unknown data, in order to make predictions and decisions based on historical relationships and trends found within the data. One of these methods is called Decision trees, where a tree is build based on “questions” asked in order to categorize the data into one or more label, based on the conditions that item fulfills. This labeling can be view as fulfilling the following statement: if condition1 and condition2 and condition3 then label x.

Artificial Intelligence and machine learning are fields where improvements are made every day and is relevant for many different kinds of businesses. Examples of machine-learning aspects are the use of random-forest trees [3], in the system controlling Microsoft’s Cortana and kinect camera[10], or Google’s automatic car.[2]

1 Chapter 1. Introduction 2

1.1 Motivation and scope of this thesis Doing improvements to algorithms is always relevant in any form of application. The problem in general with based algorithms is the performance impact it can have on an application depending on the size of the dataset to be analyzed.

ID3[12] is an algorithm that achieves its classification through splitting data based on the current object’s attribute into leaf nodes in a tree. When this tree have been created, an object passing through the decision tree will end up in one of the final leafs of the tree, and thereby become classified. ID3 has no explicit multi label extension even though some existing tree algorithms are based on ID3[13]. There are however implementations of a multi-label algorithm based on C4.5 that is based on ID3[12].

The first challenge encountered is to modify the ID3 algorithm. As the original algorithm is only able to classify each attribute into a single label, we have ex- panded the capabilities and created a multiclass classifier version of the algorithm. This was done by defining a new hybrid node that acts as both a decision-node and a class-node simultaneously. This was in itself a challenge and together with finding an appropriate entropy calculation for splitting for feature-values since the regular ID3 does not take multiple classes into the calculation.

1.2 Problem Statement By using machine learning it is possible to create an analytical model that can be used to make decisions based on historical relationships and trends in data . Since ID3 is adapted to binary classification[13], but can be extended to multiclass, it is interesting to see what benefits MLID can exhibit compared to other algorithms, and how well it will performance in regard of execution time and accuracy.

Research questions • Can the MLID algorithm be a viable option as a multi-label classifier in comparison to already established algorithms? • How will parallelization affect accuracy and execution time of MLID in comparison to a sequential execution? • How will accuracy and execution time be affected by large data-sets in comparison to smaller data-sets for MLID? • How to extend the ID3 algorithm, allowing it to handle multi label classifi- cation problems? Chapter 2 Background and Related Work

2.1 Machine Learning Machine learning is a subfield to the more umbrella definition of artificial intelli- gence. The definition was coined by Arthur Samuel (1959) as “a field of study that gives computers the ability to learn without being explicitly programmed”[16]. The field is often divided into three broad categories[11]:

• Supervised learning: Example inputs are presented to the algorithm together with the desired outputs in order for the algorithm to create a general model that maps inputs to outputs.

• Unsupervised learning: The algorithm is left to its own to find rules and structures in its provided inputs.

• Reinforcement learning: The algorithm interacts with a dynamic en- vironment and are given certain goals to perform. the algorithm have to analyze the actions taken in order to find the optimal set of actions in order to reach the goal.

This study will focus on supervised learning, through the means of improving and expanding functionality of the decision tree algorithm know as ID3. During supervised learning, the dataset is commonly divided into a training set and a test set. The training data consists of training examples, where each example contains an input object, typically a vector of attributes, and a desired, predetermined output value. The chosen learning algorithm will then analyze this data and from it generates a model on which it can map new, unseen examples.

In order to test the accuracy of this model, the remaining dataset called test set is used. The ratio of training vs test set is determined by two competing concerns: Less training data gives greater variance for the parameter estimates, and less testing data, performance statistics will suffer from greater variance. The goal is to divide the data such that neither variance is too high. To this end, the Pareto principle[8], or the “80/20 rule” can be applied as a baseline, and modified as necessary based on the specific dataset or algorithm used.

3 Chapter 2. Background and Related Work 4

Supervised learning includes two categories of algorithms:

Classification: Categorical response values, where the data can be separated into specific classes. Common classification algorithms include:

• Support vector machines ( SVM ) • Neural networks • Naive Bayes classifier • Decision trees • Discriminant analysis • Nearest neighbors ( kNN)

Regression: continuous-response values. Common regression algorithms include:

• Linear regression • Nonlinear regression • Generalized linear models • Decision trees • Neural networks

The main difference between the two approaches is that classification means to group the output into a class, i.e predict what class does this data object belong to. On the other hand, regression tries to predict the output value using training data, i.e predicts unknown or missing values.

Classification algorithms can be divided into two separate problems - binary or single label classification, and multi-label classification. During this work we have focused on classification problems and algorithms, Specifically around developing features for the classification algorithm ID3, in order to investigate the viability of expanding the capabilities of an established single-label classifier.

Binary classification algorithms like ID3 classifies the elements given in a dataset into two groups, based on the models generated from the training set. The output is specified by the dataset and desired output, but most commonly used are simple true or false classifications.

A typical application for this kind of algorithm include determining if an sample have a qualitative property; it does or does not possess specific characteristics. For example if a patient have a disease or not, here the property classified on is the presence of the disease. The result of a sample classification can only end up with a boolean value. Chapter 2. Background and Related Work 5

Multi-label classification on the other hand, assigns each sample a set of labels, that are not mutually exclusive. The result of a multi-label classification can therefore be a range of output labels for each sample. For example, classifying a text based on topic can result in multiple labels, such as politics, religion, economics and education, all at the same time, as these properties are not deemed mutually exclusive in this context.

2.2 Basic concepts of multi-label learning

Let D be a dataset containing N examples Ei = ( xi , yi ), i = 1..N. Each instance Ei comprises of an attribute vector xi=(xi1,xi2,...xiM ) that is described by M attributes Xj, j = 1...M, as well as a subset of labels yi ⊆ L. L = y1,y2...yq the set of q labels. this is illustrated in table 1. The task of the multi label classification is to generate a classifier H, that when given an unknown instance of E = (x,?) is able to accurately predict the subset of labels Y [17]. H(E)→ Y

Table 2.1: Multilabel data structure X1 X2 ... XM Y1 Y2 ... Yq E1 x11 x12 ... x1M y11 y12 ... y1q E2 x21 x22 ... x2M y21 y22 ... y2q ......

EN XN1 XN2 ... xNM yN1 yN2 ... yNq

2.3 Decision trees A decision tree is a set of nodes, logically arranged in a tree-like structure, used to classify data into different categories. The variables in the data set are exam- ined, and the analytical model determines which of the variables are the most important, based on entropy and information gain calculations. Using this in- formation, a tree is created by splitting data up by variables and counting the number in each node after each split. The main feature of a decision tree is the recursiveness. For a set of (S) of observations, the following algorithm is applied:

1. If every observation in S is the same class, or if S if very small, the tree becomes an endpoint, labeled with the most frequent class.

2. If S is too large and it contains more than one class, find the best rules based on one feature to split into subsets, one for each class. Chapter 2. Background and Related Work 6

A decision tree consists of three different types of nodes:

• Decision nodes – A location on a decision tree where a decision between at least two possible alternatives can be made. • Chance nodes – Identifies an event in a decision tree where uncertainty exists, this node represents at least two possible outcomes. • End nodes – A node that terminates the current branch.

2.4 Entropy In the realm of , entropy is a measure of uncertainty associated with random variables, or “disorder” in a system. In the context of this thesis, we have be working with a specific kind of entropy called “Shannon entropy”[15].

Shannon entropy is a common calculation within information theory and is used in multiple areas where calculating some form of information gain is utilised, for example machine learning and statistical calculations. The algorithms described later in the thesis all use shannon entropy for calculations which motivates the choice of utilising entropy.

Shannon entropy was first introduced in “The mathematical theory of commu- nication” by Claude E. Shannon in 1948[15]. Shannon entropy is the expected average value of information contained in each instance of data in a flow of in- formation. This is achieved by calculating the of events, together with the information gain of each event. This creates a with an that is the average amount of information, or entropy generated by the current distribution. In order to measure the entropy, Shannon used units called shannon, but are commonly referred to as bits.

For instance, the entropy of a coin toss is 1 bit, and m tosses equals m bits. If these events are equally likely to happen, the entropy is equal to the number of bits. Although if one or more of the events are more likely to occur, that event will generate a lower rate of entropy, or information gained by observing that event. Chapter 2. Background and Related Work 7

The simplified example of a coin toss will result in an entropy of one bit, and can be shown by Shannon’s equation. The general equation to calculate entropy calculates event i, with the probability of Pi. In order to determine the infor- mation gained by observing event i, Shannon’s solution follows the fundamental properties of information[15].

1. I(p) 0 - information is a non-negative quantity

2. I(1) = 0 - events that always occur do not communicate information

3. I(p1p2)=I(p1)+I(p2) - information due to independent events is additive

Where the last property states that join probability conveys as much information as much information as two separate events. This means that if log2(n) bits are needed to encode the first value, and log2(m) for the second, then log2(mn) = log2(m) + log2(n), leading to the shannon entropy equation that calculates the average information gained for M events.

M P H(X) = −pilog2(pi) i=1 A coin toss with a 50% chance for each outcome will result in the following calculation.

H(X) = −0.5log2(0.5) − 0.5log2(0.5) = 1bit A weighted coin which have a 75% probability of tails and only a 25% chance of heads will instead yield a lower information gain, as there is some information in knowing the outcome of the toss, but not as much as a fair coin, because of the high probability of outcome being tails.

H(X) = −0.75log2(0.75) − 0.25log2(0.25) ∼ 0.811bits

2.5 Attribute relation file format

ARFF[18] stands for attribute relation file format and is a file where data is presented in two separate sections. It was introduced by the University of Waikato during the machine learning project. It is a way to logically present data for analysis of machine learning algorithms. The first part being the header where attributes are presented as well as the relationship for the dataset. The second one is the data-section where the data to be read is presented. This is then used when training the classifier as well as when testing the classifier. Chapter 2. Background and Related Work 8

2.6 The ID3 algorithm The choice of using ID3 algorithm has influenced most of the subsequent decisions in this thesis, since it was the basis for choosing this field to study. This includes for example the choice of using classification chains as a base to build upon, since it fits in well as an extension when applying it to decision trees.

The ID3 algorithm ( Iterative Dichotomiser 3 ) is a al- gorithm developed by Ross Quinlan, used to generate a decision tree from a dataset[4].

ID3 starts with a dataset as the root node. It iterates every unused attribute of the set and calculates the entropy or information gain of that attribute. It then selects the attribute which has the smallest entropy, or largest information gain. The set is then split in regards of the selected attribute, which creates a subset of the data. This continues recursively on each subset, until one of the following cases occurs:

• Every element in the subset belongs to the same class

• The are no more attributes to be selected, but the examples do not belong to the same class

• There are no examples in the subset.

During runtime, the tree is constructed with each non-terminal node representing the attribute splitting the data, and terminal nodes represent the final class label for that subset of the branch.

ID3 can not guarantee an optimal solution, situations exists where it can get stuck in a local optima, as it is using a greedy approach when selecting the best attribute to split the dataset at each iteration. There is also a problem with ID3 overfitting the training data, which results in a very accurate results on the dataset used for training, but the tree can be built to fit the training data “too perfectly”. This leads to a tree that may not perform that well on real-world instances that are unknown. Chapter 2. Background and Related Work 9

The ID3 algorithm utilises the entropy calculation described in the previous sec- tion. The entropy is first calculated using the amount of positives and negatives for the label/class being built. This is then used for each attribute to calculate that attributes information gain.

S - The current set of instances which is being calculated. X - Set of class-variants for the label. p(x) - The proportion of the number of elements in class X to the number of elements in set S.

H(S) - Entropy of set S. T - A subset from splitting on attribute A for set S. p(t) - t is the proportion of number of elements in the set S. H(t) - Subset t’s entropy. P H(S) = p(x)log2p(x) x∈X Informationgain(A, S) = H(S) − P p(t)H(t) t∈T Information gain is used to evaluate each attribute for the class being classified. The attribute with the largest information gain is most suitable to split on for the set S.

2.7 C4.5 The C4.5 algorithm is developed by Ross Quinlan[12], and is an extension of his earlier ID3 algorithm. C4.5 is similar to ID3 as it also generates a decision tree for single label classification, although improvements have been made. Some of these are:

• Handling both continuous and discrete attributes. C4.5 creates a threshold and splits the list into attribute values that is above the threshold, and those that are less or equal. • Handling training data with missing attribute values. Missing attribute values are simply skipped in information gain and entropy calculations. • Pruning trees after creation. C4.5 goes through the tree once it has been created and removes branches that does not contributes in the decision making by replacing them with leaf nodes. Chapter 2. Background and Related Work 10

2.8 CLARE The “CLARE” algorithm is a multi label algorithm, based on Quinlan’s C4.5 algorithm and developed by Amanda Clare and Ross D. King[4]. The Clare algorithm is adapted from C4.5 for the analysis of phenotype data, and in order to achieve this extended the functionality of C4.5 to handle multi-label classification. Clare attempts to solve this by modifying the C4.5 entropy formula described below P H(S) = − p(Ci)logp(Ci) i=1N

Where p(Ci) is the probability of class Ci in the set. CLARE modifies this formula to summarize the number of bits needed to describe membership or non- membership of each class. In the case of N classes and membership of each class

Ci has probability p(Ci), the formula describing the total number of bits needed is given by P H(S) = − p(Ci)logp(Ci) + q(Ci)logq(Ci) i=1N

Where P(Ci) is the probability of class Ci, and q(Ci) = 1 - p(Ci)q(ci) = 1 - p(ci) is the probability of not being a member of class Ci. With this, the new information can be calculated as a weighted sum of the entropy for each subset. In the case of and sample appears twice in a subset, because it belongs to two classes, then it is counted twice.

This allows for multiple labels per example, which in turn allows for the classifi- cation outcome to result in a set of classes. The decision nodes for the tree does have a special case where the leaf is a set of classes, here, a separate rule will be generated for each class.

2.9 Noise Most real world cases of datasets are unlikely to produce a training set that is entirely accurate. It’s not uncommon that descriptions of objects and data may include attributes based on subjective or inaccurate measurements, which poses a risk of introducing erroneous values of attributes or even misclassifying objects in the training set.

These kind of non-systematic error are usually referred to as noise. This might cause problems during the tree-building procedure, considering an arbitrary dataset and suppose that values of one of the object in corrupted or incorrectly recorded Chapter 2. Background and Related Work 11 during input. This situation might result in two or more identical objects that belongs to different classes. Errors of this kind may cause lead to decision trees of enlarged, false complexity, or case the attributes provided to become incomplete when used to make decisions on.

In order to cope with noisy datasets, Quinlan proposed two modifications[13].

1. The algorithm must be able to decide that testing further attributes will not improve the predictive accuracy of the decision three.

2. The algorithm must be able to work with inadequate attributes, because noise can cause even the most comprehensive set of attributes to appear inadequate.

To illustrate the first example, imagine a collection C containing representatives from two classes, and let A be an attribute with random values that produces subsetsC1,C2...CN . If the proportion of of Class P objects in each Ci differs from the proportion of P objects in C itself, Branching on attribute A will appear to have an information gain and therefore a sensible step is to test on attribute A, despite the fact that values of A are generated randomly and therefore cannot help with classifying the objects in collection C.

The solution to this dilemma can be found by requiring that the information gain of a tested attribute exceeds some absolute, or percentage based threshold. In order to ensure that this threshold does not exclude relevant attributes, a method based on chi-square test is used.

The previously mentioned attribute A produces subsets C1,C2...CM where Ci contains pi and mi objects of class p and n. if the value of A is irrelevant in regards of an object in C, the value of p´i and n´i is expected to be

pi+ni p0i = p · p+n

pi+ni n0i = n · p+n Chapter 2. Background and Related Work 12

Provided that neither p0i or n0i are very small, statistics, calculated in the equa- tion below, can be used to determine with a confidence level of 99% that A is independent of the class of objects in collection C and should therefore be seen as an irrelevant attribute.

2 2 P (pi−p0i) + (ni−n0i) p0i n0i i=1v Another situation might arise where testing of C might be ruled out, caused either by inadequate attributes, or because each attribute have been deemed irrelevant to the class of object C. If this occurs, a leaf should be produced, labelled with class information, but the objects in C are not all of the same class.

Several solutions were presented to deal with this situation, and the superior approach seems to be to simply opt for the more numerous class. Should the leaf be assigned P or N is decided by comparing values and if p>n assign P, n>p assign N and either if p=n. This solution minimizes the sum of absolute errors for objects in C. Chapter 3 Approach

3.1 Ensemble of classifier chains Classifier chain[14] is one approach of handling multi-label problems by building a multitude of binary classifiers for each label and then chaining those classi- fiers together to create one classifier. This was derived from another problem transformation approach which is called binary relevance[9]. The main difference between the two approaches is that classifier chains takes into account that labels can have dependencies amongst each other and uses the labels for creating clas- sifiers. Much literature have a consensus that it is crucial to take into account label correlations during the classification process[14].

One of the main reasons that classification chains were chosen in this thesis is because of the way label correlation affects the accuracy-performance of machine learning algorithms when using them for multi-label classification.

One issue of this is to know which labels to take into account when creating the decision trees for classifying. As presented in classifier chains for multi-label classification there are many approaches to handling this problem. One approach is to randomly choose a tree to start with and then continue until all trees have been built.

The way ensamble of classifier chains operates is by training a number of classifier chains. These are done by a random approach where the labels are chosen at random. The data is split into a random subset, and each classifier chain is therefore unique and can give a different multi label prediction. The predictions are then put under a vote. A threshold is used to get the most popular labels which form the final predicted multi-label set. These will then be deemed as relevant and be used for classifying.

The main advantages to using classifier chains in comparison to other transfor- mation methods is that the computational complexity is in direct correlation to number of labels with only a small difference to the binary relevance method.

13 Chapter 3. Approach 14

The main reason for the difference is that labels are also taken into consideration when building the decision trees. This likewise affected the choice of using clas- sifier chains for extending the ID3 algorithm, albeit not as much as preserving the label dependency. This also affected the choice of using classifier chains for extending the ID3 algorithm, albeit not as much as preserving the label depen- dency.

3.2 The proposed algorithm

3.2.1 General The development of the algorithm was done in C++ language. Since there are no developed arff-readers that works for multilabel datasets in C++ this was developed during the process as well. There are two separate approaches to listing data in the an arff file for this algorithm. The attributes and labels are defined in the same way for both approaches, but there are differences in listing the data for each instance. One approach is to list the data in succession where all the information is in place with no explicit declaration for which attribute the data is related. The second approach is to define an attribute position of the value. For example, 12 5 means that the attribute in position twelve should have the value five . In the second approach the missing values are filled with values of zero, to avoid missing values.

Everything in this section is what was new when developing the algorithm to extend the multi-label problem to the existing ID3 algorithm. Every part was developed without applying any techniques from other extensions for handling multi-label problems for other algorithms but was instead approaches that was developed by the team for handling the problem. The exception to this is the approach for training the classifier, which is based on the original ID3 algorithms calculations with modifications to handle this specific case.

During development the algorithm underwent several iterations, and the end re- sults consists of a combination of different approaches. For example, classifier chains where the algorithm was chained into one large classification tree. It did not consist of the same label dependency calculations as classifier chains uses, which is described under the section classifier chains.

The approach used was instead to utilise the innate ability of the ID3 algorithm for calculating the information gain, where each label is compared to all other labels in the dataset, and an information gain calculation is executed to find the label with the highest information gain for each individual label. Chapter 3. Approach 15

The reason for this choice is because we were interested in testing an approach where the abilities of the transformed ID3 algorithm with some aspects of clas- sification chains, rather than utilising all aspects of classification chains when creating the algorithm.

The way we reached a conclusion for finding label dependencies was by discussing approaches knowing the ID3 algorithm beforehand, trying these approaches, and using the one approach that yielded the best results in the end. This involved a large amount of trial and error.

The basic building of trees still use the same approach as the original ID3 algo- rithm, except it also uses labels when deciding which attribute to split the tree on.

This approach has not been investigated before and a reason might be that the ID3 algorithm has been developed into improved versions before the classification chain approach had been developed. For many researchers a choice of a newer type of algorithm might be more interesting. There are also a large amount of classification algorithms where classification chains can be applied which makes the likelihood to evaluate this specific algorithm all the less likely.

One reason for choosing this algorithm was to see the difference in performance between an improved version of the original algorithm, in this case C4.5, compared to the original algorithm, in this case ID3, and to see if the older version is still viable for use in multi-label classification. classification chains.

3.2.2 Finding the tree order The algorithm calculates each labels dependency on another label in the same way that ID3 calculates information gain for each attribute.

What is done in Algorithm 1 is to iterate over all labels, and for each label calcu- late and store information for each other label that exists in the dataset currently being investigated. This is later used when deciding the order of training the al- gorithm on each label. It is an important part because it allows for preservation of dependencies amongst labels. Chapter 3. Approach 16

Data: data, first label position Vector v of hashmap name and gain; foreach label n from first label position do Hashmap l of attribute-name and information-gain; foreach label m that is not n do Calculate information-gain of the label m for n; Add to l with m’s labelname and information-gain; end Add l to v; end Algorithm 1: Calculate tree entropy The approach used in this algorithm is to evaluate the label dependencies, be- forehand, by calculating the information-gain between labels in such a manner that a build order for the trees can be determined.

In case of this algorithm a struct is used only containing a string and a dou- ble, which is called LabelAndGain due to the usage of the container for storing information for each labels dependency on another tree and the gain.

Algorithm 2 describes how the algorithm iterate over the calculated labels to find each labels respective best label. In cases where the best label has already been claimed by another label, where the information gain is higher for the new label, a recursive function call will be done. In this function call the label which had claimed the the best label before will find the next best candidate. In the case where the same issue arise with the label being claimed before, another recursive call will be done for the label that claimed the best label again. This is done over and over until all labels have their respective best label that was available. In the end a list consisting of a label and the label that claimed it, as well as the information gain, has been created. This is then used in the next part of the algorithm. Chapter 3. Approach 17

foreach h in vector v do foreach key value pair in h do if Gain is better than previous best gain then if key of h does not exist in l then Add to l with key of h and LabelAndGain containing current label being investigated and current value of h; else Start best dependency for label with previous best gain for the label.; Add to l with key of h and LabelAndGain containing current label being investigated and current value of h; end end end end Algorithm 2: Find the best dependency for each label To avoid getting stuck in a tree loop, after each label has chosen their respec- tive best partner, the algorithm goes through the algorithm and tries to correct any errors. This is simply done by choosing the tree with the lowest overall gain from another tree, choose the tree that has the best gain on that tree,then recursively go through each corresponding tree and see if the tree connected to it has been connected earlier. If not, find the best unused alternative for the label.

Data: Hashmap containing string and LabelAndGain, Vector v containing name of label and vector of gains for label Find tree with lowest gain from another tree and add to current tree t; List of used values l; foreach label do Get labelname n that current tree t has best dependency on; if n exists in l then Find best unused gain g for t from v where label = n; Set current tree to label of g; Add g to used values; else Set current tree to n; Add n to used values; end end Algorithm 3: Fix any loops in the tree building order Chapter 3. Approach 18

In the case of this algorithm the label with the overall least information gain from another tree is then chosen and in turn the tree with the best gain from that tree is chosen. This is done repeatedly until all trees are built. Information gain between labels are calculated using standard ID3 methods, as described under the section entropy. The labels are then presented to the next tree to build and each previous tree is presented as an alternative when branching the tree. This does not necessarily mean that the labels will be used when branching, but instead is presented as an alternative. During the building of each be determined if an available label is useful for branching or not.

3.2.3 Creation of each tree Each tree is built using standard ID3 calculations to determine the best attribute to split on. A subset of data correlating to the value of the split attribute is then used to choose the next attribute. This is done over and over until all attributes correlate to a class, either negative or positive.

The training of the algorithm is done in a linear fashion to avoid recursion. This is done due to problems with trees that branches on many attributes, reaching the stack limit. This is a problem due to the fact that the attributes can be vital for a correct classification.

To avoid getting zero positives or zero negatives for a certain value the entropy calculations are converted into a really small value close to zero and a large value close to the value one. The supported values for this algorithm are binary, nominal and numeric. If there exists a value in the test set when trying to reach the next node, the closest possible value for all the children will be chosen as the next node. Chapter 3. Approach 19

foreach label n do BuildTree(Training data, Label to check,allowed labels a, Node current-node); if viable attributes is 0 then Return root of tree end if all data for label is positive then Add leaf with positive label end if all data for label is negative then Add leaf with negative label else foreach attribute or label allowed in a do Find attribute with best information gain; foreach possible value v of best gain attribute do Split the data into s where value of attribute = v; Create new node n where value = v; Add as child to current-node; Remove current best gain attribute as a viable candidate; Start next level of the tree with n and s; end end end Add n to allowed labels; end Algorithm 4: Training the algorithm for each label

3.2.4 Classification Classification is done by going through each instance, jumping from each node to the next. If a node is set as a leaf, the value of the leaf-node will be added for that label to the existing dataset. Each leaf-node leads to the next attribute and the classifier will automatically go to that node. From there, it will check each child-node and find the value that corresponds to the dataset. If there is no value corresponding, the closest value of that attribute will be chosen and the classifier will jump to that child.

3.2.5 Classification statistics The correct information will provided and will be checked against the classified testset. If the testset corresponds to the correct-data for an individual instance, it will be counted as a positive, either negative or positive depending on the value. If the label is not of the correct value, it will be counted as a false positive or Chapter 3. Approach 20 negative. The amount of label positives and negatives are also counted while going through the data, so that label accuracy can be evaluated using the f-measure described under evaluation measurements.

3.2.6 Threading Boost[5] is used for easier handling threading for the algorithm. Boost is a library used in C++ to improve basic functionality that is missing from the language. In the algorithms case boost is used for handling threads and for easier reading of the arff-files. Boost has been developed for many years in cooperation with multiple board members of the C++ standards committee and some boost functionality has been incorporated in the standard version of C++.

There are multiple places where threading is used to increase performance of the algorithm. These places are for calculating label and the gain between labels, the training of the algorithm, the classification of the test-data and and where the algorithm count the values that has been has been correctly classified or not correctly classified.

In the case with the trees, each algorithm training phase is started in a separate thread. This is done for every label and started in the order of previously de- termined order. When the number of threads started are equal to the number of threads to execute concurrently, the algorithm stops and wait for a thread to finish before starting the next.

For label dependency calculations the number of labels are split by the number of threads to use. These are then divided into ranges of equal size from first label to the last one. Each thread then calculates the labels within the range they are given.

In the case of classification and statistics of classifications, the test dataset is split in equal portions determined by the number of test instances divided by the number of threads. The threads then do the classification and counting on the range they are given. This means that each thread has equal portions of data to process equal to the size calculated beforehand. Chapter 4 Method

4.1 Environment In order to test the performance of MLID, both in terms of accuracy and precision, as well as in terms of execution time, a range of different datasets were classified. For each set, the predictions together with execution time was recorded and compared to results provided by algorithms implemented through the machine learning tool Meka[6].

In order to provide reliable and comparable results, all tests, both MILD and Meka algorithms, were performed separately, using a personal computer with the following specifications:

• Intel Core i7-6700hq CPU @ 2.60ghz • 8GB ddr4 ram @ 2133 mhz • Geforce GTX 960m • 120GB SSD @ 560MB Read / 400MB Write

4.2 WEKA/MEKA WEKA (Waikato Environment for Knowledge Analysis)[6] is a software work- bench licenced under the GNU general public licence. It was developed by the University of Waikato in 1997, and contains a collection of tools and algorithms to Help visualize results from machine learning experiments. The extension of MEKA offers added features in order to perform machine learning experiments on multi label classification algorithms and datasets.

In order to evaluate and compare MLID to an already established one, we are using MEKA to run a comparable multi-label algorithms on the same datasets as we are using to test our own. Three different algorithms have been used in MEKA to test MLID against; Firstly, a classifier chain method using J48 as a tree builder.

21 Chapter 4. Method 22

The second setup is a binary relevance method, using LMT (logistic model tree). The third setup is a back-propagating neural network using random tree as builder. Each dataset is being split on an 80/20 ratio of training and test data, and we are evaluation accuracy of the algorithm per set, as well as F-measure and the total execution time.

4.3 Datasets The datasets we have chosen to test MLID against are seven different datasets with varying size, in order to provide a wide array of test cases. The datasets are provided from Wekas library of standardised datasets used for testing machine learning algorithms, and are all established, tested and provide realistic input data in order to gather reliable test results.

To avoid getting invalid values when calculating information gain for attributes the datasets are split into two parts when training and testing. The first part is 80% of the total dataset, and is used for training the classifier, and the remaining 20% are used to test if the classifier can correctly classify the test set.

Since the dataset is split in a ratio 80% and 20% each dataset will have a lower number of instances to train on than the total amount of instances. For example, the dataset “yeast” will be trained on 1932 instances and tested on the remaining 484 instances.

Table 4.1: The datasets used during testing. Dataset Instances Attributes Labels Medical 977 1488 45 Birds 321 259 18 Yeast 2416 103 14 Enron 1701 1000 42 CAL500 501 67 173 Emotions 592 71 6 Flags 193 18 7

4.4 Evaluation measurements Evaluating a single label classifier is simple, as there is only two possible outcomes, correct or incorrect. However, multilabel classification takes into account partially correct classifications. All performance measures range in [0...1] where a higher score is better. Chapter 4. Method 23

4.4.1 Example based evaluation measures

Let Yi be the set of true labels, and Zi the set containing predicted labels; i(true) = 1 and i(false) = 0. N = number of instances, L = number of labels.

N Accuracy(H,D) = 1 P |Yi∩Zi| N |Yi∪Zi| i=1

N P recision(H,D) = 1 P |Yi∩Zi| N |Yi| i=1

N Recall(H,D) = 1 P |Yi∩Zi| N |Zi| i=1 Accuracy of an machine learning algorithm provides a criterion on how well the algorithm correctly identifies a condition, labeling them as true or negative. Al- though accuracy alone does not provide a detailed picture of how well the algo- rithm work. In a scenario where majority of the input should be label negative, an algorithm that always labels data instances as negative would appear to have a high accuracy, even if it would miss every case where the labeling should have been positive. In order to get a more exhaustive view of how well the algorithm performance, Precision and recall are used in addition to Accuracy.

The precision equation calculates how many of the selected items are relevant, or in this context, correctly classified, while Recall calculates how many relevant items were selected. The result of precision and recall can be used together to calculated an average called “harmonic mean” using a measure called “F1- Score”[19]. F1-score is a special case of the Fβ measure that evenly balances the weight of precision and recall. The following equation will be used to determine the F-measure produced by each algorithm:

P recision·Recall F = 2 · P recision+Recall The results produced will show and comparison between the evaluated algorithms in terms of accuracy and F-measure for each dataset, in order to determine the viability of the MLID algorithm as an multi-label classifier.

4.4.2 Evaluations for execution time Execution time based evaluation will be measured in seconds for each classifier and dataset. Two different measured times will be used for evaluation, one based on training the classifier and one measurement for testing data with the classifier. Chapter 5 Results

5.1 Accuracy and F-Measure In this section information gathered related to accuracy and f-measure will be pre- sented. These two measurements are presented in the section evaluation method. The data presented will be compared between several algorithms and are using the same metrics. The algorithms that are presented are an J48 implementation using classifier chains, a binary relevance method using LMT ( Logistic model trees), and a neural-network version of a random tree algorithm. The choices of algorithms follows the pattern of dataset choices. We want a diverse field of datasets to test MLID on, and be able to compare our algorithm to different types of multi-label implementation. The metrics used are common for multi-label clas- sification, and are all standard measurements used with Weka/Meka.

J48 is a Java implementation of Quinlan’s C4.5 algorithm, and therefore provides a closely related and modern algorithm. On the other hand, the random tree implemented Neural Network provides a completely different approach to the problem, and finally binary relevance stands as a baseline algorithm in order to provide data from a simplistic approach.

In the section below F1-score will be used to evaluate each dataset between several tested algorithms. F-measurement is a measurement for evaluating each label in a dataset. It is a combination of precision and recall for a classifier set and is described under section evaluation measurements.

24 Chapter 5. Results 25

Figure 5.1: The accuracy in percentage for different datasets.

In the following table, data which accuracy is based on, is presented. Information for each label in each dataset will not be present. The information presented are true positives, which are cases in which MLID has predicted a correct positive value for a label. The same for true negatives. False values indicate where classification has not produced correct results. For example false positives are labels where the classification should have been negatives instead.

Table 5.1: The classification results for each dataset using MLID.

Dataset True Positives True Negatives False Positives False Negatives Emotions 95 361 123 135 Yeast 707 3942 813 1314 CAL500 882 13123 1926 1643 Enron 486 16121 770 696 Medical 200 8525 47 48 Flags 103 71 58 41 Birds 7 1111 58 59 Chapter 5. Results 26

Figure 5.2: The averaged F-score in percentage produced by the tested algorithms.

5.2 Execution time In this section we will present three separate data, in the form of tables. The first one being the time it takes for a classifier to train. The second table will present the time it takes to classify a dataset in seconds. The third one will present the speed increase using the parallel compared to the sequential version. These are further used to evaluate performance of the algorithm. The next table will present the time to train a classifier for varying datasets and algorithms.

Table 5.2: Training time of the classifiers for each dataset in seconds.

Dataset Sequential (s) Parallel (s) J48 (s) LMT (s) Random tree (s) Emotions 13 4 0.241 2.918 1.044 Yeast 204 67 1.792 61.423 5.424 CAL500 605 244 3.03 15.437 2.491 Enron 13537 6535 63.495 7924.036 30.431 Medical 2015 1134 4.326 1017.619 24.78 Flags 2.72 1.89 0.072 1.074 0.18 Birds 59 25 0.323 19.112 1.562 Chapter 5. Results 27

The upcoming section shows the classification time for an already built classi- fier. The execution time is presented for several classifiers and will be used for evaluating testing speed in the analysis section.

Table 5.3: The time for each algorithm to classify the test datasets.

Dataset Sequential (s) Parallel (s) J48 (s) LMT (s) Random tree (s) Emotions 2 1 0 0.003 0 Yeast 44 12 0.029 0.03 0.003 CAL500 29 11 1.108 0.718 0.006 Enron 12 8 0.186 1.477 0.01 Medical 3 2 0.032 0.62 0.008 Flags 0.29 0.28 0.001 0 0 Birds 2 2 0.009 0.018 0.001

The next table will present the speed increase for the algorithm using paralleli- sation compared to a sequential execution when training the classifier.

Table 5.4: Speed up in percentage, running in parallel compared to sequential.

Dataset Sequential (s) Parallel (s) Speed up (%) Emotions 13 4 325% Yeast 204 67 304% CAL500 605 244 248% Enron 13537 6535 207% Medical 2015 1134 177% Flags 2.72 1.89 144% Birds 59 25 236% Chapter 6 Analysis

6.1 Accuracy and F-Measure Overall the test results in terms of accuracy and F-score shows that MLID per- forms above average compared a similar classification chain version of C4.5 (J48), the binary relevance algorithm and the random forest algorithm, as can be seen in figure 5.1 and figure 5.2

Figure 5.1 indicates that out of the seven datasets tested against, MLID outper- form the other tested algorithms in six cases when comparing accuracy. In one of the other two cases the accuracy is on par or better than the other algorithms. In one of the cases it shows that accuracy is lower than two of the other algo- rithms with a big difference. The LMT algorithm is below the accuracy of MLID algorithm in all tests which can be related to LMT using binary relevance where label dependency is not taken into account.

Looking at the table 5.1 it shows overall that the trees that are built are in general good at predicting the true negatives for labels. However there is a big drop in correct classification for true positives. Averaging the datasets above there are in general 0.62 true positives on every false negative, which is reflected through the f-measure which does not take true negatives into account.

In the figure 5.2 the results show that the MLID algorithm provides a better f-measure for three out of the seven provided datasets. In the remaining datasets it is on par with three other datasets, where one dataset is slightly below the random tree and J48 algorithms. In the last remaining dataset it provides worse score than two of the other algorithms but still provides better result than the random tree algorithm.

In the cases where the f-measure is above the other algorithms it often wins by a large margin. There also seems to be a correlation between the high f-measure results and the datasets where the classifier provides a high accuracy.

28 Chapter 6. Analysis 29

One notable difference is the dataset Emotions where the F-measure of all MEKA classifiers except random tree performs better. In this section we will try to find a correlation between why the different datasets have such vast differences in accuracy or/and f-score.

Since accuracy takes in both the true negatives and true positives the values will be quite high because of the amount of correctly classified negative values in comparison to the total amount of predicted values. Comparing between chart 5.1 and chart 5.2 the accuracy in most cases are above the other classifiers, but comparing the dataset emotions, the accuracy is higher on MLID but lower in terms of f-measure. The accuracy is therefore higher solely on a higher amount of true negatives. This in itself is not bad since a classification of a false positive is even worse, and would have given an even worse precision and in the end a worse f-score. This leads to the F-measure being based almost entirely on the amount of true positives and false negatives, and these are usually more vital to the classification than true negatives.

One other dataset that sticks out is the medical set, where both accuracy and f-measure is above all the other methods tested. The cause of this is not clear since the composition of the dataset does not differ vastly from a dataset such as enron. One big difference is the amount of attributes compared to other datasets.

However when testing the amount of attributes in comparison to labels a correla- tion could not be found which would indicate the great difference in performance. Using the different approaches for the dataset birds would indicate that classifier chains with a random order can outperform a precalculated order of trees such as our algorithm uses. Both binary relevance and the MLID algorithm have low accuracy, but in terms of f-score most classifiers have a hard time getting suffi- cient results. In those cases the most likely cause for the low f-measure would be that the datasets does not provide enough information for a correct classification using the algorithms that were tested.

When comparing between the sequential and parallel versions of the algorithm, no difference in performance related to f-score or accuracy could be observed. This was investigated between all the datasets and no correlation could be found if the execution was done on one thread or several, in regards to f-score or accuracy. This was to be expected since the calculations and classification process is unrelated for each instance of data being classified. This is due to the algorithm being developed to be able to run on several threads without affecting the performance related to accuracy and f-score. Chapter 6. Analysis 30

In general the MLID algorithm outperforms or are on par with other classifiers. It does not get excellent scores on a few datasets, but as shown in figure 5.1 and 5.2 it seems that other classifiers have the same issues for those sets.

6.2 Execution Time The parallel execution was done with six parallel threads. For each dataset tested, the execution time was recorded and divided into training time and test time which can be seen in table 5.2 and table 5.3. This was done for the classifier chain algorithm, and our own MLID algorithm as well as a binary relevance version of LMT and a random forest algorithm using back propagation neural networks. The results show an average improvement of 249.5% when executing the training on MLID in parallel in comparison to a sequential version which can be seen in table 5.4.

As seen in table 5.4 the parallelisation is usually inconsequential on datasets where the amount of labels, attributes and instances are small in numbers. This is due to trees being built fast enough for the threading overhead impact too be enough to make a difference. However when datasets grow in size the parallelization are more crucial in general. Also the complexity for each tree, where each attributes does not have a huge impact on the dataset-size seems to increase the performance for parallelization.

Comparing speed up for each dataset, if many attributes are of numerical value the speed increase seems to be higher than if the datasets consist mostly of nominal values. Compared to the other algorithms the speed of training the trees are lacking, both compared to J48 and the random forest algorithm. In terms of the binary relevance there are cases where the MLID algorithm performs better, usually in terms of datasets where the amount of labels and attributes are high. This in conjunction with a large amount of instances increases the execution time exponentially. This is most likely because the algorithm have to check all available values and count them individually for each attribute to determine the information gain for that attribute. If the number of instances increase with simultaneous increase of attributes the execution time will increase greatly.

The classification speed for MLID is in all cases slower than the other algorithms tested, which can be seen in table 5.3. This is most likely because of the way MLID store the data for a testing dataset and not because of the algorithm itself. Using MEKA and WEKA, which has well developed ways of storing data, compared to our algorithm which uses a basic implementation for storing the data, can be the cause for the speed-difference. An impact can also correlate to ID3 not using any form of pruning when training the classifier which produces more complex Chapter 6. Analysis 31 trees, which are used when classifying data. It would impact the performance in a negative fashion because of more time-consuming execution when calculating each label for the test dataset.

The classification speed is however not largely time-consuming, usually only tak- ing a couple of seconds for classifying between a couple of hundred instances to a couple of thousand. If the application using the machine-learning algorithm is not time-critical and does not process large amounts of data the testing performance of the MLID algorithm would be sufficient.

There exists cases where the time it takes to train the classifier is not the impor- tant factor, but instead accuracy, f-measure and testing speed are more important. The cases are when the classifier does not need to be trained on a regular basis. If the classifier is only trained ones and used over the time of a couple of months or maybe years, the speed of training the classifier is not the deciding factor. Instead the important factor is the results produced in areas such as accuracy, f-measure and testing speed. This is true in some cases for MLID but in some cases it can not perform better than the faster algorithms in both accuracy and f-score. For many algorithms it is necessary to apply an appropriate type of data which the algorithm can perform well on. This can be why faster algorithms provides better results for some datasets. In the cases where the MLID algorithm is provided a good set for the classifier it would work suitably good as comparable algorithms for multi-label learning if training time is not an important factor. Chapter 7 Conclusions and Future Work

Our performance evaluations in regards of accuracy, precision and execution time against already existing algorithms shows that MLID can perform better than the other tested multi-label algorithms, in terms of accuracy and F-score.

In this section, we will answer the research questions stated in section 1.2. We will by connecting with analysis try to answer the questions one at a time, to try to find a suitable answer. The questions related will be:

• Can the MLID algorithm be a viable option as a multi-label classifier in comparison to already established algorithms?

• How will parallelization affect accuracy and execution time of MLID in comparison to a sequential execution?

• How will accuracy and execution time be affected by large data-sets in comparison to smaller data-sets for MLID?

• How to extend the ID3 algorithm, allowing it to handle multi label classifi- cation problems?

7.1 Is it a viable multi-label classifier? In this work, we have presented an implementation of a multi-label classifier based on the ID3 algorithm. It shows when analysing the results that in both f-score and accuracy the developed algorithm outperforms other algorithms evaluated against in most cases. For accuracy it outperforms the other algorithms in five out of six cases. In the case of f-measure, it clearly outperforms the other algorithms in three out of six cases, and for one dataset it has better performance than the other algorithms, albeit not as large as for the other three cases where performance is better.

The biggest issue with the algorithm is the speed when training the classifiers. In cases where the tree does not have to be re-built on a regular basis the training

32 Chapter 7. Conclusions and Future Work 33 speed can be neglected as training sessions would occur infrequently. In these cases the MLID algorithm is a viable option when choosing a multi-label classi- fication algorithm.

In cases where the number of instances does not exceeds several thousands, and the time it takes to test does not need to be less than a minute the classification speed is acceptable. In the parallel testing approach the time it takes to classify the test-data are usually under ten seconds but in a few cases the time is higher. If the cases where classification speed does not need to be fast the algorithm is more than suitable because of the higher accuracy and f-measure than the other tested algorithms. This is presented in figure 5.1 and figure 5.2.

We deem these results to prove that this approach for a multi-label algorithm is a viable option when comparing it to other algorithms. It is proven by classification results in the analysis section.

7.2 How is performance affected by parallelisa- tion? In terms of execution time we found that, for the datasets we defined, that a speed up for building the classifier and classification was on average 234.4% as can be seen in table 5.4. Since the application was run on six threads this is not a linear improvement depending on the amount of threads. This was expected since it is very hard to get an optimal increase for every thread. The results could probably be improved by having another approach to handling the threads.

The algorithm was developed for handling parallelisation which also led to a solution which did not get affected, except for execution time, when running the application on multiple or a single thread at a time. Because of this the performance related to accuracy and f-score it did not show any difference when running the application, regardless of the number of threads.

This concludes that performance in cases where accuracy and f-score is more important this algorithm performance is better than many other algorithms. In cases where speed is crucial, it does not increase performance for parallelisation enough for it to be in competition with the fast algorithms tested against. In cases where performance related to the correctness is important this approach is more than suitable compared to approaches such as J48, LMT or Random Tree algorithms. Performance related to accuracy and f-score is not affected by parallelisation. Chapter 7. Conclusions and Future Work 34

7.3 How is performance affected by large datasets in comparison to small datasets? It is hard to reach a conclusive answer when it comes to dataset size. Medicine, which is a large set is the best when it comes to both f-score and accuracy. In the three cases of enron, CAL500 and medicine, the f-score is performing above all the others. These are moderately large sets which would indicate that performance increase by the size, in general, although enron is bigger than medical. This does not point towards bigger is better however, which indicates that the build of the dataset, in terms of attributes, attribute-types and the number of labels, is more important than size.

The dataset flags is a small dataset but still perform well compared to other datasets of approximately the same size. This indicates that small datasets can perform well if the composition is good, but the bigger the size the higher the likeliness that there is enough information for a better result.

Our conclusion is that the results point towards improvement in performance is related to how large the dataset, with larger datasets being more accurate. MLID should however be tested on more and larger datasets, which time did not allow for us during the thesis, before reaching a certain conclusion as for the performance related to dataset size.

7.4 Is this a viable approach for extending ID3?

Looking at the score it shows that this approach works when it comes to perfor- mance in terms of classification results. It shows that it can outperform other algorithms in several areas, but lacking in others. In cases where training time is not important for the algorithm, but instead performance related to accuracy and f-score is of more importance the MLID algorithm proves to be performing better than many other algorithms.

There are obviously weaknesses since other algorithms perform better in some cases, for example the bird dataset. In that case both accuracy and f-score is below most other algorithms tested against as can be seen in figure 5.1 and figure 5.2. Many algorithms fit for specific datasets, which has to be taken into account if the necessity for adapting a machine-learning algorithm is demanded.

If the training set is big enough it points towards better performance, so it should be run on datasets where execution time is not an issue and the dataset is big. This is because of the time issues related to the algorithm. In conclusion, from our Chapter 7. Conclusions and Future Work 35 point of view it shows that the approach is a viable option to handle a multi-label extension for the ID3 algorithm.

7.5 Future work Future work would be to do more extensive testing on larger datasets, since time and execution time of the algorithm would not allow testing on larger sets. A greater difference in dataset’s amount of instances, attributes and labels would be needed to reach a conclusive answer to the viability of the algorithm.

Work developed from this thesis would also involve testing the algorithm against more algorithms that can handle multi label problems. This is crucial since the tests performed show better performance in most cases against the algorithms tested against, but does not conclude a definite answer. This would also involve testing on more datasets in terms of number and more varied datasets, where different sizes for attributes, labels and the type of values the attributes are to reach a more conclusive answer.

Working on implementing a Multi label version of the ID3 algorithm have high- lighted some issues in regards of the functionality present. In order to improve the viability of the MLID algorithm, support for additional features would be required, many of which would follow the development path of the original ID3 algorithm, such as:

• Handling continuous and discrete values - These values could be han- dled by creating a threshold at the time of tree building and splitting at- tribute into those with a value above the threshold and those that are less or equal to it.

• Pruning trees - The efficiency of the classifier could be improved by prun- ing the tree after it has been created. Pruning is done by removing branches that does not not contribute to classification and replacing them with leaf nodes instead.

• Handling attributes with different costs - Introduce penalties and learning cost to the system. Each attribute is associated with a cost to look at, and the system is restricted to a fixed budget. This helps the classifier to increase its sensitivity and avoid the most costly errors.

• Support weighting cases - Enable the algorithm to weight different cases and misclassification types. References

[1] Machine learning what it is & why it matters. http://www.sas.com/, 2013. Accessed May 2016.

[2] Madrigal C. A. The trick that makes google’s self-driving cars work. http://www.citylab.com/, 2014. Accessed May 2016.

[3] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [4] Amanda Clare and Ross D King. Knowledge discovery in multi-label phe- notype data. In Principles of data mining and knowledge discovery, pages 42–53. Springer, 2001.

[5] Beman Dawes and David Abrahams. Boost c++ libraries. http://www.boost.org/. Accessed: 06-2016.

[6] Geoffrey Holmes, Andrew Donkin, and Ian H Witten. Weka: A machine learning workbench. In Intelligent Information Systems, 1994. Proceedings of the 1994 Second Australian and New Zealand Conference on, pages 357– 361. IEEE, 1994.

[7] Artificial Intelligence. A modern approach. by Stuart Russell and Peter Norvig, 2003. [8] Ankunda R Kiremire. The application of the pareto principle in software engineering. 2011.

[9] Oscar Luaces, Jorge Díez, José Barranquero, Juan José del Coz, and Antonio Bahamonde. Binary relevance efficacy for multilabel classification. Progress in Artificial Intelligence, 1(4):303–313, 2012. [10] Branscombe. M. Machine learning in the cloud: beyond kinect and cortana. http://www.techradar.com/, 2014. Accessed: 06-2016.

[11] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.

[12] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.

36 References 37

[13] J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014. [14] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine learning, 85(3):333–359, 2011.

[15] Claude Elwood Shannon. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1):3–55, 2001.

[16] Phil Simon. Too Big to Ignore: The Business Case for Big Data, volume 72. John Wiley & Sons, 2013.

[17] Jimena Torres Tomás, Newton Spolaôr, Everton Alvares Cherman, and Maria Carolina Monard. A framework to generate synthetic multi-label datasets. Electronic Notes in Theoretical Computer Science, 302:155–176, 2014.

[18] The university of Waikato. Attribute-relation file format (arff). http://www.cs.waikato.ac.nz/, 2002. Accessed: 06-2016.

[19] C. J Van Rijsbergen. Information retrieval 2nd edition. ACM Press Books, 1979.