The Use of Genetic Programming for Detecting the Incorrect Predictions of Classification Models
Total Page:16
File Type:pdf, Size:1020Kb
The use of Genetic Programming for detecting the incorrect predictions of Classification Models Adrianna Maria Napiórkowska Dissertation presented as partial requirement for obtaining the Master’s degree in Advanced Analytics 201 Title: The use of Genetic Programming for detecting the incorrect Adrianna Maria 9 predictions of Classification Models Napiórkowska MAA i NOVA Information Management School Instituto Superior de Estat´ısticae Gest~aode Informa¸c~ao Universidade Nova de Lisboa The use of Genetic Programming for detecting the incorrect predictions of Classification Models by Adrianna Maria Napi´orkowska Dissertation presented as partial requirement for obtaining the Master's degree in Advanced Analytics Advisor: Leonardo Vanneschi Lisbon, November 27th 2019 Abstract Companies around the world use Advanced Analytics to support their decision making process. Traditionally they used Statistics and Business Intelligence for that, but as the technology is advancing, the more complex models are gaining popularity. The main reason for an increasing interest in Machine Learning and Deep Learning models is the fact that they reach a high predic- tion accuracy. On the second hand with good performance, comes an increasing complexity of the programs. Therefore the new area of Predictors was intro- duced, it is called Explainable AI. The idea is to create models that can be understood by business users or models to explain other predictions. Therefore we propose the study in which we create a separate model, that will serve as a veryfier for the machine learning models predictions. This work falls into area of Post-processing of models outputs. For this purpose we select Genetic Programming, that was proven to be successful in various applications. In the scope of this research we investigate if GP can evaluate the prediction of other models. This area of applications was not explored yet, therefore in the study we explore the possibility of evolving an individual for another model validation. We focus on classification problems and select 4 machine learn- ing models: logistic regression, decision tree, random forest, perceptron and 3 different datasets. This set up is used for assuring that during the research we conclude that the presented idea is universal for different problems. The performance of 12 Genetic Programming experiments indicates that in some cases it is possible to create a successful model for errors prediction. During the study we discovered that the performance of GP programs is mostly connected to the dataset on the experiment is conducted. The type of predictive models does not influence the performance of GP. Although we managed to create good classifiers of errors, during the evolution process we faced the problem of overfitting. That is common in problems with imbalanced datasets. The results of the study confirms that GP can be used for the new type of problems and successfully predict errors of Machine Learning Models. Keywords: Machine Learning, Explainable AI, Post-processing, Classification, Genetic Programming, Errors Prediction 2 Table of contents List of Figures 5 List of Tables 7 1 Introduction 9 Introduction 9 2 Machine Learning 13 2.1 Models interpretability . 15 2.2 Explainable AI . 17 3 Genetic Programming 21 3.1 General structure . 21 3.2 Initialization . 22 3.3 Selection . 24 3.4 Replication and Variation . 25 3.5 Applications . 27 4 Experimental study 31 4.1 Research Methodology . 32 4.1.1 Data Flow in a project . 32 4.1.2 Predictive models used in the study . 33 4.1.3 Dataset Used in a Study . 35 4.2 Experimental settings . 40 4.3 Experimental results . 46 5 Conclusions and future work 57 Bibliography 61 3 TABLE OF CONTENTS 4 List of Figures 2.1 Types of machine learning problems . 14 2.2 Machine learning process . 15 2.3 Deep Learning solution . 15 2.4 Modified machine learning process . 16 3.1 Example of a tree generation process using full method . 23 3.2 Example of a tree generation process using grow method . 24 3.3 Example of subtree crossover . 26 3.4 Example of subtree mutation . 27 4.1 Visualization of the test cases preparation . 32 4.2 Datasets transformations used in experimental study and steps applied in the process. 33 4.3 Distribution of dependent variable in Breast Cancer Wisconsin dataset . 36 4.4 Distribution of dependent variable in Bank Marketing dataset . 37 4.5 Distribution of target variable in Polish Companies Bankruptcy dataset before and after up-sampling . 38 4.6 Implementation of the research idea . 41 4.7 Data split conducted in the project . 41 4.8 Summary of the results for: Breast Cancer Wisconsin Dataset Test Cases . 47 4.9 Summary of the results for: Bank Marketing Dataset Test Cases 50 4.10 Summary of the results for: Polish Companies Bankruptcy Dataset Test Cases . 52 4.11 Comparison of the performance of the best GP programs from different runs calculated on the test set . 54 4.12 Average of Maximum Train Fitness summarized by Model and Test Case . 55 5 LIST OF FIGURES 6 List of Tables 4.1 Summary of the predictions used as test cases . 39 4.2 Comparison between Confusion Matrices obtained by 2 different Fitness Functions . 44 4.3 Summary of the parameters selected for test cases . 45 4.4 Best Individuals found for Breast Cancer Wisconsin Dataset Test Cases . 48 4.5 Best Individuals found for Bank Marketing Dataset Test Cases . 51 4.6 Best Individuals found for Polish Companies Bankruptcy Dataset Test Cases . 53 7 LIST OF TABLES 8 Chapter 1 Introduction The history of algorithms begins in 18th century, when Ada Lovelace, a math- ematician and poet, have written an article describing a concept that would allow the engine to repeat a series of instructions. This method is known nowadays as loops, widely known in computer programming. In her work, she describes how code could be written for a machine to handle not only numbers, but also letters and commands. She is considered the author of first algorithm and first computer programmer. Although Ada Lovelace did not have a computer as we have today, the ideas she developed are present in various algorithms and methods used nowadays. Since that time, the researches and scientist were focused on optimization of work and automation of repetitive tasks. Over the years they have developed a wide range of methods for that purpose. In addition to that the objective of many researches was to allow computer programs to learn. This ability could help in various areas, starting from learning how to treat diseases based on medical records, apply predictive models in areas where classic approaches are not effective or even create a personal assistant that can learn and optimize our daily tasks. All of the mentioned concepts can be described as machine learning. According to Mitchell (1997), an understanding of how to make computers learn would create new areas for customization and development. In addition, the detailed knowledge of machine learning algorithms and the ways they work, might lead to a better comprehension of a human learning abilities. Many com- puter programs were developed by implementing useful types of learning and they started to be used in commercial projects. According to research, these algorithms were outperforming other methods in the various areas, like speech or image recognition, knowledge discovery im large databases or creating a program that would be able to act like human e.g chat bots and game playing programs. 9 CHAPTER 1. INTRODUCTION On the one hand intelligent systems are very accurate and have high pre- dictive power. They are also described by a large number of parameters, hence it is more difficult to draw direct conclusions from the models and trust their predictions. Therefore the research in an area of explainable AI started to be very popular and there was a need for analysis of the output of predictive mod- els. There are areas of study or business applications that especially require transparency of applied models, e.g. Banking and the process of loans ap- proval. One reason for that are new regulations protecting personal data - like General Data Protection Regulation (GDPR), which require entrepreneurs to be able to delete sensitive personal data upon request and protect consumers with new right - Right of Explanation. It is affecting business in Europe since May 2018 and is causing an increasing importance ofthe field of Explainable AI as mentioned in publication: Current Advances, Trends and Challenges of Machine Learning and Knowledge Extraction: From Machine Learning to Ex- plainable AI by Holzinger et al. (2018). The applications of AI in many fields is very successful, but as stated in mentioned article: We are reaching a new AI spring. However, as fantastic current approaches seem to be, there are still huge problems to be solved: the best performing models lack transparency, hence are considered to be black boxes. The general and worldwide trends in privacy, data protection, safety and security make such black box solutions difficult to use in practice. Therefore in order to align with this regulation and provide trust-worthy predictions in many cases the additional step of post-processing of the predic- tions is applied. A good model should generate decisions with high certainty. First of the indication for that is high performance observed during training phase. Secondly the results of evaluation on the test and validation sets should not diverse significantly, proving stability of the solution. In this area, the use of post-processing of outputs can be very beneficial. It the model is predicting loans that will not be repaid, then the cost of wrong prediction can be very high, if the loan will be given to the bad consumer. Therefore banking insti- tution spend a lot of time and resources on improving their decision making process.