Fast and Slow Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Fast and Slow Machine Learning These` de doctorat de l’Universite´ Paris-Saclay prepar´ ee´ a` Tel´ ecom´ ParisTech Ecole doctorale n◦580 Sciences et Technologies de l’Information et de la Communication (STIC) Specialit´ e´ de doctorat: Informatique These` present´ ee´ et soutenue a` Paris, le 7 mars 2019, par NNT : 2019SACLT014 JACOB MONTIEL LOPEZ´ Composition du Jury : Ricard Gavalda` Professor, Universitat Politecnica` de Catalunya President´ Joao˜ Gama Associate Professor, University of Porto Rapporteur Georges Hebrail´ Senior Researcher, Electricit´ e´ de France Rapporteur Themis Palpanas Professor, Universite´ Paris Descartes (LIPADE) Examinateur Jesse Read Assistant Professor, Ecole´ Polytechnique (LIX) Examinateur Albert Bifet Professor, Tel´ ecom´ ParisTech (LTCI) Directeur de recherche Talel Abdessalem Professor, Tel´ ecom´ ParisTech (LTCI) Directeur de these` ` ese de doctorat Th FASTANDSLOWMACHINELEARNING jacob montiel lópez Thesis submitted for the Degree of Doctor of Philosophy at Université Paris-Saclay – Télécom Paristech March 2019 Jacob Montiel López: Fast and Slow Machine Learning, c March 2019 supervisors: Albert Bifet Talel Abdessalem location: Paris, France To Dennis, for your continuous teaching that life is here and now, without you none of this would have been possible. ABSTRACT The Big Data era has revolutionized the way in which data is created and processed. In this context, multiple challenges arise given the mas- sive amount of data that needs to be efficiently handled and processed in order to extract knowledge. This thesis explores the symbiosis of batch and stream learning, which are traditionally considered in the literature as antagonists. We focus on the problem of classification from evolving data streams. Batch learning is a well-established approach in machine learning based on a finite sequence: first data is collected, then predictive mod- els are created, then the model is applied. On the other hand, stream learning considers data as infinite, rendering the learning problem as a continuous (never-ending) task. Furthermore, data streams can evolve over time, meaning that the relationship between features and the corresponding response (class in classification) can change. We propose a systematic framework to predict over-indebtedness, a real-world problem with significant implications in modern society. The two versions of the early warning mechanism (batch and stream) outperform the baseline performance of the solution implemented by the Groupe BPCE, the second largest banking institution in France. Additionally, we introduce a scalable model-based imputation method for missing data in classification. This method casts the imputation problem as a set of classification/regression tasks which are solved incrementally. We present a unified framework that serves as a common learning platform where batch and stream methods can positively interact. We show that batch methods can be efficiently trained on the stream setting under specific conditions. The proposed hybrid solution works under the positive interactions between batch and stream methods. We also propose an adaptation of the Extreme Gradient Boosting (XG- Boost) algorithm for evolving data streams. The proposed adaptive method generates and updates the ensemble incrementally using mini- batches of data. Finally, we introduce scikit-multiflow, an open source framework in Python that fills the gap in Python for a developmen- t/research platform for learning from evolving data streams. vii What we want is a machine that can learn from experience. — Alan M. Turing ACKNOWLEDGMENTS I believe that humanity moves forward by the sum of actions from a multitude of individuals. Research, as a human activity, is no different. This thesis is possible thanks to the people and institutions that helped me during my PhD. Each and every one deserve my acknowledgment and have my gratitude. I would like to thank my advisers, Prof. Albert Bifet and Prof. Talel Abdessalem for their continuous support, advice and mentoring. Thank you for your trust and encouragement to do my best and keep moving forward. I am certain that their human quality is one of the main factors in my positive experience working towards the completion of my PhD. To my wife Dennis, who always believed in me and walked next to me through this adventure, always willing to hear about my research even when it was puzzling to her. To my parents, María Elena and José Alejandro, who layed the ground upon which I have been able to pursue all my dreams. To my family and friends whose continuous encouragement accompanies me on each step. To the members of the Data, Intelligence and Graphs team: Jean- Louis, Mauro, Fabian, Pierre, Antoine, Camille, Mostafa, Heitor, Marie, Maroua, Dihia, Miyoung, Jonathan, Quentin, Pierre-Alexandre, Thomas [D], Thomas [F], Julien, Atef, Marc, Etienne, Nedeljko, Oana, Luis, Max- imilien, Mikaël, Jean-Benoit and Katerina. As well as to the honorary members of the team: Nicoleta and Ziad. Our lunch-break discussions were a very welcomed break from the routine and are cherished mem- ories from Télécom ParisTech. Special thanks to Prof. Jesse Read who taught me a lot about the practical side of machine learning, and to Prof. Rodrigo Fernandes de Mello whose passion for the theoretical elements of machine learning is an inspiration. To the Machine Learning group at the University of Waikato for all their attentions during my visit. Special thanks to Prof. Bernhard Pfahringer, Prof. Eibe Frank and Rory Mitchell for the very interesting and enlightening discussions that led to our research collaboration. Last but not least, I would like to thank the Government of Mexico for funding my doctoral studies via the National Council for Science and Technology (Consejo Nacional de Ciencia y Tecnología, CONACYT, scholarship 579465). ix CONTENTS i introduction 1 introduction 3 1.1 Motivation . 3 1.2 Challenges and Opportunities . 5 1.3 Open Data Science . 6 1.4 Contributions . 7 1.5 Publications . 9 1.6 Outline . 10 2 preliminaries and related work 11 2.1 Streaming Supervised Learning . 11 2.1.1 Performance Evaluation . 13 2.1.2 Concept Drift . 13 2.1.3 Ensemble Learning . 15 2.1.4 Incremental Learning . 17 2.2 Over-Indebtedness Prediction . 18 2.3 Missing Data Imputation . 19 ii core content 3 over-indebtedness prediction 25 3.1 A Multifaceted Problem . 26 3.1.1 Feature Selection . 27 3.1.2 Data Balancing . 28 3.1.3 Supervised Learning . 28 3.1.4 Stream Learning . 30 3.2 A Data-driven Warnining Mechanism . 30 3.2.1 Generalization . 33 3.3 Experimental Evaluation . 33 3.4 Results . 36 3.4.1 Feature Selection . 36 3.4.2 Data Balancing . 39 3.4.3 Batch vs Stream Learning . 40 4 missing data imputation at scale 43 4.1 A Model-based Imputation Method . 43 4.1.1 Cascade Imputation . 44 4.1.2 Multi-label Cascade Imputation . 46 4.2 Experimental Evaluation . 49 4.2.1 Impact on Classification . 50 4.2.2 Scalability . 52 4.2.3 Imputed vs Incomplete Data . 52 4.2.4 Multi-label Imputation . 52 4.2.5 Measuring Performance . 53 xi xii contents 4.3 Results . 55 5 learning fast and slow 61 5.1 From Psychology to Machine Learning . 62 5.2 Fast and Slow Learning Framework . 63 5.2.1 Fast and Slow Classifier . 63 5.3 Experimental Evaluation . 68 5.4 Results . 70 6 adaptive xgboost 79 6.1 Adapting XGBoost to Stream Learning . 80 6.1.1 Preliminaries . 80 6.1.2 Adaptive eXtreme Gradient Boosting . 81 6.1.3 Ensemble Update . 82 6.1.4 Handling Concept Drift . 83 6.2 Experimental Evaluation . 85 6.3 Results . 87 6.3.1 Predictive Performance . 87 6.3.2 Hyper-parameter Relevance . 91 6.3.3 Memory and Model Complexity . 94 6.3.4 Training Time . 96 7 scikit-multiflow 99 7.1 Architecture . 101 7.2 In a Nutshell . 102 7.2.1 Data Streams . 102 7.2.2 Stream Learning Experiments . 103 7.2.3 Concept Drift Detection . 105 7.3 Development . 106 iii conclusions 8 future work and conclusions 109 8.1 Future Directions . 109 8.1.1 Over-indebtedness Prediction . 109 8.1.2 Missing Data Imputation . 109 8.1.3 Fast and Slow Learning . 110 8.1.4 Adaptive XGBoost . 110 8.2 Final Remarks . 110 iv appendices a cascade imputation 115 b résumé en français 123 b.1 Défis et opportunités . 123 b.2 Open Data Science . 124 b.3 Contributions . 125 bibliography 129 Part I INTRODUCTION 1 INTRODUCTION In recent years we have witnessed a digital revolution that has dramat- ically changed the way in which we generate and consume data. In 2016, 90% of the world’s data was created just on the last 2 years and is expected that by 2020 the digital universe will reach 44 zettabytes (44 trillion gigabytes1). This new paradigm of ubiquitous data has impacted different sectors of society including government, healthcare, banking and entertainment, to name a few. Due to the extent of its potential data has been dubbed “the oil of the digital era”2. The term Big Data is used to refer to this groundbreaking phe- nomenon. A continuous effort exists to delimit this term; a popular approach are the so called “4 Vs” of big data: Volume, Velocity, Variety and Veracity. Volume is related to the scale of the data which varies depending on the application. Velocity considers the rate at which data is generated/collected. Variety represents the different types of data, including traditional structured data and emerging unstructured data. Finally, Veracity corresponds to the reliability that can be attributed to the data. Another significant factor is the dramatic growth of the Internet of Things (IoT), which is the ecosystem where devices (things) connect, interact and exchange data. Such devices can be analog or digital, e.g. cars, airplanes, cellphones, etc. By 2013, 20 billion of such devices were connected to the internet and this number is expected to grow to 32 billion by 2020, this means an increment from 7% to 15% of the total number of connected devices.