Improving Iot Data Stream Analytics Using Summarization Techniques Maroua Bahri

Improving IoT data stream analytics using summarization techniques Maroua Bahri To cite this version: Maroua Bahri. Improving IoT data stream analytics using summarization techniques. Machine Learn- ing [cs.LG]. Institut Polytechnique de Paris, 2020. English. NNT : 2020IPPAT017. tel-02865982 HAL Id: tel-02865982 https://tel.archives-ouvertes.fr/tel-02865982 Submitted on 12 Jun 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Improving IoT Data Stream Analytics Using Summarization Techniques These` de doctorat de l’Institut Polytechnique de Paris prepar´ ee´ a` Tel´ ecom´ Paris Ecole´ doctorale n◦626 Denomination´ (Sigle) Specialit´ e´ de doctorat : Informatique NNT : 2020IPPAT017 These` present´ ee´ et soutenue a` Palaiseau, le 5 juin 2020, par MAROUA BAHRI Composition du Jury : Albert Bifet Professor, Tel´ ecom´ Paris Co-directeur de these` Silviu Maniu Associate Professor, Universite´ Paris-Sud Co-directeur de these` Joao˜ Gama Professor, University of Porto President´ Cedric´ Gouy-Pailler Engineer-Researcher, CEA-LIST Examinateur Ons Jelassi Researcher, Tel´ ecom´ Paris Examinateur Moamar Sayed-Mouchaweh Professor, Ecole Nationale Superieure´ des Mines de Douai Rapporteur Mauro Sozio Associate Professor, Tel´ ecom´ Paris Examinateur Maguelonne Teisseire Professor, TETIS, INRAE Rapporteur 626 Improving IoT Data Stream Analytics Using Summarization Techniques Maroua Bahri A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy Télécom Paris Institut Polytechnique de Paris Supervisors: Albert Bifet Silviu Maniu Examiners: João Gama Cédric Gouy-Pailler Ons Jelassi Moamar Sayed-Mouchaweh Mauro Sozio Maguelonne Teisseire Paris, 2020 Abstract With the evolution of technology, the use of smart Internet-of-Things (IoT) devices, sensors, and social networks result in an overwhelming volume of IoT data streams, generated daily from several applications, that can be transformed into valuable information through machine learning tasks. In practice, multiple critical issues arise in order to extract useful knowledge from these evolving data streams, mainly that the stream needs to be efficiently handled and processed. In this context, this thesis aims to improve the performance (in terms of memory and time) of existing data mining algorithms on streams. We focus on the classification task in the streaming framework. The task is challenging on streams, principally due to the high – and increasing – data dimensionality, in addition to the potentially infinite amount of data. The two aspects make the classification task harder. The first part of the thesis surveys the current state-of-the-art of the classification and dimensionality reduction techniques as applied to the stream setting, by providing an updated view of the most recent works in this vibrant area. In the second part, we detail our contributions to the field of classification in streams, by developing novel approaches based on summarization techniques aiming to reduce the computational resource of existing classifiers with no – or minor – loss of classification accuracy. To address high-dimensional data streams and make classifiers efficient, we incorporate an internal preprocessing step that consists in reducing the dimensionality of input data incrementally before feeding them to the learning stage. We present several approaches applied to several classifications tasks: Naive Bayes which is enhanced with sketches and hashing trick, k-NN by using compressed sensing and UMAP,and also integrate them in ensemble methods. I dedicate to my sweet loving Father, Mother & Siblings whose unconditional love, encouragement, affection, and prays of day and night make me able to get such success and honor. Without them, I would never get through this. Table of Contents Abstract iii List of Figures xi List of Tables xiii List of Abbreviations xv List of Symbols xvii I Introduction and Background1 1 Introduction5 1.1 Context and Motivation . .5 1.2 Challenges . .7 1.3 Contributions . .9 1.4 Publications . 12 1.5 Outline . 13 2 Stream Setting: Challenges, Mining and Summarization Techniques 15 2.1 Introduction . 16 2.2 Preliminaries . 17 2.2.1 Processing . 17 2.2.2 Summarization . 19 2.3 Stream Supervised Learning . 20 2.3.1 Frequency-Based Classification . 22 2.3.2 Neighborhood-Based Classification . 22 2.3.3 Tree-Based Classification . 22 2.3.4 Ensembles-Based Classification . 23 2.4 Dimensionality Reduction . 24 2.4.1 Data-Dependent Techniques . 26 vii TABLE OF CONTENTS 2.4.2 Data-Independent Techniques . 29 2.4.3 Graph-Based Techniques . 31 2.5 Evaluation Metrics . 32 2.6 Discussions . 33 2.7 Conclusion . 34 II Summarization-Based Classifiers 35 3 Sketch-Based Naive Bayes 39 3.1 Introduction . 39 3.2 Preliminaries . 40 3.2.1 Naive Bayes Classifier . 40 3.2.2 Count-Min Sketch . 41 3.3 Sketch-Based Naive Bayes Algorithms . 42 3.3.1 SketchNB Algorithm . 43 3.3.2 AdaSketchNB Algorithm . 47 3.3.3 SketchNBHT and AdaSketchNBHT Algorithms . 49 3.4 Experimental Evaluation . 50 3.4.1 Datasets . 50 3.4.2 Results and Discussions . 52 3.5 Conclusion . 57 4 Compressed k-Nearest Neighbors Classification 59 4.1 Introduction . 60 4.2 Preliminaries . 60 4.2.1 Construction of Sensing Matrices . 61 4.3 Compressed Classification Using kNN Algorithm . 62 4.3.1 Theoretical Insights . 65 4.3.2 Application to Persistent Homology . 66 4.4 Compressed kNN Ensembles . 67 4.5 Experimental Evaluation . 67 4.5.1 Datasets . 67 4.5.2 Results and Discussions . 68 4.6 Conclusion . 74 5 Compressed Adaptive Random Forest Ensemble 75 5.1 Introduction . 75 5.2 Motivation . 76 5.3 Compressed Adaptive Random Forest . 77 5.4 Experimental Evaluation . 80 viii TABLE OF CONTENTS 5.4.1 Datasets . 80 5.4.2 Results and Discussions . 81 5.5 Conclusion . 85 6 Batch-Incremental Classification Using UMAP 87 6.1 Introduction . 87 6.2 Related Work . 88 6.3 Batch-Incremental Classification . 89 6.3.1 Prior Work . 89 6.3.2 Algorithm Description . 90 6.4 Experimental Evaluation . 95 6.4.1 Datasets . 95 6.4.2 Results and Discussions . 96 6.5 Conclusion . 98 III Concluding Remarks 101 7 Conclusions and Future Work 103 7.1 Conclusions . 103 7.1.1 Naive Bayes Classification . 104 7.1.2 Lazy learning . 104 7.1.3 Ensembles . 105 7.2 Open Issues and Future Directions . 106 Appendix A Open Source Contributions 109 A.1 Introduction . 109 A.2 Massive Online Analysis . 109 A.3 Scikit-Multiflow . 113 A.4 Conclusion . 114 Appendix B Résumé en Français 115 B.1 Contexte et Motivation . 115 B.2 Défis . 116 B.3 Contributions . 119 References 121 ix List of Figures 1.1 IoT connected devices. .6 1.2 The thesis context. .9 1.3 Contributions of the thesis. 10 2.1 Sliding window model. 18 2.2 Landmark window of size 13. 18 2.3 Damped window model. 19 2.4 The data stream classification cycle. 21 2.5 Taxonomy of classification algorithms. 22 2.6 Ensemble classifier. 24 2.7 Taxonomy of dimensionality reduction techniques. 25 2.8 Handling data stream constraints. 33 3.1 Count-min sketch. ..

Improving Iot Data Stream Analytics Using Summarization Techniques Maroua Bahri

Anytime Algorithms for Stream Data Mining

Data Stream Clustering Techniques, Applications, and Models: Comparative Analysis and Discussion

Massive Online Analysis, a Framework for Stream Classiﬁcation and Clustering

An Extensible Framework for Data Stream Clustering Research with R

Semi-Supervised Hybrid Windowing Ensembles for Learning from Evolving Streams

Mining Data Streams: a Review

An Analytical Framework for Data Stream Mining Techniques Based on Challenges and Requirements

Introduction to Stream: an Extensible Framework for Data Stream Clustering Research with R

Introduction to Stream: a Framework for Data Stream Mining Research

Frequent Patterns Mining in Time-Sensitive Data Stream

Open Challenges for Data Stream Mining Research

DATA STREAM MINING a Practical Approach