Automated Machine Learning Techniques for Data Streams Alexandru-Ionut Imbrea University of Twente P.O

Automated Machine Learning Techniques for Data Streams Alexandru-Ionut Imbrea University of Twente P.O. Box 217, 7500AE Enschede The Netherlands [email protected] ABSTRACT ative tasks that involve fine-tuning, which is usually per- Automated Machine Learning (AutoML) techniques ben- formed by data scientists in a trial-and-error process until efitted from tremendous research progress recently. These the desired performance is achieved. The ever-growing developments and the continuous-growing demand for ma- number of machine learning algorithms and hyperparam- chine learning experts led to the development of numerous eters leads to an increase in the number of configurations AutoML tools. Industry applications of machine learning which makes data scientists' job more laborious than ever. on streaming data become more popular due to the in- Considering the above reason and due to the lack of ex- creasing adoption of real-time streaming in IoT, microser- perts required in the industry, the field of Automated vices architectures, web analytics, and other fields. How- Machine Learning (AutoML) benefited from considerable ever, the AutoML tools assume that the entire training advances recently [7]. AutoML tools and techniques en- dataset is available upfront and that the underlying data able non-experts to achieve satisfactory results and ex- distribution does not change over time. These assump- perts to automate and optimize their tasks. Although the tions do not hold in a data-stream-mining setting where field of AutoML is relatively new, it consists of multiple an unbounded stream of data cannot be stored and is likely other sub-fields such as Automated Data Cleaning (Auto to manifest concept drift. This research surveys the state- Clean), Automated Feature Engineering (Auto FE), Hy- of-the-art open-source AutoML tools, applies them to real perparameter Optimization (HPO), Neural Architecture and synthetic streamed data, and measures how their per- Search (NAS), and Meta-Learning [36]. The sub-field of formance changes over time. For comparative purposes, Meta-Learning is concerned with solving the algorithm se- batch, batch incremental and instance incremental estima- lection problem [23] applied to ML by selecting the algo- tors are applied and compared. Moreover, a meta-learning rithm that provides the best predictive performance for a technique for online algorithm selection based on meta- data set. HPO techniques are used to determine the opti- feature extraction is proposed and compared, while model mal set of hyperparameters for a learning algorithm, while replacement and continual AutoML techniques are dis- Auto FE aims to extract and select features automatically. cussed. The results show that off-the-shelf AutoML tools NAS represents the process of automating architecture en- can provide satisfactory results but in the presence of con- gineering for artificial neural networks which already out- cept drift, detection or adaptation techniques have to be performed manually designed ones in specific tasks such as applied to maintain the predictive accuracy over time. image classification [8]. The overarching goal of the field is to develop tools that can produce end-to-end machine Keywords learning pipelines with minimal intervention and knowledge. However, this is not yet possible using state-of-the- AutoML, AutoFE, Hyperparameter Optimization, Online art open-source tools, thus when talking about AutoML Learning, Meta-Learning, Data Stream Mining throughout the paper it means solving the combined algorithm selection and hyperparameter optimization problem, 1. INTRODUCTION or CASH for short, as defined by Thornton et al. in [28]. Developing machine learning models that provide constant When other techniques from the general field of AutoML, high predictive accuracy is a difficult task that usually re- such as meta-learning, are needed, it is specifically men- quires the expertise of a data scientist. Data scientists tioned. are multidisciplinary individuals possessing skills from the Regardless of their increasing popularity and advances, intersection of mathematics, computer science, and busi- current AutoML tools lack applicability in data stream ness/domain knowledge. Their job mainly consists of per- mining. Mainly due to the searching and optimization forming a workflow that includes, among others, the fol- techniques these tools use internally, as described in Sec- lowing steps: data gathering, data cleaning, feature ex- tion 5, they have to make a series of assumptions [19]. traction, algorithm selection, and hyperparameter opti- First, the entire training data has to be available at the mization. The last three steps of this workflow are iter- beginning and during the training process. Second, it is assumed that the data used for prediction follows the same Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies distribution as the training data and it does not change are not made or distributed for profit or commercial advantage and that over time. However, streaming data may not follow any of copies bear this notice and the full citation on the first page. To copy oth- these assumptions: it is an unbounded stream of data that erwise, or republish, to post on servers or to redistribute to lists, requires cannot be stored entirely in memory and that can change prior specific permission and/or a fee. its underlying distribution over time. The problem of ap- nd st 32 Twente Student Conference on IT Jan. 31 , 2019, Enschede, The plying AutoML to data streams becomes more relevant if Netherlands. considering that the popularity of microservices and event- Copyright 2019, University of Twente, Faculty of Electrical Engineer- ing, Mathematics and Computer Science. driven architectures is also increasing considerably [24]. In 1 these types of systems, streams of data are constantly gen- 3. PROBLEM FORMULATION erated from various sources including sensor data, network The formal definition of the AutoML problem as stated traffic, and user interaction events. The growing amount by Feurer et al. in [9] is the following: of streamed data pushed the development of new technolo- Definition 1 (AutoML): gies and architectures that can accommodate big amounts For i = 1; : : : ; n + m, n; m 2 +, let x 2 d denote of streaming data in scalable and distributed ways (e.g. N i R a feature vector of d dimensions and y 2 Y the corre- Apache Kafka [18]). Nevertheless, despite their relevance, i sponding target value. Given a training dataset D = AutoML tools and techniques lack integration with such train (x ; y );:::; (x ; y ) and the feature vectors x ; :::; x big data streaming technologies. 1 1 n n n+1 n+m of a test dataset Dtest = (xn+1; yn+1);:::; (xn+m; yn+m) The workaround solution adopted by the industry con- drawn from the same underlying data distribution, as well sists, in general, of storing the data stream events in a as a resource budget b and a loss metric L(·; ·), the Au- distributed file system, perform AutoML on that data in toML problem is to (automatically) produce test set pre- batch, serialize the model, and use the model for provid- dictions yn+1; :::; yn+m. The loss of a solutiony ^n+1; :::; y^n+m ing real-time predictions for the new stream events [17], to the AutoML problem is given by: [13]. This solution presents a series of shortcomings including predicting on an outdated model, expensive disk m and network IO, and the problem of not adapting to con- 1 X L(^y ; y ) (1) cept drift. In this paper, the problem of concept drift m n+j n+j affecting the predictive performance of models over time j=1 is extensively discussed and measured for models gener- When restricting the problem to a combined algorithm se- ated using AutoML tools. For comparison, batch, batch lection and hyperparameter optimization problem (CASH) incremental and online models are developed and used as as defined and formalised by Thornton et al. in [28] the a baseline. Moreover, detection and adaptation techniques definition of the problem becomes: for concept drift are assessed for both AutoML-generated and online models. Definition 2 (CASH): Given a set of algorithms A = fA(1);:::;A(k)g with asso- The main contribution of this research consists of: ciated hyperparameter domains Λ(1);:::; Λ(k) and a loss metric L(·; ·; ·), we define the CASH problem as comput- • An overview and discussion of the possible AutoML ing: techniques and tools that can be applied to data streams. k 1 1 ∗ X (j) (i) (i) • A Python library called automl-streams available A λ∗ 2 argmin L(Aλ ; Dtrain; Dvalid) (2) (j) (j) k on GitHub2 including: a way to use a Kafka stream A 2A;λ2Λ i=1 in Python ML/AutoML libraries, an evaluator for pretrained models, implementations of meta-learning algorithms. 4. BACKGROUND 4.1 AutoML Frameworks • A collection of containerized experiments3 that can To solve the AutoML problem (in its original form or as be easily extended and adapted to new algorithms a CASH problem) a configuration space containing the and frameworks. possible combinations of algorithms and hyperparameters is defined. In the case of artificial neural networks algo- • An interpretation of the experimental results. rithms, an additional dimension represented by the neural architecture is added to the search space. For searching In the following sections of the paper the main research this space, different searching strategies and techniques questions are presented, the formal problem formulations can be used [29]. For the purpose of this research one rep- are given for the concepts intuitively described above, and resentative open-source framework [12] was selected for the related work is summarised. Finally, the experimental each type of searching strategy (CASH solver). The se- methods and the results are described. lected frameworks and their searching strategy are dis- played in Table 1.

Load more