Machine Learning for Streaming Data: State of the Art, Challenges, and Opportunities
Total Page:16
File Type:pdf, Size:1020Kb
Machine learning for streaming data: state of the art, challenges, and opportunities Heitor Murilo Gomes Jesse Read Albert Bifet Department of Computer LIX, Ecole´ Polytechnique Department of Computer Science, University of Waikato Palaiseau, France Science, University of Waikato LTCI, Tel´ ecom´ ParisTech jesse.read@ LTCI, Tel´ ecom´ ParisTech Hamilton, New Zealand polytechnique.edu Hamilton, New Zealand [email protected] [email protected] Jean Paul Barddal Joao˜ Gama PPGIA, Pontifical Catholic LIAAD, INESC TEC University of Parana University of Porto Curitiba, Brazil Porto, Portugal jean.barddal@ [email protected] ppgia.pucpr.br ABSTRACT eral works on supervised learning [19], especially classifica- tion, mostly focused on addressing the problem of changes Incremental learning, online learning, and data stream learn- to the underlying data distribution over time, i.e., concept ing are terms commonly associated with learning algorithms drifts [119]. In general, these works focus on one specific that update their models given a continuous influx of data challenge: develop methods that maintain an accurate de- without performing multiple passes over data. Several works cision model with the ability to learn and forget concepts have been devoted to this area, either directly or indirectly incrementally. as characteristics of big data processing, i.e., Velocity and Volume. Given the current industry needs, there are many The focus given to supervised learning has shifted towards challenges to be addressed before existing methods can be ef- other tasks in the past few years, mostly to accommodate ficiently applied to real-world problems. In this work, we fo- more general requirements of real-world problems. Nowa- cus on elucidating the connections among the current state- days one can find work on data stream clustering, pattern of-the-art on related fields; and clarifying open challenges mining, anomaly detection, feature selection, multi-output in both academia and industry. We treat with special care learning, semi-supervised learning, novel class detection, and topics that were not thoroughly investigated in past posi- others. Nevertheless, some fields were less developed than tion and survey papers. This work aims to evoke discus- others, e.g., drift detection and recovery has been thoroughly sion and elucidate the current research opportunities, high- investigated for streaming data where labels are immedi- lighting the relationship of different subareas and suggesting ately (and fully) available. In contrast, not as much re- courses of action when possible. search has been conducted on the efficiency of drift detec- tion methods for streaming data where labels arrive with a non-negligible delay or when some (or all) labels never arrive 1. INTRODUCTION (semi-supervised and unsupervised learning). Data sources are becoming increasingly ubiquitous and faster Previous works have shown the importance of some of these in comparison to the previous decade. These characteristics problems and their research directions. Krempl et al. [73] motivated the development of several machine learning al- discuss important issues, such as how to evaluate data stream gorithms for data streams. We are now on the verge of algorithms, privacy and the gap between algorithms to full moving out these methods from the research labs to the in- decision support systems. Machine learning for data streams dustry, similarly to what happened to traditional machine is a recurrent topic in Big Data surveys [44; 127] as it is re- learning methods in the recent past. This current movement lated to the Velocity and Volume characteristics of the tra- requires the development and adaptation of techniques that ditional 3V's of big data (Volume, Variety, Velocity). Nev- are adjacent to the learning algorithms, i.e., it is necessary ertheless, there is no consensus about how learning from to develop not only efficient and adaptive learners, but also streaming data should be tackled, and depending on the methods to deal with data preprocessing and other prac- application (and the research group) different abstractions tical tasks. On top of that, there is a need to objectively and solutions are going to be used. For example, Stoica reassess the underlying assumptions of some techniques de- et al. [115] discuss continual learning, and from their point veloped under hypothetical scenarios to clearly understand of view, learning from heterogeneous environments where when and how they are applicable in practice. changes are expected to occur is better addressed by rein- Machine learning for streaming data research yielded sev- forcement learning. In summary, some reviews and surveys focus on specific tasks or techniques, such as rule mining for data streams [66], activity recognition [1] or ensemble learning [53]. In this paper, we focus on providing an updated view of and Flink deal with them and which methods are imple- the field of machine learning for data streams, highlighting mented for a wide range of preprocessing tasks, including: the state-of-the-art and possible research (and development) Feature selection, instance selection, discretization, and oth- opportunities. Our contributions focus on aspects that were ers. Ram´ırez-Gallegoet al. [103] focus specifically on pre- not thoroughly discussed before in similar works [73], and processing techniques for data streams, mentioning both ex- thus, when appropriate, we direct the readers to works that isting methods and open challenges. Critical remarks were better introduce the original problems while we highlight made in [103], such as: the relevance of proposing novel dis- more complex challenges. cretization techniques that do not rely solely on quantiles In summary, the topics discussed are organized as follows. and that perform better in the presence of concept drifts; The first three sections of the manuscript address the im- expanding existing preprocessing techniques that deal with portant topics of preprocessing (section 2), learning (section concept drift to also account for recurrent drifts; and the 3), and adaptation (section 4). In the last three sections, need for procedures that address non-conventional problems, we turn our attention to evaluating algorithms (section 5), e.g., multi-label classification. streaming data and related topics in AI (section 6), and In this section, we focus on issues and techniques that were existing tools for exploring machine learning for streaming not thoroughly discussed in previous works, such as sum- data (section 7). Finally, section 8 outlines the main take- marization sketches. Other aspects that can be considered aways regarding research opportunities and challenges dis- as part of preprocessing, notably dealing with imbalanced cussed throughout this paper. data, and others, are discussed in further sections. 2.1 Feature transformation 2. DATA PREPROCESSING Data preparation is an essential part of a machine learning 2.1.1 Summarization Sketches solution. Real-world problems require transformations to Working with limited memory in streaming data is non- raw data, preprocessing steps and usually a further selection trivial, since data streams produce insurmountable quan- of `relevant' data before it can be used to build machine tities of raw data, which are often not useful as individual learning models. instances, but essential when aggregated. These aggregated Trivial preprocessing, such as normalizing a feature, can be data structures can be used for further analysis such as: complicated in a streaming setting. The main reason is that \what is the average age of all customers that visited a given statistics about the data are unknown a priori, e.g., the website in the last hour?" or to train a machine learning minimum and maximum values a given feature can exhibit. model. There are different approaches to scale and discretize fea- The summary created to avoid storing and maintaining a tures as discussed in the next subsections; still, as we move large amount of data is often referred to as a `sketch' of into more complex preprocessing, it is usually unknown ter- the data. Sketches are probabilistic data structures that ritory or one which has not been explored in depth. summarize streams of data, such that any two sketches of There are mainly two reasons for performing data prepro- individual streams can be combined into the sketch of the cessing before training a machine learning model: combined stream in a space-efficient way. Using sketches re- quires a number of compromises: sketches must be created 1. To allow learning algorithms to handle the data; by using a constrained amount of memory and processing 2. To improve learning by extracting or keeping only the time, and if the sketches are not accurate, then the informa- most relevant data. tion derived from them may be misleading. Over the last decades, many summarization sketches have The first reason is more restrictive, as some algorithms will been proposed. Ranging from simple membership techniques not be able to digest data if it is not in the expected format, such as Bloom filters [22] and counting strategies to more ad- i.e., the data types do not match. Other algorithms will vanced methods such as CM-Sketch [30], and ADA-Sketches perform poorly if the data is not normalized. Examples [112]. Bloom filters are probabilistic data structures used to include algorithms that rely on Stochastic Gradient Descent test whether an element is a member of a set, using hash and distance-based algorithms (such