Mining Internet of Things (IoT) Big Data Streams

Albert Bifet LTCI, CNRS, Tel´ ecom´ ParisTech Universite´ Paris-Saclay 75634 Paris Cedex 13, FRANCE [email protected]

Abstract of Big Data infrastructures. How to do this ac- curately in real time is the main challenge for IoT Big Data and the Internet of Things (IoT) analytics systems in the near future. have the potential to fundamentally shift In the IoT data stream model, data arrives at the way we interact with our surroundings. high speed, and algorithms that process it must The challenge of deriving insights from do so under very strict constraints of space and the Internet of Things (IoT) has been rec- time. Consequently, data streams pose several ognized as one of the most exciting and challenges for algorithm design. First, key opportunities for both academia and algorithms must work within limited resources industry. Advanced analysis of big data (time and memory). Second, they must deal with streams from sensors and devices is bound data whose nature or distribution changes over to become a key area of data mining re- time. We need to deal with resources in an effi- search as the number of applications re- cient and low-cost way. In , we quiring such processing increases. Deal- are interested in three main dimensions: ing with the evolution over time of such data streams, i.e., with concepts that drift accuracy • or change completely, is one of the core amount of space (computer memory) neces- issues in stream mining. Dealing with • sary this setting, MOA is a software frame- work with classification, regression, and time required to learn from training examples • frequent pattern methods, and the new and to predict APACHE SAMOA is a distributed stream- ing software for mining IoT data streams. These dimensions are typically interdependent: adjusting the time and space used by an algorithm 1 Introduction can influence its accuracy. By storing more pre- computed information, such as look up tables, an The Internet of Things (IoT), the large network of algorithm can run faster at the expense of space. physical devices that extends beyond the typical An algorithm can also run faster by processing less computer networks, will be creating a huge quan- information, either by stopping early or storing tity of Big Data streams in real time in the next less, thus having less data to process. The more future. The realization of IoT depends on being time an algorithm has, the more likely it is that ac- able to gain the insights hidden in the vast and curacy can be increased. growing seas of data available. Since current ap- IoT data streams are closely related to Big Data. proaches don’t scale to Internet of Things (IoT) Big Data is a new term used to identify the datasets volumes, new systems with novel mining tech- that due to their large size, we can not manage niques are necessary due to the velocity, but also them with the typical data mining software tools. variety, and variability, of such data. Instead of defining “Big Data” as datasets of a con- This IoT setting is challenging, and needs algo- crete large size, for example in the order of mag- rithms that use an extremely small amount (iota) nitude of petabytes, the definition is related to the of time and memory resources, and that are able to fact that the dataset is too big to be managed with- adapt to changes and not to stop learning. These out using new algorithms or technologies. There algorithms should be distributed and run on top is need for new algorithms, and new tools to deal

15 with all of this data. Doug Laney (Laney, 2001) 3.1 High Level Architecture was the first to mention the 3 V’s of Big Data man- We identify three types of APACHE SAMOA users: agement: Volume, Variety and Velocity. 1. Platform users, who use available ML algo- 2 MOA rithms without implementing new ones.

Massive Online Analysis (MOA) (Bifet et al., 2. ML developers, who develop new ML algo- 2010) is a software environment for implement- rithms on top of APACHE SAMOA and want ing algorithms and running experiments for on- to be isolated from changes in the underlying line learning from evolving data streams. MOA SPEs. includes a collection of offline and online meth- ods as well as tools for evaluation. In particular, 3. Platform developers, who extend APACHE it implements boosting, bagging, and Hoeffding SAMOA to integrate more DSPEs into Trees, all with and without Na¨ıve Bayes classi- APACHE SAMOA. fiers at the leaves. Also it implements regression, and frequent pattern methods. MOA supports bi- 4 Conclusions directional interaction with , the Waikato Mining Internet of Things (IoT) Big Data streams Environment for Knowledge Analysis, and is re- is a challenging task, that needs new tools to per- leased under the GNU GPL license. form the most common algo- rithms such as classification, clustering, and re- 3APACHE SAMOA gression. APACHE SAMOA (SCALABLE ADVANCED MAS- APACHE SAMOA is is a platform for min- SIVE ONLINE ANALYSIS) is a platform for min- ing big data streams, and it is already avail- ing big data streams (Morales and Bifet, 2015). As able and can be found online at http://www. most of the rest of the big data ecosystem, it is samoa-project.net. The website includes a written in Java. wiki, an API reference, and a developer’s manual. APACHE SAMOA is both a framework and a li- Several examples of how the software can be used brary. As a framework, it allows the algorithm de- are also available. veloper to abstract from the underlying execution Acknowledgments engine, and therefore reuse their code on differ- ent engines. It features a pluggable architecture The presented work has been done in collaboration that allows it to run on several distributed stream with Gianmarco De Francisci Morales, Nicolas processing engines such as Storm, S4, and Samza. Kourtellis, Bernhard Pfahringer, Geoff Holmes, This capability is achieved by designing a minimal Richard Kirkby, and all the contributors to MOA API that captures the essence of modern DSPEs. and APACHE SAMOA. This API also allows to easily write new bindings to port APACHE SAMOA to new execution engines. APACHE SAMOA takes care of hiding the differ- References ences of the underlying DSPEs in terms of API Gianmarco De Francisci Morales and Albert Bifet. and deployment. 2015. SAMOA: Scalable Advanced Massive Online Analysis. Journal of Machine Learning Research, As a library, APACHE SAMOA contains imple- 16:149–153. mentations of state-of-the-art algorithms for dis- tributed machine learning on streams. For classi- Albert Bifet, Geoff Holmes, Richard Kirkby, and Bern- fication, APACHE SAMOA provides a Vertical Ho- hard Pfahringer. 2010. MOA: Massive Online Analysis. Journal of Machine Learning Research, effding Tree (VHT), a distributed streaming ver- 11:1601–1604, August. sion of a decision tree. For clustering, it includes an algorithm based on CluStream. For regression, Doug Laney. 2001. 3-D Data Management: Con- HAMR, a distributed implementation of Adaptive trolling Data Volume, Velocity and Variety. META Group Research Note, February 6. Model Rules. The library also includes meta- algorithms such as bagging and boosting. The platform is intended to be useful for both research and real world deployments.

16