Automated Machine Learning Techniques for Data Streams Alexandru-Ionut Imbrea University of Twente P.O

Total Page:16

File Type:pdf, Size:1020Kb

Automated Machine Learning Techniques for Data Streams Alexandru-Ionut Imbrea University of Twente P.O Automated Machine Learning Techniques for Data Streams Alexandru-Ionut Imbrea University of Twente P.O. Box 217, 7500AE Enschede The Netherlands [email protected] ABSTRACT ative tasks that involve fine-tuning, which is usually per- Automated Machine Learning (AutoML) techniques ben- formed by data scientists in a trial-and-error process until efitted from tremendous research progress recently. These the desired performance is achieved. The ever-growing developments and the continuous-growing demand for ma- number of machine learning algorithms and hyperparam- chine learning experts led to the development of numerous eters leads to an increase in the number of configurations AutoML tools. Industry applications of machine learning which makes data scientists' job more laborious than ever. on streaming data become more popular due to the in- Considering the above reason and due to the lack of ex- creasing adoption of real-time streaming in IoT, microser- perts required in the industry, the field of Automated vices architectures, web analytics, and other fields. How- Machine Learning (AutoML) benefited from considerable ever, the AutoML tools assume that the entire training advances recently [7]. AutoML tools and techniques en- dataset is available upfront and that the underlying data able non-experts to achieve satisfactory results and ex- distribution does not change over time. These assump- perts to automate and optimize their tasks. Although the tions do not hold in a data-stream-mining setting where field of AutoML is relatively new, it consists of multiple an unbounded stream of data cannot be stored and is likely other sub-fields such as Automated Data Cleaning (Auto to manifest concept drift. This research surveys the state- Clean), Automated Feature Engineering (Auto FE), Hy- of-the-art open-source AutoML tools, applies them to real perparameter Optimization (HPO), Neural Architecture and synthetic streamed data, and measures how their per- Search (NAS), and Meta-Learning [36]. The sub-field of formance changes over time. For comparative purposes, Meta-Learning is concerned with solving the algorithm se- batch, batch incremental and instance incremental estima- lection problem [23] applied to ML by selecting the algo- tors are applied and compared. Moreover, a meta-learning rithm that provides the best predictive performance for a technique for online algorithm selection based on meta- data set. HPO techniques are used to determine the opti- feature extraction is proposed and compared, while model mal set of hyperparameters for a learning algorithm, while replacement and continual AutoML techniques are dis- Auto FE aims to extract and select features automatically. cussed. The results show that off-the-shelf AutoML tools NAS represents the process of automating architecture en- can provide satisfactory results but in the presence of con- gineering for artificial neural networks which already out- cept drift, detection or adaptation techniques have to be performed manually designed ones in specific tasks such as applied to maintain the predictive accuracy over time. image classification [8]. The overarching goal of the field is to develop tools that can produce end-to-end machine Keywords learning pipelines with minimal intervention and knowl- edge. However, this is not yet possible using state-of-the- AutoML, AutoFE, Hyperparameter Optimization, Online art open-source tools, thus when talking about AutoML Learning, Meta-Learning, Data Stream Mining throughout the paper it means solving the combined algo- rithm selection and hyperparameter optimization problem, 1. INTRODUCTION or CASH for short, as defined by Thornton et al. in [28]. Developing machine learning models that provide constant When other techniques from the general field of AutoML, high predictive accuracy is a difficult task that usually re- such as meta-learning, are needed, it is specifically men- quires the expertise of a data scientist. Data scientists tioned. are multidisciplinary individuals possessing skills from the Regardless of their increasing popularity and advances, intersection of mathematics, computer science, and busi- current AutoML tools lack applicability in data stream ness/domain knowledge. Their job mainly consists of per- mining. Mainly due to the searching and optimization forming a workflow that includes, among others, the fol- techniques these tools use internally, as described in Sec- lowing steps: data gathering, data cleaning, feature ex- tion 5, they have to make a series of assumptions [19]. traction, algorithm selection, and hyperparameter opti- First, the entire training data has to be available at the mization. The last three steps of this workflow are iter- beginning and during the training process. Second, it is assumed that the data used for prediction follows the same Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies distribution as the training data and it does not change are not made or distributed for profit or commercial advantage and that over time. However, streaming data may not follow any of copies bear this notice and the full citation on the first page. To copy oth- these assumptions: it is an unbounded stream of data that erwise, or republish, to post on servers or to redistribute to lists, requires cannot be stored entirely in memory and that can change prior specific permission and/or a fee. its underlying distribution over time. The problem of ap- nd st 32 Twente Student Conference on IT Jan. 31 , 2019, Enschede, The plying AutoML to data streams becomes more relevant if Netherlands. considering that the popularity of microservices and event- Copyright 2019, University of Twente, Faculty of Electrical Engineer- ing, Mathematics and Computer Science. driven architectures is also increasing considerably [24]. In 1 these types of systems, streams of data are constantly gen- 3. PROBLEM FORMULATION erated from various sources including sensor data, network The formal definition of the AutoML problem as stated traffic, and user interaction events. The growing amount by Feurer et al. in [9] is the following: of streamed data pushed the development of new technolo- Definition 1 (AutoML): gies and architectures that can accommodate big amounts For i = 1; : : : ; n + m, n; m 2 +, let x 2 d denote of streaming data in scalable and distributed ways (e.g. N i R a feature vector of d dimensions and y 2 Y the corre- Apache Kafka [18]). Nevertheless, despite their relevance, i sponding target value. Given a training dataset D = AutoML tools and techniques lack integration with such train (x ; y );:::; (x ; y ) and the feature vectors x ; :::; x big data streaming technologies. 1 1 n n n+1 n+m of a test dataset Dtest = (xn+1; yn+1);:::; (xn+m; yn+m) The workaround solution adopted by the industry con- drawn from the same underlying data distribution, as well sists, in general, of storing the data stream events in a as a resource budget b and a loss metric L(·; ·), the Au- distributed file system, perform AutoML on that data in toML problem is to (automatically) produce test set pre- batch, serialize the model, and use the model for provid- dictions yn+1; :::; yn+m. The loss of a solutiony ^n+1; :::; y^n+m ing real-time predictions for the new stream events [17], to the AutoML problem is given by: [13]. This solution presents a series of shortcomings in- cluding predicting on an outdated model, expensive disk m and network IO, and the problem of not adapting to con- 1 X L(^y ; y ) (1) cept drift. In this paper, the problem of concept drift m n+j n+j affecting the predictive performance of models over time j=1 is extensively discussed and measured for models gener- When restricting the problem to a combined algorithm se- ated using AutoML tools. For comparison, batch, batch lection and hyperparameter optimization problem (CASH) incremental and online models are developed and used as as defined and formalised by Thornton et al. in [28] the a baseline. Moreover, detection and adaptation techniques definition of the problem becomes: for concept drift are assessed for both AutoML-generated and online models. Definition 2 (CASH): Given a set of algorithms A = fA(1);:::;A(k)g with asso- The main contribution of this research consists of: ciated hyperparameter domains Λ(1);:::; Λ(k) and a loss metric L(·; ·; ·), we define the CASH problem as comput- • An overview and discussion of the possible AutoML ing: techniques and tools that can be applied to data streams. k 1 1 ∗ X (j) (i) (i) • A Python library called automl-streams available A λ∗ 2 argmin L(Aλ ; Dtrain; Dvalid) (2) (j) (j) k on GitHub2 including: a way to use a Kafka stream A 2A;λ2Λ i=1 in Python ML/AutoML libraries, an evaluator for pretrained models, implementations of meta-learning algorithms. 4. BACKGROUND 4.1 AutoML Frameworks • A collection of containerized experiments3 that can To solve the AutoML problem (in its original form or as be easily extended and adapted to new algorithms a CASH problem) a configuration space containing the and frameworks. possible combinations of algorithms and hyperparameters is defined. In the case of artificial neural networks algo- • An interpretation of the experimental results. rithms, an additional dimension represented by the neural architecture is added to the search space. For searching In the following sections of the paper the main research this space, different searching strategies and techniques questions are presented, the formal problem formulations can be used [29]. For the purpose of this research one rep- are given for the concepts intuitively described above, and resentative open-source framework [12] was selected for the related work is summarised. Finally, the experimental each type of searching strategy (CASH solver). The se- methods and the results are described. lected frameworks and their searching strategy are dis- played in Table 1.
Recommended publications
  • Data Mining – Intro
    Data warehouse& Data Mining UNIT-3 Syllabus • UNIT 3 • Classification: Introduction, decision tree, tree induction algorithm – split algorithm based on information theory, split algorithm based on Gini index; naïve Bayes method; estimating predictive accuracy of classification method; classification software, software for association rule mining; case study; KDD Insurance Risk Assessment What is Data Mining? • Data Mining is: (1) The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets (2) Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD (2) The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner Knowledge Discovery Examples of Large Datasets • Government: IRS, NGA, … • Large corporations • WALMART: 20M transactions per day • MOBIL: 100 TB geological databases • AT&T 300 M calls per day • Credit card companies • Scientific • NASA, EOS project: 50 GB per hour • Environmental datasets KDD The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages: (1) Selection (2) Pre-processing (3) Transformation (4) Data Mining (5) Interpretation/Evaluation Data Mining Methods 1. Decision Tree Classifiers: Used for modeling, classification 2. Association Rules: Used to find associations between sets of attributes 3. Sequential patterns: Used to find temporal associations in time series 4. Hierarchical
    [Show full text]
  • Concept Drift Adaptation Techniques in Distributed Environment for Real-World Data Streams
    smart cities Article Concept Drift Adaptation Techniques in Distributed Environment for Real-World Data Streams Hassan Mehmood 1,* , Panos Kostakos 1, Marta Cortes 1, Theodoros Anagnostopoulos 2 , Susanna Pirttikangas 1 and Ekaterina Gilman 1 1 Center for Ubiquitous Computing, University of Oulu, Pentti Kaiteran katu 1, 90570 Oulu, Finland; Panos.Kostakos@oulu.fi (P.K.); Marta.Cortes@oulu.fi (M.C.); Susanna.Pirttikangas@oulu.fi (S.P.); Ekaterina.Gilman@oulu.fi (E.G.) 2 DigiT.DSS.Lab, Department of Business Administration, University of West Attica, P. Ralli & Thivon 250, Aigaleo, 122 44 Athens, Greece; [email protected] * Correspondence: Hassan.Mehmood@oulu.fi Abstract: Real-world data streams pose a unique challenge to the implementation of machine learning (ML) models and data analysis. A notable problem that has been introduced by the growth of Internet of Things (IoT) deployments across the smart city ecosystem is that the statistical properties of data streams can change over time, resulting in poor prediction performance and ineffective decisions. While concept drift detection methods aim to patch this problem, emerging communication and sensing technologies are generating a massive amount of data, requiring distributed environments to perform computation tasks across smart city administrative domains. In this article, we implement and test a number of state-of-the-art active concept drift detection algorithms for time series analysis within a distributed environment. We use real-world data streams and provide critical analysis of results retrieved. The challenges of implementing concept drift adaptation algorithms, along with their applications in smart cities, are also discussed. Citation: Mehmood, H.; Kostakos, P.; Keywords: Cortes, M.; Anagnostopoulos, T.; concept drift; machine learning; smart cities; edge computing; time series analysis; Pirttikangas, S.; Gilman, E.
    [Show full text]
  • Adaptation Strategies for Automated Machine Learning on Evolving Data
    1 Adaptation Strategies for Automated Machine Learning on Evolving Data Bilge Celik and Joaquin Vanschoren Abstract—Automated Machine Learning (AutoML) systems have been shown to efficiently build good models for new datasets. However, it is often not clear how well they can adapt when the data evolves over time. The main goal of this study is to understand the effect of concept drift on the performance of AutoML methods, and which adaptation strategies can be employed to make them more robust to changes in the underlying data. To that end, we propose 6 concept drift adaptation strategies and evaluate their effectiveness on a variety of AutoML approaches for building machine learning pipelines, including Bayesian optimization, genetic programming, and random search with automated stacking. These are evaluated empirically on real-world and synthetic data streams with different types of concept drift. Based on this analysis, we propose ways to develop more sophisticated and robust AutoML techniques. Index Terms—AutoML, data streams, concept drift, adaptation strategies F 1 INTRODUCTION HE field of automated machine learning (AutoML) aims concept drift. We propose six different adaptation strategies T to automatically design and build machine learning to cope with concept drift and implement these in open- systems, replacing manual trial-and-error with systematic, source AutoML libraries. We find that these can indeed be data-driven decision making [43]. This makes robust, state- effectively used, paving the way towards novel AutoML of-the-art machine learning accessible to a much broader techniques that are robust against evolving data. range of practitioners and domain scientists. The online learning setting under consideration is one Although AutoML has been shown to be very effective where data can be buffered in batches using a moving win- and even rival human machine learning experts [13], [24], dow.
    [Show full text]
  • Anytime Algorithms for Stream Data Mining
    Anytime Algorithms for Stream Data Mining Von der Fakultat¨ fur¨ Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation vorgelegt von Diplom-Informatiker Philipp Kranen aus Willich, Deutschland Berichter: Universitatsprofessor¨ Dr. rer. nat. Thomas Seidl Visiting Professor Michael E. Houle, PhD Tag der mundlichen¨ Prufung:¨ 14.09.2011 Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfugbar.¨ Contents Abstract / Zusammenfassung1 I Introduction5 1 The Need for Anytime Algorithms7 1.1 Thesis structure......................... 16 2 Knowledge Discovery from Data 17 2.1 The KDD process and data mining tasks ........... 17 2.2 Classification .......................... 25 2.3 Clustering............................ 36 3 Stream Data Mining 43 3.1 General Tools and Techniques................. 43 3.2 Stream Classification...................... 52 3.3 Stream Clustering........................ 59 II Anytime Stream Classification 69 4 The Bayes Tree 71 4.1 Introduction and Preliminaries................. 72 4.2 Indexing density models.................... 76 4.3 Experiments........................... 87 4.4 Conclusion............................ 98 i ii CONTENTS 5 The MC-Tree 99 5.1 Combining Multiple Classes.................. 100 5.2 Experiments........................... 111 5.3 Conclusion............................ 116 6 Bulk Loading the Bayes Tree 117 6.1 Bulk loading mixture densities . 117 6.2 Experiments..........................
    [Show full text]
  • Improving Iot Data Stream Analytics Using Summarization Techniques Maroua Bahri
    Improving IoT data stream analytics using summarization techniques Maroua Bahri To cite this version: Maroua Bahri. Improving IoT data stream analytics using summarization techniques. Machine Learn- ing [cs.LG]. Institut Polytechnique de Paris, 2020. English. NNT : 2020IPPAT017. tel-02865982 HAL Id: tel-02865982 https://tel.archives-ouvertes.fr/tel-02865982 Submitted on 12 Jun 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Improving IoT Data Stream Analytics Using Summarization Techniques These` de doctorat de l’Institut Polytechnique de Paris prepar´ ee´ a` Tel´ ecom´ Paris Ecole´ doctorale n◦626 Denomination´ (Sigle) Specialit´ e´ de doctorat : Informatique NNT : 2020IPPAT017 These` present´ ee´ et soutenue a` Palaiseau, le 5 juin 2020, par MAROUA BAHRI Composition du Jury : Albert Bifet Professor, Tel´ ecom´ Paris Co-directeur de these` Silviu Maniu Associate Professor, Universite´ Paris-Sud Co-directeur de these` Joao˜ Gama Professor, University of Porto President´ Cedric´ Gouy-Pailler Engineer-Researcher, CEA-LIST Examinateur Ons Jelassi
    [Show full text]
  • Frequent Item Set Mining Using INC MINE in Massive Online Analysis Frame Work
    Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 45 ( 2015 ) 133 – 142 International Conference on Advanced Computing Technologies and Applications (ICACTA- 2015) Frequent Item set Mining using INC_MINE in Massive Online Analysis Frame work Prof.Dr.P.K.Srimania, Mrs. Malini M. Patilb* aFormer Chairman and Director, R & D, Bangalore University, Karnataka, India bAssistant Professor , Dept of ISE , J.S.S. Academy of Technical Education, Bangalore-560060, Karnataka, India Research Scholar, Bharthiar University, Coimbatore, Tamilnadu Abstract Frequent Pattern Mining is one of the major data mining techniques, which is exhaustively studied in the past decade. The technological advancements have resulted in huge data generation, having increased rate of data distribution. The generated data is called as a 'data stream'. Data streams can be mined only by using sophisticated techniques. The paper aims at carrying out frequent pattern mining on data streams. Stream mining has great challenges due to high memory usage and computational costs. Massive online analysis frame work is a software environment used to perform frequent pattern mining using INC_MINE algorithm. The algorithm uses the method of closed frequent mining. The data sets used in the analysis are Electricity data set and Airline data set. The authors also generated their own data set, OUR-GENERATOR for the purpose of analysis and the results are found interesting. In the experiments five samples of instance sizes (10000, 15000, 25000, 35000, 50000) are used with varying minimum support and window sizes for determining frequent closed itemsets and semi frequent closed itemsets respectively. The present work establishes that association rule mining could be performed even in the case of data stream mining by INC_MINE algorithm by generating closed frequent itemsets which is first of its kind in the literature.
    [Show full text]
  • Data Stream Clustering Techniques, Applications, and Models: Comparative Analysis and Discussion
    big data and cognitive computing Review Data Stream Clustering Techniques, Applications, and Models: Comparative Analysis and Discussion Umesh Kokate 1,*, Arvind Deshpande 1, Parikshit Mahalle 1 and Pramod Patil 2 1 Department of Computer Engineering, SKNCoE, Vadgaon, SPPU, Pune 411 007 India; [email protected] (A.D.); [email protected] (P.M.) 2 Department of Computer Engineering, D.Y. Patil CoE, Pimpri, SPPU, Pune 411 007 India; [email protected] * Correspondence: [email protected]; Tel.: +91-989-023-9995 Received: 16 July 2018; Accepted: 10 October 2018; Published: 17 October 2018 Abstract: Data growth in today’s world is exponential, many applications generate huge amount of data streams at very high speed such as smart grids, sensor networks, video surveillance, financial systems, medical science data, web click streams, network data, etc. In the case of traditional data mining, the data set is generally static in nature and available many times for processing and analysis. However, data stream mining has to satisfy constraints related to real-time response, bounded and limited memory, single-pass, and concept-drift detection. The main problem is identifying the hidden pattern and knowledge for understanding the context for identifying trends from continuous data streams. In this paper, various data stream methods and algorithms are reviewed and evaluated on standard synthetic data streams and real-life data streams. Density-micro clustering and density-grid-based clustering algorithms are discussed and comparative analysis in terms of various internal and external clustering evaluation methods is performed. It was observed that a single algorithm cannot satisfy all the performance measures.
    [Show full text]
  • Performance Analysis of Hoeffding Trees in Data Streams by Using Massive Online Analysis Framework
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by ePrints@Bangalore University Int. J. Data Mining, Modelling and Management, Vol. 7, No. 4, 2015 293 Performance analysis of Hoeffding trees in data streams by using massive online analysis framework P.K. Srimani R & D Division, Bangalore University Jnana Bharathi, Mysore Road, Bangalore-560056, Karnataka, India Email: [email protected] Malini M. Patil* Department of Information Science and Engineering, J.S.S. Academy of Technical Education, Uttaralli-Kengeri Main Road, Mylasandra, Bangalore-560060, Karnataka, India Email: [email protected] *Corresponding author Abstract: Present work is mainly concerned with the understanding of the problem of classification from the data stream perspective on evolving streams using massive online analysis framework with regard to different Hoeffding trees. Advancement of the technology both in the area of hardware and software has led to the rapid storage of data in huge volumes. Such data is referred to as a data stream. Traditional data mining methods are not capable of handling data streams because of the ubiquitous nature of data streams. The challenging task is how to store, analyse and visualise such large volumes of data. Massive data mining is a solution for these challenges. In the present analysis five different Hoeffding trees are used on the available eight dataset generators of massive online analysis framework and the results predict that stagger generator happens to be the best performer for different classifiers. Keywords: data mining; data streams; static streams; evolving streams; Hoeffding trees; classification; supervised learning; massive online analysis; MOA; framework; massive data mining; MDM; dataset generators.
    [Show full text]
  • Massive Online Analysis, a Framework for Stream Classification and Clustering
    MOA: Massive Online Analysis, a Framework for Stream Classification and Clustering. Albert Bifet1, Geoff Holmes1, Bernhard Pfahringer1, Philipp Kranen2, Hardy Kremer2, Timm Jansen2, and Thomas Seidl2 1 Department of Computer Science, University of Waikato, Hamilton, New Zealand fabifet, geoff, [email protected] 2 Data Management and Exploration Group, RWTH Aachen University, Germany fkranen, kremer, jansen, [email protected] Abstract. In today's applications, massive, evolving data streams are ubiquitous. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learn- ing from evolving data streams. MOA is designed to deal with the chal- lenging problems of scaling up the implementation of state of the art algorithms to real world dataset sizes and of making algorithms compa- rable in benchmark streaming settings. It contains a collection of offline and online algorithms for both classification and clustering as well as tools for evaluation. Researchers benefit from MOA by getting insights into workings and problems of different approaches, practitioners can easily compare several algorithms and apply them to real world data sets and settings. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license. Besides providing algorithms and measures for evaluation and comparison, MOA is easily extensible with new contri- butions and allows the creation of benchmark scenarios through storing and sharing setting files. 1 Introduction Nowadays data is generated at an increasing rate from sensor applications, mea- surements in network monitoring and traffic management, log records or click- streams in web exploring, manufacturing processes, call detail records, email, blogging, twitter posts and others.
    [Show full text]
  • Adaptive Learning and Mining for Data Streams and Frequent Patterns
    Adaptive Learning and Mining for Data Streams and Frequent Patterns Doctoral Thesis presented to the Departament de Llenguatges i Sistemes Informatics` Universitat Politecnica` de Catalunya by Albert Bifet April 2009 Revised version with minor revisions. Advisors: Ricard Gavalda` and Jose´ L. Balcazar´ Abstract This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for devel- oping algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estima- tor modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumula- tors in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our method- ology with several learning methods as Na¨ıve Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks.
    [Show full text]
  • Online Learning for Big Data Analytics
    Online Learning for Big Data Analytics Irwin King, Michael R. Lyu and Haiqin Yang Department of Computer Science & Engineering The Chinese University of Hong Kong Tutorial presentation at IEEE Big Data, Santa Clara, CA, 2013 1 Outline • Introduction (60 min.) – Big data and big data analytics (30 min.) – Online learning and its applications (30 min.) • Online Learning Algorithms (60 min.) – Perceptron (10 min.) – Online non-sparse learning (10 min.) – Online sparse learning (20 min.) – Online unsupervised learning (20. min.) • Discussions + Q & A (5 min.) 2 Outline • Introduction (60 min.) – Big data and big data analytics (30 min.) – Online learning and its applications (30 min.) • Online Learning Algorithms (60 min.) – Perceptron (10 min.) – Online non-sparse learning (10 min.) – Online sparse learning (20 min.) – Online unsupervised learning (20. min.) • Discussions + Q & A (5 min.) 3 What is Big Data? • There is not a consensus as to how to define Big Data “A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.” - wikii “Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population.” - Tera- data magazine article, 2011 “Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” - The McKinsey Global Institute, 2011i 4 What is Big Data? Activity: Activity: IOPS File/Object Size, Content Volume Big Data refers to datasets grow so large and complex that it is difficult to capture, store, manage, share, analyze and visualize within current computational architecture.
    [Show full text]
  • An Extensible Framework for Data Stream Clustering Research with R
    JSS Journal of Statistical Software February 2017, Volume 76, Issue 14. doi: 10.18637/jss.v076.i14 Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R Michael Hahsler Matthew Bolaños John Forrest Southern Methodist University Microsoft Corporation Microsoft Corporation Abstract In recent years, data streams have become an increasingly important area of research for the computer science, database and statistics communities. Data streams are ordered and potentially unbounded sequences of data points created by a typically non-stationary data generating process. Common data mining tasks associated with data streams include clustering, classification and frequent pattern mining. New algorithms for these types of data are proposed regularly and it is important to evaluate them thoroughly under standardized conditions. In this paper we introduce stream, a research tool that includes modeling and simu- lating data streams as well as an extensible framework for implementing, interfacing and experimenting with algorithms for various data stream mining tasks. The main advantage of stream is that it seamlessly integrates with the large existing infrastructure provided by R. In addition to data handling, plotting and easy scripting capabilities, R also provides many existing algorithms and enables users to interface code written in many program- ming languages popular among data mining researchers (e.g., C/C++, Java and Python). In this paper we describe the architecture of stream and focus on its use for data stream clustering research. stream was implemented with extensibility in mind and will be ex- tended in the future to cover additional data stream mining tasks like classification and frequent pattern mining.
    [Show full text]