Arxiv:1603.08767V1 [Cs.DC] 29 Mar 2016 of Relational Databases

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:1603.08767V1 [Cs.DC] 29 Mar 2016 of Relational Databases Machine Learning and Cloud Computing: Survey of Distributed and SaaS Solutions ∗ Daniel Pop Institute e-Austria Timi¸soara Bd. Vasile P^arvan No. 4, 300223 Timi¸soara,Rom^ania E-mail: [email protected] Abstract pages (text mining, Web mining), spatial data, mul- timedia data, relational data (molecules, social net- Applying popular machine learning algorithms to works). Analytics tools allow end-users to harvest the large amounts of data raised new challenges for the meaningful patterns buried in large volumes of struc- ML practitioners. Traditional ML libraries does not tured and unstructured data. Analyzing big datasets support well processing of huge datasets, so that new gives users the power to identify new revenue sources, approaches were needed. Parallelization using mod- develop loyal and profitable customer relationships, ern parallel computing frameworks, such as MapRe- and run your overall organization more efficiently and duce, CUDA, or Dryad gained in popularity and accep- cost effectively. tance, resulting in new ML libraries developed on top Research in knowledge discovery and machine learn- of these frameworks. We will briefly introduce the most ing combines classical questions of computer science prominent industrial and academic outcomes, such as (efficient algorithms, software systems, databases) with Apache MahoutTM, GraphLab or Jubatus. elements from artificial intelligence and statistics up to We will investigate how cloud computing paradigm user oriented issues (visualization, interactive mining). impacted the field of ML. First direction is of popu- Although for more than two decades, parallel lar statistics tools and libraries (R system, Python) de- database products, such as Teradata, Oracle or Netezza ployed in the cloud. A second line of products is aug- have provided means to realize a parallel implemen- menting existing tools with plugins that allow users to tation of ML-DM algorithms, expressing ML-DM al- create a Hadoop cluster in the cloud and run jobs on gorithms in SQL code is a complex task and difficult it. Next on the list are libraries of distributed imple- to maintain. Furthermore, large-scale installations of mentations for ML algorithms, and on-premise deploy- these products are expensive and are not an afford- ments of complex systems for data analytics and data able option in most cases. Another driver for paradigm mining. Last approach on the radar of this survey is shift from relational model to other alternatives is the ML as Software-as-a-Service, several BigData start-ups new nature of data. Until about five years ago, most (and large companies as well) already opening their so- data was transactional in nature, consisting of numeric lutions to the market. or string data that fit easily into rows and columns arXiv:1603.08767v1 [cs.DC] 29 Mar 2016 of relational databases. Since then, while structured data is following a near-linear growth, unstructured 1 Introduction (e.g. audio and video) and semi-structured data (e.g, Web traffic data, social media content, sensor gener- ated data etc.) exhibit an exponential growth (see fig- Given the enormous growth of collected and avail- ure 1). Most of the new data is either semi-structured able data in companies, industry and science, tech- in format, i.e. it consists of headers followed by text niques for analyzing such data are becoming ever more strings, or pure unstructured data (photo, video, au- important. Today, data to be analyzed are no longer dio). While the latter has limited textual content and restricted to sensor data and classical databases, but is more difficult to parse and analyze, semi-structured more and more include textual documents and web- data triggered a plethora of non-relational data stores ∗This manuscript was originally published as IEAT Technical (NoSQL data stores) solutions tailored to handle huge Report at https://www.ieat.ro/technical-reports in 2012. amount of data. Consequently, the past 5 years have 1 mode yet, who are offering machine learning services to their customers, or big data analysis services can be noticed in past 5 years. These initiatives can be either PaaS/SaaS platforms or products that can be deployed on private environments. Reviewing the literature and the market, we can conclude that ML-DM comes in many flavors. We clas- sify these approaches in 5 distinct classes: • Machine Learning environments from the cloud { create a computer cluster in the cloud and boot- strapping it with statistics tools. ) Section 3. Figure 1. Trends in data growth • Plugins for Machine Learning tools { augment statistics tools with plugins that allow users to cre- ate a Hadoop cluster in the cloud and run ML jobs seen researchers moving to parallelization of ML-DM on it. ) Section 4. using these new platforms, such as NoSQL datastores, distributed processing environments (MapReduce), or • Distributed Machine Learning libraries { collec- cloud computing. tions of parallelized implementations of ML al- At this point, it is worth reflecting to a nice gorithms for distributed environments (Hadoop, metaphor by Ben Werther [18], co-founder of Platfora, Dryad etc). ) Section 5. for big data processing today: • Complex Machine Learning systems { products In `industrial revolution' terms, we are in that need to be installed on private data centers the pre-industrial era of artisanship that (or in the cloud) and offers high performance data mining and analysis. ) Section 6. preceeded mass production. It is the equivalent of needing to engage an • Software as a Service providers for Machine Learn- ing { PaaS/SaaS solutions that allow clients to ac- expert blacksmith to forge the forks and cess ML algorithms via Web services. ) Section 7. spoons for our dinner table. The remaining of the paper is structured as follows: next section presents similar, recent studies, followed Machine Learning is inherently a time consuming by 5 sections, each of them devoted to a particular task, thus plenty of efforts were conducted to speed-up class identified above. The paper ends with conclusion the execution time. Cloud computing paradigm and and future plans. cloud providers turned out to be valuable alternatives to speed-up machine learning platforms. Thus, popular statistics tools environments { like R, Octave, Python 2 Related studies { went in the cloud as well. There are two main direc- tions to integrate them with cloud providers: create a Since 1995, many implementations were proposed cluster in the cloud and bootstrapping it with statistic for ML-DM algorithms parallelization for shared or tools, or augment statistic environments with plugins distributed systems. For a comprehensive study the that allow users to create Hadoop clusters in the cloud reader is referred to a recent survey [17]. Our work and run jobs on them. is focused in frameworks, toolkits, libraries that al- Environments like R, Octave, Mapple and similar low large-scale, distributed implementations of state- offer low-level infrastructure for data analysis, that of-the-art ML-DM algorithms. To this respect, we can be applied for large datasets once leveraged by mention a recent book dealing with machine learning at cloud providers. Machine Learning is something that large [1], which contains both presentations of general comes on top of this and facilitates the retrieval of frameworks for highly scalable ML implementations, useful knowledge out of huge data for customers with like DryadLINQ or IBM PMLT, and specific imple- no/less statistical background by automatically infer- mentations of ML techniques on these platforms, like ring `knowledge models' out of data. To support this ensemble decision trees, SVM, k-Means etc. It contains need, an explosion of start-ups, some of them in stealth contributions from both industry leaders (Google, HP, 2 IBM, Microsoft) and academia (Berkeley, NYU, Uni- with map/reduce, visualization, security and version versity of California etc). control packages. Results of data analysis processes, Recent articles, such as those of S. Charrington [3], named dashboard in Opani, can easily be visualized W. Eckerson [4] and D. Harris [5], review different and shared from desktop or mobile devices. large-scale ML solutions providers that are trying to Approaches in this class are powerful and flexible offer better tools and technologies, most of them based solutions, offering users the possibility to develop com- on Hadoop infrastructure, to move forward the novel plex ML-DM applications ran on the cloud. Users are industry of big data. They are aiming at improving freed from the burden of provisioning own distributed user experience, at product recommendations, or web- environments for scientific computing, while being able site optimization applicable for finance, telecommuni- to use their favorite environments. On the other side, cations, retail, advertising or media. users of these tools need to have extensive experience in programming and strong knowledge of statistics. Per- 3 Machine Learning environments haps, due to this limited audience, the stable providers from the cloud in this category are fewer than in other categories, some of them (such as CRdata.org) shutting down the oper- Providers of this category offer computer clusters ation only shortly after taking off. using public cloud providers, such as Amazon EC2, Rackspace etc, pre-installed with statistics software, 4 Plugins for Machine Learning toosl preferred packages being R system, Octave or Map- ple. These solutions offer scalable high-performance In this class, statistics applications (e.g. R system, resources in the cloud to their customers, who are freed Python) are extended with plugins that allow users to from the burden of installating and managing own clus- create a Hadoop cluster in the cloud and run time con- ters. suming jobs over large datasets on it. Most of the in- 1 2 Cloudnumbers.com are using Amazon EC2 terest went towards R, for which several extensions are provider to setup computer clusters preinstalled with available, comparing to Python for which less effort software for scientific computing, such as R system, was invested until recently in supporting distributed Octave or Mapple.
Recommended publications
  • NTT Technical Review, December 2012, Vol. 10, No. 12
    Feature Articles: Platform Technologies for Open Source Cloud Big Data Jubatus in Action: Report on Realtime Big Data Analysis by Jubatus Keitaro Horikawa, Yuzuru Kitayama, Satoshi Oda, Hiroki Kumazaki, Jungyu Han, Hiroyuki Makino, Masakuni Ishii, Koji Aoya, Min Luo, and Shohei Uchikawa Abstract This article revisits Jubatus, a scalable distributed framework for profound realtime analysis of big data that improves availability and reduces communication overheads among servers by using parallel data processing and by loosely sharing intermediate results. After briefly reviewing Jubatus, it describes the technical challenges and goals that should be resolved and introduces our design concept, open source community activities, achievements to date, and future plans for expanding Jubatus. 1. Introduction There are two major types of big data and tech- niques for analyzing it. There is little doubt that data is of great value and (1) Stockpile type: Lumped high-speed analysis of importance in databases, data mining, and other data- accumulated big data (batch processing) centered applications. With the recent spread of net- (2) Stream type: Sequential high-speed analysis of works, attention is being focused on the large quanti- data stream being continuously generated with- ties of a wide variety of data being generated and out accumulation (realtime processing) transmitted as big data. This trend is accelerating [1] With case (2) in particular, the ambiguity inherent as a result of advances in information and communi- in the environment is creating a growing need to cations technology (ICT), which simplifies the col- make judgments and decisions on the basis of insuf- lection and analysis of big data.
    [Show full text]
  • Outline of Machine Learning
    Outline of machine learning The following outline is provided as an overview of and topical guide to machine learning: Machine learning – subfield of computer science[1] (more particularly soft computing) that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.[1] In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed".[2] Machine learning explores the study and construction of algorithms that can learn from and make predictions on data.[3] Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions. Contents What type of thing is machine learning? Branches of machine learning Subfields of machine learning Cross-disciplinary fields involving machine learning Applications of machine learning Machine learning hardware Machine learning tools Machine learning frameworks Machine learning libraries Machine learning algorithms Machine learning methods Dimensionality reduction Ensemble learning Meta learning Reinforcement learning Supervised learning Unsupervised learning Semi-supervised learning Deep learning Other machine learning methods and problems Machine learning research History of machine learning Machine learning projects Machine learning organizations Machine learning conferences and workshops Machine learning publications
    [Show full text]
  • A Guide in the Big Data Jungle Thesis, Bachelor of Science
    A guide in the Big Data jungle Thesis, Bachelor of Science Anna Ohlsson, Dan Öman Faculty of Computing Blekinge Institute of Technology SE-371 79 Karlskrona Sweden Contact Information: Author(s): Anna Ohlsson, E-mail: [email protected], Dan Öman E-mail: [email protected] University advisor: Nina Dzamashvili Fogelström Department of Software Engineering Faculty of Computing Internet : www.bth.se/com Blekinge Institute of Technology Phone : +46 0455 38 50 00 SE-371 79 Karlskrona Sweden Fax : +46 0455 38 50 57 1 Abstract This bachelor thesis looks at the functionality of different frameworks for data analysis at large scale and the purpose of it is to serve as a guide among available tools. The amount of data that is generated every day keep growing and for companies to take advantage of the data they collect they need to know how to analyze it to gain maximal use out of it. The choice of platform for this analysis plays an important role and you need to look in to the functionality of the different alternatives that are available. We have created a guide to make this research easier and less time consuming. To evaluate our work we created a summary and a survey which we asked a number of IT-students, current and previous, to take part in. After analyzing their answers we could see that most of them find our thesis interesting and informative. 2 Content Introduction 5 1.1 Overview 5 Background and related work 7 2.1 Background 7 Figure 1 Overview of Platforms and Frameworks 8 2.1.1 What is Big Data? 8 2.1.2 Platforms and frameworks 9
    [Show full text]
  • Sensorbee Documentation Release 0.4
    SensorBee Documentation Release 0.4 Preferred Networks, Inc. Jun 20, 2017 Contents I Preface 3 II Tutorial 13 III The BQL Language 43 IV Server Programming 93 V Reference 137 VI Indices and Tables 195 i ii SensorBee Documentation, Release 0.4 Contents: Contents 1 SensorBee Documentation, Release 0.4 2 Contents Part I Preface 3 SensorBee Documentation, Release 0.4 This is the official documentation of SensorBee. It describes all the functionality that the current version of SensorBee officially supports. This document is structured as follows: • Preface, this part, provides general information of SensorBee. • Part I is an introduction for new users through some tutorials. • Part II documents the syntax and specification of the BQL language. • Part III describes information for advanced users about extensibility capabilities of the server. • Reference contains reference information about BQL statements, built-in components, and client programs. 5 SensorBee Documentation, Release 0.4 6 CHAPTER 1 What is SensorBee? SensorBee is an open source, lightweight, stateful streaming data processing engine for the Internet of Things (IoT). SensorBee is designed to be used for streaming ETL (Extract/Transform/Load) at the edge of the network including Fog Computing. In ETL operations, SensorBee mainly focuses on data transformation and data enrichment, espe- cially using machine learning. SensorBee is very small (stand-alone executable file size < 30MB) and runs on small computers such as Raspberry Pi. The processing flow in SensorBee is written in BQL, a dialect of CQL (Continuous Query Language), which is similar to SQL but extended for streaming data processing. Its internal data structure (tuple) is compatible to JSON documents rather than rows in RDBMSs.
    [Show full text]
  • Data Analysis from Scratch with Python: the Complete Beginner's
    Data Analysis from Scratch with Python The Complete Beginner's Guide for Machine Learning Techniques and A Step By Step NLP using Python Guide To Expert (Including Programming Interview Questions) STEPHEN RICHARD © Copyright 2019 by Stephen Richard All rights reserved. No a part of this book may be reproduced or transmitted in any kind or by any suggests that electronic or mechanical, as well as photocopying, Recording or by any data storage and TABLE OF CONTENTS INTRODUCTION CHAPTER 1 Method and Technique When to use each method and technique Machine learning methods Machine Learning Techniques Intel NLP Architect: Open Source Natural Language Processing Model Library CHAPTER 2 How To Solve NLP Tasks: A Walkthrough On Natural Language Processing A look at the importance of Natural Language Processing NLP: How to Become a Natural Language Processing Specialist CHAPTER 3 Programming Interview Questions CHAPTER 4 What is data analysis and why is it important? Software For The Application Of Automatic Learning Techniques For Medical Diagnosis CHAPTER 5 What Is Machine Learning And What Is It Contributing To Cognitive Neuroscience? Machine learning techniques at the service of Energy Efficiency in the digital home CHAPTER 6 5 natural language processing methods that are rapidly changing the world around us CHAPTER 8 How to develop chatbot from scratch in Python: a detailed instruction SUMMARY AND CONCLUSION INTRODUCTION The data analysis techniques have substantial support in machine learning for the generation of knowledge in the organization. Machine learning exploits statistics and many other areas of mathematics. Its advantage is speed. Although it would be unfair to consider that the strength of this type of techniques is only speed, it must be taken into a report that the speed that characterizes this artificial intelligence modality comes from the processing power, both sequential and parallel, and in-memory storage capacity.
    [Show full text]
  • Machine Learning for Computer Security Fourteenforty Research Institute, Inc
    FFRI,Inc. Machine learning for computer security Fourteenforty Research Institute, Inc. Junichi Murakami Executive Officer, Director of Advanced Development Division FFRI,Inc. http://www.ffri.jp FFRI,Inc. Agenda 1. Introduction 2. Machine learning basics – What is “Machine learning” – Background and circumstances – Type of machine learning – Implementations 3. Malware detection based on machine learning – Overview – Datasets – Cuckoo Sandbox – Jubatus – Evaluations – Possibility for another applications 4. Conclusions 5. References 2 FFRI,Inc. Introduction • First this slides describes a basis of machine learning then introduces malware detection based on machine learning • The Author is a security researcher, but is not an expert of machine learning • Currently the malware detection is experimental 3 FFRI,Inc. What is “Machine learning” • By training a computer, letting the computer estimate and predict something based on the experience • Based on artificial intelligence research train train computer B or 13 ? eval 4 FFRI,Inc. Background and circumstances • Recently, big data analysis attracts attention – E-Commerce, Online games, BI(Business Intelligence), etc. – The demand would probably increase more through spreading the M2M • Machine learning as method for big data analysis – Analyze data and estimate the future – Penetration rate is different by each industry – In IT security, not enough used yet Collection Analysis Estimation 5 FFRI,Inc. Type of machine learning • “Machine learning” is general word which contains various themes and methods • Roughly it can be classified as shown below • Currently online learning is minor than the other classification * regression * recommendation * supervised Batch anomaly detection * learning ML unsupervised clustering * Online learning supervised classification * unsupervised regression * recommendation * anomaly detection * clustering * 6 FFRI,Inc.
    [Show full text]
  • “Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data”
    “Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data” Hiroyuki MAKINO, NTT Software Innovation Center XLDB2012 Sep. 12, 2012 © 2012 NTT Software Innovation Center Objective of Jubatus • To satisfy “Scalable”, “Realtime”, and “Profound analysis” for Big Data. For variety Profound Analysis SVMlight Scalability For volume For velocity © 2012 NTT Software Innovation Center Use Case: Social stream analysis Classifies into 1600 companies Analysis Twitter result More than 8000 tweets/sec Realtime twitter analysis with Jubatus • Social stream analysis for marketing research • We want to know reputation or mood from their voice. • Too hard to go over each tweet • We need machine learning to classify tweets automatically in Demonstration realtime. See you in the poster session © 2012 NTT Software Innovation Center Performance The number of servers and throughput Accuracy and learning time [million] 3.5 3.0 Accuracy of batch learning 2.5 6700 tweets/sec. 2.0 with 1 server 1.5 90% accuracy 100 thousand in 1 sec. tweets/sec. Accuracy[%] 1.0 with 16 servers The number of servers Throughput[features/sec.] 0.5 0 0 2 4 6 8 10 12 14 16 18 The number of servers * 200,000 features / approx. 30 features (per tweet) = 6,700 tweets/sec. *SPAM classification task (Dataset: LIBSVM webspam) Throughput and linear scalability Accuracy Classification Tests: Classifying tweets into 1600 - Achieving the accuracy as good as batch companies automatically processing based learning. - Throughput: Classifies all the Japanese - 90% accuracy
    [Show full text]
  • Comprehensive Analysis of Data Mining Tools S
    World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering Vol:9, No:3, 2015 Comprehensive Analysis of Data Mining Tools S. Sarumathi, N. Shanthi Abstract—Due to the fast and flawless technological innovation there is a tremendous amount of data dumping all over the world in every domain such as Pattern Recognition, Machine Learning, Spatial Data Mining, Image Analysis, Fraudulent Analysis, World Wide Web etc., This issue turns to be more essential for developing several tools for data mining functionalities. The major aim of this paper is to analyze various tools which are used to build a resourceful analytical or descriptive model for handling large amount of information more efficiently and user friendly. In this survey the diverse tools are illustrated with their extensive technical paradigm, outstanding graphical interface and inbuilt multipath algorithms in which it is very useful for handling significant amount of data more indeed. Keywords—Classification, Clustering, Data Mining, Machine learning, Visualization I. INTRODUCTION HE domain of data mining and discovery of knowledge in Tvarious research fields such as Pattern Recognition, Information Retrieval, Medicine, Image Processing, Spatial Fig. 1 A Data Mining Framework Data Extraction, Business and Education has been Revolving into the relationships between the elements of tremendously increased over the certain span of time. Data the framework has several data modeling notations pointing Mining highly endeavors to originate, analyze, extract and towards the cardinality 1 or else m of every relationship. For implement fundamental induction process that facilitates the these minimum familiar with data modeling notations. mining of meaningful information and useful patterns from the • A business problem is studied via more than one classes huge dumped unstructured data.
    [Show full text]
  • Building a Working Data Hub
    Building a Working Data Hub Authors: Vijay Gadepally, Jeremy Kepner {vijayg, kepner} @ ll.mit.edu MIT Lincoln Laboratory Supercomputing Center Lexington, MA 02421 April 2020 1 EXECUTIVE SUMMARY Data forms a key component of any enterprise. The need for high quality and easy access to data is further amplified by organizations wishing to leverage machine learning or artificial intelligence for their operations. To this end, many organizations are building resources for managing heterogenous data, providing end-users with an organization wide view of available data, and acting as a centralized repository for data owned/collected by an organization. Very broadly, we refer to these class of techniques as a “data hub.” While there is no clear definition of what constitutes a data hub, some of the key characteristics include: • Data catalog • Links to datasets or owners of data sets or centralized data repository • Basic ability to serve / visualize data sets • Access control policies that ensure secure data access and respects policies of data owners • Computing capabilities tied with data hub infrastructure Of course, developing such a data hub entails numerous challenges. This document provides background in databases, data management and outlines best practices and recommendations for developing and deploying a working data hub. A few key recommendations: Technology: • Support federated data access • Support complex access control • Tie computing and data hub • Support multiple data formats Infrastructure: • Provision hardware and software
    [Show full text]
  • Open-Source Software Business Models That Create Value
    Journal of Management and Marketing Research Volume 18, February, 2015 Open-source software business models that create value Stephen Turner Known-Quantity.com, part of Turner & Associates, Inc. ABSTRACT The late management theorist Peter Drucker described a business model as the answer to three questions: “Who is your customer, what does the customer value, and how do you deliver value at an appropriate cost?”(Casadesus-Masanell & Ricart, 2011). Harvard Business Review reported that 70 percent of businesses in a variety of industries surveyed in 2009 were engaged in business model innovation (Casadesus-Masanell & Ricart, 2011). This includes users of software who want to drive costs down and have more control over their technology infrastructure. It also includes vendors who find new ways to compete. Business models revolve around all expenses of the enterprise in order to make a designated price possible. Revenue models are customer-driven and price-centric (Popp, 2014). A revenue model is one dimension of a business model, and some vendors have multiple revenue models. The most successful proprietary software vendors have revenue models that generate cash flow with high margins, typically based on licensing fees and services. Those margins are used to develop upgrades, correct bugs, and make the software more secure. A variety of open-source software (OSS) models are challenging the status quo. Some purists envision a dichotomy between good (i.e. open-source) and evil (i.e. proprietary) software. This paper provides a marketing and business context and connects value creation with new software programming paradigms. It describes twenty-two emerging OSS models that aspire to generate sufficient revenue to finance research and development (R&D) and marketing initiatives that will identify and address customers’ business problems.
    [Show full text]
  • Platform Leadership in Open Source Software
    Platform Leadership in Open Source Software By Ken Chi Ho Wong Bachelor of Science, Computing Science, Simon Fraser University, 2005 SUBMITTED TO THE SYSTEM DESIGN AND MANAGEMENT PROGRAM IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN ENGINEERING AND MANAGEMENT at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2015 ©2015 Ken Wong. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature of Author: ___________________________________________________________ Ken Wong System Design and Management Program February 2015 Advised by: ___________________________________________________________ Michael Cusumano SMR Distinguished Professor of Management & Engineering Systems MIT Sloan School of Management Certified by: ___________________________________________________________ Patrick Hale Director, System Design and Management Program Massachusetts Institute of Technology Platform Leadership in Open Source Software This page is intentionally left blank. Platform Leadership in Open Source Software By Ken Chi Ho Wong Submitted to the System Design and Management Program on February 2015, in Partial Fulfillment of the Requirements for the degree of Master of Science in Engineering and Management. Abstract Industry platforms in the software sector are increasingly being developed in open source. Firms seeking to position themselves as platform leaders with such technologies must find ways of operating within the unique constraints of open source development. This thesis aims to understand those challenges by analyzing the Android and Hadoop ecosystems through an augmented version of Porter’s Five Forces framework proposed by Intel’s Andrew Grove. The analysis finds that platform contenders in open source behave differently depending on whether they focus on competing against alternative platforms or alternative providers of the same platform as rivals.
    [Show full text]
  • Pysad: a Streaming Anomaly Detection Framework in Python
    PySAD: A Streaming Anomaly Detection Framework in Python Selim F. Yilmaz [email protected] Department of Electrical and Electronics Engineering Bilkent University Ankara, Turkey Suleyman S. Kozat [email protected] Department of Electrical and Electronics Engineering Bilkent University Ankara, Turkey Abstract PySAD is an open-source python framework for anomaly detection on streaming data. PySAD serves various state-of-the-art methods for streaming anomaly detection. The framework provides a complete set of tools to design anomaly detection experiments ranging from projectors to probability calibrators. PySAD builds upon popular open-source frameworks such as PyOD and scikit-learn. We enforce software quality by enforcing compliance with PEP8 guidelines, functional testing and using continuous integration. The source code is publicly available on github.com/selimfirat/pysad. Keywords: Anomaly detection, outlier detection, streaming data, online, sequential, data mining, machine learning, python. 1. Introduction Anomaly detection on streaming data has attracted a significant attention in recent years due to its real-life applications such as surveillance systems (Yuan et al., 2014) and network intrusion detection (Kloft and Laskov, 2010). We introduce a framework for anomaly detection on data streams, for which methods can only access instances as they arrive, unlike batch setting where model can access all the data (Henzinger et al., 1998). Streaming methods can efficiently handle limited memory and processing time requirements of real-life arXiv:2009.02572v1 [cs.LG] 5 Sep 2020 applications. These methods only store and process an instance or a small window of recent instances. In recent years, various Free and Open Source Software (FOSS) has been developed for both anomaly detection and streaming data.
    [Show full text]