Arxiv:1603.08767V1 [Cs.DC] 29 Mar 2016 of Relational Databases

Machine Learning and Cloud Computing: Survey of Distributed and SaaS Solutions ∗ Daniel Pop Institute e-Austria Timi¸soara Bd. Vasile Pârvan No. 4, 300223 Timi¸soara,România E-mail: [email protected] Abstract pages (text mining, Web mining), spatial data, mul- timedia data, relational data (molecules, social net- Applying popular machine learning algorithms to works). Analytics tools allow end-users to harvest the large amounts of data raised new challenges for the meaningful patterns buried in large volumes of struc- ML practitioners. Traditional ML libraries does not tured and unstructured data. Analyzing big datasets support well processing of huge datasets, so that new gives users the power to identify new revenue sources, approaches were needed. Parallelization using mod- develop loyal and profitable customer relationships, ern parallel computing frameworks, such as MapRe- and run your overall organization more efficiently and duce, CUDA, or Dryad gained in popularity and accep- cost effectively. tance, resulting in new ML libraries developed on top Research in knowledge discovery and machine learn- of these frameworks. We will briefly introduce the most ing combines classical questions of computer science prominent industrial and academic outcomes, such as (efficient algorithms, software systems, databases) with Apache MahoutTM, GraphLab or Jubatus. elements from artificial intelligence and statistics up to We will investigate how cloud computing paradigm user oriented issues (visualization, interactive mining). impacted the field of ML. First direction is of popu- Although for more than two decades, parallel lar statistics tools and libraries (R system, Python) de- database products, such as Teradata, Oracle or Netezza ployed in the cloud. A second line of products is aug- have provided means to realize a parallel implemen- menting existing tools with plugins that allow users to tation of ML-DM algorithms, expressing ML-DM al- create a Hadoop cluster in the cloud and run jobs on gorithms in SQL code is a complex task and difficult it. Next on the list are libraries of distributed imple- to maintain. Furthermore, large-scale installations of mentations for ML algorithms, and on-premise deploy- these products are expensive and are not an afford- ments of complex systems for data analytics and data able option in most cases. Another driver for paradigm mining. Last approach on the radar of this survey is shift from relational model to other alternatives is the ML as Software-as-a-Service, several BigData start-ups new nature of data. Until about five years ago, most (and large companies as well) already opening their so- data was transactional in nature, consisting of numeric lutions to the market. or string data that fit easily into rows and columns arXiv:1603.08767v1 [cs.DC] 29 Mar 2016 of relational databases. Since then, while structured data is following a near-linear growth, unstructured 1 Introduction (e.g. audio and video) and semi-structured data (e.g, Web traffic data, social media content, sensor gener- ated data etc.) exhibit an exponential growth (see fig- Given the enormous growth of collected and avail- ure 1). Most of the new data is either semi-structured able data in companies, industry and science, tech- in format, i.e. it consists of headers followed by text niques for analyzing such data are becoming ever more strings, or pure unstructured data (photo, video, au- important. Today, data to be analyzed are no longer dio). While the latter has limited textual content and restricted to sensor data and classical databases, but is more difficult to parse and analyze, semi-structured more and more include textual documents and web- data triggered a plethora of non-relational data stores ∗This manuscript was originally published as IEAT Technical (NoSQL data stores) solutions tailored to handle huge Report at https://www.ieat.ro/technical-reports in 2012. amount of data. Consequently, the past 5 years have 1 mode yet, who are offering machine learning services to their customers, or big data analysis services can be noticed in past 5 years. These initiatives can be either PaaS/SaaS platforms or products that can be deployed on private environments. Reviewing the literature and the market, we can conclude that ML-DM comes in many flavors. We clas- sify these approaches in 5 distinct classes: • Machine Learning environments from the cloud { create a computer cluster in the cloud and bootstrapping it with statistics tools. ) Section 3. Figure 1. Trends in data growth • Plugins for Machine Learning tools { augment statistics tools with plugins that allow users to create a Hadoop cluster in the cloud and run ML jobs seen researchers moving to parallelization of ML-DM on it. ) Section 4. using these new platforms, such as NoSQL datastores, distributed processing environments (MapReduce), or • Distributed Machine Learning libraries { collec- cloud computing. tions of parallelized implementations of ML al- At this point, it is worth reflecting to a nice gorithms for distributed environments (Hadoop, metaphor by Ben Werther [18], co-founder of Platfora, Dryad etc). ) Section 5. for big data processing today: • Complex Machine Learning systems { products In ìndustrial revolution' terms, we are in that need to be installed on private data centers the pre-industrial era of artisanship that (or in the cloud) and offers high performance data mining and analysis. ) Section 6. preceeded mass production. It is the equivalent of needing to engage an • Software as a Service providers for Machine Learn- ing { PaaS/SaaS solutions that allow clients to ac- expert blacksmith to forge the forks and cess ML algorithms via Web services. ) Section 7. spoons for our dinner table. The remaining of the paper is structured as follows: next section presents similar, recent studies, followed Machine Learning is inherently a time consuming by 5 sections, each of them devoted to a particular task, thus plenty of efforts were conducted to speed-up class identified above. The paper ends with conclusion the execution time. Cloud computing paradigm and and future plans. cloud providers turned out to be valuable alternatives to speed-up machine learning platforms. Thus, popular statistics tools environments { like R, Octave, Python 2 Related studies { went in the cloud as well. There are two main direc- tions to integrate them with cloud providers: create a Since 1995, many implementations were proposed cluster in the cloud and bootstrapping it with statistic for ML-DM algorithms parallelization for shared or tools, or augment statistic environments with plugins distributed systems. For a comprehensive study the that allow users to create Hadoop clusters in the cloud reader is referred to a recent survey [17]. Our work and run jobs on them. is focused in frameworks, toolkits, libraries that al- Environments like R, Octave, Mapple and similar low large-scale, distributed implementations of state- offer low-level infrastructure for data analysis, that of-the-art ML-DM algorithms. To this respect, we can be applied for large datasets once leveraged by mention a recent book dealing with machine learning at cloud providers. Machine Learning is something that large [1], which contains both presentations of general comes on top of this and facilitates the retrieval of frameworks for highly scalable ML implementations, useful knowledge out of huge data for customers with like DryadLINQ or IBM PMLT, and specific imple- no/less statistical background by automatically infer- mentations of ML techniques on these platforms, like ring `knowledge models' out of data. To support this ensemble decision trees, SVM, k-Means etc. It contains need, an explosion of start-ups, some of them in stealth contributions from both industry leaders (Google, HP, 2 IBM, Microsoft) and academia (Berkeley, NYU, Uni- with map/reduce, visualization, security and version versity of California etc). control packages. Results of data analysis processes, Recent articles, such as those of S. Charrington [3], named dashboard in Opani, can easily be visualized W. Eckerson [4] and D. Harris [5], review different and shared from desktop or mobile devices. large-scale ML solutions providers that are trying to Approaches in this class are powerful and flexible offer better tools and technologies, most of them based solutions, offering users the possibility to develop com- on Hadoop infrastructure, to move forward the novel plex ML-DM applications ran on the cloud. Users are industry of big data. They are aiming at improving freed from the burden of provisioning own distributed user experience, at product recommendations, or web- environments for scientific computing, while being able site optimization applicable for finance, telecommuni- to use their favorite environments. On the other side, cations, retail, advertising or media. users of these tools need to have extensive experience in programming and strong knowledge of statistics. Per- 3 Machine Learning environments haps, due to this limited audience, the stable providers from the cloud in this category are fewer than in other categories, some of them (such as CRdata.org) shutting down the oper- Providers of this category offer computer clusters ation only shortly after taking off. using public cloud providers, such as Amazon EC2, Rackspace etc, pre-installed with statistics software, 4 Plugins for Machine Learning toosl preferred packages being R system, Octave or Map- ple. These solutions offer scalable high-performance In this class, statistics applications (e.g. R system, resources in the cloud to their customers, who are freed Python) are extended with plugins that allow users to from the burden of installating and managing own clus- create a Hadoop cluster in the cloud and run time con- ters. suming jobs over large datasets on it. Most of the in- 1 2 Cloudnumbers.com are using Amazon EC2 terest went towards R, for which several extensions are provider to setup computer clusters preinstalled with available, comparing to Python for which less effort software for scientific computing, such as R system, was invested until recently in supporting distributed Octave or Mapple.

Arxiv:1603.08767V1 [Cs.DC] 29 Mar 2016 of Relational Databases

NTT Technical Review, December 2012, Vol. 10, No. 12

Outline of Machine Learning

A Guide in the Big Data Jungle Thesis, Bachelor of Science

Sensorbee Documentation Release 0.4

Data Analysis from Scratch with Python: the Complete Beginner's

Machine Learning for Computer Security Fourteenforty Research Institute, Inc

“Scalable Distributed Online Machine Learning Framework for Realtime Analysis of Big Data”

Comprehensive Analysis of Data Mining Tools S

Building a Working Data Hub

Open-Source Software Business Models That Create Value

Platform Leadership in Open Source Software

Pysad: a Streaming Anomaly Detection Framework in Python