Quantitative Analysis of Apache Storm Applications: the Newsasset Case Study

Total Page:16

File Type:pdf, Size:1020Kb

Quantitative Analysis of Apache Storm Applications: the Newsasset Case Study Noname manuscript No. (will be inserted by the editor) Quantitative Analysis of Apache Storm Applications: The NewsAsset Case Study Jos´eI. Requeno · Jos´eMerseguer · Simona Bernardi · Diego Perez-Palacin · Giorgos Giotis · Vasilis Papanikolaou Received: date / Accepted: date Abstract The development of Information Systems to- 1 Introduction day faces the era of Big Data. Large volumes of in- formation need to be processed in realtime, for exam- Innovative practices for Information Systems develop- ple, for Facebook or Twiter analysis. This paper ad- ment, like Big Data technologies, Model-Driven Engi- dresses the redesign of NewsAsset, a commercial prod- neering techniques or Cloud Computing processes have uct that helps journalists by providing services, which penetrated in the media domain. News agencies are analyze millions of media items from the social net- already feeling the impact of these technologies (e.g., work in realtime. Technologies like Apache Storm can transparent distribution of information, sophisticated help enormously in this context. We have quantita- analytics or processing power) for facilitating the de- tively analyzed the new design of NewsAsset to assess velopment of the next generation of applications. Espe- whether the introduction of Apache Storm can meet cially, considering interesting media and burst events, the demanding performance requirements of this media which is out there in the digital world, these technolo- product. Our assessment approach, guided by the Uni- gies can offer very efficient processing capabilities and fied Modeling Language (UML), takes advantage, for can provide an added value to journalists. performance analysis, of the software designs already Apache Storm (Apache, 2017a) is a free and open used for development. In addition, we converted UML source distributed realtime computation system that into a domain-specific modeling language (DSML) for can process a million tuples per second per node. Storm Apache Storm, thus creating a profile for Storm. Later, helps for improving real-time analysis, news and adver- we transformed said DSML into an appropriate lan- tisements, the customization of searches, and the opti- guage for performance evaluation, specifically, stochas- mization of a wide range of online services that require tic Petri nets. The assessment ended with a successful low-latency processing. Today, the volume of informa- software design that certainly met the scalability re- tion in Internet increases exponentially, especially that quirements of NewsAsset. of interest for the media. For example, in the case of natural disasters, social or sportive events, the traffic of Keywords Apache Storm · UML · Petri nets · tweets or messages may rise up to 10 or 100 times with Software Performance · Software Reuse respect to the number of messages in a normal situa- tion (Ranjan, 2014). Hence, applications developed us- ing Apache Storm need to be very demanding in terms of performance and reliability. Jos´eIgnacio Requeno, Jos´eMerseguer, Simona Bernardi and This paper addresses, using Apache Storm, the re- Diego Perez-Palacin Dpto. de Inform´aticae Ingenier´ıade Sistemas design of NewsAsset, a commercial product developed Universidad de Zaragoza (Spain) by the Athens Technological Center (ATC, 2018). To E-mail: fnrequeno,jmerse,simonab,[email protected] this end, we apply a quality-driven methodology, that Giorgos Giotis and Vasilis Papanikolaou we already introduced in (Requeno et al., 2017), for the Athens Technology Center, ATC (Greece) performance assessment of Apache Storm applications. E-mail: fg.giotis, [email protected] For ATC the redesign also means to reuse coding of the 2 Jos´eI. Requeno et al. current version of the NewsAsset, then trying to impact plied in a generic context for stream processing (Nalepa only on the stream processing for leveraging Apache et al., 2015b) or distributed systems (Samolej and Rak, Storm. The simulation-based approach that we apply 2009; Rak, 2015). Generalised stochastic Petri nets (Chi- here is useful for predicting the behavior of the appli- ola et al., 1993), the formalism for performance analy- cation for future demands, and the impact of the stress sis that we adopt here, have been already used for the situations in some performance parameters (e.g., appli- performance assessment of Apache Hadoop MapReduce cation response time, throughput or device utilization). (et al., 2016). A recent publication uses fluid Petri nets Consequently, ATC gets, before reimplementation and for the modeling and performance evaluation of Apache deployment of the full application, a valuable feedback Spark applications (et al., 2017). However, the work in which saves coding and monetary efforts. (Requeno et al., 2017) was the first entirely devoted to In particular, this paper extends the approach in the Apache Storm performance evaluation, combining a (Requeno et al., 2017) with respect to the quality-driven genuine UML profile and GSPNs, and the present work methodology in different aspects. First, we improve the validates and extends it as aforementioned. UML profile of the methodology for introducing a relia- The rest of the paper is organized as follows. Sec- bility characterization of Storm. Consequently, we con- tion 2 introduces the NewsAsset case study. Section 3 vert UML into a DSML1 for performance and reliabil- recalls the basics on Apache Storm for performance and ity of Apache Storm applications. Second, we propose reliability and defines the DSML. Section 4 presents our new transformations, into stochastic Petri nets (SPN), performance modeling approach with focus on the case for some performance parameters of Storm not already study. Section 5 is devoted to the performance analy- addressed in (Requeno et al., 2017). Moreover, we in- sis of NewsAsset. Finally, Section 6 draws a conclusion. troduce computation of reliability metrics by means of Appendix A details the transformation to get perfor- the UML profile. Consequently, our approach enables mance models. Appendix B explains the computation the performance and reliability assessment of Apache of reliability metrics in a Storm design. Appendix C re- Storm applications. Finally, the application of the method- calls basic notions of Generalized stochastic Petri nets ology to the NewsAsset case study has been useful to (Chiola et al., 1993). validate the approach in a real scenario and to assess ATC about its scalability. On the modeling side, our DSML allows to work 2 A Case Study in the Media Domain with the Apache Storm performance and reliability pa- rameters in the very same model used for the workflow Heterogeneous sources like social or sensor networks are and deployment specifications. Moreover, the developer continuously feeding the world of Internet with a variety takes advantage of all the facilities provided by a UML of real data in a tremendous pace: media items describ- software development environment. These reasons rec- ing burst events, traffic speed on roads, or air pollution ommend the UML modeling, instead of doing it directly levels by location. Journalists are able to access these with the SPN, that can be merely obtained by trans- data aiding them in all manner of news stories. It is formation. the social networks like Twitter, Facebook or Instagram Regarding the related work, (Ranjan, 2014) discusses that people are using to watch the news ecosystem and the role of modeling and simulation in the era of big try to learn what conditions exist in real-time. Subse- data applications and defends that they can empower quently, news agencies have realized that social-media practitioners and academics in conducting \what-if" content is becoming increasingly useful for news cover- analyses. (Singhal and Verma, 2016) develop a frame- age and can benefit from this trend only if they adopt work for efficiently set-up heterogeneous MapReduce current innovative technologies that effectively manage environments and (Nalepa et al., 2015a,b) address the such volume of information. Thus, the challenge is to need of modeling and performance assessment in stream catch up with this evolution and provide services that applications. More in particular, a generic profile for can handle the new situation in the media industry. modeling big data applications is defined for the Palla- NewsAsset is a commercial product positioned in dio Component Model (Kroß et al., 2015). In (Kroß and the news and media domain, branded by Athens Tech- Krcmar, 2016), the authors model and simulate Apache nology Center (ATC), a SME2 located in Greece. NewsAs- Spark streaming applications. Mathematical models for set suite constitutes an innovative management solu- predicting the performance of Spark applications are tion for handling large volumes of information offering introduced in (Wang and Khan, 2015). Some of these a complete and secure electronic environment for stor- works use variants of the Petri nets, but they are ap- age, management and delivery of sensitive information 1 Domain Specific Modeling Language. 2 Small and medium-sized enterprise. Quantitative Analysis of Apache Storm Applications: The NewsAsset Case Study 3 in the news production environment. The platform pro- the system. The goal is to optimize the existing pro- poses a distributed multi-tier architecture engine for cessing time by means of not only minimizing the time managing data storage composed by media items such slot duration to reflect real time processing but also by as text, images, reports, articles or videos. maximizing the
Recommended publications
  • Working with Storm Topologies Date of Publish: 2018-08-13
    Apache Storm 3 Working with Storm Topologies Date of Publish: 2018-08-13 http://docs.hortonworks.com Contents Packaging Storm Topologies................................................................................... 3 Deploying and Managing Apache Storm Topologies............................................4 Configuring the Storm UI.................................................................................................................................... 4 Using the Storm UI.............................................................................................................................................. 5 Monitoring and Debugging an Apache Storm Topology......................................6 Enabling Dynamic Log Levels.............................................................................................................................6 Setting and Clearing Log Levels Using the Storm UI.............................................................................6 Setting and Clearing Log Levels Using the CLI..................................................................................... 7 Enabling Topology Event Logging......................................................................................................................7 Configuring Topology Event Logging.....................................................................................................8 Enabling Event Logging...........................................................................................................................8
    [Show full text]
  • Apache Flink™: Stream and Batch Processing in a Single Engine
    Apache Flink™: Stream and Batch Processing in a Single Engine Paris Carboney Stephan Ewenz Seif Haridiy Asterios Katsifodimos* Volker Markl* Kostas Tzoumasz yKTH & SICS Sweden zdata Artisans *TU Berlin & DFKI parisc,[email protected][email protected][email protected] Abstract Apache Flink1 is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continu- ous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph analysis) can be expressed and executed as pipelined fault-tolerant dataflows. In this paper, we present Flink’s architecture and expand on how a (seemingly diverse) set of use cases can be unified under a single execution model. 1 Introduction Data-stream processing (e.g., as exemplified by complex event processing systems) and static (batch) data pro- cessing (e.g., as exemplified by MPP databases and Hadoop) were traditionally considered as two very different types of applications. They were programmed using different programming models and APIs, and were exe- cuted by different systems (e.g., dedicated streaming systems such as Apache Storm, IBM Infosphere Streams, Microsoft StreamInsight, or Streambase versus relational databases or execution engines for Hadoop, including Apache Spark and Apache Drill). Traditionally, batch data analysis made up for the lion’s share of the use cases, data sizes, and market, while streaming data analysis mostly served specialized applications. It is becoming more and more apparent, however, that a huge number of today’s large-scale data processing use cases handle data that is, in reality, produced continuously over time.
    [Show full text]
  • DSP Frameworks DSP Frameworks We Consider
    Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini DSP frameworks we consider • Apache Storm (with lab) • Twitter Heron – From Twitter as Storm and compatible with Storm • Apache Spark Streaming (lab) – Reduce the size of each stream and process streams of data (micro-batch processing) • Apache Flink • Apache Samza • Cloud-based frameworks – Google Cloud Dataflow – Amazon Kinesis Streams Valeria Cardellini - SABD 2017/18 1 Apache Storm • Apache Storm – Open-source, real-time, scalable streaming system – Provides an abstraction layer to execute DSP applications – Initially developed by Twitter • Topology – DAG of spouts (sources of streams) and bolts (operators and data sinks) Valeria Cardellini - SABD 2017/18 2 Stream grouping in Storm • Data parallelism in Storm: how are streams partitioned among multiple tasks (threads of execution)? • Shuffle grouping – Randomly partitions the tuples • Field grouping – Hashes on a subset of the tuple attributes Valeria Cardellini - SABD 2017/18 3 Stream grouping in Storm • All grouping (i.e., broadcast) – Replicates the entire stream to all the consumer tasks • Global grouping – Sends the entire stream to a single task of a bolt • Direct grouping – The producer of the tuple decides which task of the consumer will receive this tuple Valeria Cardellini - SABD 2017/18 4 Storm architecture • Master-worker architecture Valeria Cardellini - SABD 2017/18 5 Storm
    [Show full text]
  • Apache Storm Tutorial
    Apache Storm Apache Storm About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache Storm became a standard for distributed real-time processing system that allows you to process large amount of data, similar to Hadoop. Apache Storm is written in Java and Clojure. It is continuing to be a leader in real-time analytics. This tutorial will explore the principles of Apache Storm, distributed messaging, installation, creating Storm topologies and deploy them to a Storm cluster, workflow of Trident, real-time applications and finally concludes with some useful examples. Audience This tutorial has been prepared for professionals aspiring to make a career in Big Data Analytics using Apache Storm framework. This tutorial will give you enough understanding on creating and deploying a Storm cluster in a distributed environment. Prerequisites Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors. Copyright & Disclaimer © Copyright 2014 by Tutorials Point (I) Pvt. Ltd. All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher. We strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors.
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • Hdf® Stream Developer 3 Days
    TRAINING OFFERING | DEV-371 HDF® STREAM DEVELOPER 3 DAYS This course is designed for Data Engineers, Data Stewards and Data Flow Managers who need to automate the flow of data between systems as well as create real-time applications to ingest and process streaming data sources using Hortonworks Data Flow (HDF) environments. Specific technologies covered include: Apache NiFi, Apache Kafka and Apache Storm. The course will culminate in the creation of a end-to-end exercise that spans this HDF technology stack. PREREQUISITES Students should be familiar with programming principles and have previous experience in software development. First-hand experience with Java programming and developing within an IDE are required. Experience with Linux and a basic understanding of DataFlow tools and would be helpful. No prior Hadoop experience required. TARGET AUDIENCE Developers, Data & Integration Engineers, and Architects who need to automate data flow between systems and/or develop streaming applications. FORMAT 50% Lecture/Discussion 50% Hands-on Labs AGENDA SUMMARY Day 1: Introduction to HDF Components, Apache NiFi dataflow development Day 2: Apache Kafka, NiFi integration with HDF/HDP, Apache Storm architecture Day 3: Storm management options, multi-language support, Kafka integration DAY 1 OBJECTIVES • Introduce HDF’s components; Apache NiFi, Apache Kafka, and Apache Storm • NiFi architecture, features, and characteristics • NiFi user interface; processors and connections in detail • NiFi dataflow assembly • Processor Groups and their elements
    [Show full text]
  • ADMI Cloud Computing Presentation
    ECSU/IU NSF EAGER: Remote Sensing Curriculum ADMI Cloud Workshop th th Enhancement using Cloud Computing June 10 – 12 2016 Day 1 Introduction to Cloud Computing with Amazon EC2 and Apache Hadoop Prof. Judy Qiu, Saliya Ekanayake, and Andrew Younge Presented By Saliya Ekanayake 6/10/2016 1 Cloud Computing • What’s Cloud? Defining this is not worth the time Ever heard of The Blind Men and The Elephant? If you still need one, see NIST definition next slide The idea is to consume X as-a-service, where X can be Computing, storage, analytics, etc. X can come from 3 categories Infrastructure-as-a-S, Platform-as-a-Service, Software-as-a-Service Classic Cloud Computing Computing IaaS PaaS SaaS My washer Rent a washer or two or three I tell, Put my clothes in and My bleach My bleach comforter dry clean they magically appear I wash I wash shirts regular clean clean the next day 6/10/2016 2 The Three Categories • Software-as-a-Service Provides web-enabled software Ex: Google Gmail, Docs, etc • Platform-as-a-Service Provides scalable computing environments and runtimes for users to develop large computational and big data applications Ex: Hadoop MapReduce • Infrastructure-as-a-Service Provide virtualized computing and storage resources in a dynamic, on-demand fashion. Ex: Amazon Elastic Compute Cloud 6/10/2016 3 The NIST Definition of Cloud Computing? • “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” On-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf • However, formal definitions may not be very useful.
    [Show full text]
  • Installing and Configuring Apache Storm Date of Publish: 2018-08-30
    Apache Kafka 3 Installing and Configuring Apache Storm Date of Publish: 2018-08-30 http://docs.hortonworks.com Contents Installing Apache Storm.......................................................................................... 3 Configuring Apache Storm for a Production Environment.................................7 Configuring Storm for Supervision......................................................................................................................8 Configuring Storm Resource Usage.....................................................................................................................9 Apache Kafka Installing Apache Storm Installing Apache Storm Before you begin • HDP cluster stack version 2.5.0 or later. • (Optional) Ambari version 2.4.0 or later. Procedure 1. Click the Ambari "Services" tab. 2. In the Ambari "Actions" menu, select "Add Service." This starts the Add Service Wizard, displaying the Choose Services screen. Some of the services are enabled by default. 3. Scroll down through the alphabetic list of components on the Choose Services page, select "Storm", and click "Next" to continue: 3 Apache Kafka Installing Apache Storm 4 Apache Kafka Installing Apache Storm 4. On the Assign Masters page, review node assignments for Storm components. If you want to run Storm with high availability of nimbus nodes, select more than one nimbus node; the Nimbus daemon automatically starts in HA mode if you select more than one nimbus node. Modify additional node assignments if desired, and click "Next". 5. On the Assign Slaves and Clients page, choose the nodes that you want to run Storm supervisors and clients: Storm supervisors are nodes from which the actual worker processes launch to execute spout and bolt tasks. Storm clients are nodes from which you can run Storm commands (jar, list, and so on). 6. Click Next to continue. 7. Ambari displays the Customize Services page, which lists a series of services: 5 Apache Kafka Installing Apache Storm For your initial configuration you should use the default values set by Ambari.
    [Show full text]
  • Perform Data Engineering on Microsoft Azure Hdinsight (775)
    Perform Data Engineering on Microsoft Azure HDInsight (775) www.cognixia.com Administer and Provision HDInsight Clusters Deploy HDInsight clusters Create a cluster in a private virtual network, create a cluster that has a custom metastore, create a domain-joined cluster, select an appropriate cluster type based on workload considerations, customize a cluster by using script actions, provision a cluster by using Portal, provision a cluster by using Azure CLI tools, provision a cluster by using Azure Resource Manager (ARM) templates and PowerShell, manage managed disks, configure vNet peering Deploy and secure multi-user HDInsight clusters Provision users who have different roles; manage users, groups, and permissions through Apache Ambari, PowerShell, and Apache Ranger; configure Kerberos; configure service accounts; implement SSH tunneling; restrict access to data Ingest data for batch and interactive processing Ingest data from cloud or on-premises data; store data in Azure Data Lake; store data in Azure Blob Storage; perform routine small writes on a continuous basis using Azure CLI tools; ingest data in Apache Hive and Apache Spark by using Apache Sqoop, Application Development Framework (ADF), AzCopy, and AdlCopy; ingest data from an on-premises Hadoop cluster Configure HDInsight clusters Manage metastore upgrades; view and edit Ambari configuration groups; view and change service configurations through Ambari; access logs written to Azure Table storage; enable heap dumps for Hadoop services; manage HDInsight configuration, use
    [Show full text]
  • A Performance Comparison of Open-Source Stream Processing Platforms
    A Performance Comparison of Open-Source Stream Processing Platforms Martin Andreoni Lopez, Antonio Gonzalez Pastana Lobato, Otto Carlos M. B. Duarte Universidade Federal do Rio de Janeiro - GTA/COPPE/UFRJ - Rio de Janeiro, Brazil Email: fmartin, antonio, [email protected] Abstract—Distributed stream processing platforms are a new processing models have been proposed and received attention class of real-time monitoring systems that analyze and extract from researchers. knowledge from large continuous streams of data. These type Real-time distributed stream processing models can benefit of systems are crucial for providing high throughput and low latency required by Big Data or Internet of Things monitoring traffic monitoring applications for cyber security threats detec- applications. This paper describes and analyzes three main open- tion [4]. Current intrusion detection and prevention systems source distributed stream-processing platforms: Storm, Flink, are not effective, because 85% of threats take weeks to be and Spark Streaming. We analyze the system architectures and detected and up to 123 hours for a reaction after detection to we compare their main features. We carry out two experiments be performed [5]. New distributed real-time stream processing concerning threats detection on network traffic to evaluate the throughput efficiency and the resilience to node failures. Results models for security critical applications is required and in show that the performance of native stream processing systems, the future with the advancement of the Internet of Things, Storm and Flink, is up to 15 times higher than the micro-batch their use will be imperative. To respond to these needs, processing system, Spark Streaming.
    [Show full text]
  • Technology Overview
    Big Data Technology Overview Term Description See Also Big Data - the 5 Vs Everyone Must Volume, velocity and variety. And some expand the definition further to include veracity 3 Vs Know and value as well. 5 Vs of Big Data From Wikipedia, “Agile software development is a group of software development methods based on iterative and incremental development, where requirements and solutions evolve through collaboration between self-organizing, cross-functional teams. Agile The Agile Manifesto It promotes adaptive planning, evolutionary development and delivery, a time-boxed iterative approach, and encourages rapid and flexible response to change. It is a conceptual framework that promotes foreseen tight iterations throughout the development cycle.” A data serialization system. From Wikepedia, Avro Apache Avro “It is a remote procedure call and serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.” BigInsights Enterprise Edition provides a spreadsheet-like data analysis tool to help Big Insights IBM Infosphere Biginsights organizations store, manage, and analyze big data. A scalable multi-master database with no single points of failure. Cassandra Apache Cassandra It provides scalability and high availability without compromising performance. Cloudera Inc. is an American-based software company that provides Apache Hadoop- Cloudera Cloudera based software, support and services, and training to business customers. Wikipedia - Data Science Data science The study of the generalizable extraction of knowledge from data IBM - Data Scientist Coursera Big Data Technology Overview Term Description See Also Distributed system developed at Google for interactively querying large datasets. Dremel Dremel It empowers business analysts and makes it easy for business users to access the data Google Research rather than having to rely on data engineers.
    [Show full text]
  • Apache Beam: Portable and Evolutive Data-Intensive Applications
    Apache Beam: portable and evolutive data-intensive applications Ismaël Mejía - @iemejia Talend Who am I? @iemejia Software Engineer Apache Beam PMC / Committer ASF member Integration Software Big Data / Real-Time Open Source / Enterprise 2 New products We are hiring ! 3 Introduction: Big data state of affairs 4 Before Big Data (early 2000s) The web pushed data analysis / infrastructure boundaries ● Huge data analysis needs (Google, Yahoo, etc) ● Scaling DBs for the web (most companies) DBs (and in particular RDBMS) had too many constraints and it was hard to operate at scale. Solution: We need to go back to basics but in a distributed fashion 5 MapReduce, Distributed Filesystems and Hadoop ● Use distributed file systems (HDFS) to scale data storage horizontally ● Use Map Reduce to execute tasks in parallel (performance) ● Ignore strict model (let representation loose to ease scaling e.g. KV stores). (Prepare) Great for huge dataset analysis / transformation but… Map (Shuffle) ● Too low-level for many tasks (early frameworks) ● Not suited for latency dependant analysis Reduce (Produce) 6 The distributed database Cambrian explosion … and MANY others, all of them with different properties, utilities and APIs 7 Distributed databases API cycle NewSQL let's reinvent NoSQL, because our own thing SQL is too limited SQL is back, because it is awesome 8 (yes it is an over-simplification but you get it) The fundamental problems are still the same or worse (because of heterogeneity) … ● Data analysis / processing from systems with different semantics ● Data integration from heterogeneous sources ● Data infrastructure operational issues Good old Extract-Transform-Load (ETL) is still an important need 9 The fundamental problems are still the same "Data preparation accounts for about 80% of the work of data scientists" [1] [2] 1 Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task 2 Sculley et al.: Hidden Technical Debt in Machine Learning Systems 10 and evolution continues ..
    [Show full text]