Experimental Study on the Performance and Resource Utilization of Data Streaming Frameworks Subarna Chatterjee, Christine Morin

Experimental Study on the Performance and Resource Utilization of Data Streaming Frameworks Subarna Chatterjee, Christine Morin To cite this version: Subarna Chatterjee, Christine Morin. Experimental Study on the Performance and Resource Uti- lization of Data Streaming Frameworks. CCGrid 2018 - 18th IEEE/ACM Symposium on Cluster, Cloud and Grid Computing, May 2018, Washington, DC, United States. pp.143-152, 10.1109/CC- GRID.2018.00029. hal-01823697 HAL Id: hal-01823697 https://hal.inria.fr/hal-01823697 Submitted on 26 Jun 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Experimental Study on the Performance and Resource Utilization of Data Streaming Frameworks Subarna Chatterjee and Christine Morin INRIA Centre Rennes - Bretagne Atlantique Email: {subarna.chatterjee, christine.morin}@inria.fr Abstract—With the advent of the Internet of Things (IoT), data can possibly offer. Despite the existence of several work that stream processing have gained increased attention due to the present a comparative study of data stream processing frame- ever-increasing need to process heterogeneous and voluminous works, such type of analysis regarding “how to efficiently data streams. This work addresses the problem of selecting a correct stream processing framework for a given application to choose a streaming platform for an application” is still to be executed within a specific physical infrastructure. For this be discovered. Therefore, deriving such knowledge towards purpose, we focus on a thorough comparative analysis of three selection of application- and infrastructure-based streaming data stream processing platforms – Apache Flink, Apache Storm, frameworks induces research interest and attention. and Twitter Heron (the enhanced version of Apache Storm), In this work, we focus on thoroughly comparing and that are chosen based on their potential to process both streams and batches in real-time. The goal of the work is to enlighten analyzing the performance and resource utilization of three the cloud-clients and the cloud-providers with the knowledge streaming platforms and eventually make an inference about of the choice of the resource-efficient and requirement-adaptive the various applications that are suitable to be launched within streaming platform for a given application so that they can plan each of these platforms and the various aspects of the physical during allocation or assignment of Virtual Machines for applica- infrastructure that each can support. Based on the potential tion execution. For the comparative performance analysis of the chosen platforms, we have experimented using 8-node clusters to process both streams and batches in real-time and the on Grid5000 experimentation testbed and have selected a wide popularity of usage, in this work, we have chosen Apache variety of applications ranging from a conventional benchmark Flink [5], Apache Storm [6], and Twitter Heron [7] as the to sensor-based IoT application and statistical batch processing stream processing frameworks to be studied. application. In addition to the various performance metrics related to the elasticity and resource usage of the platforms, A. Motivation this work presents a comparative study of the “green-ness” of the streaming platforms by analyzing their power consumption As mentioned earlier, several prior works exist on compara- – one of the first attempts of its kind. The obtained results are tive study and analysis of streaming platforms. However, there thoroughly analyzed to illustrate the functional behavior of these are some observed limitations of each of these works which platforms under different computing scenarios. are discussed as follows: Index terms— Stream processing, Apache Flink, Apache (i) Recent release of Heron: The open-source version of Spark, Twitter Heron, Internet of Things Twitter Heron was released in 2016. Therefore, the existing literature on comparative studies of streaming frameworks I. INTRODUCTION does not include any analysis of the performance of Heron. The inception of the Internet of Things (IoT) has witnessed Heron was introduced as an enhanced version of Apache massive data explosion due to the exponential growth of Storm and so it is extremely important to analyze and compare devices and the innumerable applications running within them its performance with the other streaming frameworks. [1], [2]. Therefore, data-intensive systems are gaining increas- (ii) Analysis of energy-efficiency: Existing works illustrate the ing importance today. Data stream processing comprises the comparison of the streaming frameworks in terms of the data heart of such systems which is why several stream processing throughput, the utilization of the communication network, the frameworks co-exist today [3], [4]. Every stream processing processing latency, and the fault-tolerance. However, a study framework is different in terms of the set of applications that and analysis on the energy consumption of the platforms is still it can serve, the unit (batch, or micro-batch, or streams) and to be found. The energy efficiency of a streaming platform is nature (bounded or persistent) of dataset that it can handle, the extremely important owing to the global responsibility towards type of processing it offers (at-least-once or at-most-once or the eco-friendly use of computers and thereby reducing the exactly-once), the hardware requirements, and the scalability environmental impact of the IT operations. potential. Thus, it is extremely important to choose the correct (iii) Choice of framework for different scenarios: A streaming streaming platform that would serve a specific application in platform cannot be categorized as ”good” or ”bad” as every a given available cluster. It is extremely difficult to choose platform is differently designed to process specific stream “the” streaming framework that would suit the needs of an types. Each platform is unique about the nature of data sets, application and as well would have resource requirements the variable message processing semantics, and the differential within the limits that the underlying physical infrastructure hardware expectations. Therefore, it is extremely difficult, yet crucial, to judicially select the streaming platform that et al. [12] focused on the comparison among Storm, Spark, not only suits the application requirements but also abides Samza, and Hadoop ecosystems. Another work [13] on com- by the resource offerings of the physical infrastructure of parative experimentation between Flink and Spark proposed the host organization (or an end-user). This might be of benchmarks for the platforms and validated the performance particular interest for cloud end-users while choosing Virtual of the frameworks on the proposed benchmarks. But, since Machines (VMs) in an infrastructure cloud for Infrastructure- Spark is essentially a batch-processing platform, the effective as-a-Service (IaaS) or for cloud providers while allocating comparative analysis among real-streaming platforms was still VMs to end-users in a Platform-as-a-Service (PaaS) cloud. absent. Over time, both Flink and Storm gained popularity An insight of the rationale behind the choice of a streaming as real-streaming frameworks and consequently had a wide platform is not yet discussed in prior existing works and is spectrum of usage within organizations and end-users [14], therefore one of the major focus of this work. [15]. Based on the popularity of usage, in this work, we have B. Contributions chosen both Flink and Storm along with the latest platform Heron [7] for analysis. The contributions of the work are multi-fold and are dis- Of late, some works specifically compared Flink and Storm. cussed as follows: For example, Chintapalli et al. [16] developed a streaming (i) Heron is included: The proposed work is one of the first benchmark for Storm, Flink and Spark and provided a perfor- attempts towards a Heron-inclusive comparative study and mance comparison of the engines. However, the work mainly analysis of the popular streaming frameworks. In this work, focused on the processing latency of the tuples but did not we have selected Apache Flink, Apache Storm, and Twitter consider other parameters related to the performance. Lopez Heron as the frameworks for comparison. The rationale behind et al. [17] also analyzed the architecture of Storm, Flink and this choice is that, in addition to being open-source and the Spark. The authors performed a study on the throughput and popularity of their usage, all these platforms have the ability fault-tolerance of the platforms, however, the work did not to process true streams in real-time. study the scalability of operators of each platforms and how (ii) Metrics studied: This work compares Flink, Storm, and that hugely affects the throughput of the incoming streams. Heron with respect to a multitude of parameters. The work The work also did not analyze the rseource consumption of the focuses to analyze the performance of the frameworks in platform in terms of CPU or memory. In 2016, the open-source

Experimental Study on the Performance and Resource Utilization of Data Streaming Frameworks Subarna Chatterjee, Christine Morin

Apache Flink™: Stream and Batch Processing in a Single Engine

Evaluation of SPARQL Queries on Apache Flink

Portable Stateful Big Data Processing in Apache Beam

Apache Flink Fast and Reliable Large-Scale Data Processing

The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing: Extended Survey

Introduction to the Flank Stack

Spring 2020 1/21

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack

DSP Frameworks

Introduction to Apache Beam

A Survey of State Management in Big Data Processing Systems

Apache Calcite for Enabling SQL Access to Nosql Data Systems Such As Apache Geode Christian Tzolov Whoami Christian Tzolov