Quantitative Impact Evaluation of an Abstraction Layer for Data Systems

Guenter Hesse∗, Christoph Matthies∗, Kelvin Glass†, Johannes Huegle∗ and Matthias Uflacker∗ ∗Hasso Plattner Institute University of Potsdam Email: fi[email protected] †Department of Mathematics and Computer Science Freie Universitat¨ Berlin Email: [email protected]

Abstract—With the demand to process ever-growing data [2], which provides a unified programming volumes, a variety of new data stream processing frameworks model for describing both batch and streaming data-parallel have been developed. Moving an implementation from one processing pipelines. Pipelines are described using a single such system to another, e.g., for performance reasons, requires adapting existing applications to new interfaces. Apache Beam Software Development Kit (SDK) and can then be executed by addresses these high substitution costs by providing an ab- a variety of different frameworks, without developers needing straction layer that enables executing programs on any of the detailed knowledge of the employed implementations. Thus, supported streaming frameworks. In this paper, we present a execution frameworks can be exchanged without the need to novel benchmark architecture for comparing the performance adapt code. As an additional benefit, Apache Beam enables impact of using Apache Beam on three streaming frameworks: Streaming, , and Apache Apex. We to benchmark multiple systems with a single implementation. find significant performance penalties when using Apache Beam Conceptually, this idea can be compared to object-relational for application development in the surveyed systems. Overall, mapping (ORM), where data stored in database tables is usage of Apache Beam for the examined streaming applications encapsulated in objects. Data can be queried and manipulated caused a high variance of query execution times with a slowdown just by using these objects instead of writing SQL [3]. of up to a factor of 58 compared to queries developed without the abstraction layer. All developed benchmark artifacts are publicly A question concerning abstraction layers is if their usage has available to ensure reproducible results. consequences on the performance characteristics of an appli- Index Terms—Data Stream Processing, Abstraction Layer, cation. If introduced performance penalties are too high, they Performance Benchmarking might outweigh the gained benefits. Great performance impact variations between systems can, with regard to performance I.INTRODUCTION benchmarking, lead to a distorted result. Thus, it is crucial to While the world of Big Data analysis presents a multitude understand and quantify the impact of used abstraction layers. of opportunities, it also comes with unique obstacles. For The following contributions are presented in this paper: software developers, who are tasked with building Big Data • We give a description of Apache Beam as well as the applications, the need to work with many different frame- three data stream processing systems (DSPSs) used for works, Application Programming Interfaces (APIs), program- the conducted measurements, which are Apache Flink [4], ming languages, and software development kits (SDKs) can Apache Spark Streaming [5], and Apache Apex [6]. be challenging. Especially as the long-lasting paradigm ”one • We propose a lightweight benchmark architecture for size fits all” seems to be obsolete, which Stonebraker and comparing DSPSs. This benchmark covers information C¸etintemel [1] already predicted back in 2005, handling and on the process as well as data, query, and metric aspects. keeping an overview of the multitude of technologies becomes All developed artifacts are available online which ensures more difficult. Furthermore, flexibility is crucial in our rapidly transparency and reproducibility. changing world in order to maintain or establish a leader • We present the measurement results with focus on the position. An example for that is a desired change of an performance impact of Apache Beam. That is done by IT system for reasons such as variations in pricing policy, comparing the query implementations using native system changed volume or velocity of data that needs to be processed, APIs with the implementations using Apache Beam. or altered performance characteristics of execution engines. Ideally, this adaption should happen at minimal costs, which The remainder of this paper is structured as follows: Sec- is a challenging task if each system uses its own APIs. tion II describes the employed technologies. Section III illus- The wide variety of available tools in the area of stream trates the benchmark environment and the performance results. processing has spurred the development of an abstraction Section IV, gives an overview of related work and Section V layer, which allows defining programs independent of the concludes, giving a summary on results and highlighting areas technologies used. This layer is the open source project for future work.

Accepted at 2019 IEEE International Conference on Distributed Computing Systems (ICDCS). The final authenticated version is available online: DOI 10.1109/ICDCS.2019.00137. Copyright ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. II.DATA STREAM PROCESSING SYSTEM TECHNOLOGIES Apache Beam itself resulted out of the donation of the This section describes the technologies used for the pre- Cloud Dataflow SDKs and programming model [10] to the sented measurements. Apache Software Foundation [18]. Therefore, as mentioned before, Google Cloud Dataflow is one supported system. A. Apache Beam B. Apache Flink Apache Beam [2] describes itself as a unified programming Apache Flink is an open source system with batch and model, which allows defining batch and stream processing stream processing capabilities. It offers a Java and a Scala API applications. For that, Apache Beam SDKs are provided. for developing applications. Additionally, there are multiple Currently, three SDKs are part of the Apache Beam repository: libraries on top of Apache Flink that provide, e.g., machine a Java SDK, a Python SDK, and a Go SDK [7]. learning or graph processing functionalities [4], [19]. The Instead of developing an application for a single DSPS, architecture of an Apache Flink cluster is shown in Figure 1. Apache Beam allows writing programs that are compatible with any supported execution engine. Engine-specific runners Flink Client Task Manager translate the Apache Beam code to the target runtime. Using such an abstraction layer theoretically allows, e.g., for an arbi- trary exchange of engines without the need of code adaption. Job Manager Task Manager Central elements of the Apache Beam SDK are: • Pipeline represents the entire application definition, in- cluding data input, transformation, and output. ... • PCollection embodies a distributed data set that can be either bounded or unbounded. The latter is used for data stream processing applications. Fig. 1. Architecture of an Apache Flink Runtime (based on [4], [19]) • PTranform is an abstraction for data transformation. It receives one or more PCollection objects and applies the Figure 1 shows an Apache Flink Client, a Job Manager, and a transformation on this data. That leads to an output of Task Managers. When a program is deployed to the system, zero or more PCollection objects. Moreover, read or write the client transforms it into a dataflow graph, i.e., a directed operations on external storage systems are realized with acyclic graph (DAG), and sends it to the Job Manager. The PTransform objects. Apache Beam provides some core client itself is not part of the program execution and can, after transforms. Selected ones are outlined in the following: transmitting the dataflow graph, either disconnect from the Job Manager or stay connected in order to receive information – ParDo is an element-by-element processing of data, about the execution progress. whereby the processing of a single element can lead The Job Manager or master is responsible for scheduling to zero or more output elements. In addition to work amongst the Task Manager instances and for keeping standard operations like map or flat map, a ParDo track of the execution. There can be multiple Job Manager also supports aspects such as side inputs and stateful instances whereas only one Job Manager can be the leader. processing. Others would be standby and could take over in case of failure. – GroupByKey, as the name already states, processes The Task Manager instances execute the assigned parts of key-value pairs and collects all values belonging to the program. Technically, a Task Manager is a JVM process. the same key. It is an aggregation operation that There must be at least one Task Manager in an Apache Flink outputs pairs consisting of a key and a collection installation. Thereby, they exchange data amongst each other of values that belong to this key. For use with data where needed. Each Task Manager provides at least one task streams, one must use an aggregation trigger or non- slot in which subtasks are executed in multiple threads. A global windowing in order to enable the grouping to task slot can be shared by multiple subtasks as long as they be applied to a finite data set. belong to the same application, even if they are part of different – Flatten merges the data of multiple PCollection tasks. While one task is executed by one thread, Apache Flink objects that contain data of the same type into a chains multiple operator subtasks into a single task, such as single PCollection [8]–[10] two subsequent map operations. A benefit of this optimization Next to Apache Flink, Apache Spark, and Apache Apex, is, e.g., a reduced overhead for inter-thread communication. other frameworks supporting Apache Beam exist [11]. These Every task slot has a subset of the resources that belong are Apache Gearpump [12], MapReduce [13], to its corresponding Task Manager. Particularly, the available [14], Alibaba JStorm [15], IBM Streams [16], memory is split amongst task slots. CPU separation does not and Google Cloud Dataflow [17]. This group covers both, happen in the current Apache Flink version [4], [19], [20]. closed source systems, e.g., Google Cloud Dataflow and IBM Streams, as well as open source systems, e.g., Apache Flink C. Apache Spark Streaming and Apache Spark. Hence, Apache Beam can be seen as a Apache Spark is another open source system for distributed widely spread project with a high relevance. data processing. It offers, next to batch processing functional- ities, stream processing features as part of its library Apache File System (HDFS) [30]. Similar to Apache Flink, it offers Spark Streaming. However, stream processing is implemented batch as well as stream processing functionalities. Moreover, using micro-batches, i.e., it is not a tuple-by-tuple processing stream processing is also implemented in a way of processing as in Apache Flink. Apache Spark Streaming applications can data in a tuple-by-tuple fashion [31], [32]. be written in Java, Scala, or Python. Besides Apache Spark Apex Malhar, built on top of Apex Core, is a library Streaming, there are other libraries built on top of Apache containing different input/output operators and compute oper- Spark, e.g., similar to Apache Flink’s ecosystem, a library for ators. The former group includes, e.g., connectors to Apache machine learning as well as for graph processing [5], [21]. Kafka [33] and other messaging systems [6]. The high-level The architecture of an Apache Spark installation is shown architecture of Apache Hadoop 2, which is the version that is in Figure 2. An application is executed in the form of mul- used for the presented measurements, is depicted in Figure 3. tiple independent processes distributed across a cluster. The HDFS at the bottom is a distributed file system as its name SparkContext coordinates these processes. This coordinator is already states and serves as the storage layer. Apache Hadoop an object in the main() function of the application, which is YARN acts as a resource manager on top of HDFS. On top of called Driver Program. Moreover, the SparkContext connects YARN, there are multiple data processing frameworks avail- to a Cluster Manager that takes care of resource allocation. able from various areas such as batch or stream processing. Two of those are Hadoop MapReduce and Apache Apex. The Worker Node latter is the stream processing system used for measurements described in this paper [34].

Driver Program Cluster Worker Node Apex Flink Spark MapReduce ... SparkContext Manager YARN Yet Another Resource Negotiator ... HDFS Hadoop Distributed File System Fig. 2. Architecture of Apache Spark in Cluster Mode (based on [19], [22])

Currently, there are four Cluster Managers supported by Fig. 3. Architecture of Apache Hadoop 2 (based on [34]) Apache Spark - Spark Standalone, [23], The two other DSPSs used for those measurements, i.e., Apache Hadoop YARN (Yet Another Resource Negotia- Apache Spark (Streaming) and Apache Flink, can also run tor) [24], and Kubernetes [25]. As soon as a connection on Apache Hadoop YARN as one of multiple deployment is established, the SparkContext acquires so-called executors options [35], [36]. However, for the presented measurements on the Worker Node instances. Each executor is a process we use the standalone cluster deployment option that is belonging to exactly one application, which stores data and available in both systems, Apache Flink and Apache Spark. performs computations. So different applications running on Figure 4 illustrates the architecture of an Apache Hadoop the same Apache Spark Cluster are executed in different YARN deployment. The depicted installation runs one appli- JVMs, which is different to, e.g., the execution concept in cation. Its components are marked with dashed lines. the previously illustrated Apache Flink. Thus, data cannot be exchanged between different Apache Spark application without making use of an external storage system. Client Node Manager Container Once executors are acquired, the SparkContext transmits the program in the form of a JAR or Python files to them. App Master Afterwards, it sends tasks to the executor processes. One Resource Manager Container process can run multiple tasks in several threads [22], [26]. Node Manager A central data structure that is used in Apache Spark is the Resilient Distributed Dataset (RDD). An RDD can be viewed ... as a distributed memory abstraction. To be more concrete, it is a partitioned and read-only collection of records. Apache Spark Streaming leverages a processing model called dis- Fig. 4. Architecture of an Apache Hadoop YARN (based on [24], [37]) cretized streams (D-Streams). Such a D-Stream is a sequence Two major components are part of Apache Hadoop YARN, of RDDs. An incoming data stream is divided into batches a Resource Manager and Node Manager instances. Both are stored in RDDs. Data transformations are then performed on daemons running on defined nodes. A client submits an these RDDs, which again output a D-Stream [27], [28]. application to the Resource Manager. It is responsible for distributing cluster resources amongst applications. Particu- D. Apache Apex larly, the Resource Manager allocates so-called containers on Apache Apex is based on Apache Hadoop [29] with its dedicated cluster nodes for applications. A container is defined components Apache Hadoop YARN and Hadoop Distributed as a logical bundle of resources, e.g., a bundle of 4GB RAM and 1 CPU, that is tied to a certain node. Communication be- A. Benchmark Architecture and Process tween the Resource Manager and the Node Manager daemons The proposed benchmark setting is depicted in Figure 5. happens via a heartbeat mechanism [24]. The benchmark process is divided into three separate and There is one special container spawned for each program, consecutive phases. The components that are involved in the the Application Master. It manages application execution with corresponding part of the process are marked in the figure by respect to, e.g., resource needs, execution flow, or fault han- dashed curly brackets. dling. The Application Master can be written in any program- 3 ming language, though many applications make use of higher- level frameworks such as MapReduce or Apache Apex. Other Result containers may communicate directly with the Application Calculator 2 Master if necessary. This communication would need to be 1 managed by the application as YARN does not arrange that. System Under Test

The Application Master implemented in Apache Apex is called Message Broker Benchmark Query Input Data Data Sender Streaming Application Manager (STRAM) [24], [38]. () Implementation E. Similarities and Differences Between the Presented Systems Fig. 5. Overview About the General Benchmark Architecture and Process Table I gives an overview of the presented DSPSs. Apache Based on Fundamental Modeling Concepts (FMC) Beam is not listed as it is not a DSPS but an SDK for developing stream processing programs. On the left-hand side there is the input data which is read by the data sender and forwarded to the message broker, which Criteria Apache Flink Apache Spark Apache Apex in particular is Apache Kafka. The data sender is a program Streaming written in Scala with multiple configuration parameters such as Mainly Written in Java, Scala Scala, Java, Java Python the data ingestion rate or the level of Kafka Producer acknowl- Languages for App Java, Scala, Scala, Java, Java edgments. The system under test (SUT) on the right-hand side, Development Python Python i.e., the DSPS to be benchmarked, executes the implemented Data Processing Tuple-by- Batch Tuple-by- tuple tuple query. Thereby, it reads from and writes to the message Processing Guarantees Exactly-once Exactly-once Exactly-once broker. Moreover, there is a result calculator tool developed in Scala that reads the query output from the message broker TABLE I and leverages Apache Kafka functionality for the execution COMPARISON OF APACHE FLINK,APACHE SPARK STREAMING, AND time computation. The three different benchmark process steps APACHE APEX (BASEDON [19], [39]–[41]) marked in Figure 5 are described in the following: All systems are mainly developed in a JVM language, 1) Data Ingestion: Firstly, data is inserted into an Apache specifically Java or Scala [39]–[41]. Programs can be written Kafka topic using the data sender. Particularly, 1,000,001 in either Java, Scala, or Python on Apache Flink and Apache records of the AOL Search Query Log [42] dataset that is Spark Streaming. Apache Apex only offers the possibility to also used in [43] are sent. The input topic is created with a write applications in Java. However, there are plans to also replication factor of one and one partition in order to ensure support the development of programs in Scala 1. With respect the correct order of messages. Apache Kafka only guarantees to data processing characteristics, Apache Flink and Apache the correct order for entries within one partition [44]. The data Apex process data in a tuple-by-tuple fashion. Contrary, file consists of records with five tab-separated columns. These Apache Spark Streaming makes use of a batch processing columns contain a user ID, the query issued by the user, the approach. Furthermore, all three systems guarantee exactly- time at which the query was issued, the search result rank the once processing, i.e., each input tuple is processed exactly user clicked on if applicable, and the search result Uniform once. This ensures correct results also in recovery scenarios. Resource Locator (URL) the user clicked on if applicable [42]. 2) Program Execution: During the execution phase, each III.PERFORMANCE ANALYSIS query is run ten times for each execution setup. Meanwhile, This section describes the conducted performance analysis. there are no other programs executed on the system and each First, the general benchmark setup as well as the data used for system is restarted at the beginning of this benchmarking the benchmark are presented. Afterwards, the executed queries step. The stream processing program computes the output and are highlighted. Lastly, the performance results are discussed. sends it to an Apache Kafka topic that is also created with That particularly includes an analysis of the execution times, a replication factor of one and one partition for the same standard deviation, and a detailed view of the performance reasons as mentioned previously. Each query is executed with impact of Apache Beam. Query implementations and other a parallelism of one and two, ten times each. Moreover, every 2 used programs and scripts can be found online . query is implemented using the APIs provided by the DSPS as well as using Apache Beam. So for each query, as we 1https://malhar.atlassian.net/browse/APEX-175 2http://hpi.de/fileadmin/user upload/fachgebiete/plattner/publications/ analyzed three different systems, there are twelve different papers/gh/StreamBenchOnApacheBeamBenchmark.zip query execution setups as it can be seen in Section III-C. The mentioned parallelism is set differently depending on Query Description the system and the used APIs. Regarding Apache Flink, Identity Read input and output it without performing any data trans- it is configured using the command line option -p or - formation. Can be seen as a baseline query with respect to computational complexity. -parallelism that is offered when submitting an applica- Sample Read input and output only a certain percentage of data that tion for specifying parallelism [45]. For programs executed is randomly chosen. The number of output tuples is as big as on Apache Spark Streaming, the configuration parameter about 40% of the number of input tuples. Projection Read input and output only a certain column of the input spark.default.parallelism is used [46]. record. In the presented measurements, the values of the first Apache Apex does not provide an option for configuring column are chosen for being included in the output. parallelism, so instead the number of VCOREs is set accord- Grep Read input and output only records that match a certain regex. 3 The search string used for the measurements is ”test”, which ingly in the Apache Hadoop YARN configuration as well as leads to an output of 3,003 records or about 0.3% of the within the Apache Apex application as a DAG attribute [47]. number of input records. The approach of using this configuration is also applied for the TABLE II programs running on Apache Apex that are developed using OVERVIEW OF THE BENCHMARK QUERIES (BASEDON [43]) Apache Beam APIs. Details can be found in the mentioned archive that is provided online. C. Performance Results 3) Result Calculation: Lastly, the result records are read This section illustrates the performance results regarding from Apache Kafka for each query and the time difference execution times and the performance impact of Apache Beam. between the firstly inserted and the lastly inserted record is 1) Execution Times: The following charts visualize the computed. Apache Kafka is configured to use LogAppendTime, measured average execution times. On the y-axis, the com- i.e., the timestamp when a record is appended to the Apache binations of system, parallelism, and kind of implementation, Kafka log is stored together with the record itself [44]. For i.e., using Apache Beam or system APIs, are listed. The x-axis execution time calculation, we use these timestamps which shows the times in seconds. The letter P stands for parallelism. allows keeping the measurements application- and system- Figure 6 shows the results for the identity query. It can independent. That is a crucial benefit with respect to result be seen that the query implementations using Apache Beam correctness as definitions of performance criteria vary among are slower compared to the implementations using the APIs systems. Thus, one cannot rely on performance data provided provided by the corresponding system in all cases. That is true by DSPSs [48]. The overhead between having the correct for almost all measurements presented in the following. result computed within the SUT and having it appended to The overall shortest execution time belongs to queries run Apache Kafka log is identical for every system and hence, on Apache Spark Streaming, closely followed by Apache Apex results are comparable. and Apache Flink, both of which have a noticeable slower With regard to the hard- and software setup, virtual ma- average runtime for one kind of parallelism. That could be chines are used for all nodes. Apache Kafka version 2.11- due to outliers in the corresponding series of runs. Details on 0.10.1.0 is installed on a three node cluster with 64GB that can be found in Section III-C2. main memory and an Intel(R) Xeon(R) CPU E5-2697 v3 When looking at the runtimes of queries implemented using @ 2.60GHz CPU with eight cores each. The DSPSs are on Apache Beam, differences are much larger. Apache Beam installed on a two node cluster where both nodes act as worker queries running on Apache Apex have by far the highest nodes or the equivalent. These two nodes are identical to the execution times with around 240s. So the differences between Apache Kafka nodes with regard to both, main memory and the execution times of the analyzed systems are significantly CPU. Ubuntu 14.04 is installed as the operating system on all higher for the queries implemented using Apache Beam com- virtual machines. Regarding system and framework versions, pared to those developed using native system APIs. In com- Apache Apex 3.7.0, Apache Hadoop 2.7.3, Apache Spark parison to these variances, distinctions between parallelism 2.3.0, Apache Flink 1.4.0, and Apache Beam 2.3.0 are used. factors are very small. The configuration files for the different systems can be found Nevertheless, the result differences between parallelism in the previously linked archive. factors of the Apache Beam query running on Apache Spark B. Benchmarked Queries Streaming is noticeable. The average execution time for the parallelism of two is close to 70% higher compared to these The executed queries are taken from the StreamBench [43] for the parallelism factor of one. As the relative standard benchmark. StreamBench defines seven different queries. Four deviation for these benchmark runs is low as illustrated later of these are stateless, i.e., it is not required to keep a state for on in Figure 10, that is not due to outliers. A reason for this producing the correct answer. The remaining three queries are observation could be the introduced overhead with respect to, stateful. The stateless queries used for the benchmark pre- e.g., data transfer, that comes with splitting up tasks and that sented in this paper are listed in Table II. Stateful queries are may not pay off for simple queries like the identity query. excluded as Apache Beam does not support stateful processing Figure 7 displays the results for the sample query. Again, when executed on Apache Spark, see [11]. it can be seen that implementations using native system APIs 3http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-common/ outperform these using Apache Beam. Moreover, results of yarn-default.xml queries using the system APIs do not differ significantly be- Spark P2 3.23 Spark P2 3.48 Spark P1 3.26 Spark P1 3.18 Spark Beam P2 12.75 Spark Beam P2 14.73 Spark Beam P1 7.51 Spark Beam P1 10.07 Flink P2 3.74 Flink P2 5.47 Flink P1 6.52 Flink P1 6.1 Flink Beam P2 32.97 Flink Beam P2 33.33 Flink Beam P1 30.28 Flink Beam P1 33.54 Apex P2 5.71 Apex P2 3.52 Apex P1 3.35 Apex P1 4.75 Apex Beam P2 241.01 Apex Beam P2 241.35 Apex Beam P1 237.53 Apex Beam P1 229.91 0 50 100 150 200 250 0 50 100 150 200 250 Average Execution Time in s Average Execution Time in s

Fig. 6. Average Execution Times - Identity Query Fig. 8. Average Execution Times - Projection Query

Figure 9 visualizes the results for the grep query measure- Spark P2 2.16 ments. These times are overall the lowest ones whereas there Spark P1 2.23 Spark Beam P2 11.48 are differences between systems and used APIs. Especially Spark Beam P1 11 the implementations using the native APIs offered by Apache Flink P2 3 Spark Streaming and Apache Flink have relatively low execu- Flink P1 2.09 tion times in comparison to the corresponding numbers for the Flink Beam P2 26.88 Flink Beam P1 26.62 other three queries. With about 20s, the Apache Beam version Apex P2 3.55 for Apache Flink is close to 7s faster than the corresponding Apex P1 4.1 sample query result with about 27s. There is no noticeable Apex Beam P2 125.67 difference between parallelism factors. Apex Beam P1 118.74 Contrary to Apache Flink, the average execution times for 0 50 100 150 queries developed using Apache Beam and running on Apache Average Execution Time in s Spark Streaming differ amongst parallelism factors. Similar to the corresponding times for the identity query depicted in Fig. 7. Average Execution Times - Sample Query Figure 6, the average execution time for a parallelism factor of two is noticeably higher. In particular, with absolute times of tween the analyzed systems and parallelism factors. Compared about 11.8s and 6.34s, a parallelism factor of two slows down to identity query results, times are slightly lower overall, which the average execution time by more than 85% in comparison could be a result of the lower number of output records as to the time measured for a parallelism of one. Reasons for that described in Section III-B. The Apex Beam implementation can be as described for the identity query results. is an exception as there is a major difference. To be more A surprising result is the Apex Beam performance for concrete, the average execution times for the sample query Apache Apex. While the times for the native Apache Apex amount to only about 50% of the identity query times. implementation is about on the same level as the corresponding With average execution times of about 2.09s and 3s for results for all other queries, the ones for the query developed the sample query developed using Apache Flink APIs for using Apache Beam are drastically lower. For the projection parallelisms of one and two respectively, these numbers are and the identity query, Apex Beam results are approximately below the corresponding times for Apache Apex. Thus, the between 230s and 240s. With about 120s, the sample query performance ranking between systems is identical for the performance is already significantly better. However, with sample query. That means, the times for Apache Spark are 2.58s and 3.76s, the execution times for the Apex Beam grep lowest, followed by these of Apache Flink and Apache Apex query implementation are orders of magnitude lower. for both kinds of implementation. A reason for the relatively low execution times could lie The projection query results are shown in Figure 8. They in the number of output records and the resulting smaller are similar to the numbers for the identity query in all aspects. effort that is needed for emitting query outcomes. To be This closeness leads to the conclusion that splitting a string more concrete, the output for the grep query is significantly and accessing one column of the resulting list does not lower than for the other three queries, though, as described in introduce a noticeable overhead. Regarding the number of Section III-B, the sample query already outputs fewer tuples output tuples, both queries are identical, though the tuple size than the projection and the identity query. for the projection query is smaller as only a subset of columns 2) Standard Deviation in Measured Execution Times: is sent to the output topic. However, this reduction in output Figure 10 visualizes the relative standard deviations for the size does not have a noticeable impact on the times either. measurements. These values are calculated for every system- Number of Run Parallelism = 1 Parallelism = 2 Spark P2 1.21 Spark P1 1.28 1 6.25s 4.15s Spark Beam P2 11.8 2 21.56s 3.77s Spark Beam P1 6.34 3 3.42s 2.71s 4 3.31s 5.29s Flink P2 1.43 5 3.73s 3.00s Flink P1 1.58 6 12.69s 3.93s Flink Beam P2 20.46 7 3.90s 2.90s Flink Beam P1 20.03 8 3.96s 3.66s Apex P2 3.37 9 3.42s 3.57s Apex P1 3.58 10 3.01s 4.45s Apex Beam P2 2.58 Apex Beam P1 3.76 TABLE III EXECUTION TIMESFORTHE IDENTITY QUERY ON APACHE FLINK 0 5 10 15 20 Average Execution Time in s Apache Flink with a parallelism of two. Although it seems to be plausible at a first glance that higher parallelism leads to Fig. 9. Average Execution Times - Grep Query better performance, this correlation is absent for other results. Table III shows the execution times for the benchmark runs query-SDK combination. By SDK it is distinguished between of the identity query on Apache Flink, i.e., numbers for the using Apache Beam or native system APIs for application corresponding ten runs with a parallelism of one as well as development. Deviations for the two parallelism factors are for the ten runs with a parallelism of two. When looking averaged and condensed in this way. This is done since at these measurements it becomes clear that there are two separate visualizations for different parallelisms would not to three outliers that cause this relatively high coefficient of reveal any further insights. Additionally, the reduced number variation. While results for the higher parallelism are relatively of values simplifies analysis of standard deviations. homogeneous, there are outliers in the list of execution times for runs with a parallelism of one. Particularly, seven out of Spark Sample 0.2 ten execution times range from three to four seconds. The Spark Projection 0.23 results for the remaining benchmark runs differ significantly. Spark Identity 0.15 To be more concrete, these runs lasted about 6s, 12.5s, and Spark Grep −2 8.16 · 10 21.5s. The highest execution time, e.g., is more than seven Spark Beam Sample 5.51 · 10−2 times higher than the lowest one. These outliers cause the Spark Beam Projection 9.32 · 10−2 Spark Beam Identity 9.14 · 10−2 comparatively high relative standard deviation. Apart from the Spark Beam Grep 4.3 · 10−2 identity query executed on Apache Flink, there are no further Flink Sample 0.23 values that stand out in Figure 10. Flink Projection 8.7 · 10−2 3) Performance Impact of Apache Beam: The performance Flink Identity 0.54 impact factors are calculated based on the arithmetic means Flink Grep 0.11 of the execution times. These times are measured as defined Flink Beam Sample 4.89 · 10−2 in Section III-A. The averages are determined as follows: Flink Beam Projection 6.25 · 10−2 −2 Flink Beam Identity 3.12 · 10 Nrun −2 1 X Flink Beam Grep 4.43 · 10 t¯(dsps, query, k, p) = t(dsps, query, k, p, r), −2 N Apex Sample 9.12 · 10 run r=1 Apex Projection 0.11 ¯ Apex Identity 0.15 where t(dsps, query, k, p) denotes the average over the execu- Apex Grep 9.04 · 10−2 tion times for a certain data stream processing system, query, Apex Beam Sample 0.14 kind of implementation, i.e., using Apache Beam or native Apex Beam Projection 4.57 · 10−2 system APIs, and a certain parallelism. Variable k represents Apex Beam Identity 3.15 · 10−2 the mentioned kind of implementation and p stands for the Apex Beam Grep 0.12 used degree of parallelism. The number of benchmark runs is 0 0.2 0.4 expressed as Nrun, which is equal to ten for the context of this Relative Standard Deviation paper. The execution time for a single query run of a certain benchmark scenario is shown as t(dsps, query, k, p, r). Fig. 10. Relative Standard Deviation for System-Query-SDK Combinations The slowdown factor calculation makes use of these arith- There is one value that is notably higher than others, which metic means. Specifically, it is computed as follows: belongs to the identity query executed on Apache Flink. Np Figure 6, which is visualizing the identity query execution 1 X t¯(dsps, query, Beam, p) sf(dsps, query) = , times, makes visible that there is a noticeable difference be- Np t¯(dsps, query, native, p) tween the numbers for the two parallelism factors for Apache p=1 Flink. Particularly, the execution time for Apache Flink with where sf(dsps, query) denotes the slowdown factor for a a parallelism of one is almost 75% higher than the one for given data stream processing system and a given query. Np Spark Grep Data Source Operator Data Sink 7.37 Source: Sink: Filter Spark Projection 3.7 Custom Source Unnamed ————— Spark Sample 5.13 ————— ————— Parallelism: 1 Spark Identity 3.13 Parallelism: 1 Parallelism: 1 Flink Grep 13.51 Flink Projection 5.79 Fig. 12. Apache Flink Execution Plan for the Grep Query Flink Sample 10.87 Flink Identity 6.73 Apex Grep 0.91 where the Apache Beam implementation is even faster than Apex Projection 58.46 the one using native system APIs according to the calculated Apex Sample 32.17 slowdown factor. However, this speedup is very low, i.e., it is Apex Identity 56.58 about as fast as the implementation without Apache Beam. 0 20 40 60 When looking at the absolute slowdown factors, there are also noticeable differences between Apache Apex and the Slowdown Factor sfdsps,query other two analyzed systems. Except for the grep query slow- Fig. 11. Slowdown Factor for the Analyzed Systems and Queries down, all slowdown factors are significantly higher compared to Apache Flink or Apache Spark Streaming. The slowdown depicts the number of parallelisms tested, which equals two factor for the projection query, e.g., is about 58 and so more in the previously discussed benchmark scenario, particularly a than four times higher than the highest slowdown factor for parallelism factor of one as well as a parallelism of two. So either Apache Flink or Apache Spark Streaming. in simplified terms, the ratio of average execution times for Summarizing, the conducted benchmark shows that Apache Apache Beam implementations and these using native system Beam has a negative performance impact for almost all APIs is calculated and again averaged over parallelisms, all for scenarios. Averaged over systems, the performance penalty is a given query and data stream processing system combination. lowest on Apache Spark, closely followed by Apache Flink. Concretely, the average execution times for a certain system, Patterns between these two systems and executed queries are query, and parallelism are determined, separately for the similar. The performance impact on Apache Apex is much Apache Beam version as well as the implementation using different, meaning the impact is significantly higher in most native system APIs. The average execution time belonging to of the cases and the previously mentioned pattern is vice-versa. the Apache Beam variant is then divided by the corresponding So the more output or the higher the execution time on Apache average for the native query. That is done for every parallelism. Apex, the higher impact of Apache Beam on performance. The The resulting factors for each parallelism are finally averaged grep query running on Apache Apex is an exception to that as by dividing their sum by the number of parallelisms. explained before. Except for this exceptional case, slowdown All in all, the result tells how much slower or faster the factors range from about three to almost 60. Thus, in most of Apache Beam version for a certain query and data stream the studied cases, Apache Beam has a significant influence on processing system performed in the conducted measurements. performance when looking at the calculated slowdown factors. That is with regard to execution times as defined in Sec- Figure 12 and Figure 13 visualize the execution plans tion III-A and independent of parallelism. A result greater for the grep query executed with a parallelism of one on than one marks a slowdown, whereas a result smaller than Apache Flink, implemented without and with Apache Beam one means that the Apache Beam implementation was faster respectively. Information on execution plans are retrieved from than the one using native system APIs. the Apache Flink system and visualized using the Apache The results for the computed slowdown factors are vi- Flink Plan Visualizer 4. These two plans serve as an example sualized in Figure 11. It can be seen that Apache Beam highlighting differences between the execution of applications implementations are slower for almost all DSPSs and queries developed with native APIs and those using Apache Beam. in comparison to these developed using native system APIs. The first execution plan depicted in Figure 12 contains Generally, one can recognize differences between the stud- three elements, a data source, an operator, and a data sink. ied systems and queries. When looking at Apache Flink and Particularly, the source is shown as a custom source, the sink Apache Spark, factors are similar, especially with respect to as an unnamed sink, and the operator is a filter, which fits the relative distinctions amongst queries. Particularly, the perfor- definition of the grep query as it basically filters data. Data is mance penalty for the fastest query, namely the grep query, forwarded along these three elements. is highest. Accordingly, it is lowest for the longest-running The second execution plan is presented in Figure 13. It queries projection and identity for both systems. Overall, the comprises seven elements in total. In particular, these el- performance impact on Apache Flink is slightly higher. ements are a data source followed by six operators. The In contrast, the Apache Apex results show a different data source at the beginning is named PTransformTrans- pattern. The highest performance impact can be seen, contrary lation.UnknownRawPTransform. PTransformTranslation is a to Apache Flink and Apache Spark, for the longest-running registry of familiar transforms and uniform resource names queries projection and identity. The query with the shortest execution time, the grep query, is overall the only query 4https://flink.apache.org/visualizer/ shot in time and results may differ with different versions of Data Source Operator Source: Operator Flat Map ParDoTranslation. Apache Beam, other DSPSs versions, or alternative system PTransformTranslation. RawParDo UnknownRawPTransform ————— ————— ————— configurations. Moreover, changed workloads characteristics Parallelism: 1 Parallelism: 1 Parallelism: 1 might also influence performance results.

IV. RELATED WORK Operator Operator Operator ParDoTranslation. ParDoTranslation. ParDoTranslation. RawParDo RawParDo RawParDo For supporting Apache Beam, a system has to implement ————— ————— ————— a so-called runner. The development of the runner for IBM Parallelism: 1 Parallelism: 1 Parallelism: 1 Streams is described in [49]. Next to highlighting three imple- mented optimizations with regard to the IBM Streams runner, Operator performance evaluations between IBM Streams, Apache Flink, ParDoTranslation. RawParDo and Apache Spark are presented. ————— Parallelism: 1 Besides abstraction layers such as Apache Beam that allow for developing data stream processing applications using a Fig. 13. Apache Flink Execution Plan for the Grep Query Implemented Using programming language such as Java, there is the idea of Apache Beam leveraging or extending SQL to be able to do that. Continuous Query Language (CQL) [50] is a comprehensive (URNs) [9]. As outlined in Section II-A, a PTransform is used, approach for such an SQL extension. To be more concrete, e.g., for reading or writing to an external storage system [8]. CQL is a SQL-based language for defining continuous queries The data source forwards data to the first operator, a over data streams as well as updatable relations. It is not only flat map, which performs an action on each input value a concept but also integrated into the STREAM [51] system, a and produces zero or more output values. Its Apache Beam data stream management system developed at Stanford Univer- counterpart is the read() method of the KafkaIO class, which sity. Next to describing the CQL implementation in STREAM, creates a Read PTransform. The remaining five ParDoTransla- the semantics of CQL are outlined and a comparison to other tion.RawParDo operators follow the flat map. A ParDoTrans- languages is included in the linked paper. lation comprises tools for working with instances of ParDo. [52] is another approach that was devel- A ParDo is one of the core transforms provided by Apache oped more recently. It is a framework that comprises various Beam and described in Section II-A. The first ParDo represents functionalities, e.g., with respect to query processing, query calling withoutMetadata() on the Read PTransform, which optimization, and query language support, which is the rele- drops the Kafka metadata as it is not needed. Moreover, the vant aspect in this context. Amongst others, the architecture method again returns a PTransform containing a PCollection of Apache Calcite is presented in the mentioned work as well of key-value pairs. The downstream operator represents the as developed SQL extensions for different areas such as semi- call of the create() method belonging to the class Values. structured data or geospatial queries. Another depicted exam- This operator takes the previously created PCollection of key- ple is the extensions for data stream processing queries that value pairs and returns a PCollection containing only the are called STREAM extensions. They are inspired by the men- values. Further downstream, the grep query logic is applied tioned CQL and also explained on their website [53]. However, and resulting values are sent to Apache Kafka [7]–[10]. not all the presented concepts have been implemented yet [53]. When comparing both execution plans it becomes visible A few DSPSs already integrate Apache Calcite, Apache Apex that the plan for the query implemented using Apache Beam and Apache Flink being two of them [52]. is significantly larger, i.e., it contains more elements in com- Jain et al. [54] discuss the differences of two SQL-based parison to its counterpart. That is due to the more complex languages for defining streaming queries, particularly Oracle management of communication with Apache Kafka and could Continuous Query Language [55] and StreamBase Stream- cause a lower performance. Both plans have in common SQL [56]. Moreover, an approach for unifying both languages that they start with a data source and that all elements are is proposed. However, they highlight that for achieving a executed with a parallelism of one, due to the defined degree complete standard, further challenges needs to be tackled. of parallelism. Moreover, a dedicated data sink is not identified Besides, there are more SQL extensions for stream process- for the program developed using Apache Beam. Thus, the sink ing scenarios developed for certain systems, e.g., streaming must be represented as an operator. SQL for Apache Kafka called KSQL [57], Continuous Com- Overall, the performance of Apache Beam applications putation Language (CCL) that is the extended SQL used in highly depends on the runner implementations. Effort put SAP HANA Smart Data Streaming [58], or SamzaSQL [59] into this development is likely to vary between systems. The as extended SQL for the DSPS Samza [60]. closeness of the DSPSs’ programming model to the under- With respect to benchmarking DSPSs in general, the Linear lying concepts of Apache Beam also impacts the application Road benchmark by Arasu et al. [61] is a very well-known execution. Further details, e.g., with respect to the concrete work. It is an application benchmark that provides a bench- impact of the additional operator, could be uncovered through marking toolkit. This toolkit consists of a data generator, a profiling applications. However, all measurements are a snap- data sender, and a result validator. The underlying idea of the benchmark is a variable tolling system for a metropolitan area. systems from a data stream processing point of view is out This area covers multiple expressways with moving vehicles. of scope in the performed measurements. The amount of accumulated tolls depends on various aspects Lopez et al. [68] compare , Apache Flink, and concerning the traffic situation. Apache Spark Streaming in their paper. Besides describing The mentioned data sender emits data to the DSPS, which is the architecture of these three systems, the performance is mostly car position reports. Depending on the overall situation studied in a network traffic analysis scenario. Additionally, on the expressways, car position reports may require the DSPS the behavior in case of a node failure is investigated. to create an output or not. Next to car position reports, the remaining input data represent an explicit query which always V. CONCLUSIONAND FUTURE WORK requires an answer. Linear Road defines four distinct queries, whereas the query lastly presented in [61] was skipped in the This paper describes the characteristics of three state-of- two presented implementations due to complexity reasons. the-art DSPSs, particularly Apache Apex, Apache Flink, and The benchmark result for a system is summarized as a so Apache Spark Streaming, as well as the abstraction layer called L-rating. This metric defined by Linear Road expresses Apache Beam for implementing stream processing programs. how many expressways the system could handle while meeting To study the performance impact of Apache Beam, we the defined response time requirements for each query. A propose a lightweight benchmark architecture that uses a higher number of expressways corresponds to a higher data recognized workload and a novel approach for measuring input rate for the SUT. When generating data, the amount of execution times using Apache Kafka. All benchmark imple- expressways can be configured.s The Linear Road benchmark mentations are provided in order to ensure reproducibility. was applied to the DSPS Aurora [62] and a commercial Our benchmark results show that Apache Beam has a relational database. Results are presented in the paper. noticeable impact on the performance of DSPSs in almost all Another related benchmark is the already mentioned cases. Programs developed using Apache Beam suffered from StreamBench [43]. It aims at benchmarking distributed DSPSs. a slowdown of up to a factor of 58 in the worst case. At the Regarding its category in the area of performance benchmarks same time, there is one scenario where the query developed it can be viewed as a microbenchmark, i.e., it measures atomic using Apache Beam is about as fast as its counterparts using operations such as a projection rather than more complex the APIs of the corresponding DSPS. However, for most applications as in, e.g., Linear Road. scenarios we observed a slowdown of at least a factor three. The seven queries defined by StreamBench partly contain The results lead to two major conclusions. Firstly, using a single computational step and partly comprise multiple Apache Beam as an abstraction layer for application devel- computational steps. Three queries require to keep a state opment comes at a cost in terms of runtime performance. in order to calculate correct results while the remaining Secondly, the results of benchmarking different DSPSs using four queries are stateless. One query uses numerical data a program developed with Apache Beam are not likely to as input while the others process textual data. With respect represent the performance differences that are to be expected to the benchmark architecture, StreamBench makes use of a from a benchmark with programs developed using native message broker, specifically Apache Kafka, for decoupling system APIs. While using Apache Beam certainly provides a data generation and consumption. That is different to Linear greater flexibility to switch underlying DSPSs with relatively Road but similar to the benchmark architecture proposed in low effort, one needs to be aware of the fact that this Section III-A. As part of the evaluation section, the DSPSs advantage comes with a negative impact on performance. This Apache Storm [63] and Apache Spark Streaming are compared performance penalty varies among systems and applications by applying StreamBench. and is currently unpredictable. NEXMark [64] is a benchmark aiming to benchmark Future work in this area involves studying the reasons streaming queries in the context of an online auction system. for performance differences in greater detail. Particularly, the It seems the benchmark was never finished as the website applications could be profiled in order to see how much time still states that it is ”work in progress” [64] and there is is spent in which part of the execution plans and thus, to only a draft version of a paper that was published a couple identify possible performance bottlenecks. Finding potential of years ago [65]. However, in the context of Apache Beam reasons for the comparatively large performance penalties this benchmark was used as inspiration and foundation for a when employing Apache Apex represents another interesting NEXMark-based benchmark suite [66]. This suite extends the area for future work. In the best case, it is possible to identify eight NEXMark queries by five additional ones. A complete factors that influence the performance penalty applications implementation of all queries for all runners is work in suffer from and make them predictable. Additionally, measure- progress according to [66]. ments can be extended with respect to various aspects such as The work presented in [67] compares Apache Flink and the number of studied systems or query complexity as well as Apache Spark. The conducted measurements include different scaling, parallelism, or fault-tolerance behaviors. Furthermore, queries, a grep query being one of them. One focus area that upcoming Apache Beam versions and other abstraction layers is analyzed is the scaling behavior with regard to different can be compared against the presented results to supplement numbers of nodes in the cluster. However, studying both the initial overview presented here. REFERENCES [25] E. A. Brewer, “Kubernetes and the Path to Cloud Native,” in Proc. ACM Symposium on Cloud Computing, SoCC, 2015, p. 167. [Online]. [1] M. Stonebraker and U. C¸etintemel, “”one size fits all”: An idea whose Available: http://doi.acm.org/10.1145/2806777.2809955 time has come and gone (abstract),” in Proc. International Conference [26] X. Lu, M. Wasi-ur-Rahman, N. S. Islam, D. Shankar, and on Data Engineering, ICDE, 2005, pp. 2–11. [Online]. Available: D. K. Panda, “Accelerating Spark with RDMA for Big Data https://doi.org/10.1109/ICDE.2005.1 Processing: Early Experiences,” in IEEE Annual Symposium on High- [2] “Apache Beam Overview,” https://beam.apache.org/get-started/ Performance Interconnects, HOTI, 2014, pp. 9–16. [Online]. Available: beam-overview/, accessed: 2018-10-30. https://doi.org/10.1109/HOTI.2014.15 [3] M. Lorenz, J. Rudolph, G. Hesse, M. Uflacker, and H. Plattner, “Object- [27] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Relational Mapping Revisited - A Quantitative Study on the Impact of Franklin, S. Shenker, and I. Stoica, “Resilient Distributed Datasets: A Database Technology on O/R Mapping Strategies,” in Hawaii Interna- Fault-Tolerant Abstraction for In-Memory Cluster Computing,” in Proc. tional Conference on System Sciences, HICSS, 2017. USENIX Symposium on Networked Systems Design and Implementation, [4] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and NSDI, 2012, pp. 15–28. [Online]. Available: https://www.usenix.org/ K. Tzoumas, “Apache Flink™: Stream and Batch Processing in a conference/nsdi12/technical-sessions/presentation/zaharia Single Engine,” IEEE Data Eng. Bull., vol. 38, no. 4, pp. 28–38, 2015. [28] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, [Online]. Available: http://sites.computer.org/debull/A15dec/p28.pdf “Discretized Streams: Fault-Tolerant Streaming Computation at Scale,” [5] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, in ACM SIGOPS Symposium on Operating Systems Principles, SOSP, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, 2013, pp. 423–438. [Online]. Available: http://doi.acm.org/10.1145/ J. Gonzalez, S. Shenker, and I. Stoica, “Apache Spark: A Unified 2517349.2522737 Engine for Big Data Processing,” Commun. ACM, vol. 59, no. 11, pp. [29] “Apache Hadoop,” https://hadoop.apache.org, accessed: 2018-09-11. 56–65, 2016. [Online]. Available: http://doi.acm.org/10.1145/2934664 [30] “HDFS Architecture,” http://hadoop.apache.org/docs/current/ [6] “Apache Apex,” https://apex.apache.org/docs/apex/, accessed: 2018-09- hadoop-project-dist/hadoop-hdfs/HdfsDesign.html, accessed: 2018- 11. 09-11. [7] “Apache Beam,” https://github.com/apache/beam, accessed: 2018-08-17. [31] M. Bhandarkar, “AdBench: A Complete Benchmark for Modern Data [8] “Apache Beam Programming Guide,” https://beam.apache.org/ Pipelines,” in TPC Technology Conference, TPCTC, 2016, pp. 107–120. documentation/programming-guide/, accessed: 2018-10-18. [Online]. Available: https://doi.org/10.1007/978-3-319-54334-5 8 [9] “Runner Authoring Guide,” https://github.com/apache/beam/blob/ [32] T. Dunning and E. Friedman, Streaming Architecture: New Designs Us- master/website/src/contribute/runner-guide.md, accessed: 2018-10-18. ing Apache Kafka and MapR Streams. O’Reilly Media, 2016. [Online]. [10] T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. Fernandez-´ Available: https://books.google.de/books?id=EU8kDAAAQBAJ Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, and [33] J. Kreps, N. Narkhede, and J. Rao, “Kafka: a Distributed Messaging S. Whittle, “The Dataflow Model: A Practical Approach to Balancing System for Log Processing,” in Proc. International Workshop on Net- Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of- working Meets Databases, NetDB, 2011, pp. 1–7. Order Data Processing,” PVLDB, vol. 8, no. 12, pp. 1792–1803, 2015. [34] B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. C. Murthy, [Online]. Available: http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf and C. Curino, “Apache Tez: A Unifying Framework for Modeling [11] “Beam Capability Matrix,” https://beam.apache.org/documentation/ and Building Data Processing Applications,” in Proc. International runners/capability-matrix/#cap-summary-what, accessed: 2018-09-19. Conference on Management of Data, ACM SIGMOD, 2015, pp. 1357– [12] “Apache Gearpump,” https://gearpump.apache.org/overview.html, ac- 1369. [Online]. Available: http://doi.acm.org/10.1145/2723372.2742790 cessed: 2018-10-15. [35] “Flink - YARN Setup,” https://ci.apache.org/projects/flink/ [13] “MapReduce Tutorial,” https://hadoop.apache.org/docs/stable/ flink-docs-master/ops/deployment/yarn setup.html, accessed: 2018-09- hadoop-mapreduce-client/hadoop-mapreduce-client-core/ 23. MapReduceTutorial.html, accessed: 2018-10-15. [36] “Running Spark on YARN,” https://spark.apache.org/docs/latest/ [14] “What is Samza?” https://samza.apache.org, accessed: 2018-10-15. running-on-yarn.html, accessed: 2018-09-23. [15] “Alibaba JStorm,” http://jstorm.io, accessed: 2018-10-15. [37] “Apache Hadoop YARN,” https://hadoop.apache.org/docs/r3.0.3/ [16] “IBM Streams,” https://www.ibm.com/de-en/marketplace/ hadoop-yarn/hadoop-yarn-site/YARN.html, accessed: 2018-09-25. stream-computing, accessed: 2018-10-15. [38] “Apache Apex Documentation - Application Developer Guide,” [17] “CLOUD DATAFLOW - Simplified stream and batch data processing, http://apex.apache.org/docs/apex-3.7/application development/, with equal reliability and expressiveness,” https://cloud.google.com/ accessed: 2018-09-26. dataflow/, accessed: 2018-08-17. [39] “Apache Flink,” https://github.com/apache/flink, accessed: 2018-10-21. [18] “Cloud Dataflow, Apache Beam and you,” https://cloud.google.com/ [40] “Mirror of Apache Apex core,” https://github.com/apache/apex-core, blog/products/gcp/cloud-dataflow-apache-beam-and-you, accessed: accessed: 2018-10-21. 2018-10-15. [41] “Mirror of Apache Spark,” https://github.com/apache/spark, accessed: [19] G. Hesse and M. Lorenz, “Conceptual Survey on Data Stream 2018-10-21. Processing Systems,” in IEEE International Conference on Parallel and [42] “AOL Search Query Logs,” http://www.researchpipeline.com/ Distributed Systems, ICPADS, 2015, pp. 797–802. [Online]. Available: mediawiki/index.php?title=AOL Search Query Logs, accessed: https://doi.org/10.1109/ICPADS.2015.106 2018-09-28. [20] “Flink - Distributed Runtime Environment,” https://ci.apache.org/ [43] R. Lu, G. Wu, B. Xie, and J. Hu, “StreamBench: Towards projects/flink/flink-docs-master/concepts/runtime.html, accessed: 2018- Benchmarking Modern Distributed Stream Computing Frameworks,” 09-27. in Proc. IEEE/ACM International Conference on Utility and Cloud [21] “Spark Streaming Programming Guide,” https://spark.apache.org/docs/ Computing, UCC, 2014, pp. 69–78. [Online]. Available: https: latest/streaming-programming-guide.html, accessed: 2018-09-26. //doi.org/10.1109/UCC.2014.15 [22] “Apache Spark - Cluster Mode Overview,” https://spark.apache.org/ [44] “Documentation - Kafka 0.10.2 Documentation,” https://kafka.apache. docs/2.3.1/cluster-overview.html, accessed: 2018-09-10. org/documentation/, accessed: 2017-04-24. [23] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, [45] “Flink - Command-Line Interface,” https://ci.apache.org/projects/flink/ R. H. Katz, S. Shenker, and I. Stoica, “Mesos: A Platform for flink-docs-release-1.6/ops/cli.html, accessed: 2018-09-27. Fine-Grained Resource Sharing in the Data Center,” in Proc. USENIX [46] “Spark Configuration,” https://spark.apache.org/docs/latest/ Symposium on Networked Systems Design and Implementation, NSDI, configuration.html#spark-properties, accessed: 2018-09-27. 2011. [Online]. Available: https://www.usenix.org/conference/nsdi11/ [47] “Interface Context.OperatorContext,” https://ci.apache.org/projects/ mesos-platform-fine-grained-resource-sharing-data-center apex-core/apex-core-javadoc-release-3.6/com/datatorrent/api/Context. [24] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, OperatorContext.html, accessed: 2018-10-29. R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, [48] J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache V. Markl, “Benchmarking Distributed Stream Data Processing Systems,” Hadoop YARN: Yet Another Resource Negotiator,” in ACM Symposium in IEEE International Conference on Data Engineering, ICDE, 2018, on Cloud Computing, SOCC, 2013, pp. 5:1–5:16. [Online]. Available: pp. 1507–1518. [Online]. Available: http://doi.ieeecomputersociety.org/ http://doi.acm.org/10.1145/2523616.2523633 10.1109/ICDE.2018.00169 [49] S. Li, P. Gerver, J. Macmillan, D. Debrunner, W. Marshall, and K. Wu, “Challenges and Experiences in Building an Efficient Apache Beam Runner For IBM Streams,” PVLDB, vol. 11, no. 12, pp. 1742–1754, 2018. [Online]. Available: http://www.vldb.org/pvldb/vol11/p1742-li.pdf [50] A. Arasu, S. Babu, and J. Widom, “The CQL Continuous Query Language: Semantic Foundations and Query Execution,” VLDB J., vol. 15, no. 2, pp. 121–142, 2006. [Online]. Available: https://doi.org/10.1007/s00778-004-0147-z [51] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, R. Motwani, I. Nishizawa, U. Srivastava, D. Thomas, R. Varma, and J. Widom, “STREAM: The Stanford Stream Data Manager,” IEEE Data Eng. Bull., vol. 26, no. 1, pp. 19–26, 2003. [Online]. Available: http://sites.computer.org/debull/A03mar/paper.ps [52] E. Begoli, J. Camacho-Rodr´ıguez, J. Hyde, M. J. Mior, and D. Lemire, “Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources,” in Proc. International Conference on Management of Data, ACM SIGMOD, 2018, pp. 221– 230. [Online]. Available: http://doi.acm.org/10.1145/3183713.3190662 [53] “Streaming,” https://calcite.apache.org/docs/stream.html, accessed: 2018-10-19. [54] N. Jain, S. Mishra, A. Srinivasan, J. Gehrke, J. Widom, H. Balakrishnan, U. C¸etintemel, M. Cherniack, R. Tibbetts, and S. B. Zdonik, “Towards a Streaming SQL Standard,” PVLDB, vol. 1, no. 2, pp. 1379–1390, 2008. [Online]. Available: http://www.vldb.org/pvldb/1/1454179.pdf [55] “Oracle® CEP CQL Language Reference 11g Release 1 (11.1.1),” https: //docs.oracle.com/cd/E16764 01/doc.1111/e12048/intro.htm, accessed: 2018-10-19. [56] “StreamSQL Overview,” https://docs.tibco.com/pub/sb-lv/2.1.8/doc/ html/streamsql/ssql-intro.html, accessed: 2018-10-19. [57] “KSQL and Kafka Streams,” https://docs.confluent.io/current/ streams-ksql.html, accessed: 2018-10-19. [58] “SAP HANA Smart Data Streaming: Developer Guide,” https://help.sap.com/doc/25fc8560420d4d5099d6df02f7cbff9e/1.0. 12/en-US/streaming developer guide.pdf, accessed: 2018-10-19. [59] M. Pathirage, J. Hyde, Y. Pan, and B. Plale, “SamzaSQL: Scalable Fast Data Management with Streaming SQL,” in IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS, 2016, pp. 1627–1636. [Online]. Available: https://doi.org/10.1109/IPDPSW.2016. 141 [60] S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst, I. Gupta, and R. H. Campbell, “Samza: Stateful Scalable Stream Processing at LinkedIn,” PVLDB, vol. 10, no. 12, pp. 1634–1645, 2017. [Online]. Available: http://www.vldb.org/pvldb/vol10/p1634-noghabi. pdf [61] A. Arasu, M. Cherniack, E. F. Galvez, D. Maier, A. Maskey, E. Ryvkina, M. Stonebraker, and R. Tibbetts, “Linear Road: A Stream Data Management Benchmark,” in (e)Proc. International Conference on Very Large Data Bases, 2004, pp. 480–491. [Online]. Available: http://www.vldb.org/conf/2004/RS12P1.PDF [62] D. J. Abadi, D. Carney, U. C¸etintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. B. Zdonik, “Aurora: a new model and architecture for data stream management,” VLDB J., vol. 12, no. 2, pp. 120–139, 2003. [Online]. Available: https://doi.org/10.1007/s00778-003-0095-z [63] “Apache Storm,” http://storm.apache.org, accessed: 2018-10-23. [64] “NEXMark Benchmark,” http://datalab.cs.pdx.edu/niagara/NEXMark/, accessed: 2018-10-20. [65] P. Tucker, K. Tufte, V. Papadimos, and D. Maier, “NEXMark – A Benchmark for Queries over Data Streams DRAFT,” http://datalab.cs. pdx.edu/niagara/pstream/nexmark.pdf, accessed: 2018-10-20. [66] “Nexmark benchmark suite,” https://beam.apache.org/documentation/ sdks/java/nexmark/, accessed: 2018-10-20. [67] O. Marcu, A. Costan, G. Antoniu, and M. S. Perez-Hern´ andez,´ “Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks,” in IEEE International Conference on Cluster Computing, CLUSTER, 2016, pp. 433–442. [Online]. Available: https://doi.org/10.1109/CLUSTER.2016.22 [68] M. A. Lopez, A. G. P. Lobato, and O. C. M. B. Duarte, “A Performance Comparison of Open-Source Stream Processing Platforms,” in IEEE Global Communications Conference, GLOBECOM, 2016, pp. 1–6. [Online]. Available: https://doi.org/10.1109/GLOCOM.2016.7841533