Analyzing Efficient Stream Processing on Modern Hardware

Analyzing Efficient Stream Processing on Modern Hardware [Experiment and Analysis Paper] Steffen Zeuch 2, Bonaventura Del Monte 2, Jeyhun Karimov 2, Clemens Lutz 2, Manuel Renz 2, Jonas Traub 1, Sebastian Breß 1;2, Tilmann Rabl 1;2, Volker Markl 1;2 1Technische Universitat¨ Berlin, 2German Research Center for Artificial Intelligence 384 Flink ABSTRACT 420 Memory Bandwidth 288 Spark Modern Stream Processing Engines (SPEs) process large 100 71 23:99 28 Storm data volumes under tight latency constraints. Many SPEs 6:25 Java HC 10 3:6 C++ HC execute processing pipelines using message passing on shared- records/s] Streambox Throughput nothing architectures and apply a partition-based scale-out [M 1 0:26 Saber strategy to handle high-velocity input streams. Further- Optimized more, many state-of-the-art SPEs rely on a Java Virtual Ma- Figure 1: Yahoo! Streaming Benchmark (1 Node). chine to achieve platform independence and speed up system development by abstracting from the underlying hardware. a large number of computing nodes in a cluster, i.e., scale- In this paper, we show that taking the underlying hard- out the processing. These systems trade single node perfor- ware into account is essential to exploit modern hardware mance for scalability to large clusters and use a Java Virtual efficiently. To this end, we conduct an extensive experimen- Machine (JVM) as the underlying processing environment tal analysis of current SPEs and SPE design alternatives for platform independence. While JVMs provide a high level optimized for modern hardware. Our analysis highlights po- of abstraction from the underlying hardware, they cannot tential bottlenecks and reveals that state-of-the-art SPEs are easily provide efficient data access due to processing over- not capable of fully exploiting current and emerging hard- heads induced by data (de-)serialization, objects scattering ware trends, such as multi-core processors and high-speed in main memory, virtual functions, and garbage collection. networks. Based on our analysis, we describe a set of design As a result, the overall performance of scale-out SPEs build- changes to the common architecture of SPEs to scale-up on ing on top of a JVM are severely limited in throughput and modern hardware. We show that the single-node throughput latency. Another class of SPEs optimize execution to scale- can be increased by up to two orders of magnitude compared up the processing on a single node. For example, Saber [58], to state-of-the-art SPEs by applying specialized code genera- which is build with Java, Streambox [67], which is build with tion, fusing operators, batch-style parallelization strategies, C++, and Trill [34], which is build with C#. and optimized windowing. This speedup allows for deploy- In Figure 1, we show the throughput of modern SPEs ex- ing typical streaming applications on a single or a few nodes ecuting the Yahoo! Streaming Benchmark [39] on a single instead of large clusters. node compared to hand-coded implementations. We report PVLDB Reference Format: the detailed experimental setup in Section 5.1. As shown, Steffen Zeuch, Bonaventura Del Monte, Jeyhun Karimov, Clemens the state-of-the-art SPEs Apache Flink, Spark, and Storm Lutz, Manuel Renz, Jonas Traub, Sebastian Breß, Tilmann Rabl, achieve sub-optimal performance compared to the physical Volker Markl. Analyzing Efficient Stream Processing on Modern Hardware. PVLDB, 12(xxx): xxxx-yyyy, 2019. limit that is established by the memory bandwidth. To iden- DOI: https://doi.org/TBD tify bottlenecks of SPEs, we hand-code the benchmark in Java and C++ by hand (HC bars). Our hand-coded Java 1. INTRODUCTION implementation is faster than Flink but still 40 times slower Over the last decade, streaming applications have emerged than the physical limit. In contrast, our C++ implemen- as an important new class of big data processing use cases. tation and Streambox omit serialization, virtual functions, Streaming systems have to process high velocity data streams and garbage collection overheads and store tuples in dense under tight latency constraints. To handle high velocity arrays (arrays of structures). This translates to a 26 times streams, modern SPEs such as Apache Flink [29], Apache better performance for our C++ implementation and a 6 Spark [89], and Apache Storm [81] distribute processing over times better performance for Streambox compared to the Java implementation. Finally, we show an optimized im- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are plementation that eliminates the potential SPE-bottlenecks not made or distributed for profit or commercial advantage and that copies which we identify in this paper. Our optimized implemen- bear this notice and the full citation on the first page. To copy otherwise, to tation operates near the physical memory limit and utilizes republish, to post on servers or to redistribute to lists, requires prior specific modern hardware efficiently. permission and/or a fee. Articles from this volume were invited to present The observed sub-optimal single node performance of cur- their results at The 45th International Conference on Very Large Data Bases, rent SPEs requires a large cluster to achieve the same per- August 2019, Los Angeles, California. formance as a single node system using the scale-up opti- Proceedings of the VLDB Endowment, Vol. 12, No. xxx Copyright 2018 VLDB Endowment 2150-8097/18/10... $ 10.00. mizations that we evaluate in this paper. The major ad- DOI: https://doi.org/TBD vantage of a scale-up system is the avoidance of inter-node 1 data transfer and a reduced synchronization overhead. With SPEs. We choose these SPEs due to their maturity, wide hundreds of cores and terabytes of main memory available, academic and industrial adoption, and the size of their open- modern scale-up servers provide an interesting alternative to source communities. process high-volume data streams with high throughput and SPEs in the second category are optimized for scale-up low latency. Utilizing the capabilities of modern hardware execution on a single machine. Their goal is to exploit on a single node efficiently, we show that the throughput the capabilities of one high-end machine efficiently. Recent of a single node becomes sufficient for many streaming ap- SPEs in this category are Streambox [67], Trill [34], and plications and can outperform a cluster of several nodes. Saber [58]. In particular, Streambox aims to optimize the As one result of this paper, we show that SPEs need to be execution for multi-core machines, whereas SABER takes optimized for modern hardware to exploit the possibilities heterogeneous processing into account by utilizing GPUs. of emerging hardware trends, such as multi-core processors Finally, Trill supports a broad range of queries beyond SQL- and high-speed networks. like streaming queries. In this paper, we examine Streambox We investigate streaming optimizations by examining dif- and SABER as representative SPEs in this category. ferent aspects of stream processing regarding their exploita- A major aspect of parallel stream processing engines are tion of modern hardware. To this end, we adopt the ter- the underlying processing models which are either based on minology introduced by Hirzel et al. [53] and structure micro-batching or use pipelined tuple-at-a-time processing our analysis in data-related and processing-related optimiza- [14, 19, 33, 42, 87]. Micro-batching SPEs split streams into tions. As a result, we show that by applying appropriate finite chunks of data (batches) and process these batches in optimizations, single node throughput can be increased by parallel. SPEs such as Spark [88], Trident [1], and SABER [58] two orders of magnitude. Our contributions are as follows: adopt this micro-batching approach. In contrast, pipelined, • We analyze current SPEs and identify their inefficien- tuple-at-a-time systems execute data-parallel pipelines cies and bottlenecks on modern hardware setups. consisting of stream transformations. Instead of splitting • We explore and evaluate new architectures of SPEs on streams into micro-batches, operators receive individual tu- modern hardware that address these inefficiencies. ples and produce output tuples continuously. SPEs such as • Based on our analysis, we describe a set of design Apache Flink [15] and Storm [81] adopt this approach. changes to the common architecture of SPEs to scale- 2.2 Modern Hardware up on modern hardware A streaming system on modern hardware has to exploit • We conduct an extensive experimental evaluation us- three critical hardware resources efficiently to achieve a high ing the Yahoo! Streaming Benchmark, the Linear Road resource utilization: CPU, main memory, and network. In Benchmark, and a query on the NY taxi data set. this paper, we investigate if current SPEs and their system The rest of this paper is structured as follows. In Section 2, designs exploit these resources efficiently. we introduce concepts of current SPEs as well as aspects of First, modern many-core CPUs contain dozens of CPU modern hardware that we investigate in this paper. After cores per socket. Additionally, multi-socket CPUs connect that, we explore the data-related (Section 3) and processing- multiple CPUs via fast interconnects to form a network of related stream processing optimization (Section 4) of an SPE cores. However, this architecture introduces non-uniform on modern hardware. In Section 5, we evaluate state-of- memory access (NUMA) among cores [62, 74]. As a result, the-art SPEs and investigate different optimizations in an the placement of data in memory directly affects the overall experimental analysis. Finally, we discuss our results in Sec- efficiency of the system. Furthermore, each core caches parts tion 5.3 and present related work in Section 6. of the data in its private and shared caches for fast access.

Analyzing Efficient Stream Processing on Modern Hardware

Parallel Patterns for Adaptive Data Stream Processing

Analysis of GPU-Libraries for Rapid Prototyping Database Operations

A Middleware for Efficient Stream Processing in CUDA

AMD Accelerated Parallel Processing Opencl Programming Guide

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Scalable and High Performance MPI Design for Very Large

DNA Scaffolds Enable Efficient and Tunable Functionalization of Biomaterials for Immune Cell Modulation

Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated

Lightsaber: Efficient Window Aggregation on Multi-Core Processors

A Cilk Implementation of LTE Base-Station Up- Link on the Tilepro64 Processor

Parallel Stream Processing with MPI for Video Analytics and Data Visualization

Library for Handling Asynchronous Events in C++