Apache Spark Streaming, Kafka and Harmonicio: a Performance Benchmark and Architecture Comparison for Enterprise and Scientiﬁc Computing

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing Ben Blamey∗, Andreas Hellander∗ and Salman Toor∗ ∗ Department of Information Technology, Division of Scientific Computing, Uppsala University, Sweden Email: fBen.Blamey, Andreas.Hellander, [email protected] Abstract—This paper presents a benchmark of stream pro- or more messages per second, but focus on enterprise use cases cessing throughput comparing Apache Spark Streaming (under with textual rather than binary content, and of message size file-, TCP socket- and Kafka-based stream integration), with perhaps a few KB. Additionally, the computational cost of a prototype P2P stream processing framework, HarmonicIO. Maximum throughput for a spectrum of stream processing loads processing an individual message may be relatively small (e.g. are measured, specifically, those with large message sizes (up parsing JSON, and applying some business logic). By contrast, to 10MB), and heavy CPU loads – more typical of scientific in scientific computing domains messages can be much larger computing use cases (such as microscopy), than enterprise con- (order of several MB). texts. A detailed exploration of the performance characteristics Our motivating use case is the development of a cloud with these streaming sources, under varying loads, reveals an interplay of performance trade-offs, uncovering the boundaries pipeline for the processing of streams of microscopy images, of good performance for each framework and streaming source for biomedical research applications. Existing systems for integration. We compare with theoretic bounds in each case. working with such datasets have largely focused on offline pro- Based on these results, we suggest which frameworks and cessing: our online processing (processing the ‘live’ stream), streaming sources are likely to offer good performance for a given is relatively novel for the image microscopy domain. Electron load. Broadly, the advantages of Spark’s rich feature set comes at a cost of sensitivity to message size in particular – common stream microscopes generate high-frequency streams of large, high- source integrations can perform poorly in the 1MB-10MB range. resolution image files (message sizes 1-10Mb), and feature The simplicity of HarmonicIO offers more robust performance extraction is computationally intensive. This is typical of in this region, especially for raw CPU utilization. many scientific computing use cases: where files have binary Keywords-Stream Processing; Apache Spark, HarmonicIO, content, with execution time dominated by the per-message high-throughput microscopy; HPC; Benchmark; XaaS. ‘map’ stage. Thereby, we investigate how well the performance of enterprise-grade stream processing frameworks (such I. INTRODUCTION as Apache Spark) translates to loads more characteristic of A number of stream processing frameworks have gained scientific computing, for example, microscopy image stream wide adoption over the last decade or so (Apache Flink (Car- processing, by benchmarking under a spectrum of conditions bone et al., 2015), Apache Spark Streaming (Zaharia et al., representative of both. We do this by varying both the pro- 2016), Flume (Apache Flume, 2016)); suitable for high- cessing cost of the map stage, and the message size to expand volume, high-reliability stream processing workloads. Their on previous studies. development has been motivated by analysis of data from For comparison, we measured the performance of cloud, web and mobile applications. For example, Apache HarmonicIO (Torruangwatthana et al., 2018) – a research Flume is designed for the analysis of server application log prototype with a which has a P2P-based architecture, under arXiv:1807.07724v4 [cs.DC] 19 Dec 2019 data. Apache Spark improves upon the Apache Hadoop frame- the same conditions. This paper contributes: work (Apache Software Foundation, 2011) for distributed • A performance comparison of an enterprise grade computing, and was later extended with streaming support. framework (Apache Spark) for stream processing to a Apache Flink was later developed primarily for stream pro- streaming framework tailored for scientific use cases cessing. (HarmonicIO). These frameworks boast performance, scalability, data secu- • An analysis of these results, and comparison with theoret- rity, processing guarantees, and efficient, parallelized compu- ical bounds – relating the findings to the architectures of tation; together with high-level stream processing APIs (aug- the framework when integrated with various streaming menting familiar map/reduce with stream-specific functionality sources. We find that performance varies considerably such as windowing). All of which makes them attractive for according to the application loads – quantifying where scientific computing – including imaging applications in the these transitions occur. life-sciences, and simulation output more generally. • Benchmarking tools for Apache Spark Streaming, with Previous studies have shown that these frameworks are ca- tunable message size and CPU load per message – to pable of processing message streams on the order of 1 million explore this domain as a continuum. • Recommendations for choosing frameworks and their i.e. parsing – is cheap, and integrated the stream processing integration with stream sources, especially for atypi- frameworks under test with Kafka (Kreps et al., 2011) and cal stream processing applications, highlighting some Redis (Salvatore Sanfilippo, 2009) – advantageous in that it limitations of both frameworks especially for scientific models a realistic enterprise system, but with each component computing use cases. having its own performance characteristics, it makes it difficult to get a sense of maximum performance of the streaming II. BACKGROUND: STREAM PROCESSING OF frameworks in isolation. In their study the data is preloaded IMAGES IN THE HASTE PROJECT into Kafka. In our study, we investigate ingress bottlenecks High-throughput (Wollman and Stuurman, 2007), and high- by writing and reading data through Kafka during the bench- content imaging (HCI) experiments are highly automated ex- marking, to get a full measurement of sustained throughput. perimental setups which are used to screen molecular libraries In an extension of their study (Grier, 2016), Spark was and assess the effects of compounds on cells using microscopy shown to outperform Flink in terms of throughput by a factor imaging. Work on smart cloud systems for prioritizing and of 5, achieving frequencies of more than 60MHz. Again, organizing data from these experiments is our motivation for as with previous studies, Kafka integration was used, with considering streaming applications where messages are rela- small messages. Other studies follow a similar vein: (Qian tively large binary objects (BLOBs) and where each processing et al., 2016) used small messages (60 bytes, 200 bytes), task can be quite CPU intensive. and lightweight pre-processing (i.e. ‘map’) operations: e.g. Our goal of online analysis of the microscopy image stream grepping and tokenizing strings, with an emphasis on common allows both the quality of the images to analyzed (highlighting stream operations such as sorting. Indeed, sorting is seen as any issues with the equipment, sample preparation, etc.) as something of a canonical benchmark for distributed stream well as detection of characteristics (and changes) in the sample processing. For example, Spark previously won the GraySort itself during the experiment. Industry setups for high-content contest (Xin, 2014), where the frameworks ability to shuffle- imaging can produce 38 frames/second with image sizes on data between worker nodes is exercised. Marcu et. al. (2016) the order of 10Mb (Lugnegard,˚ 2018). These image streams, offer a comparison of Flink and Spark on familiar BigData like other scientific use cases, have different characteristics benchmarks (grepping, wordcount), and give a good overview than many enterprise stream analytics applications: of performance optimizations in both frameworks. • Messages are binary (not textual, JSON, XML, etc.). HarmonicIO, a research prototype streaming framework • Messages are larger (order MBs, not bytes or KBs). with a peer-to-peer architecture, developed specifically for sci- • The initial map phase can be computationally expensive, entific computing workloads, has previously shown good per- and perhaps dominate execution time. formance messages in the 1-10MB range (Torruangwatthana Our goal is to create a general pipeline able to process et al., 2018). To the authors’ knowledge there is no existing streams with these characteristics (and image streams in par- work benchmarking stream processing with Apache Spark, or ticular) with an emphasis on spatial-temporal analysis. Our related frameworks, with messages larger than a few KB, and SCaaS – Scientific Computing as a Service platform will to with map stages which are computationally expensive. allow domain scientists to work with large datasets in an eco- IV. STREAM PROCESSING FRAMEWORKS nomically efficient way without needing to manage infrastruc- ture and software themselves. Apache Spark Streaming (ASS) This section introduces the two frameworks selected for has many of the features needed to build such a platform, study in this paper, Apache Spark and HarmonicIO. Apache with rich APIs suitable for scientific applications, and proven Spark competes with other frameworks such as Flink, Flume, performance for

Load more