Fine-Grained Real-Time Stream Processing with Cameo

Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing with Cameo Le Xu, University of Illinois at Urbana-Champaign; Shivaram Venkataraman, UW-Madison; Indranil Gupta, University of Illinois at Urbana-Champaign; Luo Mai, University of Edinburgh; Rahul Potharaju, Microsoft https://www.usenix.org/conference/nsdi21/presentation/xu This paper is included in the Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation. April 12–14, 2021 978-1-939133-21-2 Open access to the Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation is sponsored by Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing with Cameo Le Xu1∗, Shivaram Venkataraman2, Indranil Gupta1, Luo Mai3, and Rahul Potharaju4 1University of Illinois at Urbana-Champaign, 2UW-Madison, 3University of Edinburgh, 4Microsoft Abstract Resource provisioning in multi-tenant stream processing systems faces the dual challenges of keeping resource utilization high (without over-provisioning), and ensuring performance isolation. In our common production use cases, where streaming workloads have to meet latency targets and avoid breach- ing service-level agreements, existing solutions are incapable Figure 1: Slot-based system (Flink), Simple Actor system (Or- of handling the wide variability of user needs. Our framework leans), and our framework Cameo. called Cameo uses fine-grained stream processing (inspired by actor computation models), and is able to provide high latency. Others wish to maximize throughput under limited re- resource utilization while meeting latency targets. Cameo dy- sources, and yet others desire high resource utilization. Violat- namically calculates and propagates priorities of events based ing such user expectations is expensive, resulting in breaches on user latency targets and query semantics. Experiments of service-level agreements (SLAs), monetary losses, and cus- on Microsoft Azure show that compared to state-of-the-art, tomer dissatisfaction. the Cameo framework: i) reduces query latency by 2.7×in To address these challenges, we explore a new fine-grained single tenant settings, ii) reduces query latency by 4.6×in philosophy for designing a multi-tenant stream processing multi-tenant scenarios, and iii) weathers transient spikes of system. Our key idea is to provision resources to each operator workload. based solely on its immediate need. Concretely we focus on deadline-driven needs. Our fine-grained approach is inspired 1 Introduction by the recent emergence of event-driven data processing ar- Stream processing applications in large companies handle chitectures including actor frameworks like Orleans [10,25] tens of millions of events per second [16, 68, 89]. In an at- and Akka [1], and serverless cloud platforms [5,7, 11, 51]. tempt to scale and keep total cost of ownership (TCO) low, Our motivation for exploring a fine-grained approach is today’s systems: a) parallelize operators across machines, and to enable resource sharing directly among operators. This b) use multi-tenancy, wherein operators are collocated on is more efficient than the traditional slot-based approach, shared resources. Yet, resource provisioning in production wherein operators are assigned dedicated resources. In the environments remains challenging due to two major reasons: slot-based approach, operators are mapped onto processes or (i) High workload variability. In a production cluster at a threads—examples include task slots in Flink [27], instances large online services company, we observed orders of magni- in Heron [57], and executors in Spark Streaming [90]. Devel- tude variation in event ingestion and processing rates, across opers then need to either assign applications to a dedicated time, across data sources, across operators, and across ap- subset of machines [13], or place execution slots in resource plications. This indicates that resource allocation needs to be containers and acquire physical resources (CPUs and mem- dynamically tailored towards each operator in each query, in ory) through resource managers [8, 47, 84]. a nimble and adept manner at run time. While slot-based systems provide isolation, they are hard (ii) Latency targets vary across applications. User expec- to dynamically reconfigure in the face of workload variability. tations come in myriad shapes. Some applications require As a result it has become common for developers to “game” quick responses to events of interest, i.e., short end-to-end their resource requests, asking for over-provisioned resources, far above what the job needs [34]. Aggressive users starve ∗Contact author: Le Xu <[email protected]> other jobs which might need immediate resources, and the USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 389 upshot is unfair allocations and low utilization. At the same time, today’s fine-grained scheduling systems like Orleans, as shown in Figure1, cause high tail latencies. The figure also shows that a slot-based system (Flink on YARN), which maps each executor to a CPU, leads to low resource utilization. The plot shows that our approach, Cameo, can provide both high utilization and low tail latency. To realize our approach, we develop a new priority-based framework for fine-grained distributed stream processing. (a) Data Volume (b) Job Scheduling & Completion La- This requires us to tackle several architectural design chal- Distribution tencies lenges including: 1) translating a job’s performance target (deadlines) to priorities of individual messages, 2) developing Normalized Data Volume 0.25 0.50 0.75 1.00 interfaces to use real-time scheduling policies such as earliest deadline first (EDF) [65], least laxity first (LLF) [69] etc., and 3) low-overhead scheduling of operators for prioritized messages. We present Cameo, a new scheduling framework designed for data streaming applications. Cameo: • Dynamically derives priorities of operators, using both: a) Event Stream static input, e.g., job deadline; and b) dynamic stimulus, e.g., tracking stream progress, profiled message execution Time (1−minute data volume) (each band is a distinct stream) (each times. (c) Ingestion Heatmap • Contributes new mechanisms: i) scheduling contexts, which propagate scheduling states along dataflow paths, Figure 2: Workload characteristics collected from a produc- ii) a context handling interface, which enables pluggable tion stream analytics system. scheduling strategies (e.g., laxity, deadline, etc.), and iii) provisioning—their users rarely have any means of accu- tackles required scheduling issues including per-event rately gauging how many nodes are required, and end up synchronization, and semantic-awareness to events. over-provisioning for their job. • Provides low-overhead scheduling by: i) using a stateless Temporal variation makes resource prediction difficult. scheduler, and ii) allowing scheduling operations to be Figure 2(c) is a heat map showing incoming data volume driven purely by message arrivals and flow. for 20 different stream sources. The graph shows a high de- We build Cameo on Flare [68], which is a distributed data gree of variability across both sources and time. A single flow runtime built atop Orleans [10, 25]. Our experiments are stream can have spikes lasting one to a few seconds, as well run on Microsoft Azure, using production workloads. Cameo, as periods of idleness. Further, this pattern is continuously using a laxity-based scheduler, reduces latency by up to 2.7× changing. This points to the need for an agile and fine-grained in single-query scenarios and up to 4.6× in multi-query sce- way to respond to temporal variations, as they are occurring. narios. Cameo schedules are resilient to transient workload Users already try to do fine-grained scheduling. We have spikes and ingestion rate skews across sources. Cameo’s observed that instead of continuously running streaming appli- scheduling decisions incur less than 6.4% overhead. cations, our users prefer to provision a cluster using external resource managers (e.g., YARN [2], Mesos [47]), and then run 2 Background and Motivation periodic micro-batch jobs. Their implicit aim is to improve 2.1 Workload Characteristics resource utilization and throughput (albeit with unpredictable latencies). However, Figure 2(b) shows that this ad-hoc ap- We study a production cluster that ingests more than 10 PB proach causes overheads as high as 80%. This points to the per day over several 100K machines. The shared cluster has need for a common way to allow all users to perform fine- several internal teams running streaming applications which grained scheduling, without a hit on performance. perform debugging, monitoring, impact analysis, etc. We first Latency requirements vary across jobs. Finally, we also make key observations about this workload. see a wide range of latency requirements across jobs. Fig- Long-tail streams drive resource over-provisioning. Each ure 2(b) shows that the job completion time for the micro- data stream is handled by a standing streaming query, de- aggregation jobs ranges from less than 10 seconds up to 1000 ployed as a dataflow job. As shown in Figure 2(a), we first seconds. This suggests that the range of SLAs required by observe that 10% of the streams process a majority of the queries will vary across a wide range. This also presents an data. Additionally, we observe that a long tail of streams, opportunity for priority-based scheduling: applications have each processing small amount

Fine-Grained Real-Time Stream Processing with Cameo

Parallel Patterns for Adaptive Data Stream Processing

A Middleware for Efficient Stream Processing in CUDA

AMD Accelerated Parallel Processing Opencl Programming Guide

Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated

Lightsaber: Efficient Window Aggregation on Multi-Core Processors

Parallel Stream Processing with MPI for Video Analytics and Data Visualization

Design and Implementation of an FPGA-Based Scalable Pipelined

SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures

Analyzing Efficient Stream Processing on Modern Hardware

Streambox: Modern Stream Processing on a Multicore Machine

A Media Enhanced Vector Architecture for Embedded Memory Systems

A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective