Auto-Pipelining for Data Stream Processing

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, XXX 2012 1 Auto-pipelining for Data Stream Processing Yuzhe Tang, Student Member, IEEE, Bugra˘ Gedik, Member, IEEE Abstract—Stream processing applications use online analytics to ingest high-rate data sources, process them on-the-fly, and generate live results in a timely manner. The data flow graph representation of these applications facilitates the specification of stream computing tasks with ease, and also lends itself to possible run-time exploitation of parallelization on multi-core processors. While the data flow graphs naturally contain a rich set of parallelization opportunities, exploiting them is challenging due to the combinatorial number of possible configurations. Furthermore, the best configuration is dynamic in nature; it can differ across multiple runs of the application, and even during different phases of the same run. In this paper, we propose an auto-pipelining solution that can take advantage of multi-core processors to improve throughput of streaming applications, in an effective and transparent way. The solution is effective in the sense that it provides good utilization of resources by dynamically finding and exploiting sources of pipeline parallelism in streaming applications. It is transparent in the sense that it does not require any hints from the application developers. As part of our solution, we describe a light-weight runtime profiling scheme to learn resource usage of operators comprising the application, an optimization algorithm to locate best places in the data flow graph to explore additional parallelism, and an adaptive control scheme to find the right level of parallelism. We have implemented our solution in an industrial-strength stream processing system. Our experimental evaluation based on micro-benchmarks, synthetic workloads, as well as real-world applications confirms that our design is effective in optimizing the throughput of stream processing applications without requiring any changes to the application code. Index Terms—stream processing; parallelization; auto-pipelining ✦ 1 INTRODUCTION of stream computing, while best utilizing the multiple cores available in today’s processors. With the recent explosion in the amount of data available Stream processing applications are represented as data as live feeds, stream computing has found wide applica- flow graphs, consisting of reusable operators connected tion in areas ranging from telecommunications to health- to each other via stream connections attached to operator care to cyber-security. Stream processing applications im- ports. This is a programming model that is declarative at plement data-in-motion analytics to ingest high-rate data the flow manipulation level and imperative at the flow sources, process them on-the-fly, and generate live re- composition level [13]. The data flow graph represen- sults in a timely manner. Stream computing middleware tation of stream processing applications contains a rich provides an execution substrate and runtime system for set of parallelization opportunities. For instance, pipeline stream processing applications. In recent years, many parallelism is abundant in stream processing applications. such systems have been developed in academia [1], [2], While one operator is processing a tuple, an upstream [3], as well as in industry [4], [5], [6]. operator can process the next tuple concurrently. Many For the last decade, we have witnessed the prolifer- data flow graphs contain bushy segments that process ation of multi-core processors, fueled by diminishing the same set of tuples, and which can be executed gains in processor performance from increasing oper- in parallel. This is an example of task parallelism. It is ating frequencies. Multi-core processors pose a major noteworthy that both forms of parallelism have advan- challenge to software development, as taking advantage tages in terms of preserving the semantics of a parallel of them often requires fundamental changes to how program. On the other hand, exploiting data parallelism application code is structured. Examples include em- has additional complexity due to the need for morphing ploying thread-level primitives or relying on higher-level the graph to create multiple copies of an operator and abstractions that have been the focus of much research to re-establish the order between tuples. Pipeline and and development [7], [8], [9], [10], [11], [12]. The high- task parallelism do not require morphing the graph and throughput processing requirement of stream processing preserve the order without additional effort. These two applications makes them ideal for taking advantage of forms of parallelism can be exploited by inserting the multi-core processors. However, it is a challenge to keep right number of threads into the data flow graph at the the simple and elegant data flow programming model right locations. It is desirable to perform this kind of parallelization in a transparent manner, such that the • Y. Tang is a Ph.D. student at the College of Computing, Georgia applications are developed without explicit knowledge Institute of Technology, 801 Atlantic Drive, Atlanta, GA 30332. E-mail: [email protected]. The work was done while the author was at the IBM of the amount of parallelism available on the platform. T.J. Watson Research Center. We call this process auto-pipelining. • B. Gedik is an Asst. Professor at the Computer Engineer- There are several challenges to performing effective ing Deparment, Bilkent University, 06800, Ankara, Turkey. E-mail: [email protected]. Part of the work was done while the author was and transparent auto-pipelining in the context of stream at the IBM T.J. Watson Research Center. processing applications. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, XXX 2012 2 First, optimizing the parallelization of stream process- helps eliminate bottlenecks and improve throughput. ing applications requires determining the relative costs A control algorithm that decides when to stop inserting • of operators. The prevalence of user-defined operators additional threads and also backtracks from decisions in real-world streaming applications [5] means that cost that turn out to be ineffective. modeling, commonly applied in database systems [14], is Runtime mechanics to insert/remove threads while • not applicable in this setting. On the other hand, profile- maintaining lock correctness and continuous operation. driven optimization that requires one or more profile We implemented our auto-pipelining solution on runs based on compiler-generated instrumentation [15], IBM’s System S [3] — an industrial strength stream pro- [16], while effective, suffers from usability problems cessing middleware. We evaluate its effectiveness using and lack of runtime adaptation. On the usability side, micro-benchmarks, synthetic workloads, and real-world requiring profile runs and specification of additional applications. Our results show that auto-pipelining pro- compilation options has proven to be unpopular among vides better throughput compared to hand-optimized users in our own experience (see Appendix J). In terms applications at no cost to application developers. of runtime adaptation, the profile run may not be rep- resentative of the final execution. In summary, a light- 2 BACKGROUND weight dynamic profiling of operators is needed in order We provide a brief overview of the basic concepts associ- to provide effective and transparent auto-pipelining. ated with stream processing applications, using SPL [5] Second, and more fundamentally, it is a challenge as the language of illustration. We also describe the to efficiently (time-wise) find an effective (throughput- fundamentals of runtime execution in System S. wise) configuration that best utilizes available resources and harnesses the inherent parallelism present in the 2.1 Basic concepts streaming application. Given N operator ports and Listing 1 in Appendix A gives the source code for a up to T threads, there are combinatorial possibili- T N very simple stream processing application in SPL, with ties, k=0 k to be precise. In the absence of auto- its visual representation depicted in Figure 1 below. pipelining,P we have observed application developers struggling to insert threads manually1 to improve Sensor Sensors throughput. This is no surprise, as for a medium size Source application with 50 operators on an 8-core system, the Results number of possibilities reach multiple billions. Thus, a Join TCPSink practical optimization solution needs to quickly and au- Query tomatically locate an effective configuration at runtime. Source Queries Finally, deciding the right level of parallelism is a Fig. 1: Data flow graph for the SensorQuery app. challenge. The behavior of the system is difficult to predict for various reasons. User-defined operators can contain locks that inhibit effective parallelization. The The application is composed of operator instances con- overhead imposed by adding an additional thread in nected to each other via stream connections. An opera- the execution path is a function of the size of the tor instance is a vertex in the application graph. An tuples flowing through the port. The behavior of the operator instance is always associated with an operator. operating system scheduler can not be easily modeled For instance, the operator instance shown in the middle and predicted. The impact of these and other system of the graph in Figure 1 is an instance of a Join artifacts are observable only

Auto-Pipelining for Data Stream Processing

Parallel Patterns for Adaptive Data Stream Processing

A Middleware for Efficient Stream Processing in CUDA

AMD Accelerated Parallel Processing Opencl Programming Guide

Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated

Lightsaber: Efficient Window Aggregation on Multi-Core Processors

Parallel Stream Processing with MPI for Video Analytics and Data Visualization

Design and Implementation of an FPGA-Based Scalable Pipelined

SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures

Analyzing Efficient Stream Processing on Modern Hardware

Streambox: Modern Stream Processing on a Multicore Machine

A Media Enhanced Vector Architecture for Embedded Memory Systems

Fine-Grained Real-Time Stream Processing with Cameo