Scaling Ordered Stream Processing on Shared-Memory Multicores Guna Prasaad G
Total Page:16
File Type:pdf, Size:1020Kb
Scaling Ordered Stream Processing on Shared-Memory Multicores Guna Prasaad G. Ramalingam Kaushik Rajan University of Washington Microsoft Research India Microsoft Research India [email protected] [email protected] [email protected] ABSTRACT rise of many new applications such as surveillance, fraud detection, Many modern applications require real-time processing of large environment monitoring, etc. The scale and distributed nature of volumes of high-speed data. Such data processing needs can be the problem spurred the interest of several research communities modeled as a streaming computation. A streaming computation is that further gave rise to large scale distributed stream processing specified as a dataflow graph that exposes multiple opportunities systems such as Aurora [2], Borealis [1], STREAM [11] and Tele- for parallelizing its execution, in the form of data, pipeline and graphCQ [19]. task parallelism. On the other hand, many important applications In today’s highly connected world, data is of utmost value as it require that processing of the stream be ordered, where inputs are arrives. The advent of Big Data has further increased the impor- processed in the same order as they arrive. There is a fundamental tance of realtime stream processing. Several modern use-cases like conflict between ordered processing and parallelizing the stream- shopping cart abandonment analysis, ad serving, brand monitoring ing computation. This paper focuses on the problem of effectively on social media, and leader board maintenance in online games parallelizing ordered streaming computations on a shared-memory require realtime processing of large volumes of high-speed data. multicore machine. In the past decade, there has been a tremendous increase in the We first address the key challenges in exploiting data parallelism number of products (e.g. IBM Streams [37], Millwheel [5], Spark in the ordered setting. We present a low-latency, non-blocking Streaming [41], Apache Storm [8], S4 [35], Samza [7], Heron [29], concurrent data structure to order outputs produced by concur- Microsoft Stream Insight [32], TRILL [18]) that cater to such data rent workers on an operator. We also propose a new approach to processing needs and is evidence of its ever-growing importance. parallelizing partitioned stateful operators that can handle load In this paper, we focus on scaling stream processing on the imbalance across partitions effectively and mostly avoid delays due shared-memory multicore architecture. We consider this an im- to ordering. We illustrate the trade-offs and effectiveness of our portant problem for several reasons. Streaming pipelines generally concurrent data-structures on micro-benchmarks and streaming have a low memory footprint as most of the operators are either queries from the TPCx-BB [16] benchmark. We then present an stateless or have a small bounded state. With increasing main mem- adaptive runtime that dynamically maps the exposed parallelism ory sizes and prevalence of multi-core architectures, the bandwidth in the computation to that of the machine. We propose several in- and parallelism offered by a single machine today is often sufficient tuitive scheduling heuristics and compare them empirically on the to deploy pipelines with large number of operators [37]. So, most TPCx-BB queries. We find that for streaming computations, heuris- streaming workloads can be efficiently handled in a single multicore tics that exploit as much pipeline parallelism as possible perform machine without having to distribute it over a cluster of machines. better than those that seek to exploit data parallelism. In fact, systems like TRILL [18] run streaming computations en- tirely on a single multicore machine at scales sufficient for several KEYWORDS important applications. This is unlike batch processing systems, where the input, intermediate results and output data are often stream processing systems, streaming dataflow graph, continuous large and hence the pipeline needs to be split into many stages. queries, data parallelism, partitioned parallelism, pipeline paral- Even in workloads where distribution across a cluster is impor- lelism, dynamic scheduling, ordered processing, runtime, concur- tant (e.g. for fault tolerance), typically the individual nodes in the rent data structures cluster are shared-memory multicores themselves. Most distributed arXiv:1803.11328v1 [cs.DB] 30 Mar 2018 streaming systems [8, 41] today assume each core in a multicore 1 INTRODUCTION node as an individual executor and fail to exploit the advantages Stream processing as a computational model has a long history of low overhead communication offered by shared-memory. We dating back to Petri Nets in the 1960s, Kahn Process Networks and believe that considering a multicore machine as a single powerful Communicating Sequential Processes in the 1970s, and Synchro- node rather than as a set of independent nodes can help better nous Dataflow in the 1980s39 [ ]. Practical applications of stream exploit shared-memory parallelism. A streaming computation can processing were, for a long time, limited to audio, video and dig- be split into multiple stages and each stage can be deployed on a ital signal processing that typically involves deterministic, high- shared-memory node [37]. Most prior work in this area have not performance computations. Several languages and compilers such studied the shared-memory multicore setting in depth - they either as StreamIt [22], Continuous Query Language(CQL) [9] and Imag- focus on the single core [10, 17, 27] or distributed shared-nothing ine [28] were designed to specify and optimize the execution of architectures [1, 5, 6, 8, 29, 41]. such programs on single and shared-memory architectures. The emergence of sensors and similar small-scale computing devices that continuously produce large volumes of data led to the Further, we are interested in ordered stream processing. The accesses only a part of the state during the computation, which is stream of events/tuples usually have an associated notion of tempo- pre-determined by a key associated with every input tuple. These ral ordering such as in user click-streams from an online shopping are called partitioned stateful operators, as the state can be parti- session or periodic sensor readings. In many scenarios the applica- tioned by the key. windowed-group-by-count is an operator of tion logic depends on this temporal order. For example, clustering this type. click-streams into sessions based on timeouts between two con- secutive events and computing time-windowed aggregates over 1.1.1 Opportunities for Parallelization. A streaming dataflow streams of data. Implementing such logic on systems that do not graph exposes various opportunities for parallelizing the compu- provide ordered semantics is complicated and often turns out to be tation efficiently. We elucidate this using an example: figure 1rep- a performance bottleneck. resents an algorithm to detect high-mobility fraud using call data Ordered processing also enables our parallelization framework records as a streaming dataflow graph. to be deployed easily on individual multicore nodes in a distributed Call data records (CDR) are generated by every call between two stream processing cluster. This guarantee is important especially mobile phones and it contains information such as time, duration when a large stream processing query is divided into sub-queries, of the call, location and phone number of the caller and the callee. each allotted to a multicore node and one of them contains a non- In the detection algorithm, a CDR is first filtered (1, fig. 1) on the commutative operation. Moreover, fault tolerance techniques such interested area code and the caller/callee’s time and location infor- as active replication depend on state of the pipeline on two repli- mation is projected (2) as a record. These location records are then cas being the same and it cannot be guaranteed without ordered grouped by phone number to compute (3) the speed at which a user processing. must have traveled between locations. Phone numbers that have a Many stream processing systems today provide mechanisms speed greater than T , are then filtered (4), and the number of such to support ordered stream processing. Most of them based on a cases in a given time window are counted (5). micro-batch architecture [1, 5, 18, 41], in which the input stream is An operator is said to be data parallel, if its inputs can be pro- broken down into streams of smaller batches and each batch is pro- cessed concurrently. Stateless operators such as (1, 2, 4) in the cessed like in a batch processing system such as Map Reduce [20] example are data parallel. On the other hand, inputs to a parti- or Apache Spark [40]. They support order sensitive pipelines by tioned stateful operator can be processed in parallel only if they periodically sending watermarks denoting that all events less than a belong to different partitions. Hence, they are said to exhibit parti- specific timestamp have been received. However, These techniques tioned parallelism. In our example, computing the speed based on are not suitable for latency critical applications mainly due to the location records (3) for two different phone numbers can be done batching delays. We show that it is possible to achieve this guaran- in parallel. Non-commutative stateful operators do not exhibit any tee at much lower latencies without constraining execution of the data parallelism. pipeline excessively. Further, when two operators are connected to each other such that the output of one forms the input to another, they are said to exhibit pipeline parallelism. In that case, these two operators can be 1.1 Background and Challenges processed concurrently. For example, one worker can compute the A streaming computation can be specified as a dataflow graph, where speed (3) for a particular phone number, while another filters (4) each vertex is associated with an operator and directed edges rep- some phone numbers based on the speed already computed and sent resent flow of input into and out of the operators.