A Scalable Circular Pipeline Design for Multi-Way Stream Joins in Hardware

Home , Pipeline (software)

A Scalable Circular Pipeline Design for Multi-Way Stream Joins in Hardware Mohammadreza Najaﬁ†, Mohammad Sadoghi‡, Hans-Arno Jacobsen† †Technical University of Munich ‡University of California, Davis

Abstract—Efficient real-time analytics are an integral part S2.Price< S3.Price < S2.Price+$10 S1.OrderNr = S2.OrderNr of a growing number of data management applications such as computational targeted advertising, algorithmic trading, S1.OrderNr = S2.OrderNr S2.Price< S3.Price < S2.Price+$10 and Internet of Things. In this paper, we primarily focus on S S accelerating stream joins, arguably one of the most commonly 3 1 used and resource-intensive operators in stream processing. We propose a scalable circular pipeline design (Circular-MJ) in S2 S1 S2 S3 hardware to orchestrate multi-way join while minimizing data TS2 TS1 TS2 TS3 flow disruption. In this circular design, each new tuple (given its Fig. 1: Reordering operators in multi-way stream joins. origin stream) starts its processing from a specific join core and 1 passes through all respective join cores in a pipeline sequence provided flexibility that leads to an unstructured architecture. to produce final results. We further present a novel two-stage However, on hardware, it is instrumental to predetermine and pipeline stream join (Stashed-MJ) that uses a best-effort plan the necessary communication channels otherwise designs’ buffering technique (stash) to maintain intermediate results. In performance, complexity, and cost rapidly increase since all a case that an overwrite is detected in the stash, our design communications have to be exclusively realized using physical automatically resorts to recomputing intermediate results. Our experimental results demonstrate a linear throughput scaling wiring. This motivates complete rethinking of the hardware with respect to the number of execution units in hardware. design rather than simply relying on re-implementation of available software solutions. On the other hand, the execution of continuous queries (i.e., repetitive tasks) on potentially I.INTRODUCTION unbounded streams using the finite-window semantics offer a unique opportunity and make it suitable for hardware acceleration. A hardware solution presents negligible, if any, gains In recent years, there has been an increasing interest in when executing an operation only once versus its competing data stream management systems which encompasses a wide software variant; however, when this operation repeats many range of applications such as real-time data analytics, targeted times, the amortized gains grow far beyond that of a software. advertising, data mining, and Internet of Things. The com- Multi-Way Stream Join: Past hardware solutions [3], mon pattern among these applications is a predefined set of [4], [10], [11], [12], [13], [14] focus on a join between streaming queries (e.g., ad campaigns or trading strategies) and two streams while practical queries often go beyond joining unbounded event streams of incoming data (e.g., user profiles two streams. Notice it is non-trivial to build multi-way join or stock feeds) that must be processed against queries in real- operators by cascading operators designed for two streams as time. These latency-sensitive and throughput-intensive appli- each new tuple, depending on its origin, requires its own order cations have opened the demand for accelerating data manage- of join operators for the processing. Therefore, it remains a ment operations in general and stream processing in particular. major challenge to design multi-way stream joins in hardware Considering the crucial role of joins as resource-intensive due to excessive cost and penalty of flexible communications. operators in relational databases, it is not of a surprise that In the first part of this work, we propose a novel circular stream joins have also been the focus of much research on pipeline architecture for multi-way stream join (Circular-MJ) data streams [1], [2], [3], [4], [5], [6], [7], [8]. For example, that uses a dedicated stage for each stream sliding window. consider TPC-H [9] where 20 queries (out of 22) contain This inherently limits data dependency between stages by join operator while 12 of them use multi-way joins some using results of each stage as input for the next stage, up to 7 joins. However, the importance of joins is no longer leading to a scalable architecture that is centered around limited to only the classical relational setting. The emergence direct neighbor-to-neighbor communication. We use two of Internet of Things (IoT) has introduced a wide wave of fundamental steps to reshape the problem of multi-way join applications that rely on sensing, gathering, and processing to design a scalable hardware architecture. First, we change data from an increasingly large number of connected devices. the problem of unstructured multi-way stream join designs In stream join processing, software platforms offer flexible to a join reordering problem in a structured point-to-point communication where we see anycast and multicast connec- design. Second, we eliminate the join operator reordering tions between internal components without a significant drop problem by moving the reordering task to tuple insertion in performance. As an example, assume a system with four circuitry using a pipelined distribution chain. internal components of A, B, C, and D where a point-to-point Multi-Way Stream Join Optimization with Stash: connection between them could build a data-path similar to Depending on streams’ characteristics and join operators’ A→B →C →D. In a software platform establishing communication i.e., A→D and B →D is not detrimental especially 1We define unstructured architecture as an arbitrary point-to-point given the flexible shared memory hierarchy. State-of-art soft- communication (or other classes of communication pattern such as anycast) ware approaches in a multi-way stream join benefit from this among any processing nodes within the system. Pipeline Stage Distribution Chain New Tuple TS1 TS1 Process Process

OBuf Join Join

OBufOBuf OBuf Win SN Core Core

TS2 TS2 #1 2 3 StageBuffer

StageBuffer Store/

EBuf

StageBuffer StageBuffer

Order Controller Order Control Order Window Expire Control Order S3 Pipeline Registers 3

- TS3

StageBufferStageBuffer Order Controller

S2 Order Controller Result Collection Circuitry

EBufEBuf

Store/ EBuf

Circular BusCircular Pipeline Registers Pipeline Pipeline Registers Pipeline Expire S1 - - Join Join Join Core Core Core

Fig. 2: Multi-way stream joins Fig. 3: Intermediate result N N-1 N-2

StageBuffer StageBuffer StageBuffer

Order Control Order Order Control Order with circular design. generation in a pipeline stage. Control Order properties, materializing intermediate results2 in a buffer may provide performance improvements by avoiding Fig. 4: Circular-MJ architecture. recomputation of already processed tuples. However, in the Each new tuple insertion into the multi-way stream join introduced Circular-MJ, the processing engine for each stream updates its sliding window and subsequently may produce new is placed in a separate stage and accessing this buffer from intermediate results. Without materializing the intermediate two independent pipeline stages imposes multiple resource results, we need to sequentially cascade the join operators and sharing challenges in a hardware realization. First, having a feed each new tuple always from the bottom (input port) of shared unit between two stages violates the main concept of the cascaded architecture. Each new tuple and subsequently pipeline design, which is the separation of concerns. Second, its intermediate results pass through all join operators for this sharing can result in race conditions between the storage processing which leads to a right-deep join tree architecture. and processing of two tuples at different stages that require A. Multi-Way Join with Circular Pipeline expensive stalls in the pipeline to address. In the second part of this work, we propose a custom Designing a hardware based on the join reordering poses two-stage (for three or more streams) pipeline (referred to as a scalability issue due to the required crossbar4 connection Stashed-MJ) including a stash3. The novelty of our approach is between processing units (join operators) and sliding to benefit from the reduction in the number of pipeline stages windows. To address this issue, we need to fix the order of in the favor of better utilization of available processing units join operators which hardwires each sliding window to only and avoiding recomputation of already processed data. Further- one processing unit. This eliminates the need for a crossbar. more, the processing unit in the first stage operates on two win- To fix each join operator’s location on the right-deep join tree, dows (that are also connected to the stash) but not at the same we propose a circular data-path design which connects all time, which also eliminates the resource sharing challenges. operators together, as shown in Figure 2. In this design, each In this paper, we make the following contributions: join operator is connected to only one sliding window with A) Propose a scalable multi-way stream join (Circular-MJ) two entries. Each operator receives its new tuples, determined on hardware that is built on a circular chain of dedicated based on their origin, from its right entry. The left entries are stages (one per stream) and benefits from pipeline placed in the closed circular path which carries intermediate parallelism. results from a join operator to the next one. Resulting tuples B) Present a novel two-stage pipeline (Stashed-MJ) that are emitted after processing a new tuple in exactly N − 1 benefits from a stash (intermediate results buffer) to operators, where we have N streams. The remaining operator accelerate processing throughput. is responsible to store the new tuple in its sliding window. Using this design (Figure 2), we propose a scalable circular II.MULTI-WAY STREAM JOIN pipeline for multi-way stream join in hardware (referred to as A conventional join operates on tuples originating from Circular-MJ) that is shown in Figure 4. This pipeline has the two sources. Naturally, we should be able to cascade the same number of stages as input streams. Each stage is placed join operators to support more than two sources (streams). between two isolating sets of registers and is responsible to However, avoiding arbitrary (unstructured) communications process a new tuple against a specific sliding window. In between processing components, as a crucial property for case the window in a stage belongs to the current tuple’s hardware design, introduces the challenge of real-time join origin, store and expiration tasks are performed instead of the operator reordering as demonstrated in Figure 1. Here tuples processing. from S1 and S2 streams keep the order of join operators intact B. Circular Pipeline Design Rationale (left figure), while a new tuple from S3 stream mandates the operator reordering (right figure). Without the reordering, we The key factor that has heavily influenced our design is have to recompute all intermediate join results between all having an independent operation on each window assuming we already have intermediate results from another join existing tuples in the sliding windows of S1 and S2 which is not feasible due to the size and complexity of this processing. operator(s). Therefore, we design N stages, each responsible The reordering challenge is exacerbated when dealing with for processing, storage, and expiration operations on a single a hardware system, where changes in data-path and control sliding window. To handle data transfer between stages in a circuitry of a design, especially as it is scaled up (i.e., in the scalable manner, we arrange them in a circular architecture number of streams), have severe effects on design complexity, in such a way that intermediate results of each stage are fed performance, and cost of the system. to another stage as input. The key intuition of our circular design is that instead 2Outcomes of each join operator, except the last one in a join tree which of adapting the order of join operations to incoming tuples, emits the final results, are referred to as intermediate results. 3We refer to intermediate result buffer with additional control circuitries 4A type of connection which provides the possibility for every input to as stash. access to all output ports. tem)stsyaji odto.Tu,as Thus, condition. join ( probability a This satisfy results. streams) intermediate store happening to from space deadlocks extra prevents stage providing each by of StageBuffer entry a Using the deadlock. at a to translates which architecture. processing circular a in the arranged Therefore, are stages pipeline the since the to reaches given of Then, the StageBuffer. perspective its in processing. the for results further intermediate from waiting of for some is scenario buffer consume that this to its at stage in next Looking results the intermediate for waiting the is of stage generation each the to leads selectivity that tree results. low join intermediate a many deep to the prevent due probabilities in to operators occur match may necessary and deadlock results (high is intermediate A of buffer tuples. burst join joint this new stage’s the its by of caused to existence deadlocks one-by-one The them feeds core. and stage previous throughput. processing maximize pipeline and circular the utilization keep full main to is at chain distribution the three of an purpose as to stages, (3) (referred contains the registers and of to chain chain tuples Core, pipelined stage incoming a Join the use feed we a To Each (2) Unit. Control StageBuffer, Order a stages. (1) components: pipeline encompasses that path processing a all provides are the bus that This using 4. registers stages Figure pipeline other of to sets connected isolating two by surrounded ooeji prtri h ieie 3 rirr output Architecture Circular-MJ Arbitrary C. final (3) the offload throughput. pipeline. to high necessary stream sustaining the is each while that in results fixing stages all operator physically from stage to join collection any opposed one to as to pipeline access neighbor-to- stages the Arbitrary direct all in on (2) through exclusively passes communication. relies that neighbor bus that circular pipeline A within (1) ways: three a to 4). window (Figure the operators sliding unit join each change processing the hardwire separate we fix to to respectively, operators entry and join adaptive order; an reordering a have to having and using architecture of entry tuples instead incoming words, single other for a In location chain. distribution insertion pipelined the adapt we 5 ntedalc cnro l tgBfesaefl and full are StageBuffers all scenario, deadlock the In (1) u iclrppln osssof consists pipeline circular Our in pipeline conventional a from differs design circular Our h rbblt htaytocneuietpe ec rmoeof one from (each tuples consecutive two any that probability The n

htcrisnwtpe otercrepnigsae The stage. corresponding their to tuples new carries that ) Join Core th New Tuple New Store StageBuffer Window tg at o the for waits stage (a) expire store mp / X Expire Store ) i.5 re oto ntarchitecture. unit control Order 5: Fig. decreases. ( store i i th − 1) tg swiigfrisl ocniu the continue to itself for waiting is stage ( th i

olcsitreit eut rma from results intermediate collects EBuf OBuf 1) + tg hc swiigfrthe for waiting is which stage th 5 Previous Pipe-Stage Results nmlil osctv join consecutive multiple in ) 1 tg ocnuesm fthe of some consume to stage Process lookup st Window Join Core Join tg n hsdependency this and stage X selectivity iclrbus circular (b) N Previous Pipe-Stage Results Order Controlling Unit Controlling Order Lookup tgs ahsaeis stage Each stages. Out Calculator Compare - of nrae,the increases, Refine - Order ssonin shown as distribution N i i th streams, th EBuf OBuf stage stage match Pipe-Stage Results n tg sawy cuidwt h trg n expiry unit and processing storage unused pipeline. the one circular always with the is in occupied there always Therefore, tasks. is stage one N processing. improving around centered only is Circular-MJ, the In property improve utilization. second resource substantially The processed to is previously tuples. of design order recomputation avoiding Stashed-MJ by in as throughput to of results referred property buffer, of intermediate a first consists of a The that integration the pipelining ideas. Stashed-MJ, improved key introducing an two and by buffering architecture novel join multi-way o eea on,t ahbsd sdfrequi-joins. for used used hash-based, nested-loop, to algorithm from joins, joining general approaches the for expiry different and time-based, encompass and window or sliding can storage, count-based The window. processing, be this can the on operate to that components addition EBuf. in the window in stored tuples prematurely expired the last as prematurely the number with with same tuples, matches the with missing comparison avoid a To a tuples, uses expired them. unit drop control order to window the this in tuples, stored stored are which window prematurely sliding stage’s its to an of has number unit the Calculator control sliding Order as order the size The from stages). same (pipeline tuples the streams are distribution expired buffers stores the two an that These from has window. EBuf tuples unit an This received and new 5. newly chain a Figure stores of in shown that tasks as two OBuf lookup, in and involved store gets prematurely tuple unit (missing This (match results tuples). incorrect missed to expired or lead tuples) can consecutive chain processing by distribution the the into insertion by tuple pipeline out-of-order The order. threshold. tion specified a when beyond result is intermediate fill-ratio an to StageBuffer drop Additionally, the we an the fill-ratio. deadlocks, picks StageBuffer’s processing prevent controller the fully between stage reduce a choice tuple, to deadlock. new a former a a is or of result there chance intermediate when the reduces means, further This which them. tuples, consume new to stage next the for waiting without the way 6 enwtr u teto oaohrkydmninof dimension key another to attention our turn now We (3) (2) over results intermediate to priority give we Furthermore, ale hnterodri orsodnewt h urn ul under tuple current the with correspondence in order their than Earlier tgspromtepoesn uhta,frec stream, each for that, such processing the perform stages re oto Unit Control Order

onCore Join incoming tuples from S1, S2, and S3 Stash I.M III. i th WindowS access lock result store done Store S Controller 1 tuple storedone Stage Stage 6 & processdone tg a uhispoue nemdaeresults intermediate produced its push can stage oaodicretmthswt prematurely with matches incorrect avoid To . 2 S expiry done & ULTI 2 1 Expire Tuple i.6 tse-Jarchitecture. Stashed-MJ 6: Fig. Compute hc onstenme ftpe belonging tuples of number the counts which

-W Window S

otisisddctdsra’ sliding stream’s dedicated its contains grant Process AY Result Store Controller Controller S Access Access srsosbefrtecretexecu- correct the for responsible is 1 opnn efrsa additional an performs component TREAM 1 Window S tuple storedone Store J Controller Stage Stage I WITH OIN processdone S expiry done 3 Tuple 2 & Process 2 Expire stash N Reﬁne − Window S Controller Controller omaterialize to , 1 Access Access S TASH tgsotof out stages component 2 3

Out-of- results ) (a) mp : 0.0001% (b) mp : 0.001% (c) mp : 0.01%

1.5 101.5 10 tuples/s

3 101 10 ×

101 101

Throughput ( 100 2,000 4,000 6,000 8,000 2,000 4,000 6,000 8,000 2,000 4,000 6,000 8,000 Time (ms) Time (ms) Time (ms) Fig. 7: Stream count effect on input throughput (w :214, stream(stage) count:3 ,4 ,5 ,6 ,7 ,8 ).

Integration of the stash into Circular-MJ (Figure 4) imposes first observe a warm-up phase where we have a super linear an important design challenge because we are now forced to reduction in the input throughput as sliding windows get share the buffer among more than one processing units that filled. The vertical dash lines specify the end of the warm-up are placed into two or more separate pipeline stages. This phase for pipelines with a different number of streams. raises two concerns. First, having a shared block between two Figures 7b and 7c present input throughput for lower pipeline stages that could be accessed frequently and in par- selectivities. We only observe a small reduction in the input allel violates the main concept of the circular pipeline design throughput in Figure 7b while the measurements for match that is based on separation of concerns and a strictly one-way probability (mp) of 0.01%, Figure 7c, show a much lower neighbor-to-neighbor communication. Second, this sharing and also sporadic throughput readings. Increasing the match can result in race conditions between concurrent processing of probability leads to more intermediate results in each pipeline a new tuple and storing the matched results of an earlier tuple, stage which enforce next stages to process them instead which now demands expensive pipeline stalls for coordination. of receiving new tuples. In a smaller pipeline, intermediate In Circular-MJ, we utilize three pipeline stages for a 3-way results require less number of stages to construct final results stream join operation. For every tuple insertion, two of these which are the reason for smaller throughput drop for pipelines stages process this tuple against the sliding window of other with lower number of streams in Figure 7c. streams while the remaining stage is responsible for storing this tuple in its stream’s sliding window and expiring the oldest V. CONCLUSIONS tuple from it. There are two key insights that guide our new We have focused on hardware acceleration of multi-way Stashed-MJ design: (1) storage and expiration operations are stream joins. In particular, we presented Circular-MJ, a novel relatively less costly compared to processing (especially in circular pipeline architecture for realizing multi-way stream nested-loop stream join), and (2) storage and expiration are join that eliminates the need to re-arrange the order of join performed on a separate sliding window than the windows that operators and avoids arbitrary point-to-point communication are used for processing. By exploiting these insights, we re- among custom join cores. We further expanded our multi-way duce the number of pipeline stages to two by performing stor- join design, referred to Stashed-MJ, to efficiently cache age and expiration in parallel with the processing in the first intermediate results in order to avoid recomputing the already pipeline stage. As a result, the processing unit in the first stage processed data. has to operate on two sliding windows, depending on newly re- REFERENCES ceived tuples’ origin, but not at the same time. This provides us with the opportunity to offload the processing operation of two [1] C. Koch, “Incremental query evaluation in a ring of databases,” ser. PODS’10. [2] J. Kang, J. Naughton, and S. Viglas, “Evaluating window joins over unbounded streams that are involved in updating the stash onto this stage, streams,” ICDE, 2003. which eliminates the sharing challenge as shown in Figure 6. [3] B. Gedik, R. R. Bordawekar, and P. S. Yu, “CellJoin: A parallel stream join operator for the cell processor,” VLDBJ, 2009. [4] J. Teubner and R. Mueller, “How soccer players would do stream joins,” IV. EXPERIMENTAL RESULTS SIGMOD, 2011. [5] C. Koch, Y. Ahmad, O. Kennedy, M. Nikolic, A. Notzli,¨ D. Lupei, and We have realized our multi-way stream join pipeline A. Shaikhha, “DBToaster: higher-order delta processing for dynamic, (Circular-MJ) and optimized pipeline with stash (Stashed-MJ) frequently fresh views,” VLDBJ’14. [6] V. Gulisano, Y. Nikolakopoulos, M. Papatriantafilou, and P. Tsigas, “Scalejoin: in VHDL. For evaluations we used Questa Advanced A deterministic, disjoint-parallel and skew-resilient stream join,” Big Data, Simulator to extract cycle accurate measurements (with 2015. [7] Q. Lin, B. C. Ooi, Z. Wang, and C. Yu, “Scalable distributed stream join a clock frequency of 100MHz), guaranteeing the same processing,” SIGMOD, 2015. performance for the actual hardware. We employed input [8] M. Najafi, M. Sadoghi, and H.-A. Jacobsen, “SplitJoin: A scalable, low-latency streams that consist of 64-bit (32-bit key and 32-bit value) stream join architecture with adjustable ordering precision,” USENIX ATC, 2016. tuples that are joined against other streams sliding windows. [9] T. P. P. Council, “TPC-H benchmark specification,” Published at To measure the effect of stream count on input throughput http://www.tcp.org/hspec.html, 2008. [10] M. Sadoghi, R. Javed, N. Tarafdar, H. Singh, R. Palaniappan, and H. A. we use join conditions with high selectivity (low match Jacobsen, “Multi-query stream processing on FPGAs,” ICDE, 2012. probability) to approach the maximum sustainable throughput, [11] M. Najafi, K. Zhang, M. Sadoghi, and H.-A. Jacobsen, “Hardware acceleration landscape for distributed real-time analytics: Virtues and limitations,” ICDCS, presented in Figure 7a. As expected from the pipelining 2017. parallelism, increase in the number of pipeline stages [12] Y. Oge, T. Miyoshi, H. Kawashima, and T. Yoshinaga, “Design and linearly improves the processing throughput. This shows the implementation of a handshake join architecture on FPGA,” IEICE, 2012. [13] M. Najafi, M. Sadoghi, and H.-A. Jacobsen, “Configurable hardware-based effectiveness of the distribution chain since it has been able streaming architecture using online programmable-blocks,” ICDE, 2015. to keep the pipeline stages busy which is the key factor [14] C. Kritikakis, G. Chrysos, A. Dollas, and D. N. Pnevmatikatos, “An FPGA- in pipelining parallelism. On the left side of this figure we based high-throughput stream join architecture,” FPL, 2016.