Raftlib: a C++ Template Library for High Performance Stream Parallel Processing

RaftLib: A C++ Template Library for High Performance Stream Parallel Processing Jonathan C. Beard Peng Li Roger D. Chamberlain Jonathan C. Beard, Peng Li and Roger D. Chamberlain. “RaftLib: A C++ Template Library for High Performance Stream Parallel Processing,” in Proc. of 6th Int’l Workshop on Programming Models and Applications for Multicores and Manycores (PMAM), February 2015, pp. 96-105. Dept. of Computer Science and Engineering Washington University in St. Louis RaftLib: A C++ Template Library for High Performance Stream Parallel Processing Jonathan C. Beard, Peng Li and Roger D. Chamberlain Dept. of Computer Science and Engineering Washington University in St. Louis {jbeard,pengli,roger}@wustl.edu ABSTRACT Stream processing or data-flow programming is a compute source paradigm that has been around for decades in many forms yet has failed garner the same attention as other mainstream sum print languages and libraries (e.g., C++ or OpenMP [15]). Stream processing has great promise: the ability to safely exploit extreme levels of parallelism. There have been many im- source plementations, both libraries and full languages. The full languages implicitly assume that the streaming paradigm cannot be fully exploited in legacy languages, while library approaches are often preferred for being integrable with the Figure 1: Simple streaming application example vast expanse of legacy code that exists in the wild. Li- with four compute kernels of three distinct types. braries, however are often criticized for yielding to the shape From left to right: the two source kernels each pro- of their respective languages. RaftLib aims to fully exploit vide a number stream, the “sum” kernel adds pairs of the stream processing paradigm, enabling a full spectrum of numbers and the last kernel prints the result. Each streaming graph optimizations while providing a platform kernel acts independently, sharing data via commu- for the exploration of integrability with legacy C/C++ code. nications streams depicted as arrows. RaftLib is built as a C++ template library, enabling end users to utilize the robust C++ standard library along with Stream processing is a compute paradigm that views an RaftLib’s pipeline parallel framework. RaftLib supports dy- application as a set of compute kernels (also sometimes termed namic queue optimization, automatic parallelization, and “filters” [49]) connected by communication links that deliver real-time low overhead performance monitoring. data streams. Each compute kernel is typically a sequentially executing unit. Each stream is a first-in, first-out Categories and Subject Descriptors (FIFO) queue whose exact allocation and construction is de- H.3.4 [Systems and Software]: [distributed systems]; D.1.3 pendent upon the link type. Figure 1 is an example of a sim- sum [Concurrent Programming]: [distributed programming, ple streaming application, which takes in two streams parallel programming] of numbers, adds each pair, and then writes the result to an outbound data stream. A salient feature of streaming processing is the compart- 1. INTRODUCTION & BACKGROUND mentalization of state within each compute kernel [1], which Decries touting the end of frequency scaling and the in- simplifies parallelization logic for the run-time [19] as well evitability of our multi-core future are frequently found in as the programming API (compared to standard paralleliza- current literature [20]. Equally prescient are the numerous tion methods [2]). Therefore, stream processing has two papers with potential solutions to programming multi-core immediate advantages: 1) it enables a programmer to think architectures [6, 33, 41, 52]. One of the more promising sequentially about individual pieces of a program while com- programming modalities to date is a very old one: stream posing a larger program that can be executed in parallel, processing [18, 36] (the term “stream processing” is also used 2) a streaming run-time can reason about each kernel in- by some to refer to online data processing [13, 21]). dividually while optimizing globally [38]. Moreover, stream processing has the fortunate side effect of encouraging de- Permission to make digital or hard copies of all or part of this work for personal or velopers to compartmentalize and separate programs into classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- logical partitions. tion on the first page. Copyrights for components of this work owned by others than In addition to simpler logic, stream processing also en- ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- ables easier heterogeneous and distributed computation. A publish, to post on servers or to redistribute to lists, requires prior specific permission compute kernel could have individual implementations that and/or a fee. Request permissions from [email protected]. target an FPGA and a multi-core running within the same PMAM ’15, February 7-8, 2015, San Francisco Bay Area, USA application, known as “heterogeneous computation” or “hy- Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3404-4/15/02 ..$15.00. brid computing” [14]. As long as the inputs and outputs http://dx.doi.org/10.1145/2712386.2712400. are matching, the application will run identically regard- 96 less of which resource a kernel is scheduled to execute on. systems). To the best of our knowledge, RaftLib is the first Brook [12], Auto-Pipe [22], GPU-Chariot [23], and Scala- C++ library directly targeting streaming processing. Pipe [51] are examples of such systems. Stream processing There has been considerable work investigating the ef- also naturally lends itself to distributed (network) process- ficient execution of streaming applications, both on tradi- ing, where network links simply become part of the stream. tional multi-cores and on heterogeneous compute platforms, Despite the promising features of stream processing, there from the early work on dedicated data flow engines [19], are hurdles that affect programmers’ decision to use the to the synchronous data flow model [32]. Thies et al. [48] paradigm. One hurdle to adoption is communication cost. describe a number of performance impacting factors for ap- For the example of Figure 1, the overhead of moving data el- plications described using StreamIt, and Optimus [26] re- ements between compute kernels could easily overwhelm the alizes StreamIt applications on FPGA platforms. Streams- benefits of parallel computation. Clearly, a courser-grained C [24] is an earlier streaming language that targets FPGAs streaming decomposition of an application is more realistic for execution. Lancaster et al. [29, 30] have built low-impact (we will continue to use this example, however, for illustra- performance monitors for streaming computations, and Pad- tion purposes). A second hurdle is simply the bulk of legacy manabhan et al. [38, 39, 40] have shown how to efficiently code and the requirement on the part of most streaming search the design space given a model that connects tun- frameworks that applications be re-authored or substantially ing parameters to application performance. RaftLib intends modified to conform [37]. to leverage the above work as it seeks to efficiently execute This paper introduces RaftLib, a C++ template and frame- streaming applications. work enabling safe and fast stream processing. By leveraging the power of C++ templates, RaftLib can be incorporated with a few function calls and the linking of one additional li- 3. DESIGN CONSIDERATIONS brary. No compiler other than a C++ compiler is needed, so Several properties of streaming applications that must be it is completely self contained. By using the native C++ en- exploited or overcome by streaming systems have been noted vironment, the C++ standard library functionality is avail- by others. The stream access pattern is often that of a able. As to communication between nodes, RaftLib seam- sliding window [48], which should be accommodated effi- lessly integrates TCP/IP networks, and the parallelized ex- ciently. RaftLib accommodates this through a peek_range ecution on multiple distributed compute nodes is transpar- function. Streaming systems, both long running and oth- ent to the programmer. RaftLib reduces the communication erwise often must deal with behavior that differs from the cost in a multitude of ways, drawing on research from past steady state [8, 34, 48]. Non-steady state behavior is of- works [8, 9, 29, 30] and enabling avenues of research that ten also observed with data-dependent behavior, resulting would otherwise be irrelevant in conventional programming in very dynamic I/O rates (behavior also observed in [48]). models. This dynamic behavior, either at startup or elsewhere during execution, makes the analysis and optimization of streaming systems a slightly more difficult prospect, however it is not 2. RELATED WORK insurmountable and we will demonstrate empirically how There are many streaming languages and run-time sys- RaftLib handles dynamic rates through a text search appli- tems designed to assist the development of streaming ap- cation. For example, text search has the property that while plications. Brook [12] is a language designed for streaming the input volume is often fixed, the downstream data vol- processing on GPUs. StreamIt [49] is a streaming language ume varies dramatically with algorithms, which heuristically and compiler based on the synchronous dataflow model. skip, as does the output (pattern matches). Kernel devel- Storm [46] and Samza [44] are open-source streaming plat- opers, as should be the case, focus on producing the most forms that are focused on message-based data processing. efficient algorithm possible for a given kernel. Despite this, Those languages and frameworks have their “niche” applica- kernels can become bottlenecks within the streaming system. tions. They are not designed for general-purpose program- Raft dynamically monitors the system to eliminate the bot- ming and usually have steep learning curves. RaftLib’s ad- tlenecks where possible.

Raftlib: a C++ Template Library for High Performance Stream Parallel Processing

High-Level and Efficient Stream Parallelism on Multi-Core Systems

Queue Streaming Model Theory, Algorithms, and Implementation

Pash: Light-Touch Data-Parallel Shell Processing

A Personal Analytics Platform for the Internet of Things Implementing Kappa Architecture with Microservice-Based Stream Processing

Paving the Way Towards High-Level Parallel Pattern Interfaces for Data Stream Processing

A Complete Bibliography of Publications in the International

Packet-Level Analytics in Software Without Compromises

Balancing Performance and Reliability in Software Components

User-Defined Execution Relaxations for Enhanced Programmability in High-Performance Parallel Computing

Online Modeling and Tuning of Parallel Stream Processing Systems Jonathan Curtis Beard Washington University in St

Intrinsic Compilation Model to Enhance Performance of Real Time

Efficiently and Transparently Maintaining High SIMD Occupancy in the Presence of Wavefront Irregularity Stephen V