Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated

FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures Feng Zhang and Lin Yang, Renmin University of China; Shuhao Zhang, Technische Universität Berlin and National University of Singapore; Bingsheng He, National University of Singapore; Wei Lu and Xiaoyong Du, Renmin University of China https://www.usenix.org/conference/atc20/presentation/zhang-feng This paper is included in the Proceedings of the 2020 USENIX Annual Technical Conference. July 15–17, 2020 978-1-939133-14-4 Open access to the Proceedings of the 2020 USENIX Annual Technical Conference is sponsored by USENIX. FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures Feng Zhang1, Lin Yang1, Shuhao Zhang2;3, Bingsheng He3, Wei Lu1, Xiaoyong Du1 1Key Laboratory of Data Engineering and Knowledge Engineering (MOE), and School of Information, Renmin University of China 2DIMA, Technische Universität Berlin 3School of Computing, National University of Singapore [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract GPU memory via PCI-e before GPU processing, but the low Accelerating SQL queries on stream processing by utilizing bandwidth of PCI-e limits the performance of stream process- heterogeneous coprocessors, such as GPUs, has shown to be ing on GPUs. Hence, stream processing on GPUs needs to be an effective approach. Most works show that heterogeneous carefully designed to hide the PCI-e overhead. For example, coprocessors bring significant performance improvement be- prior works have explored pipelining the computation and cause of their high parallelism and computation capacity. communication to hide the PCI-e transmission cost [34, 54]. However, the discrete memory architectures with relatively Despite of various studies in previous stream processing low PCI-e bandwidth and high latency have dragged down engines on general-purpose applications [5, 23, 25, 33, 54, 62], the benefits of heterogeneous coprocessors. Recently, hard- relatively few studies focus on SQL-based relational stream ware vendors propose CPU-GPU integrated architectures that processing. Supporting relational stream processing involves integrate CPU and GPU on the same chip. This integration additional complexities, such as how to support window- provides new opportunities for fine-grained cooperation be- based query semantics and how to utilize the parallelism tween CPU and GPU for optimizing SQL queries on stream with a small window or slide size efficiently. Existing engines, processing. In this paper, we propose a data stream system, such as Spark Streaming [57], struggle to support small win- called FineStream, for efficient window-based stream pro- dow and slide sizes, while the state-of-the-art window-based cessing on integrated architectures. Particularly, FineStream query engine, Saber [34], adopts a bulk-synchronous parallel performs fine-grained workload scheduling between CPU model [66] for hiding PCI-e transmission overhead. and GPU to take advantage of both architectures, and it also In recent years, hardware vendors have released integrated provides efficient mechanism for handling dynamic stream architectures, which completely remove PCI-e overhead. We queries. Our experimental results show that 1) on integrated ar- have seen CPU-GPU integrated architectures such as NVIDIA chitectures, FineStream achieves an average 52% throughput Denver [13], AMD Kaveri [15], and Intel Skylake [27]. They improvement and 36% lower latency over the state-of-the-art fuse CPUs and GPUs on the same chip, and let both CPUs stream processing engine; 2) compared to the stream process- and GPUs share the same memory, thus avoiding the PCI-e ing engine on the discrete architecture, FineStream on the data transmission. Such integration poses new opportunities integrated architecture achieves 10.4x price-throughput ratio, for window-based streaming SQL queries from both hardware 1.8x energy efficiency, and can enjoy lower latency benefits. and software perspectives. First, different from the separate memory hierarchy of dis- 1 Introduction crete CPU-GPU architectures, the integrated architectures provide unified physical memory. The input stream data can Optimizing the performance of stream processing systems be processed in the same memory for both CPUs and GPUs, has been a hot research topic due to the rigid requirement which eliminates the data transmission between two memory on the event processing latency and throughput. Stream pro- hierarchies, thus eliminating the data copy via PCI-e. cessing on GPUs has been shown to be an effective method Second, the integrated architecture makes it possible for to improve its performance [23,33,34,41,54,62,67]. GPUs processing dynamic relational workloads via fine-grained co- consist of a large amount of lightweight computing cores, operations between CPUs and GPUs. A streaming query can which are naturally suitable for data-parallel stream process- consist of multiple operators with varying performance fea- ing. GPUs are often used as coprocessors that are connected tures on different processors. Furthermore, stream processing to CPUs through PCI-e [42]. Under such discrete architec- often involves dynamic input workload, which affects opera- tures, stream data need to be copied from the main memory to tor performance behaviors as well. We can place operators on USENIX Association 2020 USENIX Annual Technical Conference 633 different devices with proper workloads in a fine-grained man- caches. Some models of integrated architectures, such as Intel ner, without worrying about transmission overhead between Haswell i7-4770R processor [3], integrate a shared last level CPUs and GPUs. cache for both CPUs and GPUs. The shared memory man- Based on the above analysis, we argue that stream process- agement unit is responsible for scheduling accesses to system ing on integrated architectures can have much more desir- DRAM by different devices. Compared to the discrete CPU- able properties than that on discrete CPU-GPU architectures. GPU architecture, both CPUs and GPUs are integrated on the To fully exploit the benefits of integrated architectures for same chip. The most attractive feature of such integration is stream processing, we propose a fine-grained stream process- the shared main memory which is available to both devices. ing framework, called FineStream. Specifically, we propose With the shared main memory, CPUs and GPUs can have the following key techniques. First, a performance model is more opportunities for fine-grained cooperation. The most proposed considering both operator topologies and different commonly used programming model for integrated architec- architecture characteristics of integrated architectures. Sec- tures is OpenCL [49], which regards the CPU and the GPU as ond, a light-weight scheduler is developed to efficiently assign devices. Each device consists of several compute units (CUs), the operators of a query plan to different processors. Third, which are the CPU and GPU cores in Figure1. online profiling with computing resource and topology adjust- ment are involved for dynamic workloads. CPU GPU We evaluate FineStream on two platforms, AMD A10- CPU CPU … CPU GPU GPU … GPU 7850K, and Ryzen 5 2400G. Experiments show that core core core core core core FineStream achieves 52% throughput improvement and 36% CPU Cache GPU Cache lower latency over the state-of-the-art CPU-GPU stream pro- Shared Memory Management Unit cessing engine on the integrated architecture. Compared to the best single processor throughput, it achieves 88% performance improvement. System DRAM We also compare stream processing on integrated architectures with that on discrete CPU-GPU architectures. Our Figure 1: A general overview of the integrated architecture. evaluation shows that FineStream on integrated architectures achieves 10.4x price-throughput ratio, and 1.8x energy ef- We show a comparison between the integrated and discrete ficiency. Under certain circumstances, it is able to achieve architectures (discrete GPUs) in Table1. These architectures lower processing latency, compared to the state-of-the-art ex- are used in our evaluation (Section7). Although the integrated ecution on discrete architectures. This further validates the architectures have lower computation capacity than the dis- large potential of exploring the integrated architectures for crete architectures currently, the integrated architecture is a data stream processing. potential trend for a future generation of processors. Hardware Overall, we make the following contributions: vendors, including AMD [15], Intel [27] and NVIDIA [13], • We propose the first fine-grained window-based rela- all release their integrated architectures. Moreover, future inte- tional stream processing framework that takes the advan- grated architectures can be much more powerful, even can be a tages of the special features of integrated architectures. competitive building block for exascale HPC systems [47,55], and the insights and methods in this paper still can be applied. • We develop lightweight query plan adaptations for han- Besides, the integrated architectures are attractive due to their dling dynamic workloads with the performance model efficient power consumption [15, 60] and low price [31]. that considers both the operator and architecture characteristics. Table 1: Integrated architectures vs. discrete architectures. Integrated Architectures Discrete Architectures •

Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated

Parallel Patterns for Adaptive Data Stream Processing

A Middleware for Efficient Stream Processing in CUDA

AMD Accelerated Parallel Processing Opencl Programming Guide

Lightsaber: Efficient Window Aggregation on Multi-Core Processors

Parallel Stream Processing with MPI for Video Analytics and Data Visualization

Design and Implementation of an FPGA-Based Scalable Pipelined

SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures

Analyzing Efficient Stream Processing on Modern Hardware

Streambox: Modern Stream Processing on a Multicore Machine

A Media Enhanced Vector Architecture for Embedded Memory Systems

Fine-Grained Real-Time Stream Processing with Cameo

A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective