A Many-Core Architecture for In-Memory Data Processing
Total Page:16
File Type:pdf, Size:1020Kb
A Many-core Architecture for In-Memory Data Processing Sandeep R Agrawal Sam Idicula Arun Raghavan [email protected] [email protected] [email protected] Oracle Labs Oracle Labs Oracle Labs Evangelos Vlachos Venkatraman Govindaraju Venkatanathan Varadarajan [email protected] [email protected] venkatanathan.varadarajan@oracle. Oracle Labs Oracle Labs com Oracle Labs Cagri Balkesen Georgios Giannikis Charlie Roth [email protected] [email protected] [email protected] Oracle Labs Oracle Labs Oracle Labs Nipun Agarwal Eric Sedlar [email protected] [email protected] Oracle Labs Oracle Labs ABSTRACT ACM Reference format: For many years, the highest energy cost in processing has been Sandeep R Agrawal, Sam Idicula, Arun Raghavan, Evangelos Vlachos, Venka- traman Govindaraju, Venkatanathan Varadarajan, Cagri Balkesen, Georgios data movement rather than computation, and energy is the limiting Giannikis, Charlie Roth, Nipun Agarwal, and Eric Sedlar. 2017. A Many-core factor in processor design [21]. As the data needed for a single Architecture for In-Memory Data Processing. In Proceedings of MICRO-50, application grows to exabytes [56], there is clearly an opportunity Cambridge, MA, USA, October 14–18, 2017, 14 pages. to design a bandwidth-optimized architecture for big data compu- https://doi.org/10.1145/3123939.3123985 tation by specializing hardware for data movement. We present the Data Processing Unit or DPU, a shared memory many-core that is specifically designed for high bandwidth analytics workloads. 1 INTRODUCTION The DPU contains a unique Data Movement System (DMS), which A large number of data analytics applications in areas varying provides hardware acceleration for data movement and partition- from business intelligence, health sciences and real time log and ing operations at the memory controller that is sufficient to keep telemetry analysis already benefit from working sets that span up with DDR bandwidth. The DPU also provides acceleration for several hundreds of gigabytes, and sometimes several terabytes of core to core communication via a unique hardware RPC mecha- data[9]. To cater to these applications, in recent years, data stores nism called the Atomic Transaction Engine. Comparison of a DPU such as key-value stores, columnar databases and NoSQL databases, chip fabricated in 40nm with a Xeon processor on a variety of data have moved from traditional disk-based storage to main-memory processing applications shows a 3× - 15× performance per watt (DRAM) resident solutions [1, 9, 40]. However, today’s commodity advantage. hardware solutions, which serve such applications employ powerful servers with relatively sparse memory bandwidth—a typical Xeon- CCS CONCEPTS based, 2U sized chassis hosts 8 memory channels (4 channels per • Computer systems organization → Multicore architectures; socket). Special purpose systems; For applications which scan, join and summarize large volumes of data, this hardware limit on memory bandwidth translates into KEYWORDS a fundamental performance bottleneck [8, 35, 49, 62]. Furthermore, several features which contribute to server power, such as paging, Accelerator; Big data; Microarchitecture; Databases; DPU; Low large last-level caches, sophisticated branch predictors, and double- power; Analytics Processor; In-Memory Data Processing; Data precision floating point units, have previously been found tobe Movement System unutilized by applications which rather rely on complex analytics queries and fixed-point arithmetic [2, 9]. The work presented in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed this paper captures a subset of a larger project which explores the for profit or commercial advantage and that copies bear this notice and the full citation question: How can we perform analytics on terabytes of data in on the first page. Copyrights for components of this work owned by others than ACM sub-second latencies within a rack’s provisioned power budget? In must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a particular, this paper focuses on optimizing the memory bandwidth fee. Request permissions from [email protected]. per watt on a single, programmable data processing unit (DPU). This MICRO-50, October 14–18, 2017, Cambridge, MA, USA scalable unit then allows packing up to ten times as many memory © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-4952-9/17/10...$15.00 channels in a rack-able chassis as compared to a commodity server https://doi.org/10.1145/3123939.3123985 organization. MICRO-50, October 14–18, 2017, Cambridge, MA, USA S. Agrawal et al. To identify on-chip features essential for efficient execution, we application code without pre-emption on the dpCores and over- analyzed the performance of complex analytics on large volumes lap data movement via the DMS. We also abstract inter-dpCore of data like traditional TPC-H query benchmarks, and also more communication and synchronization routines over the ATE to al- contemporary text and image processing operations applied when low porting of common parallel programming paradigms such as ingesting and querying data from memory. Firstly, a large portion threads, task queues, and independent loops. These runtime hard- of the power in present systems is spent in bringing data closer to ware abstractions enable large scale, efficient, in-memory execution the cores via large cache hierarchies [34]. Secondly, these queries of heavyweight applications including SQL processing offloaded need to be broken down into simple streaming primitives, which from a commercial database on our prototype system. can then be efficiently parallelized and executed48 [ ]. Thirdly, the The following sections describe our experiences with design- variety of operations performed by these queries raises the need ing, fabricating, and programming the DPU chip. In particular, we for the cores to be easily programmable for efficiency. highlight the following contributions: We examined alternate commodity low-power chip architectures • The low-power hardware architecture of the DPU (Section 2) for such applications. GPUs have been increasingly popular in data • The microarchitecture and software interface of our novel centers for their significant compute capabilities and memory band- data movement system (Section 3) width [3, 4, 29, 30]. However, their SIMT programming model is • The ISA extensions and programming environment which intolerant to control flow divergence which occurs in applications allow relatively straightforward software development on such as parsing, and their dependence on high bandwidth GDDR the DPU (Section 4) memory to sustain the large number of on-die cores severely con- • Example parallel applications which achieve 3× - 15× im- strains their memory capacity. provement in performance per watt over a commodity Xeon We also considered FPGA based hardware subroutines to of- socket when optimized for the DPU hardware (Section 5). fload functions, reviewed contemporary work on fixed-function ASIC units for filtering, projecting, partitioning data and walking 2 DPU ARCHITECTURE hash-tables ( [36, 41, 61, 62]), and experimented with large, net- worked clusters of wimpy general-purpose cores similar to prior The primary goal of the Data Processing Unit (DPU) is to optimize research[5, 38]. We found that while ASIC units could significantly for analytic workloads [22, 47, 58] characterized by (i) large data reduce power over off-the-shelf low-power cores by optimizing data set sizes [63] (in the order of tens of TB), (ii) data parallel com- movement, very specific fixed function units separated from the putation, and (iii) complex data access patterns that can be made instruction processing pipeline impeded application development. memory bandwidth-bound using software techniques. A typical We therefore opted to engineer a customized data processing unit analytic workflow involves partitioning the working set into sev- which integrates several programmable, low-power dpCores with eral data chunks that are independently analyzed, followed by a a specialized, yet programmable data movement system (DMS) in final result aggregation stage [47]. Hence an ideal architecture for hardware. Each dpCore contains several kilobytes of scratchpad such a memory-bound workload would aim to compute at memory- SRAM memory (called DMEM) in lieu of traditional hardware man- bandwidth. aged caches. Software explicitly schedules data transfers via the Such computation at memory bandwidth requires keeping the DMS to each dpCore’s DMEM through the use of specialized instruc- workload memory-resident, with DRAM memory channels feeding tions called descriptors. Descriptors enable individual dpCores to data into the processing units. With a practical memory channel schedule extremely efficient operations such as filter and projection, bandwidth of 10 GBps (DDR3-1600), to scan a nominal workload ≈ hash/range partitioning and scatter-gather operations at close to size of 10 TB in under a second, we require 1000 channels per wire-speed. Each DPU also consists of a dual-core ARM A9 Osprey rack. This results in 3KW (at 3W per channel) budgeted for main macro and an ARM M0 core to host network drivers and system memory leaving