A Many-Core Architecture for In-Memory Data Processing
Total Page:16
File Type:pdf, Size:1020Kb
MICRO 2017 Submission #XXX – Confidential Draft – Do NOT Distribute!! A Many-core Architecture for In-Memory Data Processing ABSTRACT program due to their SIMT programming model and intoler- We live in an information age, with data and analytics guiding ance to control flow divergence. Their dependence on high a large portion of our daily decisions. Data is being generated bandwidth graphics memory to sustain the large number of at a tremendous pace from connected cars, connected homes on-die cores severely constraints their memory capacity. A and connected workplaces, and extracting useful knowledge single GPU with 300+ GB/s of memory bandwidth still sits from this data is a quickly becoming an impractical task. on a PCIE 3.0 link, reducing their data load and data move- Single-threaded performance has become saturated in the last ment capabilities, which is essential for high ingest streaming decade, and there is a growing need for custom solutions to workloads as well as SQL queries involving large to large keep pace with these workloads in a scalable and efficient joins. manner. We analyzed the performance of complex analytics queries A big portion of the power in analytics workloads involves on large data structures, and identified several key areas that bringing data to the processing cores, and we aim to optimize can improve efficiency. Firstly, most analytics queries needs that. We present the Database Processing Unit or DPU, a lots of joins and group-bys. Secondly, these queries need shared memory many-core that is specifically designed for to be broken down into simple streaming primitives, which in-memory analytics workloads. The DPU contains a unique can be efficiently partitioned amongst cores. Thirdly, this Data Movement System (DMS), which provides hardware level of scaleout also means that core to core communication acceleration for data movement and preprocessing operations. must be fast and power efficient. We present the architecture The DPU also provides acceleration for core to core com- and runtime capabilities of the Database Processing Unit munication via a unique hardware RPC mechanism called (DPU), designed for database query processing and complex the Atomic Transaction Engine or ATE. Comparison of a analytics. The DPU is designed to scale to thousands of nodes, fabricated DPU chip with a variety of state of the art x86 terabytes of memory capacity and terabytes/sec of memory applications shows a performance/Watt advantage of 3× to bandwidth at rack scale, focusing on a precise compute to 16×. memory bandwidth ratio while minimizing power. The DPU is a shared memory many-core, the hardware reduces power over a commodity Xeon processor by sacri- 1. INTRODUCTION ficing features such as out-of-order, superscalar execution, The end of Dennard scaling and dark silicon [9] have sat- SIMD units, large multi-tier caches, paging and cache co- urated the performance of x86 processors, with most new herence, and instead provides acceleration for common data generations devoting larger areas of the die to the integrated movement operations. A DPU consists of 32 identical db- GPU [15]. With more companies gathering data for use in Cores, which are simple dual issue in-order cores with a with analytics and machine learning, the volume and velocity of a simple (and low power) cache and a programmer managed data is rapidly rising, and massive computation and memory scratchpad (Data Memory or DMEM). For an equal power bandwidth is needed to process and analyze this data in a envelope, DPUs provide higher memory capacity and higher scalable and efficient manner. The slowdown in CPU scal- aggregate memory bandwidth as well as compared to GPUs, ing means that only alternative for traditional server farms is making them a better choice for big data workloads. DPUs scale out, which is expensive in terms of area and power due achieve this efficiency by recognizing data movement and to additional uncore overheads. inter-core communication as key enablers of complex analyt- Custom architectures have traditionally faced resistance in ics applications, and provide hardware acceleration for these datacenters due to their high programmability and bootstrap functions. cost, however, with Moore’s law slowing down, most big ven- The DPU programming model is based on scalability and dors are considering application specific hardware as a way to reaching peak memory bandwidth, as each DPU is designed get higher performance at lower power budgets. Intel recently with a balanced compute to memory bandwidth ratio of 2 announced their foray into a hybrid FPGA solution [11], Mi- cycles of compute per byte of data transferred from DRAM. crosoft is exploring FPGAs in their datacenters [21], Google A DPU relies on a programmable DMA engine called the is building custom accelerators for machine learning [16] and Data Movement System (DMS), that allows it to efficiently all major cloud vendors offer GPUs as part of their offerings. stream data from memory. The DMS allows for efficient GPUs are quickly becoming an attractive choice in dat- line speed processing of in-memory data, by allowing data acenters for their compute capabilities and corresponding movement and partitioning between DRAM and a dbCore’s power efficiency [2][1][14][13]. However, they are hard to 1 scratchpad asynchronously, leaving the dbCore to perform useful work. The DMS can perform many common stream operations like read, write, gather, scatter and hashes at peak bandwidth, allowing for efficienct overlap of compute and data transfer. This also means that dbCores are less reliant on caches, allowing us to dramatically simplify the cache and re- duce power consumption. The DMS also communicates with the dbCores via a unique Wait For Event (WFE) mechanism, that allows for interrupt-free notification of data movement, further reducing power and improving performance. Unlike conventional DMA block engines, the DMS is inde- pendently programmable by each dbCore, through the use of specialized instructions called descriptors that are populated and stored in the dbCore’s scratchpad. Chaining descriptors Figure 1: Block diagram of a DPU together to form a pipeline allows for the DMS to be used in a variety of ways. For example, efficient access to large data structures requires hardware support for partitioning opera- tions. [24]. The DMS allows for efficient data partitioning across any/all cores, either based on a range of values, or based on a radix, or on the basis of a CRC32 hash across any number of keys. The results of these partitioning operations are placed in the respective dbCore’s scratchpad for further processing. x86 processors focus on programmability and single thread performance, GPUs focus on compute bandwidth, DPUs fo- cus on programmability and aggregate memory bandwidth/Watt at scale. The DPU is envisioned as part of a larger system, where thousands of DPUs are connected via an Infiniband tree at rack scale, providing > 40000 dbCores, > 10 TB/s of aggregate memory bandwidth and > 1 TB/s of network bandwidth at 20 kW of power. Achieving rack scale gains requires fulfillment of two objectives - Improved algorithms for scalability across these tens of thousands of cores, and im- proved performance/Watt on a single DPU, which translates into improved performance/Watt of a rack. We taped out and Figure 2: Block diagram of a Macro manufactured a DPU using a 40 nm process for benchmark- ing purposes, and focus our evaluation on this single DPU • We showcase unique hardware software co-design as- for this work. Our primary contributions are: pects of our approach towards DPU application perfor- mance across a variety of workloads. • We present the Data Processing Unit (DPU), a many- core architecture for in memory data processing focused Section2 describe the energy efficient architecture of the on highly parallel processing of large amounts of data DPU and its core components. Section3 briefly describes while minimizing power consumption. the runtime used to simplify application development on the DPU. We then evaluate the performance of our hardware • We introduce the Data Movement System (DMS), a using a few microbenchmarks in Section4. Finally, we look DMA engine which provides efficient data movement, at the performance and energy efficiency of the DPU across partitioning and pre-processing capabilities. The DMS a variety of real world applications in Section5. has been uniquely designed to minimize the amount of software needed to control the DMS in critical loops. 2. DPU ARCHITECTURE • We present an Atomic Transaction Engine (ATE), a The DPU is a System-On-a-Chip (SoC) that defines a ba- mechanism that implements synchronous and asyn- sic processing unit at the edge of a network topology. The chronous Remote Procedure Calls from one core to DPU is designed with power efficiency and maximizing data another, some of which are executed in hardware, al- parallelism as its principle constraints. Figure1 shows the lowing fast execution of functions without interrupting block diagram of a DPU. The DPU is designed in a hierarchi- the remote core. cal manner, 8 identical dbCores form a macro, and the DPU contains 4 such identical macros. DbCores in a macro share • We design and fabricate the DPU, and evaluate its per- a L2 cache (Figure2, and each macro can be power gated formance across a large variety of workloads, achieving independently of other macros. The primary computation 3× to 15× improvement in performance/Watt over a in a DPU is carried out on 32 identical dbCores, which can single socket Xeon processor. communicate via an ATE interface. Each dbCore reads/writes 2 to main memory (DRAM) using the DMS, or via a 2 level Memory operations. In general, all loads read the value simplified cache hierarchy. Each DPU has a dual core ARM from an address in memory and either sign- or zero-extend A9 processor on board as well, which manages the Infiniband the value to 64- bits as the value is written into the register networking stack, and provides a high bandwidth interface file.