<<

MICRO 2017 Submission #XXX – Confidential Draft – Do NOT Distribute!!

A Many-core Architecture for In-Memory

ABSTRACT program due to their SIMT programming model and intoler- We live in an age, with data and analytics guiding ance to control flow divergence. Their dependence on high a large portion of our daily decisions. Data is being generated graphics memory to sustain the large number of at a tremendous pace from connected cars, connected homes on-die cores severely constraints their memory capacity. A and connected workplaces, and extracting useful knowledge single GPU with 300+ GB/s of memory bandwidth still sits from this data is a quickly becoming an impractical task. on a PCIE 3.0 link, reducing their data load and data move- Single-threaded performance has become saturated in the last ment capabilities, which is essential for high ingest streaming decade, and there is a growing need for custom solutions to workloads as well as SQL queries involving large to large keep pace with these workloads in a scalable and efficient joins. manner. We analyzed the performance of complex analytics queries A big portion of the power in analytics workloads involves on large data structures, and identified several key areas that bringing data to the processing cores, and we aim to optimize can improve efficiency. Firstly, most analytics queries needs that. We present the Processing Unit or DPU, a lots of joins and group-bys. Secondly, these queries need shared memory many-core that is specifically designed for to be broken down into simple streaming primitives, which in-memory analytics workloads. The DPU contains a unique can be efficiently partitioned amongst cores. Thirdly, this Data Movement System (DMS), which provides hardware level of scaleout also means that core to core acceleration for data movement and preprocessing operations. must be fast and power efficient. We present the architecture The DPU also provides acceleration for core to core com- and runtime capabilities of the Database Processing Unit munication via a unique hardware RPC mechanism called (DPU), designed for database query processing and complex the Atomic Transaction Engine or ATE. Comparison of a analytics. The DPU is designed to scale to thousands of nodes, fabricated DPU chip with a variety of state of the art x86 terabytes of memory capacity and terabytes/sec of memory applications shows a performance/Watt advantage of 3× to bandwidth at rack scale, focusing on a precise compute to 16×. memory bandwidth ratio while minimizing power. The DPU is a shared memory many-core, the hardware reduces power over a commodity Xeon by sacri- 1. INTRODUCTION ficing features such as out-of-order, superscalar execution, The end of Dennard scaling and dark [9] have sat- SIMD units, large multi-tier caches, paging and cache co- urated the performance of x86 processors, with most new herence, and instead provides acceleration for common data generations devoting larger areas of the die to the integrated movement operations. A DPU consists of 32 identical db- GPU [15]. With more companies gathering data for use in Cores, which are simple dual issue in-order cores with a with analytics and , the volume and velocity of a simple (and low power) cache and a programmer managed data is rapidly rising, and massive computation and memory scratchpad (Data Memory or DMEM). For an equal power bandwidth is needed to process and analyze this data in a envelope, DPUs provide higher memory capacity and higher scalable and efficient manner. The slowdown in CPU scal- aggregate memory bandwidth as well as compared to GPUs, ing means that only alternative for traditional server farms is making them a better choice for workloads. DPUs scale out, which is expensive in terms of area and power due achieve this efficiency by recognizing data movement and to additional uncore overheads. inter-core communication as key enablers of complex analyt- Custom architectures have traditionally faced resistance in ics applications, and provide for these datacenters due to their high programmability and bootstrap functions. cost, however, with Moore’s law slowing down, most big ven- The DPU programming model is based on scalability and dors are considering application specific hardware as a way to reaching peak memory bandwidth, as each DPU is designed get higher performance at lower power budgets. recently with a balanced compute to memory bandwidth ratio of 2 announced their foray into a hybrid FPGA solution [11], Mi- cycles of compute per of data transferred from DRAM. crosoft is exploring FPGAs in their datacenters [21], Google A DPU relies on a programmable DMA engine called the is building custom accelerators for machine learning [16] and Data Movement System (DMS), that allows it to efficiently all major cloud vendors offer GPUs as part of their offerings. stream data from memory. The DMS allows for efficient GPUs are quickly becoming an attractive choice in dat- line speed processing of in-memory data, by allowing data acenters for their compute capabilities and corresponding movement and partitioning between DRAM and a dbCore’s power efficiency [2][1][14][13]. However, they are hard to

1 scratchpad asynchronously, leaving the dbCore to perform useful work. The DMS can perform many common stream operations like read, write, gather, scatter and hashes at peak bandwidth, allowing for efficienct overlap of compute and data transfer. This also means that dbCores are less reliant on caches, allowing us to dramatically simplify the cache and re- duce power consumption. The DMS also communicates with the dbCores via a unique Wait For Event (WFE) mechanism, that allows for interrupt-free notification of data movement, further reducing power and improving performance. Unlike conventional DMA block engines, the DMS is inde- pendently programmable by each dbCore, through the use of specialized instructions called descriptors that are populated and stored in the dbCore’s scratchpad. Chaining descriptors Figure 1: Block diagram of a DPU together to form a pipeline allows for the DMS to be used in a variety of ways. For example, efficient access to large data structures requires hardware support for partitioning opera- tions. [24]. The DMS allows for efficient data partitioning across any/all cores, either based on a range of values, or based on a radix, or on the basis of a CRC32 hash across any number of keys. The results of these partitioning operations are placed in the respective dbCore’s scratchpad for further processing. x86 processors focus on programmability and single thread performance, GPUs focus on compute bandwidth, DPUs fo- cus on programmability and aggregate memory bandwidth/Watt at scale. The DPU is envisioned as part of a larger system, where thousands of DPUs are connected via an Infiniband tree at rack scale, providing > 40000 dbCores, > 10 TB/s of aggregate memory bandwidth and > 1 TB/s of network bandwidth at 20 kW of power. Achieving rack scale gains requires fulfillment of two objectives - Improved for scalability across these tens of thousands of cores, and im- proved performance/Watt on a single DPU, which translates into improved performance/Watt of a rack. We taped out and Figure 2: Block diagram of a Macro manufactured a DPU using a 40 nm process for benchmark- ing purposes, and focus our evaluation on this single DPU • We showcase unique hardware software co-design as- for this work. Our primary contributions are: pects of our approach towards DPU application perfor- mance across a variety of workloads. • We present the Data Processing Unit (DPU), a many- core architecture for in memory data processing focused Section2 describe the energy efficient architecture of the on highly parallel processing of large amounts of data DPU and its core components. Section3 briefly describes while minimizing power consumption. the runtime used to simplify application development on the DPU. We then evaluate the performance of our hardware • We introduce the Data Movement System (DMS), a using a few microbenchmarks in Section4. Finally, we look DMA engine which provides efficient data movement, at the performance and energy efficiency of the DPU across partitioning and pre-processing capabilities. The DMS a variety of real world applications in Section5. has been uniquely designed to minimize the amount of software needed to control the DMS in critical loops. 2. DPU ARCHITECTURE • We present an Atomic Transaction Engine (ATE), a The DPU is a System-On-a-Chip (SoC) that defines a ba- mechanism that implements synchronous and asyn- sic processing unit at the edge of a network topology. The chronous Remote Procedure Calls from one core to DPU is designed with power efficiency and maximizing data another, some of which are executed in hardware, al- parallelism as its principle constraints. Figure1 shows the lowing fast execution of functions without interrupting block diagram of a DPU. The DPU is designed in a hierarchi- the remote core. cal manner, 8 identical dbCores form a macro, and the DPU contains 4 such identical macros. DbCores in a macro share • We design and fabricate the DPU, and evaluate its per- a L2 cache (Figure2, and each macro can be power gated formance across a large variety of workloads, achieving independently of other macros. The primary computation 3× to 15× improvement in performance/Watt over a in a DPU is carried out on 32 identical dbCores, which can single socket Xeon processor. communicate via an ATE interface. Each dbCore reads/writes

2 to main memory (DRAM) using the DMS, or via a 2 level Memory operations. In general, all loads read the value simplified cache hierarchy. Each DPU has a dual core ARM from an address in memory and either sign- or zero-extend A9 processor on board as well, which manages the Infiniband the value to 64- as the value is written into the register networking stack, and provides a high bandwidth interface file. Two addressing modes are available. The offset ad- to other DPUs or a host machine. A Cortex M0 processor dressing mode adds an immediate value to a base register serves as a Power Unit to control the various value to form the address. The scaled indexed addressing power modes of the DPU and provides health monitoring mode adds two registers together to form the address, the first functionality for the board. A Mailbox messaging interface register being a base and the second register being an index, provides a peer to peer communication interface between the where the index is scaled by 1, 2, 4 or 8 prior to adding to dbCores, ARM cores and the M0 processor. the base register. All load/store accesses must be naturally aligned. Conditional stores examine the least significant 2.1 DbCore Architecture of a register to determine whether or not to perform the store. Address spaces and protection. The address space is The DbCore features the Extensible RISC Instructions for broken into physically addressed DRAM regions (up to 8GB Queries, ERIQ, instruction set architecture. The instruction DDR memory on RAPIDv1), control, configuration and spe- set provides a base instruction set for general purpose comput- cial purpose register spaces, and a core-local data memory ing as well as specialized instruction set to facilitate efficient (scratchpad SRAM, called DMEM—32 KB in RAPID-v1). processing of database queries. Each dbCore is designed to In the absence of a conventional memory management unit nominally operate at 800MHz, although capable of 1GHz (MMU), hardware watchpoint registers allow the loader soft- speeds with voltage and frequency scaling. ware to set sentinel boundaries demarcating legal code and All instructions in the ISA are 32-bits long and are aligned data regions. Any dbcore executable overreaching these on a 32-bit boundary. The executable is always loaded in the boundaries causes a hardware exception, thus preventing 0-4GB physical address range in memory to allow addressing memory errors and potential vulnerabilities. The DMEM is with a 32-bit instruction pointer. The ISA consists of 32 64- core-local SRAM mapped at a fixed base address outside the bit registers with instructions that explicitly operate on either physically addressed DRAM range. Any load or store to the 32-bit quantities or 64- bit quantities. The ISA permits the C 32 KB address region offset from this base address results in language int type to be 32-bits to encourage software to oper- an in-pipeline (cache-hit latency) access to this memory. Soft- ate on 32-bit quantities, while allowing for the pointer type ware manages the contents of this scratchpad memory with to be 64-bits. Instructions that operate on 32-bit quantities explicit data-movement commands to the DMS. In contrast to sign-extend the result to 64-bits when saving to the register the software managed DMEM, each dbcore is also equipped file. with traditional data and instruction caches (16 KB And 8 KB ALU operations. The dbcore supports common arith- respectively in RAPID-v1). Hardware populates and evicts metic operations (addition, subtraction, multiplication, divi- these direct-mapped caches on load and store misses to phys- sion, modulo), logical operations (shift, rotation), and bitwise ical memory. In addition a set of 8 dbcores (dbcore macro), (and, or, xor, bit set, clear and population count) operations share a 256 KB, 4-way associative level-2 cache, which is on word (32-bit), double word (64-bit) or subword (8-, 16- also implicitly managed by hardware on misses to the L1 bit) operands. Explicit instructions that operate on 32-bit instruction and data caches. quantities enable the pipeline implementation to minimize Consistency and Coherence. The dbcore does not per- power consumption while meeting computing performance form invalidations or updates to remote (i.e., other dbcores) goals. The primary motivation for this 64-bit ISA is memory caches when populating or evicting a dbcore’s cache (L1 or addressing, specifically the pipeline load/stores addressing far L2) in response to misses. Instead, the ISA provides explicit more than 4GB of DRAM when processing database queries. operations (CacheOps) which can be issued from the dbcore The secondary motivation was ease of handling 64-bit colum- pipeline to invalidate or writeback L1 or L2 cache blocks by nar data types. In general, instructions that operate on 32-bit address or index. Software uses these CacheOp instructions quantities produce a 32-bit result and sign-extend the result in conjunction with the remote procedure call mechanism to 64- bits when saving to the register file. The motivation for (described below) to ensure coherent memory accesses. All sign-extending 32-bit results stems directly from the load/s- load and store misses cause the pipeline to stall. An explicit tore instructions. The load/store instructions utilize 64-bits to sync instruction ensures all previous memory operations (in- form an effective addresses. The hardware does not support cluding evictions) and serves as a memory fence to ensure floating point arithmetic natively. program consistency. Comparisons and Branches. Compare instructions oper- Data processing operations. The dbcore supports bit- ate on word or double word registers and set a single bit vector load (BVLD), filter (FILT), and CRC32 hashcode in the resultant register. Conditional branches are fused generation as single-cycle instructions to accelerate database compare-and-branch operations that only operate on 32-bit operations such as filters, joins and late materialization. The values. Target address is 16 bits but yields a signed, 18-bit CRC32 instruction computes a seeded 32-bit Castagnoli poly- PC-relative range (the lower 2-bits are implied zeros since nomial which can be used for certain hash-value computa- instructions must be aligned on a 32-bit boundary). Uncondi- tions, and for checksums. The BVLD and FILT instructions tional branches directly identify the branch target via immedi- are designed to filter through sparsely populated columns effi- ate or register operands and can result from explicit branches ciently in a loop. The program first populates the column into (’goto’ statements), function calls and return statements or a contiguous buffer in the DMEM address space, notes the while handling exceptions.

3 base address in a special register, then repeatedly performs coupled with software techniques such as loop unrolling and the BVLD And FILT operations to compute offset of the next pipelining. Multiply, divide and modulo instructions stall the valid column item, load the element from DMEM, and per- ALU pipe for multiple cycles until they complete execution, form the filter comparison operation. The instructions also and bypass their results to downstream instructions. A 12 enable efficient generation of operations such as population entry instruction buffer captures tight loops and issue such counts and scatter-gather masks. basic blocks while bypassing the instruction cache. Condi- Event and FIFO processing. The pipeline issues instruc- tional branches are predicted to be taken when backward, and tions also to the Atomic Transaction Engine (ATE) to perform not taken when forward. Mispredictions cause the pipeline remote procedure calls, and to the Data Movement System to stall for two cycles. The ATE controller may additionally (DMS) to move data between the DMEM and DRAM. The issue load and store requests to cacheable memory or to the program first constructs descriptors for such operations, then DMEM. No instructions are issued from the instruction cache enqueues the descriptor by invoking a push instruction iden- or loop buffers during these cycles. tifying the engine (ATE, or DMAC queue as the case may be). To receive completions for the pushed operations, the 2.2 Atomic Transaction Engine (ATE) ISA defines a wait-for-event (WFE) instruction with an event Traditional synchronization mechanisms like locks and identifier. The above mechanisms allow the dbcore pipeline, mutexes are costly and complex to implement and the pro- and thus the software to: (i) utilize the DMS engines with grammer needs to ensure visibility of the state of the synchro- concurrently pending, programmed memory requests, and (ii) nization mechanism throughout the system. We implement execute operations on a specific dbcore and receive the return an "Atomic Transaction Engine" (ATE) which enables the value. execution of remote procedure calls (RPC) on other dbCores Interrupts and Exceptions. The dbcore is designed pri- on the DPU. The ATE guarantees that two RPCs to the same marily to execute data processing operations and expect lim- dbCore will be executed atomically, which then allows them ited interruptions. Specifically, there are three external inter- to implement mutually exclusive code segments. This also rupts which switch execution context. means that all operations to "shared" data must be performed • When a dbcore pushes an ATE descriptor, the ATE using the ATE, to ensure that data read write ordering is controller raises an interrupt on the target dbcore (which maintained. Atomic RPCs were first explored in the context could be the issuing dbcore itself). The ATE interrupt of distributed systems to ensure consistency in case of handler causes the pipeline to execute instructions from failures [18]. Our use of hardware accelerated atomic RPCs the remote procedure until it explicitly returns from the to provide synchronization in a multicore system is novel to interrupt. This feature enables software to construct the best of our knowledge. critical sections and shared data structures. The ATE adds two instructions to the dbCore ISA - RPC No Return (RPCNR) which is used for routines that do not • When a dbcore or ARM core pushes a mailbox mes- require data to be returned to the calling dbCore, and RPC sage into a target dbcore’s FIFO, the mailbox controller With Return (RPCWR), we returns data to the calling dbCore. hardware raises an interrupt on the target dbcore. The These instructions require 2 operands, a RPC identifier and mailbox interrupt handler minimally enqueues the mail- the address in DMEM for the first payload word. The ATE box FIFO message into a software structure for further is programmed with descriptors, which are sent over the processing and returns to the main context. This feature ATE network to the other dbCore. The descriptor is a 32 allows software to communicate with external nodes byte structure, and the arguments to the RPC need to be across network fabrics. packed into these 32 . If the arguments exceed the space, • The DPU’s System Interrupt Controller (SIC) can be arguments can be placed in DRAM as well and only a pointer programmed to deliver general purpose I/O interrupts to the arguments will be shipped over the ATE. In this case, to the dbcore. While most such interrupts are handled due to lack of cache coherence, the runtime system and the by the one of ARM cores in the system, the timer in- application need to ensure visibility of the arguments to the terrupt allows user code running on the DPU to make remote dbCore. scheduling and priority decisions. When a dbCore issues a RPC instruction, it is routed over the dbCore interconnect to another dbCore, or potentially Exceptions arising from within the dbcore pipeline, such itself. The ATE on the receiving dbCore places the RPC and as unaligned data and instruction accesses or divide by zero its payload in the ATE Receive Queue (reserved in DMEM). errors, cause hardware to switch the program counter to spec- The ATE contains a state machine which removes RPCs from ified offsets into an exception vector table. The architecture the ATE Receive Queue and executes them. The ATE pro- also implements a special debug exception which allows an vides hardware support for simple atomic operations like external debug process to attach to a dbcore executable’s read, write, increment and compare and swap. These low instruction stream. latency routines are executed in hardware, and the remote Pipeline execution throughput. The dbcore can issue in- dbCore is not interrupted. For RPCs which are not executed structions on two pipelines each cycle. One pipeline issues by hardware, an interrupt is sent to the dbCore and a soft- integer and branch instructions to the ALU, while the other ware interrupt service routine interrogates the RPC Descriptor pipeline issues memory operations to the load store unit. Cer- Register to determine the RPC ID and processes the RPC ap- tain instructions (e.g., sync) prevent dual issue. Hardware propriately. features such as two sets of BVLD/FLIT column base reg- isters are designed to extract full pipeline throughput when 2.3 Data Movement System (DMS)

4 The DMS is also fully aware of database tuples (multiple columns), and supports multiple data types. Multiple descrip- tors can be queued up in a hardware managed linked list for processing. A special loop descriptor is used to reuse these programmed descriptors multiple times. These descriptors provide ways to specify auto-incrementing the effective ad- dress, allowing large columns of data to be moved to DMEM without reprogramming the DMS. These unqiue aspects of the DMS dramatically simplify the programming interface, improving efficiency for operations involving small buffers, where the programming cost cannot be amortized. For efficient communication with the dbCores, the DMS software interface uses the Wait For Event (WFE) mechanism, allowing buffer flow control without the overhead and latency of interrupts. The DMS interface exposes 31 DMS events per dbCore that software can use to communicate with the DMS hardware. DMS descriptors have a notify event field that specify the event to be asserted when the operation is completed. They also have a Wait event field that specifies which event the descriptor to be waited on. Software can set or clear the specific events directly. Software can also wait Figure 3: Data Movement System (DMS) on a specific event using WFE instruction. It should be noted that while the dbCore is waiting for an event using the WFE instruction, it is automatically put in to a low power state where most clocks are gated. The WFE instruction can be thought of as hardware managed low power polling. A majority of database operations are focused more on Software can use two channels per DMAD (read and write) data movement rather than on computation. To support these to submit the descriptors to DMS. DMS descriptors are 16 operations, we have implemented an intelligent DMA engine, bytes long, which encodes the source address, destination termed the Data Movement System, or DMS. Unlike conven- address and other control information such as auto-increment, tional block DMA engines, the DMS understands database event mangement in a compact way. Software programs the data formats and can perform a variety of functions, including DMS by pushing the address of the descriptor to the DMAD’s data partitioning and projection, hash computation, and array active list, using the FIFO interface on one of the channel. transposing (row-major to column-major or vice-versa), all DMAD walks the active list and processes the descriptors in while transfering data. The DMS is fully programmable by order. Before it executes the descriptor, it checks descriptor software using linked descriptors in the dbCore’s DMEM, from the DMEM and executes the descriptor once the other enabling software to combine the capabilities of the DMS in conditions such as synchronization, events readiness are sat- a flexible manner to suit various processing needs. isfied. The descriptor information is passed up to DMAC Figure3 shows the overall architecture of the DMS, a data through DMAX. The DMAC executes the descriptor by per- highway built from three separate units. The central DMA forming the operations such as moving data from the main Controller or DMAC performs all tasks that are common to memory to DMEM or partitioning the data etc., Upon com- all 32 dbCores such as descriptor parsing and is responsible pletion of the operation, it returns the descriptor to DMAD for the bulk of the data movement in the DMS. It comprises through DMAX. Once the completion notification is reached most of the control, buffering and data path processing of DMAD, DMAD retires the descriptor. When it retires, it also DMS, allowing most of the DMS logic to be shared across notifies the core if it is programmed to raise an event. dbCores. The DMA DMEM unit (DMAD) performs the tasks Table1 shows the various type of DMS descriptors avail- that are unique to each dbCore. It is designed to be as simple able to program the DMS. The data movement descriptors as possible and is duplicated on all 32 dbCores. The third are used to program the DMS to move the data betweeen the unit is the DMA Crossbar (DMAX) which serves as a switch main memory, core’s DMEM, and DMS internal memory. or crossbar in each macro. It acts as a bridge between the DMS internal memory are buffer space provided by DMS eight DMADs to the DMAC. to store column data during data operations. For example, The DMS provides efficient data movement between the it provides the space for key columns while DMS computes main memory (DDR) and scratch pad memory (DMEM) in the CRC32 and provides space for columns while data is the dbCores. While moving data, the DMS can also effi- partitioned among the cores using the hash or range. Loop ciently perform some preprocessing operations such as scat- descriptors specifies the looping operations on the descrip- ter/gather, computing hashes and partitioning the data. The tors. AUX descriptors provide a way to provide additional DMS provides a robust interface to minimize the software information needed to process other descriptor. This provides overhead involved in programming it, through the use of de- a way to extend the descriptors with backward compatibility. scriptors. Descriptor based DMA has previously been used to Subsection 3.3 describes how the DMS is programmed and program on-chip NICs [4], however, our use of the DMS to presents an example. efficient move data from DRAM to the dbCores for is unique.

5 Descriptor Type Main Purpose

DDR->DMEM Moving data from the main memory to DMEM DDR->DMS Moving data from the main memory to DMS internal memory DMEM->DDR Moving data from DMEM to the main memory DMEM->DMS Moving data from DMEM to the DMS internal memory DMS->DMEM Moving data from DMS internal memory to DMEM DMS->DDR Moving data from DMS internal memory to the main memory DMS->DMS Moving data between the DMS internal memory Loop Specifies looping operations Event Provides a way for DMS to wait on events and clear events HARE Controls HASH and range functions AUX Used to provide any additional information needed to process other descriptors Program Configures DMS control registers

Table 1: DMS descriptor types

2.4 Mailbox Messaging The Mailbox Controller (MBC) is used for managing com- munication between the dbCores, the A9 cores and the M0 processor. It maintains a total of 34 mailboxes, one for every dbCore, one for both A9 cores and one for the M0. The controller is located on the AXI bus, and it is accessible via a standard AXI slave interface by all cores. For each mailbox the MBC maintains a set of memory mapped data (wr/rd) and control (wr/rd) registers that are used to write (send) or read (receive) messages to and from the corresponding mailbox, along with the corresponding logic that implements the send and receive protocol. An SRAM memory module, which is shared among all mailboxes, is used to store all messages delivered but not consumed yet. Figure 4: dbCore processor implementation A source core that wants to send a message to a destination core, first acquires exclusive access to the destination mailbox, by accessing atomically the destination WR control register. data are communicated through main memory. The network Once given exclusive access, it checks for available free space communication among the nodes and the RDBMS is based and then sends the message, by performing consecutive writes on this property. For example, a node will place the results of to the WR data register. Finally, it marks send completion by the query execution in a buffer in memory and send a mailbox updating the WR control register. The MBC logic updates the message to A9 with a pointer to the buffer. Upon receiving counter of received messages for the corresponding mailbox, the message, the A9 will pass the pointer to the network stack, causing an interrupt to be delivered to the destination core. which will then forward the buffer to its destination. Following similar steps, the destination core, upon a mail- Overall, the latency of a mailbox message may not be as box interrupt, will read the RD control register to see how low as the latency of a more integrated network (i.e., ATE). many messages are available. It will then read the data of However, we observed the MBC message throughput to be each message, by performing consecutive reads from the RD sufficient to sustain a network message rate that keeps the data register. The first word of a message is expected to con- dbCores utilized, without introducing any communication tain the number of words each message consists of, in order bottlenecks to the system. for the destination core to identify message boundaries. After the destination core has read all individual words, it updates 2.5 Fabrication the RD control register to mark the consumption of the mes- We fabricated the DPU using a 40 nm process, with a sili- sage. At that time, the message counter for this mailbox is con area of 90.63 mm2 and 540 millions , of which decremented and if there are no other messages available the 268 million transistors are used for memory cells. Figure interrupt signal is de-asserted. 4 shows the implementation of a single dbCore. We went The messages exchanged over MBC are always a multiple through an extensive physical design and verification process, of 4 bytes, and their size varies between one word and one the details of which are beyond the scope of this paper. We mailbox’s capacity (128 words). The goal of the MBC is used formal verification techniques as well, and almost 16% to facilitate quick exchange of light-weight messages (i.e., of our RTL bugs were found via formal tools. The DMS sending a pointer to buffer in memory), while the bulk of the proved to be one of the most challenging blocks to verify, due to deep corners in design. We also created a detailed virtual

6 prototype in software for verification, allowing us to perform coherency contract for the arguments struct as well as another end-to-end black box testing of the DMS. visitor for the returned data structure are supplied. Any hardware is only as good as the software that runs on The non-coherent caches require that the programming top of it. We next look at the our runtime. Creating a runtime model enforces the programmer to specify how shared struc- for the DPU was based on the same principles as its hardware, tures should be made coherent by the underlying layers. The maintaining a balance between efficiency and programmer vistor has a memory operation argument that defines which . and how we operation should be executed and will be filled in by the runtime. This argument modifies the visitor to either flush 3. THE DPU RUNTIME the data structure on the calling core, and invalidate the data structure on the remote core ensuring coherence is maintained. The low power design of a DPU delegates a majority of The visitor essentially defines a coherence contract, which scheduling and memory management decisions to software. can be used by the runtime to perform the necessary actions. The DPU runtime is guided by same principles, providing a intuitive and powerful API to the programmer, while maxi- 3.2 Scheduling mizing optimization potential through the use of the DMS and the ATE messaging interfaces. The guiding principle Each dbCore maintains an independent thread of execution, behind the DPU runtime is energy efficiency by fully exploit- and a list of pending tasks that are executed in order. The ing all levels of compute and memory parallelism, and our scheduler is designed to be lightweight and runs as a part of focus on achieving peak memory bandwidth. Programming this main event loop, and puts the dbCore in a sleep state if the DPU essentially involves asking the following questions: the task queue is empty using a wfe instruction. We create a lightweight messaging API on top of ATE Software RPCs, • How can I partition my data across all dbcores to maxi- and an ATE interrupt on a dbCore adds a task to the local mize compute parallelism? task queue, and wakes up the dbCore. The scheduler resumes execution, and processes all tasks in the task queue. The • How can I layout my data in DMEM so I can concur- scheduler relies on the programmer to yield control at the end rently load, compute and store data? of a task. We create a lightweight barrier by using the ATE hardware • How can I layout my data in main memory to allow the RPCs. A master core (the core where the barrier is created) DMS to fetch and store contigous buffers, and achieve acts as the coordinator, and initializes a local variable count maximum memory bandwidth? in its DMEM to zero. Other cores increment this variable by using the atomic add with return ATE RPC, and once count 3.1 DbCore Programming reaches the total number of cores, all cores leave the barrier. The dbcore comes out of reset in a bare state, and the Cores can check the value of this shared variable via a poll runtime is responsible for reserving the stack and heap in based mechanism using the ATE, however that would create DRAM. It also reserves some space in DMEM for the ATE significant contention on the coordinator core. We create a message queues and runtime state. After initializing the scalable barrier, where each core uses a local wait variable, interrupt handlers, control is transferred to the main routine. and spins on this variable after atomically incrementing count. A dbCore is similar in architecture to an embedded MIPS Once all cores enter the barrier, the coordinator notifies all the core, allowing for rapid prototyping. We provide standard C other cores by modifying the local wait variable using an ATE routines such as printf, memory and string APIs using atomic write operation. This would cause the coordinator to the newlib library [23]. Subsequent optimization relies on perform 31 sequential writes once all cores enter the barrier. moving data from the caches into DMEM, and overlapping We further optimize this based on our architecture, and the compute with memory transfer using the DMS. coordinator notifies macro masters, which in turn notify the The programmer must be aware of incoherent caches, and cores within their macro, creating a scalable tree like barrier. we use the ATE to simplify core to core communication. This improves the time taken for the last core to leave the The DPU allows executing a function pointer on a remote barrier from 30 us for the sequential case to 12 us for the tree core. This mimics remote procedure calls where an API is barrier. called which executes on a remote endpoint as well. This mechanism can be employed to build data structures to which 3.3 DMS Programming accesses must be performed in a mutually exclusive fash- The DMS in the DPU is fully programmable with the ion. We define a serialized interface to allows execution of DMS Descriptors as mentioned in the Subsection 2.3. DMS software RPCs on remote cores. descriptors are like macro instructions or commands which void* dpu_serialized(core_id_t _id, void(*rpc)(void*), instructs the DMS to move data between the main memory, void* args, visitor_fp args_visitor, visitor_fp DMEM, and DMS internal memory, to partition the data, to return_visitor); perform looping operation and to deliver events when data movement is complete. A target core id as well as a function pointer which will be Listing1 illustrates how the DMS is programmed for a executed on the remote core are specified. In addition to that, simple DMA access with double buffering. In this example, the arguments supplied to the RPC function pointer on the the DMS is programmed such that the program consumes 4 remote core are supplied as parameters. To achieve coherency million rows of a 4 byte column. First we setup two DDR- in this 1 to 1 communication style, a visitor specifying the >DMEM descriptors is setup to read 256 rows with same

7 We keep in DRAM instructions (i.e., dbCores’ binary im- age) and data, such as stack space for each dbCore, heap space and global data. The lack of an OS or of a virtual mem- ory management scheme, has led to having all these regions fixed in the memory address space, and share their location with the (i.e., position of the stack) or expose them to the dbCore programmer, either directly or through an API. The largest fixed memory region is the heap space. We implemented a two-level dynamic memory allocator, that overlooks the heap space and provides dynamic buffer man- Figure 5: DMS Descriptors agement. The lower level implements a fast buffer alloca- tion/free core-local path that handles the majority of the re- source address and destination address, but with different quests. When a core-local allocator runs out of buffers, it events. Even though the same destination address is used, the requests for additional memory from the global allocator, effective address is different due to the auto-increment feature which may choose to reply with either a set of buffers of the in the DMS. Second, we setup the loop descriptor such that it requested size or a chunk of memory, which the core-local loops to the first descriptor for 8191 times such that it reads all allocator will use to produce buffers of the requested size. 4 million rows. Finally, we push the descriptors to the DMS Similarly, if the number of free buffers in a core-local alloca- and thereby we trigger the data transfer to DMEM. Figure5 tor exceeds a pre-defined parameter, the core-local allocator shows the state of the descriptors in the active list after they will return a number of buffers to the global allocator. Each are pushed. Once the data is fully transfered to DMEM for the core can reach the global allocator by performing an ATE first descriptor (desc0), it will notify the core by raising the request to a specific core, which is tasked with running the event1. The DBCore waits for event1 to be raised and then global allocator. The code for the global allocator, is always start consuming the rows. Meanwhile DMS start transfering executed within the context of an ATE interrupt, and thus is the data corresponding to the second descriptor (desc1) to designed to be as short as possible. DMEM without waiting for dbcore to finish consuming the Internally, both allocators are organized as a set of seg- data. After finish transfering the data, DMS will loop back regated free-lists. The core-local one maintains buckets of to first descriptor and wait for the event1 to be cleared. After sizes ranging from 64 bytes to a few MBs, forwarding any consuming the data for the first descriptor (desc0), dbcore request for larger buffers directly to the global allocator. The clears the event, signaling the DMS that the first buffer is free. global allocator maintains the same set of free lists, to collect The overlapping of computation (consume_rows()) and the returned buffers from the cores and an additional linked list transfering of data with double buffering enables the DPU to for all the remaining buffer sizes. process data with high throughput. The memory allocator participates in the system-wide ef- fort of maintaining (software) cache coherence. At any given dms_descriptor* desc0 = dms_setup_ddr_to_dmem(256, time a buffer can be in one of the following three states: src_addr, dest_addr, event0); dms_descriptor* desc1 = dms_setup_ddr_to_dmem(256, • It can be in the global allocator, which means no part of src_addr, dest_addr, event1); such a buffer can be found in any part of the memory dms_descriptor* loop = dms_setup_loop(desc0, 511); hierarchy, other than the main memory, and it is not dms_push(desc0); dms_push(desc1); going to be accessed by anyone. dms_push(loop); count = 1; • If in the local allocator and marked as cache-only, it can buffer_index = 1; be accessed only through ld/st instructions executed by events[] = {event1, event2}; dbCores, and parts of the buffer can be in different lo- do{ cations in the memory hierarchy (i.e., different caches). dms_wfe(events[buffer_index]); It is the programmers responsibility to further enforce consume_rows(); clear_event(events[buffer_index]); which core is the owner of the buffer at any given time. buffer_index = 1 - buffer_index;// toggle buffer index • If in the local allocator and marked as DMS-only, it can } while (++count != 8192); be accessed only through DMS operations, and cannot be loaded on any cache. Listing 1: DMS Programming Example The state of any buffer can change only when the buffer is free, and either moves to/from the global allocator or it is in 3.4 Memory Management the local allocator and the allocator chooses to transition the The memory available in the system consists of the main buffer based on requests and availability. All state transitions memory, which is shared among dbCores and DMS, and the require the address range of the buffer to be invalidated, apart DMEM scratch space locally available to every dbCore. Each from the ones moving from DMS-only to any other state. one of these two spaces are managed in different way. 3.4.2 DMEM Memory Management with a Stack Al- 3.4.1 Heap Space Allocator locator

8 The DMEM holds data input to operators as well as in- Partitioning is critical for achieving high efficiency for termediate results, shared across operators. The lifetime of data processing operations and we study the efficiency of this data is short, as the chain of operators forming a task, is the partitioning engine in the DMS in terms of its achievable usually short and the result of a task is copied back to DRAM. memory bandwidth. The input to the microbenchmark is a To avoid the overhead of dynamically managing the avail- relation with four 4 byte columns. The relation is stored in able DMEM space through a malloc-like API, the operators column major format. The microbenchmark uses the DMS’s runtime uses DMEM in a stack-based fashion. Every new partition engine to partition the input relation into 32-way operator appends its result to the end of the used DMEM partitions. First, the DMS is programmed to move the data space. At the end of the last operator, the whole stack space from the main memory into the DMS internal memory. Once is released and made available for the next task. the data is in the DMS internal memory, DMS then triggers the hash or range engine to compute the index of the output 4. MICROBENCHMARKS partition. Finally, DMS moves the payload column data from the DMS internal memory and sends them to output We fabricated a test DPU using a 40 nm process, with 8 partitions which are in a core’s DMEM. We pipeline these GB of DDR3 memory. We first evaluate the performance three stages since that DMS uses different resources in each of the DMS and the dbCores for several small test cases, stages. In addition, we used multiple cores to submit the and ensure that functionality on the DPU works as designed, DMS descriptors to reduce the DMS programming overhead and measure peak performance of the DMS and ATE across per core and used all four available memory controller to several use cases. maximize the input bandwidth to the hardware partitioning 4.1 Read and Write engine. Figure8 shows the effective bandwidth achieved with dif- We first take a look at the Read and Write performance ferent partition schemes available in the DMS. Radix parti- of the DMS. Each dbcore reads 4096 rows from DRAM tioning uses 5 bits from a key column to partition the data into DMEM using the DMS, and write the same rows from into 32 ways. Hash partitioning first uses the hash engine to DMEM to DRAM. Each dbcore reads/writes a buffer of mul- compute the CRC32 from one or multiple key columns and tiple rows at a time containing all columns. We vary the num- then uses the CRC32 bits to partition the data. In all the parti- ber of columns from 2 to 32 and width of each column from tioning scheme, the DMS achieves 9.3 GB/s and outperforms 1 byte to 4 bytes. The data is stored in column major format. the previous published state of art hardware accelerator for Figure6(a) shows the read bandwidth achieved across all db- partitioning [24], where the partitioning throughput is only cores across different buffer sizes and column widths. There 3.13 GB/s. The DMS achieves high efficiency by pipelining is a small decrease in bandwidth as the number of columns the partitioning operations into three stages, by using efficient increase. The DMS fetches data one column at a time, and hash and range engine desigh, which runs at 800 MHz, and more columns mean more non-contigous DRAM pages are the usage of the low latency scratchpad memory to process opened, which incurs a small latency for each buffer. Larger the partitions. buffer sizes amortize fixed DMS overheads achieve higher bandwidths. Using the DMS, the DPU achieves a bandwidth 4.4 Atomic Transcation Engine of > 9 GB/s for a buffer size of 8 KB(128 rows/buffer, 4 The ATE is designed for fast core to core communication, columns, 4B column widths) which is almost 75% of the and allows dbCores to perform atomic operations on shared peak DDR3 bandwidth, and this is bandwidth we observe in data in DMEM or DRAM. We measure the latency of dif- many real applications as well. ferent ATE operations and different payload sizes (Figure 4.2 Gather 9. Looking at read, an atomic read from a remote dbCore incurs a latency of 80 ns for an 8B value. Writes are faster, We study a basic bitvector gather operation, where we as the local dbCore does not need to wait for a return value. gather rows from DRAM corresponding to 1s in a given Similarly, an atomic addition that returns a value to the caller bitvector into DMEM. Gather is fundamental to the filter op- dbCore takes 100 ns. Software RPCs are much slower due erator, wherein we gather rows from DRAM corresponding to interrupt overhead, incurring almost 400 ns to read an 8B to ones in a given bitvector. and . We test the DMS against value. a dense (0xF7) and a sparse (0x13) bitvector. The DMS is designed to perform gather at line speeds, however, due to 4.5 Mailbox a RTL bug, the first version of our chip could not utilize the DMS to its full potential(Figure7). In brief, when all 32 cores The mailbox controller is responsible for the communica- issue gather operations, a FIFO that hold bitvector counts in tion between the dbCores and the A9 cores. We evaluated its the DMAC overflows and starts dropping counts. DMAD peak performance with the following traffic patterns: channels in the corresponding dbCores that are waiting for • A9 to dbCore0, back to back messages of variable size. these counts get stuck, causing some dbCores to hang. We create a software workaround for this issue by serializing • dbCore0 to A9, back to back messages of variable size. descriptors across dbCores, ensuring only a single dbCore issues a gather operation at a time, causing low gather band- • A9 to dbCores, back to back messages on both direc- widths(Figure7). tions. dbCore0 is the recipient of A9 messages and dbCore1 the . A9 alternates between sending and 4.3 Hardware Partitioning consuming a message.

9 (a) Columnwidth = 1B (b) Columnwidth = 4B

Figure 6: Bandwidth achieved across 32 dbcores for reading (a) and reading and writing(b) data via the DMS. Each line shows the bandwidth achieved for given number of rows per buffer and column width

Figure 7: Bandwidth achieved across 32 dbcores for the Figure 8: Bandwidth achieved with DMS partitioning engine bitvector gather operation using the DMS. The x-axis 4.6 SQL Operations • round-trip messages of minimum size, exchanged be- tween A9 and dbCore0. 4.6.1 Filter The filter operation is a basic database operation (SQL Figure 10 shows the message throughput rate in thousands WHERE clause), used to identify elements of a column that of messages per second, as a function of the message payload satisfy a given condition. We program the DMS to fetch size (in 4-byte words), for the three traffic patterns that ex- a single column of data (columnwidth = 1B), and vary the change back-to-back messages. The message processing rate number of rows from 512 to 16384. The DMS fetches a tile at the two ends of the message exchange is different, with A9 of the requested size into DMEM, and the dbCores gener- having to involve the kernel interrupt to receive the messages ate a bitvector in DMEM corresponding to the ≤,<,≥,>,= and deliver them to the user space of the destination process. comparisons with a constant. We accelerate these operations On the other side, when a dbCore is receiving a message, by using the bvld and filt instructions in the dbCore ISA, the processing path is much shorter. As a result, there is and pipeline these operations together with the DMEM read. higher throughput when A9 sends messages to dbCores than A single dbCore achieves a bandwidth of 482 millions tu- when receiving. The combination of the two patterns, results ples/second (Figure 11), which translates to 1.65 cycles/tuple. naturally in lower throughput, as the A9 has more work to do We can see a saturation in dbCore performance as the number compared to either previous scenario. of rows are increased, as fetching larger buffers causes the Finally, we evaluated the round-trip throughput of the mail- DMS to approach peak memory bandwidth. For 16K rows/d- box , by having A9 and dbcore) ex- bCore, this allows us to achieve a peak memory bandwidth change the same message multiple times. We measured the of almost 9.6 GB/s for 32 dbCores. average message rate under that traffic pattern to be 64.184 round trips per second. 4.6.2 Binary Expressions

10 Figure 9: Performance of ATE SW and HW RPCs for differ- Figure 12: Performance achieved for a single dbCore for the ent payload sizes filter primitive Messaging rate & Bandwidth We evaluate binary expressions on the DPU, which are defined as operations on columns of the form ColA EXPR Message Rate ColB, where ColA and ColB are two database columns, and 1000 9000EXPR is one of ADD,Bandwidth SUB, MUL or DIV. We also evaluate 900 the CASE expression defined as ColA > Const1 then ColB A9 to DbCore0 - single direction 8000 800 else Const2. We fetch 1024 rowsA9 to perDbCore0 column - single using direction the DMS DbCore0 to A9 - single direction 7000 700 into DMEM, and measure theDbCore0 achieved to A9 bandwidth- single direction for dif- A9 to DbCore0 - bi-directional 6000 600 ferent columns widths (FigureA9 to 12 DbCore0). The - bandwidthbi-directional for the 5000ADD/SUB expression increases with larger column widths, 500 4000due to larger buffer being transferred by the DMS. MUL 400 and DIV achieve their peak bandwidth at a column width 3000 300 of 4 bytes, and become compute bound for higher column 200 2000widths. The CASE expression bandwidth increases linearly Bandwidth Bandwidth (Kwords /sec) MessageRate (Kmsg / sec) 100 1000with increasing column widths, due to improved DMS band- 0 width.0 CASE achieves lower bandwidth than ADD/SUB due 0 20 40 60 80 100 120 to higher0 branch 20 misprediction 40 60 penalties. 80 100 120 Payload Size (words) Payload Size (words) Figure 10: Mailbox throughput as a function of payload size 5. APPLICATIONS: OPTIMIZING FOR THE DPUNote: Bandwidth = (payload + 1) * MsgRate

Copyright © 2015, OracleWe and/or designed its affiliates. Allthe rights reserved. DPU | toOracle be Confidential able – to Internal/Restricted/Highly perform in Restricted memory an- 4 alytics at peak memory bandwidth and corresponding power efficiency, and we look at applications spanning a variety of domains that are a good match for our hardware. We find state of the art algorithms from research for these applica- tions, and implement them both on x86 and the DPU. In most cases, similar code runs on both architectures, with threads on x86 translating to dbCores, and cache-friendly memory operations are converted to DMS operations. The DMEM and LLC on the DPU are much smaller than Xeon processors, and we perform additional optimizations to allow working sets to fit in DMEM. We compare our numbers to a Xeon server, with two In- tel Xeon E5-2699 v3 18C/36T processors and 256GB (16 * 16GB Samsung 16GB DDR4 2133MHz) DRAM running at 1600 MHz. For our DPU experiments, all datasets were Figure 11: Performance achieved for a single dbCore for the converted to 10.22 fixed point. There has been a increasing filter primitive amount of research on using fixed point for machine learning algorithms [12][7], and we observed no losses in accuracy while comparing with a floating point implementation. A

11 the fly, since we found generating and maintaining a kernel cache for the entire dataset to be much slower. A side-effect of our fixed point implementation is that the DPU converges in 35% fewer iterations, with no loss in classification accuracy, further making the case for fixed point. 5.2 Similarity Search on Text Text processing is an ubiquitous workload, we use a sim- ple example of similarity search on text to demonstrate the applicability of DPUs to text analytics. The similarity search problem essentially involves computing similarities between a group of queries and a group of indexed using the tf-idf scoring technique, and coming up with a set of topk matches for each query. Computing cosine similarities for a group of queries against an inverted index of documents can Figure 13: Power efficiency gains of the DPU for several be formulated as a Sparse -Matrix Multiplication prob- applications lem (SpMM) [1]. We leverage recent research on optimizing SpMM on the CPU [20] and the GPU [1] and implement simple 10.22 fixed point approach works due to the fact that these algorithms on x86 and the DPU. Each query indepen- most machine learning algorithms require data normaliza- dently searches across the index, making the problem easily tion, which constrains the range of the numbers involved, parallelizable across multiple threads/dbCores. We search leaving 22 bits to handle precision. We create optimized across 4M pages in the English , using page titles routines for fixed point multiply, divide, dot product, and as queries, similar to [1]. exponential functions. For the dot product, we use hand- A SpMM operation between 2 sparse matrices A and B coded assembly to make sure the element-wise loop fits into creating C relies on a simple principle, accumulate rows of the loop buffer, making the backwards branch free. To com- B corresponding to non-zero columns of A into C. The algo- pute performance/Watt, we assume a TDP of 145 W for the rithm relies on using a dense intermediate format for rows Xeon, and 6 W for the DPU. Figure 13 shows the perfor- of C, which makes indexing easier during accumulation. We mance/Watt advantages of the DPU for several applications range-partition rows of B and C into smaller tiles, allowing that we looked at normalized to a corresponding optimized this intermediate dense array to fit in the LLC. These tiles are x86 implementation. stored in the Compressed Sparse Row (CSR) format, which means we cannot know when a tile ends without actually 5.1 Support Vector Machines reading the tile. These irregular access patterns make SpMM Support Vector machines (SVMs) are widely used for clas- a challenging problem for DMS access. Getting good band- sification problems in healthcare (cancer/diabetes), width and efficiency from the DMS entails fetching large classification and handwriting recognition. We implement fixed length contiguous buffers. Naively using the DMS in- a variation of the Parallel SMO proposed by Cao volves fetching a buffer containing a tile, utilizing the tile, et. al [5] on the DPU. Each iteration of the SMO algorithm and discarding the rest of the buffer. This generates an ef- involves computing the maximum violating pair across all fective bandwidth of only 0.26 GB/s across 32 dbcores. We training samples, and the algorithm converges when no such use a novel technique for SpMM, where we fetch a buffer pair could be found. We distribute the computation of the containing multiple tiles into DMEM, and track state corre- maximum violating pair across all dbCores, and each core sponding to the end of each tile. This allows us to consume sends its local violating pair to a designated master core all data in DMEM, increasing bandwidth utilization. using the ATE. The master then computes the error on the Careful DMEM management improves the effective band- global pair, and broadcasts the updated values to all dbCores width to 5.24 GB/s on the DPU and a 3.9× improvement in using the ATE as well. performance/Watt over a optimized 256 thread Xeon imple- Sending large data values across the ATE is expensive, mentation (effective bandwidth across 36 cores - 34.5 GB/s). so we store all data structures in shared DRAM structures, and use a soft coherence protocol to maintain integrity. The 5.3 Grouping and Aggregation (SQL ’Group sending core flushes the data structure from its cache and By’) notifies the receiving core via an ATE message. The receiv- This SQL operation consists of grouping rows based on ing core then invalidates the data structure address range the values of certain columns (or, more generally, expres- from its cache. We use the DMS to read and write the sam- sions) and then calculating aggregates (like sum and count) ples and coefficients arrays at line speeds, further improving of a given list of expressions within each group. It can be efficiency. efficiently processed using a hash table as long as the num- We compare the DPU version with a multicore LIBSVM ber of distinct groups is small enough [8]. Since the access [6] implementation on x86. We use 128K samples from the pattern is random and the hash table size grows linearly with HIGGS [17] dataset for evaluation. Optimal parameters are the number of distinct groups, ensuring locality of access chosen for LIBSVM (100MB kernel cache, 18 OpenMPI is very important for performance, especially on the DPU threads) empirically. The DPU version generates kernels on architecture. On conventional architectures like x86, query

12 processing systems rely on large multi-tier caches to hide the 5.4 HyperLogLog latency of accessing a large hash table; even on such systems The HyperLogLog algorithm [10] provides an efficient there is no guarantee that a given section of the hash table way to approximately count the number of distinct elements will be cache-resident and performance drops significantly (the cardinality) in a large data collection with just a single when the hash table grows beyond a certain size. pass over the data. The distinct count is an important problem Our query processing software is architected around care- finding applications across a variey of disciplines like count- ful partitioning of the data to ensure that each partition’s ing unique visitors to a website as well as common database data structures (like a hash table, in the case of group-by) fit operations (COUNT DISTINCT). HyperLogLog relies on a into the DMEM. This also guarantees single-cycle latency to well behaving hash function which is used to build a most access any part of the hash table, unlike a cache. likelihood estimator by counting the maximum number of The process begins with the query compiler where the leading zeros (NLZ) in the hashes of each data value. This es- DMEM space is allocated among input/output buffers, meta- timation is coarse-grained, and the variance in this approach data structures and the hash table in a way that maximizes can be controlled by splitting the data into multiple subsets, performance; the algorithm initially allocates a minimum computing the maximum NLZ for each subset, and using a amount of space for each structure, and then allocates the rest harmonic mean across these subsets to get an estimate for the of the space incrementally using a greedy approach based cardinality of the whole collection. This also makes the algo- on the performance gains of allocating extra space to each rithm easily parallelizable, each core computes the maximum structure. Typically, input/output buffers don’t benefit much NLZs for its subsets, followed by a merge phase at the end. from more than 0.5 KB (since the DMS can nearly saturate We optimize our implementation by using a key observa- memory bandwidth at that point) and hence a large part the tion, that the properties of the hash function remain the same DMEM space is allocated to the hash table. if we count number of trailing zeros (NTZ) instead of num- Then the number of partitions needed to achieve that hash ber of leading zeros (NLZ). The NTZ operation takes only 4 table size per partition is calculated. The partitioning needs cycles on a dbcore as compared to 13 cycles for a NLZ due to be performed using a combination of hardware and/or soft- to hardware support for a popcount instruction. Instead of a ware partitioning. Based on the number of columns involved, static schedule, we partition the input set into multiple chunks we can calculate maximum number of software partitions that and implement work stealing on the across cores using the can be achieved in one "round" (round-trip through DRAM ATE hardware atomics. The variable latency multiplier on âA¸Sreading˘ data in and writing it out as separate partitions) the dbCores makes this dynamic scheduling essential to avoid at a rate that is close to memory bandwidth; this is because long tail latencies. We also use the DMS to read and write software partitioning internally uses DMEM buffers for each buffers at peak bandwidth, and evaluate the performance o, partition. The number of rounds of partitioning required is talk about merge, hash functions used We can use a 64-bit then calculated and partition operators are added to the query hash function We further optimize the x86 version by using plan before the grouping operator. At runtime, if the size of atomics for synchronization and SIMD intrinsics. The hash a partition is larger than estimated, the execution engine can function is at the heart of the HyperLogLog algorithm, and re-partition the data for that partition as needed. In the last we compare the performance of the DPU for 2 common hash round, if the number of partitions is less than the number of functions, Murmur64 and CRC32. The DPU has hardware cores, only hardware partitioning is needed; this is especially acceleration for CRC32, making the CRC implementation useful for moderately sized hash tables (which are larger than almost 9× better than the x86 implementation. The Mur- DMEM but not larger than the combined size of all the cores’ mur64 implementation does poorly on the DPU due to the DMEM) since no extra round-trip through DRAM is needed. high latency multiplier (4-8 cycles), and we plan to address Partitioning also provides a natural way to parallelize the this in a future revision of our chip. operation among the cores, since each core can usually oper- ate on a separate partition. But when the number of distinct 5.5 JSON Parsing groups is low, partitioning is not necessary or useful; in this The JavaScript Object Notation (JSON) is an increasingly case, the input data is equally distributed among the cores and popular format amongst many applications that store and an- a merge operator is added to the query plan after the grouping alyze large amounts of data. The JSON grammar is relatively operator. Since the merge operator only works on aggregated simple, consisting of key-value pairs and a small number of data, its overhead is very low. syntactic tokens and datatypes (numbers, strings, lists, dic- We evaluate groupby for two cases: low number of distinct tionaries), yet flexible because it supports nesting. Hence values (Low-NDV) and high number of distinct values (High- several applications use JSON to log data, as well as ingest NDV). In the Low-NDV case, both platforms are able to JSON logs as an initial step in data analysis pipelines. We process the operation at a rate close to memory bandwidth; so hence evaluate the efficiency of parsing JSON data using the the improvement (6.7x) is primarily due to the DPU’s higher DPU. memory bandwidth per Watt. However, in the high-NDV For representative baselines to compare to the DPU, we case, the data needs to be first partitioned on both platforms. assume the source file is loaded into DDR memory (as with Due to the DMS’s hardware partitioning feature the DPU mmap() on x86). After evaluating open source C/C++ im- only needs to do one round of partitioning, whereas X86 plementations of JSON (JSON11, RAPIDJSON, SAJSON) needs two rounds; so the improvement (9.7×) is higher in we selected SAJSON [3] as our best performing, portable this case. baseline. Like several other parsers, it employs a switch- case structure and uses a single memory allocation to avoid

13 repeated overheads. For a benchmark, we populate JSON records with keys corresponding to the TPCH lineitems ta- ble. The datatypes hence consist of a mixture of integers, strings, dates and populate approximately 1GB of records. For this workload, SAJSON is able to achieve an IPC of 3.05, and parse the input data at 5.2 GB/s on our x86 machine. However, on the RAPID DPU, it only achieves a throughput of 645 MB/s âA˘Tˇ even assuming all the workload records Figure 14: The three types of data access pattern in the dis- are perfectly prefetched into core-local SRAM(DMEM). The parity vision workload. switch-case anatomy emits a large number of instructions, including several compare-branch with the default compiler (gcc, MIPS-O64). On the in-order RAPID dbCores operating ing one of the images by X pixels, where X varies from 0 at a relatively low frequency ( 1/3 compared to the x86, out- to a given max_shift parameter. The resulting disparity of-order machine), and lacking hardware branch prediction, map is essential component in many machine vision applica- the resulting processing cycles per byte ( 13.2) brings down tions like autonomous cruising, pedestrian tracking, etc. This the achievable memory bandwidth. well-studied workload is known to be data Instead of a nested branching structure, we coerce a jump- intensive [22] whose memory accesses need to be carefully table by first loading the next byte in the input token stream, orchestrated to efficiently utilize the available memory band- and branching conditionally based on the loaded character. width. In addition, the disparity computation involves four A table which supports looking up the input character in distinct image kernels each with different data access patterns combination with the current parse state, and yield a pointer that are not straight-forward to parallelize across multiple to a new parse action requires allocation of 256 (number of cores. These are the two main challenges that a developer characters) × 8B (size of pointer to next parse action) × ˘ ´ faced in parallelizing this workload on any archiecture. number of states in the grammar. Given JSONâAZs relatively In order to efficiently parallelize the disparity computa- small grammar ( 12 states), the parse table size fits within tion across multiple cores, we experimented with both fine- 23 KB. To allow concurrent processing on all dbcores, the grained and coarse-grained parallelization approaches each JSON file (in memory) is split into per-core chunks. To with different trade-offs. Under the fine-grained approach, further avoid synchronization that would be required if a we split the input images into distinct chunks or tiles of pix- JSON record straddled the chunk boundary between two els, one per each DbCore, where they cooperatively compute dbcores, each dbcore allocates and reads an extra chunk. the disparity kernel in lockstep. Hence this approach might During parsing, the extra bytes are parsed as the last bytes require non-trivial amounts of system-wide barriers for syn- of the dbcore processing the previous chunk and ignored by chronization between the computer vision kernels. On the the dbcore which encounters them in its first chunk. Due other hand, the coarse-grained approach splits the work of to the efficiency of prefetching buffers using the DMS, this computing disparity by making each core independently com- overhead is largely negligible. pute disparity for a distinct shift in pixel of one of the images Apart from using the DMS for moving data from DRAM and finally aggregate the result across cores. Although this to individual dbcore memories, further optimizations were approach reduces the number of synchronization between required to comply with the requirements of parsing such as the cores, may not efficiently utilize the available memory returning of pointers to full, null-terminated records. The bandwidth. The result of this experiment is highly dependent straightforward approach of checking once for the end of on the architecture. For example, the cost of synchronization a DMS descriptor, followed by a subsequent scan for the and ability to orchesterate memory accesses. Low-latency terminal character causes inputs to be read twice. Instead, we synchronization primitives and the DMS played a vital role exploit the JSON grammar by adding an invalid byte at the in the RAPID architecture. end of the DMS buffer (like 0x00), force a parse error and [ht] invoke the appropriate next-state production rule with a single The four disparity vision kernel involves three distinct data- pass over the input. The DMS also triple-buffers the data in 8 access patterns as shown in the Figure-14. The most chal- KB chunks, with a padding size of 1 KB to avoid the chunk- lenging data-access patterns are the columnar and pixelated- straddling issue mentioned above. These optimizations allow pattern. The software-managed DMEM via DMS makes our DPU implementation to process the above dataset at 1.73 orchestrating these access pattern significantly easier on the GB/s using 32 dbcores, with an improvement of 8× in terms RAPID architecture. For instance, the most-challenging pixe- of performance/Watt over SAJSON. lated access pattern can be reduced into gathering pixels with two different strides into two sections of the DMEM scratch 5.6 Disparity memory. Similarly, each core could independently fetch its In this section, we evaluate a well-known computer vi- own column into the core-private scratch memory. sion workload that computes a disparity map [19]. Disparity We found that a tightly-coupled design of the RAPID archi- map provides detailed information on the relative distance tecture performed better under the fine-grained parallelism ap- and depth between objects in the image from two stereo im- proach and resulted in a 2.9x slow down to 16 core IvyBridge ages each taken with slightly different camera angles. In Xeon machines (Oracle X4-2) with the OpenMP-based paral- a nutshell, the disparity map workload involves computing lel implementation. This translates into 8.6x better perf/watt the pixel-wise difference between the two images by shift- on RAPID compared to the X86 counterpart. The perfor-

14 (a) (b)

Figure 15: Performance of disparity workload on both X86 and RAPID, (a) shows the absolute runtime on each architecture, (b) shows scalability by plotting speedup over single thread performance with increasing number of threads. Both the graphs show fine- and coarse-grained parallelization approach on RAPID mance result of scaling from 1 to 32 threads is shown in training of support vector machines,” Trans. Neur. Netw., vol. 17, no. 4, Figure-15. This shows that RAPID is a versatile system pp. 1039–1049, Jul. 2006. [Online]. Available: http://dx.doi.org/10.1109/TNN.2006.875989 which benefits other non-SQL data-intensive workloads as [6] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector well. machines,” ACM Transactions on Intelligent Systems and , vol. 2, pp. 27:1–27:27, 2011, software available at 6. CONCLUSION http://www.csie.ntu.edu.tw/~cjlin/libsvm. [7] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Rapid increases in data sizes requires a fundamental change Z. Xu, N. , and O. Temam, “Dadiannao: A machine-learning in the way we tackle large scale analytics applications. We ,” in Proceedings of the 47th Annual IEEE/ACM present the Database Processing Unit (DPU), a unique ar- International Symposium on Microarchitecture, ser. MICRO-47. Washington, DC, USA: IEEE Computer Society, 2014, pp. 609–622. chitecture focusing on memory bandwidth achieved/Watt, [Online]. Available: http://dx.doi.org/10.1109/MICRO.2014.58 created to address the challenges of big data. We implement [8] J. Cieslewicz and K. A. Ross, “Adaptive aggregation on chip a variety of applications on the DPU, ranging from JSON multiprocessors,” in Proceedings of the 33rd International Conference Parsing to SQL operations, as well as a number of analyt- on Very Large Data Bases, ser. VLDB ’07. VLDB Endowment, 2007, pp. 339–350. [Online]. Available: ics workloads, and show an efficiency improvement of 5x http://dl.acm.org/citation.cfm?id=1325851.1325893 to 15x in terms of performance/Watt. We note that this is a [9] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and first step towards a larger rack scale system, where hundreds D. Burger, “Dark silicon and the end of multicore scaling,” in of thousands of DPUs are connected via a high bandwidth Proceedings of the 38th Annual International Symposium on , ser. ISCA ’11. New York, NY, USA: ACM, interconnect to provide an efficient solution to the exascale 2011, pp. 365–376. [Online]. Available: challenge. http://doi.acm.org/10.1145/2000064.2000108 [10] P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier, “HyperLogLog: the 7. REFERENCES analysis of a near-optimal cardinality estimation algorithm,” in AofA: , ser. DMTCS Proceedings, P. Jacquet, Ed., [1] S. R. Agrawal, C. M. Dee, and A. R. Lebeck, “Exploiting accelerators vol. AH. Juan les Pins, France: and for efficient high dimensional similarity search,” in Proceedings of the Theoretical , Jun. 2007, pp. 137–156. [Online]. 21st ACM SIGPLAN Symposium on Principles and Practice of Available: https://hal.inria.fr/hal-00406166 Parallel Programming, ser. PPoPP ’16. New York, NY, USA: ACM, 2016, pp. 3:1–3:12. [Online]. Available: [11] P. K. Gupta, “Accelerating datacenter workloads,” 2016. [Online]. http://doi.acm.org/10.1145/2851141.2851144 Available: http://www.fpl2016.org/slides/Gupta%20-- %20Accelerating%20Datacenter%20Workloads. [2] S. R. Agrawal, V. Pistol, J. Pang, J. Tran, D. Tarjan, and A. R. Lebeck, “Rhythm: Harnessing data parallel hardware for server workloads,” in [12] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. th Dally, “Eie: Efficient inference engine on compressed deep neural Proceedings of the 19 International Conference on Architectural Proceedings of the 43rd International Symposium on Support for Programming Languages and Operating Systems network,” in , ser. Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA: IEEE ASPLOS ’14. New York, NY, USA: ACM, 2014, pp. 19–34. Press, 2016, pp. 243–254. [Online]. Available: [3] C. Austin. (2013) Sajson: Single-allocation json parser. [Online]. https://doi.org/10.1109/ISCA.2016.30 Available: https://chadaustin.me/2013/01/single-allocation-json-parser [13] T. H. Hetherington, M. O’Connor, and T. M. Aamodt, [4] N. L. Binkert, A. G. Saidi, and S. K. Reinhardt, “Integrated network “Memcachedgpu: Scaling-up scale-out key-value stores,” in interfaces for high-bandwidth tcp/ip,” in Proceedings of the 12th Proceedings of the Sixth ACM Symposium on Cloud Computing, ser. International Conference on Architectural Support for Programming SoCC ’15. New York, NY, USA: ACM, 2015, pp. 43–57. [Online]. Languages and Operating Systems, ser. ASPLOS XII. New York, Available: http://doi.acm.org/10.1145/2806777.2806836 NY, USA: ACM, 2006, pp. 315–324. [Online]. Available: [14] T. H. Hetherington, T. G. Rogers, L. Hsu, M. O’Connor, and T. M. http://doi.acm.org/10.1145/1168857.1168897 Aamodt, “Characterizing and evaluating a key-value store application [5] L. J. Cao, S. S. Keerthi, C.-J. Ong, J. Q. Zhang, U. Periyathamby, X. J. on heterogeneous CPU-GPU systems,” in Proceedings of the 2012 Fu, and H. P. Lee, “Parallel sequential minimal optimization for the IEEE International Symposium on Performance Analysis of Systems &

15 Software, ser. ISPASS ’12. Washington, DC, USA: IEEE Computer [21] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, Society, 2012, pp. 88–98. J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, [15] Intel, “Advancing moore’s law in 2014 âA˘Tˇ the road to 14 nm,” 2014. J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and [Online]. Available: http://www.intel.com/content/www/us/en/silicon- D. Burger, “A reconfigurable fabric for accelerating large-scale innovations/advancing-moores-law-in-2014-presentation.html datacenter services,” in Proceeding of the 41st Annual International Symposium on Computer Architecuture [16] N. Jouppi, “Google supercharges machine learning tasks with tpu , ser. ISCA ’14. Piscataway, custom chip,” 2016. [Online]. Available: NJ, USA: IEEE Press, 2014, pp. 13–24. [Online]. Available: https://cloudplatform.googleblog.com/2016/05/Google- http://dl.acm.org/citation.cfm?id=2665671.2665678 supercharges-machine-learning-tasks-with-custom-chip.html [22] S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia, S. Belongie, and M. B. Taylor, “Sd-vbs: The san diego vision [17] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Workload Characterization, 2009. IISWC 2009. Available: http://archive.ics.uci.edu/ml benchmark suite,” in IEEE International Symposium on. IEEE, 2009, pp. 55–64. [18] K.-J. Lin and J. D. Gannon, “Atomic remote procedure call,” IEEE Trans. Softw. Eng., vol. 11, no. 10, pp. 1126–1135, Oct. 1985. [Online]. [23] C. Vinschen and J. Johnston, “The red hat newlib c library.” [Online]. Available: http://dx.doi.org/10.1109/TSE.1985.231860 Available: https://sourceware.org/newlib/ [19] D. Marr and T. Poggio, “Cooperative computation of stereo disparity,” [24] L. Wu, R. J. Barker, M. A. Kim, and K. A. Ross, “Navigating big data From the Retina to the Neocortex with high-throughput, energy-efficient data partitioning,” in in . Springer, 1976, pp. 239–243. Proceedings of the 40th Annual International Symposium on [20] M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, Computer Architecture, ser. ISCA ’13. New York, NY, USA: ACM, S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey, 2013, pp. 249–260. [Online]. Available: “Parallel efficient sparse matrix-matrix multiplication on multicore http://doi.acm.org/10.1145/2485922.2485944 platforms,” in International Conference on High Performance Computing. Springer, 2015, pp. 48–57.

16