Pluto: In-DRAM Lookup Tables to Enable Massively Parallel General
Total Page:16
File Type:pdf, Size:1020Kb
pLUTo: In-DRAM Lookup Tables to Enable Massively Parallel General-Purpose Computation João Dinis Ferreira§ Gabriel Falcao† Juan Gómez-Luna§ Mohammed Alser§ Lois Orosa§ Mohammad Sadrosadati‡§ Jeremie S. Kim§ Geraldo F. Oliveira§ Taha Shahroodi§ Anant Nori⋆ Onur Mutlu§ §ETH Zürich †IT, University of Coimbra ‡Institute for Research in Fundamental Sciences ⋆Intel Data movement between main memory and the processor is in two categories: 1) Processing-near-Memory (PnM), a significant contributor to the execution time and energy con- where computation occurs near the memory array [33, 59, sumption of memory-intensive applications. This data move- 78, 84, 103, 107], and 2) Processing-using-Memory (PuM), ment bottleneck can be alleviated using Processing-in-Memory where computation occurs within the memory array, by ex- (PiM), which enables computation inside the memory chip. ploiting intrinsic properties of the memory technology [31, However, existing PiM architectures often lack support for com- 44, 82, 108, 110]. plex operations, since supporting these operations increases de- In PnM architectures, data is transferred from the DRAM sign complexity, chip area, and power consumption. to nearby processors or specialized accelerators, which are We introduce pLUTo (processing-in-memory with either 1) a part of the DRAM chip, but separate from the lookup table (LUT) operations), a new DRAM substrate memory array [33], or 2) integrated into the logic layer of 3D- that leverages the high area density of DRAM to enable the stacked memories [59,84]. PnM enables the design of flexible massively parallel storing and querying of lookup tables (LUTs). substrates that support a diverse range of operations. How- The use of LUTs enables the efficient execution of complex ever, PnM architectures are limited in functionality and scal- operations in-memory, which has been a long-standing chal- ability: the design and fabrication of memory chips that inte- lenge in the domain of PiM. When running a state-of-the-art grate specialized processing units (such as [33]) has proven to binary neural network in a single DRAM subarray, pLUTo be challenging, and 3D-stacked memories are bound by strict outperforms the baseline CPU and GPU implementations by thermal and area limitations. 33× and 8×, respectively, while simultaneously achieving In contrast, PuM architectures enable computation to oc- energy savings of 110× and 80×. cur within the memory array. Impactful works in this do- main have proposed mechanisms for the execution of bitwise 1. Introduction operations (e.g., AND/OR/XOR)[44,108,110], arithmetic opera- For decades, DRAM has been the predominant technologyfor tions [31,32,82,119],and basic LUT-based operations [32,45]. manufacturing main memory, due to its low cost and high ca- Operations in PuM are usually performed between multiple pacity. Despite recent efforts to create technologies to replace memory rows. This fact, combined with the use of vertical it [64,69,75,96,117], DRAM is expected to continue to be the data layouts and bit-serial computing algorithms [4, 35, 115], de facto main memory technology for the foreseeable future. enables a very high degree of parallelism, since there can be However, despite its high density, DRAM’s latency and band- as many execution lanes as there are bits in each memory width have not kept pace with the rapid improvements to pro- row. However, the flexibility of PuM architectures is limited arXiv:2104.07699v1 [cs.AR] 15 Apr 2021 cessor core speed. This divide creates a bottleneck to system by the range of operations they support, and by the difficulty performance, which has become increasingly limiting in re- of using these operations to express higher-level algorithms cent years due to the rapidly growing sizes of the working of interest with sufficiently low latency and energy costs. sets used by many modern applications [25,63,95]. Indeed, re- We leverage LUT-based computing through the use of LUT- cent surveys show that the movement of data between main based operations that synergize well with existing work to en- memory and the processor is responsible for up to 60% of able more complex PuM-based functions. This allows pLUTo the energy consumption of modern memory-intensive appli- to perform a wider range of operations than prior works, cations [25, 30, 79, 90, 102, 121, 124, 127]. while enjoying similar performance and energy efficiency Processing-in-Memory (PiM) is a promising paradigm that metrics. aims to alleviate this data movement bottleneck. In a PiM- Our goal is to enable the execution of complex operations enabled device, the system’s main memory is augmented in-memory with simple changes to commodity DRAM that with some form of compute capability [46,48,89,90,116]. This synergize well with available PuM-based operations [28, 31, augmentation both 1) alleviates computational pressure from 68, 82, 110]. To this end, we propose pLUTo: processing-in- the CPU, and 2) reduces the movement of data between main memory with lookup table (LUT)operations, a DRAM sub- memory and the CPU. strate that enables massively parallel in-DRAM LUT queries. Recent works divide DRAM-based PiM architectures [48] pLUTo extends current PuM-enabled DRAM substrates [31, 1 82] by integrating a novel LUT-querying mechanism that single bit (0 or 1) in the form of stored electrical charge. The can be used to more efficiently perform arithmetic opera- memory cell transistor connects the capacitor to the bitline tions (e.g., multiplication, division), transcendental functions wire. Each bitline is shared by all the memory cells in a col- (e.g., binarization, exponentiation), and access precomputed umn, and connects them to a sense amplifier. The set of sense results (e.g., memoization, LUT queries in cryptographic al- amplifiers in a subarray makes up the local row buffer. gorithms). pLUTo stands out from prior works by being the first work to enable the massively parallel bulk querying of LUTs inside the DRAM array, which is our main contribution. pLUTo’s careful design enables these LUTs, which can be stored and queried directly inside memory, to express com- plex operations (e.g., multiplication, division, transcenden- tal, memoization) and enables two critical LUT-based capa- bilities: 1) querying of LUT tables of arbitrary size and 2) the pipelining of LUT operations, which significantly syn- ergize with and enhance existing PuM mechanisms (e.g., [28,82,110]). Furthermore, LUTsare an integral componentof many widespread algorithms, including AES, Blowfish, RC4, Figure 1: The internal organization of DRAM banks. and CRC and Huffman codes [40, 86, 87, 118, 126]. We evaluate pLUTo’s performance on a number of work- Reading and writing data in DRAM occurs over three loads against CPU-, GPU-, and PnM-based baselines. Our phases: 1) Activation, 2) Reading/Writing, 3) Precharging. evaluations show that pLUTo consistently outperforms the During Activation, the wordline of the accessed row is driven considered baselines, especially when normalizing to area high. This turns on the row’s access transistors and creates overhead. We also show that LUT-based computing is an ef- a path for charge to be shared between each memory cell ficient paradigm to execute bulk bitwise, arithmetic and tran- and its bitline. This charge sharing process induces a fluctua- scendental functions (e.g., binarization, exponentiation) with tion (δ) in the voltage level of the bitline, which is originally high throughput and energy efficiency. For example, pLUTo set at VDD~2. If the cell is charged, the bitline voltage be- outperforms existing PuM designs [32,82,110]by upto 3.5×, comes VDD~2 + δ. If the cell is discharged, the bitline voltage in the execution time for XOR and XNOR bitwise operations. becomes VDD~2 − δ. To read the value of the cell, the sense In this paper, we make the following contributions: amplifiers in the local row buffer amplify the fluctuation (±δ) • We introduce pLUTo, a PuM substrate that enables new induced in the bitline during Activation. Simultaneously, the lookup table operations. These operations synergize well desired charge level is restored to the capacitor in the memory with available PuM-based operations to enable more com- cell. After reading, the data is sent to the host CPU through plex operations that are commonly used in modern appli- the DRAM chip’s I/O circuitry and the system memory bus. cations. During Precharging, the access transistors are turned off, and • We propose three designs for pLUTo with different levels of the voltage level of all the bitlines is reset to VDD~2. This en- trade-offs in area cost, energy efficiency, and performance sures the correct operation of subsequent activations. depending on the system designers needs. • We evaluate pLUTo using a set of real-world cryptogra- 2.2. DRAM Extensions phy, image processing and neural network workloads. We pLUTo optimizes key operations by incorporating the follow- compare against state-of-the-art GPU implementations and ing previous proposals for enhanced DRAM architectures. find that pLUTo outperforms the baseline CPU and GPU Inter-Subarray Data Copy. The LISA-RBM (Row Buffer implementations by up to 33× and 8×, respectively, while Movement) operation, introduced in [28], copies the contents simultaneously achieving energy savings of 110× and 80×. of a row buffer to the row buffer of another subarray, without 2. Background making use of the external memory channel. This is achieved by linking neighboring subarrays with isolation transistors. In this section we describe the hierarchical organization of LISA-RBM commands are issued by the memory controller. DRAM and provide an overview of relevant prior work. The total area overhead of LISA is 0.8% 2.1. DRAM Background Subarray-Level Parallelism. MASA [68] is a mechanism A DRAM chip contains multiple memory banks (8 for DDR3, that introduces support for subarray-level parallelism by 16 for DDR4), and I/O circuitry.