arXiv:2104.07699v1 [cs.AR] 15 Apr 2021 PM,wihealscmuainisd h eoychip. memory the inside computation enables which (PiM), it aentkp aewt h ai mrvmnst pro- to improvements rapid the with pace kept not have width ihsm omo opt aaiiy[ capability compute of form some with 33 incmlxt,ci ra n oe consumption. power and area, chip increa complexity, operations sign these supporting since for operations, support plex lack often architectures PiM existing However, This bottleneck co ment applications. energy memory-intensive and time of execution sumption the to contributor significant a nrysvnsof savings energy uprom h aeieCUadGUipeettosby pLUTo implementations GPU subarray, and DRAM CPU single baseline a state-of-the-art the a in outperforms running network When complex PiM. neural chal- of of binary long-standing domain execution the a efficient in been lenge the has which enables in-memory, the LUTs operations enable of to use (L DRAM tables The lookup of of querying density and storing area parallel massively high the leverages that aiy ept eeteot ocet ehooist repla [ to it technologies create ca- to high efforts and recent cost Despite low pacity. its to due memory, for main technology manufacturing predominant the been has DRAM decades, For Introduction 1. etyasdet h ail rwn ie fteworking the of [ sizes applications growing modern many rapidly by used the sets re- to in due limiting years increasingly cent system become to has bottleneck a which creates performance, divide This speed. core cessor band- and latency DRAM’s density, high its despite However, facto de h P,ad2 eue h oeeto aabtenmain between CPU. data the of and movement memory the reduces 2) and CPU, from the pressure computational alleviates 1) both augmentation of 60% to appli- up [ memory-intensive cations for modern of responsible consumption is energy the processor main between the data and of movement memory the that show surveys cent (LUT table lookup ist leit this alleviate to aims nbe eie h ytmsmi eoyi augmented is memory main system’s the device, enabled × eintroduce We is processor the and memory main between movement Data Processing-in-Memory eetwrsdvd RMbsdPMacietrs[ architectures PiM DRAM-based divide works Recent 64 § T Zürich ETH , and 69 osOrosa Lois ooDnsFerreira Dinis João anmmr ehooyfrtefrsebefuture. foreseeable the for technology memory main , 25 75 8 , , × 30 96 epciey hl iutnosyachieving simultaneously while respectively, , , , 79 asvl aallGnrlProeComputation General-Purpose Parallel Massively 117 a ealvae sn Processing-in-Memory using alleviated be can , 110 90 ,DA sepce ocniu ob the be to continue to expected is DRAM ], LT (p pLUTo § , o ) × 102 LT:I-RMLou alst Enable to Tables Lookup In-DRAM pLUTo: † aamvmn bottleneck movement data and T nvriyo Coimbra of University IT, perations) , PM sapoiigprdg that paradigm promising a is (PiM) 121 80 oamdSadrosadati Mohammad , aaShahroodi Taha § 124 × oesn-nmmr with rocessing-in-memory . , 127 e RMsubstrate DRAM new a , ]. are Falcao Gabriel 46 25 , 48 , 63 , 89 , 95 , 90 .Ide,re- Indeed, ]. § naPiM- a In . aamove- data , 116 .This ]. e de- ses UTs). † com- ‡ 48 nn Nori Anant nttt o eerhi udmna Sciences Fundamental in Research for Institute ‡§ ce n- ] 1 hl noigsmlrpromneadeeg efficiency energy and performance similar enjoying while hr optto occurs computation where hr optto occurs computation where 44 78 tce eois[ memories stacked LT xed urn u-nbe RMsbtae [ substrates DRAM PuM-enabled queries. current LUT extends in-DRAM pLUTo parallel massively enables that strate bitwise of (e.g., execution operations the for mechanisms proposed have main [ technology memory the of properties intrinsic ploiting yegz elwt vial u-ae prtos[ operations PuM-based that available DRAM with commodity well to synergize changes simple with in-memory opromawdrrneo prtosta ro works, prior than operations metrics. of range wider a pLUTo perform allows This to functions. PuM-based complex en- more to work able existing with well synergize that operations based costs. energy and latency low algorithms sufficiently higher-level with interest express of to difficulty operations the by these and using support, of they limite operations of is range architectures the PuM by of are flexibility there the However, as be can row. lanes there execution since parallelism, many of as degree high very [ a algorithms enables computing bit-serial and layouts data multiple between performed memory usually are PuM in Operations [ inte- as (such that units chips processing memory specialized of grate fabrication scal- and and design functionality the How- in ability: limited operations. are of architectures range PnM diverse ever, a support that substrates eoywt okptbe(LUT table lookup with memory 68 [ tions cur limitations. strict area by and bound thermal are memories 3D-stacked and challenging, be ihr1 ato h RMci,btsprt rmthe from separate but chip, DRAM the [ of array are part memory which a accelerators, 1) specialized either or processors nearby to ntoctgre:1) categories: two in unGómez-Luna Juan , , , u goal Our LUT- of use the through computing LUT-based leverage We ncnrs,PMacietrseal optto ooc- to computation enable architectures PuM contrast, In nPMacietrs aai rnfre rmteDRAM the from transferred is data architectures, PnM In 82 84 82 eei .Kim S. Jeremie within , , , 108 ⋆ 31 110 103 , rows 32 , .T hsed epropose we end, this To ]. , 110 107 h eoyary matu ok nti do- this in works Impactful array. memory the , 82 st nbeteeeuino ope operations complex of execution the enable to is hsfc,cmie ihteueo vertical of use the with combined fact, This . ]. ,ad2) and ], , 119 33 nrMutlu Onur AND/OR/XOR ,o )itgae notelgclyro 3D- of layer logic the into integrated 2) or ], ,adbscLTbsdoeain [ operations LUT-based basic and ], 59 § , § 84 rcsiguigMmr (PuM) Processing-using-Memory rcsigna-eoy(PnM) Processing-near-Memory .PMealstedsg fflexible of design the enables PnM ]. within near [ ) § ead .Oliveira F. Geraldo 44 oamdAlser Mohammed , h eoyary[ array memory the 108 o ) h eoyary yex- by array, memory the perations LT:p pLUTo: , 110 bits ,aihei opera- arithmetic ], 33 nec memory each in )hspoe to proven has ]) RMsub- DRAM a , rocessing-in- ⋆ 4 Intel , § § 35 32 33 28 , 115 , , , 45 31 31 31 59 d ], ]. , , , , , , 82] by integrating a novel LUT-querying mechanism that single bit (0 or 1) in the form of stored electrical charge. The can be used to more efficiently perform arithmetic opera- memory cell transistor connects the capacitor to the bitline tions (e.g., multiplication, division), transcendental functions wire. Each bitline is shared by all the memory cells in a col- (e.g., binarization, exponentiation), and access precomputed umn, and connects them to a sense amplifier. The set of sense results (e.g., memoization, LUT queries in cryptographic al- amplifiers in a subarray makes up the local row buffer. gorithms). pLUTo stands out from prior works by being the first work to enable the massively parallel bulk querying of LUTs inside the DRAM array, which is our main contribution. pLUTo’s careful design enables these LUTs, which can be stored and queried directly inside memory, to express com- plex operations (e.g., multiplication, division, transcenden- tal, memoization) and enables two critical LUT-based capa- bilities: 1) querying of LUT tables of arbitrary size and 2) the pipelining of LUT operations, which significantly syn- ergize with and enhance existing PuM mechanisms (e.g., [28,82,110]). Furthermore, LUTsare an integral componentof many widespread algorithms, including AES, Blowfish, RC4, Figure 1: The internal organization of DRAM banks. and CRC and Huffman codes [40, 86, 87, 118, 126]. We evaluate pLUTo’s performance on a number of work- Reading and writing data in DRAM occurs over three loads against CPU-, GPU-, and PnM-based baselines. Our phases: 1) Activation, 2) Reading/Writing, 3) Precharging. evaluations show that pLUTo consistently outperforms the During Activation, the wordline of the accessed row is driven considered baselines, especially when normalizing to area high. This turns on the row’s access transistors and creates overhead. We also show that LUT-based computing is an ef- a path for charge to be shared between each memory cell ficient paradigm to execute bulk bitwise, arithmetic and tran- and its bitline. This charge sharing process induces a fluctua- scendental functions (e.g., binarization, exponentiation) with tion (δ) in the voltage level of the bitline, which is originally high throughput and energy efficiency. For example, pLUTo set at VDD~2. If the cell is charged, the bitline voltage be- outperforms existing PuM designs [32,82,110]by upto 3.5×, comes VDD~2 + δ. If the cell is discharged, the bitline voltage in the execution time for XOR and XNOR bitwise operations. becomes VDD~2 − δ. To read the value of the cell, the sense In this paper, we make the following contributions: amplifiers in the local row buffer amplify the fluctuation (±δ) • We introduce pLUTo, a PuM substrate that enables new induced in the bitline during Activation. Simultaneously, the lookup table operations. These operations synergize well desired charge level is restored to the capacitor in the memory with available PuM-based operations to enable more com- cell. After reading, the data is sent to the host CPU through plex operations that are commonly used in modern appli- the DRAM chip’s I/O circuitry and the system memory bus. cations. During Precharging, the access transistors are turned off, and • We propose three designs for pLUTo with different levels of the voltage level of all the bitlines is reset to VDD~2. This en- trade-offs in area cost, energy efficiency, and performance sures the correct operation of subsequent activations. depending on the system designers needs. • We evaluate pLUTo using a set of real-world cryptogra- 2.2. DRAM Extensions phy, image processing and neural network workloads. We pLUTo optimizes key operations by incorporating the follow- compare against state-of-the-art GPU implementations and ing previous proposals for enhanced DRAM architectures. find that pLUTo outperforms the baseline CPU and GPU Inter-Subarray Data Copy. The LISA-RBM (Row Buffer implementations by up to 33× and 8×, respectively, while Movement) operation, introduced in [28], copies the contents simultaneously achieving energy savings of 110× and 80×. of a row buffer to the row buffer of another subarray, without 2. Background making use of the external memory channel. This is achieved by linking neighboring subarrays with isolation transistors. In this section we describe the hierarchical organization of LISA-RBM commands are issued by the memory controller. DRAM and provide an overview of relevant prior work. The total area overhead of LISA is 0.8% 2.1. DRAM Background Subarray-Level Parallelism. MASA [68] is a mechanism A DRAM chip contains multiple memory banks (8 for DDR3, that introduces support for subarray-level parallelism by 16 for DDR4), and I/O circuitry. Each bank is further divided overlapping the latency of memory accesses directed to dif- into subarrays, which are two-dimensional arrays of memory ferent subarrays. The total area overhead of MASA is 0.15%. cells. The DRAM subarrays in a bank share peripheral cir- Bitwise Operations. Ambit [110] introduces support for cuitry, such as the row decoders. Each cell contains one ca- bulk bitwise logic operations between memory rows in a pacitor and one access transistor. The capacitor encodes a DRAM subarray by leveraging the analog principles of op-

2 eration of DRAM cells. Ambit’s key idea is to activate three DRAM rows simultaneously (using a triple-row activation op- eration) to perform a bitwise majority function across three rows. Shifting. DRISA [82] introduces support for intra-lane shift- ing in DRAM. Using this mechanism, the contents of a mem- ory row may be shifted by 1 or 8 bits at a time, with a cost of one Activate-Activate-Precharge (AAP) command sequence. 3. Motivation Our goal in this work is to introduce support for the in- memory execution of general-purpose operations. To this end, we propose pLUTo, a mechanism that augments the well-established DRAM technology with support for general- purpose operations by enabling the in-place querying of LUTs. DRAM Is a Prominent Memory Technology. Despite recent advances in emerging memory technologies [64, 69, 75, 96, 117], DRAM is expected to continue to provide supe- rior density at lower cost in the immediate future. In re- cent years, the increasingly high bandwidths of 3D-stacked DRAM variants (e.g., HBM [78] and HMC [103]) has led to Figure 2: Main components of pLUTo (a); LUT layout inside a their widespread adoption in specific domains, such as GPU pLUTo-enabled subarray (b); pLUTo-BSA design (c). memory [93, 94]). pLUTo modifies the DRAM subarrays to the time- and energy-efficient querying of lookup tables introduce in-memory compute capabilities, and is thus well- (LUTs). pLUTo supports the following new LUT-based func- suited for both 2D and 3D-stacked DRAM. tions: 1) querying LUTs of arbitrary size, and 2) pipelining LUTs Enable General-Purpose Computing. LUTs can be LUT queries. These LUTs exploit the massive parallelism of used to replace complex computations with less costly lookup DRAM’s internal structures and the new functions are en- queries. Any deterministic operation applied on a finite input abled with minimal changes to the DRAM array. pLUTo syn- set, regardless of its complexity, can be expressed using LUTs. ergizes well with available PuM-based operations [28, 68, 82, Common use cases of LUTs include the computation of arith- 110] to enable complex functions (described in Section 4.2) metic and transcendental functions, as well as other complex with high throughput and efficiency. operations. pLUTo is a scalable solution that supports LUTs with an arbitrary number of elements, limited only in practice 4.1. pLUTo Architecture by the DRAM capacity. pLUTo’s design enables flexibility in its implementation in PuM Is Promising but Has Limitations. State-of-the-art DRAM, as it only requires small changes in subarrays that are PuM architectures [32,82,110] provide very high throughput desirable for pLUTo-enabled operations. These small changes and energy efficiency, due to the massive reductions in data only enhance the functionality of the subarrays, allowing movement they provide. However, these approaches, which them to be accessed as standard DRAM arrays in addition to make only minor changes to the structure of the memory ar- supporting pLUTo’s operations. This leads to a natural trade- ray, are only able to support a very limited range of oper- off between area overhead and the number of subarrays that ations. To address this limitation in the potential range of can be pLUTo-enabled. The number of pLUTo-enabled subar- applications for PuM, we position pLUTo as an extension to rays can be defined on a system-by-system basis, depending these prior works, by leveraging their best features (i.e., high on a) the system’s design requirements, and b) how many sub- parallelism, reduced need for data movement) and address- arrays the system’s design constraints allow. ing their main drawbacks (i.e., reduced range of supported To enhance a DRAM subarray with pLUTo operations, we operations). pLUTo achieves this by supporting more com- change the row buffer and row decoders, and build a unit, plex operations through the use of LUT-based computing. dubbed match logic, to compare the values of each entry in the source row with the value of the currently activated row 4. pLUTo: A DRAM-based PuM Substrate and to identify matches. Figure 2 (a) shows the high-level To enable a wide variety of complex functions with overview of all the DRAM structures required to support Processing-using-Memory (PuM) at low cost, we introduce pLUTo’s operation. Figure 2 (b) shows the vertical layout pLUTo, a new DRAM substrate that enables PuM through used to store LUTs in the pLUTo-enabled subarray.

3 Figure 3: A pLUTo LUT Query: (a) shows a LUT containing the first four prime numbers, (b) shows a simplified view of a pLUTo- enabled subarray, and (c) steps of the pLUTo LUT Query. In this example, the pLUTo LUT Query returns the i-th prime number, for each element in the input row. 4.1.1. pLUTo-enabled Row Buffer. To enhance DRAM row a DRAM subarray and 2) lookup queries spanning multiple buffer, we connect one additional flip-flop (FF) to every sense DRAM subarrays. We next describe how pLUTo’s architec- amplifier in the row buffer via a switch (as shown in Fig- ture enables these operations. ure 2 (c)). The set of additional FFs constitutes an FF buffer. 4.2.1. Querying LUTs of Arbitrary Size. pLUTo’s design If a switch is enabled by the matchline signal, data in the enables efficient operations that mimic the lookuptable (LUT) sense amplifier is copied to the corresponding FF. As we can query operation, illustrated in Figure 3. We refer to pLUTo’s drive each matchline signal independently, data in the row implementation of the LUT query as pLUTo LUT Query. buffer can be partially written to the FF buffer. This capabil- To explain the process of pLUTo LUT Query, we use a ity enables more complex data management operations such small pLUTo LUT storing four prime numbers {2, 3, 5, 7} at as gather-scatter, which we use to enable our new LUT-based the respective indices {0, 1, 2, 3}, as shown in Figure 3 (a). operations. This LUT is stored in a pLUTo-enabled subarray such that 4.1.2. pLUTo-enabled Match Logic. As shown in Fig- each row i contains repeated elements of the LUT at index ure 2 (a), we implement match logic between subarrays that i, as shown in Figure 3 (b). An application can store the consist of an array of two-input comparators: 1) the index of LUT in DRAM in a one-time step for any number of future the currently activated row in a pLUTo-enabled subarray and queries, amortizing the performance cost of initially storing 2) an element in another source subarray. Each comparator the LUT. pLUTo then performs the massively-parallelizable outputs N matchline signals, where N is the bit width of each pLUTo LUT Query with an example input vector of LUT in- LUT entry. Each matchline connects to the switch belonging dices {1, 0, 1, 3} in three steps. First, the memory controller to the corresponding FF in the pLUTo-enabled row buffer. If loads the input indices from a source subarray (not shown) the two inputs of the comparator match, each of the N output into the source row buffer, as shown in Figure 3 (b). Sec- matchlines from the comparator are driven high. Otherwise, ond, the memory controller issues a row sweep operation to each of the N output matchlines are driven low. quickly activate all rows storing the LUT. After each back-to- 4.1.3. pLUTo-enabled Row Decoder. To enhance the back row activation, the match logic checks for matches be- DRAM row decoder, we enable a new functionality: row tween the elements of the source row buffer (i.e., input vector sweep. Row sweep is similar to the self-refresh operation [73], of LUT indices) and the index of the currently activated row which already exists in many commodity DRAM designs and in the pLUTo-enabled subarray. If there is a match (see 1 in quickly activates many rows sequentially. Row sweep ex- Figure 3 (c)), then the contents of the activated row at the cor- tends this functionality to quickly activate an arbitrary num- responding matching locations are loaded to the destination ber of rows sequentially. The latency of the row sweep opera- row buffer (see 2 in Figure 3 (c)). Third, the results of the tion is equal to (tRAS +tRP )×N, where tRAS is the time that pLUTo LUT Query are copied to the destination row using a must elapse between an activate and a precharge command, LISA-RBM command (see 3 in Figure 3 (c)), as described in tRP is the latency that must elapse between a precharge com- Section 2.2. mand and the next activation to the bank, and N is the to- pLUTo supports querying LUTs that have as many entries tal number of rows sweeped. This operation enables quickly as the total number of rows across all pLUTo-enabled subar- matching and copying data across many rows. rays. However, if a LUT has more entries than the number of rows in a subarray (typically 512 rows [121]), additional 4.2. pLUTo-enabled Operations measures must be taken The architectural changes we make to each DRAM subarray Large LUTs. Queries to a LUT that contains more entries enable various PuM operations that synergize with available than the number of rows in a subarray can be performed by PuM operations to provide complex functions. These include, querying multiple subarrays in parallel and aggregating the but are not necessarily limited to 1) lookup queries within results of these subqueries for efficiency. In this case, each

4 element of the source row specifies both 1) the index of the 4.3. Synergies With Available PuM Operations target subarray and 2) the index of the target row in the tar- pLUTo synergizes well with available PuM operations pro- get subarray. For example, assume the input source row con- posed in prior work. In this section, we describe the complex tains n 10-bit elements, which are used to query a LUT with functions that are enabled when using available PuM opera- 1024 entries spread across two 512-row subarrays. Each 10- tions in conjunction with the pLUTo architecture. These com- bit element encodes the S most significant bits as the target plex instructions improve pLUTo’s performance in a wider subarray index and the R least significant bits as the target variety of workloads. row index. In our example, S=1 bit and R=9 bits (shown in Figure 4). 4.3.1. Arithmetic Operations. pLUTo can perform arith- metic operations in two ways: 1) the LUT-based approach that can implement any operation, and 2) the Kogge-Stone Adder (KSA) approach, which enables efficient operations on large operands. LUT-based Arithmetic. pLUTo can perform any operation Figure 4: Addressing scheme for large LUTs spread across on two input values with a carefully constructed LUT. The multiple subarrays (two subarrays, in this example). LUT takes as input two concatenated operands and return as output the result of the operation. For example, the addition In order to prepare the source rows, the subarray and row of the binary values ’1010’ and ’0101’ can be represented indices must be separated using a combination of row dupli- in a LUT by storing the result (’00001111’) in the LUT at cation operations and bitwise operations with pre-computed index ’10100101’. bit masks. Section 4.3.1 describes how pLUTo supports ef- ficient bitwise operations. After preparing the source rows, KSA-based Arithmetic. pLUTo enables addition using the pLUTo processes the pLUTo LUT Query in all subarrays con- Kogge-Stone Adder (KSA) approach demonstrated in prior taining the LUT elements in parallel according to the subar- work [31, 82]. KSA performs carry-look-ahead addition with ray indices in the source row. The row indices are then used only XOR, AND, and shifting operations. pLUTo can also as in the standard pLUTo LUT Query described above. Af- perform subtraction (two’s complement) by either 1) a sin- ter the completion of the pLUTo LUT Query in each subarray gle pLUTo LUT Query, or 2) an Ambit-NOT [110] operation containing LUT elements, pLUTo must combine the results and an increment operation. generated by each subarray into a single result. 4.3.2. Subarray-Level Parallelism. pLUTo’s performance can be further improved by utilizing the parallelism available 4.2.2. Selective Bit Match Logic. pLUTo enables masking across subarrays, as demonstrated in MASA [68] (described the row index inputs to match logic such that the match logic in Section 2.2). pLUTo leverages MASA to operate on multi- only compares unmasked bits in the two input values. This ple subarrays in parallel. The degree of parallelism which can feature enables LUTs to be represented in a compressed form be achieved in practice is limited by the JEDEC DRAM Stan- for certain applications, which in turn reduces the number of dard [61,62], which specifies a limit for the rate of issuing ac- row activations required in a pLUTo LUT Query and reduces tivate commands. This limit is dictated by the TF AW timing the overall latency and energy costs of performing a pLUTo parameter, which indicates the time window during which at LUT Query. most four activate commands can be issued, per DRAM rank. Therefore, parallel activations to different subarrays in the This feature is useful in processes such as image binariza- same rank are limited. Since this constraint serves to prevent tion, where 8-bit values are quantized into either 0 or 255. In deterioration of the DRAM reference voltage, TF AW may be a LUT implementation of image binarization, every possible reduced with the use of, for example, more powerful charge 8-bit input value would require a corresponding output value. pumps. DRAM manufacturers have been able to reduce the Input values less than 128 would map to an output value of 0, value of TF AW in DDR3 chips in recent years [88]. and input values greater than or equal to 128 would map to an output value of 255. Using pLUTo, pLUTo LUT Query would 4.3.3. Pipelined Operation Mode (pLUTo-POM). pLUTo have to scan every entry in the table to fully binarize the im- can be further improved by chaining pLUTo LUT Query oper- age. However, the bit masking feature enables the table to be ations (similar to SIMD operations in the CPU) for increased represented by 2 entries, where ’X’ bits are masked: 1) in- throughput and performance. We use Figure 5 to illustrate put row indices matching the binary form ’0XXXXXXX’, map how pLUTo can chain operations (pLUTo-POM) using a col- to an output value of 0, and 2) input row indices matching lection of pLUTo LUTs. In this example, pLUTo-POM per- ’1XXXXXXX’ map to an output value of 255. This effectively forms the chain of functions f2(f1(x)) using five subarrays: reduces the number of row activations required in the pLUTo one source subarray, which contains an input vector, two LUT Query from 256 to 2, providing a 128x improvement in pLUTo-enabled subarrays, which contain pLUTo LUTs rep- both performance and energy. resenting f1() and f2(), one intermediate storage subarray,

5 Input Rows First Iteration Steady-State pLUTo stage RBM stage pLUTo stage RBM stage source Row 0 Row 0 Row 0 Row 0 Row 0 Row 1 Row 1 Row 1 Row 1 Row 1 subarray Row 2 Row 2 Row 2 Row 2 Row 2 1 Row 0 Row 0 Row 0 Row 1 Row 1

pLUTo-enabled pLUTo pLUTo subarray f1( ) f1( ) f1( ) f1( ) f1( ) f1( ) - 2 f1(Row 0) f1(Row 0) 4 f1(Row 1) f1(Row 1)

intermediate RBM f1(Row 0) f1(Row 0) RBM f1(Row 0) storage subarray f1(Row 1) - 3 5 - f1(Row 0) f1(Row 0) f1(Row 1) pLUTo-enabled pLUTo subarray f2( ) f2( ) f2( ) f2( ) f2( ) f2( ) - - - 4 f2(f1(Row 0)) f2(f1(Row 0))

destination RBM f2(f1(Row 0)) subarray - - - - 5 f2(f1(Row 0))

Figure 5: Overview of the pLUTo Pipelined Operation Mode (pLUTo-POM). An example with a two-stage pipeline is shown. The left to right flow indicates temporal progression. Input data flows from top to bottom as it is processed. which contains the results of f1(x), and one destination sub- array, which will contain the final result of f2(f1(x)). pLUTo-POM performs the following steps: 1 The first in- put row (row 0) is activated and copied into the source sub- array’s row buffer. 2 A pLUTo LUT Query operation is ex- ecuted in the f1() subarray. The results of f1(Row 0) are now available in the row buffer of the f1() subarray. 3 The intermediate results are copied from the row buffer of the Figure 6: pLUTo-GSA (a) and pLUTo-GMC (b) designs. f1() subarray to the row buffer of the intermediate storage subarray, using LISA-RBM, and stored in a row of the inter- 4.4.1. pLUTo-GSA: Gated Sense Amplifier. pLUTo-GSA mediate storage subarray. At this point, the output of f1() provides superior area efficiency over pLUTo-BSA, at the ex- can be inputted to f2() and the next row of the source sub- pense of reduced throughput and energy efficiency. pLUTo- array (Row 1) can be inputted to f1(). 4 two independent GSA differs from pLUTo-BSA in its row buffer design and im- pLUTo LUT Query operations corresponding to f1(Row 1) plementation of the row sweep operation. and f2(f1(Row 0)) begin simultaneously in the f1() and pLUTo-GSA Row Buffer. Each sense amplifier in the row f2() subarrays, respectively. At this point, pLUTo-POM has buffer of pLUTo-GSA only includes a switch that is controlled reached a steady state of operation. 5 When any pending by the matchline signal. The switch connects the sense ampli- pLUTo LUT Query operations finish, LISA-RBM operations fier to the bitline (as shown in Figure 6 (a)). When the switch copy the results to the intermediate storage subarray and the is enabled, the sense amplifier is able to sense the value on destination subarray, respectively. the bitline. Since pLUTo-GSA does not use an FF buffer to aggregate data, this design has a smaller area overhead than 4.4. Alternative pLUTo Architectures pLUTo-BSA. However, row activations performed in pLUTo- GSA are destructive for cells connected to bitlines that are In order to offer more flexibility to the system architect, lim- not attached to its respective sense amplifier (i.e., the match- ited by design constraints, we provide two additional vari- line is driven low by the match logic). Since this means that ants to the pLUTo architecture with different trade-offs in activations performed during the row sweep operation are po- throughput, area efficiency, and energy efficiency: 1) pLUTo- tentially destructive, LUTs must always be loaded into the GSA (Gated Sense Amplifier) and 2) pLUTo-GMC (Gated pLUTo-enabled subarrays prior to executing a pLUTo-GSA Memory Cell). We refer to the original design as pLUTo-BSA LUT Query operation. (Buffered Sense Amplifier). Table 1 qualitatively tabulates the Row Sweep Operation. Since row activations are destruc- trade-offs for each of the designs. Each design can largely be tive in pLUTo-GSA, the row sweep operation does not re- used in the same way as pLUTo-BSA, but we explain the sub- quire issuing a precharge command following each activation. tleties associated with each design next. Instead, it is possible to issue a single precharge command, Table 1: Qualitative comparison of the pLUTo design variants. at the end of the row sweep operation. The total time re- pLUTo-GSA pLUTo-BSA pLUTo-GMC quired to perform a row sweep in pLUTo-GSA is therefore Area Efficiency High Medium Low equal to tRC × N + tRP , where tRC is the minimum enforced Throughput Low Medium High time between two consecutive activate commands, tRP is the Energy Efficiency Low Medium High precharge time, and N is the total number of rows sweeped. Destructive Reads Yes No No This is about half the time required for a row sweep operation Data Loading After every use Once Once compared to pLUTo-BSA.

6 4.4.2. pLUTo-GMC:GatedMemoryCell. pLUTo-GMC pro- 5.1. ISA Support vides superior throughput and energy efficiency over pLUTo- BSA, at the cost of increased area overhead. pLUTo-GMC dif- An application can initiate a pLUTo LUT Query by issuing a fers from the pLUTo-BSA in its DRAM cell design, row buffer pluto_op instruction: design, and implementation of the row sweep operation. pluto_op(src, dst, lut_subarr, lut_size, lut_bitw) pLUTo-GMC DRAM Cell. pLUTo-GMC implements 2T1C where src is the address of the source row, and dst is the memory cells, instead of the conventional 1T1C DRAM cell address of the destination row. lut_subarr is the address of design (that is also used in both pLUTo-GMC and pLUTo- the pLUTo-enabled subarray where the pLUTo LUT Query is GSA). The additional transistor in each 2T1C memory cell is executed. lut_size is the number of entries of the LUT, i.e., controlled by the matchline, as shown in Figure 6 (b). This en- the number of rows to be sweeped. lut_bitw specifies the ables fine-grained control of which cells in an activated row bit width of the LUT entries. The pluto_op instruction al- share charge with its respective bitline. This significantly con- ways operates at the granularity of DRAM rows. For this rea- tributes to minimizing the energy consumption of this design, S son, operating on an S-byte input requires ⌈ ⌉ but also requires a higher area overhead. DRAM row size pluto_op instructions. pLUTo-GMC Row Buffer. pLUTo-GMC places an addi- tional switch between the sense amplifier and its enable sig- nal. This switch is enabled by the matchline signal (shared 5.2. Generating and Loading LUTs by cells attached to the same bitline). This switch ensures LUTs must be generated before use. This can be achieved in that the sense amplifier does not drive a value on the bitline one of three ways: if the cell is also not attached to the bitline. This both enables pLUTo-GMC to perform back to back activations without a Generating LUTs From Scratch. The first time a LUT is precharge (explained next), and save on energy costs associ- generated, all of its values must be computed from scratch. ated with enabling the sense amplifier. The entries of the LUT must have the same bit width as the elements in the source and destination subarrays. To Row Sweep Operation in pLUTo-GMC. Due to two key ensure this, it may be necessary to zero-pad the elements features of our pLUTo-GMC design, this design is able to per- in the source row. For example, in order to be compatible form the row sweep operation almost twice as fast as pLUTo- with 6-bit LUT entries of the form ’L5L4L3L2L1L0’, 4-bit in- BSA, by using back to back activations without precharging. put elements of the form ’i3i2i1i0’ should be zero-padded as First, a sense amplifier is only enabled when there is a match ’00i3i2i1i0’. in the corresponding match logic. This means that activations only disturb bitlines whose associated matchline signals are Loading LUTs From Memory. If a required LUT is already driven high and the remaining bitlines are maintained at the stored in a DRAM subarray, it is possible to quickly copy the nominal bitline voltage level (in the precharged state). Sec- LUT to a pLUTo-enabled subarray using a sequence of LISA- ond, since a LUT query only has one match in a LUT, the RBM commands. sense amplifier is only enabled for a single row activation dur- Loading LUTs From Secondary Storage. IncasetheLUTis ing an entire pLUTo LUT Query. Therefore, we can guarantee stored in secondary storage, it is possible to copy its contents that back to back activates will not open the gating transis- into memory by performing a direct memory access (DMA) tors of any two cells sharing the same bitline, and thus will operation. Even though accessing secondary storage is rel- not destroy the data in the cell. As in pLUTo-GSA, the to- atively time- and energy-consuming, this may be more effi- tal time required to perform a row sweep in pLUTo-GMC is cient than recalculating the LUT from scratch. Such LUTs equal to tRC × N + tRP . In addition, due to the additional can be generated by a program at compile-time and stored gating transistors, pLUTo-GMC does not destroy the data in together with the generated binary file. Later, at launch time, the LUTs. these LUTs can be loaded into main memory together with the application code. Furthermore, it is possible for the appli- cation to request the allocation of a greater or smaller num- 5. System Integration of pLUTo ber of subarrays on which to store LUTs, based on measures of computational intensity which can be obtained at runtime. In this section, we detail the integration of pLUTo in a sys- For this reason, loading LUTs does not degrade application tem with a host processor. pLUTo has the same address and performance any more than the process of bringing the pro- command interface as conventional DRAM, and can there- gram instructions and data into main memory does. In addi- fore be directly connected onto the memory bus. The pLUTo tion, the LUTs have the potential to reduce the total volume controller is an enhanced memory controller, with support of data exchanged between main memory and the CPU. We for the commands that pLUTo requires (e.g., LISA-RBM, row estimate that the time spent loading a LUT relative to the time sweep). The pLUTo controller should additionally support is- spent in computation for that same LUT will be less than 1% suing shifting and bitwise operations. for datasets of at least 50 MB.

7 6. pLUTo Evaluation Table 2: System configuration for simulations. We first describe our methodology for evaluating the varia- Parameter Configuration tions of pLUTo against CPU and GPU baselines. Processor IntelXeonGold5118 GPU NVIDIAGeForceRTX2080Ti 64-Byte cache line, 8-way set-associative, 6.1. Methodology Last-level Cache 16.5MB We evaluate the three proposed designs of pLUTo (pLUTo- DDR4 2400MHz, 8GB, 1-channel, 1-rank, 4-bank GSA, pLUTo-BSA and pLUTo-GMC) on each workload. Un- Main Memory groups, 4-banks per bank group, 512 rows per less stated otherwise, our implementations assume the paral- subarray, 8KB per row Main Memory Timings 17-17-17 (14.16 ns) lel operation of 16 subarrays for DDR4 pLUTo and 512 sub- Near Data Processor HMC Model [59], 1.25 GHz on-die core clock, arrays for 3D-stacked pLUTo, which is based on HMC mem- (NDP) 10 W on-die core TDP ory [103]. The reason for this disparity in the number of sub- In-Situ DRAM-Based Bulk Same as NDP + support for bitwise opera- arrays operating in parallel is that the row buffer in the 3DS Bitwise Accelerator tions [110] and bit shifting [82]. memory is much smaller than the row buffer in DDR4. Be- 3) image binarization, which in prior bulk bitwise acceler- cause of this key difference, these values for subarray-level ator proposals requires several bit masking steps and large parallelism in the two memory technologies provide the same sequences of bitwise logical operations. Unless stated other- effective parallelism at the operation level, and should there- wise, we assume that the necessary LUTs are already avail- fore be considered comparable design points. Additionally, able in pLUTo-enabled subarrays at the start of simulation. we perform a sensitivity study where the number of subar- rays operating in parallel is one of {1, 16, 256, 2048} subar- Table 3: Workloads Evaluated. rays for DDR4 pLUTo, and one of {512, 8192} subarrays for Name Parameters 3D-stacked pLUTo. We compare the performance of each Vector Addition, LUT-based [122] Element width: 4 bits pLUTo configuration against three baselines: 1) a state-of-the- Vector Point-Wise Multiplication [122] Q Format: Q1.7, Q1.15 art CPU implementation, 2) a state-of-the-art GPU implemen- Bitwise Logic # LUT entries: 4 Element width: 8 bits; 256-entry Bit Counting tation, 3) an implementation in a near-data processing (NDP) LUTs and 16-entry Short LUTs accelerator. CRC-8/16/32 [125] Packetsize:128bytes Evaluation Frameworks. We evaluate the CPU and GPU Salsa20 [23], VMPC [138] Packetsize:512bytes 936000 pixels; static threshold: baselines on real systems. For the evaluation of the near-data Image Binarization processing (NDP) baseline, we simulate an HMC-based sys- 50% Color Grading 936000 pixels; 8-bit to 8-bit tem [59] with the characteristics described in Table 2. For the implementation of the in-situ DRAM-based system, we augment the simulated model for the NDP baseline with sup- 6.2. Performance port for bulk bitwise operations, as described in [110], and Figure 7 plots the absolute speedups relative to the CPU, GPU, shifting, as described in [82]. Since these operations are also and NDP baselines for the considered pLUTo design points. supported by pLUTo, comparing our proposal with this simu- We make four key observations. First, we observe that pLUTo- lated system clearly highlights the benefits that result from BSA and pLUTo-GMC outperform the GPU baseline on av- the introduction of the pLUTo LUT Query operation. We erage; we also note that pLUTo-GSA achieves comparable simulate various configurations of pLUTo on both DDR4 [62] performance to the GPU. These observations hold for both and HMC [59] memory models using a custom-built simula- DDR4- and 3DS-based pLUTo designs. Second, we observe tor (which we plan to release under an open source license). that the 3DS implementations of pLUTo consistently outper- We evaluate the energy consumption and area overhead of form the DDR4 implementations, although both provide im- pLUTo configurations using CACTI 7 [22] DDR4 and HMC provements over the baseline that are on the same order of models. Table 2 shows the main parameters that we use in magnitude. Third, we observe that the workload that least our evaluations. benefits from pLUTo is VMPC. This workload relies heav- Workloads. Table 3 shows the names and characteristics ily on memory accesses and the operands between consec- of the workloads that we analyze. We consider a number utive steps of the algorithm are highly dependent. For this of workloads spanning the domains of cryptography, image reason, memory is a bottleneck both for execution on con- processing and neural networks. Many of the workloads re- ventional architectures, as well as in pLUTo. Nevertheless, quire the execution of operations which cannot be executed the pipelined querying of the LUTs used by the algorithm trivially using existing Processing-in-Memory designs, and still allows pLUTo to pull ahead of the baselines and show which pLUTo implements using LUT-based computing. Ex- performance gains. Fourth, we observe that the CRC work- amples of these operations whose implementation is chal- loads show the least overall benefit from executing in pLUTo. lenging include 1) direct LUT queries to substitution tables, The speedup in these workloads is bottlenecked by a reduc- as required by cipher algorithms such as Salsa20 and VMPC, tion step, which must be performed serially either in the host 2) polynomial division, required by the CRC algorithm, and CPU, in the case of the 2D implementation of pLUTo, or on

8 ¤

GPU NDP pLUTo£ GSA pLUTo BSA since pLUTo takes advantage of several mechanisms that

¦ § ¨ © pLUTo¥ GMC pLUTo GSA 3DS pLUTo BSA 3DS pLUTo GMC 3DS 1 × 104 take exploit parallelism at the granularity of subarrays (typi- 1 × 103 1 × 102 cally 512 rows), pLUTo’s bandwidth improves with increas- 1 × 101 1 × 100 Speedup ing DRAM capacity. pLUTo’s theoretical maximum band- 1 × 10−1 1 × 10−2 width is calculated under the assumption that every subar- 8

¡ 16 32 ¢ VMPC CRC CRC CRC Salsa20 ImgBin GMEAN ColorGrade ray can be operated on in parallel, and therefore that the in- ternal maximal memory bandwidth increases linearly with Figure 7: Speedup of GPU and pLUTo relative to the baseline DRAM capacity (number of subarrays). For a 16GB DRAM, CPU. pLUTo parallelizes operations across 16 subarrays. The this bandwidth is 508 GB/s, 1017 GB/s and 2027 GB/s, for y-axis uses a logarithmic scale, where higher is better. DDR4-based pLUTo-GSA, pLUTo-BSA and pLUTo-GMC, re- the logic layer of the HMC, in the case of pLUTo-3DS. Never- spectively. On the upper end, a 64GB DRAM would achieve a theless, the acceleration of the parallel portion of these work- bandwidth greater than 10TB/s with pLUTo-GMC. This con- loads still allows nearly all pLUTo design points to achieve trasts with the available bandwidth in conventional systems, 20 60 performance comparable to that of the GPU baseline. which is typically around − GB/s for DDR3/4/5 memo- ries, and around 400 − 600 GB/s for high-bandwidth memo- Figure 8 shows the speedup per unit area of each pLUTo ries [59, 60, 78, 92, 94, 103, 135].

configuration, relative to the CPU and GPU baselines. Area

9:;< =>?@ ACEF GHI JK LM NO 8

values for the CPU/GPU baselines refer to total chip area for 7 5

PQST pLUTo−GSA pLUTo−BSA pLUTo−GMC

3 2

each of these devices. We make three key observations. First, 1 × 104

0

/

. 

all pLUTo designs outperform the GPU on average. This im- −

e

-



,



+  provement is considerably greater that the one observed in * 2

 1 × 10

i

)

( '

Figure 8, when normalized to the area of each design. Sec- &

% $ Theor ond, the performance improvement of pLUTo-3DS is less no- # " 0

! 1 × 10

D  R   1 6  GPU− 8GB 32GB ticeable in this plot. This leads to the observation that the B performance of pLUTo is roughly proportional to the area of the memory technology in which it is implemented, at Figure 9: Theoretical maximum bandwidth for DRAM (DDR3, DDR4, DDR5), GPU memory (GDDR6, HMC2, HBM3), and least for the well-established DDR4 and HMC memory tech- pLUTo-GSA, pLUTo-BSA and pLUTo-GMC (4-64GB). nologies considered in our evaluation. Third, we especially observe improvements for the most memory-intensive work- loads, which for this set are Salsa20 and VMPC. This demon- 6.2.2. Throughput with Subarray-Level Parallelism. In strates the benefit of performing computation in-memory for order to present a more realistic internal bandwidth when us- workloads that are especially data-intensive. ing pLUTo, we evaluate the three pLUTo designs (i.e., pLUTo- GSA, pLUTo-BSA, pLUTo-GMC) with varying degrees of GPU pLUTo−GSA pLUTo−BSA pLUTo−GMC pLUTo−GSA−3DS pLUTo−BSA−3DS pLUTo−GMC−3DS subarray-level parallelism for both DDR4 and 3D-stacked 1 × 104 1 × 103 memory. Figure 10 plots the speedups (averaged across all 1 × 102 1 × 101 evaluated workloads) of each configuration against the base- 1 × 100 1 × 10−1 line CPU. We make three observations. First, due to the differ- 1 × 10−2

Speedup per unit area 8 − −16 −32 ent row size in DDR4 (i.e., 8KB) and 3D memory (i.e., 256B), VMPC CRC CRC CRC Salsa20 ImgBin GMEAN ColorGrade DDR4 provides higher speedup than 3D memory when uti- lizing the same degree of subarray-level parallelism. As dis- Figure 8: Normalized speedup of GPU and pLUTo relative to the baseline CPU. pLUTo parallelizes operations across 16 cussed in Section 6.1, a fair direct comparison only can be subarrays. The y-axis uses a logarithmic scale, where higher made between the configuration pairs {DDR4-16, HMC-512} is better. and {DDR4-256, HMC-8192}. Second, performance scaling is very close to proportional to the number of subarrays oper- 6.2.1. Analysis of Maximum Internal Bandwidth. In or- ating in parallel in both cases. The reason that we do not der to study the impact of parallelizing pLUTo operations on observe the maximum theoretical performance that would performance, we examine the maximum internal data band- result from increasing the number of subarrays by a given width that can be achieved with pLUTo when using the most factor is that there is an additional overhead associated with commonly-used commodity DRAM technologies (DDR3/4/5) in-memory data movement when computation is partitioned and several widely-used high-bandwidth memories (GDDR6, across multiple subarrays. Third, performance is approxi- HMC2, HBM3). The theoretical maximum bandwidth for mately constant when normalized to area across all pLUTo each of these technologies is dictated by their timing param- designs. This validates the potential scalability of pLUTo to eters and plotted in Figure 9. As DRAM timing parameters operate in as many subarrays in parallel as the memory tech- are largely limited by power constraints, the maximum band- nology (i.e., DDR4, HMC) supports. We perform an evalua- width is fixed for each of these DRAM standards. However, tion of the effect of one limiting factor to the subarray paral-

9 pLUTo−GSA pLUTo−BSA pLUTo−GMC 6.2.4. Comparison with Prior PiM Mechanisms. Fig-

– DDR4 3D ure 12 shows a comparison between a specialized state-of-

•

w

”

“ uv t ×

’ the-art in-situ DRAM bitwise accelerator with support for

‘ 3

q rs  × bitwise logic operations as described in [110] and bit shift-

 2

Ž op n ×

 ing as described in [82], and implementations of pLUTo-GMC

Œ

m

j kl ‹ × in DDR4 and HMC memory technologies. pLUTo speedups

Š 0

‰ gh

f ×

ˆ

WXY Z[\] ^_` abcd 1 UV are in the range of 0.8× to 12× for the DDR4 implementation,

A

~ €‚ƒ„ †‡ x yz{ | } a and 1.1× to 17× for the HMC implementation, with geomet- Figure 10: Measured geometric mean speedup relative to ric mean speedups of 2.0× and 2.8×, respectively. We make CPU for pLUTo configurations utilizing varying degrees of two key observations. First, pLUTo consistently outperforms subarray-level parallelism. these specialized accelerator designs for basic low-bit-width arithmetic operations and other bit-level operations. Second, lelism scaling (the tF AW timing parameter in DDR4) in Sec- the improved timing parameters of the 3DS-based implemen- tion 6.2.3. tation yield considerable gains over its DDR4 counterpart for 6.2.3. Impact of tFAW on Performance. As explained pre- single-subarray operation. Whereas DDR4 matches the per- viously, the tF AW timing parameter is another limiter of the formance of the prior works in bitwise operations, pLUTo- activation rate in a DRAM chip to meet power constraints. 3DS exceeds it. Third, pLUTo is able to exactly match the Since the activation operation is central to pLUTo, it is impor- considered prior works in the execution of bitwise operations tant to evaluate the impact of this parameter on performance. in the DDR4 case, and to outperform them in the 3DS variant. While we discussed methods for reducing this constraint (in Note that the implementation of these operations in pLUTo Section 4.3.2), we consider how tF AW would affect pLUTo’s is entirely LUT-based, and consists of looking up the values performance in the case that tF AW cannot be relaxed. The in a LUT with 4 elements, i.e., all possible 2-bit values. results shown in Figures 7, 8 and 10 are obtained without con- sideration for any power constraints on the rate of activating pLUTo−GMC pLUTo−GMC−3DS 12 17 DRAM rows (i.e., tF AW = 0). Figure 11 shows the effects of 4 varying tF AW (between 0% and 100% of its nominal value) in 3 a commodity DDR4 memory module on the performance of 2

a single pLUTo LUT Query, across our examined workloads. Speedup 1

0

¨ ¢

§

¡

¦

é

BC4 BC8 ¥

M

¤

ê ë ì í î ï ð ñ ò ó ô õö ÷ø ù ú ûüýþ

A A A ADD4 ADD8 MUL8 £

ÿ

B è

ç GMEAN

¡¢£

æ

å

ä

žŸ

ã â

or Figure 12: Comparison of pLUTo-GMC-{DDR4, 3DS} with

á

š›œ à

ß prior state-of-the-art specialized in-situ DRAM bitwise accel- —˜™ e P

Þ erator proposals.

Ý

Ü Û

Ú 0%

Ù

Ä ¨ Ñ ×

° º

¾

Ã

Ð Ö

§

¯ ¹

Ø

Ï Õ Â

−8 ½

¦

¸

« Ë

− −32 Á ade

Î Ô ·

BC8 ¼

¥ ¶

® ³

ª

Í Ó

Ê

À

»

ADD4 ADD8 MUL8 µ

¿

É

­ ²

¤ Ì Ò

©

´

È

¬ Ç

± GMEAN Æ

Å 6.3. Energy Efficiency Figure 13 shows the absolute energy costs of executing each Figure 11: The impact of different values of tFAW on pLUTo’s of the evaluated workloads on different pLUTo configura- performance. tions. The energy cost is independent from the level of paral- lelism considered, since the overall number of row activations We make two key observations. First, the loss in perfor- is constant. For this reason, we report the energy consump- mance is only around 10% when tF AW is 50% of its nom- tion for the CPU and GPU baselines and for the combined inal value, and only around 20% when tF AW is set to its 2D and 3D pLUTo designs. We make three key observations. nominal value (requiring no relaxation of power constraints). First, the average energy consumption of the GPU implemen- Even considering this performance penalty, the performance tation is higher than pLUTo but lower than pLUTo-3DS. This results of pLUTo would still outperform the CPU baseline, is because HMC memory is comprised of smaller rows, which and be comparable to the GPU baseline. Second, we note increases the overall number of activations required and con- that the performance penalties are very similar across all sequently, the overall energy consumption. Second, we ob- of the considered workloads, for the same value of tF AW . serve that pLUTo is able to outperform the GPU for most sim- Despite the limited impact of tF AW on pLUTo, the use of ple arithmetic operations, but begins to fall short as the com- more powerful charge pumps could further relax power con- plexity of these operations increases (e.g., Matrix-Vector Mul- straints and therefore reduce the required value of tF AW in tiplication). This trend is consistent with our observations pLUTo-capable DRAMs, bringing the actual performance re- relative to workload performance discussed earlier in this sec- sults closer to the ones reported in Figures 7 and 8. tion, and the increased energy consumption for these work-

10 CPU GPU pLUTo−GSA pLUTo−BSA for each of a set of operations of interest, under each of the

pLUTo−GMC pLUTo−GSA−3DS pLUTo−BSA−3DS pLUTo−GMC−3DS

    × architectures mentioned above.

log 9

×

6 In the implementation of all algorithms we assume the use © 1 × 1 × 103 of ideal data layouts for all designs. This enables the report-

0

Energy (pJ) 1 × 10 ing of the best-case achievable performance for each design.

"

 

 !

 













V

 I

CRC−8 CRC− CRC−32 S GMEAN ColorGrade For example, in the case of bitwise operations between in- put sets A (’a1a2...’) and B (’b1b2...’), under the LUT-based Figure 13: Energy consumption of CPU, GPU and pLUTo. paradigm all input operands are ideally shuffled (i.e., laid out pLUTo parallelizes operations across 16 subarrays. The y-axis as ’a1b1a2b2...’), and for all prior PuM designs input sets A uses a logarithmic scale, where lower is better. and B are ideally stored in two separate memory rows. We loads arises directly from their performance scaling. Third, note that changes at the system level are required to support the use of smaller LUTs as described in Section 4.2.1 to imple- the ideal data mapping schemes that maximize pLUTo’s per- ment the bit count workload (BC-SL vs. BC) translates into formance. To ensure as fair a comparison as possible, the significant energy savings, roughly proportional to the reduc- memory capacity for each of the designs was chosen to en- tion in the number of LUT entries. sure that the area overheads for all designs remain in a nar- row range that is similar to the typical overhead of commod- 6.4. Area Overhead ity DRAM devices. We observe that DRISA, the most perfor- mant of the previous approaches, requires a substantial de- Table 4 shows a breakdown of the estimated area overheads, crease in area density: it is only possible to store up to 2GB per DRAM component. The rationale for the estimated over- of data for a similar chip area, compared to 8GB in all three heads of each of the three designs follows. other approaches considered here. pLUTo-BSA. We estimate that the sense amplifier switch and We draw three key conclusions from Table 5. First, we ob- FF (shown in Figure 2 (c)) incur a 60% area overhead for the serve that, due to their complexity, some operations cannot sense amplifiers. The total overhead of pLUTo-BSA is 16.7% be implemented in a time-efficient manner at all using any of of the DRAM chip area. the prior designs. Examples of such operations include bina- pLUTo-GMC. The estimated area overhead per 2T1C DRAM rization and exponentiation. In pLUTo, it is only possible to cell (shown in Figure 6 (a)) is 25%. The total area overhead of perform exponentiation when operating on small bit-widths pLUTo-GMC is 23.1% of the DRAM chip area. (for best results, using input operands with 8 bits or less), but pLUTo-GSA. The estimated area overhead of the switch this can be done with high efficiency. Second, we observe (shown in Figure 6 (b)) is 20% of the area of a sense amplifier, that pLUTois able to performbitwise logic operations at rates per bitline. The total area overhead of pLUTo-GSA is 10.2% that match or exceed those of all prior works. This is made of the DRAM chip area. possible by employing the data mapping assumption intro- Table 4: Area breakdown of the three designs of pLUTo (GSA, duced earlier in this section. This result is consequential since BSA, GMC). The area overheads for an unmodified DRAM chip are also shown. it shows that, with proper data alignment, LUT-based com- puting is able to outperform even highly specialized designs. DRAM pLUTo-GSA pLUTo-BSA pLUTo-GMC

) DRAM Cell 45.23 45.23 45.23 56.53 Third, we observe that pLUTo consistently outperforms all 2 Local WL Driver 12.45 17.06 17.06 17.06 three other approaches for most of the considered operations, mm

( Sense Amplifier 11.39 13.67 18.23 11.39 Other 1.16 1.49 1.49 1.49 in performance(absolute and normalized to area) as well as in

Area 77.44 82.00 86.47 Total 70.23 energy efficiency. This is possible due to the fact that pLUTo (+10.2%) (+16.7%) (+23.1%) operates much like conventional DRAM, and thus incurs a similar power footprint. This improvement is not universal: 6.5. Comparison With Other Works for instance, pLUTo lags behind all three baselines in the case As discussed in Section 3, prior PuM architectures (e.g., [32,82, of 4-bit addition. However, we note that even in cases where 110]) achieve very high throughput and energy efficiency, but it is not the fastest or the most energy-efficient, it is not far do so at the expense of a high degree of specialization, man- from being so, which arguably makes pLUTo the most well- ifested through the support of a limited range of operations. rounded of the approaches considered in this analysis. These works are able to address this limitation by exploiting alternatives to conventional bit-parallel algorithms. For ex- 7. Case Study: Binary Neural Networks ample, it is possible to efficiently realize arithmetic operations As shown in Section 6, pLUTo is especially well-suited for ex- in Ambit [110] using bit-serial algorithms. Nevertheless, we ecuting limited-precision operations efficiently, since these argue that the additional flexibility afforded by pLUTo’s na- operations can be expressed using small LUTs. Building on tive support for LUT operations allows it to outperform prior this observation, in this section we validate the applicability PuM architectures in meaningful and substantive ways. We of pLUTo for quantized neural networks, an emerging ma- substantiate this claim with Table 5, which shows the time chine learning application. We evaluate as a proof-of-concept

11 Table 5: Comparison of operations supported by pLUTo operations. In this section, we describe relevant prior works. against prior PuM works. Ambit LAcc DRISA pLUTo Processing-using-Memory (PuM). Many prior works pro- Capacity 8 GB 8 GB 2 GB 8 GB pose various forms of compute-capable memory [1–3, 5–21, Area (mm2) 61.0 54.8 65.2 70.5 24,26,28,29,31,32,34–39,41–44,46,47,49–58,65,68,70–72,74, Power (W ) 5.3 5.3 98.0 5.3 76, 77, 80–83, 85, 91, 97–100, 104–106, 108–114, 120, 123, 128– ns NOT ( ) 135.0 135.0 207.6 105.0 134, 137]. All these approaches provide significant perfor- AND (ns) 270.0 270.0 415.2 165.0 OR (ns) 270.0 270.0 415.2 165.0 mance and energy improvements, but each focuses only on a XOR (ns) 585.0 450.0 691.9 165.0 reduced set of operations, e.g. data movement [28, 109], bulk XNOR (ns) 585.0 450.0 691.9 165.0 bitwise operations [1, 83, 110, 129] or acceleration of neural Performance Per Area 0.54 0.67 0.37 1.00 networks [31,32,35,82]. By combining the in-memory pLUTo (higher is better) LUT Query with the fast and efficient bitwise logic and shift- Energy Efficiency 0.54 0.67 0.02 1.00 ing operations enabled by these prior works, pLUTo achieves (higher is better) 4-bit Addition (ns) 1485.0 1142.3 1756.5 1920.0 considerable performance improvements and enables the ex- 4-bit Multiplication (ns) 6975.0 5365.4 8250.1 1920.0 ecution of workloads beyond what the cited PiM works have 6-bit input, 2-bit output (ns) - - - 1920.0 been able to address, e.g. cryptographic applications and im- 8-bit input, 8-bit output (ns) - - - 7680.0 age/video processing pipelines. Performance Per Area 0.97 1.41 0.77 1.00 (higher is better) Processing-near-Memory (PnM). 3D-stacked memories Energy Efficiency 0.97 1.41 0.04 1.00 are an emerging technology that enables stacking memory (higher is better) layers vertically with a logic layer, which has compute ca- ns Binarization ( ) - - - 7680.0 pabilities. This technology provides higher bandwidth com- Exponentiation (ns) - - - 7680.0 Performance Per Area pared to standard (2D) DRAM chips. Many prior works [2,27, - - - 1.0 (higher is better) 67, 101, 136] propose logic layers with various compute capa- Energy Efficiency bilities to minimize data movement to the CPU core. How- - - - 1.0 (higher is better) ever, pLUTo offers the following advantages over 3D-stacked memories with a custom logic layer: 1) it is built on widely a quantized version of the LeNet-5 network to classify the dig- adopted conventional DRAM; 2) it provides superior energy its in the MNIST dataset. The inference times for CPU, GPU savings, by operating on data in-place; 3) pLUTo and 3D- and pLUTo are shown in Table 6. For this evaluation, the stacked memories can be complementary technologies, as CPU is unchanged from Table 2; the GPU is a server-grade shown in Section 6. NVIDIA P100, commonly used for machine learning applica- tions, with 16GB of dedicated memory. We make two key ob- pPIM [119] and LAcc [32] both employ fully LUT-driven servations. First, pLUTo-16 outperforms both the CPU (10×, computing paradigms to efficiently execute a limited range of 30× for 1-bit, 4-bit) and the GPU (2×, 7×) in inference time. operations relevant for neural network acceleration, in partic- This is because pLUTo operations on reduced bit width data ular vectorized multiplication and nonlinear activation func- are especially efficient, since they can be performed in-place tions. For this reason, their applicability to other domains as a short sequence of DRAM commands. Second, pLUTo- is limited. In contrast, pLUTo employs the same LUT-driven 16 also achieves considerable energy savings over both the computing paradigm while providing support for a greater CPU (110×, 109×) and the GPU (80×, 81×), for both 1- and range of operations, such as querying of small and large LUTs 4-bit precision. This reduction can mostly be attributed to the (section 4.2.1) and LUT query pipelining (section 4.3.3). In overall mitigation in data movement, since most operations addition, the pLUTo substrate is heavily based on the well- are performed in-place. These results strengthen the case for established commodity DRAM technology and introduces the use of pLUTo in heavily energy-constrained devices, such support for LUT querying directly within the DRAM sub- as IoT and other edge devices. arrays without requiring dedicated considerable logic over- heads to support LUT querying. Furthermore, when not in Table 6: LeNet-5 inference times (in µs) and energy (inmJ) for CPU, GPU and pLUTo. use for LUT querying, a pLUTo subarray can be operated ex- clusively for storage in a way that is very similar to a conven- Bit Accuracy CPU GPU pLUTo-BSA-16 Network Width [66] Time Energy Time Energy Time Energy tional DRAM subarray. 1 bit 97.4 % 249 2.2 56 1.6 23 0.02 LeNet-5 4 bits 99.1 % 997 8.7 224 6.5 30 0.08 In contrast to the specialized nature of pPIM and LAcc, DRAF [45] is an architecture optimized for flexibility that employs the same LUT-based computing paradigm used by 8. Related Work FPGAs using lookup tables located inside the DRAM subar- To our knowledge, pLUTo is the first work to propose a mech- rays. DRAF is able to outperform FPGAs in area and en- anism to enable the efficient storage and querying of LUTs ergy efficiency, at the expense of inferior throughput and la- inside DRAM to enable the in-memory execution of complex tency. In contrast, the pLUTo substrate is optimized for high-

12 throughput regular LUT queries that exploit a high level of [16] S. Angizi et al., “IMCE: Energy-efficient bit-wise in-memory convolu- reuse. Because of this, even though the cost of an individual tion engine for deep neural network,” in ASP-DAC, 2018. [17] S. Angizi et al., “CMP-PIM: an energy-efficient comparator-based LUT query is higher than that of DRAF, this added cost is processing-in-memory neural network accelerator,” in DAC, 2018. amortized due to the massive level of parallelism adopted by [18] S. Angizi et al., “AlignS: A processing-in-memory accelerator for dna short read alignment leveraging SOT-MRAM,” in DAC, 2019. pLUTo. [19] S. Angizi et al., “GraphS: A graph processing accelerator leveraging SOT-MRAM,” in DATE, 2019. 9. Conclusion [20] S. Angizi et al., “PIM-Aligner: A processing-in-MRAM platform for biological sequence alignment,” in DATE, 2020. We introduced pLUTo, a new DRAM substrate that enables [21] S. Angizi et al., “Exploring DNA Alignment-in-Memory Leveraging Emerging SOT-MRAM,” in GLSVLSI, 2020. the energy-efficient storage and bulk query of lookup tables [22] R. Balasubramonian et al., “CACTI 7: New tools for interconnect ex- entirely within DRAM. Any deterministic function with a ploration in innovative off-chip memories,” in TACO, 2017. bounded domain can be mapped to a pLUTo-enabled subar- [23] D. J. Bernstein, “Salsa20 specification,” http://www.ecrypt.eu.org/stream/salsa20pf.html, 2005. ray. With our design, computing the value of one such func- [24] D. Bhattacharjee et al., “ReVAMP: ReRAM based VLIW architecture for tion for a large number of inputs can be performed with sig- in-memory computing,” in DATE, 2017. [25] A. Boroumand et al., “Google workloads for consumer devices: Miti- nificantly reduced data movement and considerable energy gating data movement bottlenecks,” in ASPLOS, 2018. savings. We believe that pLUTo has the potential to provide [26] A. Boroumand et al., “CoNDA: Efficient cache coherence support for significant performance and energy improvements in appli- near-data accelerators,” in ISCA, 2019. [27] A. Boroumand et al., “LazyPIM: An efficient cache coherence mecha- cations designed to take maximum advantage of it. nism for processing-in-memory,” in CAL, 2016. [28] K. K. Chang et al., “Low-cost inter-linked subarrays (LISA): Enabling Acknowledgments fast inter-subarray data movement in DRAM,” in HPCA, 2016. [29] P. Chi et al., “Prime: A novel processing-in-memory architecture for We thank the anonymous reviewers of HPCA 2019, ICCD neural network computation in ReRAM-based main memory,” in ISCA, 2016. 2020, HPCA 2020, ISCA 2020 and ISCA 2021 for their valu- [30] B. Dally, “The Path to Exascale Computing,” able comments and feedback. We thank the SAFARI Research http://images.nvidia.com/events/sc15/pdfs/SC5102-path-exascale-computing.pdf, 2015. Group members for valuable feedback and the stimulating in- [31] Q. Deng et al., “DrAcc: a DRAM based accelerator for accurate CNN tellectual environment they provide. We acknowledge the inference,” in DAC, 2018. [32] Q. Deng et al., “LAcc: Exploiting Lookup Table-based Fast and Accu- generous gifts provided by our industrial partners: Google, rate Vector Multiplication in DRAM-based CNN Accelerator,” in DAC, 2019. Huawei, Intel, Microsoft, and VMware. [33] F. Devaux, “The True Processing In Memory Accelerator,” in HC, 2019. This work was funded in part by the Instituto de [34] J. Draper et al., “The architecture of the DIVA processing-in-memory Telecomunicações and the Fundação para a Ciência e a chip,” in ICS, 2002. Tecnologia (FCT), under grant number UIDB/50008/2020- [35] C. Eckert et al., “Neural cache: Bit-serial in-cache acceleration of deep UIDP/50008/2020. neural networks,” in ISCA, 2018. [36] D. Fan, “Low power in-memory computing platform with four termi- References nal magnetic domain wall motion devices,” in NANOARCH, 2016. [37] D. Fan and S. Angizi, “Energy efficient in-memory binary deep neural [1] S. Aga et al., “Compute caches,” in HPCA, 2017. network accelerator with dual-mode SOT-MRAM,” in ICCD, 2017. [2] J. Ahn et al., “A scalable processing-in-memory accelerator for parallel [38] D. Fan et al., “In-memory computing with spintronic devices,”in ISVLSI, graph processing,” in ISCA, 2015. 2017. [3] A. Akerib et al., “Using storage cells to perform computation,” US [39] D. Fan et al., “Leveraging spintronic devices for ultra-low power in- Patent 8,238,173. 2012. memory computing: Logic and neural network,” in MWSCAS, 2017. [4] M. F. Ali et al., “In-memory low-cost bit-serial addition using commod- [40] A. M. Fiskiran and R. B. Lee, “On-chip lookup tables for fast symmetric- ity DRAM technology,” in TCAS I, 2019. key encryption,” in ASAP, 2005. [5] S. Angizi et al., “PIM-Assembler: A processing-in-memory platform for [41] D. Fujiki et al., “Duality cache for data parallel acceleration,” in ISCA, genome assembly,” in DAC, 2020. 2019. [6] S. Angizi and D. Fan, “IMC: energy-efficient in-memory convolver for [42] P.-E. Gaillardon et al., “The Programmable Logic-in-Memory (PLiM) accelerating binarized deep neural network,” in NCS, 2017. Computer,” in DATE, 2016. [7] S. Angizi and D. Fan, “Deep Neural Network Acceleration in Non- [43] D. Gao et al., “A design framework for processing-in-memory acceler- Volatile Memory: A Digital Approach,” in NANOARCH, 2019. ator,” in SLIP, 2018. [8] S. Angizi and D. Fan, “GraphiDe: A graph processing accelerator lever- [44] F. Gao et al., “ComputeDRAM: In-memory compute using off-the-shelf aging in-DRAM-computing,” in GLSVLSI, 2019. DRAMs,” in MICRO, 2019. [9] S. Angizi and D. Fan, “ReDRAM: A reconfigurable processing-in- [45] M. Gao et al., “DRAF: a low-power DRAM-based reconfigurable accel- DRAM platform for accelerating bulk bit-wise operations,” in ICCAD, eration fabric,” in ISCA, 2016. 2019. [46] S. Ghose et al., “Processing-in-memory: A workload-driven perspec- [10] S. Angizi et al., “Design and evaluation of a spintronic in-memory pro- tive,” in IBM J. Res. Dev., 2019. cessing platform for nonvolatile data encryption,” in IEEE TCAD, 2017. [47] S. Ghose et al., “Enabling the adoption of processing-in-memory: Chal- [11] S. Angizi et al., “Energy efficient in-memory computing platform based lenges, mechanisms, future research directions,” in Beyond-CMOS Tech- on 4-terminal spin Hall effect-driven domain wall motion devices,” in nologies for Next Generation Computer Design, 2018. GLSVLSI, 2017. [48] S. Ghose et al., “The Processing-in-Memory Paradigm: Mechanisms to [12] S. Angizi et al., “DIMA: a depthwise CNN in-memory accelerator,” in Enable Adoption,” in Beyond-CMOS Technologies for Next Generation ICCAD, 2018. Computer Design, 2019. [13] S. Angizi et al., “PIMA-logic: a novel processing-in-memory architec- [49] M. Gokhale et al., “Processing in memory: The Terasys massively par- ture for highly flexible and energy-efficient logic computation,” in DAC, allel PIM array,” in Computer, 1995. 2018. [50] P. Gu et al., “DLUX: a LUT-based Near-Bank Accelerator for Data Cen- [14] S. Angizi et al., “ParaPIM: a parallel processing-in-memory accelerator ter Deep Learning Training Workloads,” in TCAD, 2020. for binary-weight deep neural networks,” in ASP-DAC, 2019. [51] N. Hajinazar et al., “SIMDRAM: A Framework for Bit-Serial SIMD Pro- [15] S. Angizi et al., “RIMPA: A new reconfigurable dual-mode in-memory cessing Using DRAM,” in HPCA, 2021. processing architecture with spin hall effect-driven domain wall mo- [52] S. Hamdioui et al., “Memristor for computing: Myth or reality?” in tion device,” in ISVLSI, 2017.

13 DATE, 2017. [89] O. Mutlu et al., “Enabling practical processing in and near memory for [53] S. Hamdioui et al., “Memristor based computation-in-memory archi- data-intensive computing,” in DAC, 2019. tecture for data-intensive applications,” in DATE, 2015. [90] O. Mutlu et al., “Processing Data Where It Makes Sense: Enabling In- [54] Z. He et al., “Exploring STT-MRAM based in-memory computing Memory Computation,” in Microprocessors and Microsystems, 2019. paradigm with application of image edge extraction,” in ICCD, 2017. [91] O. Mutlu et al., “A Modern Primer on Processing in Memory,” in Emerg- [55] Z. He et al., “High performance and energy-efficient in-memory com- ing Computing: From Devices to Systems - Looking Beyond Moore and puting architecture based on SOT-MRAM,” in NANOARCH, 2017. Von Neumann, 2021. [56] Z. He et al., “Leveraging dual-mode magnetic crossbar for ultra-low [92] R. Nair et al., “Active memory cube: A processing-in-memory architec- energy in-memory data encryption,” in GLSVLSI, 2017. ture for exascale systems,” in IBM J. Res. Dev., 2015. [57] K. Hsieh et al., “Transparent offloading and mapping (TOM) enabling [93] Nvidia, “P100 GPU,” in Pascal Architecture White Paper, 2016. programmer-transparent near-data processing in GPU systems,” in [94] Nvidia, “NVIDIA Tesla V100 GPU Architecture,” in White Paper, 2017. ISCA, 2016. [95] D. Pandiyan and C. Wu, “Quantifying the energy cost of data move- [58] K. Hsieh et al., “Accelerating pointer chasing in 3D-stacked memory: ment for emerging smart phone workloads on mobile platforms,” in Challenges, mechanisms, evaluation,” in ICCD, 2016. IISWC, Oct. 2014. [59] Hybrid Memory Cube Consortium, “Hybrid Memory Cube Specifica- [96] K.-T. Park et al., “Three-dimensional 128 Gb MLC vertical NAND flash tion 2.1,” Tech. Rep., 2014. memory with 24-WL stacked layers and 50 MB/s high-speed program- [60] J. Jeddeloh and B. Keeth, “Hybrid Memory Cube new DRAM architec- ming,” in JSSC, 2015. ture increases density and performance,” in VLSIT, 2012. [97] F. Parveen et al., “Low power in-memory computing based on dual- [61]JEDEC, “DDR3 SDRAM Standard, JESD79-3D,” mode SOT-MRAM,” in ISLPED, 2017. https://www.jedec.org/standards-documents/docs/jesd-79-3d, 2012. [98] F. Parveen et al., “IMCS2: Novel device-to-architecture co-design for [62]JEDEC, “DDR4 SDRAM Standard, JESD79-4B,” low-power in-memory computing platform using coterminous spin https://www.jedec.org/standards-documents/docs/jesd79-4a, 2017. switch,” in IEEE Trans. Magn., 2018. [63] S. Kanev et al., “Profiling a warehouse-scale computer,” in ISCA, 2015. [99] F. Parveen et al., “Hybrid polymorphic logic gate with 5-terminal mag- [64] D. Kang et al., “256 Gb 3 b/cell V-NAND flash memory with 48 stacked netic domain wall motion device,” in ISVLSI, 2017. WL layers,” in JSSC, 2017. [100] F. Parveen et al., “HieIM: Highly flexible in-memory computing using [65] M. Kang etal., “An energy-efficient VLSI architecture for pattern recog- STT MRAM,” in ASP-DAC, 2018. nition via deep embedding of computation in SRAM,” in ICASSP, 2014. [101] A. Pattnaik et al., “Scheduling techniques for GPU architectures with [66] S. Khoram and J. Li, “Adaptive quantization of neural networks,” in processing-in-memory capabilities,” in PACT, 2016. ICLR, 2018. [102] I. Paul et al., “Harmonia: Balancing compute and memory power in [67] D. Kim et al., “Neurocube: A programmable digital neuromorphic ar- high-performance GPUs,” in ISCA, 2015. chitecture with high-density 3D memory,” in ISCA, 2016. [103] J. T. Pawlowski, “Hybrid Memory Cube (HMC),” in HCS, 2011. [68] Y. Kim et al., “A case for exploiting subarray-level parallelism (SALP) [104] A. S. Rakin et al., “PIM-TGAN: A processing-in-memory accelerator in DRAM,” in ISCA, 2012. for ternary generative adversarial networks,” in ICCD, 2018. [69] E. Kültürsay et al., “Evaluating STT-RAM as an energy-efficient main [105] A. K. Ramanathan et al., “Look-Up Table based Energy Efficient Pro- memory alternative,” in ISPASS, 2013. cessing in Cache Support for Neural Network Acceleration,” in MICRO, [70] S. Kvatinsky et al., “MAGIC—Memristor-aided logic,” in TCAS II, 2014. 2020. [71] S. Kvatinsky et al., “Memristor-based IMPLY logic design procedure,” [106] S. H. S. Rezaei et al., “NoM: Network-on-Memory for Inter-Bank Data in ICCD, 2011. Transfer in Highly-Banked Memories,” in CAL, 2020. [72] S. Kvatinsky et al., “Memristor-based material implication (IMPLY) [107] P. Rosenfeld, “Performance exploration of the Hybrid Memory Cube,” logic: Design principles and methodologies,” in VLSI, 2013. Ph.D. dissertation, 2014. [73] P. S. Lazar and S. C. Oh, “DRAM with total self refresh and control [108] V. Seshadri et al., “Fast Bulk Bitwise AND and OR in DRAM,” in CAL, 2015. circuit,” US Patent 6,741,515. 2004. [109] V. Seshadri et al., “RowClone: fast and energy-efficient in-DRAM bulk [74] P. V. Lea, “Apparatuses and methods for in-memory operations,” US data copy and initialization,” in MICRO, 2013. Patent 10,268,389. 2019. [75] B. C. Lee et al., “Architecting phase change memory as a scalable [110] V. Seshadri et al., “Ambit: In-memory accelerator for bulk bitwise op- DRAM alternative,” in ISCA, 2009. erations using commodity DRAM technology,” in MICRO, 2017. [76] B. C. Lee et al., “Phase change memory architecture and the quest for [111] V.Seshadri et al., “Gather-scatter DRAM: In-DRAM address translation scalability,” in CACM, 2010. to improve the spatial locality of non-unit strided accesses,” in MICRO, 2015. [77] B.C.Lee et al., “Phase-change technology and the future of main mem- [112] V. Seshadri and O. Mutlu, “Simple operations in memory to reduce data ory,” in MICRO, 2010. movement,” in Adv. Comput., 2017. [78] D.U. Lee et al., “A 1.2V 8Gb 8-channel 128 GB/s high-bandwidth mem- [113] V. Seshadri and O. Mutlu, “In-DRAM bulk bitwise execution engine,” ory (HBM) stacked DRAM with effective microbump I/O test methods in arXiv preprint arXiv:1905.09822, 2019. using 29nm process and TSV,” in ISSCC, 2014. [114] A. Shafiee et al., “ISAAC: A convolutional neural network accelerator [79] C. Lefurgy et al., “Energy management for commercial servers,” in with in-situ analog arithmetic in crossbars,” in ISCA, 2016. Computer, 2003. [115] W. Shooman, “Parallel computing with vertical data,” in IRE-AIEE- [80] Y. Levy et al., “Logic operations in memory using a memristive Akers ACM, 1960. array,” in Microelectronics Journal, 2014. [116] P. Siegl et al., “Data-centric computing frontiers: A survey on [81] S. Li et al., “SCOPE: A Stochastic Computing Engine for DRAM-Based processing-in-memory,” in MEMSYS, 2016. In-Situ Accelerator,” in MICRO, 2018. [117] D. B. Strukov et al., “The missing memristor found,” in Nature, 2008. [82] S. Li et al., “DRISA: A DRAM-based reconfigurable in-situ accelerator,” [118] Y. Sun and M. S. Kim, “A pipelined CRC calculation using lookup ta- in MICRO, 2017. bles,” in CCNC, 2010. [83] S. Li et al., “Pinatubo: A processing-in-memory architecture for bulk [119] P. R. Sutradhar et al., “pPIM: A Programmable Processor-in-Memory bitwise operations in emerging non-volatile memories,” in DAC, 2016. Architecture With Precision-Scaling for Deep Learning,” in CAL, 2020. [84] G. H. Loh, “3D-Stacked Memory Architectures for Multi-core Proces- [120] Y. Tian et al., “ApproxLUT: A novel approximate lookup table-based sors,” in ISCA, 2008. accelerator,” in ICCAD, 2017. [85] T. A. Manning, “Apparatuses and methods for comparing data patterns [121] T. Vogelsang, “Understanding the energy consumption of dynamic ran- in memory,” US Patent 9,934,856. 2018. dom access memories,” in MICRO, 2010. [86] M. F. Mansour, “Efficient Huffman decoding with table lookup,” in [122] Q. Wang et al., “AUGEM: automatically generate high performance ICASSP, 2007. dense linear algebra kernels on CPUs,” in SC, 2013. [87] J. McNeely and M. Bayoumi, “Low Power Lookup Tables for Huffman [123] Y. Wang et al., “FIGARO: Improving system performance via fine- Decoding,” in 2007 IEEE International Conference on Image Processing, grained In-DRAM data relocation and caching,” in MICRO, 2020. 2007. [88] Micron, “Micron Collaborates With Broadcom to [124] M. Ware et al., “Architecting for power management: The IBM® Solve DRAM Timing Challenge, Delivering Im- POWER7™ approach,” in HPCA, 2010. proved Performance for Networking Customers,” [125] H. S. Warren, Hacker’s delight, 2013. http://investors.micron.com/static-files/3e9669f9-7186-481c-8594-dca7e992a0b2,[126] J. Wolkerstorfer et al., “An ASIC implementation of the AES SBoxes,” 2013. in RSA Conf., 2002.

14 [127] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the obvious,” in ACM SIGARCH Computer Architecture News, 1995. [128] L. Xie et al., “Fast boolean logic mapped on memristor crossbar,” in ICCD, 2015. [129] X. Xin et al., “ROC: DRAM-based Processing with Reduced Operation Cycles,” in DAC, 2019. [130] X. Xin et al., “ELP2IM: Efficient and Low Power Bitwise Operation Pro- cessing in DRAM,” in HPCA, 2020. [131] L. Yang et al., “A Flexible Processing-in-Memory Accelerator for Dy- namic Channel-Adaptive Deep Neural Networks,” in ASP-DAC, 2020. [132] J. Yu et al., “Memristive devices for computation-in-memory,” in DATE, 2018. [133] J. T. Zawodny and G. E. Hush, “Apparatuses and methods to reverse data stored in memory,” US Patent 9,959,923. 2018. [134] Y. Zha and J. Li, “Hyper-AP: Enhancing associative processing through a full-stack optimization,” in ISCA, 2020. [135] D. P. Zhang et al., “A new perspective on processing-in-memory archi- tecture design,” in MSPC, 2013. [136] D. Zhang et al., “TOP-PIM: throughput-oriented programmable pro- cessing in memory,” in HPDC, 2014. [137] H. Zhao et al., “Apparatuses and methods to control body potential in memory operations,” US Patent 9,536,618. 2017. [138] B. Zoltak, “VMPC one-way function and stream cipher,” in FSE, 2004.

15