Fine-Grained Polarized Reram-Based In-Situ Computation for Mixed-Signal DNN Accelerator

FORMS: Fine-grained Polarized ReRAM-based In-situ Computation for Mixed-signal DNN Accelerator Geng Yuan?1, Payman Behnam?2, Zhengang Li1, Ali Shafiee3, Sheng Lin1, Xiaolong Ma1, Hang Liu4, Xuehai Qian5, Mahdi Nazm Bojnordi6, Yanzhi Wang1, Caiwen Ding7 1Northeastern University, 2Georgia Institute of Technology, 3Samsung, 4Stevens Institute of Technology, 5University of Southern California, 6University of Utah, 7University of Connecticut 1fyuan.geng, li.zhen, lin.sheng, ma.xiaol, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract—Recent work demonstrated the promise of using storage and computation units are separated. To reduce data resistive random access memory (ReRAM) as an emerging movement, model compression techniques [2–4] and hardware technology to perform inherently parallel analog domain in-situ accelerators [5–11] have been intensively investigated. How- matrix-vector multiplication—the intensive and key computation in deep neural networks (DNNs). One key problem is the weights ever, as Moore’s law is reaching an end [12], the potential of that are signed values. However, in a ReRAM crossbar, weights the acceleration architecture based on conventional technology are stored as conductance of the crossbar cells, and the in-situ is still limited. We argue that drastic improvements can be only computation assumes all cells on each crossbar column are of achieved by 1) the next-generation emerging device/circuit the same sign. The current architectures either use two ReRAM technology beyond CMOS; and 2) the vertical integration [13] crossbars for positive and negative weights (PRIME), or add an offset to weights so that all values become positive (ISAAC). and optimization of algorithm, architecture, and technology Neither solution is ideal: they either double the cost of crossbars, innovations to deliver better overall performance and energy or incur extra offset circuity. To better address this problem, we efficiency for various applications. propose FORMS, a fine-grained ReRAM-based DNN accelerator with algorithm/hardware co-design. Instead of trying to represent A promising emerging technology is the recently discov- the positive/negative weights, our key design principle is to ered resistive random access memory (ReRAM) [14, 15] enforce exactly what is assumed in the in-situ computation— devices that are able to perform the inherently parallel in- ensuring that all weights in the same column of a crossbar situ matrix-vector multiplication in the analog domain. This have the same sign. It naturally avoids the cost of an additional key feature has been applied to several significant problems, crossbar. Such polarized weights can be nicely generated using alternating direction method of multipliers (ADMM) regularized including solving systems of linear equations in O(1) time optimization during the DNN training, which can exactly enforce complexity [16], and more interestingly, building DNN accel- certain patterns in DNN weights. To achieve high accuracy, we erators [17–24]. Since the key computation in DNNs can be divide the crossbar into logical sub-arrays and only enforce this essentially expressed as matrix-vector multiplication, ReRAM property within the fine-grained sub-array columns. Crucially, crossbars can naturally accelerate DNNs with much less data the small sub-arrays provides a unique opportunity for input zero- skipping, which can significantly avoid unnecessary computations movement and low-cost computation. Due to their promising and reduce computation time. At the same time, it also makes the results, structured pruning [3] and quantization [4] are also hardware much easier to implement and is less susceptible to non- developed to facilitate a smaller number of weights and bits idealities and noise than coarse-grained architectures. Putting all to reduce the amount of computation and ReRAM hardware together, with the same optimized DNN models, FORMS achieves GOP s resources. 1.50× and 1.93× throughput improvement in terms of s×mm2 GOP s On the other side, a key complication of the ReRAM- and W compared to ISAAC, and 1.12× ∼ 2.4× speed up in arXiv:2106.09144v1 [cs.AR] 16 Jun 2021 terms of frame per second over optimized ISAAC with almost based DNN accelerator is that although the weights stored the same power/area cost. Interestingly, FORMS optimization in ReRAM crossbar cells can be either positive or negative, framework can even speed up the original ISAAC from 10.7× the in-situ computation assumes all cells in each crossbar up to 377.9×, reflecting the importance of software/hardware co-design optimizations. column hold values of the same sign, i.e., all positive or all negative. There are two approaches to tackle the problem. I. INTRODUCTION The general way is to use two ReRAM crossbars to hold the Deep Neural Networks (DNNs) have become the funda- positive and negative magnitudes weights separately, doubling mental element and core enabler of ubiquitous artificial intel- the ReRAM portion of hardware cost [17, 25–28]. In contrast, ligence, thanks to their high accuracy, excellent scalability, and ISAAC [18] adds an offset to weights so that all values become self-adaptiveness [1]. As the ever-growing DNN model size, positive. While keeping the number of crossbars the same, the high computation and memory storage of DNN models the latter approach introduces additional hardware costs for introduce substantial data movements, posing key challenges the peripheral circuits by adding extra offset circuits and may to the conventional Von Neumann architectures, where weight also decrease the network robustness to hardware failures [29]. We argue that both solutions are not ideal, and we attempt to ?These Authors contributed equally. develop an alternative approach with better cost/performance Training with ADMM regularized optimization Sign Indicator Driver Driver } Optimized Pre-trained Fragment MCU MCU Weight Pruning Quantization ReRAM-aware DAC DAC Polarization zero-skip DNN Model zero-skip DNN Model S&H S&H MCU MCU ADCs ADCs Pruned Driver Driver MCU MCU Polarized Controller Data-Flow Controller DAC DAC zero-skip zero-skip Quantized MCU MCU S&H S&H after pruning positive weight ADCs ADCs pruned weight 2D weight format negative weight Tile MCU FORMS Optimization Framwork FORMS Accelerator Architecture Design Fig. 1: Overall Flow of FORMS Algorithm/Hardware Co-designed. trade-offs. fine-grained computation are easier to implement than large Different from the previous approaches, which use addi- ADCs for coarse-grained computation, and make the design tional hardware to “fix” the problem, our design principle is to less susceptible to non-idealities and noise [31, 32]. enforce exactly what is assumed in the in-situ computation— Putting all together, this paper proposes FORMS, the first ensuring the pattern that all weights in the same column of a algorithm/hardware co-designed fine-grained ReRAM archi- crossbar have the same sign. This idea takes the opportunity tecture solution leveraging polarized weights. The overall of algorithm and hardware co-design, and is motivated by flow of FORMS algorithm/hardware co-design is shown in the capability of the powerful alternating direction method Figure 1. Starting with a pretrained model, we first apply of multipliers (ADMM) regularized optimization [30], which structured pruning. Then, the weight matrix in the pruned is exactly able to enforce patterns in DNN training while model is divided into fixed-sized fragments, and each frag- maintaining high accuracy. Based on this idea, we can train ment corresponds to a column of the crossbar sub-array. By a DNN model using ADMM with our novel constraints such incorporating our fragment polarization constraints in ADMM that the weights mapped to the same crossbar columns are all regularized training, the weights in each fragment will be positive or negative. With the typical ReRAM crossbar size, trained to have the same sign. Please note that we are not e.g., 128 × 128, we found that the “same-sign” property of moving the weights around to form polarized fragments. the coarse granularity, i.e., the whole column of 128 weights, Finally, the quantization is applied to reduce the number of bits can lead to accuracy degradation. To maintain high accuracy, required for each weight. With the multi-step ADMM-based we propose to divide the crossbar into smaller logical sub- structured pruning, polarization, and quantization, a significant arrays and only enforce the property within the fine-grained model compression ratio is achieved. sub-array columns, and use fine-grained computation instead At the hardware level, we design a fine-grained DNN of coarse-grained computation (i.e., calculate fine-grained sub- accelerator architecture leveraging fine-grained computations. array column at each time). However, this raises another A novel zero-skipping logic is developed to control the shift problem: the mainstream designs take advantage of coarse- of input bit streams on-the-fly and avoids entering the unnec- grained computation to achieve high performance (i.e., frames essary zero bits into each layer of the DNNs-based ReRAMs per second (FPS)). For a fine-grained design to reach a similar to eliminate useless computations and start computation on FPS to that of coarse-grained designs, if no other optimizations the next input early. Applied to the inputs of a small sub- are applied, more parallel analog-to-digital converters (ADCs) array (fragment), zero-skipping significantly reduces the frame are needed to operate in parallel to compensate for the processing time and energy consumption. performance degradation caused by computation granularity, II. BACKGROUND AND CHALLENGES which will generally lead to a higher hardware overhead than the coarse-grained designs. A. ReRAM Crossbar and ReRAM-Based DNN Acceleration Is this yet another failed idea that seems to work at the Recently, there is significant progress in fabricating non- first thought but actually not after deliberation? Fortunately, volatile memories.

Load more