Tiny but Accurate: A Pruned, Quantized and Optimized Crossbar Framework for Ultra Efficient DNN Implementation

Xiaolong Ma†1, Geng Yuan†1, Sheng Lin1, Caiwen Ding2, Fuxun Yu3, Tao Liu4, Wujie Wen4, Xiang Chen3, Yanzhi Wang1 1Northeastern University, 2University of Connecticut, 3George Mason University, 4Florida International University E-mail: 1{ma.xiaol, yuan.geng, lin.sheng,}@husky.neu.edu, [email protected], [email protected], 3{fyu2, xchen26}@gmu.edu, 4{tliu023, wwen}@fiu.edu

Abstract— The state-of-art DNN structures involve vector multiplication in the analog domain and the com- intensive computation and high memory storage. To putation is in O(1) time complexity [12,13]. Motivated by mitigate the challenges, the memristor crossbar array the fact that there is no precedent model that is structured has emerged as an intrinsically suitable matrix compu- pruned and quantized as well as satisfying memristor hard- tation and low-power acceleration framework for DNN ware constraints, in this work, a memristor-based ADMM applications. However, the high accuracy solution for regularized optimization method is utilized both on struc- extreme model compression on memristor crossbar ar- tured pruning and weight quantization in order to mitigate ray architecture is still waiting for unraveling. In this the accuracy degradation during extreme model compres- paper, we propose a memristor-based DNN frame- sion. A structured pruned model can potentially benefit work which combines both structured weight pruning for high-parallelism implementation in crossbar architec- and quantization by incorporating alternating direction ture. Further more, quantized weights can reduce hard- method of multipliers (ADMM) algorithm for better ware imprecision during read/write procedure, and save pruning and quantization performance. We also dis- more hardware footprint due to less peripheral circuits are cover the non-optimality of the ADMM solution in needed to support fewer bits. weight pruning and the unused data path in a struc- However, to achieve ultra-high compression ratio, an tured pruned model. Motivated by these discoveries, ADMM pruning method [3, 14] cannot fully exploit all re- we design a software-hardware co-optimization frame- dundancy in a neural network model. As a result, we design work which contains the first proposed Network Pu- a hardware-software co-optimization framework in which rification and Unused Path Removal algorithms tar- we investigate Network Purification and Unused Path Re- geting on post-processing a structured pruned model moval after the procedure of structured weight pruning with after ADMM steps. By taking memristor hardware ADMM. Moreover, we utilize distilled knowledge from soft- constraints into our whole framework, we achieve ex- ware model to guide our memristor hardware constraint treme high compression ratio on the state-of-art neu- quantization. To the best of our knowledge, we are the first ral network structures with minimum accuracy loss. to combine extreme structured weight pruning and weight For quantizing structured pruned model, our frame- quantization in an unified and systematic memristor-based work achieves nearly no accuracy loss after quantiz- framework. Also, we are the first to discover the redun- ing weights to 8-bit memristor weight representation. dant weights and unused path in a structured pruned DNN We share our models at anonymous link https://bit. model and design a sophisticate co-optimization framework ly/2VnMUy0. to boost higher model compression rate as well as maintain high network accuracy. By incorporating memristor hard- 1 Introduction ware constraints in our model, our frameworks are guar- anteed feasible for a real memristor crossbar device. The Structured weight pruning [1–3] and weight quantiza- contributions of this paper include: arXiv:1908.10017v1 [eess.SP] 27 Aug 2019 tion [4–6] techniques are developed to facilitate weight com- pression and computation acceleration to solve the high de- • We adopt ADMM for efficiently optimizing the non- mand for parallel computation and storage resources [7–9]. convex problem and utilized this method on structured Even though models are compressed, computation com- weight pruning. plexity still burden the overall performance of the state-of- • We systematically investigate the weight quantization art CMOS hardware applications. on a pruned model with memristor hardware con- To mitigate the bottleneck caused by CMOS-based DNN straints. architectures, the next-generation device/circuit technolo- gies [10,11] triumph CMOS in their non-volatility, high en- • We design a software-hardware co-optimization frame- ergy efficiency, in-memory computing capability and high work in which Network Purification and Unused Path scalability. Memristor crossbar device has shown its po- Removal are first proposed. tential for bearing all these characteristic which makes it intrinsically suitable for large DNN hardware architecture We evaluate our proposed memristor framework on dif- design. A memristor crossbar device can perform matrix- ferent networks. We conclude that structured pruning method with memristor-based ADMM regularized opti- ∗†These authors contributed equally. mization achieves high compression ratio and desirable Vertical high accuracy. Software and hardware experimental results BL Bit Line shows our memristor framework is very energy efficient and V1 saves great amount of hardware footprint. V2 undoped V3 2 Related Works doped

WL Heuristic weight pruning methods [15] are widely used in Wi,j neuromorphic computing designs to reduce the weight stor- Vi Horizontal age and computing delay [16]. [16] implemented weight World Line I1 I2 I3 Ij pruning techniques on a neuromorphic computing sys- Figure 1: memristor and memristor crossbar tem using irregular pruning caused unbalanced workload, greater circuits overheads and extra memory requirement tions [21]. It is known that memristor is a thin-film device on indices. To overcome the limitations, [17] proposed constructed by a region highly doped with oxygen vacan- group connection deletion, which structually prunes con- cies and an undoped region. By nature, applying an electric nections to reduce routing congestion between memristor field across the memristor over a period of time, the oxy- crossbar arrays. gen vacancies would migrate to the direction along with Weight quantization can mitigate hardware imperfec- the electric field, which leads to the (memristance) state tion of memristor including state drift and process vari- drift. Consequently, an error will incur when the state of ations, caused by the imperfect fabrication process or by memristor drifts to another state level. the device feature itself [4, 5]. [18] presented a technique It has been proved that applying quantization on to reduce the overhead of Digital-to-Analog Converters memristor-based designs can mitigate the undesired im- (DACs)/Analog-to-Digital Converters (ADCs) in resistive pacts caused by hardware imperfections [22]. random-access memory (ReRAM) neuromorphic comput- 4 A Memristor-Based Highly Compressed ing systems. They first normalized the data, and then DNN Framework quantized intermediary data to 1-bit value. This can be directly used as the analog input for ReRAM crossbar and, The memristor crossbar structure has shown its poten- hence, avoids the need of DACs. tial for neuromorphic computing system compared to the CMOS technologies [16]. Due to great amount of weights 3 Background on and computations that involved in networks, an efficient 3.1 Memristor Crossbar Model and highly performed framework is needed to conquer the memory storage and energy consumption problems. Memristor [10] crossbar is an array structure consists of We propose an unified memristor-based framework includ- memristors, horizontal Word-lines and Vertical Bit-lines, ing memristor-based ADMM regularized optimization and as shown in Figure 1. Due to its outstanding performance masked mapping. on computing matrix-vector multiplications (MVM), mem- ristor crossbars are widely used as dot-product accelerator 4.1 Problem Formulation in recent neuromorphic computing designs [19]. By pro- ADMM [23] is an advanced optimization technique which gramming the conductance state (which is also known as decompose an original problem into subproblems that “memristance”) of each memristor, the weight matrix W can be solved separately and iteratively. By adopt- can be mapped onto the memristor crossbar. Given the ing memristor-based ADMM regularized optimization, the input voltage vector Vi, the MVM output current vector framework can guarantee the solution feasibility (satisfying Ij can be obtained in time complexity of O(1). memristor hardware constraints) while provide high solu- 3.2 Challenges in Memristor Crossbars Imple- tion quality (no obvious accuracy degradation after prun- mentation and Mitigation Techniques ing). First, the memristor-based ADMM regularized optimiza- Different from the software-based designs, hardware im- tion starts from a pre-trained full size DNN model without perfection is one of the key issues that causes the hard- compression. Consider an N- DNNs, sets of weights ware non-ideal behaviors and needs to be considered in of the i-th (CONV or FC) layer are denoted by Wi. And memristor-based designs. The hardware imperfection of the associated with the DNN is denoted by N  memristor devices are mainly come from the imperfect fab- f {Wi}i=1 . The overall problem is defined by rication process and the memristor features. N  Process Variation. Process variation is one major minimize f {Wi}i=1 , {Wi} hardware imperfection that caused by the fluctuations (1) subject to W ∈ P , W ∈ Q , i = 1,...,N. in fabrication process. It mainly comes from the line- i i i i edge roughness, oxide thickness fluctuations, and random Given the value of αi, the memristor-based constraint set P dopant variations [20]. Inevitably, process variation plays Pi = {Wi| (structured Wi 6= 0) ≤ αi} and Qi={the an increasingly significant role as the process technology weights in the i-th layer are mapped to the quantization scales down to nanometer level. In a DNN hardware de- values}, where αi is predefined hyper parameters. The gen- sign, the non-ideal behaviors caused by process variations eral constraint can be extended in structured pruning such may lead to an accuracy degradation. as filter pruning, channel pruning and column pruning, State Drift. State drift is the phenomenon that the which facilitate high-parallelism implementation in hard- memristance would change after several reading oper- ware. Filter Pruning Channel Pruning Filter Shape Pruning k+1 k k+1 k+1 k+1 k k+1 k+1 Ui := Ui +Wi −Yi ; Vi := Vi +Wi −Zi (7) which (5) is the proximal step, (6) is projection step and Filter 1 Filter 1 Filter 1 (7) is dual variables update. The optimal solution is the Euclidean projection (masked mapping) of Wk+1 + Uk and Wk+1 + Vk onto P and Q . Filter 2 Filter 2 Filter 2 i i i i i i Namely, elements in solution that less than αi will be set to ...... zero. In the meantime, those kept elements are quantized to the closest valid memristor state value. Filter A Filter Ai Filter Ai i 4.3 Memristor-Based Structured Weight Pruning Figure 2: Illustration of filter-wise, channel-wise and shape- In order to accommodate high-parallelism implementation wise structured sparsities. in hardware, we use structured pruning method [1] instead Similarly, for weight quantization, elements in Qi are of the irregular pruning method [15] to reduce the size the solutions of Wi. Assume set of qi,1, qi,2, ··· , qi,Mi is of the weight matrix while avoid extra memory storage the available memristor state value which is the elements requirement for indices. Figure 2 shows different types in Wi, where Mi denotes the number of available quan- of structured sparsity which include filter-wise sparsity, tization level in layer i. Suppose qi,j indicates the j-th channel-wise sparsity and shape-wise sparsity. quantization level in layer i, which gives Figure 3 (a) shows the general matrix multiplication qi,j ∈ [−memrmax, −memrmin] ∪ [memrmin, memrmax] (2) (GEMM) view of the DNN weight matrix and the different structured weight pruning methods. The structured prun- where memrmin, memrmax are the minimum and maxi- mum memristance value of a specified memristor device. ing corresponds to removing rows (filters-wise) or columns (shape-wise) or the combination of them. We can see that 4.2 Memristor-based ADMM regularized opti- after structured weight pruning, the remaining weight ma- mization step trix is still regular and without extra indices. Figure 3 (b) illustrate the memristor crossbar schematic Corresponding to every memristor-based constraint set of size reduction from corresponding structured weight prun- Pi and Qi, a indicator functions is utilized to incorporate ing and Figure 3 (c) shows physical view of the mem- Pi and Qi into objective functions, which are ristor crossbar blocks. A CONV layer has n filters, m ( ( channels which include total k columns, and is denoted 0 if Wi ∈ Pi, 0 if Wi ∈ Qi, as W ∈ n×k. Due to the increasing reading/writing er- gi(Wi) = hi(Wi) = R +∞ otherwise, +∞ otherwise, rors caused by expanding the memristor crossbar size, we for i = 1,...,N. Substituting into (1) and we get limited our design by using multiple 128×64 [25] crossbars for all DNN layers. In Figure 3 (c), i, j denote columns N N N  X X and rows for each crossbar, X represent inputs and c is the minimize f {Wi}i=1 + gi(Yi) + hi(Zi), {Wi} column number which is also shown in Figure 3 (a). By i=1 i=1 (3) subject to W = Y = Z , i = 1,...,N, easy calculation, one can derived that there’s k/j different i i i crossbars to store one filter’s weights as a block unit. So n×k We incorporate auxiliary variables Yi and Zi, dual vari- there’s total p = n/j blocks to store W ∈ R . Within ables Ui and Vi, and the augmented Lagrangian formation each block, the outputs of each crossbar will be propagated Lρ{·} of problem (3) is through an ADC. Then We column-wisely sum the inter- N mediate results of all crossbars. N  X ρi 2 minimize f {Wi}i=1 + kWi − Yi + UikF {Wi} 2 5 Software-hardware Co-optimization i=1 (4) N X ρi 2 + kW − Z + V k , Due to the existence of the non-optimality of ADMM pro- 2 i i i F i=1 cess and the accuracy degradation problem of quantizing sparse DNN, a software-hardware co-optimization frame- We can see that the first term in problem (4) is origi- work is desired. In this section we propose: (i) network nal DNN loss function, and the second and third term are purification and unused path removal to efficiently remove differentiable and convex. As a result, subproblem (4) can redundant channels or filters, (ii) memristor model quanti- be solved by stochastic [24] as the original zation by using distilled knowledge from software helper. DNN training. The standard ADMM algorithm [23] steps proceed by repeating, for k = 0, 1,..., the following subproblems iter- 5.1 Network Purification and Unused Path Re- ations: moval k+1 k k Wi := minimize Lρ({Wi}, {Yi }, {Ui }) Weight pruning with memristor-based ADMM regularized {Wi} (5) k k optimization can significantly reduce the number of weights + Lρ({Wi}, {Zi }, {Vi }) while maintaining high accuracy. However, does the prun- k+1 k+1 k+1 k ing process really remove all unnecessary weights? Yi , Zi := minimize Lρ({Wi }, {Yi}, {Ui }) {Yi,Zi} (6) From our analysis on the DNN data flow, we find that k+1 k if a whole filter is pruned, after General Matrix Multiply + Lρ({Wi }, {Zi}, {Vi }) Schematic View Physical View

GEMM view of weight reduction Memristor block size view DAC Column Decoder channel 1 channel 2 channel m filter 1 x1 x2 x3 xi xk filter 2 ADC ADC filter 3 full size memristor block ADC filter 1* SUM SUM SUM filter n ADC ADC

Block 1 Sense Amplifier channel 1 channel 2 channel m filter j filter 1* Decoder size shrinked Data / Control filter n* memristor block Sense Amplifier ADC ADC

channel 1channel 2 channel m* SUM SUM SUM filter 1* Block p Decoder smallest ADC ADC filter n* memristor block filter n* Sense Amplifier C1 Ci Ci+1 C2i Ck C1 Ci Ck (a) (b) channel 1 channel m* size shrink filter prune may not be the same filter original size column prune with original weight matrix (c)

Figure 3: Structured weight pruning and reduction of hardware resources

-0.4 0.3 0.1 (GEMM), the generated feature maps by this filter will be Feature maps from 0.8 1.2 0.6 Weight kernel Feature maps to previous layer next layer Channel “blank”. If we map those “blank” feature input to next 1.1 0.6 -2.1 Channel Filter layer, the corresponding unused input channel weights be- Filter come removable. By the same token, a pruned channel also ... causes the corresponding filter in previous layer removable. Figure 4 gives a clear illustration about the correspond- Layer i weight matrix Layer i+1 weight matrix ing relationship between pruned filters/channels and cor- respond unused channels/filters. Figure 4: Unused data path caused by structured pruning To better optimize the unused path removal effect we Algorithm 1: Network purification & Unused path removal discussed above, we derive an emptiness ratio parameter η Result: Redundant weights and unused paths removed to define what can be treated as an empty channel. Sup- Load ADMM pruned model pose Λi is the number of columns per channel in layer i, δ = numbers of columns per channel and j is channel index. We have for i ← 1 until last layer do δ for j ← 1 until last channel in layeri do 2  X  for each: k ∈ δ and kcolumnkkF < T h1 do ηi,j = (columnk! = 0) /δ δ ∈ Λi (8) calculate: equation (8), (9); k=1 end if η < T h and σ < T h then If ηi,j exceeds a pre-defined threshold, we can assume that i,j 2 i,j 3 this channel is empty and thus actually prune every col- prune(channeli,j ) prune(filteri−1,j ) when i 6= 1; umn in it. However, if we remove all columns that satisfy end η, dramatic accuracy drop will occur and it will be hard end to recover by retraining because some relatively “impor- for m ← 1 until last filter in layeri do if filter is empty or kfilter k2 < T h then tant” weights might be removed. To mitigate this problem, m m F 4 prune(filter ) we design Network Purification algorithm dealing with the i,m prune(channeli+1,m) when i 6= last layer index; non-optimality problem of the ADMM process. We set-up end an criterion constant σi,j to represent channel j’s impor- end tance score, which is derived from an accumulation proce- end dure: δ X 2 network are represented by the memristance of the mem- σi,j = kcolumnkkF /δ δ ∈ Λi (9) k=1 ristor (i.e. the memristance range constraint Qi in ADMM One can think of this process as if collection evidence for process). Due to the limited memristance range of the whether each channel that contains one or several columns memristor devices, the weight values exceeding memris- need to be removed. A channel can only be treated as empty tance range cannot be represented precisely. Meanwhile, when both equation (8) and (9) are satisfied. Network Pu- the write-on value and the exact value mismatch when rification also works on purifying remaining filters and thus mapping weights on memristor crossbar will also cause the remove more unused path in the network. Algorithm 1 reading mismatch if the amount of the value shift exceeds shows our generalized method of the P-RM method where state level range. T h1 . . . T h4 are hyper-parameter thresholds values. In order to mitigate the memristance range limitation and the mapping mismatch, larger range between state 5.2 Memristor Weight Quantization level (qi,1, qi,2, ··· , qi,Mi ) is needed which means fewer bits Traditionally, DNN in software is composed by 32-bit are representing weights. To better maintain accuracy, we weights. But on a memristor device, the weights of a neural use a pretrained high-accuracy teacher model to provide Table 1: Structured weight pruning results on multi-layer network on MNIST, CIFAR-10 and ImageNet datasets. (P-RM: Network Purification and Unused Path Removal). Accuracies in ImageNet results are reported in Top-5 accuracy. Original model Compression Rate Accuracy Prune Ratio Accuracy Weight Quantization Method Accuracy Without P-RM Without P-RM With P-RM With P-RM Accuracy (8-bit) MNIST Group Scissor [17] 99.15% 4.16× 99.14% N/A N/A N/A 23.18× 99.20% 39.23× 99.20% 99.16% our 99.17% 34.46× 99.06% *87.93× 99.06% 99.04% LeNet-5 45.54× 98.48% 231.82× 98.48% 98.05% *numbers of parameter reduced: 25.2K CIFAR-10 Group Scissor [17] 82.01% 2.35× 82.09% N/A N/A N/A 2.35× 84.55% N/A N/A 84.33% our 84.41% *2.93× 84.53% N/A N/A 83.93% ConvNet 5.88× 83.58% N/A N/A 83.01% our 44.67× 93.36% 93.04% 93.70% 20.16× 93.36% VGG-16 *50.02× 92.73% 92.46% our 5.83× 93.79% 52.07× 93.79% 93.71% 94.14% ResNet-18 15.14× 93.20% *59.84× 93.22% 93.27% *numbers of parameter reduced on ConvNet: 102.30K, VGG-16 : 14.42M, ResNet-18 : 10.97M ImageNet ILSVRC-2012 SSL [1] AlexNet 80.40% 1.40× 80.40% N/A N/A N/A our AlexNet 82.40% 4.69× 81.76% 5.13× 81.76% 80.45% our ResNet-18 89.07% 3.02× 88.41% 3.33× 88.36% 88.47% our ResNet-50 92.86% 2.00× 92.26% 2.70× 92.27% 92.20% numbers of parameter reduced on AlexNet: 1.66M, ResNet-18 : 7.81M, ResNet-50 : 14.77M

Algorithm 2: Distillation Quantization Result: distillation quantization with memristor hardware constraints student ← model pruned and ready to apply quantization; teacher ← model with a deeper structure and higher accuracy; for step ← 1 until lstudent converge do studentq = apply quantization(ws, Q); 2 calculate T L(ps, pt) of studentq & teacher; 2 back propagate on student ← ∂(T L(ps,pt)) ; ∂(studentq ) end Figure 5: Effect of removing redundant weights and unused distillation loss to add on our memristor model (referred paths. (dataset: CIFAR-10; Accuracy: VGG-16-93.36%, as student model) loss to provide better training perfor- ResNet-18-93.79%) mance. The 1R crossbar structure is used in our design. And we 2 choose the memristor device that has R = 1MΩ and lstudent = (1 − σ)L(ps, pr) + σT L(ps, pt) (10) on Roff = 10MΩ. The memristor precision is 4-bit, which The L in first term in (10) is the memristor model (stu- indicates that 16 state-levels can be represented by a sin- dent) loss, and in second term is distillation loss between gle memristor device, and two memristors are combined to student and teacher. ps and pt are outputs of student and represent 8-bit weight in our framework. For the peripheral teacher and pr is the ground-truth label. σ is a balancing circuits, the power and area is calculated based on 45nm parameter, and T is the temperature parameter. technology. And H-tree distribution networks are used to access all the memristor crossbars. 6 Experimental Results As shown in Table 1, we show groups of different prune In this section, we show the experimental results of our ratios and 8-bits quantization with accuracies on each net- proposed memristor-based DNN framework in which struc- work structure. Figure 5 proves our previous arguments tured weight pruning and quantization with memristor- that ADMM’s non-optimality exists in a structured pruned based ADMM regularized optimization are included. Our model. P-RM can further optimize the loss function. software-hardware co-optimization framework (i.e. Net- Please note all of the results are based on non-retraining work Purification, Unused Path Removal (P-RM)) are also process. Below are some results highlights on different thoroughly compared. We test MNIST dataset on LeNet- dataset with different network structures. 5 and CIFAR-10 dataset using ConvNet (4 CONV layers MNIST. With LeNet-5 network, comparing to original and 1 FC layer), VGG-16 and ResNet-18, and we also accuracy (99.17%), our proposed P-RM framework achieve show our ImageNet results on AlexNet, ResNet-18 and 231.82× compression with minor accuracy loss while other ResNet-50. The accuracy of pruned and quantized model state-of-art compression ratios are lossless. And no accu- results are tested based on our software models that incor- racy losses are observed after quantization on 40× and 88× porated with memristor hardware constraints. Models are models and only 0.4% accuracy drop on 231.82× model. trained on an eight NVIDIA GTX-2080Ti GPUs server us- On the other hand, Group Scissor [17] only has 4.16× com- ing PyTorch API. Our memristor model on MATLAB and pression rate. the NVSim [26] is used to calculate power consumption CIFAR-10. Convnet structure are relative shallow so and area cost of the memristors and memristor crossbars. ADMM reaches a relative optimal local minimum, so post- framework of structured weight pruning for dnns,” arXiv preprint Table 2: Area/power comparison between models with and arXiv:1807.11091, 2018. without P-RM on ResNet-18 and VGG-16 on CIFAR-10 [4] E. Park, J. Ahn, and S. Yoo, “Weighted-entropy-based quantization for deep neural networks,” in CVPR, 2017.

[5] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convo- lutional neural networks for mobile devices,” in CVPR, 2016.

[6] S. Lin, X. Ma, S. Ye, G. Yuan, K. Ma, and Y. Wang, “Toward extremely low bit and lossless accuracy in dnns with progressive admm,” arXiv preprint arXiv:1905.00789, 2019.

[7] W. Niu, X. Ma, Y. Wang, and B. Ren, “26ms inference time for resnet-50: Towards real-time execution of all dnns on smartphone,” processing is not necessary. But we still outperform Group arXiv preprint arXiv:1905.00571, 2019.

Scissor [17] in accuracy (84.55% to 82.09%) when compres- [8] H. Li, N. Liu, X. Ma, S. Lin, S. Ye, T. Zhang, X. Lin, W. Xu, and sion rate is same (2.35×). For larger networks, when a mi- Y. Wang, “Admm-based weight pruning for real-time acceleration on mobile devices,” in Proceedings of the 2019 on Great nor accuracy loss is allowed, our proposed P-RM method Lakes Symposium on VLSI, 2019. improves the prune ratio to 50.02× and 59.84× on VGG-16 [9] C. Ding, A. Ren, G. Yuan, X. Ma, J. Li, N. Liu, B. Yuan, and and ResNet-18 respectively, and no obvious accuracy loss Y. Wang, “Structured weight matrices-based hardware accelerators after quantization on pruned models. in deep neural networks: Fpgas and asics,” in Proceedings of the ImageNet. AlexNet model outperform SSL [1] both 2018 on Great Lakes Symposium on VLSI, 2018. in compression rate (4.69× to 1.40×) and network ac- [10] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The missing memristor found,” nature, vol. 453, no. 7191, p. 80, 2008. curacy (81.76% to 80.40%), with or without P-RM. Our [11] X. Ma, Y. Zhang, G. Yuan, A. Ren, Z. Li, J. Han, J. Hu, and ResNet-18 and ResNet-50 models also achieve unprece- Y. Wang, “An area and energy efficient design of domain-wall dented 3.33× with 88.36% accuracy and 2.70× with 92.27% memory-based deep convolutional neural networks using stochastic respectively. No accuracy losses are observed after quan- computing,” in ISQED. IEEE, 2018. tization on pruned ResNet-18/50 models and around 1% [12] L. Chua, “Memristor-the missing circuit element,” IEEE Transac- accuracy loss on 5.13× compressed AlexNet model. tions on circuit theory, vol. 18, no. 5, pp. 507–519, 1971. Table 2 shows our highlighted memristor crossbar power [13] G. Yuan, C. Ding, R. Cai, X. Ma, Z. Zhao, A. Ren, B. Yuan, and Y. Wang, “Memristor crossbar-based ultra-efficient next-generation and area comparisons of ResNet-18 and VGG-16 mod- baseband processors,” in MWSCAS, 2017. els. By using our proposed P-RM method, the area [14] S. Ye, X. Feng, T. Zhang, X. Ma, S. Lin, Z. Li, K. Xu, W. Wen, and power of the 5.83× (15.14×) ResNet-18 model is re- S. Liu, J. Tang et al., “Progressive dnn compression: A key 2 2 to achieve ultra-high weight pruning and quantization rates using duced from 0.235mm (0.117mm ) and 3.359W (1.622W ) admm,” arXiv preprint arXiv:1903.09769, 2019. to 0.042mm2 (0.041mm2) and 0.585W (0.556W ), with- [15] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and out any accuracy loss. For VGG-16 20.16× model, after connections for efficient neural network,” in NeurIPS, 2015. using our P-RM method, the area and power is reduced [16] A. Ankit, A. Sengupta, and K. Roy, “Trannsformer: Neural network 2 2 2 from 0.113mm and 1.611W to 0.056mm (0.053mm ) and transformation for memristive crossbar based neuromorphic system 0.824W (0.754W ), where the compression ratio is achieved design,” in Proceedings of ICCD, 2017. 44.67× (50.02×) with 0% (0.63%) accuracy degradation. [17] Y. Wang, W. Wen, B. Liu, D. Chiarulli, and H. Li, “Group scissor: Scaling neuromorphic computing design to large neural networks,” in DAC. IEEE, 2017. 7 Conclusion [18] L. Xia, T. Tang, W. Huangfu, M. Cheng, X. Yin, B. Li, Y. Wang, and H. Yang, “Switched by input: power efficient structure for rram- based convolutional neural network,” in DAC. ACM, 2016, p. 125. In this paper, we designed an unified memristor-based [19] A. Shafiee, A. Nag, N. Muralimanohar, and et.al, “ISAAC: A Convo- DNN framework which is tiny in overall hardware footprint lutional Neural Network Accelerator with In-Situ Analog Arithmetic and accurate in test performance. We incorporate ADMM in Crossbars,” in ISCA 2016. in weight structured pruning and quantization to reduce [20] S. Kaya, A. R. Brown, A. Asenov, D. Magot, e. D. LintonI, T.”, model size in order to fit our designed tiny framework. and C. Tsamis, “Analysis of statistical fluctuations due to line edge roughness in sub-0.1µm mosfets,” 2001. We find the non-optimality of the ADMM solution and [21] J. J. Yang, M. D. Pickett, X. Li, D. A. Ohlberg, D. R. Stew- design Network Purification and Unused Path Removal in art, and R. S. Williams, “Memristive switching mechanism for our software-hardware co-optimization framework, which metal/oxide/metal nanodevices,” Nature Nanotechnology, 2008. achieve better results comparing to Gourp Scissor [17] and [22] C. Song, B. Liu, W. Wen, H. Li, and Y. Chen, “A quantization-aware SSL [1]. On AlexNet, VGG-16 and ResNet-18/50, after regularized learning method in multilevel memristor-based neuro- structured weight pruning and 8-bit quantization, model morphic computing system,” in 2017 NVMSA. IEEE, 2017. size, power and area are significant reduced with negligible [23] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Dis- tributed optimization and statistical learning via the alternating di- accuracy loss. rection method of multipliers,” Foundations and Trends® in Ma- chine learning, 2011. References [24] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980, 2014. [1] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in NeurIPS, 2016, pp. 2074–2082. [25] M. Hu, C. E. Graves, C. Li, and e. Li, Yunning, “Memristor-Based Analog Computation and Neural Network Classification with a Dot Product Engine,” Advanced Materials, 2018. [2] X. Ma, G. Yuan, S. Lin, Z. Li, H. Sun, and Y. Wang, “Resnet can be pruned 60x: Introducing network purification and un- [26] X. Dong, C. Xu, S. Member, Y. Xie, S. Member, and N. P. used path removal (p-rm) after weight pruning,” arXiv preprint Jouppi, “Nvsim: A circuit-level performance, energy, and area model arXiv:1905.00136, 2019. for emerging nonvolatile memory,” IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS [3] T. Zhang, K. Zhang, S. Ye, J. Li, J. Tang, W. Wen, X. Lin, AND SYSTEMS. M. Fardad, and Y. Wang, “Adam-admm: A unified, systematic