<<

Algorithm-hardware Co-optimization for Neural Network Efficiency Improvement by Qing Yang

Department of Electrical and Computer Engineering Duke University

Date: Approved:

Hai Li, Advisor

Yiran Chen

Jeffrey Derby

Henry Pfister

Benjamin Lee

Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Electrical and Computer Engineering in the Graduate School of Duke University 2020 ABSTRACT Algorithm-hardware Co-optimization for Neural Network Efficiency Improvement by Qing Yang

Department of Electrical and Computer Engineering Duke University

Date: Approved:

Hai Li, Advisor

Yiran Chen

Jeffrey Derby

Henry Pfister

Benjamin Lee

An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Electrical and Computer Engineering in the Graduate School of Duke University 2020 Copyright c 2020 by Qing Yang All rights reserved Abstract

Deep neural networks (DNNs) are tremendously applied in the artificial intelligence field. While the performance of DNNs is continuously improved by more complicated and deeper structures, the feasibility of deployment on edge devices remains a critical problem. In this thesis, we present algorithm-hardware co-optimization approaches to address the challenges of efficient DNN deployments from three aspects: 1) save computational cost, 2) save memory cost, and 3) save data movements.

First, we present a joint regularization technique to advance the compression be- yond the weights to neuron activations. By distinguishing and leveraging the signifi- cant difference among neuron responses and connections during learning, the jointly pruned network, namely JPnet, optimizes the sparsity of activations and weights. Second, to structurally regulate the dynamic activation sparsity (DAS), we propose a generic low-cost approach based on winners-take-all (WTA) dropout technique. The network enhanced by the proposed WTA dropout, namely DASNet, features structured activation sparsity with an improved sparsity level, which can be easily utilized to achieve acceleration on conventional embedded systems. The effective- ness of JPNet and DASNet has been thoroughly evaluated through various network models with different activation functions and on different datasets. Third, we pro- pose BitSystolic, a neural processing unit based on a systolic array structure, to fully support the mixed-precision inference. In BitSystolic, the numerical precision of both weights and activations can be configured in the range of 2b∼8b, fulfilling different requirements across mixed-precision models and tasks. Moreover, the de- sign can support various data flows presented in different types of neural layers and adaptively optimize the data reuse by switching between the matrix-matrix mode and vector-matrix mode. We designed and fabricated the proposed BitSystolic in

iv the 65nm process. Our measurement results show that BitSystolic features the uni- fied power efficiency of up to 26.7 TOPS/W with 17.8 mW peak power consumption across various types. In the end, we will have a glance at computing-in-memory architectures based on resistive random-access memory (ReRAM) which realizes in- place storage and computation. A quantized training method is proposed to enhance the accuracy of neuromorphic systems based on ReRAM by alleviating the impact of limited parameter precision.

v Acknowledgements

First of all, I have many thanks to my advisor Dr. Hai Li. Her enthusiasm to work and research has inspired me a lot in my doctoral study. I am honored to get her guidance and support in the past five years.

I would like to thank my respectable committee members: Dr. Yiran Chen, Dr. Jeffrey Derby, Dr. Benjamin Lee and Dr. Henry Pfister. Their expertise and insights help me strengthen and improve the thesis.

I would like to thank my lab mates: Bonan Yan, Chaofei Yang, Jiachen Mao, Chang Song and many others. Many contributions in the thesis are the results of collaborations. Our friendship makes the doctoral student life happy and adorable.

I would also like to give my special thanks to Mr. Elias Fallon. His mentorship during the NSF IUCRC project has made a substantial change in my career.

Thanks to my parents and family for their selfless love and support.

In the end, I would like to thank my fiancee, Xiaoshuang Xun. It’s impossible for me to complete the journey of PhD without her love and tolerance.

vi Contents

Abstract iv

Acknowledgements vi

List of Figures xi

List of Tables xiv

1 Introduction 1

1.1 Overview of Challenges in DNN Deployments ...... 2

1.2 Related Work ...... 4

1.2.1 Weight Pruning ...... 4

1.2.2 Activation Pruning ...... 6

1.2.3 Model Quantization ...... 8

1.3 Thesis Contribution ...... 9

2 Joint Regularization on Neuron Weights and Connectivity 11

2.1 Motivation ...... 11

2.2 Approach ...... 12

2.2.1 Joint Regularization ...... 12

2.2.2 Joint Pruning Procedure ...... 13

2.2.3 Optimizer and Learning Rate ...... 15

2.2.4 Reconcile Dropout and Activation Pruning ...... 16

2.2.5 Winner Prediction in Activation Pruning ...... 17

2.3 Experiments ...... 17

2.3.1 Overall Performance ...... 18

vii 2.3.2 MNIST and CIFAR-10 ...... 19

2.3.3 ImageNet ...... 21

2.3.4 Prune Activation without Intrinsic Zeros ...... 24

2.4 Discussions ...... 26

2.4.1 Comparison with Weight Pruning ...... 26

2.4.2 Comparison with Static Activation Pruning ...... 27

2.4.3 Activation Analysis ...... 28

2.4.4 Speedup from Dynamic Activation Pruning ...... 28

2.4.5 Activation Threshold Prediction ...... 29

2.5 Conclusive Remarks ...... 30

3 WTA Dropout for Structured Activation Regularization 31

3.1 Motivation ...... 31

3.2 DASNet with WTA Dropout ...... 32

3.2.1 Ranking and Winner Rate Selection ...... 34

3.2.2 Theoretical Analysis ...... 39

3.2.3 The Tuning Flow to Obtain DASNet ...... 43

3.3 Evaluation ...... 43

3.3.1 DASNet Measurement Results ...... 44

3.3.2 Case Study on Winner Rate Configuration ...... 45

3.3.3 Case Study on Layer-wise Speedup ...... 46

3.3.4 Comparison with State-of-the-art ...... 48

3.4 Conclusive Remarks ...... 49

4 BitSystolic Architecture for Mixed-precision Inference 50

4.1 Motivation ...... 50

viii 4.2 Preliminary ...... 53

4.3 BitSystolic Architecture ...... 55

4.3.1 Bit-level Data Layout ...... 55

4.3.2 Architecture Overview ...... 56

4.3.3 BVPE Design ...... 60

4.3.4 Bit-Significance Integration ...... 61

4.3.5 Memory Arbiter Design ...... 61

4.4 Data Partition and Mapping in BitSystolic ...... 63

4.4.1 Partition and Mapping Scheme ...... 63

4.4.2 Computation Boundary and IO Boundary ...... 64

4.4.3 BitSkipping Technique ...... 66

4.5 Measurements and Discussions ...... 67

4.5.1 BitSystolic Chip ...... 67

4.5.2 Test System Design ...... 68

4.5.3 Measurement Results ...... 70

4.5.4 Comparison with State-of-the-art NPUs ...... 73

4.5.5 Discussions ...... 74

4.6 Conclusive Remarks ...... 79

5 Quantized Training to Enhance ReRAM-based CIM Systems 80

5.1 Motivation ...... 80

5.2 Working Principle of ReRAM-based CIM ...... 82

5.3 Quantized Training Algorithm ...... 83

5.3.1 Quantization Method ...... 83

5.3.2 Quantized Training Method ...... 84

ix 5.4 Discussions ...... 86

5.4.1 Classification Accuracy vs. Layer Size ...... 87

5.4.2 Quantization for conv Layer and fc Layer ...... 89

5.4.3 Robustness to Device Variations ...... 90

5.5 Conclusive Remarks ...... 91

6 Conclusions 92

6.1 Summary of Contributions ...... 92

6.2 Future Work ...... 93

Bibliography 95

Biography 106

x List of Figures

1.1 The overview of thesis contributions...... 9

2.1 Working principle of joint pruning...... 15

2.2 Comparison between WP and JP...... 19

2.3 The activation distribution of fc2 in MLP-3 for all digits...... 21

2.4 The number of activated top neurons for all digits...... 22

2.5 The decomposition of weight and computation cost of AlexNet. . . . 23

2.6 Activation distribution of ResNet-32...... 25

2.7 Comparison to static activation pruning for ResNet-32...... 27

2.8 The analysis of activation pruning sensitivity...... 28

2.9 The effects of threshold prediction...... 30

3.1 An example of the activation distribution of the first layer in AlexNet on ImageNet dataset...... 32

3.2 The original network and the DASNet with dynamic WTA dropout. . 33

3.3 Accuracy vs. activation pruning for fc layers...... 36

3.4 Working scheme of WTA dropout in conv layers...... 38

3.5 Comparison between different feature vectors...... 39

3.6 Accuracy vs. feature map pruning for conv layers...... 40

3.7 WTA mask in the forward and backward propagation...... 42

3.8 The tuning flow to obtain DASNet...... 42

xi 3.9 The winner rate vs. energy threshold for AlexNet...... 46

3.10 The layer-wise decomposition for the inference time in DASNets. . . . 48

4.1 The case studies of the impact of the numerical precision of weight and activation on DNN accuracy...... 51

4.2 The data reuse analysis on various layer types...... 52

4.3 The data layouts of conv layer and fc/lstm layer...... 53

4.4 The working principle of matrix multiplication A · W on conventional systolic array architecture...... 55

4.5 The overview of BitSystolic architecture...... 57

4.6 The flowing data waveform in the systolic row...... 59

4.7 The bit-vector PE structure...... 60

4.8 The shifter design to handle the running significance...... 61

4.9 The memory arbiter design...... 62

4.10 Partition for different types of layers...... 62

4.11 The illustrations of computation boundary and IO boundary...... 64

4.12 The demonstration of BitSkipping on a 3-layer MLP for MNIST. . . . 66

4.13 The die photo and chip configuration...... 67

4.14 The test board design with USB interface connected to PC...... 68

4.15 The design of the concise SIMD instruction set...... 68

4.16 The chip power consumption analysis...... 70

4.17 Comparison between the MM mode and VM mode...... 70

4.18 Comparison between different numerical precision...... 71

xii 4.19 The layer-wise results for mixed-precision MobileNet-V1 on ImageNet. 71

4.20 The layer-wise results for mixed-precision MobileNet-V2 on ImageNet. 72

4.21 The overview of the state-of-the-art chips...... 74

4.22 Balance spots between VM mode and MM mode...... 76

4.23 The throughput scaling vs. bit precision...... 77

4.24 The throughput scaling vs. IO bandwidth with varying the bitwidth of stationary A for fc1...... 78

4.25 The computation cost saving on AlexNet using BitSkipping...... 79

5.1 The typical CIM architecture...... 81

5.2 The comparison between Von Neumann chips and CIM chips. . . . . 82

5.3 Quantization example...... 84

5.4 Learning curves for three different quantized models...... 88

5.5 Classification accuracy vs. layer size for the quantized MLP...... 89

5.6 Quantization effects for conv layer and fc layer...... 90

5.7 ReRAM-based system robustness to variations...... 91

6.1 The neuron overlapping in MLP-3 for MNIST dataset...... 94

xiii List of Tables

2.1 Summary of JPnets ...... 18

2.2 The joint pruning results of MLP-3 on MNIST ...... 20

2.3 The joint pruning results of LeNet-4 on MNIST ...... 21

2.4 The joint pruning results of ConvNet-5 on CIFAR-10 ...... 22

2.5 The joint pruning results of AlexNet on ImageNet ...... 23

2.6 The joint pruning results of ResNet-50 on ImageNet...... 24

2.7 The joint pruning results of ResNet-32 on CIFAR-10...... 25

2.8 Comparison with the state-of-the-art weight pruning methods. . . . . 26

2.9 Speedup test for fc layers in AlexNet...... 29

3.1 The network configurations for DASNet exploration...... 35

3.2 The experiment setup and results of DASNets...... 44

3.3 Comparison with existing feature map pruning methods...... 47

4.1 The definitions of architectural parameters of BitSystolic...... 58

4.2 Comparison with the state-of-the-art NPUs ...... 75

5.1 Network configurations for quantized training...... 86

5.2 Training results based on 2-level ReRAM...... 87

xiv Chapter 1

Introduction

Deep Neural Networks (DNNs) have been serving as the creative horsepower and demonstrated significant advantages in many real-world applica- tions, such as image classification, object detection and [HZRS16, RDGF16, AAA+16]. For different application scenarios, distinct network architec- tures and layer types are developed to fit with the application requirements and data features. For instance, fully-connected (fc) networks can be taken as “universal ap- proximators” capable of representing any function by virtue of the non-linearity in each layer [RZ18]. Thus fc networks are widely adopted in the generic classifica- tion and regression applications. In applications, the convolution (conv) layers are usually utilized on the input image and intermediate feature maps. The neurons/filters with predefined filtering window sizes will take the convolution operation at each pixel location of the image or feature maps. The preliminary fea- tures extracted by neurons can be gradually elaborated into high-level features by stacking multiple conv layers. Take the AlphaGo [SHM+16] as an example, the Go board image will be processed by many conv layers to derive high-level board position information. At the end, the convolution neural network will output a probability distribution of legal moves over the Go board. For the applications featuring tempo- ral data, recurrent neural networks are invented to address the information processing by not only working on individual events at each time-step but also considering the previous states incurred by all the previous input data. To further improve the performance of recurrent networks, long-short term memory (lstm) networks [HS97] introduce input gate and forget gate structures to manage the long- and short-term dependencies on input data.

1 As can be seen, distinct network architectures are designed based on various requirements and properties of machine learning applications. With the rapid expan- sion of data volume and application scale, the neural networks get more and more complicated and deeper to achieve better model accuracy. Moreover, hybrid network architectures are widely adopted for intricate applications. For example, in speech recognition [AAA+16], the DNN model stacks conv layers, lstm layers and fc layers together for the end-to-end processing on the spectrogram images of input voices. Similarly, to complete the image captioning task [VTBE15], the overall model is di- vided into two logic modules: the convolution network for image processing and the lstm network for text generation.

In short, the rapid development of neural networks has paved the way for a bright future of machine learning applications in daily life. The DNN deployments spread from small-scale applications like hand-written digit recognition, moderate applica- tions like object classification to large-scale scenarios such as autonomous driving and medical diagnosis. Recently, the DNNs have also been promoted for a lot of creative usage in artistic and scientific fields, such as painting style transfer using convolution neural networks [GEB16], creating music using a generative model [DJP+20] and AlphaFold model for structure prediction [SEJ+20].

1.1 Overview of Challenges in DNN Deployments

The vigorous development of DNNs designed for various machine learning applica- tions broadens our vision for ubiquitous artificial intelligence era. The whole life-cycle of a DNN model mainly comprises the training phase and deployment phase. The GPU platforms facilitate the training phase of DNNs by learning from massive input data, which usually consumes tens to thousands GPU-hours to obtain a DNN model ready to be deployed. The resource-hungry training phase is majorly conducted us-

2 ing servers or cloud services. On the contrary, the deployments of DNNs are often targeted on edge devices to make the wide impact of machine learning techniques for human life. The edge computing platforms, e.g., mobile phones and self-driving cars, are usually equipped with limited computing and storage resources, which incurs challenges for efficient DNN deployments.

The major challenges for DNN deployments in edge devices are due to the rapid growing model size and computational cost. Taking the 2012 ImageNet competi- tion [RDS+15] as an example, many great breakthroughs have been achieved by using DNNs as the backbones for object classification tasks. AlexNet [KSH12a], the first entry of adopting DNN model, achieved 19.2% Top-5 error rate with 233 MB model size and 727M floating-point operations (FLOPs). To further improve the classification accuracy, ResNet-152 [HZRS16] reduced the Top-5 error to 6.7% with 230 MB model size and 11G FLOPs, which involves more computation burdens. A larger model can have a smaller error rate but with heavier computational cost, such as SENet [HSS18] featuring 440 MB model size, 21G FLOPs and the Top-5 error rate of 4.47%. As can be seen, the classification accuracy is highly correlated with the model size and computational power of DNN models. While increasing the model capacity gets better model performance, it brings more computational complexity. Another reason for the soaring complexity of DNN deployments comes from those in- tricate applications that need network combinations. For instance, the SSD [LAE+16] model, designed for the object detection task, attaches additional head structures to the base convolution network to predict bounding boxes and object classes. Overall, SSD owns a model size of 337 MB and involves 91G FLOPs.

The power efficiency is critical for DNN deployments in edge devices with limited power budget and heat dissipation requirement. As aforementioned, the DNN model size and complexity are rapidly increasing, which demands more and more computing power. Meanwhile, the intensive data communications with on- and off-chip memory

3 also play a vital role in the power efficiency. As illustrated in the energy table for 45nm process [Hor14b], the 32-bit on-chip SRAM access takes 1.4× power consumption compared to a 32-bit FLOP. The 32-bit off-chip DRAM access will consume much more heavier energy, which is 173× more energy expensive than a FLOP. Hence, it can be seen that the challenges for efficient DNN deployments not only originate from the large model size and intensive computation, but also come from the heavy data movements during model execution.

1.2 Related Work

To overcome the challenges for efficient DNN deployments, many approaches have been proposed to address three optimization aims: 1) save computational cost, 2) save memory cost, and 3) save data movements. The widely adopted techniques for DNN optimization include weight pruning, activation pruning and model quantization. It’s worth noting that each technique usually covers multiple optimization aspects. For example, weight pruning and model quantization focus on model compression to obtain a smaller model and save memory cost thereafter. At the same time, the deeply compressed models would incur less computational cost and data movements. The activation pruning is dedicated to save the computational cost by reducing the inter-layer data propagation, while the reduced inter-layer data volume naturally saves the data movements. In the following, we will summarize the related works optimizing the DNN deployments in recent years.

1.2.1 Weight Pruning

Weight pruning emerges as an effective compression technique in reducing the model size and computation cost of neural networks. A common approach of pruning the redundant weights in DNNs is to include an extra regularization term (e.g., the `1/`2 4 regularization) in the [LWF+15, PLW+16] to constrain the weight dis- tribution. Then the weights below a heuristic threshold will be removed. Afterwards, a certain number of finetuning epochs will be applied to recover the accuracy loss induced by the pruning. In practice, the direct-pruning and finetuning stages can be carried out iteratively to gradually achieve the optimal tradeoff between the model compression rate and accuracy. To avoid erroneous removal of important weights in the na¨ıve pruning and finetuning approach, a dynamic compression method was proposed to recover those pruned weights whose expected updates are larger than an empirical threshold in each training iteration [GYC16]. Rather than using `1/`2 regu- larization to constrain the weight magnitude and distribution, `0 regularization can be adopted as a stochastic binary mask on weights, which was proven to produce a higher sparsification level [LWK17]. These regularization-based weight pruning approaches demonstrated high effectiveness, especially for fc layers [HPTD15]. However, these methods are heuristic and lack theoretical guarantee for the convergence and com- pression performance. Being theoretically proved, sparse variational dropout can be utilized on individual weights to realize all possible dropout rates [MAV17, NMAV17]. The objective of weight pruning can also be transformed as a non-convex optimization problem which is mathematically solvable using the alternating direction method of multipliers (ADMM) [ZYZ+18]. Again, finetuning is needed to recover the accuracy drop for the sparsified model obtained by ADMM.

Removing the redundant weights in structured forms, e.g., the filters and filter channels, has been widely investigated too. For example, structured pruning applies group lasso regularization on weight groups in a variety of self-defined shapes and sizes [WWW+16]. In [MTK+16], the rankings of filters are indicated by the first- order Taylor series expansion of the loss function on feature maps. The filters in low rankings are then removed. The filter ranking can also be represented by the root mean square or the sum of absolute values of the filter weights [MHP+17, YLP+17].

5 A theoretical view on the importance of neurons/filters can be derived from the perspective of variational information bottleneck which minimizes the mutual infor- mation between layers [DZW18]. The structured pruning methods do not require dedicated supports for random sparse matrix and thus are hardware-friendly for con- ventional computation platforms. However, these methods seldom achieve a weight compression rate as high as the element-wise pruning methods.

1.2.2 Activation Pruning

Activation sparsity has been utilized in DNN accelerator designs. The activation spar- sity originating from ReLU accelerates DNN inference with reduced off-chip memory access and computation cost [CES16, AJH+16, RWA+16]. A simple technique to improve activation sparsity was explored by zeroing out small activations [AJH+16]. However, the increment of activation sparsity is very limited with a concern of ac- curacy loss. Moreover, these works heavily relied on the zero activations of ReLU, which cannot be extended to other activation functions. To structurally regulate the activations, SeFAct [SAM+19] proposed to derive a sparse computing graph for a DNN model by clustering neurons into independent sets specified for different in- put classes. However, the threshold determining the neuron’s importance needs to be chosen empirically during clustering, which makes the learning process intricate. Furthermore, specific accelerators optimized for sparse computation graph is required for fully utilizing the benefits from SeFAct.

At the algorithmic level, dropout-based methods were proposed to regulate acti- vation sparsity and obtain sparse feature representation [BF13, MF15]. These tech- niques incur essential model modifications, e.g., adding a binary belief network over- laid on the original model. It’s worth noting that these dropout-based approaches were investigated to alleviate the overfitting problem and improve the model gener-

6 ality. Whereas, such sparse activation patterns are not applicable in the inference stage. To make activation regularization feasible in the inference stage, Spring et al. [SS17] utilized the locality-sensitive hashing (LSH) to predict the neurons with top activation magnitudes for DNNs with only fc layers. In both training and in- ference stages, only the selected neurons will be activated. The LSH-based method can produce a high sparsity in activations (e.g., on average only 5% activations re- main). However, it is hard to fully utilize the generated random sparse patterns on conventional computing platforms. Structural feature map pruning has been studied accordingly. For example, Molchanov et al. [MTK+17] analyzed the importance of feature map channels based on the partial derivative of the loss function and elim- inated those less important ones. The determination on the importance of feature maps can also be derived by either the percentage of zeros [HPTT16] or the mu- tual information between inter-layer activations [DZW18]. However, the feasibility of these approaches has not been validated on more advanced network structures other than VGGNet. The scaling factor in can be utilized to indicate the significance of each feature map channel, too [LLS+17, YLLW18]. The idea of adding scaling factors on feature maps can be extended to conv layers that even do not have batch normalization [HW18]. In the training stage, all scaling factors are learned by adding a specified regularization term in the original loss function. A feature map channel whose scaling factor is less than a predefined threshold will be removed once the training is completed. The partial selection of feature map channels can also be dynamically determined at run-time during inference to reduce the ac- curacy loss [HZDS+19, GZD+19]. However, the existing dynamic approaches induce intricate branching structures to the original model, which increases the model size and complicates the training process.

7 1.2.3 Model Quantization

A common approach for improving the energy efficiency and computation density of DNN deployments is to replace floating-point operations with integer operations, e.g., quantizing the weight parameters down to 8-bit (8b) [JYP+17a] or even 1-bit (1b) [CBD15a] precision. The activation quantization can also facilitate to reduce the inter-layer data traffic and computational cost [HCS+16a, RORF16a]. However, models with deeply quantized weights or activations could result in severe accuracy drops. For instance, the XNOR-Net [RORF16a] incurs a more than 10% accuracy degradation on the ImageNet dataset, compared to its floating-point counterpart.

Recent studies [DYG+19, GHC+19] reveal that different layers of a DNN model have different sensitivities with the model performance, and thus, the mixed-precision quantization could better tradeoff the execution efficiency and accuracy. To overcome the large accuracy drop in the unified quantization method where all the layers share the identical bitwidth configuration for weight/activation, many methods have been proposed to generate layer-wise mixed-precision models. HAQ [WLL+19] uses re- inforcement learning to learn and determine the numerical precision automatically. DNAS [WWZ+18] and DFS [SWX+20] transfer the quantization process into network architecture search (NAS) problems by representing the bitwidth configurations as optional paths in network graphs. Hence, the stochastic (SGD) op- timization method can be efficiently applied on the path selection through updating the connection parameters. HAWQ enables the bitwidth selection based on the quan- tization sensitivity reflected by the deterministic second-order information, Hessian Spectrum, derived from the Hessian matrix of weights per layer/block [DYG+19]. Moreover, HAWQ provided an explicit quantization and finetuning order among lay- ers to optimize the accuracy recovery.

8 Efficient DNN Deployments

Save Computational Save Memory Save Data Cost Cost Movements

Activation Pruning Weight Pruning Model Quantization

Chapter 2 Algorithm Chapter 3 Chapter 4 Chapter 5 Layer Winner Neuron Mixed-Precision Quantized Training Selection Inference

BitSystolic Computing in Mobile CPU Hardware Architecture Memory (ReRAM) Layer

Figure 1.1: The overview of thesis contributions.

1.3 Thesis Contribution

This thesis is focusing on efficient DNN deployments on edge devices. In other words, we are aiming to improve the neural network efficiency in the inference stage through algorithm and hardware co-optimizations. The proposed cross-layer approaches com- bining algorithm and hardware techniques will simultaneously take the optimizations in computational cost, memory cost and data movements into considerations. The thesis contributions are shown in Fig. 1.1:

In Chapter 2, we propose the joint regularization on weights and activations by integrating the static weight pruning and dynamic activation pruning. In this way, the jointly pruned networks, namely JPnet, feature compressed model size and com- putational cost without compromising on the model accuracy. The derived deep sparsification of JPnet reveals more optimization space for the existing DNN accel- erators dedicated for sparse matrix operations.

In Chapter 3, the activation sparsity is structurally regulated with winners-take- all (WTA) dropout in the granularity of neurons. The structured activation sparsity

9 patterns derived from winner neuron selection are easily utilized in conventional em- bedded systems. Our experiments on mobile CPUs achieved the significant speedup for the networks equipped with WTA dropout scheme.

Chapter 4 presents BitSystolic architecture for mixed-precision inference of DNNs. BitSystolic fulfills the flexible numerical precision (2b ∼ 8b) requirements of mixed- precision models with distinct bitwidth configurations for weights and activations for different layers. Meanwhile, BitSystolic optimizes the corresponding data flows for various layer types, e.g., conv layers, fc layers and lstm layers, to save data movements by maximizing their intrinsic data reuse.

To further save the data movements involved in the intensive matrix computa- tion during DNN deployments, we further explore the computing-in-memory (CIM) mechanism in Chapter 5. Here we focus on the ReRAM-based CIM architecture which utilizes the ReRAM crossbar structure to realize in-place storage and compu- tation. Considering the limited programming resolution of ReRAM devices, a quan- tized training method is proposed to improve the model accuracy with the awareness of available weight candidates.

10 Chapter 2

Joint Regularization on Neuron Weights and Connectivity

2.1 Motivation

Although prior weight pruning techniques achieved tremendous successes, merely fo- cusing on the weights cannot lead to the best inference efficiency, the crucial metric in DNN deployment, for the following three reasons. First, existing weight prun- ing methods reduce the fc layer size dramatically, while there lacks a systematic method to achieve a comparable compression rate for conv layers. The conv lay- ers account for most of the computation cost and dominate the inference time in DNNs, whose performance is usually bounded by computation instead of memory accesses [JYP+17b, ZLS+15]. The most essential challenge of speeding up DNNs is to minimize the computation cost, i.e., the intensive multiple-and-accumulate op- erations (MACs). Second, the weights and activations together determine the per- formance of a network. Our experiments show that the zero-activation percentage obtained by ReLU decreases after applying the weight pruning [HLM+16]. Such a deterioration in activation sparsity could potentially eliminate the advantage of the aforementioned accelerator designs. Third, the activation in DNNs is not strictly limited to ReLU. Non-ReLU activation functions, such as leaky ReLU and sigmoid, do not have intrinsic zero-activation patterns.

Generally, the model size compression is the main focus of weight pruning, while the regulation of activation sparsification focuses more on the intrinsic activation sparsity by ReLU or exploiting the virtue of sparse activation in the DNN training for better model generalization. In contrast, our proposed joint regularization aims

11 to reduce the DNN computation cost and accelerate the inference by simultaneously optimizing weight pruning and activation sparsification.

2.2 Approach

2.2.1 Joint Regularization

For an L-layer neural network represented by the weight set SW = {Wi : i = 1,...,L} where Wi denotes the weights of layer i, given the dataset {X , Y}, the SW will be learned to minimize the loss function as follows: n 1 X Loss = L(y , D(x , )), n k k SW k=1 (2.1) ∗ SW = argmin {Loss}, SW where {xk, yk} is the sampled input-output pair from {X , Y}, and n is the minibatch size. The nonlinear relationship of the network is modeled as D(·). The cross-entropy is usually adopted as the function L(·) for multi-class problems. For the common weight pruning techniques, the optimization problem extends the loss function in

Equation (2.1) with a regularization term on SW as n 1 X Loss = L(y , D(x , )) + α ·RW ( ). (2.2) n k k SW SW k=1

W R (·) can be configured as the `0/`1/`2 regularization on weights with a strength α.

W The R (·) focuses on the optimal weight compression, whereas Ai−1 ·Wi in layer i is determined by both the activation Ai−1 from the previous layer and the weights Wi of layer i. We propose joint regularization on both weights and activations to minimize the computation cost and optimize the execution efficiency thereafter. Overall, the loss function will be represented as: n 1 X Loss = L(y , D(x , )) + α ·RW ( ) + β ·RA( ), (2.3) n k k SW SW SA k=1 12 where SA = {Ai : i = 1,...,L − 1}. AL indeed is the model output, which is not included into the activation regularization RA(·). It’s inappropriate to apply the

`1/`2 regularization for activation, as the regularization may constrain the activation magnitude and hinder the feature learning process. Hence, we propose to adopt the

`0 regularization, which minimizes the number of the effective activations without disturbing their magnitudes. More specific, for each layer i, a binary mask Ti acting as an information filter is designed for the original activations Aorig,i such as

Am,i = Aorig,i Ti, (2.4)

where is the element-wise multiply operation. The `0 optimization problem on activations is therefore transformed as the derivation of optimal mask set ST = {Ti : i = 1,...,L − 1}.

2.2.2 Joint Pruning Procedure

When implementing the joint regularization, we choose the `1 regularization on weight distribution for its ease of gradient derivation while training. After combining with

ST for `0 regularization on activations, the loss function in Equation (2.3) can be rewritten as: n 1 X Loss = L(y , D(x , , )) + α · || || , n k k SW ST SW 1 k=1 (2.5) ∗ ∗ SW , ST = argmin {Loss}. SW , ST

To overcome the non-differentiability of the `0 regularization, we adopt a deterministic solution to obtain the proper ST . Noting that small weights in layers are learned to be pruned, activations with small magnitudes are taken as unimportant and masked out to further minimize inter-layer connections. Considering that neurons are activated in various patterns according to different input classes, we propose dynamic masks for the activation pruning. This is different from the static masks in the weight pruning.

13 The selected activations by mask Ti are denoted as winners. To derive the acti- vation am,j ∈ Am,i based on aorig,j ∈ Aorig,i, we have:   aorig,j, when aorig,j is a winner, am,j = (2.6)  0, otherwise, here the winners are dynamically determined at run-time according to the winner rate per layer. The determination of winners through the activation mask is a relaxed partial sorting problem to find top-k arguments in an array. The winner rate of layer i is defined as:

|Am,i| (winner rate)i = , (2.7) |Aorig,i| where |Am,i| and |Aorig,i| respectively denotes the number of winners selected by Ti and that of the original activations. Usually, different layer features a unique optimal winner rate. To get the appropriate winner rate per layer, the model with configurable activation masks is tested on a validation set sampled from the training set. Verified by our experiments, the size of the validation set can be similar to that of the test set. The accuracy drop is taken as the indicator of the model sensitivity for the winner rate setting. The (winner rate)i is set empirically according to the tolerable accuracy loss. Examples of activation sensitivity analysis will be presented in Section 2.4. After deriving the winner rates, dynamic activation masks are configured as illustrated in Fig. 2.1.

To understand the working scheme of the optimization problem defined by the Equation (2.5), we focus on the operation of a single layer i:

Aorig,i = fi(Wi, Am,i−1) = fi(Wi, Aorig,i−1 Ti−1), (2.8) where fi(·) represents the function of layer i. In the phase, the partial derivative of the loss function on Aorig,i is propagated backwards: ∂Loss ∂Loss ∂A ∂A = · orig,i · m,i−1 . (2.9) ∂Aorig,i−1 ∂Aorig,i ∂Am,i−1 ∂Aorig,i−1 14 Aorig, i-1 static weight mask Aorig, i Activation Analysis 1 Determine winner rates 0 Apply Joint Regularization

1 Learn connections & neuron responses 0 Finetune dynamic activation mask

Figure 2.1: Working principle of joint pruning.

∂Am,i−1 The term is equal to Ti−1, which means the backpropagation process is ∂Aorig,i−1 masked in the same way as the forward propagation. Thereafter, only the activated neurons will be updated. For the weight updating in a finetuning iteration, a small decay will be applied according to the setting of `1 regularization on weights. Those weights smaller than the empirical threshold will be pruned out.

As summarized in Fig. 2.1, the proposed end-to-end joint pruning approach con- sists of three steps. First, the significance of activations per layer is analyzed to determine the winner rates and define the pruning strength of each dynamic acti- vation mask. Afterwards, the regularizations on both weights and activations are applied for the following finetuning stage. With the joint regulating force by ||SW ||1 and activation masks ST as defined in Equation (2.5), weights and activations are co-trained to obtain deep sparsification. Through finetuning, the generated model is jointly optimized by dynamic sparse activation patterns and static compressed weights.

2.2.3 Optimizer and Learning Rate

We start the pruning process with several warm-up finetuning epochs to obtain the preliminary sparse patterns in both weights and activations with joint regularization.

15 The same optimizer for training the original model is adopted. The learning rate is set as 0.1× ∼ 0.01× smaller than the original setting. Our experiments show that Adadelta [Zei12] usually brings the best performance in the following pruning process after the warm-up finetuning, especially for deep sparsified activations. Adadelta adapts the learning rate for each individual weight parameter. Smaller updates are performed on neurons associated with more frequently occurring activations, whereas larger updates will be applied for infrequent activated neurons. Hence, Adadelta is beneficial for sparse weight updates, which commonly occur in our joint pruning method. During finetuning, only a small portion of weight parameters are updated because of the combination of sparse patterns in weights and activations. The learning rate for Adadelta is recommended to be reduced 0.1× ∼ 0.01× compared to the setting for training the original model.

2.2.4 Reconcile Dropout and Activation Pruning

In DNN training, dropout layer is commonly added after large fc layers to avoid over-fitting. The neuron activations are randomly chosen in the feedforward phase, and weights updates will be only applied on the neurons associated with the se- lected activations in the backpropagation phase. Thus, a random partition of weight parameters are updated in each training iteration. Similarly, the activation mask only selects a small portion of activated neurons and realize sparse weight updates. However, the over-fitting is still prone to happen because the selected neurons with winner activations are always kept and updated. Thus the random dropout layer is still needed. In fc layers, the number of remaining activated neurons is reduced to |Am,i| from |Aorig,i| as defined in Equation (2.7). Similar to [HPTD15] dealing with sparse fc layer training, the dropout layer connected after the activation mask

16 is suggested to be modified with the setting:

p (dropout rate)i = Cd · (winner rate)i , (2.10) where the constant Cd is the dropout rate in the training process for original models. The activation winner rate is introduced to modify the dropout strength to balance over-fitting and under-fitting. The dropout layers will be directly removed in the inference stage.

2.2.5 Winner Prediction in Activation Pruning

The dynamic activation pruning method increases the activation sparsity and main- tains the model accuracy. As aforementioned, the determination of Am,i through the activation mask is actually a relaxed partial sorting problem. According to the Master Theorem [BHS80], partial sorting can be fast solved in linear time O(N) on average through recursive algorithms, where N is the number of elements to be partitioned. To further speed up, Am,i can be predicted based on a down-sampled activation set. A threshold θ is derived by separating top-εk elements from the down- sampled activation set comprising εN elements with ε as the down-sampling rate.

Then θ is applied to derive am,j ∈ Am,i from aorig,j ∈ Aorig,i as follows:   aorig,j, when abs(aorig,j) > θ, am,j = (2.11)  0, otherwise.

2.3 Experiments

We evaluate he joint regularization on various models ranging from multi-layer per- ceptron (MLP) to deep neural networks (DNNs) on three datasets, MNIST, CIFAR- 10 and ImageNet (Table 2.1). In ResNet-50 [HZRS16] and wide ResNet-32 [ZK16],

17 conv layers account for more than 99% computation cost and are our focus. All the evaluations are implemented in TensorFlow.

2.3.1 Overall Performance

The compression results of JPnets on activations, weights and MACs are summa- rized in Table 2.1. Our method can learn both sparse activations and sparse weights. Compared to original dense models, JPnets achieve 1.4× ∼ 5.2× activation com- pression rate and 1.6× ∼ 12.3× weight compression rate. As such, JPnets execute only 1.2% ∼ 27.7% of MACs required in dense models. The accuracy drop is kept less than 0.4%, and for some cases, the JPnets achieve even better accuracy (e.g., MLP-3, AlexNet and ResNet-50).

The ReLU function in MLP-3, LeNet-4, ConvNet-5, AlexNet and ResNet-50 brings intrinsic zero activations. However, our experiment results in Fig. 2.2(a) show that the non-zero activation percentage in the weight-pruned (WP) model tends to increase compared to the original dense models. This increment indeed undermines the benefit from weight pruning. Our proposed JP method can remedy the activa- tion sparsity loss in WP models and remove 7.7% ∼ 22.5% more activations even compared to the original dense models. We observe the largest activation removal in ResNet-32 which uses leaky ReLU as . As leaky ReLU doesn’t

Table 2.1: Summary of JPnets

Network MLP-3 LeNet-4 ConvNet-5 AlexNet ResNet-50 ResNet-32 Dataset MNIST MNIST CIFAR-10 ImageNet ImageNet CIFAR-10 Activation Function ReLU ReLU ReLU ReLU ReLU Leaky ReLU Accuracy Baseline 98.41% 99.4% 86.0% 57.22% 75.6% 95.0% Accuracy Joint Regularization 98.42% 99.0% 85.9% 57.26% 75.7% 94.6% Activation Percentage 17.1% 5.5% 43.6% 37.9% 17.7% 30.8% Weight Compression Rate 10× 12.3× 2.5× 5.3× 1.6× 3.1× MAC Percentage 3.65% 1.2% 27.7% 25.2% 19.1% 11.5%

18 100% 80% Dense WP JP 60% 40% 20% Activation %

MLP-3 Lenet-4ConvNet-5AlexNetResNet-50ResNet-32

(a) Comparison of non-zero activation percentage.

WP 30% JP 20%

MAC % 10%

MLP-3 Lenet-4ConvNet-5AlexNetResNet-50ResNet-32

(b) Comparison of MAC percentage.

Figure 2.2: Comparison between WP and JP. provide intrinsic zero activation, the WP model of ReNet-32 cannot benefit from activation sparsity. In contrast, the JPnet in this work can remove 69.2% activa- tions and reduce additional 25% of MAC operations compared to the WP model. As shown in Fig. 2.2(b), JPnets decrease the MAC operations to 1.2% ∼ 27.7%. It is a 1.3× ∼ 10.5× improvement compared to WP models. More details on model configuration and analysis will be presented in the following subsections.

2.3.2 MNIST and CIFAR-10

The MLP-3 on MNIST has two hidden layers with 300 and 100 neurons respectively. The model configuration details are summarized in Table 2.2. The non-zero activation percentage (Acti %) per layer indicates the pruning strength on activations before reaching next layer. The amount of MACs is calculated with batch size as 1. The same setting will be applied to the analysis for other models.

The MLP-3 is successfully compressed 10× and only 17.1% of activations are kept.

19 The total umber of MACs is reduced to merely 3.65% (27.4×) without compromising the model accuracy at all. A higher computation reduction rate is achieved by a LeNet-4 model which comprises two conv layers and two fc layers (Table 2.3). The JPnet for LeNet-4 reduces the computation cost 83.3× with only 5.5% activations and 8.1% weights retained.

To analyze and understand the effectiveness of dynamic activation masks, we take the example of the activation patterns from layer fc2 in MLP-3. Before starting the joint pruning, the activation distribution for all MNIST digits is visualized in Fig. 2.3, which clearly shows that digits 0 − 9 incur different regions in fc2. The activa- tion is obtained by averaging over all data examples per digit class. The observation implies that it is impossible to design a static activation mask and obtain a compa- rable sparsification effectiveness as the dynamic counterpart. We name the neuron featuring maximum activation for each input as top neuron. Fig. 2.4 compares the number of activated top neurons for all digits by observing the training set before and after applying joint pruning. The results show that the JPnet needs fewer top neurons and generates a sparser feature representation.

We also apply joint regularization to ConvNet-5 on CIFAR-10 dataset. The ac- curacy of the original dense model is 86.0%. As detailed in Table 2.4, JPnet for ConvNet-5 needs only 27.7% of total MACs compared to the dense model by pruning 59.6% of weights and 56.4% of activations. Only 0.1% accuracy drop is resulted by

Table 2.2: The joint pruning results of MLP-3 on MNIST

Layer Shape Weight # MAC # Acti % Weight % MAC % fc1 784×300 235.2K 235.2K 12% 10% 3.77% fc2 300×100 30K 30K 24% 10% 2.62% fc3 100×10 1K 1K 100% 20% 6.81% Total 266.2K 266.2K 17.1% 10% 3.65%

20 digit 0 digit 1 digit 2 digit 3 digit 4

6

4 digit 5 digit 6 digit 7 digit 8 digit 9

2

0

Figure 2.3: The activation distribution of fc2 in MLP-3 for all digits.

JPnet. The conv layers account for more than 80% of total MACs and dominate the computation cost.

2.3.3 ImageNet

We use ImageNet ILSVRC-2012 dataset to evaluate the joint pruning method on large datasets. ImageNet consists of about 1.2M training images and 50K validating images. The AlexNet and ResNet-50 are adopted.

The AlexNet comprises 5 conv layers and 3 fc layers and achieves 57.22% top-1 accuracy on the validation set. Similar to ConvNet-5, the computation bottleneck of AlexNet emerges in conv layers, which accounts for more than 90% of total MACs. As shown in Table 2.5, deeper layers present larger pruning strength on weights and

Table 2.3: The joint pruning results of LeNet-4 on MNIST

Layer Shape Weight # MAC # Acti % Weight % MAC % conv1 5×5, 20 0.5K 0.4M 6.6% 60% 11.5% conv2 5×5, 50 25K 4.9M 1.9% 10% 0.7% fc1 2450×500 1.23M 1.2M 12.2% 8% 0.2% fc2 500×10 5K 5K 100% 18% 2.2% Total 1.26M 6.5M 5.5% 8.1% 1.2%

21 original after pruning

40

20 top neuron # 0 0 1 2 3 4 5 6 7 8 9 digits 0-9

Figure 2.4: The number of activated top neurons for all digits. activations due to the high-level feature abstraction of input images. For example, the MACs of conv5 can be reduced 18.2×, while only a 1.2× reduction rate is realized in conv1. In total, applying joint pruning removes 81.1% weights and 62.1% activa- tions, inducing 4× reduction in effective MACs. The weight and computation cost decomposition is shown in Fig. 2.5. The overall model size and computation cost are normalized here. The fc layers contribute the most majority of model size and are generally pruned in larger strength than conv layers to realize a significant model compression rate. Whereas, the computation cost reduction mainly comes from the optimization in conv layers as depicted in Fig. 2.5(b).

To reach higher accuracy, DNNs are getting deeper with tens to hundreds of conv layers. We deploy the joint regularization on ResNet-50 and summarize the

Table 2.4: The joint pruning results of ConvNet-5 on CIFAR-10

Layer Shape Weight # MAC # Acti % Weight % MAC % conv1 5×5, 64 4.8K 0.69M 50.6% 70% 70% conv2 5×5, 64 102.4K 3.68M 17.3% 50% 25.3% fc1 2304×384 884.7K 884.7K 9.9% 40% 6.92% fc2 384×192 73.7K 73.7K 44.8% 30% 3.0% fc3 192×10 1.92K 1.92K 100% 50% 22.4% Total 1.07M 5.34M 43.6% 40.4% 27.7%

22 Table 2.5: The joint pruning results of AlexNet on ImageNet

Layer Shape Weight # MAC # Acti % Weight % MAC % conv1 11×11, 96 34.85K 109.3M 69.9% 85% 85% conv2 5×5, 256 307.2K 240.8M 28.6% 40% 27.9% conv3 3×3, 384 884.7K 149.5M 16.4% 35% 10% conv4 3×3, 384 663.5K 112.1M 13.7% 40% 6.6% conv5 3×3, 256 442.4K 74.8M 15% 40% 5.5% fc1 9216×4096 37.7M 37.7M 10% 17.2% 2.6% fc2 4096×4096 16.8M 16.8M 9.4% 17.2% 1.7% fc3 4096×1000 4M 4M 100% 31% 2.9% Total 60.9M 745.2M 37.9% 18.9% 25.2% detailed results in Table 2.6. Consisting of 1 conv layer, 4 residual units and 1 fc layer, the ResNet-50 model achieves a 75.6% accuracy on ImageNet ILSVRC-2012 dataset. In each residual unit, several residual blocks equipped with bottleneck layer are stacked. The filter numbers in residual units increase rapidly, and the same for the weight amount as shown in the table. An average pooling layer is connected before the last fc layer to reduce feature dimension. Overall, conv layers contribute the most majority of weights and computation. The JPnet for ResNet-50 achieves a 75.7% accuracy, which is 0.1% higher than the original model. Only 19.1% MACs are retained in JPnet with a 5.65× activation reduction and a 1.6× weight compression.

1.0 1.0 fc fc conv conv

0.5 0.5

0.0 0.0 Dense JPnet Dense JPnet (a) The weight decomposition. (b) The computation decompo- Compression rate = 5.3×. sition. Reduction rate = 4×.

Figure 2.5: The decomposition of weight and computation cost of AlexNet.

23 Table 2.6: The joint pruning results of ResNet-50 on ImageNet.

Layer Shape Weight # MAC # Acti % Weight % MAC % conv1 7×7, 64 9.4K 0.84G 39.9% 91.5% 91.5%    1×1, 64  unit2 3×3, 64 × 3 0.21M 1.13G 19.4% 66.8% 13.6%  1×1, 256     1×1, 128  unit3 3×3, 128 × 4 1.21M 1.68G 19.6% 68.2% 14.3%  1×1, 512     1×1, 256  unit4 3×3, 256 × 6 7.08M 2.49G 12.7% 59.1% 8.4%  1×1, 1024     1×1, 512  unit5 3×3, 512 × 3 14.94M 1.49G 7.9% 62.2% 5.8%  1×1, 2048  Total 25.5M 7.63G 17.7% 61.6% 19.1%

2.3.4 Prune Activation without Intrinsic Zeros

For the networks aforementioned, joint regularization stretches the sparsity level in the ReLU activation. In the following, we validate the idea on the activation function without intrinsic sparse patterns, e.g., leaky ReLU. Table 2.7 shows our results for ResNet-32. The model consists of 1 conv layer, 3 stacked residual units and 1 fc layer. Each residual unit contains 5 consecutive residual blocks. Compared to conv layers, the last fc layer is negligible in terms of weight volume and computation cost. The original model has a 95.0% accuracy on CIFAR-10 dataset with 7.34G MACs per image. As its activation function is leaky ReLU, zero activations rarely occur in the original and WP models. After applying joint pruning, the activation percentage can be dramatically reduced down to 30.8%. As shown in Table 2.7, the JPnet keeps 32.3% weight parameters, while only 11.5% MACs are required in execution. The accuracy drop is merely 0.4%.

Fig. 2.6(a) demonstrates the activation distribution of the first residual block

24 Table 2.7: The joint pruning results of ResNet-32 on CIFAR-10.

Layer Shape Weight # MAC # Acti % Weight % MAC % conv1 3×3, 16 0.43K 0.44M 50% 40% 40%  3×3, 160  unit2 × 5 2.1M 2.15G 29.1% 40% 11.6% 3×3, 160  3×3, 320  unit3 × 5 8.76M 2.6G 31.8% 40% 12.7% 3×3, 320  3×3, 640  unit4 × 5 35.02M 2.6G 34.5% 30% 10.3% 3×3, 640 Total 45.87M 7.34G 30.8% 32.3% 11.5% in the original model by randomly selecting 500 images from the training set. The distribution gathers near zero with long tails towards both positive and negative directions. For comparison, the activation distribution after joint pruning is shown in Fig. 2.6(b), in which activations near zero are pruned out. In addition, the kept activations are trained to be stronger with larger magnitude, which is consistent with the phenomenon that the non-zero activation percentage increases in WP models as illustrated in Fig. 2.2(a).

1e7 1e6

0.5 0.5

0.0 0.0 0 2 0 2

(a) Original. (b) After joint pruning.

Figure 2.6: Activation distribution of ResNet-32.

25 Table 2.8: Comparison with the state-of-the-art weight pruning methods.

Model Dataset Method Weight % MAC Reduction Error L0 8.9% 5.9× 0.9% VIB 0.8% 71.4× 1.0% LeNet-4 MNIST VD 0.4% 80.6× 0.8% Ours 8.1% 83.3× 1.0% DNS 32.5% 3.7× 20% AlexNet? ImageNet ADMM 20.5% 3.8× 19.8% Ours 38.7% 3.7× 19.6% ? For AlexNet, we focus on conv layers which are the computation bottleneck for inference. The top-5 prediction error is reported in the table.

2.4 Discussions

2.4.1 Comparison with Weight Pruning

In Table 2.8, we compare the joint pruning method with the state-of-the-art weight pruning methods, including `0 regularization (L0) [LWK17], variational information bottleneck (VIB) [DZW18], variational dropout (VD) [MAV17], dynamic network surgery (DNS) [GYC16] and the non-convex problem optimization method (ADMM) [ZYZ+18]. For the LeNet-4, our method achieves the best reduction rate on computa- tion cost with similar inference error compared to others. While VIB and VD provide comparable reduction results on computation cost by merely focusing on weight prun- ing, the computational complexity during training hinders their application in DNNs for large datasets, e.g., ImageNet. Joint pruning can be easily applied for large mod- els as shown in Section 2.3.3. Compared with DNS and ADMM, we can obtain the minimum prediction error with a comparable reduction rate on computation cost.

26 ) 0 %

(

p 5 o r d Leaky ReLU with static threshold y 10 c

a Dyn. activation pruning w/o finetuning r

u Dyn. activation pruning w/ finetuning c 15 c A

35% 40% 45% 50% 55% 60% 65% Non-zero activation percentage

Figure 2.7: Comparison to static activation pruning for ResNet-32.

2.4.2 Comparison with Static Activation Pruning

The static activation pruning has been widely adopted in efficient DNN accelerator designs [AJH+16, RWA+16]. By selecting a proper static threshold θ in Equation (2.11), more activations can be pruned with little impact on model accuracy. For the activation pruning in joint pruning, the threshold is dynamic according to the winner rate and activation distribution layer-wise. The comparison between static and dynamic pruning is conducted on ResNet-32 for CIFAR-10 dataset. For the static pruning setup, the θ for leaky ReLU is assigned in the range of [0.07, 0.14], which brings different activation sparsity patterns.

As the result of leaky ReLU with static threshold shown in Fig. 2.7, the accuracy starts to drop rapidly when non-zero activation percentage is less than 58.6% (θ = 0.08). Using dynamic activation masks, a better accuracy can be obtained under the same activation sparsity constraint. Finetuning the model using dynamic activation masks will dramatically recover the accuracy loss. As our experiment in Section 2.3.4, the JPnet for ResNet-32 can be finetuned to eliminate the 10.4% accuracy drop caused by the static activation pruning.

27 ) )

% 0

% 0

(

( conv1

p p o 20 conv1 o conv2 r r 10 D d

conv3 unit2 y y c c 40 unit3 a

a conv4 r r u u 20 conv5 unit4 c c c c 60 A A 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 Winner rate Winner rate

(a) AlexNet on ImageNet. (b) ResNet-32 on CIFAR-10.

Figure 2.8: The analysis of activation pruning sensitivity.

2.4.3 Activation Analysis

In weight pruning, the applicable pruning strength varies by layers [HPTD15, MTK+16]. Similarly, the pruning sensitivity analysis is required to determine the proper acti- vation pruning strength layer-wise, i.e., the activation winner rate per layer. Fig. 2.8(a) shows the relation of JPnet accuracy drop and the selection of winner rates for AlexNet before pruning. As can be seen that the accuracy drops sharply as the acti- vation winner rate of conv1 is less than 0.3, while setting the winner rate of conv5 as 0.1 doesn’t affect accuracy. This implies that deeper conv layers can support sparser activations. The unit-wise analysis results for ResNet-32 are shown in Fig. 2.8(b), which denotes a similar trend of activation pruning sensitivity to AlexNet: conv1 is most susceptible to the activation pruning. The accuracy of ResNet-32 drops quickly with the decrements of winner rate, indicating a high sensitivity. Verified by thorough experiments in Section 2.3, the accuracy loss can be well recovered by finetuning with proper activation winner rates.

2.4.4 Speedup from Dynamic Activation Pruning

The speedup for fc layers with dynamic activation pruning can be easily observed even without specific support for sparse matrix operations. After activation pruning,

28 Table 2.9: Speedup test for fc layers in AlexNet.

Time consumption per Layer Layer Shape Acti % Speedup Original Acti Pruning + MACs fc1 9216×4096 15% 10.19 ms 0.87 ms + 3.08 ms 2.58× fc2 4096×4096 10% 4.54 ms 0.52 ms + 0.72 ms 3.65× fc3 4096×1000 9.4% 1.52 ms 0.39 ms + 0.39 ms 1.95× the weight matrix in fc layers can be condensed by removing all connections related to the pruned activations, which speeds up the inference time with the compact weight matrix. Table 2.9 shows the experiment results implemented in TensorFlow compiled on Intel i7-7700HQ CPU for AlexNet’s 3 fc layers. The activation percentage listed here is the winner rate for the input activations. There is no accuracy loss after finetuning with these winner rate settings. Batch size is set as 1 in the test, which is the typical scenario in real-time applications on edge devices. The experiment obtains 1.95× ∼ 3.65× speedup. Time spent on activation pruning to get winner activations accounts for a very small portion of the time spent on the original densely connected layers.

2.4.5 Activation Threshold Prediction

As discussed in Section 2.2.5, the process to select activation winners can be accel- erated by threshold prediction on down-sampled activation set. We apply different down-sampling rates on the JPnet for AlexNet. As can be seen in Fig. 2.9, layer conv1 is most vulnerable to threshold prediction. From the overall results for AlexNet, it’s practical to down-sample 10% (ε = 0.1) of activations for activation threshold prediction by keeping the accuracy drop less than 0.5%.

29 w/o sampling 56 sampling rate:0.2 sampling rate:0.1 54 sampling rate:0.05 sampling rate:0.02 sampling rate:0.01 Top-1 accuracy (%) 52 conv1 conv2 conv3 conv4 conv5 overall

Figure 2.9: The effects of threshold prediction.

2.5 Conclusive Remarks

To minimize the computation cost in DNNs, joint regularization integrating weight pruning and activation pruning is proposed in this paper. The experiment results on various models for MNIST, CIFAR-10 and ImageNet datasets have demonstrated con- siderable computation cost reduction. In total, a 1.4× ∼ 5.2× activation compression rate and a 1.6× ∼ 12.3× weight compression rate are obtained. Only 1.2% ∼ 27.7% of MACs are left with marginal effects on model accuracy, which outperforms the weight pruning by 1.3× ∼ 10.5×. The JPnets are targeted for the dedicated DNN accelerators with efficient sparse matrix storage and computation units on chip. The JPnets featuring compressed model size and reduced computation cost will meet the constraints from memory space and computing resource in embedded systems.

30 Chapter 3

WTA Dropout for Structured Activation Regularization

3.1 Motivation

Although the activation sparsity can dramatically save computation cost and power consumption, two fundamental issues prevent its full advantage in practice. (i) Ran- domness of the intrinsic activation sparsity pattern from ReLU. Fig. 3.1(a) is an example of the flattened feature map1 generated by the first convolution layer in AlexNet on ImageNet dataset. It can be seen that the distribution of white dots that represent zero activations has no regularity. Specific hardware designs are compul- sive in order to leverage the random activation sparsity distribution. In other words, it is hard to achieve a similar speedup on conventional embedded systems without the support of dedicated circuit components. For example, Eyeriss [CKES17] imple- mented an on-chip run-length compression module, which achieved up to 1.9× re- duction in memory accesses. Rhu et al. [ROC+18] proposed a dedicated compressing direct memory access (cDMA) engine to exploit the inherent sparsity in activations to speedup the data movements between the CPU memory and the GPU memory. EIE [HLM+16] and ZeNA [KAY18] utilized non-zero detection logic unit to select non- zero activations for processing engines. (ii) Nonidentical sparsity levels for different input images. The overhead of non-zero detection and compression/decompression techniques can be amortized for sparse activations, which might not be possible for dense activations incurred by certain input data.

1The feature map with 96 channels are arranged in the 8×12 grid for better visualization. White and dark blue dots represent zero and non-zero activations, respectively.

31 (a) The original feature map. (b) The feature map with WTA masks.

Figure 3.1: An example of the activation distribution of the first convolution layer in AlexNet on ImageNet dataset.

In this chapter, we aim to overcome these difficulties, enable feasible activation sparsification, and improve the efficiency of DNN deployment in embedded systems. We exploit the dynamic activation sparsity (DAS) and propose a dynamic winners- take-all (WTA) dropout technique. In each layer, the neurons will be tagged with rankings based on their activation magnitudes at run-time. Only those neurons with top rankings will be selected to participate in the computations for the following layers. In other words, the WTA dropout serves as a dynamic WTA mask to prune neuron activations layer-wise. In this way, we remove certain channels in feature maps and thus obtain the structured activation sparsity, as shown in Fig. 3.1(b). The rate of WTA dropout can be adjusted by layers to ensure the overall system accuracy.

3.2 DASNet with WTA Dropout

In this work, we propose dynamic WTA dropout which regulates the neuron activa- tion sparsity based on the activation magnitudes, as stronger activations potentially contribute more to the computation results in both forward and backward propa- gation phases. Fig. 3.2 depicts the basic concept of the proposed dynamic WTA

32 WTA Masks

1 1

2 2

3 3

K O Input Hidden 1 Hidden 2 Output Input Hidden 1 Hidden 2 Output

(a) Original. (b) After applying WTA masks.

Figure 3.2: The original network and the DASNet with dynamic WTA dropout. dropout. It behaves as a mask between layers and prunes low-ranking neurons at run-time. We first rank all the neurons of a layer based on their activation magni- tudes. Only those top-ranking neurons (namely, winner neurons) will remain and propagate their outputs to the following layer. DASNets with WTA masks need to be finetuned in order to maintain the model accuracy. In the backward propagation phase during finetuning, accordingly, we update only the winner neurons selected by the WTA masks along the forward path. The winner rate (p) of a layer is defined as:

p = Nwinner/Ntotal, (3.1)

where Nwinner and Ntotal denote the sizes of the winner neuron set Swinner and the total neuron set Stotal, respectively. p is adjustable to balance the sparsity level and accuracy. The finetuning process is clarified in Algorithm 1, which shares the same loss function used for training the original model. Because of the WTA dropout, a new term ∂Aw,l is applied on the weight updates during the backpropagation phase. ∂Ao,l Actually, ∂Aw,l is the binary mask generated by WTA dropout, determining which ∂Ao,l neurons should be updated for the particular input data xi.

33 Algorithm 1: DASNet Finetuning. Loss is the loss function for the input- output pair (xi, yi); η is the learning rate; Wl denotes the weight variables per layer; Ao,l and Aw,l are the original activation and pruned activation after the WTA mask, respectively; and L is the total layer number. while not reaching the stop criteria do 1. Forward Propagation: For layer l = 1, compute Ao,1 = g(xi,W1); for l = 2 → L do get the activation Aw,l−1 for Swinner,l−1; compute Ao,l = g(Aw,l−1,Wl); end 2. Backward Propagation: For layer l = L, Aw,L = Ao,L; for l = (L − 1) → 1 do propagate ∂Loss ; ∂Aw,l update Wl = Wl + ∆Wl, where ∂Loss ∂Loss ∂Aw,l ∂Ao,l ∆Wl = −η = −η · · ; ∂Wl ∂Aw,l ∂Ao,l ∂Wl end end

3.2.1 Ranking and Winner Rate Selection

The selection of the winner rate p for each layer is a tradeoff process between the activation sparsification level and the model inference accuracy. From the one hand, a DASNet with more activations pruned potentially obtains faster execution; from the other hand, it becomes harder to keep the accuracy of such a DASNet intact. To explore the relation between model accuracy and activation pruning strength p, we first adopt three basic networks (MLP-3, LeNet-4, and ConvNet-5) on two simple datasets (MNIST and CIFAR-10). Experiment for deeper networks will be presented in Section 3.3. Table 3.1 summarizes the configurations of the three models. The thumb rule of determining winner neurons in conv and fc layers is explained as follows.

34 Table 3.1: The network configurations for DASNet exploration.

Model Dataset Accuracy conv1 conv2 fc1 fc2 fc3 MLP-3 MNIST 98.4% - - 784×300 300×100 100×10 LeNet-4 MNIST 99.4% 5×5, 32 5×5, 64 3136×1024 1024×10 - ConvNet-5 CIFAR-10 86.0% 5×5, 64 5×5, 64 2304×384 384×192 192×10

Fully-connected layers

In fc layers, each neuron generates one single activation. The rankings of their acti- vation magnitudes can be directly used to determine the winner neurons. The winner neuron selection is a relaxed partial sorting problem. It can be solved by using re- cursive method such as [BHS80] in linear time complexity O(n), or O(n log n) under the worst-case condition. Here, n is equal to Ntotal as defined in Equation (3.1).

The standard dropout [SHK+14] has been used to avoid overfitting for large fc layers. Our experiments is conducted by including the standard dropout with a dropout rate of 50%. We observe that although half of the neurons are activated in each layer, the proportion of neurons with non-zero activation is usually close to or even below 20%. That is, only a small portion of neurons in a layer need to fire each time. Furthermore, the same dropout rate is usually applied across all the layers of a model, without considering the different sparsity levels presented by different layers. Instead, our method determines p in WTA masks according to the cumulative energy

Ecum of the winner set Swinner, such as:

P 2 aj E = aj ∈Swinner , (3.2) cum P a 2 ai∈Stotal i where ai and aj denote the neuron activation. We set a threshold θ. The size of

Swinner and the winner rate p then is decided by satisfying the constraint of Ecum ≥ θ.

We adopt MLP-3 and ConvNet-5 to explore the appropriate energy threshold θ and the effects of the winner rate p on the model accuracy. WTA masks are

35 ) % 0.0 (

y c

a MLP-3 r 0.5 u

c ConvNet-5 c a

90% MLP-3 ConvNet-5 80%

pruned 70% activations % 60% 99% 97% 95% 92% 90% 85% 80% cumulative energy threshold

Figure 3.3: Accuracy vs. activation pruning for fc layers. attached to all the fc layers except the output layer. The activation sparsity level is controlled by configuring p according to the setting of θ. Fig. 3.3 shows the changes of model accuracy and the corresponding activation pruning strength when varying θ. The pruned percentage of activations directly reflects the sparsity level and inter- layer data traffic reduction, and therefore, is used to measure the pruning strength here. As expected, the model accuracy has an explicit inverse correlation with the pruning strength. Majority of the activations can be removed by scarifying a slight accuracy drop. For instance, pruning 93.6% activations from MLP-3 induces 0.4% accuracy drop, and ConvNet-5 can mask out 81.7% activations in its fc layers with 0.8% accuracy drop. It’s worth noting that ConvNet-5 is more sensitive to activation pruning compared to MLP-3 due to the higher complexity of CIFAR-10 dataset. For this reason, different models require different θ’s in practice.

Convolution layers

In conv layers, each filter is regarded as a neuron. Every neuron generates a channel in a feature map. We cannot directly adopt the ranking method used in fc layers as it is based on single activation. More specifically, it is inefficient to rank all the activations in the feature map, which involves significant computation resources. We propose to

36 get the winner set Swinner in a conv layer through a two-step ranking process: (i) construct a feature vector (fv) with elements representing the importance of feature map channels. The vector length is equal to the number of filters; and (ii) apply the ranking method used in fc layers to fv.

There are two ways to obtain the feature vector. Considering a feature map fm ∈ RH×W ×C as depicted in Fig. 3.4(a), the feature map is a 3-D matrix with dimensions of height H, width W and channel C. The filtering dimension is F × F .

C The feature vector fv = [v1, v2, . . . , vC ] ∈ R can be constructed with the mean activation per channel, where

vj = mean(fm[:, :, j]), where j = 1, 2,...,C. (3.3)

Previous studies [ZF14, BZK+17] discovered that the maximum activation in a fea- ture map likely dominates the feature representation in the corresponding feature dimension. So fv can also be abstracted from fm by:

vj = max(fm[:, :, j]), where j = 1, 2,...,C. (3.4)

Each feature map channel can be regarded as a feature dimension. Thus, the eigenvalue for a feature dimension is a good indicator of the significance of a feature map channel. We propose to use the eigenvalues Λ = {λc | c = 1, 2,...,C} to replace ai and aj in Equation (3.2) to determine Ecum of the winner neuron set Swinner in conv layers. Singular value decomposition (SVD) is adopted to derive the eigenvalues for the feature map fm and thereby seek for the optimal p. Before deriving SVD, fm is reshaped into a matrix M fm ∈ RS×C by arranging all the activation vectors fm[h, w, :] at each pixel position [h, w] into rows, where S is the total number of activation vectors. S is then enlarged by gathering all the feature maps for a random set of training images (1,000 is enough by our experiments). Apply SVD on M fm such as

fm T MS×C = US×SΣS×C VC×C , (3.5) 37 Feature Map Input Matrix X' Feature Map Input Matrix l Xl HW HW C' C 2 2 F F

(a) Original. (b) With WTA masks.

Figure 3.4: Working scheme of WTA dropout in conv layers. where U and V T are unitary matrices, and Σ is a diagonal matrix comprising the eigenvalues Λ. M fm can be converted into a new feature space as M fmV , which is equal to US×SΣS×C . Due to the low-rank behavior of the responses of filters in conv

0 layers [ZZHS16], ΣS×C can be reduced to ΣS×C0 . Here C is obtained by removing negligible eigenvalues from C, so it has a much smaller dimension than C. The winner rate p can be determined by setting an appropriate cumulative energy threshold θ for the eigenvalues in Σ. Once p is set, the winner set Swinner can be selected using the aforementioned two-step ranking method. After finetuning, the DASNet is trained to obtain the trimmed feature map US×SΣS×C0 in a new feature space with the basis

VC×C0 . The feature map unrolling scheme is depicted in Fig. 3.4(a), where each convolution patch per filtering location is unrolled as a column of the input matrix. Applying WTA masks can remove certain feature map channels (in grey color) and accordingly shrink the row-dimension of the input matrix, as shown in Fig. 3.4(b).

We adopt LeNet-4 and ConvNet-5 in Table 3.1 to evaluate the effectiveness of different feature vectors as defined in Equations (3.3) and (3.4). For the comparison between two fv extraction methods, the WTA masks are configured with identical winner rates according to the cumulative energy threshold θ settings. Interestingly, the max(·) method outperforms the mean(·) method regarding to the model accu-

38 MAX MEAN MAX MEAN 99.4% 86.0%

99.2% 85.0% accuracy accuracy

99.0% 84.0% 99% 95% 90% 85% 80% 99% 97% 95% 92% 90% cumulative energy threshold cumulative energy threshold

(a) The Lenet-4 on MNIST. (b) The ConvNet-5 on CIFAR-10.

Figure 3.5: Comparison between different feature vectors. racy as shown in Fig. 3.5. As θ is approaching 100%, the accuracy gap between the two methods decreases. In the following experiments, max(·) will be adopted to derive the feature vector fv if not specially indicated.

Fig. 3.6 shows that different cumulative energy threshold θ for Swinner results in different sparsity levels of feature map. Lowering θ can aggressively increase the feature map sparsity. Again, we adopt different θ for LeNet-4 and ConvNet-5 for their different pruning sensitivity. Similar to the findings in fc layers, the accuracy drops along with the increment of sparsity level. As can be seen in Fig. 3.6, 61.9% of the feature maps among all conv layers can be pruned for LeNet-4 with a mere 0.05% accuracy drop, while ConvNet-5 can remove 35.1% of the feature maps by keeping the accuracy loss by 1%. The feature map has a less pruning strength than the activations in fc layers because the feature map is more complicated in the feature space.

3.2.2 Theoretical Analysis

Ranking cost

A lightweight ranking process is necessary to guarantee the computation reduction via the activation sparsification. The ranking overhead should be much less, compared to the saved matrix computational complexity. Assume a fc layer has K inputs and

39 ) % 0 (

y c

a LeNet-4 r

u 1 c ConvNet-5 c a

60% LeNet-4 ConvNet-5 40%

20% pruned fm % 0 99% 97% 95% 92% 90% 85% 80% cumulative energy threshold

Figure 3.6: Accuracy vs. feature map pruning for conv layers.

O outputs as illustrated in Fig. 3.2(a). Its computational complexity is KO. After applying the dynamic WTA dropout, only the winner neurons from the previous layer, pK input activations, are kept. So the lightened computation complexity is pKO. The ratio of the ranking cost over the saved computation complexity of the fc layer is:  Ranking Cost  K log K log K = =  1, (3.6) Saved Computation fc (1 − p)KO (1 − p)O as K and O are usually in the similar magnitudes and p in fc layers can be smaller than 20% according to our experiments.

Assume a conv layer has N filters with the configuration shown in Fig. 3.4(a). Its computation complexity originally is F 2HWCN and reduces to pF 2HWCN after applying the WTA dropout with a winner rate of p. Here the ranking cost is composed of two components that are respectively induced by finding the maximum in each feature map channel with a complexity of HWC and ranking with a complexity of C log C. So the ratio of the ranking cost to the saved computational complexity for the conv layer is:  Ranking Cost  HWC + C log C = Saved Computation conv (1 − p)F 2HWCN (3.7) HW 1 ≈ =  1, (1 − p)F 2HWN (1 − p)F 2N where p of conv layers commonly is less than 70% according to our experiments.

40 Memory accesses and computational complexity

The computations in DNNs are intensive matrix multiplications. From a mathemat- ical view, the matrix computation in layer l works as Wl · Xl, where Wl denotes the weight matrix, Xl is the input vector/matrix derived from the previous layer’s acti- vations. The dynamic WTA method attempts to improve the sparsity level of the neuron activation. As such, the number of MACs decreases too. Moreover, sparse

Xl can reduce the memory allocation.

For fc layers, a winner rate p in vector Xl straightforwardly reduces the compu- tation cost by 1/p times. For conv layers, the reduction of memory accesses and MACs can also be obtained from the WTA dropout. As the example depicted in Fig. 3.4(a), the dimension of filters is F × F , where F is the window size of the 2-D convolution sliding over the feature map. When the convolution stride is 1, the

2 feature map will be unrolled into a matrix Xl in the dimension of F C × HW . After applying the dynamic WTA dropout, as illustrated in Fig. 3.4(b), the feature map

0 will contain structured sparse patterns, where the matrix Xl derived from the fea- ture map unrolling can be condensed in the row direction from F 2C to F 2C0. With a winner rate of p in the conv layer, the data volume of Xl reduces 1/p times, so as the

MACs needed in Wl · Xl. Very importantly, the method doesn’t require any specific compression techniques, e.g., coordinate list, compressed sparse row and et al..

WTA mask reuse in backpropagation

The reduction in memory accesses and computational complexity is also applicable to the backpropagation phase. As shown in the forward propagation graph in Fig. 3.7, the WTA mask is generated at run-time by ranking the activations Ao,l = g(Aw,l−1), where the function of layer l is modeled as g(·). Only the activations from winner neurons are propagated to the following layer l + 1. For example, Aw,l is obtained

41 Ao,l+1 δAo,l+1 g(·) g'(·)

Aw,l δAw,l

element-wise element-wise WTA Mask multiply WTA Mask multiply

ranking Ao,l ranking δAo,l

g(·) g'(·)

forward backward Aw,l-1 δAw,l-1 propagation propagation

Figure 3.7: WTA mask in the forward and backward propagation.

Fail to meet the requirements θ u o t c

u Qualified m E C D

h f o A e F W r c a S i B

k n c T T N a

c e D A u s e u t e u e n t

r l d r n e a i i n r

e c v t o e

y h e w p

/

m r s w i e p t o s i h e n h d

Figure 3.8: The tuning flow to obtaine DASNet. n o e d e l l d r

r a t through the element-wise multiplicatione of Ao,l and the WTA mask. In the train- s

∂Loss ∂Aw,l ∂Aw,l ing process, the errors δAw,l = and δAo,l = δAw,l · , where is the ∂Aw,l ∂Ao,l ∂Ao,l WTA mask, are needed to be backward propagated to derive the weight updates as described in Algorithm 1. Consistent with the dimension reduction in Aw,l by the

WTA dropout, the size of δAw,l reduces accordingly. The WTA mask is reused as shown in the backward propagation graph in Fig. 3.7, so it doesn’t incur any extra ranking overhead.

42 3.2.3 The Tuning Flow to Obtain DASNet

As indicated by Section 3.2.1 and Section 3.2.1, there exists a tradeoff between model accuracy and activation pruning strength for both conv and fc layers. The pruning strength (i.e., winner rate) is determined by the threshold θ for cumulative energy

Ecum of winner neurons in each layer. Therefore, we can tune θ to make DASNet meet various requirements on model accuracy and/or inference speed. As illustrated in Fig. 3.8, an iterative tuning flow based on the binary search is adopted to obtain the qualified DASNet from a pretrained baseline model. The initial search iteration on θ starts with the range [L, U], where the lower bound L = 0 corresponds to remove all activations while the upper bound U = 100% denotes to keep all activations. In each iteration, we derive a DASNet candidate with θ = θmid = (L + U)/2 and check its compliance with the requirement. The search range for next iteration will be updated accordingly. More specifically, L = θmid when it doesn’t meet the minimum accuracy requirement; or U = θmid to prune more activations if further speedup is needed. The tuning flow will end when the DASNet meets the accuracy and/or latency requirements. In the experiments as we shall show in Section 3.3, the accuracy drop for all the DASNets is kept within 0.5%. These models can be further accelerated by releasing the constraint on θ to achieve more MAC reduction, if a larger accuracy loss is acceptable.

3.3 Evaluation

We verify the WTA dropout on three DNN models: ResNet [HZRS16], AlexNet and MobileNet [HZC+17a]. Compared to conventional CNN models like ResNet and AlexNet, MobileNet is designed specially for embedded systems featuring a conv block by concatenating a depth-wise conv layer and a point-wise conv layer. Mo-

43 Table 3.2: The experiment setup and results of DASNets.

Model ResNet-164 AlexNet MobileNet Dataset CIFAR-10 GTSRB ImageNet SVHN ImageNet MAC # 384.5M 724.3M 748.1M 11.9M 610.4M Top-1 Accuracy 94.54% 97.21% 57.22% 93.3% 70.9% Top-5 Accuracy - - 80.2% - 89.9% MAC # Reduction 32.2% 34.9% 27% 45% 12% ∆ Top-1 Accuracy -0.14% -0.5% +0.41% -0.1% -0.39% ∆ Top-5 Accuracy - - +0.12% - +0% Measured Speedup 1.48× 1.80× 1.61× 1.75× 1.12× bileNet consists of 1 conv layer, 13 optimized conv blocks and 1 fc layer. We adopt four datasets including CIFAR-10, GTSRB (German Traffic Sign Recognition Bench- mark), SVHN (Street View House Number) and ImageNet. CIFAR-10 is a simple dataset containing 10 classes of images in low resolution (32×32); GTSRB has 43 classes of traffic signs; SVHN is a color-version of MNIST with 10 digits collected from natural scene images; and ImageNet is a large benchmark in image classification on 1000 object categories and millions of images. Table 3.2 presents the classification accuracy of pretrained models on each dataset. The number of MACs is measured under minibatch size = 1, which is the typical scenario for real-time applications in embedded systems.

3.3.1 DASNet Measurement Results

We deployed the DASNets on LG Nexus 5X with a 1.8 GHz processor and 2GB RAM, running with Android 6.0.1 (API level 23). The neural network models were implemented based on MXNet [CLL+15], a high performance library developed in C++ by Distributed Machine Learning Community (DMLC). We mod- ified the MXNet library and cross compiled it with Android Standalone Toolchain

44 so that it can support general matrix multiplication (GEMM) operations with our customized WTA dropout and neuron activation sparsity features on ARM-based platforms.

As summarized in Table 3.2, 1.12× ∼ 1.8× wall-clock time2 speedup is achieved. This implies that the structured sparsity in DASNets can be easily utilized for ac- celeration. The performance of DASNets varies on different datasets. For example, the speedup for AlexNet on GTSRB is 1.8× when keeping 65.1% MACs and losing 0.5% accuracy. When changing the dataset to ImageNet, AlexNet even has a 0.41% (0.12%) improvement for Top-1 (Top-5) accuracy with a 1.61× speedup. It’s worth noting that MobileNet is more sensitive to the WTA dropout method. Through the SVD analysis on the feature maps in MobileNet on ImageNet, less redundancy is observed as in AlexNet. By saving 12% MACs according to θ = 99%, the inference is accelerated by 1.12× with a 0.39% drop of Top-1 accuracy. It doesn’t incur any loss for the Top-5 accuracy. In contrast, more feature maps can be pruned for MobileNet on the less complex dataset SVHN, and thus a higher speedup (1.75×) is achieved.

3.3.2 Case Study on Winner Rate Configuration

The winner rate p for each layer is derived according to the cumulative energy thresh- old θ. To better understand the relation between p and θ, the layer-wise analysis for AlexNet on ImageNet dataset is shown as an example in Fig. 3.9, without loss of generality for other models. Similar to the setup in Section 3.2.1, 1,000 images are randomly selected from the training set to analyze the configuration of p for feature maps in conv layers and activations in fc layers. Since layer fc8 is the output layer, the analysis on the activations in fc layers are only applied for fc6 and fc7.

2Wall-clock time here emphasizes that the execution time is measured by including both compu- tation and memory access time.

45 100%

75%

50%

25% conv1 conv2 conv3 conv4 0 0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 100%

75%

50%

25% cumulative energycumulative threshold θ conv5 fc6 fc7 0 0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 winner rate p

Figure 3.9: The winner rate vs. energy threshold for AlexNet.

As seen from the analysis results, different layers have distinct behavior of winner rates. For example, the winner rate p of conv1 is 0.54 by setting θ = 99%, while p = 0.72 for conv2 at the same θ. The divergence in the derived p values indicates the different feature redundancy in the generated feature space per layer. It’s worth noting that the cumulative energy in fc layers increases rapidly with the increment of winner rate. When p is set around 0.1, Ecum can be easily kept more than 95%, which means the activations of fc layers can be deeply sparsified. The analysis in Fig. 3.9 can be generalized to different combinations of datasets and model structures to explore activation sparsity. Thereafter, the DASNet with WTA dropout will efficiently utilize the improved activation sparsity derived from the winner rate configuration.

3.3.3 Case Study on Layer-wise Speedup

The speedup of each layer in AlexNet and MobileNet on ImageNet are shown in Fig. 3.10. The time consumption per layer consists of both the ranking process and the remaining MAC computation for the winner neurons. For AlexNet and MobileNet, no WTA dropout is applied on the input image. To have a clear view of

46 Table 3.3: Comparison with existing feature map pruning methods.

Top-5 ∆ Top-5 MAC # Model Pruned Model Accuracy Accuracy Reduction Ours X +0.12% 27% AlexNet 80.2% Molchanov and et al. -1.4% 22.4% [MTK+17] Ours X -1.24% 50.4% Gao and et al. X -1.46% 49.5% ResNet-18 89.68% [GZD+19] Hua and et al. -1.87% 37.9% [HZDS+19] Ours-v1 X +0.26% 15.9% Huang & Wang ResNet-50 92.8% -0.19% 15.1% [HW18] Ours-v2 X -0.37% 29% Ours-v3 X -1.14% 52.4% time distribution among all layers, the time is normalized to the computation time of the original dense layer. As shown in Fig. 3.10(a), for conv layers 2-5 in AlexNet, 1.13× ∼ 1.84× speedup is achieved. The input layer is omitted here, because the WTA dropout is not applicable on the input data. It’s worth noting that fc layers 6-8 obtain higher speedup of 3.6× ∼ 7.6×. This is because the WTA masks before fc layers usually generate a deeply sparsified input. For MobileNet, 13 conv blocks account for more than 90% of the overall computation. The results in Fig. 3.10(b) shows that 1.07× ∼ 1.96× speedup per block is obtained. For both AlexNet and MobileNet, the time consumption on the ranking process accounts for a small portion (< 2%) of the original model execution time.

47 saved ranking MAC

total

6 Layer index

2

0.0 0.2 0.4 0.6 0.8 1.0 Time composition

(a) AlexNet on ImageNet.

total

11

6 Block index

1

0.0 0.2 0.4 0.6 0.8 1.0 Time composition

(b) MobileNet on ImageNet.

Figure 3.10: The layer-wise decomposition for the inference time in DASNets.

3.3.4 Comparison with State-of-the-art

Many methods have been proposed for feature map pruning. The comparison between the our WTA dropout method with state-of-the-art is shown in Table 3.3. The models denoted with ‘X’ dynamically prune feature maps in contrast to the static pruning in others. Compared to the static pruning methods, our WTA dropout method obtains a better accuracy with similar reductions of MACs. In the cases with a small computation save, DASNet models even boost the accuracy over the original models. For the existing dynamic feature map pruning methods, they actually complicate

48 network structures and lack experiments on deeper models than ResNet-18. Our WTA dropout method is more supportive to DNNs as it doesn’t increase training variables. Compared with the existing dynamically pruned ResNet-18, the accuracy drop of DASNet is smaller with comparable computation reduction.

3.4 Conclusive Remarks

We propose the dynamic WTA dropout to generate structured sparsity in feature map for conv layers and deeply sparsify the activations for fc layers. The DASNet equipped with dynamic WTA dropout can be efficiently utilized by conventional embedded systems without requiring dedicated hardware. The proposed WTA dropout is a generic approach to explore the dynamic activation sparsity which can be easily applied to DNN applications beyond the models and datasets in this paper. Our experiments on the mobile platform show an 1.12× ∼ 1.8× speedup on various DNN models by keeping a 0.5% accuracy loss. The DASNets can also be integrated with weight pruning and quantization without compromising on accuracy, which presents a great potential for the efficient DNN deployment in embedded systems.

49 Chapter 4

BitSystolic Architecture for Mixed-precision Inference

4.1 Motivation

Mixed-precision inference with both compressed model size and reduced computa- tion cost enlightens a way for accurate and efficient DNN deployment. We examine the impact of the unified quantization method and present the results of two typical models in Fig. 4.1. Here, all the weights (activations) of the model adopt an identical precision level denoted by the x-axis (y-axis). This study shows that the numeri- cal precision requirement of a DNN is related to many factors, including application (dataset), design (model), data type (weight, activation), and tolerable accuracy loss. Recent studies [DYG+19, WLL+19, WWZ+18, SWX+20] have proposed mixed- precision DNN models by considering different sensitivities with weight/activation quantization in different layers. Despite of these progress in algorithmic methods, state-of-the-art hardware solutions [SPS+18, URS18, SLLY17, LKK+18] targeted for mixed-precision inference cannot sufficiently support the numerical flexibility of mixed-precision models. For example, BitFusion [SPS+18] built with an array of ternary operation blocks can only adopt exponentially growing bitwidths (2b/4b/8b). BISMO [URS18] needs the prior knowledge of weight and activation bitwidths before defining the top-level architecture. Moreover, BISMO strictly requires weights and activations across all the layers to share an identical bitwidth configuration to fit its computing paradigm. DNPU [SLLY17] and UNPU [LKK+18] take only the weight quantization into consideration while ignoring the precision flexibility of activations.

We also note that artificial intelligence (AI) applications deployed on edge de-

50 weights weights 2b 3b 4b 5b 6b 7b 8b 2b 3b 4b 5b 6b 7b 8b

2b 14.0 43.8 52.5 51.9 52.3 52.0 52.4 2b 0.1 0.1 0.1 0.2 0.3 0.2 0.2

3b 17.6 91.6 96.4 96.5 96.7 96.9 96.7 3b 0.1 0.1 0.3 26.3 32.3 34.3 34.0

4b 17.4 95.6 97.8 98.0 98.0 98.1 98.0 4b 0.1 0.1 0.4 43.0 52.9 54.0 54.1

5b 19.4 96.1 98.0 98.3 98.3 98.3 98.3 5b 0.1 0.1 0.5 46.4 56.1 57.4 57.5

activations 6b 19.3 96.3 98.1 98.3 98.4 98.4 98.4 activations 6b 0.1 0.1 0.5 47.0 56.9 58.2 58.2

7b 19.4 96.4 98.0 98.3 98.4 98.4 98.4 7b 0.1 0.1 0.5 47.0 57.1 58.3 58.5

8b 19.5 96.3 98.0 98.3 98.4 98.4 98.4 8b 0.1 0.1 0.5 47.2 57.1 58.3 58.5

(a) The classification accuracy (%) of a 3- (b) The classification accuracy (%) of layer MLP on MNIST AlexNet [KSH12b] on ImageNet

Figure 4.1: The case studies of the impact of the numerical precision of weight and activation on DNN accuracy. vices spread across many scenarios, such as image classification, object detection, and speech recognition. Such a diversity in the application domain stimulates various network structures with different types of layers, including convolution (conv), fully- connected (fc), long-short-term-memory (lstm), and depthwise convolution (dconv) layers. However, an NPU optimized for a particular network architecture may not provide excellent support for other network types. For example, the NPU designs in [JYP+17a, SLLY17, LKK+18, YOT+18] support only the weight-stationary data flow. This is optimal for conv layers, in which the weight parameters are repeti- tively applied to input feature maps. However, fc/lstm layers reuse activation data a lot more frequently than weights as our study in Fig. 4.2 shows, indicating that activation-stationary data flow might be more beneficial for these layers1. Config- urable data flows are essential for optimizing deployments of different network archi- tectures.

1The data is collected from the conv layers and fc layers in AlexNet [KSH12b], the lstm layer examples from PTB [WHR+17], and the dconv layer examples from MobileNet-V1 [HZC+17b]

51 weight reuse activation reuse

103

1 data reuse 10

fc1 fc2 fc3 conv1 conv2 conv3 conv4 conv5 lstm1 lstm2 dconv1dconv2

Figure 4.2: The data reuse analysis on various layer types.

Many NPU designs like DNPU [SLLY17] and UNPU [LKK+18] adopt the mul- tipliers based on look-up tables (LUTs), which require to constantly update LUTs especially when inputs change frequently. (TPU) [JYP+17a] based on systolic array structure has obtained tremendous success for the workload in data centers. It adopts the weight-stationary scheme and is optimized for large-batch based DNN processing. The large batch is commonly applicable for the workload in data centers, whereas the edge AI applications usually feature a small batch size, which could be even only one for latency-strict tasks. Thus, TPU cannot be directly adopted to edge devices.

We aim at a NPU design supporting flexible numerical precision, configurable data flows, and small batch size, all of which are essentially required by edge devices. We propose BitSystolic based on a systolic array structure. Unlike the existing systolic design [JYP+17a], BitSystolic can alter the precision level from 2b (ternary) to 8b at runtime. Moreover, BitSystolic features two computational modes, matrix-matrix and vector-matrix, which are dedicated for the distinct data reuse pattern in different layer types.

52 Data layout in conv layer Data layout in fc/lstm layer feature map K filters M 1 Vector A · H ... W * C C K Matrix WT patch to row F L patch to column M a w L W

F m * * a t F A

H Matrix A [2] r i * Matrix W m A · [1] x

C a A[0] e t x r i p x a

e n · s K x i C*F*F p W[0] o a n n s W[1] i o matrix expansion matrix expansion n W[2] W[3] w[4] W[4] a[2] W[4] w[3] W[3] a[1] w[2] W[2] a[0] w[1] W[1] w[0] W[0] · A[2] g. b e. A 5 [1] , , b i+j 3 A[0] e A·W = Σ 2 · A[i] · W[i] .g .

Figure 4.3: The data layouts of conv layer and fc/lstm layer.

4.2 Preliminary

Matrix multiplications dominate the DNN computational cost. From an abstract view, all matrix-relevant operations in neural networks can be represented by the multiplication of an input activation matrix A and a weight matrix W, where A ∈

Rx×y and W ∈ Ry×z. The matrix dimensions x, y and z are determined by the layer type and structure. For example, the convolution operation between the input feature map and filters can be transformed as A · W after unrolling the feature map and filters as depicted in Fig. 4.3. Mathematically, the convolution operation is conducted between input feature map in the space of RH×W ×C and K filters, with each filter in the space of RF ×F ×C , where H, W and C denote the height, width and 53 channel number of the feature map, respectively, and F is the filtering window size. A convolution patch in the feature map is converted into a row in A, and each filter is arranged as a column in W. Here we assume there have K filters. Thus we obtain the matrices A and W with dimensions as x = HW , y = CF 2 and z = K.

For fc/lstm layers, the input activation vectors are directly aligned in rows of A, and neurons are arranged per column of W. Overall, there are K neurons, each of which has M parameters. As illustrated in Fig. 4.3, when batchsize is 1, A has only one row. Correspondingly, the dimensions of A and W are determined by x = 1, y = M and z = K, where there are M parameters per neuron and K neurons in the layer. The operation turns to be a vector-matrix product, which is different from the matrix-matrix multiplication in conv layers. A fc/lstm layers with a large batchsize would have a similar computational pattern as conv layers. Large batchsizes are common in datacenter applications, but edge applications usually use small batchsizes due to the limited computational resources and run-time operation requirement.

Fig. 4.4 depicts the working principle of matrix multiplication Y = A · W using weight-stationary scheme on the conventional systolic array structure as adopted in

TPU [JYP+17a]. Without losing generality, A ∈ R3×3, and W ∈ R3×3 in this tiny example. Before computation, the weight matrix W is preloaded into the systolic array. During computation, matrix A, organized in the systolic format, will flow downwards through the systolic array. The multiplication is executed at the wave- fronts of the systolic A. The partial multiplication results are accumulated while flowing rightwards. Synchronizing with the input flow, the multiplication result Y will be produced in a corresponding systolic format. This kind of systolic comput- ing paradigm provides efficient local data movements for intensive streaming inputs, e.g., in many applications in datacenters. However, for edge devices with small batch- sizess, the length of input flow is small. Such a design induces low utilization and

54 A W Y

· =

a22

a21 a12 1. Preload W. 2. Make systolic A. a20 a11 a02 3. Get systolic Y. Input flow a10 a01 Systolic a00 Array Output flow

w00 w10 w20

y20 y10 y00

w01 w11 w21 y21 y11 y01

w02 w12 w22 y22 y12 y02

Figure 4.4: The working principle of matrix multiplication A · W on conventional systolic array architecture. low computing throughput because the overhead for setting up systolic input cannot be amortized.

4.3 BitSystolic Architecture

To address the challenges in realizing the mixed-precision inference and distinct types of network layers, we propose BitSystolic architecture based on a bit-level data lay- out. In this section, we will elaborate the overall architecture and design details of BitSystolic.

4.3.1 Bit-level Data Layout

To deal with the mixed-precision matrix multiplication, we propose to expand the activation matrix A and the weight matrix W to a group of bit-matrices in different

55 numerical significance. Assume the numerical precisions of the feature map and filters are m-bits and n-bits, respectively. A and W can be expressed in the expansion forms

Pm−1 k Pn−1 k as: A = k=0 2 · A[k] and W = k=0 2 · W[k], where A[k] and W[k] are the bit- matrices with the significance of 2k. Thereafter, A · W can be expressed as the accumulation of products of all bit-matrix pairs, i.e., {A[i], W[j]}, s.t. 0 ≤ i ≤ m − 1 and 0 ≤ j ≤ n − 1:

m−1 n−1 X i X j A · W = 2 · 2 · A[i] · W[j] i=0 j=0 (4.1) m+n−2 X i+j = 2 · A[i] · W[j]. i+j=0

As illustrated in Fig. 4.3, A and W can then be mapped to a stack of bit-matrices. We can configure the stack depth of bit-matrices (e.g., m = 3 and n = 5 in Fig. 4.3) to obtain distinct numerical precision for activations and weights, and thus, enable the mixed-precision operation.

4.3.2 Architecture Overview

Fig. 4.5 illustrates an overview of our proposed BitSystolic architecture. The corre- sponding architectural parameters are defined in Table 4.1. The core of the design is a systolic array comprising P parallel systolic rows, each of which has a chain of N bit-vector processing engines (BVPEs). BitSystolic embeds three types of on-chip memory resources: input buffer, output buffer and local register files. The input buffer is arranged into P banks aligning with the P systolic rows. Each systolic row is quipped with an independent accumulator (i.e., output buffer) with the depth D. The input and output buffers are implemented in SRAMs. Every BVPE contains a register file for local data storage.

56 Global Systolic Array Configuration and Control

Systolic Row Control Buffer Systolic Data Setup Bank 0 ACCUM 0 << + Buffer Bank 1 Systolic Row Control Systolic Data Setup ACCUM 1 Buffer << + Bank 2 Systolic Row Control Systolic Data Setup ACCUM 2 << + M e m o r y Systolic Row Control

Buffer A r Systolic Data Setup b ACCUM P-1

Bank P-1 i t << + e r

Data Loader DMA Data Exporter

BitSystolic Global Data Bus Architecture External Data Link

Figure 4.5: The overview of BitSystolic architecture.

Data Loading

Before loading data onto the chip, we can recognize the weight matrix W as a sta- tionary operand and the activation matrix A as a flying (non-stationary) operand, or vice versa. Taking the weight-stationary setup in Fig. 4.4 as an example, the direct memory access (DMA) unit will facilitate loading the stationary operand W into the register files located in the systolic rows, while the flying operand A will be fetched into the input buffer through the data loader module. Here, W and A will be formatted in the bit-level layout as described in Section 4.3.1. The bit-matrix stacks

{A[k] | 0 ≤ k ≤ m − 1} and {W[k] | 0 ≤ k ≤ n − 1} are stored contiguously in the corresponding memory space.

57 Table 4.1: The definitions of architectural parameters of BitSystolic.

Symbol Definition P Number of systolic rows L Length of sub-vector N Number of BVPEs per row D Depth of the accumulator fcore Core frequency fio External data link frequency Bio External data link width

Data Processing

Conventional systolic paradigm in Fig. 4.4 derives the matrix multiplication by accu- mulating the element-wise products of A and W. Instead, BitSystolic organizes the entire matrix operations by rows and each systolic row carries out a vector-matrix multiplication. Furthermore, a systolic row is made of a chain of N BVPEs, each of which implements a partial vector product, e.g., a · w with sub-vector length L as depicted in Fig. 4.3.

Fig. 4.6 elaborates the two orthogonal data flows in a systolic row: a sequence of input bit-vectors (e.g., a[0] ∼ a[2] expanded from a in Fig. 4.3) fetched from the input buffer, and a stream of the partial bit-vector product (psum) accumulation produced by BVPEs that is propagating along the systolic row. Each input bit-vector will be reused multiple times by the stationary bit-vectors (e.g., w[0] ∼ w[4] expanded from w in Fig. 4.3) that are stored in the register file in each BVPE. Therefore, we adopt two clock domains, clk and clk div n, to orchestrate the psum accumulation stream and input bit-vectors in the systolic format, respectively. clk div n is generated by dividing clk by n times and synchronized to clk within the local control unit per row, where n is the bitwidth of the stationary operands in register files. Along with the systolic wavefronts of input bit-vector sequence, the psum (e.g., a[0] · w[0]) derived in

58 N BVPEs in a row

a[2] clk_div_n a[1]

a[0] Systolic Data Setup

w[0] w[1] w[2] clk w[3] w[4]

BVPE0 BVPE1 BVPE2 BVPE3 psum accumulation direction

Figure 4.6: The flowing data waveform in the systolic row. each BVPE will be propagated towards to the shifter and adder at the end of the row, which are designed to coordinate the bit significance, i.e., 2i+j in Equation (4.1). The final product results will be collected in the accumulator attached to the systolic row.

As can be seen from Fig. 4.6, a systolic row can conduct a vector-matrix mul- tiplication based on a stationary vector buffered in register files and the bit-vector sequence derived from an input matrix. Again, taking the conv layer in Fig. 4.3 as an example, one column of W could be segmented and loaded into the BVPEs of a systolic row with the sub-vector length as L. We then fetch A from the input buffer and send it to the systolic data setup module to generate the input bit-vectors with systolic waveform for the BVPE chain. Multiple rows can be coordinated to further handle matrix-matrix multiplication. Each column of W could be loaded into a systolic row, and A would be broadcast to all the rows.

Result Exporting

Once the computation is completed, the results stored in the accumulators will be off-loaded with the data exporter. In our design, ReLU activation function is applied during data off-loading. The data exporter shares the global data bus with data

59 input bit vectors

Bit-wise register file &

output Bit sum Adder psum input psum

ACCUM << +

Figure 4.7: The bit-vector PE structure. loader and DMA in a time-multiplexing way.

4.3.3 BVPE Design

BVPE is the fundamental processing unit in BitSystolic. Different from TPU with a fixed 8b precision, we design BVPE to provide a flexible precision of 2b ∼ 8b for weights and activations. Fig. 4.7 illustrates the circuit diagram of a BVPE, which assembles a register file to buffer the stationary bit-vectors (e.g., w[0] ∼ w[4] in Fig. 4.6) preloaded by DMA. The register file has 8 entries and arranges the bit- vectors in the significance order, i.e., storing a bit-vector with the significance of 2k at the entry k.

The BVPE derives the partial vector products by orchestrating three data flows: a sequence of input bit-vectors (e.g., a[0] ∼ a[2] in Fig. 4.6) sent from the systolic data setup, the input psum from the previous adjacent BVPE, and the output psum to be received by the next adjoining BVPE. The bit-vector product (e.g., a[0] · w[0]) can be obtained via a bit-wise Boolean and logic followed by the bit summation, which avoids the costly multiplier.

60 << << << + ACCUM

(a) Using distributed 7b shifters.

<< + ACCUM

(b) Using a postponed 14b shifter.

Figure 4.8: The shifter design to handle the running significance.

4.3.4 Bit-Significance Integration

With BVPEs in place, we need to realize the significance terms to complete the computation in Equation (4.1). A straightforward implementation is a hierarchical significance scheme as depicted in Fig. 4.8(a). The design utilizes a distributed form of 7b shifters: every BVPE accompanies with a 7b shifter to realize the operation 2j

i in W[j]; at the end of a BVPE row, a tail 7b shifter is responsible for the term 2 from

A[i]. Instead of such a hierarchical significance scheme, we postpone the management of 2i and 2j to the end of a systolic row, as depicted in Fig. 4.8(b). Although the postponed shifter needs to support 14b shifting capacity, it can improve the computation density by removing the shifters near BVPEs. Based on the synthesis result by Cadence Genus with TSMC 65nm GP process, for the systolic row with L = 16 and N = 16 (refer Section 4.5), the implementation with the postponed shifter can save 30.4% area compared to the systolic row design that utilizes the distributed shifters.

4.3.5 Memory Arbiter Design

To optimize the data reuse, the memory arbiter between input buffer and the sys- tolic array provides the flexibility switching between matrix-matrix (MM) mode and vector-matrix (VM) mode as in Fig. 4.9. In the MM mode, the data sequence fetched

61 T T F F o o r r

s s o o y y Brom adcasting Data Link Inm dependent Data Link s s t t

o o b inb Vector-Matrix mode

in Matrix-Matrix model l u u i i f f c c f f

e e r r o o r r

w w b b a a Figure 4.9: Thes memory arbiter design. s n n k k s s conventional conv layer fc/lstm layer depthwise conv layer

NL P NL NL

× × 1 1 ×

D × × D D

Weight matrix W Activation matrix A Matrix partition W stationary A stationary W stationary MM mode VM mode VM mode

Figure 4.10: Partition for different types of layers. from one buffer bank will be broadcast to all the systolic rows. Whereas in the VM mode, all buffer banks are independently accessed by the corresponding systolic rows. The determination of MM mode and VM mode is based on the intrinsic data reuse patterns existing in various layer types, which will be discussed in details in Sec- tion 4.4.

62 4.4 Data Partition and Mapping in BitSystolic

4.4.1 Partition and Mapping Scheme

As summarized in Table 4.1, the overall BitSystolic architecture provides a multiple- dimension configuration, including the parallel row number (P ), the BVPE number per row (N), the sub-vector length (L), and the accumulation depth (D). The overall computational capacity per round is determined by the multiplication of two matrices in the dimensions of [D,NL] and [NL, P ] with arbitrary bitwidth options from 2b to 8b. In practical neural networks, the neuron number and the parameter number per neuron can easily reach several thousands and the computational load per layer is often partitioned into sub-tasks. We propose to optimize the partition and mapping scheme according to the types of layers as shown in Fig. 4.10. The corresponding weight and activation matrices will be partitioned into chunks based on the architectural parameters of BitSystolic. In each sub-task, the BitSystolic architecture would derive the multiplication between a pair of weight and activation chunks.

As discussed in Section 4.2, the computation of conv layers can be generalized as a matrix-matrix multiplication. Naturally, the MM mode with the weight-stationary scheme can be adopted. A weight chunk with the dimension as NL × P will be preloaded into a systolic array, and then its corresponding activation chunk with the dimension as D ×NL will be fed as the input to generate the systolic sequence. Thus the weight chunk in the systolic array will be reused D times before being evicted and replaced with a new one.

For fc/lstm layers, the activation matrix dimension is related to the batchsize, e.g., the activation becomes a vector when the batchsize is 1. In such a situation, the MM mode with weight stationary scheme is wasteful, because the weights need to be

63 Load data into input buffer banks Computation

MM mode bank0 bank1 bank2 bank3 bank0

array array array array array

(a) The computation-bounded pattern.

VM mode bank0 bank1 bank2 bank3 bank0

row0 stall row0

row1 stall row1

(b) The IO-bounded pattern.

Figure 4.11: The illustrations of computation boundary and IO boundary. repeatedly updated. The VM mode with the activation-stationary scheme would be more efficient: an activation chunk with the dimension of 1 × NL is preloaded in a systolic row, and the corresponding weight chunk with the dimension as D × NL will be fetched from the input buffer as the input of the row. We can further duplicate the activation chunk over multiple systolic rows to increase the data parallelism.

BitSystolic can also support dconv layers featuring separate filter convolution against each channel of input feature map, which introduces a different data pattern as shown in Fig. 4.10. The depthwise convolution can be operated in the VM mode with the separate filters mapped onto parallel systolic rows as weight-stationary to maximize the data reuse. The corresponding activation chunks will be simultaneously fetched from input buffer banks aligned with the rows.

4.4.2 Computation Boundary and IO Boundary

The sub-tasks executed on BitSystolic are based on data chunks extracted from weight and activation matrices as depicted in Fig. 4.10. The MM mode for conv layers is usually computation-bounded. It’s convenient to fully utilize the systolic array

64 through pipelining the off-chip data loading and the computation. As shown in Fig. 4.11(a), while the array is processing the activation chunk buffered in one specific buffer bank, the chunk for the following sub-task can be prepared in another buffer bank. Whereas in VM mode, all systolic rows work independently, i.e., each row consumes an exclusive data chunk from the input buffer. Thus the chip performance in the VM mode is prone to be IO-bounded as depicted in Fig. 4.11(b). For instance, row0 could encounter a stall until its corresponding buffer bank is updated with a new data chunk.

Be more generic, we model the computation boundary and IO boundary with the architectural parameters defined in Table 4.1. Here, we denote Bbuf and Breg as the bitwidth of operands in the input buffer and register file, respectively. The time consumption by a systolic row to process a data chunk can be approximated as:

D · Bbuf · Breg tcomp. = , (4.2) fcore neglecting the small overhead caused by the systolic data setup to simplify the anal- ysis. The time required for loading the data chunk into the chip is:

D · N · L · Bbuf tio = . (4.3) Bio · fio

Thus the relationship between tcomp. and tio can be represented as

t B · B · f comp. = reg io io . (4.4) tio N · L · fcore

With the configuration for the prototyping chip shown in Fig. 4.13, Equation (4.4) can be simplified as tcomp./tio = Breg/4. Taking an example, when the bitwidth of the stationary operands preloaded in the register files is 8b (i.e., Breg), ideally it can only complete the loading of two chunks during the computation. In other words, we can at most harness the computation power of 2 systolic rows out of 16.

65 W[0] W[1] W[2] W[3] W[0] W[1] W[2] W[3] 4b A 0 1 2 3 A 0 1 2 3 [0] 2 2 2 2 [0] 2 2 2 2 A 4b

A[1] 21 22 23 24 A[1] 21 22 23 24 4b MatMul A·W

A[2] 22 23 24 25 A[2] 22 23 24 25 W

A[3] 23 24 25 26 A[3] 23 24 25 26 Skipped Needed Hidden Layer 1 Hidden Layer 2

Figure 4.12: The demonstration of BitSkipping on a 3-layer MLP for MNIST.

4.4.3 BitSkipping Technique

The task partition can be managed in a finer granularity at the bit level. BitSystolic would calculate all the multiplications of bit-matrix pairs extracted from the weight and activation matrices as defined in Equation (4.1). In fact, the computations on certain pairs with small significance are wasted efforts. In other words, the product accumulation from bits with small significance can be erased through the quantization process on the final results. Taking an example of two hidden layers of the 3-layer MLP for MNIST with 4b weight and activation (ref. to Fig. 4.1(a)), each layer involves 16 bit-level matrix multiplications between weight and activation as depicted in Fig. 4.12. According to our offline analysis, the quantization stepsize QA·W for the multiplication result A·W is significantly larger than the product QA ·QW, where QA and QW are the quantization stepsize of A and W respectively. For the first hidden layer, QA·W is 66.8× greater than QA ·QW. Correspondingly, 62.5% of the bit-matrix multiplications can be safely skipped as shown in Fig. 4.12, which doesn’t affect the accuracy of A · W after quantization in 4b. Similarly, 37.5% of the computational cost can be avoided for the second hidden layer. Such computation saving approach, namely BitSkipping, can be generically applied on other mixed-precision models with advanced network structures to avoid useless bit-matrix multiplications, and more results will be included in Section 4.5.5.

66 Technology TSMC GP 65 nm

External data link 128b (Bio), 100 MHz (fio)

s Core frequency 50 ~ 200 MHz

Input buffer (SRAM) 4 KB × 16

b u 2

f Accumulator (SRAM) 1 KB × 16 m f e m I r n S A Register files 2 KB p y c s u c t u t o conv, depthwise conv, fc, m

l Supported layer types

i lstm c u

R l a

o * Systolic row # P = 16, BVPE # per row N = 16, t o w

r Sub-vector length L = 16, Accumulater depth D = 256. 2 ms m

Figure 4.13: The die photo and chip configuration.

4.5 Measurements and Discussions

4.5.1 BitSystolic Chip

We implemented and fabricated a prototypical BitSystolic chip in TSMC GP 65nm process. Fig. 4.13 shows the die photo and the chip configuration. Within an area of 2×2 mm2 for the whole chip including IO pads, the systolic array features 16 independent rows and 16 BVPEs per row with the sub-vector length of 16. The maximal sequence length of consecutive vector products is 256. The chip contains 80 KB SRAM in total consisting of 64 KB single-port SRAM for the input buffer and 16 KB dual-port SRAM for the accumulators. In addition, register files with a total capacity of 2 KB are distributed across the systolic array. Separated clock frequencies are applied to the core and external data link. The core clock frequency is in the range of 50 ∼ 200 MHz, while the external data link works up to 100 MHz with the bus width as 128b.

67 DC power

USB interface

FPGA BitSystolic chip

Clock generator

Figure 4.14: The test board design with USB interface connected to PC.

8b

global config 0 0 0 x MM_VM bitwidth_sta.

runtime config 0 0 1 x round_one bitwidth_fly.

load stationary data 0 1 0 x id_register_file

load flying data 0 1 1 x id_input_buffer

compute 1 0 0 x id_systolic_row

export data 1 0 1 x id_accumulator

OP code Configuration data

Figure 4.15: The design of the concise SIMD instruction set.

4.5.2 Test System Design

The test board shown in Fig. 4.14 is built with a BitSystolic chip and an accompanied FPGA controller (Xilinx Spartan-6 XC6SLX45). The FPGA controller serves as the data and instruction bridge between the PC host and the BitSystolic chip under test using an 8b USB interface. A dedicated on-board clock generator provides the clock for the BitSystolic core, and the clock for the external data link is supplied by the FPGA controller.

To facilitate the testing process, we design a concise Single-Instruction-Multiple-

68 Data (SIMD) instruction set to cover the chip configuration and principle operations on data chunks for neural networks that will be allocated and executed on BitSystolic. The instruction decoder, built in the FPGA controller, translates the instruction streams from the PC host as configuration binaries written in the global control unit of BitSystolic. The instruction set contains six instructions as summarized in Fig. 4.15. The instruction length is constrained within 8b considering the bus width of the USB interface integrated on the FPGA board, and the leading three bits are kept for operation codes. The details of the instruction set are described as follows.

• I1) global config: This instruction is used to set the global configuration related to the computational mode (i.e., VM or MM) and bitwidth of the stationary operands in the systolic array.

• I2) runtime config: Different from the global configuration in I1, instruction I2 configures the chip with runtime information about the bitwidth of the data fetched from the input buffer and computational state of the sub-tasks as dis- cussed in Section 4.4.1.

• I3) load stationary data: For a matrix computation A · W, either A or W will be chosen as the stationary operand based on the offline data reuse analysis in

order to minimize the data movements. I3 will load the stationary operand into the systolic array in dedicated register files by identifying the row number.

• I4) load flying data: Corresponding to I3, instruction I4 is designed for loading the non-stationary operand into the input buffer.

• I5) compute: It kicks off the matrix computation mode once both A and W are

loaded. I5 can either synchronously control the entire systolic array or enable only an individual row partition.

69 0.70 systolic array 0.65 15 accumulators 0.60 on-chip buffer 10 0.55 VM mode peak power (mW) supply voltage (V) 0.50 5 MM 50 100 150 200 mode core frequency (MHz) 60% 80% 100%

(a) Frequency-voltage scaling (b) The power breakdown

Figure 4.16: The chip power consumption analysis.

MM mode VM mode

10 1

10 3 latency (sec)

fc1 fc2 fc3 conv1 conv2 conv3 conv4 conv5 lstm1 lstm2 dconv1dconv2

Figure 4.17: Comparison between the MM mode and VM mode.

• I6) export data: With the ignition of instruction I6, the matrix computation results buffered in a certain accumulator will firstly go through the activation function and then be exported to the PC through the USB interface. In this chip, ReLU function is supported.

4.5.3 Measurement Results

The measurement results in Fig. 4.16(a) show that as the core frequency increases, the minimal power supply to sustain correct functionality increases from 0.54 V to 0.66 V. Accordingly, the peak power consumption ranges from 3.8 mW to 17.8 mW. Note that the core frequency could exceed 200 MHz under higher power supply, which is not in the scope of this work. Fig. 4.16(b) presents the normalized power breakdowns in the MM and VM mode. The SRAM-based input buffer and accumulator consume

70 10 1 2b 4b 8b

10 3 latency (sec)

fc1 fc2 fc3 conv1 conv2 conv3 conv4 conv5 lstm1 lstm2 dconv1dconv2

Figure 4.18: Comparison between different numerical precision.

all in MM mode 8b model mixed-precision model

1 10 dconv 60% pconv 40% 10 2 20% latency (sec) comp. saving

10 3 0 0 5 10 15 20 25 avg. layer index

Figure 4.19: The layer-wise results for mixed-precision MobileNet-V1 on ImageNet. more than 95% power in both modes. Due to the memory arbiter’s large fanout for data broadcasting in the MM mode, the systolic array accounts for a slightly heavier portion of the power consumption in the MM mode than that in the VM mode.

Fig. 4.17 compares the processing latency of various neural layers (refer Fig. 4.2) in MM and VM modes, under 200 MHz core frequency, batch size 1, and 8b precision for both weights and activations. As can be seen that conv layers benefit more from the weight-stationary scheme and execute faster in the MM mode. For fc, lstm and dconv layers, the VM mode outperforms as it provides more flexible data reuse patterns by allowing all the systolic rows to run independently.

The throughput can be further increased by lowering the numerical precision. Here, we adopt the identical bitwidth for weights and activations for simplicity of testing. Fig. 4.18 shows that the layer latency decreases rapidly along the bitwidth reduction. Reducing from 8b to 4b results in 3.96× speedup for conv layers, 1.98×

71 8b model mixed-precision model 0.010 pconv 60% dconv

0.005 40%

20% latency (sec) comp. saving

0.000 0 0 10 20 30 40 avg. layer index

Figure 4.20: The layer-wise results for mixed-precision MobileNet-V2 on ImageNet. for fc/lstm layers and 2.23× for dconv layers. When decreasing the bitwidth to 2b, the overall speedup rates are 7.7×, 3.85× and 4.34× for conv layers, fc/lstm layers and dconv layers, respectively. It’s worth noting that different from the MM mode for conv layers, the VM mode used for fc/lstm and dconv layers doesn’t provide comparable latency reduction rates. As explained in Equation (4.4), lower bitwidth would reduce the number of concurrent systolic rows, which impedes the throughput improvement expected from reducing the bitwidth of operands.

We further evaluate BitSystolic on mixed-precision MobileNet [HZC+17b, SHZ+18] models featuring light-weight dconv layers optimized for edge devices. The trained models are obtained from HAQ [WLL+19]. MobileNet-V1 has 13 consecutive convo- lution blocks with each block comprising a dconv layer and a point-wise conv (pconv) layer. The layer-wise measurement results for MobileNet-V1 in Fig. 4.19 show that the dconv layer using MM mode becomes a severe bottleneck, because MM mode doesn’t utilize the parallelism in the depthwise filtering. Thus we assign VM mode and MM mode for the dconv layer and pconv layer respectively to fit their intrinsic data parallelism. Overall, the 8b MobileNet-V1 with the mode adaptivity obtains a 3.88× speedup. The mixed-precision model can further increase the throughput by optimizing the bitwidth layer by layer without losing inference accuracy. Up to 62.5% computation cost can be saved per layer. On average, the mixed-precision MobileNet- V1 achieves an 1.6× speedup with 41.8% computation saving (1.72× reduction rate)

72 per layer compared to the 8b counterpart.

MobileNet-V2 further compresses the computation cost and model size by in- cluding the bottleneck convolution block consisting of an expansion layer, a dconv layer and a projection layer, where the expansion layer and projection layer are both pconv layers. Through the expansion and projection operations per bottleneck block, MobileNet-V2 could shrink the inter-block tensor dimension without losing the intra-block spatial information in the feature map. MobileNet-V2 is built on 16 stacked bottleneck convolution blocks. As shown in Fig. 4.20, the mixed-precision MobileNet-V2 can save 23.4% ∼ 71.9% computation cost layer-wise compared to the 8b counterpart without losing inference accuracy. On average, it achieves a 1.64× acceleration rate with 43.7% computation reduction.

4.5.4 Comparison with State-of-the-art NPUs

Fig. 4.21 compares the computation density (GOPS/mm2) and power efficiency (TOPS/W) of the BitSystolic chip with state-of-the-art NPUs targeted for DNN acceleration. Compared to the early designs dealing with fixed-point operations in high precision, e.g., 16b in [CKES16] and 24b in [SPK+16], the chip computation density and effi- ciency has been continuously improved in orders of magnitudes. The improvements mainly originate from two aspects. (1) Many designs resort to advanced technologies, such as 8nm technology node in [SCP+19], 28nm FDSOI node in [DCB+17, MBV19] and novel 3-D SRAM technology in [UAH+18]. (2) Quantized models are utilized to reduce the computational burdens and memory cost [SLLY17, LKK+18, YOT+18]. Here, BitSystolic is designed for mixed-precision models to fulfill the precision flexi- bility of weights and activations. Overall, BitSystolic provides state-of-the-art com- putation density and power efficiency among these NPU designs.

The details of the comparison with some representative NPUs specialized for

73 UNPU, 2018.

BitSystolic

Song et al., 2019. 101 ENVISION, 2017. DNPU, 2017. Thinker, 2018.

Desoli et al, 2017. TOPS/W QUEST, 2018. Sim et al., 2016. 100

Eyeriss, 2016.

101 102 103 GOPS/mm^2

Figure 4.21: The overview of the state-of-the-art chips. quantized DNN models are summarized in Table 4.2. These counterparts are partic- ularly optimized for conv layers, while lacking an efficient support for fc/lstm/dconv layers. BitSystolic features coherent power efficiency for all layer types by virtue of configurable data flows. For UNPU [LKK+18] with the highest power efficiency utilizes the multipliers based on LUTs, new inputs changing frequently can induce exhausting LUT updates, making it difficult to maintain at the peak performance. In contrast, BitSystolic adopts the low-cost systolic structure, which can be run steadily upon the input data sequences. BitSystolic also features the lowest peak power con- sumption and is the only design that fully supports flexible numerical precision for both weights and activations.

4.5.5 Discussions

Balance Spot between VM and MM modes

Considering the computation patterns of different layers depicted in Fig. 4.10, the layer execution on BitSystolic is adaptable to VM or MM mode. The input batchsize

74 Table 4.2: Comparison with the state-of-the-art NPUs DNPU QUEST UNPU JSSC ISSCC BitSystolic [SLLY17] [UAH+18] [LKK+18] [YOT+18] [SCP+19] Process 65nm 40nm 65nm 65nm 8nm 65nm Die area 16mm2 121.6mm2 16mm2 19.4mm2 5.5mm2 4mm2 Supply voltage (V) 0.77∼1.1 0.77∼1.1 0.63∼1.1 0.67∼1.2 0.5∼0.8 0.54∼0.66 Max core frequency 200 MHz 330 MHz 200 MHz 200 MHz 933 MHz 200 MHz SRAM size 290 KB 7.68 MB 256 KB 348 KB 1.6 MB 80 KB Quantization LUT-based Logarithmic LUT-based Linear Linear Linear 75 Weight precision 4b∼7b 1b∼4b 1b∼16b 8b/16b 8b 2b∼8b Activation precision 16b 4b 16b 8b/16b 8b 2b∼8b Peak power 279mW 3,300mW 297mW 290mW 1,553mW 17.8mW Peak throughput 1,200 7,490 7,372 410 6,937 403 (GOPS) Comp. density 75 61.6 461 21 1261 101 (GOPS/mm2) Peak power efficiency 8.1@conv 2.27@conv 50.6@conv 5.09@conv 11.5@conv 26.7@conv (TOPS/W) 3.9@fc 2.27@fc 46.7@fc 5.09@fc n.a.@fc 26.7@fc ∗ 1 Multiply-Add = 2 OPs. 100 fc1, MM mode fc1, VM mode fc2, MM mode 10 1 fc2, VM mode fc3, MM mode fc3, VM mode

latency (sec) 10 2

1 8 16 24 32 40 48 56 batchsize

Figure 4.22: Balance spots between VM mode and MM mode. can also change the underlying computation pattern, which is another factor for the mode configuration. We investigate the impact of batchsize and show the results in Fig. 4.22, where the analysis is based on 8b (for both weights and activations) fc layers of AlexNet. The MM mode results are derived from cycle-accurate simulation after synthesis, while the VM mode results are based on the measurements. The MM mode with weight stationary as used in TPU [JYP+17a] provides nearly constant performance when the batchsize is less than the accumulator depth due to the low utilization of the systolic array. For the tested fc layers, the VM mode with activation stationary outperforms the MM mode when the batchsize is less than 32. We note that a lower bitwidth will cause the lower utilization of the systolic rows due to the IO boundary, which shifts the balance spot to a smaller batchsize. For example, the balance spot turns to be 8 for 2b layers based on the post-synthesis simulation.

Throughput Scaling vs. Bit Precision

As defined in Equation (4.1), the computation cost of A · W can be reduced by lowering the bitwidth of activation matrix A and weight matrix W, and the chip throughput will be increased accordingly. Based on the cycle-accurate simulation after synthesis, Fig. 4.23 shows the throughput scaling for two typical cases, conv1 and fc1 in AlexNet, without losing generality for other layers. The results are normalized

76 4 conv1, A bitwidth scaling conv1, W bitwidth scaling 3 fc1, A bitwidth scaling fc1, W bitwidth scaling

2

Normalized throughput 1 8b 7b 6b 5b 4b 3b 2b Bitwidth

Figure 4.23: The throughput scaling vs. bit precision. to the 8b result in each curve. The conv1 is executed in MM mode with weight stationary, while the fc1 is in VM mode with activation stationary. When scaling the bitwidth of either A or W, we maintain the precision of the other one as 8b. For conv1, the throughput increases linearly with the bitwidth reduction of A, and its performance starts to plateau at 4b when scaling the bitwidth of W. For fc1, the throughput stays nearly constant when varying the bitwidth of A, while the performance has the linear scalability against the bitwidth of W. Recalling that A in conv1 and W in fc1 are non-stationary data loaded off-chip, the chip performance is scaling appropriately against the bitwidth of the non-stationary operands. When varying the bitwidth of stationary operands, i.e., W in conv1 and A in fc1, the performance scalability is vulnerable to the limited IO bandwidth. In other words, the throughout is IO-bounded with the systolic array underutilized.

Throughput Scaling vs. IO Bandwidth

The performance scalability could be IO-bounded especially for fc layers. The IO boundary can be alleviated by increasing the IO bandwidth, i.e., Bio ·fio, as defined in Equation (4.4). To illustrate the effect of IO boundary, we scale the bandwidth from 1.6 GB/s (this design) to 48 GB/s. The results of fc1 with varying the bitwidth of

77 4 1.6GB/s 20.0GB/s 3 32.0GB/s 48.0GB/s ideal 2

Normalized throughput 1 8b 7b 6b 5b 4b 3b 2b Bitwidth

Figure 4.24: The throughput scaling vs. IO bandwidth with varying the bitwidth of stationary A for fc1. stationary A are shown in Fig. 4.24, where W is kept in 8b. Each curve is normalized according to the 8b result. When the IO bandwidth increases, the performance scalability agrees more with that in the ideal IO condition. In addition, smaller bitwidth of the stationary A tends to lower the utilization of systolic array, which makes the chip throughput IO-bounded.

BitSkipping on AlexNet

As discussed in Section 4.4.3, the computation cost can be further saved by BitSkip- ping technique which is analyzed off-line to avoid wasted bit-matrix multiplication without model accuracy loss. Here we apply BitSkipping on AlexNet to validate its feasibility. According to Fig. 4.1(b), AlexNet can be quantized with the bitwidth of 6b ∼ 8b for activation and 7b ∼ 8b for weight while keeping the inference accuracy > 58% on ImageNet. Fig. 4.25 shows the results for all the feasible configurations for the bitwidth of activaiton and weight. Since the final output is usually kept in high precision, BitSkipping is avoided for the ending layer. As can be seen from the results, the average layer-wise computation cost is reduced to 61.6%∼71.4% of the original requirements. It’s worth noting that lower activation bitwidth indicates more computation saving according to BitSkipping, which is because that lower-precision

78 A:8b, W:8b 80% A:8b, W:7b A:7b, W:8b 60% A:7b, W:7b A:6b, W:8b A:6b, W:7b computation cost 40% 1 2 3 4 5 6 7 avg. layer index

Figure 4.25: The computation cost saving on AlexNet using BitSkipping. matrix multiplication results are allowable.

4.6 Conclusive Remarks

We propose BitSystolic architecture, aiming to support flexible numerical precision (2b ∼ 8b) of both weights and activations in mixed-precision DNN models. The prototyping BitSystolic chip with a 16×16 systolic array achieves a unified 26.7 TOPS/W peak power efficiency with 101 GOPS/mm2 computation density. Our chip successfully provides 1.64× layer-wise latency improvement for mixed-precision MobileNet-V2 with 1.78× computation cost reduction compared to the 8b counter- parts. Moreover, BitSystolic features both MM mode and VM mode which can be configured to optimize the data reuse and parallelism in different layer types.

79 Chapter 5

Quantized Training to Enhance ReRAM-based CIM Systems

5.1 Motivation

Modern computers are heavily adopting the Von Neumann architecture as shown in Fig. 5.1(a). In Von Neumann architecture, arithmetic unit and memory unit are sep- arate. The control unit will fetch instructions and data from memory, and commonly there is direct memory access (DMA) unit to help access data in memory directly. In memory-intensive tasks, e.g., neural networks, the data communication between arithmetic unit and memory unit will domain the performance and power consump- tion. Resistive random-access memory (ReRAM) brings new chance in computing in memory (CIM) for DNN acceleration. ReRAM is a new kind of memory device whose conductance can be modified by current or voltage pulses. It features high den- sity and low-power read/write operations. ReRAM-based systems can combine data storage and computations on the crossbar structure. As depicted in Fig. 5.1(b), CIM architecture enhanced with ReRAM manages to alleviate enormous data movements by combining data storage and computation. While reading the data programmed in ReRAM devices located at the cross-points of horizontal and vertical wires, it completes a multiplication operation through ReRAMs current-voltage characteris- tic. The bit-line current is the accumulation of multiplications on the same column. The reduce of data movements successfully saves time and power consumption.

With the help of crossbar structure, ReRAM-based CIM system has obtained op- portunities in various applications, e.g., deep learning, graph computing and spiking neuron modelling, where it gets higher computing efficiency compared to general CPU

80 Figure 5.1: The typical CIM architecture. and GPU platforms. ISAAC [SNM+16] designed a pipeline CNN structure based on ReRAM crossbar with peripheral DAC and ADC circuits. PRIME [CLX+16] and Pipelayer [SQLC17] demonstrate processing in memory architectures where ReRAM devices can be switched between data memory and computing engine. Liu et al. de- signed a spiking neural network chip [LYY+15] with ReRAM crossbar which features low power and good robustness against process variations. Fig. 5.2 shows the com- parison of power efficiency between the NPU chips with Von Neumann architecture and CIM chips. As can be seen, the ReRAM-based chips [LGY+20, XHL+20] feature higher power efficiency benefiting from in-place data storage and computation.

It’s worth noting that current ReRAM-based designs have an unrealistic require- ment on resistance resolution. For example, ReRAM device with 16 programming levels is assumed in PRIME, and ISAAC and Pipelayer require to merge calcula- tion results of adjacent several (4∼8) columns in a ReRAM array to achieve 16-bits resolution. To pave the way for networks with low-precision devices (e.g., binary ReRAM device), Wang et al. [WWSL17] and Song et al. [SLW+17] manage to utilize a quantization regularization term to constrain the distribution of parameter values. A new hyper parameter λ is introduced to control regularization strength. Brute-force method is utilized in [SLW+17] to determine the optimal λ, which is very

81 102

UNPU, 2018.

BitSystolic

Song et al., 2019. 1 ENVISION, 2017. 10 DNPU, 2017. Thinker, 2018.

TOPS/W Desoli et al, 2017. QUEST, 2018. Sim et al., 2016. 100 Analog ReRAM, Liu and et al., 2020. Eyeriss, 2016. Multibit ReRAM, Xue and et al., 2020.

101 102 103 GOPS/mm^2

Figure 5.2: The comparison between Von Neumann chips and CIM chips. time-consuming. The accuracy drops using quantization regularization method are still considerable, e.g., accuracy drops by 7.69% for CIFAR10 dataset in [WWSL17], and Song et al. [SLW+17] has a 1.8% accuracy drop for MNIST dataset.

We propose a quantized training method targeting to improve the accuracy of ReRAM-based neuromorphic systems with limited programming resolution. Instead of introducing new hyper parameters for training, our approach attempts to deal with the limited programming resolution directly. Models with discrete parameter distribution will be obtained once the training process terminates.

5.2 Working Principle of ReRAM-based CIM

Matrix-vector multiplication is the core operation of DNNs. In the systems based on ReRAM technology,the weights parameters (matrix) of a layer are mapped onto ReRAM devices, while inter-layer data (vector) is routed as the inter-connect signals between crossbar arrays. Figure 5.1(b) depicts the scenario of a matrix-vector mul- tiplication where the input data generated from the previous layer is represented by

82 voltage amplitudes, and ReRAM devices are located at the cross-points of horizon- tal and vertical wires. Taking the first column as an example, the column current

T shows the vector multiplication result of vector v = [v0, v1, . . . , vn−1] and vector

T g+ = [g0,+, g1,+, . . . , gn−1,+], where vi is the voltage applied on each row, and gi,+ are T the conductance of the ReRAM devices along the same column, that’s i0,+ = v · g+. Current sensing circuit buffers the column currents for analog-to-digital conversions that can be implemented by conventional ADCs [SNM+16] or integrated-and-fire cir-

+ cuit [LYY 15]. When the vector multiplication result i0,+ is converted into the digital domain, additional algorithmic operations, e.g., addition, subtraction and shifting, can be easily realized by using digital logic circuits.

It’s worth noting the ReRAM’s conductance is always a positive value. To realize the representation of negative parameters, we use the combination of two ReRAM device on two adjacent columns to represent one value. As shown in Figure 5.1, one combination example is gg0 = g0,+ − g0,−, where gg0 can be positive, negative or 0. Thus, the vector multiplication result of v and g will be the subtraction of currents from two columns:

T i0 = i0,+ − i0,− = v · (g+ − g−). (5.1)

With the help of the ReRAM crossbar structure, vector-matrix multiplication vT ·

Gn×m can be calculated in parallel to get the result i = [i0, i1, . . . , im−1] by sensing column currents.

5.3 Quantized Training Algorithm

5.3.1 Quantization Method

Figure 5.3 depicts an quantization example using ReRAM device with two resistance levels. As the conductance combination rule discussed in Section II, there are only

83 Figure 5.3: Quantization example.

three discrete values {gg0, 0, gg1} to which parameters can be mapped. One param- eter wi will be quantized to either one of its two nearest neighbors instructed by corresponding distances dl and dr.

There are two commonly adopted quantization methods to convert continuous numbers to quantized numbers: stochastic method and deterministic method. In the stochastic method, the quantization process can be expressed as:

  0, with probability p = dl dl+dr wi = , (5.2) gg0, with probability p = 1 − p thus wi will have a higher probability to be quantized as 0 in Figure 5.3, meanwhile it is still probably quantized as gg0.

The other quantization method is a deterministic way, and wi will be quantized to its nearest neighbor:

  0, if dr ≤ dl wi = . (5.3) gg0, if dl < dr

5.3.2 Quantized Training Method

Our proposed quantized training algorithm for ReRAM-based systems is described in Algorithm 2. Here, we will simply use the term weights (denoted by W) to indicate both weights and bias parameters in a network model (denoted by net) for

84 Algorithm 2: Quantized Training Method. The network model is denoted by net, and gg is a list of all possible discrete candidates for weights W.

W(c,0) = getParameters(net) W(q,0) = quantize(W(c,0), gg) W(c,0) = W(q,0) . Initialize net while i < max iterations do ∆W = SGD(W(q,i), batchi) W(c,i+1) = clipping(W(c,i) + ∆W) W(q,i+1) = quantize(W(c,i+1), gg) if loss(W(q,i+1), batchi) > loss(W(q,i), batchi) then W(q,i+1) = W(q,i) end i++ . Training iterations end ease of explanation. In the algorithm, gg is a list of all possible discrete weight candi- dates, and the function quantize implements deterministic quantization as discussed in Section 5.3.1 for it’s more computationally efficient compared to the stochastic method.

There are two sets of weights in the algorithm: the continuous weights Wc and the quantized weights Wq. To implement the initialization before training, the initial

Wc is set as the same with the initial Wq that is quantized from the net initialization.

During the training, the stochastic gradient descent (SGD) is computed upon Wq, which means that the quantized weights participate in the forward pass and back propagation. However, the derived weights updates ∆W will be accumulated by

Wc. After that, Wq will be updated by quantizing current Wc for the next training iteration. Through the interaction between Wq and Wc, our proposed method deals with quantization while training and is aware of available discrete weights candidates gg. To understand the principle of the quantized training method, we can see Wc as an accumulator for enormous tiny attractions to pull net parameters into expected discrete values. After training, the quantized net with Wq turns to be an appropriate approximation of the model with real-valued Wc.

85 To guarantee the feasibility of quantization process, we will adopt clipping on model parameters while training. The clipping function is to bound all parameters in the range of [ggmin, ggmax] that can be supported by ReRAM devices, which avoids uncontrollable value distributions. Loss checking will be applied along the training phase to check if updated Wq successfully achieves a smaller loss, which helps to avoid meaningless weights updates. Here, we just set the maximum iteration numbers to stop the training for simplicity. At the end of the quantized training, we will get a net model with quantized weights, which can be directly mapped onto ReRAM devices.

5.4 Discussions

We show our experiment results and discussions in this section. We apply quan- tized training method on both MLP and CNN to explore the feasibility of quantized training under various scenarios.

As summarized in Table 5.1, we implement three typical neural networks as case studies: a two-layer MLP for MNIST dataset and two CNNs from LeNet family

[LBBH98]. LeNet1 is configured with two convolution (conv) layers and two fully- connected (fc) layers for MNIST dataset, while LeNet2 with three conv layers and one fc layer for CIFAR10 dataset. In the experiments, we set the resistance range of ReRAM devices from 50 KΩ to 1 MΩ [LYY+15][YWW11]. Thus the programmable conductance can be linearly distributed in the range [1 µS, 20 µS].

Table 5.1: Network configurations for quantized training.

Model conv1 conv2 conv3 fc1 fc2 MLP - - - 784×128 128×10 LeNet1 5×5×20 5×5×50 - 2450×128 128×10 LeNet2 5×5×32 5×5×32 5×5×64 1024×10 -

86 Table 5.2: Training results based on 2-level ReRAM.

Model Ideal Direct quantization Quantized training MLP 97.51% 85.98% 94.53% LeNet1 99.00% 69.24% 98.75% LeNet2 76.99% 26.20% 73.04%

Table 5.2 shows the quantized training results based on ReRAM device with two programming levels. The learning curves are depicted in Figure 5.4. Direct quantiza- tion is to map model parameters onto ReRAM devices only after training. As Table 5.2 shows, our proposed quantized training method can recover the big accuracy drops in direct quantization. For example, LeNet1 based on 2-level ReRAM approaches the accuracy of the ideal case using floating point parameters with a drop of only 0.25%;

LeNet2 increases the accuracy from 26.2% to 73.04% compared to direct quantiza- tion. The big accuracy gain comes from the fact that the quantized training method takes parameter limits into considerations while training. As depicted in Figure 5.4, although there appears a big accuracy gap between the ideal training and quantized training at the beginning (iteration=1), the gap turns to be smaller and stable after enough training iterations.

In the following, we also carry out experiments to explore the design spaces for ReRAM-based neuromorphic systems in terms of layer size, quantization effects and device variation.

5.4.1 Classification Accuracy vs. Layer Size

Higher ReRAM programming resolution and larger layer size can bring a higher classification accuracy. To achieve multi-level ReRAM devices, read/write circuitry designs are challenging, which increases circuit complexity and chip area. However, with the feature of high integration density of ReRAM, it’s convenient to implement

87 Figure 5.4: Learning curves for three different quantized models. a large layer size.

Taking the two-layer MLP in Table 5.2 as the example, we scan across various ReRAM programming levels and hidden layer sizes. The MNIST classification results after quantized training are shown in Figure 5.5. The obvious trend is that both larger layer size and higher ReRAM programming resolution help the classification accuracy. When we focus on the accuracy improvements column-wise and row-wise in Figure 5.5, larger accuracy increment is introduced by enlarging the hidden layer size. The classification accuracy is improved from 92.68% to 95.86% (3.18% higher) by enlarging the hidden layer size from 64 to 256 (4× larger size). However, increasing ReRAM resolution from 2 levels to 10 levels (5× higher resolution) only gets 0.56% accuracy improvement. Based on the observations from Figure 5.5, instead of finer ReRAM resolution, enlarging layer size is a more efficient choice for improving neural network’s classification accuracy.

88 Figure 5.5: Classification accuracy vs. layer size for the quantized MLP.

5.4.2 Quantization for conv Layer and fc Layer

Typically a CNN model consists of a group of conv layers and several fc layers. The conv layers consisting of 3-D filters are responsible for feature extraction to represent images using high-level feature maps. fc layers are responsible for classification tasks based on feature maps generated by previous conv layers.

Quantization has different effects on conv layer and fc layer. We take LeNet2 in Table 5.2 as the example to explore different effects of quantized training on these two kinds of layers. Three comparative experiments are implemented: quantize conv layer only, quantize fc layer only and quantize them both. Figure 5.6 shows image recognition accuracy trends vs. ReRAM resistance levels. As depicted by Figure 5.6, with more ReRAM programming levels, LeNet2 generally gets a higher recognition accuracy for CIFAR10 dataset. Furthermore, the accuracy trend by quantizing conv layer only coincides more with quantizing the whole model. This indicates conv layer quantization plays a more vital role in model quantization, or say, conv layer has a higher demand on ReRAM programming resolution. The curve for quantizing fc layer only also supports the idea in Section 5.4.1. The resolution increment can have limit impacts on model accuracy especially when the fc layer has enormous parameters, e.g, the weights in fc layer of LeNet2 has a dimension of 1024×10.

89 Figure 5.6: Quantization effects for conv layer and fc layer.

5.4.3 Robustness to Device Variations

It’s not trivial to have ideal ReRAM devices that can be programmed onto designed levels without variations. Here we will explore the system robustness to programming variations. To cover different circumstances, we apply various variation conditions to model parameters obtained from quantized training. We use LeNet2 as an example and apply uniform distributed variations on weight parameters with a variation bound ranging from 10% to 50%. The accuracy drops due to ReRAM variations are shown in Figure 5.7. Although the model trained with a higher ReRAM programming resolution is better than the model with a lower resolution when the variation is small, low-resolution model starts to overwhelm gradually. When the weight parameter variation is greater than 30%, the model based on 2-level ReRAM device starts superior to all other models.

To design a robust neuromorphic system based on ReRAM, higher programming resolution is not always beneficial. Beyond the fact that complicated read/write cir- cuitry designs are needed for more ReRAM programming levels, higher resolution is less immune to device variations. Thus in ReRAM-based neuromorphic system de- signs, an appropriate ReRAM programming resolution is needed to be determined by

90 Figure 5.7: ReRAM-based system robustness to variations. considering read/write circuitry cost, device variation circumstances and application requirement.

5.5 Conclusive Remarks

The limited ReRAM programming resolution introduces a big challenge for mapping DNN model parameters onto ReRAM devices. In this paper, we propose a quantized training method to enhance accuracy of ReRAM-based neuromorphic systems. Our quantized training method is aware of limited parameter precision during training phase. Thorough experiments demonstrate that quantized DNN model can approach the performance of ideal DNN model on various image recognition tasks. Further- more, this paper gives several rules in ReRAM-based neuromorphic system designs. First, increasing the layer size improves the model accuracy in a more efficient way compared to increasing ReRAM’s programming resolution. Second, convolution layer has a higher demand on parameter precision than fully connected layer. At last, we find there exists an optimal choice of ReRAM programming resolution determined by design complexity, device variation and application requirement.

91 Chapter 6

Conclusions

6.1 Summary of Contributions

This thesis proposed the cross-layer optimization for DNN deployments by integrating the algorithmic improvements on network structures and the hardware supports for the novel features of optimized DNN models.

In Chapter 2, we propose joint regularization integrating weight pruning and activation pruning. The experiment results on various models for MNIST, CIFAR-10 and ImageNet datasets have demonstrated considerable computation cost reduction. In total, a 1.4× ∼ 5.2× activation compression rate and a 1.6× ∼ 12.3× weight compression rate are obtained. Only 1.2% ∼ 27.7% of MACs are left with marginal effects on model accuracy, which outperforms the weight pruning by 1.3× ∼ 10.5×.

In Chapter 3, we exploit the dynamic activation sparsity and propose structural activation regularization using WTA dropout technique. Our experiments on var- ious datasets show that the derived networks enhanced by WTA dropout, namely DASNets, can achieve 1.12× ∼ 1.8× speedup by paying up to 0.5% accuracy loss.

In Chapter 4, a NPU, BitSystolic, is designed to support flexible numerical preci- sion, configurable data flows, and small batch size, all of which are essentially required by edge devices. BitSystolic features two computational modes, matrix-matrix and vector-matrix, which are dedicated for the distinct data reuse pattern in different layer types. We design and fabricate our BitSystolic chip with a 16×16 systolic ar- ray in TSMC GP 65nm process. The measurements show that our chip obtains the unified 26.7 TOPS/W peak power efficiency for all layer types under the 17.8 mW chip power consumption.

92 In Chapter 5, a quantized training method is proposed to enhance the perfor- mance of neuromorphic systems based on ReRAM. our quantized training method deals with training and quantization at the same time to alleviate the impact of limited parameter precision. The experiment results verify that the quantized train- ing method can approximate the accuracy of full-precision training, e.g., a two-layer MLP based on binary ReRAM only decreases the classification accuracy by 0.25% for MNIST dataset.

6.2 Future Work

Though the thesis successfully demonstrated promising achievements in efficient DNN deployments through algorithm-hardware co-optimization, there still exist many re- search directions worthy of exploration. Here we list two preliminary ideas as poten- tial future works.

• In Chapter 3, by activating only winner neurons, the WTA dropout used to generate structural activation sparsity can reduce the overlapping of activated neuron sets for different object classes. We numerically define the neuron over-

lapping as the intersection percentage, |Sa ∩ Sb|/|Sa ∪ Sb|. Here Sa ∩ Sb is the

intersection of neuron Sa activated for class a and neuron set Sb for class b,

and Sa ∪ Sb is the union of these two neuron sets. Fig. 6.1 shows the confusion matrices about the neuron overlapping in the two hidden layers of the MLP-3 model (refer Table 3.1) on the MNIST dataset. When constructing the neuron set for each class, we neglect the neurons activated for less than 1% of the tar- geted digit images to filter out random noise. Compared to the original model, applying the WTA dropout reduces the neuron overlapping between digit pairs in both layers significantly. In other words, the inference paths for different digits share much fewer neurons. The reduced neuron overlapping helps to sort

93 (a) fc1: original (b) fc1: with WTA (c) fc2: original (d) fc2: with WTA

Figure 6.1: The neuron overlapping in MLP-3 for MNIST dataset.

out the object selectivity for neurons in each layer. This observation indicates that potentially the inference may stop early for correct prediction because of the appearance of object selectivity before the ending layer. We may explore the early stopping potential for DNNs equipped with WTA dropout to further save the execution cost during inference.

• In Chapter 4, the design space exploration for BitSystolic on the architectural parameters (refer Table 4.1), e.g., the systolic array size and accumulator depth, is a valuable but yet addressed question in this thesis. On one hand, we antic- ipate a good scalability for the computation capacity of BitSystolic by virtue of the systolic architecture. On the other hand, the IO boundary would poten- tially hinder the full utilization of the systolic array, especially for dense fc/lstm layers. The optimal design configuration of BitSystolic architecture is worthy to be investigated in the future.

94 Bibliography

[AAA+16] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catan- zaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International confer- ence on machine learning, 2016.

[AJH+16] Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Na- talie Enright Jerger, and Andreas Moshovos. Cnvlutin: Ineffectual- neuron-free deep neural network computing. In ACM SIGARCH Com- puter Architecture News. IEEE Press, 2016.

[BF13] Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neu- ral networks. In Advances in Neural Information Processing Systems, 2013.

[BHS80] Jon Louis Bentley, Dorothea Haken, and James B Saxe. A general method for solving divide-and-conquer recurrences. ACM SIGACT News, 1980.

[BZK+17] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Tor- ralba. Network dissection: Quantifying interpretability of deep visual representations. arXiv preprint arXiv:1704.05796, 2017.

[CBD15a] Matthieu Courbariaux, , and Jean-Pierre David. Bina- ryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.

[CBD15b] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Bina- ryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.

[CES16] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial archi- tecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News. IEEE Press, 2016.

[CKES16] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Ey- eriss: An energy-efficient reconfigurable accelerator for deep convolu- tional neural networks. IEEE journal of solid-state circuits, 52(1):127– 138, 2016.

95 [CKES17] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-efficient reconfigurable accelerator for deep con- volutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138, 2017.

[CLL+14] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: A machine-learning supercomputer. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 609–622. IEEE, 2014.

[CLL+15] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous dis- tributed systems. arXiv preprint arXiv:1512.01274, 2015.

[CLX+16] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main mem- ory. ACM SIGARCH Computer Architecture News, 44(3):27–39, 2016.

[DCB+17] Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinder-pal Singh, Elio Guidetti, Fabio De Ambroggi, Tommaso Majo, Paolo Zambotti, Manuj Ayodhyawasi, Harvinder Singh, et al. 14.1 a 2.9 tops/w deep convolutional neural network soc in fd-soi 28nm for intelligent embed- ded systems. In 2017 IEEE International Solid-State Circuits Confer- ence (ISSCC), pages 238–239. IEEE, 2017.

[DJP+20] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.

[DYG+19] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE International Conference on Computer Vision, pages 293–302, 2019.

[DZW18] Bin Dai, Chen Zhu, and David Wipf. Compressing neural net- works using the variational information bottleneck. arXiv preprint arXiv:1802.10399, 2018.

[GEB16] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and , pages 2414–2423, 2016.

96 [GHC+19] Xuyang Guo, Yuanjun Huang, Hsin-pai Cheng, Bing Li, Wei Wen, Siyuan Ma, Hai Li, and Yiran Chen. Exploration of automatic mixed- precision search for deep neural networks. In 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pages 276–278. IEEE, 2019.

[GWFM+13] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013.

[GYC16] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Sys- tems, 2016.

[GZD+19] Xitong Gao, Yiren Zhao, ukasz Dudziak, Robert Mullins, and Cheng zhong Xu. Dynamic channel pruning: Feature boosting and suppres- sion. In International Conference on Learning Representations, 2019.

[HCS+16a] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.

[HCS+16b] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.

[HLL+18] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile de- vices. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[HLM+16] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on com- pressed deep neural network. In Computer Architecture, ACM/IEEE International Symposium on. IEEE, 2016.

[HMD15] Song Han, Huizi Mao, and William J Dally. Deep compression: Com- pressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

[Hor14a] Mark Horowitz. Computing’s energy problem (and what we can do about it). In Proceedings of the IEEE International Solid-State Circuits Conference, pages 10–14, 2014.

[Hor14b] Mark Horowitz. Energy table for 45nm process. In Stanford VLSI wiki, 2014.

97 [HPTD15] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, 2015.

[HPTT16] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.

[HS97] Sepp Hochreiter and J¨urgenSchmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

[HSS18] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.

[HW18] Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision, pages 304–320, 2018.

[HZC+17a] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

[HZC+17b] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

[HZDS+19] Weizhe Hua, Yuan Zhou, Christopher M De Sa, Zhiru Zhang, and G Edward Suh. Channel gating neural networks. In Advances in Neural Information Processing Systems, pages 1884–1894, 2019.

[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep resid- ual learning for image recognition. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition, 2016.

[JYP+17a] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau- rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Bo- den, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Sym- posium on Computer Architecture, pages 1–12, 2017.

[JYP+17b] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau- rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Bo- den, Al Borchers, et al. In-datacenter performance analysis of a tensor

98 processing unit. In Computer Architecture, ACM/IEEE International Symposium on. IEEE, 2017.

[KAY18] Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo. Zena: Zero-aware neural network accelerator. IEEE Design & Test, 35(1):39–46, 2018.

[KSH12a] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012.

[KSH12b] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[LAE+16] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multi- box detector. In European conference on computer vision, pages 21–37. Springer, 2016.

[LBBH98] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[LGY+20] Q. Liu, B. Gao, P. Yao, D. Wu, J. Chen, Y. Pang, W. Zhang, Y. Liao, C. Xue, W. Chen, J. Tang, Y. Wang, M. Chang, H. Qian, and H. Wu. 33.2 a fully integrated analog reram based 78.4tops/w compute-in- memory chip with fully parallel mac computing. In 2020 IEEE In- ternational Solid- State Circuits Conference - (ISSCC), pages 500–502, 2020.

[LKK+18] Jinmook Lee, Changhyeon Kim, Sanghoon Kang, Dongjoo Shin, Sangyeob Kim, and Hoi-Jun Yoo. Unpu: A 50.6 tops/w unified deep neural network accelerator with 1b-to-16b fully-variable weight bit- precision. In 2018 IEEE International Solid-State Circuits Conference- (ISSCC), pages 218–220. IEEE, 2018.

[LLS+17] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pages 2755–2763, 2017.

[LWF+15] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Mar- ianna Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.

99 [LWK17] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l 0 regularization. arXiv preprint arXiv:1712.01312, 2017.

[LYY+15] Chenchen Liu, Bonan Yan, Chaofei Yang, Linghao Song, Zheng Li, Beiye Liu, Yiran Chen, Hai Li, Qing Wu, and Hao Jiang. A spiking neuromorphic design with resistive crossbar. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1– 6. IEEE, 2015.

[MAV17] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[MBV19] Bert Moons, Daniel Bankman, and Marian Verhelst. Envision: Energy- scalable sparse convolutional neural network processing. In Embedded Deep Learning, pages 115–151. Springer, 2019.

[MF15] Alireza Makhzani and Brendan J Frey. Winner-take-all . In Advances in Neural Information Processing Systems, 2015.

[MHP+17] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally. Exploring the granularity of sparsity in convolu- tional neural networks. IEEE CVPRW, 2017.

[MTK+16] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.

[MTK+17] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In Proceedings of the International Conference on Learning Representations, 2017.

[NMAV17] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov. Structured bayesian pruning via log-normal multiplicative noise. In Advances in Neural Information Processing Systems, 2017.

[PLW+16] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. Faster cnns with direct sparse and guided pruning. arXiv preprint arXiv:1608.01409, 2016.

[PPG+13] Eustace Painkras, Luis A Plana, Jim Garside, Steve Temple, Francesco Galluppi, Cameron Patterson, David R Lester, Andrew D Brown, and Steve B Furber. Spinnaker: A 1-w 18-core system-on-chip for

100 massively-parallel neural network simulation. IEEE Journal of Solid- State Circuits, 48(8):1943–1953, 2013.

[RDGF16] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.

[RDS+15] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition chal- lenge. International journal of computer vision, 115(3):211–252, 2015.

[ROC+18] Minsoo Rhu, Mike O’Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. Compressing dma engine: Leveraging activation sparsity for training deep neural networks. In Proceedings of the IEEE International Symposium on High Perfor- mance Computer Architecture, pages 78–91, 2018.

[RORF16a] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.

[RORF16b] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Com- puter Vision, pages 525–542, 2016.

[RWA+16] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, Jos´eMiguel Hern´andez-Lobato,Gu- Yeon Wei, and David Brooks. Minerva: Enabling low-power, highly- accurate deep neural network accelerators. In ACM SIGARCH Com- puter Architecture News. IEEE Press, 2016.

[RZ18] Bharath Ramsundar and Reza Bosagh Zadeh. TensorFlow for deep learning: from linear regression to . ” O’Reilly Media, Inc.”, 2018.

[SAM+19] Farhana Sharmin Snigdha, Ibrahim Ahmed, Susmita Dey Manasi, Meghna G Mankalale, Jiang Hu, and Sachin S Sapatnekar. Sefact: selective feature activation and early classification for cnns. In Proceed- ings of the 24th Asia and South Pacific Design Automation Conference, pages 487–492, 2019.

101 [SCP+19] Jinook Song, Yunkyo Cho, Jun-Seok Park, Jun-Woo Jang, Sehwan Lee, Joon-Ho Song, Jae-Gon Lee, and Inyup Kang. 7.1 an 11.5 tops/w 1024- mac butterfly structure dual-core sparsity-aware neural processing unit in 8nm flagship mobile soc. In 2019 IEEE International Solid-State Circuits Conference-(ISSCC), pages 130–132. IEEE, 2019.

[SEJ+20] Andrew W Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Z´ıdek,Alexanderˇ WR Nelson, Alex Bridgland, et al. Improved prediction using potentials from deep learning. , 577(7792):706–710, 2020.

[SHK+14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014.

[SHM+16] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Lau- rent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.

[SHZ+18] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.

[SLLY17] Dongjoo Shin, Jinmook Lee, Jinsu Lee, and Hoi-Jun Yoo. 14.2 dnpu: An 8.1 tops/w reconfigurable cnn-rnn processor for general-purpose deep neural networks. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pages 240–241. IEEE, 2017.

[SLW+17] Chang Song, Beiye Liu, Wei Wen, Hai Li, and Yiran Chen. A quantization-aware regularized learning method in multilevel -based neuromorphic computing system. In 2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA), pages 1–6. IEEE, 2017.

[SNM+16] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubra- monian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. Isaac: A convolutional neural network accelerator with in- situ analog arithmetic in crossbars. ACM SIGARCH Computer Archi- tecture News, 44(3):14–26, 2016.

102 [SP15] Tara N Sainath and Carolina Parada. Convolutional neural networks for small-footprint keyword spotting. In Conference of the Interna- tional Speech Communication Association, 2015. [SPK+16] Jaehyeong Sim, Jun-Seok Park, Minhye Kim, Dongmyung Bae, Yeong- jae Choi, and Lee-Sup Kim. 14.6 a 1.42 tops/w deep convolutional neu- ral network recognition processor for intelligent ioe systems. In 2016 IEEE International Solid-State Circuits Conference (ISSCC), pages 264–265. IEEE, 2016. [SPS+18] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural net- work. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 764–775. IEEE, 2018. [SQLC17] Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. Pipelayer: A pipelined reram-based accelerator for deep learning. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 541–552. IEEE, 2017. [SS17] Ryan Spring and Anshumali Shrivastava. Scalable and sustainable deep learning via randomized hashing. In Proceedings of the ACM Interna- tional Conference on Knowledge Discovery and Data Mining, pages 445–454, 2017. [SWX+20] Jianghao Shen, Yue Wang, Pengfei Xu, Yonggan Fu, Zhangyang Wang, and Yingyan Lin. Fractional skipping: Towards finer-grained dynamic cnn inference. In AAAI, pages 5700–5708, 2020. [SZ14] Karen Simonyan and Andrew Zisserman. Very deep convolu- tional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [UAH+18] Kodai Ueyoshi, Kota Ando, Kazutoshi Hirose, Shinya Takamaeda- Yamazaki, Junichiro Kadomoto, Tomoki Miyata, Mototsugu Hamada, Tadahiro Kuroda, and Masato Motomura. Quest: A 7.49 tops multi- purpose log-quantized dnn inference engine stacked on 96mb 3d sram using inductive-coupling technology in 40nm cmos. In 2018 IEEE In- ternational Solid-State Circuits Conference-(ISSCC), pages 216–218. IEEE, 2018. [URS18] Yaman Umuroglu, Lahiru Rasnayake, and Magnus Sj¨alander. Bismo: A scalable bit-serial matrix multiplication overlay for reconfigurable computing. In 2018 28th International Conference on Field Pro- grammable Logic and Applications (FPL), pages 307–3077. IEEE, 2018.

103 [VTBE15] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.

[WHR+17] Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, and Hai Li. Learning intrin- sic sparse structures within long short-term memory. arXiv preprint arXiv:1709.05027, 2017.

[WLL+19] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed precision. In Pro- ceedings of the IEEE conference on computer vision and pattern recog- nition, pages 8612–8620, 2019.

[WWSL17] Yandan Wang, Wei Wen, Linghao Song, and Hai Helen Li. Classifica- tion accuracy improvement for neuromorphic computing systems with one-level precision synapses. In 2017 22nd Asia and South Pacific De- sign Automation Conference (ASP-DAC), pages 776–781. IEEE, 2017.

[WWW+16] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, 2016.

[WWZ+18] Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Pe- ter Vajda, and Kurt Keutzer. Mixed precision quantization of con- vnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090, 2018.

[XHL+20] C. Xue, T. Huang, J. Liu, T. Chang, H. Kao, J. Wang, T. Liu, S. Wei, S. Huang, W. Wei, Y. Chen, T. Hsu, Y. Chen, Y. Lo, T. Wen, C. Lo, R. Liu, C. Hsieh, K. Tang, and M. Chang. 15.4 a 22nm 2mb reram compute-in-memory macro with 121-28tops/w for multibit mac com- puting for tiny ai edge devices. In 2020 IEEE International Solid- State Circuits Conference - (ISSCC), pages 244–246, 2020.

[YLLW18] Jianbo Ye, Xin Lu, Zhe Lin, and James Z Wang. Rethinking the smaller-norm-less-informative assumption in channel pruning of con- volution layers. arXiv preprint arXiv:1802.00124, 2018.

[YLP+17] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetu- parna Das, and Scott Mahlke. Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In ACM SIGARCH Computer Architecture News. ACM, 2017.

104 [YOT+18] Shouyi Yin, Peng Ouyang, Shibin Tang, Fengbin Tu, Xiudong Li, Shix- uan Zheng, Tianyi Lu, Jiangyuan Gu, Leibo Liu, and Shaojun Wei. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE Journal of Solid-State Circuits, 53(4):968–982, 2018.

[YWW11] Shimeng Yu, Yi Wu, and H-S Philip Wong. Investigating the switch- ing dynamics and multilevel capability of bipolar metal oxide resistive switching memory. Applied Physics Letters, 98(10):103514, 2011.

[YWWL19] Q. Yang, W. Wen, Z. Wang, and H. Li. Joint regularization on acti- vations and weights for efficient neural network pruning. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pages 790–797, 2019.

[Zei12] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

[ZF14] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 2014.

[ZK16] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

[ZLS+15] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpga-based accelerator design for deep convo- lutional neural networks. In Proceedings of the ACM/SIGDA Interna- tional Symposium on Field-Programmable Gate Arrays. ACM, 2015.

[ZYZ+18] Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic dnn weight prun- ing framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[ZZHS16] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence, 38(10):1943– 1955, 2016.

105 Biography

Qing Yang received his B.S. degree in Electronic and Technology from Uni- versity of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2012, and dual M.S. degree in Electronic Science and Technology from Tsinghua University, Beijing, China, and Katholieke Universiteit Leuven, Leuven, Belgium, in 2015. His research focus in the M.S. degree is on the communication and wireless power management for implantable medical devices.

He is now pursuing the Ph.D. degree in the department of Electrical and Computer Engineering at Duke University, Durham, North Carolina, USA. His research interests include deep learning, DNN compression, and algorithm-hardware co-optimization for DNN efficiency improvement. Qing has more than five years of experiences on mixed-signal/digital chip design and tapeout in 65nm ∼ 1µm process technologies. Up to now, Qing has published 3 first-author papers, 4 second-author papers and 1 first-author book chapter.

106