To appear in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

Enabling Highly Efficient Capsule Networks Processing Through A PIM-Based Architecture Design

Xingyao Zhang Shuaiwen Leon Song Chenhao Xie University of Houston University of Sydney Pacific Northwest National Houston, USA Sydney, Australia Laboratory [email protected] [email protected] Richland, USA [email protected] Jing Wang Weigong Zhang Xin Fu Capital Normal University Capital Normal University University of Houston Beijing, China Beijing, China Houston, USA [email protected] [email protected] [email protected]

ABSTRACT 1. INTRODUCTION In recent years, the CNNs have achieved great successes Recently, machine learning has bloomed into rapid in the image processing tasks, e.g., image recognition growth and been widely applied in many areas, includ- and object detection. Unfortunately, traditional CNN’s ing medical [1, 2], security [3], social media [4], engineer- classification is found to be easily misled by increasingly ing [5] and etc. Such explosive development is cred- complex image features due to the usage of pooling op- ited to the success of the neural network algorithms, erations, hence unable to preserve accurate position and especially the convolutional neural networks (CNNs), pose information of the objects. To address this chal- which are extremely suitable for the image processing lenge, a novel neural network structure called Capsule tasks, e.g., image recognition and object detection [6, Network has been proposed, which introduces equiv- 7, 8, 9, 10]. However, recent studies have found that ariance through capsules to significantly enhance the CNNs could easily misdetect important image features learning ability for and object de- when image scenarios are more complex. This could tection. Due to its requirement of performing a high significantly affects the classification accuracy [11, 12, volume of matrix operations, CapsNets have been gen- 13]. As shown in Fig.1, when attempting to identify erally accelerated on modern GPU platforms that pro- the lung cancer cells, traditional CNNs are bounded to vide highly optimized software library for common deep the confined region, thus missing critical regions around learning tasks. However, based on our performance the cell edges and leading to the wrong identification. characterization on modern GPUs, CapsNets exhibit This is because the pooling operations used in common low efficiency due to the special program and execution CNNs apply happenstance translational invariance [11, features of their routing procedure, including massive 14] which limits the learning of and propor- unshareable intermediate variables and intensive syn- tional change, resulting in obtaining only partial fea- chronizations, which are very difficult to optimize at tures. With the increasing adoption of emerging appli- software level. To address these challenges, we propose cations (e.g., medical image processing and autonomous a hybrid computing architecture design named PIM- driving) into humans’ daily life that have strict require- arXiv:1911.03451v1 [cs.DC] 7 Nov 2019 CapsNet. It preserves GPU’s on-chip computing capa- ments on object detection’s accuracy, wrong identifi- bility for accelerating CNN types of layers in CapsNet, cation could be fatal in some cases. To address these while pipelining with an off-chip in-memory acceleration challenges, a novel neural network called Capsule Net- solution that effectively tackles routing procedure’s in- work (CapsNet) has been proposed recently [14]. Fig.1 efficiency by leveraging the processing-in-memory capa- demonstrates the evolution from classic CNN identifica- bility of today’s 3D stacked memory. Using routing pro- tion to CapsNet identification. CapsNet abandons the cedure’s inherent parallellization feature, our design en- usage of pooling operations in CNNs and introduces the ables hierarchical improvements on CapsNet inference concept of capsule which is any function that tries to efficiency through minimizing data movement and max- predict the presence and the instantiation parameters of imizing parallel processing in memory. Evaluation re- a particular object at a given location. The figure illus- sults demonstrate that our proposed design can achieve trates a group of capsules, each with a double-featured substantial improvement on both performance and en- activation vector (i.e., probability of presence and pose, ergy savings for CapsNet inference, with almost zero shown as the green and red arrows in Fig.1). Because accuracy loss. The results also suggest good perfor- of this added equivariance, CapsNet can accurately de- mance scalability in optimizing the routing procedure tect the cancer cells via precise classification according with increasing network size. to the cell edges and body texture. According to the re- Input Image CNN Identification CapsNet Identification unique architectural features, our PIM-CapsNet design mitigates the data-intensive RP into HMC to tackle its

Theory major challenges above. There are two key objectives for this in-memory design for RP: under the hardware design constraints, creating a hierarchical optimization strategy to enable (i) inter-vault workload balance and communication minimization (Sec.5.1 and 5.3) and (ii) Example intra-vault maximum parallel processing (Sec.5.2 and 5.3). Additionally, the in-memory optimizations should Figure 1: The comparison of neural networks in identi- be generally applicable to different RP algorithms. fying lung cancer cells [17], where CapsNet outperforms For objective (i), we further investigate RP’s algo- the traditional CNN on detection accuracy. The heat rithm and identify an interesting feature: highly paral- maps indicate the detected features. lelizable in multi-dimensions. This makes RP’s work- loads have great potential to be concurrently executed cent studies, CapsNets are increasingly involved in the across vaults without incurring significant communica- human-safety related tasks, and on average outperform tion overheads. We then create a modeling strategy CNNs by 19.6% and 42.06% on detection accuracy for by considering both per-vault workloads and inter-vault medical image processing[15, 16, 17, 18, 19, 20] and au- communication to guide dimension selection for paral- tonomous driving[21, 22, 23]. lelization, in order to achieve the optimal performance Because CapsNets execution exhibits a high percent- and power improvement. This also significantly reduces age of matrix operations, state-of-the-art GPUs have the original synchronization overheads through the con- become primary platforms for accelerating CapsNets by version to aggregation within a vault. For objective (ii), leveraging their massive on-chip parallelism and deeply we integrate multiple processing elements (PEs) into a optimized software library [24, 25]. However, processing vault’s logic layer and explore the customized PE design efficiency of CapsNets on GPUs often cannot achieve to concurrently perform RP’s specific operations across the desired level for fast real-time inference. To in- memory banks. Meanwhile, we propose a new address vestigate the root causes of this inefficiency, we con- mapping scheme that effectively transfers many inter- duct a comprehensive performance characterization on vault level data requests to the intra-vault level and CapsNets’ execution behaviors on modern GPUs, and address the bank conflict issues of concurrent data re- observe that the computation between two consecutive quests. Furthermore, to reduce logic design complexity capsule layers, called routing procedure (Sec.2.2), presents and guarantee performance, we use simple low-cost logic the major bottleneck. Through runtime profiling, we to approximate complex special functions with negligi- further identify that the inefficient execution of the rout- ble accuracy loss. To summarize, this study makes the ing procedure originates from (i) tremendous data ac- following contributions: cess to off-chip memory due to the massive unshare- • We conduct a comprehensive characterization study able intermediate variables, and (ii) intensive synchro- on CapsNet inference on modern GPUs and iden- nizations to avoid the potential write-after-read and tify its root causes for execution inefficiency. write-after-write hazards on the limited on-chip stor- age. These challenges are induced by the unique fea- • Based on the interesting insights from the charac- tures of the routing procedure execution, and cannot be terization and further algorithm analysis, we pro- addressed well via common NN optimization techniques pose a processing-in-memory based hybrid com- [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37] as well as puting architecture named PIM-CapsNet, which software-level on-chip memory management (e.g., reg- leverages both GPU on-chip computing capabil- ister manipulation or shared memory multiplexing). ity and off-chip in-memory acceleration features of To tackle CapsNets’ significant off chip-memory ac- 3D stacked memory to improve the overall Cap- cess and intensive synchronization (induced by numer- sNet inference performance. ous aggregation operations in the routing procedure), • To drastically reduce the identified performance we propose a processing-in-memory based hybrid com- bottlenecks, we propose several memory-level op- puting architecture named PIM-CapsNet. At the high- timizations to enable minimal in-memory commu- est level, PIM-CapsNet continues to utilize GPU’s na- nication, maximum parallelization, and low design tive on-chip units as the host for fast processing CNN- complexity and overhead. type of layers such as and Fully-connected • The experimental results demonstrate that for the included in CapsNet. Meanwhile, by leveraging batch overall CapsNet inference, our proposed PIM- Cap- execution, PIM-CapsNet pipelines the host GPU ex- sNet design outperforms the baseline GPU by 2.44x ecution with an off-chip in-memory solution that can (upto 2.76x) on performance and 64.91% (upto effectively accelerate CapsNet’s routing procedure. 85.16%) on energy saving. It also achieves good To enable the in-memory acceleration capability for performance scalability with increasing network size. the routing procedure (RP), we select one of the emerg- ing 3D stacked technologies, i.e., hybrid memory cube 2. BACKGROUND: CAPSULE NETWORK (HMC)[38], for RP’s design and integration. Because As shown in Fig.2, CapsNet inherits the convolutional of HMC’s high internal and external memory band- (Conv) and fully connected (FC) layers from the stan- width, and its unique logic layer for easy integration dard CNNs, but introduces new layers (i.e., Caps layers) of computation logic, it has become a promising PIM to realize the concept of“capsule”for better information platform [39, 40, 41, 42, 43]. By leveraging HMC’s representation. A capsule is a group of neurons (Fig.1)

2 Encode Decode r Convolutional DigitCaps r [28 x 28] PrimaryCaps Layer Fully-Connected Encoder Decoder Encoder Decoder Input Layer La1y6er Layer Convolutional Layer PrimaryCaps Layer DigitCaps Layer Fully-Connected Convolutional Layer PrimaryCaps Layer DigitCaps Layer Fully-Connected 2 [32 x 32] Input Layer [32 x 32] Input Layer 3 Dynamic 16 16 0 1 2 2 Routing 3 3

0 0 Procedure 6 8 1 8 1 5 [10 x 1] 16D 2 512 1024 784 16 [20 x 20] Capsules 6 6 5 5 Classification 2 2 256 channel [6 x 6] 8D Reconstruction [10 x 1] 16D Capsules 512 1024 784 [10 x 1] 16D Capsules 512 1024 784 [20 x 20] [20 x 20] Capsules Results 256 channel 256 channel 32 channel “3” 0 [6 x 6] 8D Capsules Classification Results: Reconstruction [6 x 6] 8D Capsules Classification Results: Reconstruction 1 32 channel 32 channel s Dog Dog e h tc a Encode Decode B b0j Encoding Decoding r[28 x 28] Convolutional DigitCaps r Convolutional PrimaryCaps Final Caps Layer Fully-Connected

s ∑ PrimaryCaps Layer Fully-Connected u0 c00 b0j b00

Encoder Decoder Encoder Decoder Input Layer Layer e û00 × b00 Layer Layer Layer

16 Layer l 16 Encoding Decoding u w00 × û00 × . PrimaryCaps Layer Convolutional Layer PrimaryCaps Layer DigitCaps Layer Fully-Connected Convolutional Layer PrimaryCaps Layer DigitCaps Layer Fully-Connected . Convolutional Final Caps Layer Fully-Connected 2 s 2 [32 x 32] Input [32 x 32] Input 3 . Routing Dynamic .

. . . 16 Layer 16 Layer p 3 0

. ∑ s0 v0 0 Layer . Layer

. . . a 1 Procedure Routing 1 d 2 2 . . . 3 3 x = C bij 2 (Dynamic 3 Procedure ui L High- 0 0 b ∑ bi0 1 1 ij 8 8 6 s ui × ûi0 × 6 Routing Procedure 5 [10 x 1] 16D ûi0 × bi0 5 level 2 512 1024 784 e 16 Routing) l 2 [20 x 20] Capsules wij w c [28 x 28] [10 x 1] 16D 512 1024 784 (Dynamic Routing) Capsules 6 6 Classification u i0 i0 [20 x 20] 5 5 256 channel [6 x 6] 8D Reconstruction s S Capsules 6 2 2 Input 5 512 1024 784 512 1024 784 . 256 Reconstruction Low-level

[10 x 1] 16D Capsules [10 x 1] 16D Capsules p [20 x 20] [20 x 20] Capsules Results [6 x 6] 8D 2

. a 6 512 1024 784 256 channel 256 channel 32 channel s 0 channel Capsules Capsules e . “3” C [6 x 6] 8D Capsules Reconstruction [6 x 6] 8D Capsules Reconstruction 1 Classification Results Classification Results Classification Results: Classification Results: h 32 channel 32 channel tc 32 channel Input Image Reconstruction Dog Dog s H a S “3” “3” e B ch b0j at ∑ B s u0 c0j b0j b0j

e û0j × b0j b0j Encoding Decoding l

Convolutional PrimaryCaps Final Caps Layer . Fully-Connected u w0j × û0j × 3 Dynamic ∑ b .

. u0 c00 b0j 00 s

. . .

.

. û00 × b00 Layer Layer Layer p s ∑ s v Routing . . j j .

.

16 Encoding a Decoding e

. . . l w00 × û00 × . C PrimaryCaps Layer . Voting Results Convolutional Last Caps Layer Fully-Cobninj ected Procedure

4

u 1 . 2 Routing 2

. Algorithm 1 Dynamic Routing Procedure

L s

. . . ∑ . 3 Weight bij bij ∑ . ui cij

s v 0 p . . 0 0 . Layer Layer Procedure ûij × bij Input a Routing 1 . . . Child Prediction x d Coefficient Prediction Vector x x ∑ 5 = C bij Capsule Vector wij × ûij × 2 (Dynamic 3 ui L High- b ∑ bi0 Routing Input: L capsules u, weight matrix W ui × ûi0 × ij Coefficient Routing Routing Procedure 6 Weights ûi0 × bi0 Prediction Vector 5 Routing) Coefficients level

s 2 wij [28 x 28] [10 x 1] 16D (Dynamic Routing) e wi0 ci0 [20 x 20] 512 1024 784 Capsules Output: H capsules v l S Weight u Input Capsules 6

. s 256 Reconstruction 5 Low-level [6 x 6] 8DPar ent Capsule 2

. for all L capsule i & all H capsule j from all input set k: p 1:

6 channel 512 1024 784 a . Capsules Capsules s Child Capsule

C e Classification Results Classification Results k k ch 32 channel Input Image Reconstruction uˆ ← u × W . Eq.1 H S at “3” “3” j|i i ij B b0j ∑ u0 c0j b0j b0j 2: for all L capsule i & all H capsule j:

û0j × b0j

s e . b ← 0 . Initialize Routing Coeffcients l w0j × û0j × 3 Dynamic ij . Figure 2: The computation diagram of CapsNet for

. u

.

. . .

. s

. . . ∑ sj vj . Routing p s . . . 3: for Routing iterations do a e Voting Results b C 4 ij Procedure MNISTh [14]. 1 2 c L ∑ bij t Weight ui cij bij 4: for all L capsule i: ûij × bij Input a Routing B Child Prediction s Coefficient Prediction Vector x x ∑ 5 b0j Capsule Vector wij × ûij × c ← softmax(b ) . Eq.5 e ∑ ij ij l u c b0j b00 Routing 0 00 × Coefficient û00 b00

Routing u Weights Prediction Vector Coefficients × ×

. s 5: for all H capsule j from all input set k: w00 û00 .

.

. p

. . .

∑ . k k

.

. . s0 v0 . P a Weight s s ← uˆ × c . Eq.2 . . . ij

Parent Capsule C bij j i j|i

e

l × × ∑ b L ui ûi0 × bij i0

u û b Child Capsule i0 i0 6: for all H capsule j from all input set k:

s w c

i0 i0 S k k p s . v ← squash(s ) . Eq.3

.

a e 6 j j

h .

C tc a S for all L capsule i & H capsule j:

B 7: s H b0j k k e ∑ P l u0 c0j b0j b0j b = v uˆ + b . Eq.4

û0j × b0j ij k j j|i ij u

.

s × × 3 w0j û0j .

.

s . . .

. p e . 8: end for

. . ∑ . sj vj . h a tc . . . C 1 4 bij a 2 9: Return v B ∑ bij

L b b0j ui cij û × b ij ∑ ij ij 5 s u0 c00 × b0j b00 wij × ûij × e û00 b00 l × ×

. u w00 û00 .

. s

.

. . .

.

.

p ∑ . . s0 v0 .

. . .

a bij Figure 3: Dynamic Routing Procedure (RP) in Cap- C

tiplies with the corresponding weight (W ) to generate s × × ∑ b ij L ui ûi0 bij i0 e ûi0 × bi0 l sNet: describing computation flow and multiple ways u wi0 ci0 S s the prediction vector (ˆu ) (Fig.3 1 ):

. j|i

. p s 6 for parallel processing. e . a k k ch C t uˆ = u × W (1) a S j|i i ij B H b0j ∑ s b u0 c0j × b0j 0j Then, these prediction vectors will multiply with their

e û0j b0j l

× × . whose activity vector represents instantiation param- u 3 w û .

0j 0j . s

.

. . .

.

.

. . . corresponding routing coefficients (c ) with results ag- p ∑ sj vj ij

. . . a 4 bij eters of a specific type entity (e.g., the location and C 1 2 ∑ bij L ui cij bij ûij × bij pose of an object). It introduces equivariance which gregated across all L capsules (Fig.3 2 ): × × 5 wij ûij k P k makes standard CNNs understand rotation and propor- sj = i uˆj|i × cij (2) tional change (e.g., in Fig.1). CapsNet significantly lifts The non-linear“squashing”function is then implemented k the limitation of happenstance translational invariance on the aggregated results of sj to produce the jth H of pooling operations applied in the traditional CNNs, capsule vj (Fig.3 3 ): ||sk||2 sk thus being considered to be superior in image segmen- k j j vj = 1+||sk||2 ||sk|| (3) tation and object detection [11, 14]. j j Note that the vk can not be considered as the final value 2.1 CapsNet Structure j of jth H capsule unless the features of L capsules have Fig.2 takes CapsNet-MNIST [14] as an example to il- been correctly inherited. The information difference be- lustrate a basic CapsNet structure. It contains two com- tween an L and H capsule can be quantified by the putation stages: encoding and decoding. The encoding agreement measurement via the scalar production of the stage is composed of Conv layers and Caps layers. The prediction vectoru ˆk and the H capsule vk (Fig.3 4 ), convolutional operations are first performed on the data j|i j mapping from neurons of the Conv layer to the capsules where the “0” output means the information is precisely of the first Caps layer, which is defined as PrimeCaps inherited. In the case of a large divergence, the agree- layer. It is often followed by at least one other Caps ments will be accumulated into an intermediate variable layer. The data mapping between the capsules of the bij (Fig.3 5 ), which will be used to update the routing adjacent Caps layers is performed via the routing pro- coefficients via the “softmax” function (Fig.3 6 ). cedure (RP). Finally, the last Caps layer produces the b = P vkuˆk + b (4) classification information towards categorization, with ij k j j|i ij exp(bij ) each capsule representing one category. Following the cij = P (5) k exp(bik) encoding stage, the decoding function includes multiple The updated routing coefficients will then be integrated FC layers which attach to the last Caps layer for im- in Eq.(2) to start another iteration in this routing pro- age reconstruction (i.e., improving model accuracy in training or plotting abstract features in inference) [14]. cedure (Fig.3 2 ). The number of iterations is determined by the con- 2.2 Essential Mechanism For Avoiding Feature vergence of routing coefficients and set by programmers. Loss: Routing Procedure (RP) Several recent studies indicate that the number of iter- The routing procedure (RP) is introduced to route ations increases for tasks with large datasets and cat- the information from low-level capsules (or L capsules) egories [45, 46]. Once all the iterations complete, the to high-level capsules (or H capsules) without feature features of L capsules should have already been routed loss. There have been several routing algorithms used to the H capsules and ready to proceed to the following in routing procedure such as Dynamic Routing [14] and layers. Expectation-Maximization routing[44]. In this work, we Summary. Generally, the routing algorithms (e.g., use the popular Dynamic Routing [14] as an example to Dynamic Routing, Expectation-Maximization Routing) explain the RP execution. share the similar execution pattern and exhibit several Fig.3 and Algorithm.1 demonstrate the computation core features in CapsNet routing procedure: (1) The flow and possible dimensions for parallelization. Given execution of the RP exhibits strong data dependency the kth batched input set, in order to generate jth H and needs to be sequentially processed. (2) The proce- k capsule, ith L capsule in this input set (ui ) first mul- dure adopts all-to-all computation, which routes all the

3 Conv Layer L Caps Layer H Caps Layer FC Layer Execution Time Table 1: CapsNet Benchmark Configurations 100% 16 14 80% 12 Net- Configuration 60% 10 Dataset 74.62% 8 work BS L Caps H Caps Iter 40% 6 4 Caps-MN1 100 1152 10 3 20% 2 Caps-MN2 MNIST[47] 200 1152 10 3 0% 0

Caps-MN3 300 1152 10 3 Time (s) Execution Caps-CF1 100 2304 11 3 Caps-CF2 CIFAR10[48] 100 3456 11 3 Time of Execution Percentage Caps-CF3 100 4608 11 3 Caps-EN1 EMNIST Letter[49] 100 1152 26 3 Figure 4: The overall execution time breakdown of Cap- Caps-EN2 EMNIST Balanced[49] 100 1152 47 3 Caps-EN3 EMNIST By Class[49] 100 1152 62 3 sNets on GPU across different layers. Red line repre- Caps-SV1 100 576 10 3 Caps-SV2 SVHN[50] 100 576 10 6 sents the actual inference time. Caps-SV3 100 576 10 9 Memory Access Synchronization Lack of Resource Inst_Fetch Other 100% 75% L capsules to the H capsules and forms aggregation in 34.45% 50% all possible dimensions. (3) The procedure produces a 25% 44.64% large amount of intermediate variables. (4) The routing 0% procedure is iteratively processed to generate the dy- namic coefficients to pass the feature information. We Pipeline Stalls Contributions to will discuss these features in relation to performance Figure 5: The breakdown for pipeline stall cycles during characterization of CapsNet in Sec.3, and our optimiza- RP execution on Tesla P100 Pascal GPU. tions on Dynamic Routing in the following Sections can overall inference time and RP’s percentage generally in- be easily applied to other routing algorithms with sim- creases when scaling up the network size (e.g., compar- ple adjustment. ing Caps-MN1, Caps-CF1, Caps-EN1 and Caps-SV1). 3. CHARACTERIZATION AND ANALYSIS This implies that RP’s execution time is sensitive to While the CapsNet starts gaining popularity from network size as well. both academia and industry, the performance charac- To summarize, using the highly optimized deep learn- terization of CapsNet on modern high-performance plat- ing library, the RP execution time on GPU can not be forms is largely neglected. Given that GPU has become effectively reduced through the general CNN optimiza- the major platform for executing CapsNet due to high tion techniques such as batch execution. Moreover, it computation capability and deep optimization on ma- exhibits a certain level of sensitivity to network scaling. trix operations (which CapsNet has a large amount of), Both of these factors make the RP execution a dom- we adopt some of the state-of-the-art NVIDIA GPU inating performance bottleneck for CapsNet inference, platforms to conduct a comprehensive characterization especially with a growing size and complexity of future towards the execution behaviors of various CapsNets CapsNet structures[45, 46]. listed in Table 1. We use 4 different datasets and corre- 3.2 Root Causes for Inefficient RP Execution sponding 12 CapsNets with CapsNet-MNIST like struc- ture (Sec.2.1), and different configurations on batch size To understand the root causes of RP’s inefficiency on (BS in Table 1), L capsules, H capsules and routing iter- GPU, we use NVprofiler [54] to collect runtime GPU ation number. These CapsNets’ inference are processed stats for comprehensive analysis. We observe two root via PyTorch framework [51] with the latest deep learn- causes for poor RP execution efficiency on GPU-like ing library (i.e., CuDNN [52]), which already enables architectures: the state-of-the-art CNN optimizations [53]. (1) Significant Off-Chip Memory Accesses: We 3.1 Overall Evaluation for CapsNet Inference profile the utilization of several major GPU function units during RP execution on a NVIDIA Tesla P100 Fig.4 demonstrates the performance breakdown of GPU. We observe that the arithmetic logic unit (ALU) each layer to the overall CapsNet execution. It shows is lightly utilized (i.e., on average only 38.6% across that, across different CapsNet configurations, the rout- the investigated benchmarks) while the load/store unit ing procedure (RP) accounts for an average of 74.62% (LDST) is heavily stressed with an average utilization of the entire inference time, becoming the major perfor- of 85.9%. This implies that CapsNets’ RP phase fails mance bottleneck. we further conduct detailed analysis to effectively leverage the GPU strong computation ca- on the performance results of CapsNets on GPU and pability and is severely limited by the intensive off-chip make the following observations: memory access. We further investigate the factors that Observation 1: ineffectiveness of batched execu- may contribute to GPU pipeline stalls during RP, in- tion. One common strategy to accelerate CNNs is to cluding the off-chip memory access, barrier synchroniza- conduct batched execution for improving hardware uti- tion, lack of resource, etc. Fig.5 profiles the individual lization, especially when input dataset is large. How- contribution of each major factor to the overall pipeline ever, it cannot improve the RP’s performance during stall cycles. As can be seen, the majority of the pipeline inference. As Fig.4 illustrates, with the increasing of stalls are induced by the memory access (i.e., on average batch size (i.e., Caps-MN1 → Caps-MN3),the overall 44.64%). This further confirms that RP performance CapsNet inference time increases; meanwhile, the RP is significantly limited by the off-chip memory access. proportion also expands with batch size. This is caused by a combination of massive interme- Observation 2: sensitivity to network scaling. diate variables from RP execution and limited on-chip Table 1 shows the network size (formed by a combi- storage. Fig.6(a) illustrates the ratio of RP’s interme- nation of L capsules, H capsules and routing iterations) diate variables’ size to on-chip memory sizes of different for each individual case, e.g., Caps-SV1 being the small- generations of NVIDIA GPUs. As can be seen, the size est. The red curve in Fig.4 also demonstrates that the of RP’s intermediate variables far exceeds the GPU on-

4 41x 50x 67x 75x 41x

99x 54x 42x

Ratio_A Ratio_B Ratio_C Ratio_D 49x 106x 169x 104x 155x 205x 128x 231x 305x 42x 128x 40x chip storage, shown in Fig.6(a). Similarly, Fig.7 shows 55x 50x 67x 41x 75x 99x 42x 41x 54x 30x that only increasing memory bandwidth from 288GB/s - chip Storage 20x GDDR5 to 897 GB/s HBM slightly improves the over-

to On 10x

0x

Ratio of Intermediate Variable Variable of Intermediate Ratio all RP’s performance by an average of 26%. This indi- Perf_A Perf_B Perf_C Perf_D (a) cates that higher off-chip memory bandwidth can only 1.2 1.111.114 solve a small part of the problem but itself does not re- 1.1 1.09 duce the high intensity of the off-chip memory accesses. 1.0

0.9 Therefore, to significantly improve CapsNet’s inference 0.8 performance, we need a customized solution to address the root causes for RP’s inefficient execution. Normalized Routing Performance Routing Normalized (b) 4. BASIC IDEA: PROCESSING-IN-MEMORY Figure 6: (a) Ratio of intermediate variables’ size to on- + PIPELINING chip storage of different GPUs; (b) the impact of on-chip As discussed previously, we aim to address CapsNets’ storage sizes of state-of-the-art GPUs on RP’s execu- significant off chip-memory access caused by massive tion. A: 1.73MB (K40m), B: 5.31MB (Tesla P100), C: unshareable intermediate variables and intensive syn- 9.75MB (RTX2080Ti), D: 16MB (Tesla V100). chronization due to numerous aggregation operations GDDR5 GDDR5X GDDR6 HBM2 1.26 1.5 1.19 in RP. Meanwhile, we also want to utilize the excellent 1.14 1.0 core computing capability provided by modern GPUs 0.5 for deep learning’s matrix operations. Thus, we pro- 0.0 pose a hybrid computing engine named“PIM-CapsNet”,

Normalized RP Performance Normalized shown in Fig.8. It utilizes GPU’s native on-chip units Figure 7: The impact of memory bandwidth on the as the host for fast processing layers such as Conv and overall RP performance. GDDR5:288GB/s (K40m), FC, while pipelining with an off-chip in-memory accel- GDDR5X: 484GB/s (GTX 1080Ti), GDDR6: 616GB/s eration that effectively tackles RP’s inefficient execu- (RTX 2080Ti), HBM2: 897GB/s (Tesla V100) tion. For example, since multiple input sets are gen- erally batched together to be concurrently processed chip storage. Given the iterative computation pattern in RP to avoid the local optimal solution of the rout- of RP, these variables (e.g.,u, ˆ sj, vj, bij, cij) need to be ing coefficients [55], host processors can start process- periodically loaded into GPU cores from the off-chip ing Conv/FC operations from the different batches of memory due to the limited on-chip storage. Moreover, the input sets while waiting for RP’s results from in- the intermediate variables are not sharable among dif- memory processing on the current batch, forming an ferent input batches, which also explains the ineffec- execution pipeline. Furthermore, the in-memory accel- tiveness of batched execution as we observed in Sec.3.1. erators of our PIM-CapsNet design can hierarchically Due to the large data volume and lack of temporal value improve RP’s execution efficiency by minimizing data similarity from these variables, software-level schemes movement and maximizing parallel processing. Note such as register manipulation and shared memory mul- that our proposed design is a general optimization so- tiplexing are also not very effective. lution that is applicable to different routing algorithms (2) Intensive Synchronization: Fig.5 also indicates used in RP. that the frequent barrier synchronization is the sec- To build our in-memory acceleration capability for ond major contributor (i.e., on average 34.45%) to the RP, we resort to one of the emerging 3D stacked tech- pipeline stalls. These synchronization overheads are nologies, i.e., hybrid memory cube (HMC)[38], which induced by the syncthread() calls, which coordinate has become a promising PIM platform [39, 40, 41, 42, shared memory accesses for all threads in one thread 43]. The major reason to replace current GPU’s off- block. There are two major factors causing the frequent chip memory (e.g., GDDRX or HBMs) with HMC in synchronization during the RP execution: (i) the RP ex- our design is their lack of logic layer for in-memory in- ecution contains numerous aggregation operations (e.g., tegration. As illustrated in Fig.9, HMC stacks several Eq.(2)), inducing massive data communication between DRAM dies on the CMOS logic layer as a cube (speci- the threads through shared memory; (ii) the size of the fication 2.1 [56]). The memory cube connects with the intermediate variables far exceeds the shared memory host processor (i.e., GPU cores in our case) via the fully- size, leading to frequent data loading from the global duplex links which can provide up to 320GB/s of exter- memory. Thus, syncthread() calls occur frequently to nal memory bandwidth. Additionally, the logic layer is avoid the potential write-after-write and write-after-read split into 32 sub-memory controllers with each commu- hazards. nicates its local DRAM banks through Through-Silicon To address these issues above, we attempt to apply Vias (TSVs) which together provide an internal mem- two naive solutions: scaling up the on-chip and off-chip ory bandwidth of 512GB/s [38, 41]. The sub-memory memory capacity. Fig.6(b) shows the impact of on- controller and its local DRAM partitions (each parti- chip memory sizes of different generations of NVIDIA tion contains multiple DRAM banks) form the vault ar- GPUs on RP execution. We can observe that increasing chitecture, as highlighted in the red dashed box. The on-chip memory size can help alleviate the challenges logic layer receives system commands and routes mem- above but not very effective, e.g., only up to an average ory access to different vaults. A crossbar is integrated of 14% performance improvement for the 16MB V100. in the logic layer to support the communication be- This is because the nonsharable intermediate variables’ tween SerDes links and vaults. Note that relatively sim- size from RP still far exceeds the current GPUs’ on- ple computation logic can be integrated onto HMC’s

5 Workloads Architecture PE PE PE PE Host Processor Reg Workloads Architecture

S Sub-Memory

u

C Vault Vault Vault

b o ... - Contoller

n PE PE PE PE M Logic Logic Logic Host Processor

t

r e Conv/PrimeCaps/FC

o

m

l

l

o PE PE PE PE e g

r r Conv/PrimeCaps/FC

y e Cross Bar (Switch) R PE PE PE PE ... Inter-Vault PE PE PE Parallel Processing Parallel Processing Vault Vault ... Vault E Level Vault Vault Vault Logic Logic Logic I - Vault Vault Vault / C ... Inter-Vault PE PE PE PE O t ...

r Logic Logic Logic l Logic Logic Logic Instruction Buffer Design Level Reg Cross Bar (Switch) Design Cross Bar (Switch) Vault Vault ... Vault DRAM Die DRAM Die Logic Logic Logic Vault Vault ... Vault DR Procedure Logic Logic Logic DRAM Die DRAM Die Distribution Intra-Vault DRAM Die DRAM Die Level Vault Logic Host Processor Host Processor Logic Layer Logic Layer Design Sub-Memory Silicon Interposer Silicon Interposer Sub-Operations Controller Vault Distribution PE PE ... PE Intra-Vault Sub-Memory Contoller Partition Partition Partition Customized Level ...... Address Partition Partition Partition Design DRAM Vault ... Intra-vault Data Mapping DRAM DRAM DRAM Dies Partition Partition Partition Storage Sub-Operations ... Bank Bank ... Bank PE PE PE Partition Partition Partition Instruction Buffer Sub-Memory Sub-Memory Sub-Memory TSV Controller Controller Controller

Logic Cross Bar (Switch) Layer

Vault Partition Partition Partition ...... Partition Partition Partition Reg

S

... u

b

Partition Partition Partition - Workloads Architecture M Independent Workloads

e Partition Partition Partition m Host Processor o Independent Workloads

r y Device

Sub-Memory Sub-Memory Sub-Memory C o Config Independent Workloads Controller Controller Controller n Conv/PrimeCaps/FC t Dynamic

o R l Inter-Vault

l e

e g

r Independent Workloads Cross Bar (Switch) Routing Parallel Processing Level Vault Vault ... Vault Procedure Independent Workloads Logic Logic Logic Performance Design E Independent Workloads

- C Cross Bar (Switch) t I/O

r Evaluation

l Independent Workloads Vault Vault Vault Logic Logic ... Logic Reg Distribution Intra-Vault Level Vault Logic Design Sub-Memory Vault Controller Partition Partition Partition Sub-Operations Vault Vault Vault ...... PE PE ... PE Partition Partition Partition Logic Logic Logic DRAM Customized Partition Partition ... Partition Dies Cross Bar (Switch) Address Partition Partition Partition Mapping Vault Intra-vault Data DRAM DRAM Storage DRAM Sub-Memory Sub-Memory Sub-Memory Vault Vault ... Vault Bank Bank ... Bank TSV Controller Controller Controller Logic Logic Logic Logic Cross Bar (Switch) Layer

CapsNet Inter-Vault Design Customized Address Mapping s

e Workload

r

p a u Division Intra-vault Data

v DR

C

Inter-Vault Level Design Intra-Vault Level Design d

n

R

C e e Customized

D D

D

o

F

c D

R R R Procedure

A A

A

m Address Mapping C

M M

M

Memory o i

CapsNet

B B

Vault Vault b ... Split Workloads

r r

a Device ... a a

n Addressing n n

k

Logic Logic k k Conv/PrimeCaps/FC P Config Data P Vault Parallel DRAM DRAM DRAM Cross Bar (Switch) Processing Sub-Memory Parallel Storage Contoller Inter-level Bank Bank ... Bank Execution Operation Workloads Vault Vault Inter-Vault Level Design Processing Workload DR Procedure ... Simplification PE PE ... PE Allocation Logic Logic W/ PE design Distribution Conv/ C P Sub- Sub- Sub- Vault Logic u E workload workload workload s

DR Procedure t PrimeCaps/FC DRAM Die D Vault Vault Vault o e Vault Vault Sub-Memory Controller m ...... s i

Logic Logic i g DRAM Die Logic Logic Logic z e n

Split Cross Bar (Switch) d PE PE ... PE Vault Vault Cross Bar (Switch) Workloads DRAM Die Sub- Sub- Sub- Logic ... Logic workload workload workload Intra-level Workload Distribution Host Processor Logic Layer Vault Vault Vault Logic Logic ... Logic Sub-Operations Silicon Interposer Intra-Vault Design

Inter-Vault Level Design

Device Vault ... Vault

Config Logic Logic CapsNet Inter-Vault Design Customized Address Mapping s

Cross Bar (Switch) e

r

p a

u Intra-vault Data

v

C

DR Procedure Workloads Vault Vault d DR Procedure n

... R Intelligent Workload Distributor

C e

Allocation Logic Logic e Customized

o

F

c D

m Address Mapping

C

o

i

r

r P P Vault Split Workloads DRAM DRAM DRAM Parallel Device Storage Config Bank Bank ... Bank Intra-Vault Level Design Processing

D D

D

R R R Inter-level Workload Distribution

A A

A

M M Memory M Workloads Architecture

B B b ... C

a a a Conv/ P

n Addressing n n u

k k k Split Split Split Vault Logic PE PE PE PE Data E workload workload workload s

t PrimeCaps/FC DRAM Die D Host Processor Parallel Vault Vault Vault o Reg Workloads Architecture e Sub-Memory Sub-Memory Controller m Processing S Sub-Memory s

u Contoller ... C Vault Vault Vault i i g

b z Logic Logic Logic o ... DRAM Die - Contoller e Execution M

n Operation n PE PE PE PE Logic Logic Logic Host Processor d

t

e Conv/PrimeCaps/FC Simplification PE PE ... PE r

... o Cross Bar (Switch) PE PE PE m W/ PE design l

l

Split Split Split o

e DRAM Die g PE PE PE PE Conv/PrimeCaps/FC

r

r workload workload workload e

Intra-level Workload Distribution y Cross Bar (Switch)

Host Processor Vault Vault Vault R Logic Layer PE PE PE PE ... Inter-Vault Logic Logic ... Logic Sub-Operations PE PE PE Parallel Processing Parallel Processing Vault Vault ... Vault E Level Vault Vault Vault Logic Logic Logic I - Vault Vault Vault / C ... Inter-Vault PE PE PE PE O t ... Logic Logic Logic Silicon Interposer r Intra-Vault Design l Logic Logic Logic Instruction Buffer Design Level Reg Cross Bar (Switch) Design Cross Bar (Switch) Vault Vault ... Vault DRAM Die DRAM Die Logic Logic Logic DR Procedure Vault Vault ... Vault Workload DRAM Die DRAM Die Intra-Vault Logic Logic Logic Division Distribution DRAM Die DRAM Die Level Vault Logic Host Processor Host Processor Logic Layer Logic Layer Design Sub-Memory Silicon Interposer Silicon Interposer Sub-Operations Controller Vault Distribution PE PE ... PE Intra-Vault Sub-Memory Contoller Partition Partition Partition Customized Level ...... Address Partition Partition Partition Design DRAM Vault ... Intra-vault Data Mapping DRAM DRAM DRAM Dies Partition Partition Partition Storage Sub-Operations ... Bank Bank ... Bank PE PE PE Partition Partition Partition Instruction Buffer Sub-Memory Sub-Memory Sub-Memory TSV Controller Controller Controller

Logic Cross Bar (Switch) Layer

Vault Partition Partition Partition ...... Vault Partition Partition Partition Reg

DRAM DRAM DRAM S

Storage ... u

Bank Bank Bank b Partition Partition Partition - Workloads Architecture ... M Customized PE Independent Workloads

e Design Partition Partition Partition m Host Processor

o Independent Workloads

(Section5.2.2) r Device

y

Sub-Memory Sub-Memory Sub-Memory C Independent Workloads

Controller Controller Controller o Config Conv/PrimeCaps/FC

n Dynamic

t R

o Inter-Vault e

l

l

g Independent Workloads Cross Bar (Switch) e Routing

r Parallel Processing Level Vault Vault ... Vault Procedure Independent Workloads Logic Logic Logic Performance Design E Independent Workloads

- C Cross Bar (Switch) t I/O Evaluation

r

l CapsNet Inter-Vault Level Design Customized Address Mapping Independent Workloads Vault Vault Vault Logic Logic ... Logic Conv DR Intelligent Workload Distributor Intra-vault Data Customized Reg PrimeCaps Intra-Vault Procedure (Section5.1.2) Address Mapping Distribution DR (Section5.3) Level Vault Logic Procedure Design Sub-Memory l g Vault Vault FC e n Device Controller l i Split Workloads DRAM DRAM DRAM Partition Partition Partition l s Storage Sub-Operations a s Config Bank Bank Bank Vault Vault Vault r e ... Routing Procedure Workloads ... PE PE ... PE a c CapsNet Intra-vault Data ...... Logic Logic Logic P o Memory Address Partition Partition Partition r Inter-level Workload Distribution DRAM Customized P d e n g Mapping Mechanism Partition Partition ... Partition li in Inter-vault Level Dies Address Conv layer Split Split Split Vault Logic Customized PE e ss Cross Bar (Switch) ip e (Sec.5.3.1) Vault PrimeCaps workload workload workload Design P c Design (Sec.5.1) DRAM DRAM DRAM Partition Partition Partition Vault Vault Vault Vault ro Storage Intra-vault Data Mapping DRAM DRAM DRAM FC layer ... Sub-Memory Controller (Section5.2.2) P Bank Bank ... Bank Storage CONV/FC/ Sub-Memory Sub-Memory Sub-Memory Vault Vault Vault Bank Bank Bank Logic Logic Logic HMC Logic Layer TSV ...... DRAM Die PrimeCaps Controller Controller Controller Logic Logic Logic Cross Bar (Switch) PE PE ... PE Vault Logic DRAM Die Split Split Split Runtime Vault Vault Vault workload workload workload ... Logic Cross Bar (Switch) DRAM Die Vault Vault Vault Intra-vault Workload Distribution (Section5.2.1) Memory Access Logic Logic Logic Sub-Memory Controller Host ... Intra-vault Level Layer Processor Logic Layer Logic Logic Logic Sub-Operations Scheduler (Sec.5.3.2) Cross Bar (Switch) Design (Sec.5.2) PE PE ... PE Silicon Interposer GPU Intra-Vault Level Design I/O Operations Figure 9: Left: HMC Organization. Right: HMC Block Diagram. The Red dashed box DR Conv Procedure PrimeCaps Figure 8: The overview of PIM-CapsNet Design. DR DRAM Die marks the vault structure.

Procedure DRAM Die ) 2 . S

FC 3 A TableDRA 2:M Die Possible Parallelizable. Dimensions 5 .

M Observation I indicates that the inter-vault data com- Host GPU Logic Layer c CapsNet Inter-Vault Design Customized Address Mapping R

e

s e S

Intra-Vault Level Design Silicon Interposer ( r

Inter-Vault Level Design Batch Low-level Caps High-level Caps munication can be reduced by parallelizing the inde- p Workload a

(B-dimension) (L-dimension) (H-dimension) u Division Intra-vault Data

v

C DR Conv layer Inter-Vault Level Design Intra-Vault Level Design d

pendent sub-operations of each equation on one cho- n R

Eq.1 x x x e e

PrimeCaps C Customized o

D D

D

c

F D

R R

Eq.2 x x R Procedure m

A A

FC layer A Address Mapping

C o

M M

M sen dimension and then distributing them across HMC Memory CapsNet i

... Split Workloads

Vault Vault r

B B r Eq.3 x x Device b

a ... Addressing a a

n

Logic Logic n n Conv/PrimeCaps/FC P

k

k k Eq.4 x x vaults. By doing so, the major communication across Config Data P Vault Eq.5 x Parallel DRAM DRAM DRAM Cross Bar (Switch) Processing Sub-Memory Parallel Storage Routing vaults is only required by the aggregation operations Contoller Inter-level Bank Bank ... Bank Execution Operation Procedure Intelligent Workload Distributor Intra-vault Data Customized Workloads Vault Vault Inter-Vault Level Design Processing Workload CapsNet Address Mapping DR Procedure ... Simplification PE PE ... PE Workloads (Sec.5.1.2) logic layer which can directly access data from vaults at that dimension (aggregations required by the other Allocation Logic Logic (Sec.5.3.1) W/ PE design Distribution Conv/ C d P e Sub- Sub- Sub- Vault Logic u n Vault via memory controllers and benefit from large internal two dimensions will be performed locally within vaults), E li Workload DRAM DRAM DRAM DR Procedure workload workload workload s

e DRAM Die t

Storage PrimeCaps/FC D ip Bank Bank Bank Vault Vault Vault o

P Snippets ... Vault Vault Sub-Memory Controller e Runtime Memory memory bandwidth. This layer is very suitable for inte- which is relatively low. Observation II, however, em- ... m ... s Access Schduler Logic Logic CONV/FC/ i Logic Logic Logic i g (Sec.5.3.2) DRAM Die z e PrimeCaps Vault Logic Customized PEgrating in-memory accelerators for RP execution. Next, Split Cross Bar (Switch) n Snippet Snippet Snippet phasizes that the inter-vault communication can not be d Design PE PE ... PE Vault Vault Vault Sub-Memory Controller Workloads Vault Vault Cross Bar (Switch) ... (Sec.5.2.2) ... DRAM Die Sub- Sub- Sub- e Logic Logic l Logic Logic Logic we will discuss the detailed PIM-CapsNet design. workload workload workload S fully eliminated for RP as none of the three dimensions Intra-level Workload Distribution

DRAM Die u A Host Processor Vault Vault Vault d Cross Bar (Switch) PE PE ... PE M DRAM Die o Logic Layer ... Snippet Snippet Snippet Sub-Operations R can support the parallelization for all the RP equations. DRAM Die M Vault Vault Vault Intra-vault Workload Distribution (Sec.5.2.1) Logic Logic Logic Host GPU ... Logic Layer Logic Logic Logic Sub-Operations 5. PIM-CAPSNET ARCHITECTURE DESIGN When distributing the entire RP workloads on a certain Silicon Interposer Intra-Vault Design Silicon Interposer There are two key design objectives in PIM-CapsNet: dimension, some related data have to be reloaded to an- (i) maintaining workload balance with minimal data other designated vault for aggregation operations. Inter-Vault Level Design communication at inter-vault level(Sec.5.1 and Sec.5.3), 5.1.2 Parallelization and Workload Distribution and (ii) maximizing parallel processing at intra-vault Device Vault ... Vault Config Logic Logic CapsNet Inter-Vault Design Customized Address Mapping Since distributing the workloads on different dimen- s

level but within architecture design constraints (Sec.5.2 e r

Cross Bar (Switch) p a

sions leads to different communication overheads and u v

Intra-vault Data C and Sec.5.3). d Workloads Vault Vault DR Procedure

DR Procedure n

R e

... e Logic Logic C Intelligent Workload Distributor Customized

Allocation o c

execution patterns, it is important to apply an intelli- F D

m Address Mapping

C

o

i r

5.1 Inter-Vault Level Design gent workload distributor to achieve the optimal design r P P Vault Split Workloads DRAM DRAM DRAM A typical HMC design adopts crossbar switch to sup- for power and performance. To minimize the inter-vault Parallel Device Storage Config Bank Bank ... Bank port the inter-vault communication [56], and the over- communication, our distributor only distributes the RP Intra-Vault Level Design Processing

D D

D

R R R Inter-level Workload Distribution

A A

A

M M Memory M

...

all HMC performance power/thermal efficiency can be C

B B workload on a single chosen dimension. Fig.10 illus- b Conv/

a

a a Addressing P

n

n n Split Split Split u

k Data k k Vault Logic E

workload workload workload s

DRAM Die t drastically impacted by the increasing data movements Parallel PrimeCaps/FC D trates an example of the RP execution flow when dis- Vault Vault Vault o Sub-Memory Sub-Memory Controller e Processing m Contoller ... s i across vaults. Employing a more complicated data flow Logic Logic Logic i g tributing on the B-dimension. As it shows, the RP op- Execution Operation DRAM Die z e n

Simplification PE PE ... PE PE PE ... PE d

management (e.g., network-on-chip) on the HMC logi- Cross Bar (Switch) erations of Eq.1,2,3 ( 1 2 3 ) exhibit parallelism along W/ PE design DRAM Die Split Split Split cal layer would further exacerbate the thermal issue. In workload workload workload Intra-level Workload Distribution the B-dimension. Additionally, the multiplication oper- Host Processor Logic Layer Vault Vault ... Vault our proposed design, we leverage the unique features of Logic Logic Logic Sub-Operations ations of Eq.4 ( 4 ) (i.e., vk×uˆk ) can also be parallelized the RP execution and intelligently distribute workloads j j|i Silicon Interposer Intra-Vault Design across vaults with minimal inter-vault communication. along the B-dimension. These parallelizable workloads can be divided into snippets (i.e. green blocks) and dis- 5.1.1 RP’s Unique Feature: Highly Parallelizable in tributed across the HMC vaults. Note that typical Cap- Workload Multi-Dimensions sNet workloads will generate way more snippets than Division As an interesting feature, the operations of RP equa- the number of vaults in the HMC (e.g., up to 32 vaults tions (Sec.2.2) are highly parallelizable. For example, in HMC Gen3). in Eq.(1), the vector-matrix multiplications for all the Due to the aggregation requirements on all the di- low- to high-level (L-H) capsule pairs are independent. mensions during RP execution, there are always some We define such independent operations on L capsules workloads that cannot be processed in parallel on the as parallelism in the L-dimension, while the indepen- selected dimension, leading to workload imbalance. For dent operations on H capsules are defined as parallelism instance, the remaining operations of the RP proce- in the H-dimension. Additionally, if operations corre- dure in Fig.10 (i.e., the purple blocks), including partial sponding to different batches are independent, they are Eq.4 ( 5 ) and Eq.5 ( 6 ) cannot be divided into snip- defined as parallelism in the B-dimension. Thus, we pets due to the lack of parallelism on the B-dimension. make the following key observations: Additionally, these workloads usually require data to Observation I: Operations of each equation in the be aggregated in one place, making it hard to utilize RP can be partitioned into multiple independent sub- all the HMC computation resources. Furthermore, the Vault DRAM DRAM DRAM Storage operations on at least one of the three dimensions (L, data size requested by the aggregation operations can Bank Bank Bank ... Customized PE H, or B), suggesting highly parallelizable feature. be large, which may cause massive inter-vault communi- Design Table 2 further demonstrates which possible dimen- cation and even the crossbar stalls. To reduce the inter- (Section5.2.2) sions these five equations of the dynamic routing proce- vault communication and increase hardware utilization, dure can be parallelized through. Based on it, we also we propose to pre-aggregate the partial aggregation op- make the second key observation: erations inside each vault for the corresponding allo- Observation II: All the RP equations cannot be con- cated snippets. For example, when the RP workloads CapsNet Inter-Vault Level Design Customized Address Mapping currently executed through the same dimension. are distributed on B-dimension, the pre-aggregation can Conv DR Intelligent Workload Distributor Intra-vault Data Customized PrimeCaps Procedure (Section5.1.2) Address Mapping DR (Section5.3) Procedure

g l n Vault 6 e i FC l Device l s Split Workloads DRAM DRAM DRAM s Storage a Config Bank Bank Bank r e Routing Procedure Workloads c ... a CapsNet Intra-vault Data o P r Inter-level Workload Distribution P l e g ll n Inter-vault Level Conv layer Customized PE a i Memory Address Split Split Split Vault Logic r ss Vault workload workload workload a e PrimeCaps Design P c Design (Sec.5.1) DRAM DRAM DRAM Vault Vault Vault o Storage Mapping Mechanism Sub-Memory Controller r Bank Bank Bank FC layer ... (Section5.2.2) P ... Logic Logic Logic CONV/FC/ HMC Logic Layer (Sec.5.3.1) DRAM Die ... PrimeCaps ) PE PE PE 2 S

Cross Bar (Switch) . Vault Logic DRAM Die Split Split Split Vault Vault Vault 3 A workload workload workload Intra-vault Workload Distribution (Section5.2.1) . ... DRAM Die 5 Sub-Memory Controller

Vault Vault Vault . Logic Logic Logic M

Host c

... R Processor Logic Layer Logic Logic Logic Sub-Operations e Inter-vault Level S

GPU ( Cross Bar (Switch) PE PE ... PE Silicon Interposer Design (Sec.5.2) Intra-Vault Level Design I/O Operations

DR Conv Procedure PrimeCaps DR DRAM Die

Procedure DRAM Die ) 2 S . 3

FC A DRAM Die . 5 . M Host GPU Logic Layer c R e S Inter-Vault Level Design Intra-Vault Level Design Silicon Interposer (

Conv layer PrimeCaps FC layer

Routing Procedure Intelligent Workload Distributor Intra-vault Data Customized CapsNet Workloads Address Mapping (Sec.5.1.2) (Sec.5.3.1) d e in Vault l DRAM DRAM DRAM e Workload Storage ip Bank Bank Bank P Runtime Memory Snippets ... CONV/FC/ Access Schduler (Sec.5.3.2) PrimeCaps Customized PE Snippet Snippet Snippet Vault Logic Design Vault Vault Vault Sub-Memory Controller

e ... (Sec.5.2.2) l

S Logic Logic Logic u

DRAM Die A d Cross Bar (Switch) PE PE ... PE o DRAM Die M Snippet Snippet Snippet R DRAM Die M Vault Vault Vault Intra-vault Workload Distribution (Sec.5.2.1) Host GPU ... Logic Layer Logic Logic Logic Sub-Operations Silicon Interposer Equation 2 Equation 2

Equation 3 Equation 4 Equation 5 cij sij vij bij

Host Processor Hybrid Memory Cube Host Processor Hybrid Memory Cube Host Processor Hybrid Memory Cube 1 1 s Equation 1 Equation 1 e Equation 3 h 5 tc 2 a Equation 5 3 Table 3: Parameters2 for Modeling Inter-Vault Data B 6 t Equation 2 ... Equation 4 ... Equation 2 ... n re fe u0~ui × û00~ûij × s0~sj v0~vj × b00~bij 5 1 Movement if 4 Equation 1 2 5 3 D 1 2 3 4 Equation 3 ... Equation 5 ... Equation 5 Equation 3 ... u0~ui × û00~ûij × s0~sj v0~vj × b00~bij + k k k k k k k k k k 6 u 0~u i w00~×wij û 00~û ij c00~×cij s 0~s j v 0~v j û00~×ûij b 00~b ij Equation 4, 5 ... Equation 2, 3 Symbol... Description4 ... b00~bij part 2 part 2 Equation 4 w00~wij c00~cij û00~ûij k k Equation 4, Part 1 4 Equation 2, Part 1 I DR iteration number w00~wij c00~cij û 00~û ij softmax Scale for B-dimension 6 NB i.e. batch size Scale for L-dimension NL i.e. number of low-level capsules Figure 10: The execution diagram for the RP proce- s p Scale for H-dimension a u dureP-c with B-dimension distribution. The workloads in i t NH n ui × wij i.e. number of high-level capsules e r 0 k 0 k 0 k fe greenif blocksu i~u i × canû i0~û beij × splitsi0~sij across vaults, but workloads× û i0~û ij Nvault Number of Vault D 0 k 1 0 k 2 3 4 0 k u i~u i × û i0~û ij × si0~sij × û i0~û ij 0 k 0 k 0 k Dimension of low-level capsule u i~u i × û i0~û ij × si0~sij + s ~s v ~v × û i0~û ij C in purplewi0~wi blocksj ci0~ cannotcij be distributed0 j via0 j B-dimension.b ~b L i.e. number of scaler per low-level capsule w ~w c ~c 5 i0 ij i0 ij i0 ij bi0~bij Dimension of high-level capsule wi0~wij ci0~cij + bi0~bij C H i.e. number of scaler per high-level capsule and tail size represented as SIZEpkt. Therefore,softmax the

. 6

. . .

. . . . s0 v0

. × . . × . SIZEx Data. size of a. variable or packet head&tail

. . amount of data movements M can be represented as: . . k s be performeds fori Eq.(4)vi to combine the b from the p ij aMB = I × [(Nvault − 1) × NL × NH × (SIZEb + SIZEpkt) D-c ij t n re 0 k 0 k (8) snippets assigned to a specific vault before performing fe u ~u × û ~û × s v ×

. f 0 i 0j ij j j

. . . i

. D 0 k +(N 0 k − 1) × N × N × (SIZE + SIZE )] . . . 1 vault 2 L 3 H 4 cij pkt 5

. × . . × . u 0~u i × û 0j~û ij × sj vj × b0j~bij global inter-vault aggregation. 0 k 0 k 0 k u 0~u iw0j~w× ij û 0j~û ijc0j~c×ij sj vj û 0j~×û ij b0j~bij softmax 0 k w0j~wij c0j~cij û 0j~û ij Distribution on L-dimension:0 k As Table 2 illustrates, Guiding Distribution via an Execution Score: To w0j~wij c0j~cij û 0j~û ij achieve the optimal power/thermal and performance re- the RP operations of Eq.1,4,5 can be divided into work- load snippets on the L-dimension. Besides, partial oper- sults, we propose a metric called execution score S to k guide workload distribution, which quantitatively esti- ations from Eq.2 (i.e.,u ˆj|i ×cij) also exhibit parallelism mates the RP execution efficiency under a given work- on the L-dimension. Thus, E can be represented as:

load distribution. S considers workload assignment to NL EL = NB × d e × NH × [2I(2CH − 1) + CH (2CL − 1)] (9) each vault, inter-vault communication overheads, device- Nvault dependent factors and inter-vault memory bandwidth. The inter-vault communication contains data all-reducing k S is modeled as S = 1/(αE + βM). for of sj and broadcasting vj : where E represents the largest workloads distributed ML = I × [NB × (Nvalut − 1) × NH × (SIZEsk + SIZEpkt) to a single vault (since even distribution across vaults is j (10) +NB × (Nvault − 1) × NH × (SIZE k + SIZEpkt)] typically not feasible), which can be quantified based on vj the amount of allocated operations. M represents the Distribution on H-dimension: As Table 2 presents, amount of inter-vault data movement. Both E and M only Eq.5 cannot be parallelized on this dimension. Hence, are affected by the distribution strategy and the model E can be represented as

configuration. Finally, α and β represent the device- NH EH = NB × NL × d e × CH × [2CL − 1 + 2I] (11) dependent coefficients, determined by HMC frequency Nvault and inter-vault memory bandwidth, respectively. The inter-vault communication contains data all-reducing The calculation of S is independent of the online pa- for of bij and broadcasting cij:

rameters, thus, the distribution strategy can be deter- MH = I × [(Nvault − 1) × NL × (SIZEb + SIZEpkt) mined off-line before the actual inference. Then, the in- ij (12) vault operations according to this selected distribution +NL × (SIZEcij + SIZEpkt)] dimension will be generated by compiler and the corre- 5.2 Intra-Vault Level Design sponding workloads will be assigned into each vault via In this section, we propose the intra-vault level de- a hardware scheduler at runtime. We now demonstrate sign that effectively processes the sub-operations of each how to model S via estimating E and M on the three equation that are allocated to a vault. We target the de- distribution dimensions. sign for IEEE-754 single precision (FP32) format, which Distribution on B-dimension: As Fig.10 illustrates, provides sufficient precision range for CapsNet work- the largest workload assigned to a single vault (E) con- loads [14]. Our design can also fit other data formats sist of the workload snippets including 1 2 3 4 and the with minor modifications. partial operations 5 6 . With our optimizations, the 5.2.1 Intra-Vault Level Workload Distribution single vault can get at most dlog2(Nvault)e of the unpar- Nvault In a basic HMC design, the logical layer of each vault allelizable operations, where Nvault represents number contains a sub-memory controller to handle the memory of the HMC vaults. Using parameters shown in Table access. In order to conduct RP specific computation, we 3, E can be modeled as follows: introduce 16 processing elements (PEs) into each vault.

NB NB This design overhead has been tested to satisfy both EB = d e × NL × NH × CH × (2CL − 1) + I × [d e× Nvault Nvault area and thermal constraints for HMC [43] (see detailed NB NH × CH × (2NL − 1) + d e × NH × (3CH + 19)+ Nvault overhead analysis in Sec.6.5). These PEs (integrated

NB dlog2(Nvault)e d e × NL × NH × (2CH − 1) + + 4 × CH ] onto the logic layer) are connected to the sub-memory Nvault Nvault (6) controller in the vault, shown in Fig.11(left). Note that Since NL  1, the above equation can be simplified as: the number of parallel sub-operations on certain dimen- sion is generally the orders of magnitude higher than the NB EB = d e × NL × NH × [(4I − 1)CH + 2CLCH − I] (7) Nvault number of vaults in HMC. In other words, many par- The inter-vault data communication consists of send- allel sub-operations will be allocated to the same vault. ing pre-aggregated bij from all the vaults to a single Hence they can be easily distributed on the same dimen- vault and scattering cij across all the vaults. The data sion and concurrently processed via the PEs without in- transmission is in the form of packets with the head troducing additional communication overheads. There

7 (a) Processing Vault Element P-P Router MUX

Reg S Vault

u Processing Element C Router

b V

o - ∫

M

n I Sub-Memory Write Data Buffer n

t H

e p o Buffer

m

u I l Contoller n t

l R

e

o × + p e x x x u

r

r g t u u u

y O

MUX u M M M

E t p - >>

C × + u

t I/O t

r l PE PE ... PE 1 2 3 Control Signal Reg Instruction Buffer PC Mul Mul Mul Mul

MAC MAC MAC MAC + Mul Mul Mul Mul (b) MAC MAC MAC MAC (c) >> Add Add Add Add

MAC MAC MAC MAC

Add Add Add SF MAC MAC MAC MAC

MUX

× + ∫

Processing Element MUX DRAM DRAM DRAM Bank Bank ... Bank Data Buffer

× + ∫ x x x

Vault u u u Sub-Memory Write M × M + M << Contoller Buffer 1 2 3 PE PE ... PE Program Counter Instruction Buffer

Processing Element DRAM DRAM ... DRAM A with its significand bits logically shifting right. By Bank Bank Bank Data Buffer doing this, the faction representation can be described x x x

Vault u u u

M M M as D in Fig.12(b). Given the matched real exponent Sub-Memory Contoller × + << 1 2 3 value (i.e., ep − b), the two representations, i.e., A and PE PE ... PE OP Controller D in Fig.12(b), now share the identical bit shifting op- 16 PEs erations. Additionally, A and D correspond to Ex- Figure 11: Intra-vault level Architecture Design. pResult’s (i.e.,exponential function’s result) integer and may exist some extreme cases that the number of paral- fraction, respectively. There is no overlapping between lel sub-operations allocated to the vault is smaller than their fraction bits. Thus, these two representations can the number of PEs, leading to low PE utilization. Since be combined (i.e., A OR D ) followed by a unified bit most equations in RP can be parallelized on more than shifting operation on the significand. Note that the ex- one dimension, the workloads can then be distributed ponent matching procedure ( B → D ) could over chuck along a different dimension which can produce enough several least significant bits which would originally be parallelism to fully utilize the PE resources. mapped into the ExpResult. 5.2.2 Customized PE Design Since the exponent matching and combination of two FP32 numbers can be simply considered as a FP32 ad- There have been some studies [43, 42, 57] that inte- dition, we can treat the ExpResult computation as an grate adders and multipliers onto the HMC logic layer addition of the exponent and fraction representations to perform multiply-accumulation (MAC) which is an y−byc essential operation for deep learning (e.g., CNN). How- (i.e., byc + b + 2 − 1) followed by the bit shifting ever, CapsNet’s RP execution involves other operations operations. Note that the power of 2 function causes beyond MAC. As discussed in Sec.2.2, among the five high complexity in both execution and logic design, we RP equations, Eq.1, Eq.2 and Eq.4 can be translated propose to simplify it as follows: y−byc into MAC operations. But the other operations includ- The above polynomial can be expanded as (y+2 − ing Eq.3 and Eq.5 involve more complex functions such (y−byc)+b−1); then, the average value Avg of (2y−byc− as division (Eq.3), inverse square-root (Eq.3) and expo- (y − byc)) can be achieved via integrating the polyno- nential functions (Eq.5) that require complicated logic mial over (y − byc) ∈ [0, 1), which is fixed and can be design, resulting in large performance and power/thermal obtained offline. With y represented by x, the exponen- overheads [58, 59]. tial function can be described as follows:

Operation Approximation: To gain high performance ExpResult ' BS(log2(e) × x + Avg + b − 1) (14) while maintaining low hardware design complexity, we where the BS is bit shifting operations with information propose approximation to simplify these operations with from ep − b; and log2(e) is a constant that is computed negligible accuracy loss. For simplifying division and in- offline. verse square-root functions of the FP32, we apply bit shifting [60], which is widely adopted in graphics pro- Accuracy Recovery: Under the worst case scenari, cessing domain [61, 62, 63]. For exponential function, othere might be several lowest significand bits chucked we approximate the operation as follows: when mapping from D to C . It may cause some accu- The original exponential function can be transformed racy loss. To minimize the bit chucking impact, we an- in the form of power function with the base as 2 [64]: alyze 10,000 exponential executions to collect the value x log2(e)×x y byc y−byc differences between the approximated and original re- e = 2 = 2 = 2 (1 + 2 − 1) (13) sults. During the approximation execution, the accu- where byc is the integer part of y, and y − byc is the racy loss will be recovered via enlarging the results by decimal part. the mean percentage of the value difference. Note that Fig.12(a) illustrates an example of representation trans- our accuracy recovery scheme only introduces one ad- fer. First, the ep of both A and B will be subtracted ditional multiplication operation during the inference, from the bias b to get their real exponents (i.e., ep − b), which guarantees the high performance and the low de- as shown in Fig.12(a) 1 & 3 . Then, the most significant sign complexity compared to other exponential approx- ep−b+1 bits of A ’s significand (i.e., 1+fraction) will imation methodology, e.g., applying look-up tables [59]. be chunked and filled into the least significant ep−b+1 Sec.6.5 shows the detailed accuracy analysis. bits of C ’s exponent field, with the remaining exponent Final PE Structure: According to the discussion bits in C filled by zeros, as shown in Fig.12(a) 2 . We above, the special functions can be simplified as a com- bination of addition, multiplication, and bit shifting op- conduct s similar operation to transfer B to the C ’s erations. Thus, our intra-vault PE employs adders, mul- fraction. Since B is a fraction value, its exponent is tipliers, and bit-shifters to construct these special func- a non-positive number. B ’s significand will be logical tions, as shown in Fig.11. Specifically, our PE enables shift right by |ep − b| bits and then its most significant the flow configuration via the multiplexer (MUX) to 23 bits are filled into C ’s fraction field, as shown in support different types of operations. For example, PE Fig.12(a) 4 . execution flow 1 2 is for MAC operations; 3 2 1 2 1 The above two transfers can be considered as bit shift- is for inverse square-root operations; and 1 2 2 3 is ing on the significand (i.e., 1+fraction) in FP32 format for exponential function. with the distance and direction determined by the real exponent (i.e., ep − b). As illustrated in Fig.12, the 5.3 Contention Reduction in CapsNet Design B ’s exponent can increase to match the exponent of In this section, we discuss strategies to combat the

8 Input 0 Exponent Mantissa × Y_integer + 127 2^(Y_frac)-1 offset +

Exponent+ 0 Exponent Mantissa 2^a0*(1+2^-a1 + 2^-a2 + 2^-a3) 2^b0*(1+2^-b1 + 2^-b2 + 2^-b3) Mantissa

2^a0+2^(a0-a1) + 2^(a0-a2) + 2^(a0-a3) 2^b0+2^(b0-b1) + 2^(b0-b2) + 2^(b0-b3) bias + +

23 Bit Shift 23 Bit Shift << << Output 0 Exponent Mantissa

0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

+134 -6 Exponent Significand +7 130 A 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Exponent Significand -127 + + A 0 1 0 0 0 0 1 0 0 1 + 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A 0 1 0 0 0 0 1 0 0 1 + 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +7 +1 +134 -6 A Exponent represnetation 1 +7 +5 Non-overlapping fraction bits + << Shifting 23 Bits << -127 + 2 + 1 +132 -127 << 2 +7 +1 B Mantissa represntation D 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - 127 -6 << 3 Bit Shifting << C Target FP32 Number C 0 0 0 1 0 0 0 0 1 1 + 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + Matching -9 Abandon -25 E 0 1 0 0 0 0 1 0 0 1 + 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 7 1 C 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |-3| Exponent 3 +124 -127 >> 4 Chunked << << << Shifting 23 Bits << Ignore +5 << -5 +132 -127 -3 -3 << << < < Bit Shifting << B 0 0 1 1 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 -127 + + + + 6 1 + C 0 0 0 1 0 0 0 0 1 1 + 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +124 0.15820315 -3 -5 -9 -25 -2 -6 -22 4 -3 0 0 1 1 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 -127 + + + 5 + +124 -2 -6 -22 (a) (b) B 0 0 1 1 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

Exponent Significand Exponent Significand A 0 1 0 0 0 0 1 0 0 1+ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A 0 1 0 0 0 0 1 0 0 1+ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +132 +5 Identical -127 << Non-overlapping fraction bits 1 2 Exponent

C 0 0 0 1 0 0 0 0 1 1+ 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 1 0 0 0 0 1 0 0 0+ 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0

|-3| +132 +5 3 -127 >> 4 -127 << +124 Chunked B 0 0 1 1 1 1 1 0 0 1+ 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 C 0 0 0 1 0 0 0 0 1 1+ 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (a) (b)

Exponent Significand Exponent Significand A 0 1 0 0 0 0 1 0 0 1+ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A 0 1 0 0 0 0 1 0 0 1+ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +132 +5 Identical Over- -127 << Non-overlapping fraction bits 1 2 Exponent Chunked

C 0 0 0 1 0 0 0 0 1 1+ 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 D 0 1 0 0 0 0 1 0 0 0+ 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 +132 |-3| +5 3 -127 >> 4 -127 << +124 B 0 0 1 1 1 1 1 0 0 1+ 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 C 0 0 0 1 0 0 0 0 1 1+ 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (a) Chunked (b)

Exponent Significand Exponent Significand A 0 1 0 0 0 0 1 0 0 1+ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A 0 1 0 0 0 0 1 0 0 1+ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +132 +5 Identical Over- -127 << Non-overlapping fraction bits 1 2 Exponent Chucked C 0 0 0 1 0 0 0 0 1 1+ 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 D 0 1 0 0 0 0 1 0 0 0+ 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 +132 |-3| +5 3 -127 >> 4 -127 << Mean Difference % +124 B 0 0 1 1 1 1 1 0 0 1+ 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 C 0 0 0 1 0 0 0 0 1 1+ 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 Conv/ (a) Chucked PrimeCaps/FC (b) Figure 12: (a) An example of transferring the exponent representation A (i.e.,Layers byc+b)O andutput fraction representation B y−byc Pytorch Gate-Level (i.e., 2 − 1) to the exponential function’s result C in FP32 format. (b) Combining the exponentSimulation representation

A and the fraction representation (i.e., D that transferred from B ), and applying a unified bit shiftingPerformance toPo obtainwer HMC-sim the exponential function’s result C . GPU Block Address 33 4 0 CapsNet Event Trace For RP (a) … … NVprofiler Sub-page ID Bank ID Vault ID Block ID Gate-Level Execution Simulation Indicator Pytorch Status 33 Vault ID Bank ID 4 0 Nvidia-smi Execution (b) … … + Performance Status Sub-page ID Block ID HMC-sim Output Power Figure 13: (a)The default address mapping of 8GB in GPU HMC Gen3; (b) Our address mapping. Figure 14: The Evaluation Infrastructure.

Power Performance memory-level contention in our PIM-CapsNet design. consecutive blocks required by onePerf PEormance are stored in the

same bank, our scheme will dynamically determineGPU the 5.3.1 Memory Address Mapping Mechanism sub-page size according to the size of the requested data.

In the default HMC design, memory access granular- As shown in Fig.13(b), we leverage bitGa 1te-L∼evel bit 3 in the Pytorch Simulation ity is 16 bytes which is defined as a block, and the se- lowest 4 ignored bits asNV theprofile indicator to determine the Nvidia-smi quential interleaving mapping mechanism is applied to sub-page sizeGPU for the data requests, whereHMC-sim range “000” benefit from better bandwidth utilization. Moreover, ∼ “100” represents the sub-page size from 16B ∼ 256B. MAX block is introduced to define the maximum num- Given that the data requests from PEs and host GPU ber of blocks that one bank can provide at a time. Its need to be allocated into different banks, the indicator size can be set to 32B, 64B, 128B, or 256B [38]. In bits are dynamically assigned by the page table during this study, we redefine MAX block as a subpage in or- the virtual-physical address translation according to the der to differentiate it from block, and one subpage is storage requested by each variable involved in each ex- composed of multiple blocks. Fig.13(a) illustrates the ecution thread. default address mapping from HMC Gen3 specifications [38]. The lowest 4 bits and the highest 1 bit are ignored, 5.3.2 Identifying Memory Access Priority and the remaining bits describe the block address. From During CapsNet execution, resource contention from its lower to higher bits, a block address is composed of concurrent data requesting to the same vault from both several fields: the block ID in the sub-page (the num- host GPU and PEs could occur. Although the Conv/FC ber of bits is determined by the sub-page size), the 5-bit layers exhibit much better data locality than the RP, vault ID for 32 vaults, the 4-bit Bank ID for 16 banks they may still need to periodically request data from the per vault, and the sub-page ID in the bank. As can be HMC. By default, the priority of a request is determined seen, sub-pages with consecutive addresses will be first by the arrival time. But this can cause stalls if both spread sequentially to different vaults and then different sides are requesting to access the same bank in a vault, DRAM banks. which may occur more frequently after applying our new Note that our inter-vault level design requires consec- address mapping. utive blocks allocated into one vault to avoid high inter- To address this issue, we propose a runtime memory vault communication. This can be easily addressed by access scheduler (RMAS) to dynamically identify the moving up the vault ID field to the highest field of priority of memory requests from both the host GPU the block address (as shown in Fig.13(b)) so that vault and vault PEs. We first observe that, with our address ID remains unchanged during the intra-vault memory mapping mechanism, consecutive data are likely to be address mapping. However, at the intra-vault level, distributed across the banks within a vault instead of PEs process their workloads in parallel and concurrently being scattered over all the vaults. Thus, it is highly generate data requests, which may result in serious bank likely that each consecutive data request from the host conflicts and vault request stalls (VRS). will only access a single or few vaults at a time instead Interestingly, we observe that most of the concurrent of all the vaults. This provides opportunities for some PE requests assigned to the same bank actually visit dif- vaults to serve the GPU memory requests in rotation ferent blocks. Based on this, we propose a new memory without affecting the overall HMC execution. addressing scheme to distribute these blocks to differ- To quantify the impact of vault access priority, the ent banks, in order to significantly alleviate bank con- runtime scheduler (RMAS) first collects the information flicts and decrease the VRS. However, simply distribut- of the issued operations by HMC and the number of PE ing blocks to different banks could further increase the requests (Q) from the vault request queues in HMC. VRS as one PE may request multiple consecutive blocks It also collects the information of the issued operations at a time. Because in this case these blocks will reside from the host GPU about which and how many of vaults in multiple banks, it leads to multiple accesses to these the operations are requesting from the HMC (defined banks, resulting in higher bank conflicts. To ensure the as nmax). The collected information above from RMAS

9 is associated with the HMC performance overhead if Table 4: Platform Specifications Host Processor (GPU) granting priority to access from either side. Thus, the Shading Unit 3584 @ 1190MHz L1Cache/Shared: 24KB x 56 performance impact of serving the two types of access On-chip Storage requests from HMC and GPU can be quantified via the L2 Cache: 4MB Default Memory HBM, 8GB, 320GB/s following overhead functions: HMC nmax Capacity 8 GB, 32 Vault, 16 Banks/Vault κ = γv × nh × Q + γh × (15) nh Bandwidth Extl:320 GB/s, Intl: 512GB/s No. of PEs per Vault 16 where κ represents the quantified performance impact; Frequency 312.5MHz γv and γh represent the impact factors that are deter- Baseline GPU-ICP PIM-CapsNet 3x mined by the issued operations’ type from HMC and 2x host GPU, e.g., memory intensive operations corresponds 1x Normalized Normalized to a large γ as their performance is more sensitive to the Performance 0x (a) memory bandwidth than the computation intensive op- 100% erations; Q is the average number of PEs’ requests in 50% the request queues from the targeted vaults; nh is the 0% Consumption number of vaults that are granted with access priority Energy Normalized from the host GPU, which is in the range of [0, nmax]. If (b) nh is ”0”, all the target vaults will first process current HMC PEs requests before processing the GPU requests; Figure 15: The (a) speedup and (b) normalized en- ergy consumption of PIM-CapsNet on the RP proce- while if nh is nmax, all the target vaults will grant prior- ity to the GPU requests. To achieve the minimal impact dure comparing with the GPU-based design. q (i.e. min(κ)), n should be equal to nmax×γh , where To evaluate the effectiveness of PIM-CasNet, we com- h Q×γ v pare with the following design scenarios: (1) Baseline: nh ∈ [0, nmax]. The RMAS will then send the control the state-of-the-art GPU accelerated CapsNet execu- signals to nh vaults that will give host GPU higher pri- tion with the HBM memory (320GB/s). (2) GPU- ority to access. Note that a vault with smaller Q has ICP: the GPU accelerated CapsNet execution with ideal higher priority to be chosen by our RMAS to further cache replacement policy. (3) PIM-CapsNet: our com- reduce the impact on execution. bined inter-vault level and intra-vault level design for RP acceleration with the new memory addressing scheme 6. EVALUATIONS and RMAS scheme (4) PIM-Intra: our PIM-CapsNet 6.1 Experimental Setup without inter-vault level design, and the memory ad- dressing scheme does not optimize the inter-vault data Fig.14 shows our evaluation infrastructure to eval- distribution. (5) PIM-Inter: our PIM-CapsNet with- uate PIM-CapsNet. We first employ Pytorch [51], a out intra-vault level design, and the memory addressing popular open-source machine learning framework that scheme does not support the intra-vault bank conflict supports dynamic computation graphs, to execute the optimization. (6) RMAS-PIM and (7) RMAS-GPU: CapsNet on our host GPU. We then design a physical- Our PIM-CapsNet with the naive memory access schedul- simulator cooperated platform which takes the execu- ing, which always grants HMC PEs higher priority than tion status of CapsNet provided by Pytorch to obtain GPU for RMAS-PIM, and always grants GPU higher the detailed performance and energy results when ap- priority than HMC PEs for RMAS-GPU. (8) All-in- plying PIM-CapsNet. From the physical side, we adopt PIM: the HMC accelerated CapsNet execution, includ- the Nvidia Tesla P100 [65] as our host processor to ing RP and other layers’ execution. evaluate performance and energy consumption of the Conv/PrimeCaps/FC layers of CapsNet. The detailed 6.2 Effectiveness of PIM-CapsNet execution time and power information for these layers are captured by using NVprofiler [54] and Nvidia-smi 6.2.1 Performance and Energy for RP execution [66]. From the simulator side, we leverage NVprofile to We first evaluate the effectiveness of PIM-CapsNet collect the event trace from host and pass it to a mod- on RP execution. Fig.15 illustrates the normalized per- ified HMC-sim 3.0 [67] to simulate the computing and formance and energy consumption of our PIM-CapsNet memory accesses in HMC. Considering that HMC-sim design compared with the GPU-based design (i.e., The 3.0 cannot provide precise execution cycles and power Baseline and GPU-ICP) for the RP execution. From information for the logic layer design, we conduct a the figure, we first observe that the GPU-ICP only out- gate-level simulation on Cadence [68] to measure the performs Baseline by 1.14% on performance and 0.77% execution latency and power consumption for our logic on energy during RP execution. This is because RP design (PE). We then integrate the gate level results and requires a large number of intermediate variables which our PIM design in HMC-sim to obtain the performance exceed the on-chip storage. As a result, the cache pol- and energy consumption of RP execution. Finally, since icy improvements can barely reduce the off-chip mem- the execution of CapsNet is pipelined on the host pro- ory accesses. Second, PIM-CapsNet outperforms Base- cessor and HMC, we combine the results from both sides line by 117% on performance by addressing the large via overlapping the execution time and accumulating number of memory access as well as the intensive syn- the energy consumption. The detailed platform config- chronizations. Third, from Fig.15(b), we observe PIM- urations are shown in Table 4. In this work, we choose CapsNet saves 92.18% on energy comsumpting com- 12 different types of CapsNets as our benchmarks which paring to Baseline. This is because the entire work- are introduced in Sec.3 and shown in Table 1. ing power of our PIM design is much lower then host

10 MN1 MN2 MN3 CF1 CF2 CF3 EN1 EN2 EN3 SV1 SV2 SV3 Average

Baseline All-in-PIM Baseline All-in-PIM Execution X-bar VRS RMAS-PIM RMAS-GPU RMAS-PIM RMAS-GPU 1.6 PIM-CapsNet PIM-CapsNet Caps-MN1 Caps-MN2 Caps-MN3 Caps-CF1 Caps-CF2 Caps-CF3 Caps-EN1 Caps-EN2 Caps-EN3 Caps-SV1 Caps-SV2 Caps-SV3 Average 3x 1.2

1.2 Energy MN1 MN2 MN3 CF1 CF2 CF3 EN1 EN2 EN3 SV1 SV2 SV3 Average 0.8 2x 0.8

0.4 1x 0.4

0 Normalized Execution Time Execution Normalized 0x 0 Normalized Entire Entire Normalized Normalized Entire Speedup Entire Normalized

(a) Average Average Caps-CF1 Caps-CF2 Caps-CF3 Caps-SV1 Caps-SV2 Caps-SV3 Caps-CF1 Caps-CF2 Caps-CF3 Caps-SV1 Caps-SV2 Caps-SV3 Caps-EN1 Caps-EN2 Caps-EN3 Caps-EN1 Caps-EN2 Caps-EN3 Caps-MN1 Caps-MN2 Caps-MN3 Execution DRAM XBAR Vault Caps-MN1 Caps-MN2 Caps-MN3 15% (a) (b) Caps-MN1 Caps-MN2 Caps-MN3 Caps-CF1 Caps-CF2 Caps-CF3 Caps-EN1 Caps-EN2 Caps-EN3 Caps-SV1 Caps-SV2 Caps-SV3 Average 10% Figure 17: The (a) speed up and (b) normalized energy 5% consumption when processing entire CapsNet with dif- 0% ferent designs. 312.5 MHz 625 MHz 937.5 MHz PIM-Intra PIM-Intra PIM-Intra PIM-Intra PIM-Intra PIM-Intra PIM-Intra PIM-Intra PIM-Intra PIM-Intra PIM-Intra PIM-Intra PIM-Intra PIM-Inter PIM-Inter PIM-Inter PIM-Inter PIM-Inter PIM-Inter PIM-Inter PIM-Inter PIM-Inter PIM-Inter PIM-Inter PIM-Inter CAMP-Intra

Normalized Energy Consumption Energy Normalized 8 PIM-CapsNet PIM-CapsNet PIM-CapsNet PIM-CapsNet PIM-CapsNet PIM-CapsNet PIM-CapsNet PIM-CapsNet PIM-CapsNet PIM-CapsNet PIM-CapsNet PIM-CapsNet PIM-CapsNet Caps-SV3 Caps-SV2 Caps-SV1 Caps-EN3 6 (b) Caps-EN2 Caps-EN1 Caps-CF3 4 Figure 16: The breakdown of the factor to (a) normal- Caps-CF2 speedup Caps-CF1 Caps-MN3 ized performance and (b) normalized energy consump- Caps-MN2 2 Caps-MN1 tion when performing RP execution on different PIM BPD BPD BPD designs. Figure 18: The speedup (heat map) achieved by differ- ent workloads distribution dimensions (X-axis) under and PIM-CapsNet is able to reduce a huge number of different HMC execution frequency. Red means better data movements between host and HMC. Moreover, we improvement. observe that PIM-CapsNet can achieve better perfor- mance and energy saving for RP execution in larger Baseline), All-in-PIM. From the figure, we observe that size CapsNet, e.g. 2.27× speedup and 92.52% energy the All-in-PIM causes the 47.59% performance drop saving of accelerating Caps-EN3 compared with 2.09× compared to the Baseline. This is because we mainly speedup and 91.90% energy saving of accelerating Caps- focuses on the RP procedure optimization at minimal SV1. This implies that PIM-CapsNet exhibits the scal- cost in HMC, and this low-cost design choice can hardly ability in optimizing the RP execution with the ever- achieve the best performance for Conv/FC layers. But increasing network size. it reduces the energy consumption by 71.09%, which ex- hibits better execution efficiency (i.e. performance /en- 6.2.2 Effectiveness of Intra-Vault and Inter-Vault Level ergy consumption) compared with Baseline. Fig.17 Fur- Designs ther plots the performance and energy consumption of To better understanding the effectiveness of our intra- the pipelined design with native memory access sched- vault and inter-vault level designs, we compare PIM- ulers (i.e., RMAS-PIM and RMAS-GPU). We observe CapsNet with other PIM design scenarios for the RP that these naive schedulers can not achieve the best execution only. Fig.16(a) illustrates the evaluation re- performance and energy saving compared with PIM- sults of normalized performance with the breakdown CapsNet due to the memory access stalls. In summary, factors for different PIM designs. From the figure, we our design outperforms Baseline on both performance have several observations. First, even though PIM- (i.e., on average 2.44×) and energy saving (i.e., 64.91%), Intra achieves 1.22× speedup over Baseline, the inter- and it even achieves higher performance improvement vault communication overheads contribute on averages compared with the acceleration for RP execution only, 45.24% to the overall performance. This is because which benefits from the pipelined execution. the PIM-intra design induces massive concurrent data transfer between the centralized computation units in 6.4 Sensitivity to PE Frequency Scaling the logic layer and the DRAM banks in vaults, leading We conduct the sensitivity analysis on inter-vault level to high crossbar utilization and serious stalls. Second, workload distributions in our PIM-CapsNet when dif- PIM-Inter decreases the performance by 4.73% com- ferent frequency is adopted in the PEs, e.g., 312.5MHz, pared with the Baseline. Compared with PIM-Intra, 625MHz and 937.5MHz. Note the frequency scaling will the inter-vault communication overheads have been sig- be controlled under a certain range without violating nificantly reduced in PIM-Inter, but the vault request the HMC thermal constraint. Fig.18 shows the speedup stalls (VRS) grow which contribute on average 57.91% achieved by selecting different distribution dimensions to the execution time due to the serious bank conflicts (i.e., B-dimension, L-dimension, and H-dimension) un- within the vault. Finally, PIM-CapsNet improves the der the above three different frequencies. The darker performance about 127.83%/76.62% on average com- color indicates the higher speedups (e.g., the red color paring to PIM-Inter/PIM-Intra by reducing both inter- indicates the substantial speedups while the yellow color vault communications and VRS. From the energy per- means trivial speedups). spective, as Fig.16(b) shows, these PIM designs achieve It is obvious that PIM-CapsNet can achieve better high energy saving compared with Baseline by execut- improvement with higher execution frequency. We also ing RP on energy-efficient PIM design and our PIM- notice that the selection of the distribution dimension CapsNet on average outperforms the PIM-Inter/PIM- changes with frequency scaling. For example, for Caps- Intra by 4.81%/4.52% respectively on energy saving. SV3, the L-dimension distribution can achieve the best performance under 312.5MHz execution frequency; but 6.3 Overall Performance and Energy the H-dimension distribution achieves the best perfor- Fig.17 shows the normalized speedup and energy of mance when frequency increases to 937.5MHz. This entire CapsNet execution. First, to evaluate the effec- indicates that the dimension selection is affected by not tiveness of the pipelined execution between the host and only the network configurations, but also the hardware the HMC, we compare our design with all in GPUs (i.e., configurations, e.g. processing frequency.

11 Table 5: Accuracy Validations for PIM-CapsNet ations via greatly improving the resource utilization. Caps-MN1 Caps-MN2 Caps-MN3 Caps-CF1 Caps-CF2 Caps-CF3 Origin 99.75% 99.75% 99.75% 89.40% 90.03% 90.43% Since the execution of the RP procedure is much more w/o Accuracy Recovery 99.73% 99.74% 99.73% 89.15% 89.91% 90.08% complicated than the convolutional layer and involves w/ Accuracy strong execution dependence, these studies are not ap- Recovery 99.75% 99.75% 99.75% 89.37% 90.02% 90.39% Caps-EN1 Caps-EN2 Caps-EN3 Caps-SV1 Caps-SV2 Caps-SV3 plicable to the CapsNet acceleration. Origin 88.74% 85.01% 82.36% 96.70% 95.90% 95.90% w/o Accuracy Exponential Approximation: There are also some Recovery 88.19% 84.16% 81.64% 95.08% 95.92% 95.92% works on approximation for exponential function [58, Recovery 88.69% 84.96% 82.34% 96.42% 95.90% 95.90% 59, 83, 84]. For example, [59] leverages the lookup table 6.5 Overhead Analysis to conduct the fast exponential execution, which causes Accuracy Analysis: Table 5 shows the absolute Cap- substantial performance and area overheads. [84] accel- sNet accuracy after applying the approximations and erates exponential function via software optimizations, accuracy recovery methodologies on our PIM-CapsNet which are hard to be implemented via the simple logic designs as discussed in Sec.5.2.2. As it shows, the ap- design. In this work, we conduct the efficient accelera- proximations cause on average 0.35% accuracy loss while tion for exponential function (Sec.5.2.2) with low design the accuracy recovery method can achieve no accuracy complexity and low power overhead. loss for most of the cases, and only induces on average 8. CONCLUSIONS 0.04% accuracy difference across our benchmarks. We also observe the slight accuracy boost for Caps- In recent years, the CapsNet has outperformed the SV2 and Caps-SV3 when we increase the RP iteration CNN on the image processing tasks and becomes in- numbers for PIM-CapsNet without accuracy recovery. creasing popular in many areas. In this study, we pro- This is because the noise induced by the approximation pose a processing-in-memory based hybrid computing enhances the robustness of feature mapping given the design called PIM-CapsNet, which leverages both GPU enough RP iterations[69]. core computing capability and the special features of the hybrid memory cube to effectively process the data- Area Analysis: Our PIM-CapsNet introduces 16 PEs intensive RP operations, achieving substantial improve- and operation controller into each HMC vault architec- ments on both performance and energy saving. To dras- ture, with one RMAS module located in the logic layer. tically reduce the identified performance bottlenecks of Based on our gate-level simulations, our logic design for 2 RP, we propose several memory-level optimizations (e,g., 32 vaults and the RMAS together incurs 3.11mm area multi-dimensional parallelism selection, intelligent work- overheads under the 24nm process technology, which load distribution, complex logic approximation and new only occupy 0.32% HMC logic surface area. address mapping mechanism) based on RP’s program Thermal Analysis: Note that our logic design raises and execution features to enable minimal in-memory the total power of the HMC, which could cause ther- communication, maximum parallelization, and low de- mal issues for the 3D stacked memory architecture. We sign complexity and overhead. Evaluation results demon- observe the average power overhead of our logic design strate that our proposed design achieves significant per- is 2.24W , which is below the maximum power overhead formance speedup and energy savings over the baseline (i.e., 10W thermal design power (TDP) [70]) the HMC GPU design, for both the RP phase and overall CapsNet can tolerate. inference. It also achieves good performance scalability 7. RELATED WORKS with increasing network size. PIM-Based NN Acceleration: There have been mul- tiple studies focus on exploring PIM-based neural net- 9. REFERENCES [1] I. Kononenko, “Machine learning for medical diagnosis: work accelerators [71, 72, 73, 74, 75, 76]. For example, history, state of the art and perspective,” Artificial [72] employs the ReRAM-based PIM to conduct effi- Intelligence in medicine, vol. 23, no. 1, pp. 89–109, 2001. cient neural network execution. However, the massive [2] B. J. Erickson, P. Korfiatis, Z. Akkus, and T. L. Kline, intermediate variables involved in the RP could induce “Machine learning for medical imaging,” Radiographics, both performance and energy overheads in the ReRAM vol. 37, no. 2, pp. 505–515, 2017. design for frequent value updating. Besides, [43, 77, [3] A. L. Buczak and E. Guven, “A survey of data mining and 42] propose the CNN accelerator design using 3D stack machine learning methods for cyber security intrusion detection,” IEEE Communications Surveys & Tutorials, PIM technique. Since the execution pattern of the RP vol. 18, no. 2, pp. 1153–1176, 2016. procedure are far different from CNN layers, previous [4] J. Grimmer, “We are all social scientists now: How big data, logic layer designs of 3D stacked memory exhibit low machine learning, and causal inference work together,” PS: efficiency on the RP execution. To our best knowledge, Political Science & Politics, vol. 48, no. 1, pp. 80–83, 2015. this is the first work that leverages HMC to explore [5] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement a customized architectural design for efficient CapsNet learning in robotics: A survey,” The International Journal acceleration. of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013. [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Workload Distributions: Several studies have ex- Towards real-time object detection with region proposal plored the workload distribution for efficient neural net- networks,” in Advances in neural information processing work execution [78, 79, 80, 81, 82]. For example, [80] systems, pp. 91–99, 2015. proposes to distribute the parallel workloads of convo- [7] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy lutional layer among the computation units within a with 50x fewer parameters and< 0.5 mb model size,” arXiv single device; and [82] splits the convolutional execu- preprint arXiv:1602.07360, 2016. tion and distribute the workloads into multiple devices. [8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, Their methods have achieved significant CNN acceler- D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,

12 “Going deeper with ,” in Proceedings of the 2017 IEEE International Symposium on High Performance IEEE conference on computer vision and pattern Computer Architecture (HPCA), pp. 1–12, IEEE, 2017. recognition, pp. 1–9, 2015. [29] Z. Chen, L. Luo, W. Quan, Y. Shi, J. Yu, M. Wen, and [9] K. Simonyan and A. Zisserman, “Very deep convolutional C. Zhang, “Multiple cnn-based tasks scheduling across networks for large-scale image recognition,” arXiv preprint shared gpu platform in research and development arXiv:1409.1556, 2014. scenarios,” in 2018 IEEE 20th International Conference on [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual High Performance Computing and Communications; IEEE learning for image recognition,” in Proceedings of the IEEE 16th International Conference on Smart City; IEEE 4th conference on computer vision and pattern recognition, International Conference on Data Science and Systems pp. 770–778, 2016. (HPCC/SmartCity/DSS), pp. 578–585, IEEE, 2018. [11] G. E. Hinton, A. Krizhevsky, and S. D. Wang, [30] X. Zhang, C. Xie, J. Wang, W. Zhang, and X. Fu, “Transforming auto-encoders,” in International Conference “Towards memory friendly long-short term memory on Artificial Neural Networks, pp. 44–51, Springer, 2011. networks (lstms) on mobile gpus,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture [12] G. Hinton, “Where do features come from?,” Cognitive (MICRO), pp. 162–174, IEEE, 2018. science, vol. 38, no. 6, pp. 1078–1101, 2014. [13] M. Jaderberg, K. Simonyan, A. Zisserman, et al., “Spatial [31] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, transformer networks,” in Advances in neural information T. Chen, Z. Xu, N. Sun, et al., “Dadiannao: A processing systems, pp. 2017–2025, 2015. machine-learning supercomputer,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on [14] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing Microarchitecture, pp. 609–622, IEEE Computer Society, between capsules,” in Advances in neural information 2014. processing systems, pp. 3856–3866, 2017. [32] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, [15] A. Jim´enez-S´anchez, S. Albarqouni, and D. Mateus, X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shifting “Capsule networks against medical imaging data vision processing closer to the sensor,” in ACM SIGARCH challenges,” in Intravascular Imaging and Computer Computer Architecture News, vol. 43, pp. 92–104, ACM, Assisted Stenting and Large-Scale Annotation of 2015. Biomedical Data and Expert Label Synthesis, pp. 150–160, Springer, 2018. [33] X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen, “Cambricon-s: Addressing [16] P. Afshar, K. N. Plataniotis, and A. Mohammadi, “Capsule irregularity in sparse neural networks through a cooperative networks for brain tumor classification based on mri images software/hardware approach,” in 2018 51st Annual and course tumor boundaries,” arXiv preprint IEEE/ACM International Symposium on Microarchitecture arXiv:1811.00597, 2018. (MICRO), pp. 15–28, IEEE, 2018. [17] A. Mobiny and H. Van Nguyen, “Fast capsnet for lung [34] M. Mahmoud, K. Siu, and A. Moshovos, “Diffy: a d´ej`a cancer screening,” in International Conference on Medical vu-free differential deep neural network accelerator,” in Image Computing and Computer-Assisted Intervention, 2018 51st Annual IEEE/ACM International Symposium pp. 741–749, Springer, 2018. on Microarchitecture (MICRO), pp. 134–147, IEEE, 2018. [18] A. S. Lundervold and A. Lundervold, “An overview of deep learning in medical imaging focusing on mri,” Zeitschrift [35] Y. Kwon and M. Rhu, “Beyond the memory wall: A case fur¨ Medizinische Physik, 2018. for memory-centric hpc system for deep learning,” in 2018 51st Annual IEEE/ACM International Symposium on [19] X. Zhang and S. Zhao, “Blood cell image classification Microarchitecture (MICRO), pp. 148–161, IEEE, 2018. based on image segmentation preprocessing and capsnet network model,” Journal of Medical Imaging and Health [36] Y. Li, J. Park, M. Alian, Y. Yuan, Z. Qu, P. Pan, R. Wang, Informatics, vol. 9, no. 1, pp. 159–166, 2019. A. Schwing, H. Esmaeilzadeh, and N. S. Kim, “A network-centric hardware/algorithm co-design to accelerate [20] L. Wang, R. Nie, R. Xin, J. Zhang, and J. Cai, “sccapsnet: distributed training of deep neural networks,” in 2018 51st a deep learning classifier with the capability of Annual IEEE/ACM International Symposium on interpretable feature extraction, applicable for single cell rna data analysis,” bioRxiv, p. 506642, 2018. Microarchitecture (MICRO), pp. 175–188, IEEE, 2018. [21] A. D. Kumar, “Novel deep learning model for traffic sign [37] C. Deng, S. Liao, Y. Xie, K. Parhi, X. Qian, and B. Yuan, detection using capsule networks,” arXiv preprint “Permdnn: Efficient compressed deep neural network arXiv:1805.04424, 2018. architecture with permuted diagonal matrices,” in MICRO, 2018. [22] J. Kronenberger and A. Haselhoff, “Do capsule networks solve the problem of rotation invariance for traffic sign [38] J. Jeddeloh and B. Keeth, “Hybrid memory cube new dram classification?,” in International Conference on Artificial architecture increases density and performance,” in 2012 Neural Networks, pp. 33–40, Springer, 2018. symposium on VLSI technology (VLSIT), pp. 87–88, IEEE, 2012. [23] M. P¨opperl, R. Gulagundi, S. Yogamani, and S. Milz, “Capsule neural network based height classification using [39] S. F. Yitbarek, T. Yang, R. Das, and T. Austin, “Exploring low-cost automotive ultrasonic sensors,” arXiv preprint specialized near-memory processing for data intensive arXiv:1902.09839, 2019. operations,” in 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1449–1452, IEEE, [24] D. A. Lopez, Evolving GPU-Accelerated Capsule Networks. 2016. PhD thesis, 2018. [40] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled [25] A. Yusuf and S. Alawneh, “A survey of gpu instructions: a low-overhead, locality-aware implementations for hyperspectral image classification in processing-in-memory architecture,” in 2015 ACM/IEEE remote sensing,” Canadian Journal of Remote Sensing, 42nd Annual International Symposium on Computer pp. 1–19, 2018. Architecture (ISCA), pp. 336–348, IEEE, 2015. [26] A. Stoutchinin, F. Conti, and L. Benini, “Optimally [41] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scheduling cnn convolutions for efficient memory access,” scalable processing-in-memory accelerator for parallel graph arXiv preprint arXiv:1902.01492, 2019. processing,” ACM SIGARCH Computer Architecture News, [27] C. Li, Y. Yang, M. Feng, S. Chakradhar, and H. Zhou, vol. 43, no. 3, pp. 105–117, 2016. “Optimizing memory efficiency for deep convolutional [42] J. Liu, H. Zhao, M. A. Ogleari, D. Li, and J. Zhao, neural networks on gpus,” in SC’16: Proceedings of the “Processing-in-memory for energy-efficient neural network International Conference for High Performance training: A heterogeneous approach,” in 2018 51st Annual Computing, Networking, Storage and Analysis, IEEE/ACM International Symposium on Microarchitecture pp. 633–644, IEEE, 2016. (MICRO), pp. 655–668, IEEE, 2018. [28] M. Song, Y. Hu, H. Chen, and T. Li, “Towards pervasive and user satisfactory cnn across gpu microarchitectures,” in [43] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and

13 S. Mukhopadhyay, “Neurocube: A programmable digital [64] W. Kahan, “Ieee standard 754 for binary floating-point neuromorphic architecture with high-density 3d memory,” arithmetic,” Lecture Notes on the Status of IEEE, vol. 754, in 2016 ACM/IEEE 43rd Annual International Symposium no. 94720-1776, p. 11, 1996. on Computer Architecture (ISCA), pp. 380–392, IEEE, [65] “Nvidia Tesla P100 Whitepaper.” 2016. https://images.nvidia.com/content/pdf/tesla/ [44] G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules whitepaper/pascal-architecture-whitepaper.pdf. with em routing,” 2018. [66] “NVIDIA-smi.” https://developer.nvidia.com/nvidia- [45] E. Xi, S. Bing, and Y. Jin, “Capsule network performance system-management-interface. on complex data,” arXiv preprint arXiv:1712.03480, 2017. [67] J. D. Leidel and Y. Chen, “Hmc-sim: A simulation [46] S. S. R. Phaye, A. Sikka, A. Dhall, and D. Bathula, “Dense framework for hybrid memory cube devices,” Parallel and diverse capsule networks: Making the capsules learn Processing Letters, vol. 24, no. 04, p. 1442002, 2014. better,” arXiv preprint arXiv:1805.04001, 2018. [68] “Gate-Level Simulation Methodology - Cadence.” [47] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., https://www.cadence.com/content/dam/cadence- “Gradient-based learning applied to document recognition,” www/global/en_US/documents/tools/system-design- Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, verification/gate-level-simulation-wp.pdf. 1998. [69] P. Koistinen and L. Holmstr¨om, “Kernel regression and [48] A. Krizhevsky and G. Hinton, “Learning multiple layers of training with noise,” in Advances in features from tiny images,” tech. rep., Citeseer, 2009. Neural Information Processing Systems, pp. 1033–1039, [49] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, 1992. “Emnist: an extension of mnist to handwritten letters,” [70] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, arXiv preprint arXiv:1702.05373, 2017. L. Xu, and M. Ignatowski, “Top-pim: throughput-oriented [50] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and programmable processing in memory,” in Proceedings of the A. Y. Ng, “Reading digits in natural images with 23rd international symposium on High-performance unsupervised feature learning,” 2011. parallel and distributed computing, pp. 85–98, ACM, 2014. [51] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, [71] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, and Y. Xie, “Prime: A novel processing-in-memory “Automatic differentiation in pytorch,” 2017. architecture for neural network computation in reram-based main memory,” in ACM SIGARCH Computer Architecture [52] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, News, vol. 44, pp. 27–39, IEEE Press, 2016. J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv preprint [72] A. Shafiee, A. Nag, N. Muralimanohar, arXiv:1410.0759, 2014. R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural [53] “Optimizations To Accelerate Deep Learning Training on network accelerator with in-situ analog arithmetic in NVIDIA GPUs.” crossbars,” ACM SIGARCH Computer Architecture News, https://devblogs.nvidia.com/new-optimizations- vol. 44, no. 3, pp. 14–26, 2016. accelerate-deep-learning-training-gpu/. [73] F. Chen, L. Song, and Y. Chen, “Regan: A pipelined [54] “Nvidia Profiler.” reram-based accelerator for generative adversarial https://docs.nvidia.com/cuda/profiler-users- networks,” in 2018 23rd Asia and South Pacific Design guide/index.html. Automation Conference (ASP-DAC), pp. 178–183, IEEE, 2018. [55] R. Mukhometzianov and J. Carrillo, “Capsnet comparative performance evaluation for image classification,” arXiv [74] H. Mao, M. Song, T. Li, Y. Dai, and J. Shu, “Lergan: A preprint arXiv:1805.11195, 2018. zero-free, low data movement and pim-based gan architecture,” in 2018 51st Annual IEEE/ACM [56] “HMC Specification 2.1.” International Symposium on Microarchitecture (MICRO), http://hybridmemorycube.org/files/SiteDownloads/HMC- pp. 669–681, IEEE, 2018. 30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf. [75] B. K. Joardar, B. Li, J. R. Doppa, H. Li, P. P. Pande, and [57] D.-I. Jeon, K.-B. Park, and K.-S. Chung, “Hmc-mac: K. Chakrabarty, “Regent: A heterogeneous Processing-in memory architecture for multiply-accumulate reram/gpu-based architecture enabled by noc for training operations with hybrid memory cube,” IEEE Computer cnns,” in 2019 Design, Automation & Test in Europe Architecture Letters, vol. 17, no. 1, pp. 5–8, 2018. Conference & Exhibition (DATE), pp. 522–527, IEEE, 2019. [58] D. De Caro, N. Petra, and A. G. Strollo, “High-performance special function unit for programmable 3-d graphics [76] H. Falahati, P. Lotfi-Kamran, M. Sadrosadati, and processors,” IEEE Transactions on Circuits and Systems I: H. Sarbazi-Azad, “Origami: A heterogeneous split Regular Papers, vol. 56, no. 9, pp. 1968–1978, 2009. architecture for in-memory acceleration of learning,” arXiv preprint arXiv:1812.11473, 2018. [59] J. Partzsch, S. H¨oppner, M. Eberlein, R. Schuffny,¨ C. Mayr, D. R. Lester, and S. Furber, “A fixed point exponential [77] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, function accelerator for a neuromorphic many-core system,” “Tetris: Scalable and efficient neural network acceleration in 2017 IEEE International Symposium on Circuits and with 3d memory,” in ACM SIGARCH Computer Systems (ISCAS), pp. 1–4, IEEE, 2017. Architecture News, vol. 45, pp. 751–764, ACM, 2017. [60] C. Lomont, “Fast inverse square root,” Tech-315 nical [78] Y. Shen, M. Ferdman, and P. Milder, “Maximizing cnn Report, vol. 32, 2003. accelerator efficiency through resource partitioning,” in 2017 ACM/IEEE 44th Annual International Symposium [61] M. Robertson, A brief history of invsqrt. Department of on Computer Architecture (ISCA), pp. 535–547, IEEE, Computer Science & Applied Statistics, 2012. 2017. [62] A. Zoglauer, S. E. Boggs, M. Galloway, M. Amman, P. N. [79] J. Guo, S. Yin, P. Ouyang, L. Liu, and S. Wei, “Bit-width Luke, and R. M. Kippen, “Design, implementation, and based resource partitioning for cnn acceleration on fpga,” in optimization of megalib’s image reconstruction tool 2017 IEEE 25th Annual International Symposium on mimrec,” Nuclear Instruments and Methods in Physics Field-Programmable Custom Computing Machines Research Section A: Accelerators, Spectrometers, Detectors (FCCM), pp. 31–31, IEEE, 2017. and Associated Equipment, vol. 652, no. 1, pp. 568–571, 2011. [80] S. Wang, G. Ananthanarayanan, Y. Zeng, N. Goel, [63] L. Middendorf and C. Haubelt, “A programmable graphics A. Pathania, and T. Mitra, “High-throughput cnn inference processor based on partial stream rewriting,” in Computer on embedded arm big. little multi-core processors,” arXiv Graphics Forum, vol. 32, pp. 325–334, Wiley Online preprint arXiv:1903.05898, 2019. Library, 2013. [81] C. Lo, Y.-Y. Su, C.-Y. Lee, and S.-C. Chang, “A dynamic deep neural network design for efficient workload allocation

14 in edge computing,” in 2017 IEEE International Conference on Computer Design (ICCD), pp. 273–280, IEEE, 2017. [82] S. Dey, A. Mukherjee, A. Pal, and P. Balamuralidhar, “Partitioning of cnn models for execution on fog devices,” in Proceedings of the 1st ACM International Workshop on Smart Cities and Fog Computing, pp. 19–24, ACM, 2018. [83] A. C. I. Malossi, Y. Ineichen, C. Bekas, and A. Curioni, “Fast exponential computation on simd architectures,” Proc. of HIPEAC-WAPCO, Amsterdam NL, 2015. [84] F. Perini and R. D. Reitz, “Fast approximations of exponential and logarithm functions combined with efficient storage/retrieval for combustion kinetics calculations,” Combustion and Flame, vol. 194, pp. 37–51, 2018.

15