Enabling Highly Efficient Capsule Networks Processing Through A

To appear in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). Enabling Highly Efficient Capsule Networks Processing Through A PIM-Based Architecture Design Xingyao Zhang Shuaiwen Leon Song Chenhao Xie University of Houston University of Sydney Pacific Northwest National Houston, USA Sydney, Australia Laboratory [email protected] [email protected] Richland, USA [email protected] Jing Wang Weigong Zhang Xin Fu Capital Normal University Capital Normal University University of Houston Beijing, China Beijing, China Houston, USA [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION In recent years, the CNNs have achieved great successes Recently, machine learning has bloomed into rapid in the image processing tasks, e.g., image recognition growth and been widely applied in many areas, includ- and object detection. Unfortunately, traditional CNN's ing medical [1, 2], security [3], social media [4], engineer- classification is found to be easily misled by increasingly ing [5] and etc. Such explosive development is cred- complex image features due to the usage of pooling op- ited to the success of the neural network algorithms, erations, hence unable to preserve accurate position and especially the convolutional neural networks (CNNs), pose information of the objects. To address this chal- which are extremely suitable for the image processing lenge, a novel neural network structure called Capsule tasks, e.g., image recognition and object detection [6, Network has been proposed, which introduces equiv- 7, 8, 9, 10]. However, recent studies have found that ariance through capsules to significantly enhance the CNNs could easily misdetect important image features learning ability for image segmentation and object de- when image scenarios are more complex. This could tection. Due to its requirement of performing a high significantly affects the classification accuracy [11, 12, volume of matrix operations, CapsNets have been gen- 13]. As shown in Fig.1, when attempting to identify erally accelerated on modern GPU platforms that pro- the lung cancer cells, traditional CNNs are bounded to vide highly optimized software library for common deep the confined region, thus missing critical regions around learning tasks. However, based on our performance the cell edges and leading to the wrong identification. characterization on modern GPUs, CapsNets exhibit This is because the pooling operations used in common low efficiency due to the special program and execution CNNs apply happenstance translational invariance [11, features of their routing procedure, including massive 14] which limits the learning of rotation and propor- unshareable intermediate variables and intensive syn- tional change, resulting in obtaining only partial fea- chronizations, which are very difficult to optimize at tures. With the increasing adoption of emerging appli- software level. To address these challenges, we propose cations (e.g., medical image processing and autonomous a hybrid computing architecture design named PIM- driving) into humans' daily life that have strict require- arXiv:1911.03451v1 [cs.DC] 7 Nov 2019 CapsNet. It preserves GPU's on-chip computing capa- ments on object detection's accuracy, wrong identifi- bility for accelerating CNN types of layers in CapsNet, cation could be fatal in some cases. To address these while pipelining with an off-chip in-memory acceleration challenges, a novel neural network called Capsule Net- solution that effectively tackles routing procedure's in- work (CapsNet) has been proposed recently [14]. Fig.1 efficiency by leveraging the processing-in-memory capa- demonstrates the evolution from classic CNN identifica- bility of today's 3D stacked memory. Using routing pro- tion to CapsNet identification. CapsNet abandons the cedure's inherent parallellization feature, our design en- usage of pooling operations in CNNs and introduces the ables hierarchical improvements on CapsNet inference concept of capsule which is any function that tries to efficiency through minimizing data movement and max- predict the presence and the instantiation parameters of imizing parallel processing in memory. Evaluation re- a particular object at a given location. The figure illus- sults demonstrate that our proposed design can achieve trates a group of capsules, each with a double-featured substantial improvement on both performance and en- activation vector (i.e., probability of presence and pose, ergy savings for CapsNet inference, with almost zero shown as the green and red arrows in Fig.1). Because accuracy loss. The results also suggest good perfor- of this added equivariance, CapsNet can accurately de- mance scalability in optimizing the routing procedure tect the cancer cells via precise classification according with increasing network size. to the cell edges and body texture. According to the re- Input Image CNN Identification CapsNet Identification unique architectural features, our PIM-CapsNet design mitigates the data-intensive RP into HMC to tackle its Theory major challenges above. There are two key objectives for this in-memory design for RP: under the hardware design constraints, creating a hierarchical optimization strategy to enable (i) inter-vault workload balance and communication minimization (Sec.5.1 and 5.3) and (ii) Example intra-vault maximum parallel processing (Sec.5.2 and 5.3). Additionally, the in-memory optimizations should Figure 1: The comparison of neural networks in identi- be generally applicable to different RP algorithms. fying lung cancer cells [17], where CapsNet outperforms For objective (i), we further investigate RP's algo- the traditional CNN on detection accuracy. The heat rithm and identify an interesting feature: highly paral- maps indicate the detected features. lelizable in multi-dimensions. This makes RP's workloads have great potential to be concurrently executed cent studies, CapsNets are increasingly involved in the across vaults without incurring significant communica- human-safety related tasks, and on average outperform tion overheads. We then create a modeling strategy CNNs by 19.6% and 42.06% on detection accuracy for by considering both per-vault workloads and inter-vault medical image processing[15, 16, 17, 18, 19, 20] and au- communication to guide dimension selection for paral- tonomous driving[21, 22, 23]. lelization, in order to achieve the optimal performance Because CapsNets execution exhibits a high percent- and power improvement. This also significantly reduces age of matrix operations, state-of-the-art GPUs have the original synchronization overheads through the con- become primary platforms for accelerating CapsNets by version to aggregation within a vault. For objective (ii), leveraging their massive on-chip parallelism and deeply we integrate multiple processing elements (PEs) into a optimized software library [24, 25]. However, processing vault's logic layer and explore the customized PE design efficiency of CapsNets on GPUs often cannot achieve to concurrently perform RP's specific operations across the desired level for fast real-time inference. To in- memory banks. Meanwhile, we propose a new address vestigate the root causes of this inefficiency, we con- mapping scheme that effectively transfers many inter- duct a comprehensive performance characterization on vault level data requests to the intra-vault level and CapsNets' execution behaviors on modern GPUs, and address the bank conflict issues of concurrent data re- observe that the computation between two consecutive quests. Furthermore, to reduce logic design complexity capsule layers, called routing procedure (Sec.2.2), presents and guarantee performance, we use simple low-cost logic the major bottleneck. Through runtime profiling, we to approximate complex special functions with negligi- further identify that the inefficient execution of the rout- ble accuracy loss. To summarize, this study makes the ing procedure originates from (i) tremendous data ac- following contributions: cess to off-chip memory due to the massive unshare- • We conduct a comprehensive characterization study able intermediate variables, and (ii) intensive synchro- on CapsNet inference on modern GPUs and iden- nizations to avoid the potential write-after-read and tify its root causes for execution inefficiency. write-after-write hazards on the limited on-chip stor- age. These challenges are induced by the unique fea- • Based on the interesting insights from the charac- tures of the routing procedure execution, and cannot be terization and further algorithm analysis, we pro- addressed well via common NN optimization techniques pose a processing-in-memory based hybrid com- [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37] as well as puting architecture named PIM-CapsNet, which software-level on-chip memory management (e.g., reg- leverages both GPU on-chip computing capabil- ister manipulation or shared memory multiplexing). ity and off-chip in-memory acceleration features of To tackle CapsNets' significant off chip-memory ac- 3D stacked memory to improve the overall Cap- cess and intensive synchronization (induced by numer- sNet inference performance. ous aggregation operations in the routing procedure), • To drastically reduce the identified performance we propose a processing-in-memory based hybrid com- bottlenecks, we propose several memory-level op- puting architecture named PIM-CapsNet. At the high- timizations to enable minimal in-memory commu- est level, PIM-CapsNet continues to utilize GPU's na- nication, maximum parallelization,

Load more