Towards Efficient U-Nets: a Coupled and Quantized Approach
Total Page:16
File Type:pdf, Size:1020Kb
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2907634, IEEE Transactions on Pattern Analysis and Machine Intelligence TOWARDS EFFICIENT U-NET: A COUPLED AND QUANTIZED APPROACH 1 Towards Efficient U-Nets: A Coupled and Quantized Approach Zhiqiang Tang, Xi Peng, Kang Li and Dimitris N. Metaxas, Fellow, IEEE Abstract—In this paper, we propose to couple stacked U-Nets for efficient visual landmark localization. The key idea is to globally reuse features of the same semantic meanings across the stacked U-Nets. The feature reuse makes each U-Net light-weighted. Specially, we propose an order-K coupling design to trim off long-distance shortcuts, together with an iterative refinement and memory sharing mechanism. To further improve the efficiency, we quantize the parameters, intermediate features, and gradients of the coupled U-Nets to low bit-width numbers. We validate our approach in two tasks: human pose estimation and facial landmark localization. The results show that our approach achieves state-of-the-art localization accuracy but using ∼70% fewer parameters, ∼30% less inference time, ∼98% less model size, and saving ∼75% training memory compared with benchmark localizers. Index Terms—Stacked U-Nets, Dense Connectivity, Network Quantization, Efficient AI, Human Pose Estimation, Face Alignment. F 1 INTRODUCTION the stacked U-Nets, generating the coupled U-Nets (CU- The U-Net architecture [1] is a basic category of Convolution Net). The key idea is to directly connect blocks of the same Neural Network (CNN). It has been widely used in the semantic meanings, i.e. having the same resolution in either location-sensitive tasks: semantic segmentation [2], biomedi- top-down or bottom-up context, from any U-Net to all cal image segmentation [1], human pose estimation [3], facial subsequent U-Nets. Please refer to Fig. 1 for an illustration. landmark localization [4], etc. A U-Net contains several top- This encourages the feature reuse across the stacks resulting down and bottom-up blocks. There are shortcut connections in light-weighted U-net. between the corresponding top-down and bottom-up blocks. Yet there is an issue in designing the CU-Net. The number The essence of U-Net is integrating both the local visual cues of shortcut connections would have a quadratic growth if and global context information to make the inference. we couple every U-Net pair, e.g. n stacked U-Nets would 2 Recently, the stacked U-Nets, e.g. hourglasses (HGs) [5] generate O(n ) connections. To balance parameter efficiency become a standard baseline in landmark localization tasks. and inference accuracy, we propose the order-K coupling The multiple top-down and bottom-up processing could that couples a U-Net to its K instance successors. Besides, refine the inference stage-by-stage. Many techniques, such as we employ intermediate supervisions to provide additional adversarial training [6], attention modeling [7], etc, are used gradients, compensating the trimmed off shortcut connec- to further improve its inference accuracy. However, very few tions. The order-K coupling cuts down ∼70% parameter works try to improve the efficiency of stacked U-Nets. number and ∼30% forward time without sacrificing inference The stacked U-Nets usually contain dozens of millions of accuracy compared with stacked U-Nets [5]. Furthermore, float parameters. The massive high-precision computations we propose an iterative design that can further reduce the require the high-end GPU devices with abundant memory. parameter number to ∼50%. More specifically, the CU-Net It is very challenging for the applications in resource- output of the first pass is used as the input of the second limited mobile devices. In this paper, we aim to improve pass, which is equivalent to a double-depth CU-Net. the efficiency of staked U-Nets in three aspects: parameter, Memory efficiency. The shortcut connections may have memory, and bit-width. a severe memory issue. For instance, a naive implementation Parameter efficiency. The shortcut connections could intends to make feature copies repeatedly for all shortcut promote feature reuse, thereby reducing many redundant connections. We adapt the memory efficient implementation parameters. For a single U-Net, it is straightforward to [8] to share memories for features in connected blocks. This change each block into a dense block. Insides a dense block, technique can reduce the training memory by ∼40%. several convolutional layers are densely connected. Bit-width efficiency. In addition to the parameter and However, adding the shortcut connections properly in memory efficiency, we also investigate model quantization to the stacked U-Nets is nontrivial. Our solution is to couple improve the bit-width efficiency. Different from the common setup, we quantize the parameters as well as the data flow • Zhiqiang Tang is with the Department of Computer Science, Rutgers ( intermediate features and gradients). On the one hand, University, NJ, USA. E-mail: [email protected] we ternarize or binarize the float parameters, which shrinks • Xi Peng (corresponding author) is with the Department of Computer Sci- 16× or 32× model size in testing. On the other hand, we ence, Binghamton University, NY, USA. E-mail: [email protected] • Kang Li is with the Department of Orthopaedics, New Jersey Medical quantize the data flow with different bit-width setups, which School, Rutgers University, NJ, USA. E-mail: [email protected] saves ∼4× training memory without compromising the • Dimitris Metaxas is with the Department of Computer Science, Rutgers performance. To the best of our knowledge, this is the first University, NJ, USA. E-mail: [email protected] study to simultaneously quantize the parameters and the 0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Rutgers University. Downloaded on May 26,2020 at 21:21:47 UTC from IEEE Xplore. Restrictions apply. TOWARDS EFFICIENT U-NET: A COUPLED AND QUANTIZED APPROACH 2 Dense U-Net Stacked U-Nets Coupled U-Nets c c Element-wise Channel-wise Dense Block Residual Module Addition Bottleneck Concatenation Fig. 1. Illustration of dense U-Net, stacked U-Nets and coupled U-Nets. The CU-Net is a hybrid of dense U-Net and stacked U-Nets, integrating the merits of both dense connectivity and multi-stage top-down and bottom-up refinement. The coupled U-Nets can save ∼70% parameters and ∼30% inference time of stacked U-Nets. Each block in the coupled U-Nets is a bottleneck module which is different from the dense block. data flow in U-Nets. is global at the U-Net level. Besides, we aim to improve In summary, we present a comprehensive study of the U-Net efficiency whereas they focus on accuracy. Our efficient U-Nets in three aspects: parameter, memory, and bit- method is also related to DLA [17] in the sense of feature width. Coupled U-Nets (CU-Nets), order-K coupling and aggregation. However, the proposed coupling connectivity iterative refinement are proposed to balance the parameter is designed for multiple stacked U-Nets whereas DLA [17] is efficiency and inference accuracy. Besides, a memory sharing for single U-Net. technique is employed to significantly cut down the training Network Quantization. Training deep neural networks memory. Moreover, we investigate the bit-width efficiency usually consumes a large amount of computational power, by quantizing the parameters as well as the data flow. Two which makes it hard to deploy on mobile devices. Recently, popular tasks, human pose estimation and facial landmark network quantization approaches [18], [19], [20], [21], [22] localization, are studied to validate our approach in various offer an efficient solution to reduce the network size by aspects. The experimental results prove that our model cuts cutting down high precision operations and operands. TWN down ∼70% parameter number and ∼30% inference time. [19] utilize two symmetric thresholds to ternarize the param- Together with the quantization, we can shrink the model eters to +1, 0, or -1. XNOR-Net [22] quantize the parameters size by ∼98% and reduces ∼75% training memory with and intermediate features. It also uses a scaling factor to comparable performance as state-of-the-art U-Nets designs. approximate the real-value parameters and features. DoReFa- This is an extension of the ECCV work [9]. We have Net [20] quantizes gradients to low bit-width numbers. improved it mainly in four aspects: giving a more compre- WAGE [21] proposes an integer-based implementation for hensive study of the U-Net efficiency, presenting a more training and inference simultaneously. These quantization complete solution for both single U-Net and stacked U- methods are mainly designed for the image classification Nets, giving the detailed network architecture and the imple- networks. In the recent binarized convolutional landmark mentation of order-K coupling, conducting more thorough localizer (BCLL) [23] architecture, XNOR-Net [22] is utilized ablation study to evaluate every component. For details, for network binarization. However, BCLL only quantizes please refer to the difference summary. parameters for inference. Due to its high precision demand for training, it cannot save training memory and improve training efficiency. Therefore, we explore to quantize the pro- 2 RELATED WORK posed CU-Net in training and inference simultaneously. That In this section, we review the recent developments on is, we quantize the parameters as well as the intermediate designing convolutional network architectures, quantizing features and gradients. the neural networks and two landmark localization tasks: Human Pose Estimation. Starting from the DeepPose human pose estimation and facial landmark localization.