Squeeze U-Net: a Memory and Energy Efficient Image Segmentation Network
Total Page:16
File Type:pdf, Size:1020Kb
Squeeze U-Net: A Memory and Energy Efficient Image Segmentation Network Nazanin Beheshti Lennart Johnsson Department of Computer Science Department of Computer Science University of Houston University of Houston [email protected] [email protected] Abstract fire modules use point-wise convolutions followed by an × inception stage [4] in which pointwise and 3 3 To facilitate implementation of deep neural networks on convolutions are performed independently then embedded systems keeping memory and computation concatenated to form the output. It results in a small model requirements low is critical, particularly for real-time with only 2.6 million parameters. The total number of × × × mobile use. In this work, we propose a SqueezeNet inspired parameters in our Squeeze U-Net is 1.68 , 2.59 , 11.58 , × × × version of U-Net for image segmentation that achieves a 3.65 , , 16.84 , 27.4 smaller than Mobile Net [5], Deep 12X reduction in model size to 32MB, and 3.2X reduction Lab [6], U-Net [3] , SegNet [7], FCN [8], DeconvNet [9] in Multiplication Accumulation operations (MACs) from architectures. To analyze the merit of the Squeeze U-Net 287 billion ops to 88 billion ops for inference on the architecture, we have implemented both it and U-Net using CamVid data set while preserving accuracy. Our proposed Tensorflow 1.14 and Python 3.6 with CUDA 10.1.243, and Squeeze U-Net is efficient in both low MACs and memory measured execution time of every layer on an NVIDIA K40 use. Our performance results using Tensorflow 1.14 with GPU. We show that Squeeze U-Net for the contracting path Python 3.6 and CUDA 10.1.243 on an NVIDIA K40 GPU requires 63% of the time for U-Net, for the expanding path shows that Squeeze U-Net is 17% faster for inference and 75% of the U-Net time and for the two together 69% of the 52% faster for training than U-Net for the same accuracy. U-Net time on the CamVid data set. For inference, Squeeze U-Net is 17% faster than U-Net. Next, we describe related 1. Introduction work followed by a detailed description of the Squeeze U- Net architecture in Section 3. Section 4 discusses training using the Squeeze U-Net architecture followed by an Until recently, most deep convolutional neural evaluation in Section 5 and our conclusions in Section 6. network (CNN) research focused on increasing accuracy on computer vision datasets. The resulting large and powerful neural networks consume considerable energy, memory 2. Related Work bandwidth and computational resources. For embedded mobile applications, not only accuracy matters but also It has become evident that Deep Neural Networks energy consumption and model size are of top concerns, typically are over parametrized in that a variety of and in many applications also inference time for real-time compression techniques have been applied on large use. A small CNN architecture may enable on-chip storage parameter spaces with no or only minor loss of accuracy. of the model thereby significantly reducing the energy Redundancy in deep learning models results in waste of consumption for retrieving model parameters from DRAM computation, memory and energy. Shrinking, factorization during inference. It also reduces energy requirements for or compressing pretrained networks are approaches for computation. References to off-chip memory may incur a removing redundancy and obtaining smaller models [11]– latency of hundreds of compute cycles and dissipate more [14]. One of the straightforward approaches in model than hundred times as much energy as arithmetic operations compression is applying singular value decomposition [1]. Our goal is to reduce energy and memory consumption (SVD) to a pretrained CNN model and finding low rank of neural networks so they can be deployed on devices with approximations of the parameters [15]. Other approaches limited resources while preserving accuracy. To achieve are network pruning which takes a pretrained model and this goal, we devised Squeeze U-Net for image replaces parameters which are below a certain threshold segmentation. The design of this architecture is inspired by with zeros to form sparse matrices. In sparse matrices SqueezeNet [2] and U-Net [3] . We choose the U-Net relative encoding of indices can be used to compress indices architecture [3] as a starting point because it can be to a few bits at the expense of indirection. To reduce the successfully trained on small data sets which is beneficial number of parameters and computational effort for CNNs for hardware with limited memory and for applications for several techniques based on factorizing the convolution which large training data sets may not be available. We kernel has been used. Depthwise separable convolution replace the down and up sampling layers in U-Net with [16] separates convolution across channels from modules similar to the fire modules in SqueezeNet [2]. Our Figure 1: (A) shows convolution in the contracting path in U-Net (B) shows our corresponding Squeeze U-Net implementation using fire modules [2] instead of full convolution to reduce the reduce the number of parameters. (C) shows transposed convolution in the expansive path in U-Net (D) shows our Squeeze U-Net implementation corresponding to (C) . In (D) and (B), the fire modules first squeeze the number of output channels then apply two parallel convolutions with different kernel size to capture missing features from the previous layer and concatenate their outputs. convolution within channels. Depthwise separable Table 1: The number of parameters for a K K size convolution is used in SqueezeNet and in the Squeeze U- convolution kernel, Ci input channels and Co output channels Net and in e.g. [17][18]. Potential further reduction in are a . and is given below for a few CNNs. model size by reducing the data type size from 32-bits to The networks typically use 3 3 kernels or a combination of eight or 16 bits, commonly known as quantization, may be 3 3 andKKC 1 1 kernels. C able to reduce the Squeeze U-Net model, but is not Model Model investigated here. Quantization in depthwise separable #Params Size Kernel Size Architecture convolution networks using non-linear activation functions (MB) within layers may require special attention as shown in [19] 3 3 Squeeze for MobileNetV1. Other approaches towards reducing the 2.59M 32 2 2 U-Net computation time have focused on specialized hardware for 11 CNNs, such as e.g. [20]–[22]. [22] focus on data reuse for 33 dense uncompressed models thereby making the hardware architecture energy efficient. U-Net 30M 386 2 2 Contributions: For image segmentation we show that 11 Squeeze U-Net achieves the same accuracy as U-Net on the CamVid dataset [23] with 2.6 million parameters, a 12 SegNet 30M 117 33 reduction compared to U-Net [3]. In designing Squeeze U- Net we employ the SqueezeNet fire module [2] design in Deep Lab 20M 83 3 3 – 1 1 both the U-Net contracting and expansive paths. The fire modules initial depthwise convolution reduce the number 33–1 1– of channels and compensates this reduction by an inception 132M FCN -8s 539 4 4 stage with two parallel convolutions each having half the 7 7–16 16 number of output channels of the fire module’s output channels. The two parallel convolutions help prevent DeconvNet 143M 877 1 1 – 3 3 feature loss and vanishing gradients which may be caused 7 7 Figure 2: The Squeeze U-Net architecture consists of down sampling units in the contracting U-Net path, and up sampling units in the expansive U-Net path. Every down sampling (DS) unit consists of two fire modules which extract features. The extracted features are passed down to the next down sampling unit and the corresponding up sampling unit (US). Every up-sampling unit consist of a transposed fire module, a concatenation unit and two fire modules which in order up samples their input, extract features, and concatenate features to construct the output. by reducing the number of channels [24]. Furthermore, we show that, although the Squeeze U-Net has more 3.2 Expansive path layers than U-Net, the inference time for Squeeze U- Net implemented in Tensorflow 1.14 using Python 3.6 For the expansive path in the Squeeze U-Net we and CUDA 10.1.243 on an NVIDIA K40 GPU is 17 also use the SqueezeNet fire modules to reduce the % faster than U-Net for the CamVid data set and only total number of parameters. requires 66% of the U-Net training time Table 2: Quantitate comparison of Squeeze U-Net and U-Net 3. Architecture regarding model size, number of convolution and multiplication 3.1 Contracting Path operations. These numbers are obtained using the TensorFlow 1.14 profiler on a saved model. The number of convolutions and multiplication are gathered during inference time. Inspired by SqueezeNet [2], we adopt fire #CONV #Mult modules for the down sampling (DS) units in the Size Model (CamVid) (CamVid) contracting path of Squeeze U-Net. Each fire module (MB) in the contracting path consists of one 1 1 convolution Billion Million layer with output channels, and an × Squeeze inception layer with two parallel convolutions with 32 432.71 33.03 3 3 and 1 1C kernel size and /2 Coutput <C channels U-Net each. Concatenation of the parallel convolution output U-Net 386.6 1315.60 61.44 channels× form× the fire module output.C It is passed to the next contracting layer and also to the Factor of 12.08× 3.04× 1.86× corresponding layer in the expansive path of Squeeze reduction U-Net with long skip connections[25], [26].