Multi-Level Wavelet Convolutional Neural Networks Pengju Liu, Hongzhi Zhang, Wei Lian, and Wangmeng Zuo

1 Multi-level Wavelet Convolutional Neural Networks Pengju Liu, Hongzhi Zhang, Wei Lian, and Wangmeng Zuo Abstract—In computer vision, convolutional networks (CNNs) MWCNN(181×181) often adopts pooling to enlarge receptive field which has the 32 advantage of low computational complexity. However, pooling can cause information loss and thus is detrimental to further High DRRN(105×105) operations such as features extraction and analysis. Recently, 31.5 LapSRN(137×137) RED30(61×61) dilated filter has been proposed to trade off between receptive DnCNN(41×41) VDSR(41×41) field size and efficiency. But the accompanying gridding effect can cause a sparse sampling of input images with checkerboard patterns. To address this problem, in this paper, we propose 31 a novel multi-level wavelet CNN (MWCNN) model to achieve ESPCN(36×36) FSRCNN(61×61) better trade-off between receptive field size and computational PSNR efficiency. The core idea is to embed wavelet transform into 30.5 CNN architecture to reduce the resolution of feature maps while at the same time, increasing receptive field. Specifically, MWCNN for image restoration is based on U-Net architecture, Low SRCNN(17×17) and inverse wavelet transform (IWT) is deployed to reconstruct 30 10-3 10-2 10-1 100 101 the high resolution (HR) feature maps. The proposed MWCNN Fast Running Time(s) Slow can also be viewed as an improvement of dilated filter and a generalization of average pooling, and can be applied to not only Fig. 1: The running time vs. PSNR value of representative image restoration tasks, but also any CNNs requiring a pooling CNN models, including SRCNN [1], FSRCNN [18], ES- operation. The experimental results demonstrate effectiveness of PCN [4], VDSR [2], DnCNN [5], RED30 [20], LapSRN [3], the proposed MWCNN for tasks such as image denoising, single DRRN [17], MemNet [19] and our MWCNN. The receptive image super-resolution, JPEG image artifacts removal and object field of each model are also provided. The PSNR and time classification. The code and pre-trained models will be given at https://github.com/lpj-github-io/MWCNNv2. are evaluated on Set5 with the scale factor ×4 running on a GTX1080 GPU. Index Terms—Convolutional networks, receptive field size, efficiency, multi-level wavelet. spatial context into account. Generally, the receptive field can I. INTRODUCTION be enlarged by either increasing the network depth, enlarging Nowadays, convolutional networks have become the domi- filter size or using pooling operation. But increasing the nant technique behind many computer vision tasks, e.g. image network depth or enlarging filter size can inevitably result in restoration [1]–[5] and object classification [6]–[10]. With higher computational cost. Pooling can enlarge receptive field continual progress, CNNs are extensively and easily learned and guarantee efficiency by directly reducing spatial resolution on large-scale datasets, speeded up by increasingly advanced of feature map. Nevertheless, it may result in information loss. GPU devices, and often achieve state-of-the-art performance in Recently, dilated filtering [8] is proposed to trade off between comparison with traditional methods. The reason that CNN is receptive field size and efficiency by inserting “zero holes” in popular in computer vision can be contributed to two aspects. convolutional filtering. However, the receptive field of dilated arXiv:1907.03128v1 [cs.CV] 6 Jul 2019 First, existing CNN-based solutions dominate on several sim- filtering with fixed factor greater than 1 only takes into account ple tasks by outperforming other methods with a large margin, a sparse sampling of the input with checkerboard patterns, thus such as single image super-resolution (SISR) [1], [2], [11], it can lead to inherent suffering from gridding effect [16]. image denoising [5], image deblurring [12], compressed imag- Based on the above analysis, one can see that we should be ing [13], and object classification [6]. Second, CNNs can be careful when enlarging receptive field if we want to avoid both treated as a modular part and plugged into traditional method, increasing computational burden and incurring the potential which also promotes the widespread use of CNNs [12], [14], performance sacrifice. As can be seen from Figure 1, even [15]. though DRRN [17] and MemNet [19] enjoy larger receptive Actually, CNNs in computer vision can be viewed as a non- fields and higher PSNR performances than VDSR [2] and linear map from the input image to the target. In general, DnCNN [5], their speed nevertheless are orders of magnitude larger receptive field is helpful for improving fitting ability of slower. CNNs and promoting accurate performance by taking more In an attempt to address the problems stated previously, we propose an efficient CNN based approach aiming at trading P. Liu is with School of Computer Science and Technology, Harbin Institute off between performance and efficiency. More specifically, we of Technology, China, e-mail: [email protected]. propose a multi-level wavelet CNN (MWCNN) by utilizing H. Zhang and W. Zuo are with Harbin Institute of Technology. W. Lian is with Department of Computer Science, Changzhi University, discrete wavelet transform (DWT) to replace the pooling China. operations. Due to invertibility of DWT, none of image in- 2 formation or intermediate features are lost by the proposed Recently, with the booming development, CNNs based meth- downsampling scheme. Moreover, both frequency and location ods achieve state-of-the-art performance over the traditional information of feature maps are captured by DWT [21], [22], methods. which is helpful for preserving detailed texture when using 1) Improving Performance and Efficiency of CNNs for Im- multi-frequency feature representation. More specifically, we age Restoration: In the early attempt, the CNN-based methods adopt inverse wavelet transform (IWT) with expansion convo- don’t work so well on some image restoration tasks. For exam- lutional layer to restore resolutions of feature maps in image ple, the methods of [31]–[33] could not achieve state-of-the- restoration tasks, where U-Net architecture [23] is used as a art denoising performance compared to BM3D [27] in 2007. backbone network architecture. Also, element-wise summation In [34], multi-layer perception (MLP) achieved comparable is adopted to combine feature maps, thus enriching feature performance as BM3D by learning the mapping from noise representation. patches to clean patches. In 2014, Dong et al. [1] for the In terms of relation with relevant works, we show that first time adopted only a 3-layer FCN without pooling for dilated filtering can be interpreted as a special variant of SISR, which realizes only a small receptive field but achieves MWCNN, and the proposed method is more general and state-of-the-art performance. Then, Dong et al. [35] proposed effective in enlarging receptive field. Using an ensemble of a 4-layer ARCNN for JPEG image artifacts reduction. such networks trained with embedded multi-level wavelet, we Recently, deeper networks are increasingly used for image achieve PSNR/SSIM value that improves upon the best known restoration. For SISR, Kim et al. [2] stacked a 20-layer CNN results in image restoration tasks such as image denoising, with residual learning and adjustable gradient clipping. Subse- SISR and JPEG image artifacts removal. For the task of ob- quently, some works, for example, very deep network [5], [36], ject classification, the proposed MWCNN can achieve higher [37], symmetric skip connections [20], residual units [11], performance than when adopting pooling layers. As shown Laplacian pyramid [3], and recursive architecture [17], [38], in Figure 1, although MWCNN is moderately slower than had also been suggested to enlarge receptive field. However, LapSRN [3], DnCNN [5] and VDSR [2], MWCNN can have the receptive field of those methods is enlarged with the a much larger receptive field and achieve higher PSNR value. increase of network depth, which may has limited potential This paper is an extension of our previous work [24]. to extend to deeper network. Compared to the former work [24], we propose a more general For better tradeoff between speed and performance, a 7- approach for improving performance, further extend it to high- layer FCN with dilated filtering was presented as a denoiser level task and provide more analysis and discussions. To sum by Zhang et al. [12]. Santhanam et al. [39] adopt pool- up, the contributions of this work include: ing/unpooling to obtain and aggregate multi-context represen- • A novel MWCNN model to enlarge receptive field with tation for image denoising. In [40], Zhang et al. considered to better tradeoff between efficiency and restoration perfor- operate the CNN denoiser on downsampled subimages . Guo et mance by introducing wavelet transform. al. [41] utilized U-Net [23] based CNN as non-blind denoiser. • Promising detail preserving due to the good time- On account of the speciality of SISR, the receptive field size frequency localization property of DWT. and efficiency could be better traded off by taking the low- • A general approach to embedding wavelet transform in resolution (LR) images as input and zooming in on features any CNNs where pooling operation is employed. with upsampling operation [4], [18], [42]. Nevertheless, this • State-of-the-art performance on image denoising, SISR, strategy can only be adopted for SISR, and are not suitable for JPEG image artifacts removal, and classification. other tasks, such as image denoising and JPEG image artifacts removal. The remainder of the paper is organized as follows. Sec. II 2) Universality of Image Restoration: On account of the briefly reviews the development of CNNs for image restoration similarity of tasks such as image denoising, SISR, and JPEG and classification. Sec. III describes the proposed MWCNN image artifacts removal, the model suggested for one task may model in detail. Sec. IV reports the experimental results in be easily extended to other image restoration tasks simply by terms of performance evaluation.

Load more