Compression Artifacts Reduction by a Deep Convolutional Network
Total Page:16
File Type:pdf, Size:1020Kb
Compression Artifacts Reduction by a Deep Convolutional Network Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang Department of Information Engineering, The Chinese University of Hong Kong {dc012,dy015,ccloy,xtang}@ie.cuhk.edu.com Abstract Lossy compression introduces complex compression ar- tifacts, particularly the blocking artifacts, ringing effects and blurring. Existing algorithms either focus on removing blocking artifacts and produce blurred output, or restores sharpened images that are accompanied with ringing ef- fects. Inspired by the deep convolutional networks (DCN) on super-resolution [5], we formulate a compact and effi- (a) Left: the JPEG-compressed image, where we could see blocking arti- cient network for seamless attenuation of different compres- facts, ringing effects and blurring on the eyes, abrupt intensity changes on sion artifacts. We also demonstrate that a deeper model can the face. Right: the restored image by the proposed deep model (AR-CNN), be effectively trained with the features learned in a shal- where we remove these compression artifacts and produce sharp details. low network. Following a similar “easy to hard” idea, we systematically investigate several practical transfer settings and show the effectiveness of transfer learning in low-level vision problems. Our method shows superior performance than the state-of-the-arts both on the benchmark datasets and the real-world use case (i.e. Twitter). 1. Introduction (b) Left: the Twitter-compressed image, which is first re-scaled to a small Lossy compression (e.g. JPEG, WebP and HEVC-MSP) image and then compressed on the server-side. Right: the restored image by the proposed deep model (AR-CNN) is one class of data encoding methods that uses inexact Figure 1. Example compressed images and our restoration results approximations for representing the encoded content. In on the JPEG compression scheme and the real use case – Twitter. this age of information explosion, lossy compression is indispensable and inevitable for companies (e.g. Twitter save storage space. This step will cause a complex com- and Facebook) to save bandwidth and storage space. How- bination of different artifacts, as depicted in Figure 1(a). ever, compression in its nature will introduce undesired Blocking artifacts arise when each block is encoded with- complex artifacts, which will severely reduce the user ex- out considering the correlation with the adjacent blocks, re- perience (e.g. Figure 1). All these artifacts not only de- sulting in discontinuities at the 8×8 borders. Ringing ef- crease perceptual visual quality, but also adversely affect fects along the edges occur due to the coarse quantization various low-level image processing routines that take com- of the high-frequency components (also known as Gibbs pressed images as input, e.g. contrast enhancement [16], phenomenon [9]). Blurring happens due to the loss of super-resolution [30, 5], and edge detection [3]. However, high-frequency components. To cope with the various com- under such a huge demand, effective compression artifacts pression artifacts, different approaches have been proposed, reduction remains an open problem. some of which can only deal with certain types of artifacts. We take JPEG compression as an example to explain For instance, deblocking oriented approaches [18, 21, 26] compression artifacts. JPEG compression scheme divides perform filtering along the block boundaries to reduce only an image into 8×8 pixel blocks and applies block discrete blocking artifacts. Liew et al.[17] and Foi et al.[6] cosine transformation (DCT) on each block individually. use thresholding by wavelet transform and Shape-Adaptive Quantization is then applied on the DCT coefficients to DCT transform, respectively. These approaches are good at 576 removing blocking and ringing artifacts, but tend to produce The contributions of this study are two-fold: (1) We blurred output. Jung et al.[13] propose restoration method formulate a new deep convolutional network for efficient based on sparse representation. They produce sharpened reduction of various compression artifacts. Extensive ex- images but accompanied with noisy edges and unnatural periments, including that on real use cases, demonstrate smooth regions. the effectiveness of our method over state-of-the-art meth- To date, deep learning has shown impressive results on ods [6, 12] both perceptually and quantitatively. (2) We ver- both high-level and low-level vision problems . In particu- ify that reusing the features in shallow networks is helpful lar, the SRCNN proposed by Dong et al.[5] shows the great in learning a deeper model for compression artifact reduc- potential of an end-to-end DCN in image super-resolution. tion. Under the same intuitive idea – easy to hard, we reveal The study also points out that conventional sparse-coding- a number of interesting and practical transfer settings. Our based image restoration model can be equally seen as a deep study is the first attempt to show the effectiveness of feature model. However, we find that the three-layer network is not transfer in a low-level vision problem. well suited in restoring the compressed images, especially in dealing with blocking artifacts and handling smooth re- 2. Related work gions. As various artifacts are coupled together, features ex- tracted by the first layer is noisy, causing undesirable noisy Existing algorithms can be classified into deblocking ori- patterns in reconstruction. ented and restoration oriented methods. The deblocking oriented methods focus on removing blocking and ring- To eliminate the undesired artifacts, we improve the SR- ing artifacts. In the spatial domain, different kinds of fil- CNN by embedding one or more “feature enhancement” ters [18, 21, 26] have been proposed to adaptively deal with layers after the first layer to clean the noisy features. Experi- blocking artifacts in specific regions (e.g., edge, texture, ments show that the improved model, namely “Artifacts Re- and smooth regions). In the frequency domain, Liew et duction Convolutional Neural Networks (AR-CNN)”, is ex- al.[17] utilize wavelet transform and derive thresholds at ceptionally effective in suppressing blocking artifacts while different wavelet scales for denoising. The most success- retaining edge patterns and sharp details (see Figure 1). ful deblocking oriented method is perhaps the Pointwise However, we are met with training difficulties in training Shape-Adaptive DCT (SA-DCT) [6], which is widely ac- a deeper DCN. “Deeper is better” is widely observed in knowledged as the state-of-the-art approach [12, 16]. How- high-level vision problems, but not in low-level vision tasks. ever, as most deblocking oriented methods, SA-DCT could Specifically, “deeper is not better” has been pointed out in not reproduce sharp edges, and tend to overly smooth tex- super-resolution [4], where training a five-layer network be- ture regions. The restoration oriented methods regard the comes a bottleneck. The difficulty of training is partially compression operation as distortion and propose restoration due to the sub-optimal initialization settings. algorithms. They include projection on convex sets based The aforementioned difficulty motivates us to investigate method (POCS) [32], solving an MAP problem (FoE) [25], a better way to train a deeper model for low-level vision sparse-coding-based method [13] and the Regression Tree problems. We find that this can be effectively solved by Fields based method (RTF) [12], which is the new state-of- transferring the features learned in a shallow network to the art method. The RTF takes the results of SA-DCT [6]as a deeper one and fine-tuning simultaneously1. This strat- bases and produces globally consistent image reconstruc- egy has also been proven successful in learning a deeper tions with a regression tree field model. It could also be CNN for image classification [24]. Following a similar gen- optimized for any differentiable loss functions (e.g. SSIM), eral intuitive idea, easy to hard, we discover other interest- but often at the cost of other evaluation metrics. ing transfer settings in this low-level vision task: (1) We Super-Resolution Convolutional Neural Network (SR- transfer the features learned in a high-quality compression CNN) [5] is closely related to our work. In the study, in- model (easier) to a low-quality one (harder), and find that dependent steps in the sparse-coding-based method are for- it converges faster than random initialization. (2) In the mulated as different convolutional layers and optimized in real use case, companies tend to apply different compres- a unified network. It shows the potential of deep model in sion strategies (including re-scaling) according to their pur- low-level vision problems like super-resolution. However, poses (e.g. Figure 1(b)). We transfer the features learned the model of compression is different from super-resolution in a standard compression model (easier) to a real use case in that it consists of different kinds of artifacts. Designing (harder), and find that it performs better than learning from a deep model for compression restoration requires a deep scratch. understanding into the different artifacts. We show that di- 1Generally, the transfer learning method will train a base network first, rectly applying the SRCNN architecture for compression and copy the learned parameters or features of several layers to the corre- restoration will result in undesired noisy patterns in the re- sponding layers of a target network. These transferred layers can be left frozen or fine-tuned to the target dataset. The remaining layers are ran- constructed image. domly initialized and trained to the target task. Transfer learning in deep neural networks becomes pop- 577 “noisy” feature maps “cleaner” feature maps “restored” feature maps Compressed image Reconstructed image (Input) (Output) Feature extraction Feature enhancement Mapping Reconstruction Figure 2. The framework of the Artifacts Reduction Convolutional Neural Network (AR-CNN). The network consists of four convolutional layers, each of which is responsible for a specific operation.