Non-Local Convlstm for Video Compression Artifact Reduction

Non-Local Convlstm for Video Compression Artifact Reduction

Non-Local ConvLSTM for Video Compression Artifact Reduction Yi Xu1∗ Longwen Gao2 Kai Tian1 Shuigeng Zhou1y Huyang Sun2 1Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai, China 2Bilibili, Shanghai, China fyxu17,ktian14,[email protected] fgaolongwen,[email protected] Abstract Video compression artifact reduction aims to recover high-quality videos from low-quality compressed videos. Most existing approaches use a single neighboring frame or a pair of neighboring frames (preceding and/or follow- index of frame ing the target frame) for this task. Furthermore, as frames of high quality overall may contain low-quality patches, and high-quality patches may exist in frames of low quality overall, current methods focusing on nearby peak-quality frames (PQFs) may miss high-quality details in low-quality patch from 140푡ℎ frame patch from 142푡ℎ frame patch from 143푡ℎ frame frames. To remedy these shortcomings, in this paper we propose a novel end-to-end deep neural network called Figure 1. An example of high-quality patches existing in low- 140th 143th non-local ConvLSTM (NL-ConvLSTM in short) that exploits quality frames. Here, though the and frames have better full-frame SSIM than the 142th frame, the cropped patch multiple consecutive frames. An approximate non-local from the 142th frame has the best patch SSIM. The upper part strategy is introduced in NL-ConvLSTM to capture global shows the SSIM values of the fames and cropped patches; the motion patterns and trace the spatiotemporal dependency lower images are the cropped patches from three frames. Also, in a video sequence. This approximate strategy makes the comparing the details in the boxes of the same color, we can see non-local module work in a fast and low space-cost way. that the cropped patches from the 142th frame are of better quality. Our method uses the preceding and following frames of the target frame to generate a residual, from which a higher emerged as an important research topic in multimedia and quality frame is reconstructed. Experiments on two datasets computer vision areas [26, 43, 45]. show that NL-ConvLSTM outperforms the existing methods. In recent years, significant advances in compressed im- age/video enhancement have been achieved due to the suc- cessful applications of deep neural networks. For example, [11, 12, 36, 49] directly utilize deep convolutional neural 1. Introduction networks to remove compression artifacts of images with- arXiv:1910.12286v1 [eess.IV] 27 Oct 2019 out considering the characteristics of underlying compres- Video compression algorithms are widely used due to sion algorithms. [16, 38, 43, 44] propose models that are limited communication bandwidth and storage space in fed with compressed frames and output the enhanced ones. many real (especially mobile) application senarios [34]. These models all use a single frame as input, do not con- While significantly reducing the cost of transmission and sider the temporal dependency of neighboring frames. To storage, lossy video compression also leads various com- exploit the temporal correlation of neighboring frames, [26] pression artifacts such as blocking, edge/texture floating, proposes the deep kalman filter network, [42] employs task mosquito noise and jerkiness [48]. Such visual distor- oriented motions, and [45] uses two motion-compensated tions often severely impact the quality of experience (QoE). nearest PQFs. However, [26] uses only the preceding Consequently, video compression artifact reduction has frames of the target frame, while [42, 45] adopt only a pair ∗This author did most work during his internship at Bilibili. of neighboring frames, which may miss high-quality details yCorresponding author. of some other neighbor frames (will be explained later). Video compression algorithms have intra-frame and jections onto convex sets [29, 47], wavelet-based meth- inter-frame codecs. Inter coded frames (P and B frames) ods [40] and sparse coding [5, 25]. significantly depend on the preceding and following neigh- Following the success of AlexNet [21] on Ima- bor frames. Therefore, extracting spatiotemporal relation- geNet [31], many deep learning based methods have been ships among the neighboring frames can provide useful in- applied to this low-level long-standing computer vision formation for improving video enhancement performance. task. Dong et al. [11] firstly proposed a four-layer network However, mining detailed information from one/two neigh- named ARCNN to reduce JPEG compression artifacts. Af- boring frame(s) or even two nearest PQFs is not enough terwards, [22, 28, 35, 36, 49, 50] proposed deeper networks for compression video artifact reduction. To illustrate this to further reduce compression artifacts. A notable exam- point, we present an example in Fig. 1. Frames with a ple is [49], which devised an end-to-end trainable denoising larger structural similarity index measure (SSIM) are usu- convolutional neural network (DnCNN) for Gaussian de- ally regarded as better visual quality. Here, though the noising . DnCNN also achieves a promising result on JPEG 140th and 143th frames have better overall visual quality deblocking task. Moreover, [7, 14, 46] enhanced visual than the 142th frame, the cropped patch of highest qual- quality by exploiting the wavelet/frequency domain infor- ity comes from the 142th frame. The high-quality details mation of JPEG compression images. Recently, more meth- in such patches would be ignored if mining spatiotemporal ods [1, 6, 8, 12, 15, 24, 27] were proposed and got com- information from videos using the existing methods. petitive results. Concretely, Galteri et al. [12] used a deep Motivated by the observation above, in this paper we try generative adversarial network and recovered more photore- to capture the hidden spatiotemporal information from mul- alistic details. Inspired by [39], [24] incorporated non-local tiple preceding and following frames of the target frame operations into a recursive framework for quality restora- for boosting the performance of video compression arti- tion, it computed self-similarity between each pixel and its fact reduction. To this end, we develop a non-local Con- neighbors, and applied the non-local module recursively for vLSTM framework that uses the non-local mechanism [3] correlation propagation. Differently, here we adopt a non- and ConvLSTM [41] architecture to learn the spatiotem- local module to capture global motion patterns by exploit- poral information from a frame sequence. To speed up ing inter-frame pixel-wise similarity. the non-local module, we further design an approximate and effective method to compute the inter-frame pixel- 2.2. Video Compression Artifact Reduction wise similarity. Comparing with the existing methods, Most of existing video compression artifact reduction our method is advantageous in at least three aspects: 1) works [10, 16, 38, 43, 44] focus on individual frames, No accurate motion estimation and compensation is ex- neglecting the spatiotemporal correlation between neigh- plicitly needed; 2) It is applicable to videos compressed boring frames. Recently, several works were proposed to by various commonly-used compression algorithms such exploit the spatiotemporal information from neighboring as H.264/AVC and H.265/HEVC; 3) The proposed method frames. Xue et al. [42] designed a neural network with a outperforms the existing methods. motion estimation component and a video processing com- Major contributions of this work include: 1) We pro- ponent, and utilized a joint training strategy to handle var- pose a new idea for video compression artifact reduction by ious low-level vision tasks . Lu et al. [26] further incor- exploiting multiple preceding and following frames of the porated quantized prediction residual in compressed code target frame, without explicitly computing and compensat- streams as strong prior knowledge, and proposed a deep ing motion between frames. 2) We develop an end-to-end Kalman filter network (DKFN) to utilize the spatiotem- deep neural network called non-local ConvLSTM to learn poral information from the preceding frames of the target the spatiotemporal information from multiple neighboring frame. In addition, considering that quality of nearby com- frames. 3) We design an approximate method to compute pressed frames fluctuates dramatically, [13, 45] proposed the inter-frame pixel-wise similarity, which dramatically re- multi-frame quality enhancement (MFQE) and utilized mo- duces calculation and memory cost. 4) We conduct ex- tion compensation of two nearest PQFs to enhance low- tensive experiments over two datasets to evaluate the pro- quality frames. Comparing with DKFN[26], MFQE is a posed method, which achieves state-of-the-art performance post-processing method and uses less prior knowledge of for video compression artifact reduction. the compression codec, but still achieves state-of-the-art performance on HEVC compressed videos. 2. Related Work In addition to compression artifact removal, spatiotem- 2.1. Single Image Compression Artifact Reduction poral correlation mining is also a hot topic in other video quality enhancement tasks, such as video super resolu- Early works mainly include manually-designed filters [3, tion (VSR). [4, 18, 19, 23, 32, 37, 42] estimated optical 9, 30, 51], iterative approaches based on the theory of pro- flow and warped several frames to capture the hidden spa- input center frame ℋ푡−1, 풞t−1 푋푡−2 퐹푡−2 푋푡 F 푡−1 non-local 푋 퐹푡−1 푡−1 module F푡 푋푡 concat ෡ መ 퐹푡 ℋ푡−1, 풞t−1 Decoder residual 푋푡+1 퐹푡+1 Decoder F푡 ConvLSTM 푋 푡+2 퐹푡+2 ℋ푡, 풞t compressed video shared Encoder Bidirectional output enhanced NL-ConvLSTM center frame Figure 2. The framework of our method (left) and the architecture of NL-ConvLSTM (right) tiotemporal dependency for VSR. Although these methods learning spatiotemporal correlation across frames, and de- work well, they rely heavily on the accuracy of motion es- coding high-level features to residuals, with which the high- timation. Instead of explicitly taking advantage of motion quality frames are reconstructed eventually.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us