Video Super Resolution Based on Deep Learning: a Comprehensive Survey
Total Page:16
File Type:pdf, Size:1020Kb
1 Video Super Resolution Based on Deep Learning: A Comprehensive Survey Hongying Liu, Member, IEEE, Zhubo Ruan, Peng Zhao, Chao Dong, Member, IEEE, Fanhua Shang, Senior Member, IEEE, Yuanyuan Liu, Member, IEEE, Linlin Yang Abstract—In recent years, deep learning has made great multiple successive images/frames at a time so as to uti- progress in many fields such as image recognition, natural lize relationship within frames to super-resolve the target language processing, speech recognition and video super- frame. In a broad sense, video super-resolution (VSR) resolution. In this survey, we comprehensively investigate 33 state-of-the-art video super-resolution (VSR) methods based can be regarded as a class of image super-resolution on deep learning. It is well known that the leverage of and is able to be processed by image super-resolution information within video frames is important for video algorithms frame by frame. However, the SR perfor- super-resolution. Thus we propose a taxonomy and classify mance is always not satisfactory as artifacts and jams the methods into six sub-categories according to the ways of may be brought in, which causes unguaranteed temporal utilizing inter-frame information. Moreover, the architectures and implementation details of all the methods are depicted in coherence within frames. detail. Finally, we summarize and compare the performance In recent years, many video super-resolution algo- of the representative VSR method on some benchmark rithms have been proposed. They mainly fall into two datasets. We also discuss some challenges, which need to categories: the traditional methods and deep learning be further addressed by researchers in the community of based methods. For some traditional methods, the mo- VSR. To the best of our knowledge, this work is the first systematic review on VSR tasks, and it is expected to make tions are simply estimated by affine models as in [8]. In a contribution to the development of recent studies in this [9, 10], they adopt non-local mean and 3D steering ker- area and potentially deepen our understanding to the VSR nel regression for video super-resolution, respectively. techniques based on deep learning. Liu and Sun [11] proposed a Bayesian approach to Index Terms—Video super-resolution, deep learning, con- simultaneously estimate underlying motion, blur kernel, volutional neural network, inter-frame Information. and noise level and reconstruct high-resolution frames. In [12], the expectation maximization (EM) method is adopted to estimate the blur kernel, and guide the I. INTRODUCTION reconstruction of high-resolution frames. However, these Super-resolution (SR) aims at recovering a high- explicit models of high-resolution videos are still inade- resolution (HR) image or multiple images from the corre- quate to fit various scenes in videos. sponding low-resolution (LR) counterparts. It is a classic With the great success of deep learning in a variety and challenging problem in computer vision and image of areas, super-resolution algorithms based on deep processing, and it has extensive real-world applications, learning are studied extensively. Many video super- such as medical image reconstruction [1], face [2], remote resolution methods based on deep neural networks such sensing [3] and panorama video super-resolution [4,5], as convolutional neural network (CNN), generative ad- UAV surveillance [6] and high-definition television [7]. versarial network (GAN) and recurrent neural network With the advent of the 5th generation mobile commu- (RNN) have been proposed. Generally, they employ a arXiv:2007.12928v2 [cs.CV] 20 Dec 2020 nication technology, larger size images or videos can large number of both low-resolution and high-resolution be transformed within a shorter time. Meanwhile, with video sequences to input the neural network for inter- the popularity of high definition (HD) and ultra high frame alignment, feature extraction/fusion, and then to definition (UHD) display devices, super-resolution is produce the high-resolution sequences for the corre- attracting more attention. sponding low-resolution video sequences. The pipeline Video is one of the most common multimedia in our of most video super-resolution methods mainly includes daily life, and thus super-resolution of low-resolution one alignment module, one feature extraction and fusion videos has become very important. In general, image module, and one reconstruction module, as shown in super-resolution methods process a single image at a Fig.1. Because of the nonlinear learning capability of time, while video super-resolution algorithms deal with deep neural networks, the deep learning based meth- ods usually achieve good performance on many public H. Liu, Z. Ruan, P. Zhao, F. Shang, Y. Liu and L. Yang are with the Key Laboratory of Intelligent Perception and Image Understanding benchmark datasets. of Ministry of Education, School of Artificial Intelligence, Xidian So far, there are few works about the overview on University, China. E-mails: fhyliu, fhshang, [email protected]. video super-resolution tasks, though many works [13, 14, C. Dong is with the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. E-mail: [email protected]. 15] on the investigation of single image super-resolution Manuscript received November 12, 2020. have been published. Daithankar and Ruikar [16] pre- 2 SR image Feature extraction LR sequences Alignment and fusion Reconstruction t-1 t t+1 Fig. 1: The general pipeline of deep learning methods for VSR tasks. Note that the inter-frame alignment module can be either traditional methods or deep CNNs, while both the feature extraction & fusion module and the upsampling module usually utilize deep CNNs. The dashed line box means that the module is optional. sented a brief review on many frequency-spatial domain RGB color space, the YUV including YCbCr color space H×W ×3 methods, while the deep learning methods are rarely is also widely used for VSR. Ii 2 R denotes the ^ sH×sW ×3 mentioned. Unlike the previous work, we provide a com- i-th frame in a LR video sequence I, and Ii 2R prehensive investigation on deep learning techniques for is the corresponding HR frame, where s is the scale ^ i+N video super-resolution in recent years. It is well known factor, e.g., s = 2, 4, or 8. And fIjgj=i−N is a set of that the main difference between video super-resolution 2N +1 HR frames for the center frame I^i, where N is and image super-resolution lies in the processing of the temporal radius. Then the degradation process of HR inter-frame information. How to effectively leverage the video sequences can be formulated as follows: information from neighboring frames is critical for VSR ^ ^ i+N tasks. We focus on the ways of utilizing inter-frame Ii = φ(Ii; fIjgj=i−N ; θα) (1) information for various deep learning based methods. The contributions of this work are mainly summarized where φ(·; ·) is the degradation function, and the param- as follows. 1) We review recent works and progresses eter θα represents various degradation factors such as on developing techniques for deep learning based video noise, motion blur and downsampling factors. In most super-resolution. To the best of our knowledge, this is the existing works such as [11, 12, 17, 18], the degradation first comprehensive survey on deep learning上采样 based VSR process is expressed as: methods. 2) We propose a taxonomy for deep learning ^ based video super-resolution methods by categorizing Ij = DBEi!jIi + nj (2) their ways of utilization金字塔级联 inter-frame information and 时空注意力 where D and B are the down-sampling and blur opera- illustrate特征提取 how the taxonomy的可变形卷 can be used to categorize 重构模块 HR LR 模块 积对齐模块 tions, nj denotes image noise, and Ei!j is the warping existing methods. 3) We summarize the performance ^ ^ of state-of-the-art methods on some public benchmark operation based on the motion from Ii to Ij. datasets. 4) We further discuss some challenges and In practice, it is easy to obtain LR image Ij, but the perspectives for video super-resolutionDual network tasks. degradation factors, which may be quite complex or The rest of the paper is organized as follows. In Section probably a combination of several factors, are unknown. II, we briefly introduce the background of video super- Different from single image super-resolution (SISR) aim- resolution. Section III shows our taxonomy for recent ing at solving a single degraded image, VSR needs to works. In Sections IV and V, we describe the video残 deal with残 degraded video sequences, and recovers the super-resolution methods with and without alignment,差 corresponding差 HR video sequences, which should be respectively, according to the taxonomy. In Section VI, as close as the ground truth (GT) videos. Specifically, the performance of state-of-the-art methods is analyzed a VSR algorithm may use similar techniques to SISR for quantitatively. In Section VII, we discuss the challenges processing a single frame (spatial information), while it and prospective trends in video super-resolution. Finally, has to take relationships among frames (temporal infor- we conclude this work in Section VIII. mation) into consideration to ensure motion consistency of the video. The super-resolution process, namely the reverse process of Eq. (1), can be formulated as follows: II. BACKGROUND ~ −1 i+N Video super-resolution stems from image super- Ii = φ (Ii; fIjgj=i−N ; θβ) (3) resolution, and it aims at restoring high-resolution ~ ^ videos from multiple low-resolution frames. However,