Frame-Recurrent Video Super-Resolution
1,2 2 2 Mehdi S. M. Sajjadi ∗ Raviteja Vemulapalli Matthew Brown [email protected] [email protected] [email protected] 1Max Planck Institute for Intelligent Systems 2Google
Abstract
Recent advances in video super-resolution have shown that convolutional neural networks combined with motion compensation are able to merge information from multi- ple low-resolution (LR) frames to generate high-quality im- ages. Current state-of-the-art methods process a batch of LR frames to generate a single high-resolution (HR) frame and run this scheme in a sliding window fashion over the entire video, effectively treating the problem as a large number of Bicubic Our result HR frame separate multi-frame super-resolution tasks. This approach has two main weaknesses: 1) Each input frame is processed Figure 1: Side-by-side comparison of bicubic interpolation, and warped multiple times, increasing the computational our FRVSR result, and HR ground truth for 4x upsampling. cost, and 2) each output frame is estimated independently conditioned on the input frames, limiting the system’s ability ploited to improve reconstruction for video super-resolution. to produce temporally consistent results. It is therefore imperative to combine the information from In this work, we propose an end-to-end trainable frame- as many LR frames as possible to reach the best video super- recurrent video super-resolution framework that uses the pre- resolution results. viously inferred HR estimate to super-resolve the subsequent The latest state-of-the-art video super-resolution methods frame. This naturally encourages temporally consistent re- approach the problem by combining a batch of LR frames to sults and reduces the computational cost by warping only estimate a single HR frame, effectively dividing the task of one image in each step. Furthermore, due to its recurrent video super-resolution into a large number of separate multi- nature, the proposed method has the ability to assimilate a frame super-resolution subtasks [3, 28, 29, 39]. However, large number of previous frames without increased compu- this approach is computationally expensive since each input tational demands. Extensive evaluations and comparisons frame needs to be processed several times. Furthermore, gen- with previous methods validate the strengths of our approach erating each output frame separately reduces the system’s and demonstrate that the proposed framework is able to sig- ability to produce temporally consistent frames, resulting in arXiv:1801.04590v4 [cs.CV] 25 Mar 2018 nificantly outperform the current state of the art. unpleasing flickering artifacts. In this work, we propose an end-to-end trainable frame- 1. Introduction recurrent video super-resolution (FRVSR) framework to ad- dress the above issues. Instead of estimating each video Super-resolution is a classic problem in image processing frame separately, we use a recurrent approach that passes that addresses the question of how to reconstruct a high- the previously estimated HR frame as an input for the fol- resolution (HR) image from its downscaled low-resolution lowing iteration. Using this recurrent architecture has several (LR) version. With the rise of deep learning, super-resolution benefits. Each input frame needs to be processed only once, has received significant attention from the research commu- reducing the computational cost. Furthermore, information nity over the past few years [3, 5, 20, 21, 26, 28, 35, 36, 39]. from past frames can be propagated to later frames via the While high-frequency details need to be reconstructed exclu- HR estimate that is recurrently passed through time. Pass- sively from spatial statistics in the case of single image super- ing the previous HR estimate directly to the next step helps resolution, temporal relationships in the input can be ex- the model to recreate fine details and produce temporally
∗This work was done when Mehdi S. M. Sajjadi was interning at Google. consistent videos. To analyze the performance of the proposed framework, the field tremendously [6, 21, 22, 25, 36, 37]. Parallel ef- we compare it with strong single image and video super- forts have studied alternative loss functions for more visually resolution baselines using identical neural networks as build- pleasing reconstructions [26, 35]. Agustsson and Timofte [1] ing blocks. Our extensive set of experiments provides in- provide a recent survey on the current state of the art in single sights into how the performance of FRVSR varies with the image super-resolution. number of recurrent steps used during training, the size of Video and multi-frame super-resolution approaches com- the network, and the amount of noise, aliasing or compres- bine information from multiple LR frames to reconstruct sion artifacts present in the LR input. The proposed approach details that are missing in individual frames which can lead clearly outperforms the baselines under various settings both to higher quality results. Classical video and multi-frame in terms of quality and efficiency. Finally, we also com- super-resolution methods are generally formulated as opti- pare FRVSR with several existing video super-resolution mization problems that are computationally very expensive approaches and show that it significantly outperforms the to solve [2, 11, 27, 38]. current state of the art on a standard benchmark dataset. Most of the existing deep learning-based video super- 1.1. Our contributions resolution methods divide the task of video super-resolution We propose a recurrent framework that uses the HR es- into multiple separate sub-tasks, each of which generates • a single HR output frame from multiple LR input frames. timate of the previous frame for generating the subsequent LR LR Kappeler et al.[20] warp video frames It 1 and It+1 onto frame, leading to an efficient model that produces temporally LR − consistent results. the frame It using the optical flow method of Drulea and Nedevschi [8], concatenate the three frames and pass them Unlike existing approaches, the proposed framework can through a convolutional neural network that produces the • est propagate information over a large temporal range without output frame It . Caballero et al.[3] follow the same ap- increasing computations. proach but replace the optical flow model with a trainable et al Our system is end-to-end trainable and does not require motion compensation network. Makansi .[29] follow an • any pre-training stages. approach similar to [3] but combine warping and mapping to HR space into a single step. We perform an extensive set of experiments to analyze the • Tao et al.[39] rely on a batch of up to 7 input LR proposed framework and relevant baselines under various frames to estimate a single HR frame. After computing the different settings. LR motion from neighboring input frames to It , they map We show that the proposed framework significantly out- the frames onto high-resolution grids. In a final step, they • performs the current state of the art in video super-resolution run an encoder-decoder style network with a Conv-LSTM est both qualitatively and quantitatively. in the core yielding It . Liu et al.[28] process up to 5 LR LR frames using different numbers of input frames (It ), LR LR LR LR LR 2. Video super-resolution (It 1,It ,It+1), and (It 2,...,It+2) simultaneously to LR H W C − − Let It [0, 1] × × denote the t-th LR video frame produce separate HR estimates that are aggregated in a final ∈ est obtained by downsampling the original HR video frame step with dynamic weights to produce a single output It . HR sH sW C I [0, 1] × × by scale factor s. Given a set of con- t ∈ While a number of the above mentioned methods are end- secutive LR video frames, the goal of video super-resolution to-end trainable, the authors often note that they first pre-train est is to generate HR estimates It that approximate the original each component before fine-tuning the system as a whole in HR HR frames It under some metric. a final step [3, 28, 39]. 2.1. Related work Huang et al.[17] use a bidirectional recurrent architecture Super-resolution is a classic ill-posed inverse problem for video super-resolution with shallow networks but do not with approaches ranging from simple interpolation methods use any explicit motion compensation in their model. Recur- such as Bilinear, Bicubic and Lanczos [9] to example-based rent architectures have also been used for other tasks such as super-resolution [12, 13, 40, 42], dictionary learning [32, video deblurring [23] and stylization [4, 15]. While Kim et 43], and self-similarity approaches [16, 41]. We refer the al.[23] and Chen et al.[4] pass on a feature representation to reader to Milanfar [30] and Nasrollahi and Moeslund [31] the next step, Gupta et al.[15] pass the previous output frame for extensive overviews of prior art up to recent years. to the next step to produce temporally consistent stylized The recent progress in deep learning, especially in con- videos in concurrent work. A recurrent approach for video volutional neural networks, has shaken up the field of super- super-resolution was proposed by Farsiu et al.[10] more resolution. After Dong et al.[5] reached state-of-the-art re- than a decade ago with motivations similar to ours. However, sults with shallow convolutional neural networks, many oth- this approach uses an approximation of the Kalman filter for ers followed up with deeper network architectures, advancing frame estimation and is constrained to translational motion. � � ���� �