Efficient Video Super-Resolution Through Recurrent Latent Space
Total Page:16
File Type:pdf, Size:1020Kb
Efficient Video Super-Resolution through Recurrent Latent Space Propagation Dario Fuoli Shuhang Gu Radu Timofte Computer Vision Lab, ETH Zurich, Switzerland fdario.fuoli, shuhang.gu, [email protected] 27.6 Abstract 7-256 27.4 7-128 With the recent trend for ultra high definition displays, RLSP DUF-52 the demand for high quality and efficient video super- 27.2 resolution (VSR) has become more important than ever. 27.0 Previous methods adopt complex motion compensation FRVSR 10-128 DUF-28 strategies to exploit temporal information when estimating 7-64 PSNR [dB] PSNR 26.8 the missing high frequency details. However, as the motion DUF-16 estimation problem is a highly challenging problem, inac- 26.6 7-48 curate motion compensation may affect the performance of VSR algorithms. Furthermore, the complex motion com- 26.4 pensation module may also introduce a heavy computa- FRVSR 3-64 tional burden, which limits the application of these meth- 26.2 101 102 103 ods in real systems. In this paper, we propose an efficient t [ms] recurrent latent space propagation (RLSP) algorithm for fast VSR. RLSP introduces high-dimensional latent states to Figure 1. Quantitative comparison of PSNR values on Vid4 and propagate temporal information between frames in an im- computation times to produce a single Full HD (1920x1080) frame with other state-of-the-art methods FRVSR [33] and DUF [16]. plicit manner. Our experimental results show that RLSP is a highly efficient and effective method to deal with the VSR including the Bilateral prior [8] and Bayesian estimation problem. We outperform current state-of-the-art method model [26] have been adopted to solve the VSR problem. [16] with over 70× speed-up. In recent years, the success of deep learning in other vi- sion tasks inspired the research to apply convolutional neu- ral networks (CNN) also to VSR. Following a similar strat- 1. Introduction egy, already adopted in conventional VSR algorithms, most Super-resolution aims to obtain high-resolution (HR) im- of existing deep learning based VSR methods divide the ages from its low-resolution (LR) observations. It provides task into two sub-problems: motion estimation and the fol- a practical solution to enhance existing images as well as al- lowing compensation procedure. In the last years, a large leviating the pressure of data transportation. One category number of elaborately designed models have been proposed of methods [6, 18, 23, 25, 22, 12] takes a single LR image to capture the subpixel motion between input LR frames. as input. The single image super-resolution (SISR) problem However, as subpixel-level alignment of images is a highly has been intensively studied for many years and is still an challenging problem, these types of approaches may gen- active topic in the area of computer vision. Another cate- erate blurred estimations, when the motion compensation gory of approaches [15, 33, 16,7,2, 38], i.e. video super- module fails to generate accurate motion estimation. Fur- resolution (VSR), takes LR video as input. In contrast to thermore, the complex motion estimation and compensation SISR methods, which can only rely on natural image priors modules are often computationally expensive, which makes for estimation of high resolution details, VSR exploits tem- these methods unable to handle HR video in real time. poral information for improved recovery of image details. To address the accuracy issue, Jo et al.[16] proposed A key issue to the success of VSR algorithms, is how dynamic upsampling filters (DUF) to perform VSR without to take full advantage from temporal information [30]. In explicit motion compensation. In their solution, motion in- the early years, different methods have been suggested to formation is implicitly captured with dynamic upsampling model the subpixel-level motion between LR observations, filters and the HR frame is directly constructed by local fil- 1 tering of the center input frame. Such an implicit formula- lating demanding optimization problems, leading to slow tion avoids conducting motion compensation in the image evaluation times [1,8, 26]. space and helps DUF to obtain state-of-the-art VSR results. Many deep learning based VSR methods are composed However, as DUF needs to estimate dynamic filters in each of multiple independent processing pipelines, motivated by location, the algorithm suffers from heavy computation as prior knowledge and inspired by traditional computer vision well as putting a burden on memory for processing large tools. To leverage temporal information, a natural exten- size images. sion to SISR is combining multiple low-resolution frames In this paper, we propose a recurrent latent space propa- to produce a single high-resolution estimate [24, 29,5]. 1 gation (RLSP) method for efficient VSR. RLSP follows a Kappeler et al.[17] combine several adjacent frames. Non- similar strategy as FRVSR [33], which utilizes a recurrent center frames are motion compensated by calculating op- architecture to avoid processing LR input frames multiple tical flow and warping towards the center frame. All times. In contrast to FRVSR, which adopts explicit motion frames are then concatenated and followed by 3 convolu- estimation and warping operations to exploit temporal in- tion layers. Tao et al.[35] produce a single high-resolution formation, RLSP introduces high dimensional latent states frame yt from up to 7 low-resolution input frames xt−3:t+3. to propagate temporal information in an implicit way. First, motion is estimated in low resolution and a prelim- In Fig.1, we present the trade-off between runtime inary high-resolution frame is computed through a sub- and accuracy (average PSNR) for state-of-the-art VSR ap- pixel motion compensation layer. The final output is com- proaches on the Vid4 dataset [26]. The proposed RLSP ap- puted by applying an encoder-decoder style network with proach achieves a better balance between speed and perfor- an intermediate convolutional LSTM [13] layer. Liu et mance than the competing methods. RLSP achieves about al.[27] calculate multiple high-resolution estimates in par- 10× and 70× speed-up over the methods FRVSR and DUF, allel branches, each processing an increasing number of respectively, while maintaining similar accuracy. Further- low-resolution frames. Additionally, a temporal modulation more, despite its efficiency, by utilizing more filters in our branch computes weights according to which the respective model, RLSP can also be pushed to pursue state-of-the-art high-resolution estimates are aggregated, forming the final VSR accuracy. Our model RLSP 7-256 achieves the highest high-resolution output. Caballero et al.[3] extract motion PSNR on the Vid4 benchmark. flow maps between adjacent frames and center frame xt. The frames are warped according to the flow maps towards 2. Related Work frame xt. These frames are then processed with a spatio- temporal network, by either direct concatenation and con- Single Image Super-Resolution With the rise of deep volution, gradually applying several convolution and con- learning, especially convolutional neural networks catenation steps or applying 3D convolutions [37]. Jo et (CNN) [21], learning based super-resolution models have al.[16] propose DUF, a network without explicit motion shown to be superior in terms of accuracy compared estimation. Dynamic upsampling filters and residuals are to classical interpolation methods, such as bilinear and calculated from a batch of adjacent input frames. The cen- bicubic interpolation and similar approaches. One of the ter frame is filtered and added with the residuals to get the earliest methods to apply convolution for super-resolution final output. is SRCNN, proposed by [6]. SRCNN uses a shallow network of only 3 convolutional layers. VDSR [18] shows A more powerful approach to process sequential data substantial improvements by using a much deeper network like video, is to use recurrent connections between time of 20 layers combined with residual learning. In order steps. Methods using a fixed number of input frames are in- to get visually more pleasing images, photorealistic and herently limited by the information content in those frames. natural looking, the accuracy to the ground truth is traded Recurrent models however, are able to leverage information off by method such as SRGAN [23], EnhanceNet [32], from a potentially unlimited number of frames. Sajjadi et and [31,4, 28] that introduce alternative loss functions [11] al.[33] use an optical flow network, followed by a super- to super-resolution. An overview of recent methods in the resolution network. Optical flow is calculated between xt−1 field of SISR is provided by [36]. and xt to warp the previous output yt−1 towards t. The fi- nal output y is calculated from the warped previous output Video Super-Resolution Super-resolution can be gener- t and the current low-resolution input frame x . The two net- alized from images to videos. Videos additionally pro- t works are trained jointly. Huang et al.[14] propose a bidi- vide temporal information among frames, which can be ex- rectional recurrent network using 2D and 3D convolutions ploited to improve interpolation quality. Non-deep learning with recurrent connections between time steps. A forward video super-resolution problems are often solved by formu- pass and a backward pass are combined to produce the final 1Our code and models are publicly available at: output frames. Because of its nature, this method can not be https://github.com/dariofuoli/RLSP applied online. x t-1 x t x t+1 … … 1 n-1 n LR Flow Hidden State Flow Conv Conv ReLU ReLU Conv ReLU 3x3xf 3x3xf … RLSP RLSP RLSP 3x3xf Cell h t-1 Cell h t Cell * HR Flow x t RLSP Cell Conv x4 x4 3x3x16 /4 /4 … y t-1 y t y t+1 Figure 2.