Iseebetter: Spatio-Temporal Video Super Resolution Using Recurrent-Generative Back-Projection Networks

iSeeBetter: Spatio-Temporal Video Super Resolution using Recurrent-Generative Back-Projection Networks Aman Chadha∗ Stanford University [email protected] Abstract Recently, learning-based models have enhanced the performance of Single-Image Super-Resolution (SISR). However, applying SISR successively to each video frame leads to lack of temporal coherency. On the other hand, Video Super Resolution (VSR) models based on Convolutional Neural Networks (CNNs) outperform traditional approaches in terms of image quality metrics such as Peak Signal to Noise Ratio (PSNR) and Structural SIMilarity (SSIM). However, Generative Adversarial Networks (GANs) offer a competitive advan- tage in terms of being able to mitigate the issue of lack of finer texture details when super-resolving at large upscaling factors which is usually seen with CNNs. We present iSeeBetter, a novel spatio-temporal approach to VSR. iSeeBetter seeks to render temporally consistent Super Resolution (SR) videos by extracting spatial and temporal information from the current and neighboring frames using the concept of Recurrent Back-Projection Networks (RBPN) as its generator. Further, to improve the "naturality" of the super-resolved image while eliminating artifacts seen with traditional algorithms, we utilize the discriminator from Super-Resolution Generative Adversarial Network (SRGAN). Mean Squared Error (MSE) as a primary loss-minimization objective improves PSNR and SSIM, but these metrics may not capture fine details in the image leading to misrepresentation of perceptual quality. To address this, we use a four-fold (adversarial, perceptual, MSE and Total-Variation (TV)) loss function. Our results demonstrate that iSeeBetter offers superior VSR fidelity and surpasses state-of-the-art performance. 1 Introduction The goal of Super Resolution (SR) is to enhance a Low Resolution (LR) image to a Higher Resolution (HR) image by filling in missing fine-grained details in the LR image. This domain can be divided into three main areas: Single Image-SR (SISR) (1), (2), (3), (4), Multi Image SR (MISR) (5), (6) and Video SR (VSR) (7), (8), (9), (10), (11). The idea behind SISR is to to super-resolve an LR frame LRt, independently of other frames in the video sequence. While this technique takes into account spatial information, it fails to exploit the temporal details inherent in a video sequence. MISR seeks to address just that - it utilizes the missing details available from neighboring frames and fuses them for super-resolving LRt. After spatially aligning frames, missing details are extracted by separating differences between the aligned frames from missing details observed only in one or some of the frames. However, in MISR, the alignment of the frames is done without any concern for temporal smoothness, while in VSR, frames are typically aligned in temporal smooth order. Traditional VSR methods upscale based on a single degradation model (usually bicubic interpolation), followed by reconstruction. This is sub-optimal and adds computational complexity (12). Recently, learning-based models based on Convolutional Neural Networks (CNNs) have outperformed traditional approaches in terms of widely-accepted image reconstruction metrics such as Peak Signal to Noise Ratio (PSNR) and Structural SIMilarity (SSIM). A crucial aspect of an effective VSR system is its ability to handle motion sequences since those are often important components of videos (7), (13). The proposed method, iSeeBetter, is inspired by Recurrent Back-Projection Networks (RBPNs) (10), which utilize “back- projection” as their underpinning approach which was originally introduced in (14), (15). The basic concept behind back- projection is to iteratively calculate residual images as reconstruction error between a target image and a set of a neighboring ∗Aman Chadha is an engineer in the System Performance and Architecture team at Apple Inc., but did this work at Stanford. CS230: Deep Learning, Fall 2019, Stanford University, CA. images. The residuals are then back-projected to the target image for improving super-resolution accuracy. The multiple residuals enable representing subtle and significant differences between the target frame and other frames and thus exploit temporal relationships between adjacent frames as shown in Figure 1. This results in superior SR accuracy. To mitigate the issue of lack of finer texture details when super-resolving at large upscaling factors, which is usually seen with CNNs (16), iSeeBetter utilizes GANs with a loss function that weighs adversarial loss, perceptual loss (16), Mean Square Error (MSE)-based loss and Total-Variation (TV) loss (17). Our approach combines the merits of RBPN and SRGAN (16) - it is based on RBPN as its generator which is complemented by SRGAN’s discriminator architecture. Blending these techniques yields iSeeBetter, a state-of-the-art system that is able to recover photo-realistic textures and motion-based scenes from heavily down-sampled videos. Figure 1: Adjacent frame similarity Our contributions include the following key innovations. Combining the state-of-the-art in SR: We propose a model that leverages two superior SR techniques - RBPN and SRGAN. RBPN enables iSeeBetter to extract details from neighboring frames, while the generator-discriminator architecture pushes iSeeBetter to generate more realistic frames and eliminate artifacts. "Optimizing" the loss function: Minimizing MSE encourages finding pixel-wise averages of plausible solutions which are typically overly-smooth and thus have poor perceptual quality (18)(19)(20)(21). To address this, we adopt a four-fold (adversarial, perceptual, MSE and TV) loss for superior results. Extended evaluation protocol: To evaluate iSeeBetter, we used standard datasets: Vimeo90K (22), Vid4 (23) and SPMCS (8). To expand the spectrum of data diversity, we wrote scripts to collect additional data from YouTube and augment our dataset to 170,000 clips. User-friendly script infrastructure: We built several tools to download and structure datasets, visualize temporal profiles and run benchmarks to be able to iterate on different models quickly. Further, we also built a video-to-frames tool to enable directly input videos to iSeeBetter, rather than frames. 2 Related work Learning-based methods have emerged as superior VSR techniques compared to traditional statistical methods. We thus focus our discussion in this section solely on learning-based methods that are trained end-to-end. Deep VSR can be primarily divided into three types based on the approach to preserving temporal information. (a) Temporal Concatenation. The most popular approach to retain temporal information in VSR is by concatenating the frames as in (24), (7), (11), (25). Essentially, this approach can be seen as an extension of SISR to accept multiple input images. (b) Recurrent Networks. A many-to-one architecture is used in (26), (8) where a sequence of LR frames is mapped to a single target HR frame. A many-to-many RNN has recently been used in VSR by (9), to map the current LR frame and previous HR estimate to the target HR frame. (c) Optical Flow-Based Methods. To reduce unwanted flickering artifacts in the output frames (17), (9) proposed a method that utilizes a network that is trained on estimating optical flow along with the SR network. Optical flow methods allow estimation of the trajectories of a moving objects, thereby assisting in VSR. 2 3 Datasets To train iSeeBetter, we amalgamated diverse datasets with differing video lengths, resolutions, motion sequences and number of clips. Table 1 presents a summary of the datasets used. When training our model, we generated the corresponding LR frame for each HR input frame by performing 4× down-sampling using bicubic interpolation. To extend our dataset further, we wrote scripts to collect additional data from YouTube. The dataset was shuffled for training and testing. Our training/validation/test split was 80%/10%/10%. Dataset Resolution # of clips # of frames/clip # of frames Vimeo90K 448 × 256 13,100 7 91,701 SPMCS 240 × 135 30 31 930 Vid4 (720 × 576 × 3), (704 × 576 × 3), (720 × 480 × 3), (720 × 480 × 3) 4 41, 34, 49, 47 684 Augmented 960 × 720 7,000 110 77,000 Total - 46,034 - 170,315 Table 1. Datasets used for training and evaluation 4 Methods 4.1 Implementation2 Figure 2: Overview of iSeeBetter Figure 2 shows the iSeeBetter architecture which uses RBPN (10) and SRGAN (16) as its generator and discriminator respectively. RBPN has two approaches that extract missing details from different sources, namely SISR and MISR. Figure 3 shows the horizontal flow (blue arrows in Figure 2) that enlarges LRt using SISR. Figure 4 shows the vertical flow (red arrows in Figure 2) which is based on MISR that computes residual features from a pair of LRt to neighbor frames (LRt−1, ..., LRt−n) and the flow maps (Ft−1, ..., Ft−n). At each projection step, RBPN observes the missing details from LRt and extracts residual features from neighboring frames to recover details. Within the projection models, RBPN utilizes a recurrent encoder-decoder mechanism for incorporating details extracted in SISR and MISR through back-projection. 0 0 3 LR (t) 64 64 64 64 64 64 64 64 64 64 64 64 64 64 Conv DeConv ReLU Conv ReLU DeConv ReLU Conv ReLU DeConv ReLU Conv ReLU Conv ... ... Up Projection Down Projection SR (t) 2x Up-Down Projection Blocks followed by 1x Up Projection Block 256 256 256 256 Conv ReLU Conv ReLU Figure 3: DBPN (2) architecture for SISR, where we perform up-down-up sampling using 8 × 8 kernels with stride of 4, padding of 2. Similar to the ResNet architecture above, the DBPN network also uses Parametric ReLUs (27) as its activation functions. 2Code and samples for the implementation are available at github.com/amanchadha/iSeeBetter 3 + + + 0 0 3 256 256 Conv ReLU LR Flow LR (t-1) (t-1) (t) 64 64 64 64 64 64 64 64 DeConv Conv Conv ReLU Conv Conv ConvT ReLU ... ... ... 5 Residual Blocks 5 Residual Blocks 5 Residual Blocks H in Total in Total in Total (t) 256 256 256 Conv Conv ReLU Figure 4: ResNet architecture for MISR that is composed of three tiles of five blocks where each block consists of two convolutional layers with 3 × 3 kernels, stride of 1 and padding of 1.

Load more