iSeeBetter: Spatio-Temporal Video Super Resolution using Recurrent-Generative Back-Projection Networks
Aman Chadha∗ Stanford University [email protected]
Abstract
Recently, learning-based models have enhanced the performance of Single-Image Super-Resolution (SISR). However, applying SISR successively to each video frame leads to lack of temporal coherency. On the other hand, Video Super Resolution (VSR) models based on Convolutional Neural Networks (CNNs) outperform traditional approaches in terms of image quality metrics such as Peak Signal to Noise Ratio (PSNR) and Structural SIMilarity (SSIM). However, Generative Adversarial Networks (GANs) offer a competitive advan- tage in terms of being able to mitigate the issue of lack of finer texture details when super-resolving at large upscaling factors which is usually seen with CNNs. We present iSeeBetter, a novel spatio-temporal approach to VSR. iSeeBetter seeks to render temporally consistent Super Resolution (SR) videos by extracting spatial and temporal information from the current and neighboring frames using the concept of Recurrent Back-Projection Networks (RBPN) as its generator. Further, to improve the "naturality" of the super-resolved image while eliminating artifacts seen with traditional algorithms, we utilize the discriminator from Super-Resolution Generative Adversarial Network (SRGAN). Mean Squared Error (MSE) as a primary loss-minimization objective improves PSNR and SSIM, but these metrics may not capture fine details in the image leading to misrepresentation of perceptual quality. To address this, we use a four-fold (adversarial, perceptual, MSE and Total-Variation (TV)) loss function. Our results demonstrate that iSeeBetter offers superior VSR fidelity and surpasses state-of-the-art performance.
1 Introduction
The goal of Super Resolution (SR) is to enhance a Low Resolution (LR) image to a Higher Resolution (HR) image by filling in missing fine-grained details in the LR image. This domain can be divided into three main areas: Single Image-SR (SISR) (1), (2), (3), (4), Multi Image SR (MISR) (5), (6) and Video SR (VSR) (7), (8), (9), (10), (11). The idea behind SISR is to to super-resolve an LR frame LRt, independently of other frames in the video sequence. While this technique takes into account spatial information, it fails to exploit the temporal details inherent in a video sequence. MISR seeks to address just that - it utilizes the missing details available from neighboring frames and fuses them for super-resolving LRt. After spatially aligning frames, missing details are extracted by separating differences between the aligned frames from missing details observed only in one or some of the frames. However, in MISR, the alignment of the frames is done without any concern for temporal smoothness, while in VSR, frames are typically aligned in temporal smooth order. Traditional VSR methods upscale based on a single degradation model (usually bicubic interpolation), followed by reconstruction. This is sub-optimal and adds computational complexity (12). Recently, learning-based models based on Convolutional Neural Networks (CNNs) have outperformed traditional approaches in terms of widely-accepted image reconstruction metrics such as Peak Signal to Noise Ratio (PSNR) and Structural SIMilarity (SSIM). A crucial aspect of an effective VSR system is its ability to handle motion sequences since those are often important components of videos (7), (13). The proposed method, iSeeBetter, is inspired by Recurrent Back-Projection Networks (RBPNs) (10), which utilize “back- projection” as their underpinning approach which was originally introduced in (14), (15). The basic concept behind back- projection is to iteratively calculate residual images as reconstruction error between a target image and a set of a neighboring
∗Aman Chadha is an engineer in the System Performance and Architecture team at Apple Inc., but did this work at Stanford.
CS230: Deep Learning, Fall 2019, Stanford University, CA. images. The residuals are then back-projected to the target image for improving super-resolution accuracy. The multiple residuals enable representing subtle and significant differences between the target frame and other frames and thus exploit temporal relationships between adjacent frames as shown in Figure 1. This results in superior SR accuracy. To mitigate the issue of lack of finer texture details when super-resolving at large upscaling factors, which is usually seen with CNNs (16), iSeeBetter utilizes GANs with a loss function that weighs adversarial loss, perceptual loss (16), Mean Square Error (MSE)-based loss and Total-Variation (TV) loss (17). Our approach combines the merits of RBPN and SRGAN (16) - it is based on RBPN as its generator which is complemented by SRGAN’s discriminator architecture. Blending these techniques yields iSeeBetter, a state-of-the-art system that is able to recover photo-realistic textures and motion-based scenes from heavily down-sampled videos.
Figure 1: Adjacent frame similarity
Our contributions include the following key innovations. Combining the state-of-the-art in SR: We propose a model that leverages two superior SR techniques - RBPN and SRGAN. RBPN enables iSeeBetter to extract details from neighboring frames, while the generator-discriminator architecture pushes iSeeBetter to generate more realistic frames and eliminate artifacts. "Optimizing" the loss function: Minimizing MSE encourages finding pixel-wise averages of plausible solutions which are typically overly-smooth and thus have poor perceptual quality (18)(19)(20)(21). To address this, we adopt a four-fold (adversarial, perceptual, MSE and TV) loss for superior results. Extended evaluation protocol: To evaluate iSeeBetter, we used standard datasets: Vimeo90K (22), Vid4 (23) and SPMCS (8). To expand the spectrum of data diversity, we wrote scripts to collect additional data from YouTube and augment our dataset to 170,000 clips. User-friendly script infrastructure: We built several tools to download and structure datasets, visualize temporal profiles and run benchmarks to be able to iterate on different models quickly. Further, we also built a video-to-frames tool to enable directly input videos to iSeeBetter, rather than frames.
2 Related work
Learning-based methods have emerged as superior VSR techniques compared to traditional statistical methods. We thus focus our discussion in this section solely on learning-based methods that are trained end-to-end. Deep VSR can be primarily divided into three types based on the approach to preserving temporal information. (a) Temporal Concatenation. The most popular approach to retain temporal information in VSR is by concatenating the frames as in (24), (7), (11), (25). Essentially, this approach can be seen as an extension of SISR to accept multiple input images. (b) Recurrent Networks. A many-to-one architecture is used in (26), (8) where a sequence of LR frames is mapped to a single target HR frame. A many-to-many RNN has recently been used in VSR by (9), to map the current LR frame and previous HR estimate to the target HR frame. (c) Optical Flow-Based Methods. To reduce unwanted flickering artifacts in the output frames (17), (9) proposed a method that utilizes a network that is trained on estimating optical flow along with the SR network. Optical flow methods allow estimation of the trajectories of a moving objects, thereby assisting in VSR.
2 3 Datasets
To train iSeeBetter, we amalgamated diverse datasets with differing video lengths, resolutions, motion sequences and number of clips. Table 1 presents a summary of the datasets used. When training our model, we generated the corresponding LR frame for each HR input frame by performing 4× down-sampling using bicubic interpolation. To extend our dataset further, we wrote scripts to collect additional data from YouTube. The dataset was shuffled for training and testing. Our training/validation/test split was 80%/10%/10%.
Dataset Resolution # of clips # of frames/clip # of frames Vimeo90K 448 × 256 13,100 7 91,701 SPMCS 240 × 135 30 31 930 Vid4 (720 × 576 × 3), (704 × 576 × 3), (720 × 480 × 3), (720 × 480 × 3) 4 41, 34, 49, 47 684
Augmented 960 × 720 7,000 110 77,000
Total - 46,034 - 170,315 Table 1. Datasets used for training and evaluation
4 Methods
4.1 Implementation2
Figure 2: Overview of iSeeBetter
Figure 2 shows the iSeeBetter architecture which uses RBPN (10) and SRGAN (16) as its generator and discriminator respectively. RBPN has two approaches that extract missing details from different sources, namely SISR and MISR. Figure 3 shows the horizontal flow (blue arrows in Figure 2) that enlarges LRt using SISR. Figure 4 shows the vertical flow (red arrows in Figure 2) which is based on MISR that computes residual features from a pair of LRt to neighbor frames (LRt−1, ..., LRt−n) and the flow maps (Ft−1, ..., Ft−n). At each projection step, RBPN observes the missing details from LRt and extracts residual features from neighboring frames to recover details. Within the projection models, RBPN utilizes a recurrent encoder-decoder mechanism for incorporating details extracted in SISR and MISR through back-projection.
0 0 3
LR (t) 64 64 64 64 64 64 64 64 64 64 64 64 64 64 Conv DeConv ReLU Conv ReLU DeConv ReLU Conv ReLU DeConv ReLU Conv ReLU Conv ...... Up Projection Down Projection
SR (t)
2x Up-Down Projection Blocks followed by 1x Up Projection Block
256 256 256 256 Conv ReLU Conv ReLU
Figure 3: DBPN (2) architecture for SISR, where we perform up-down-up sampling using 8 × 8 kernels with stride of 4, padding of 2. Similar to the ResNet architecture above, the DBPN network also uses Parametric ReLUs (27) as its activation functions. 2Code and samples for the implementation are available at github.com/amanchadha/iSeeBetter
3 + + + 0 0 3 256 256 Conv ReLU LR Flow LR (t-1) (t-1) (t) 64 64 64 64 64 64 64 64 DeConv Conv Conv ReLU Conv Conv ConvT ReLU
...... 5 Residual Blocks 5 Residual Blocks 5 Residual Blocks H in Total in Total in Total (t)
256 256 256 Conv Conv ReLU
Figure 4: ResNet architecture for MISR that is composed of three tiles of five blocks where each block consists of two convolutional layers with 3 × 3 kernels, stride of 1 and padding of 1. The network uses Parametric ReLUs (27) for its activations.
0 1024 0 8 1 1024 1024 Dense ReLU 512 512 512 512 Conv BN ReLU Conv BN ReLU Conv BN ReLU 256 256 256 256 Conv BN ReLU Conv BN ReLU
128 128 128 128 Go or no go? Conv BN ReLU Conv BN ReLU SR (t)
64 64 Conv ReLU
Figure 5: Discriminator Architecture from SRGAN (16). The discriminator uses Leaky ReLUs for computing its activations.
4.2 Loss functions
To evaluate the quality of an image, a commonly used loss function is MSE which aims to improve the PSNR of an image (28). While optimizing MSE during training improves PSNR and SSIM, these metrics may not capture fine details in the image leading to misrepresentation of perceptual quality and can cause the resulting video frames to be too smooth (29). In a series of experiments, it was found that even manually distorted images still had an MSE score comparable to the original image (30). To address this, we use a four-fold (adversarial, perceptual, MSE and TV) loss. (19) introduced a new loss function called perceptual loss, which relies on features extracted from a pre-trained VGG network instead of low-level pixel-wise error measures. Per (16), we use adversarial loss along with content loss which focuses on perceptual similarity instead of similarity in pixel space to limit model “fantasy”. Further, we use a de-noising function called TV loss (19). We weigh these losses together as a final evaluation standard for training iSeeBetter. We define our loss function for each frame as follows. The total loss of a sample is the average of all frames.