Can Learned Frame-Prediction Compete with Block-Motion Compensation for Video Coding?

Can Learned Frame-Prediction Compete with Block-Motion Compensation for Video Coding?

Noname manuscript No. (will be inserted by the editor) Can Learned Frame-Prediction Compete with Block-Motion Compensation for Video Coding? Serkan Sulun · A. Murat Tekalp Received: 25 December 2019 / Accepted: 16 July 2020 Abstract Given recent advances in learned video pre- can be adjusted by rate-distortion (RD) optimization diction, we investigate whether a simple video codec creating videos with varying bitrate and visual quality. using a pre-trained deep model for next frame prediction Until recently, there was no serious competition to based on previously encoded/decoded frames without block motion compensation (BMC) to form a predicted sending any motion side information can compete with video frame. Advances in network architectures, training standard video codecs based on block-motion compensa- methods, and the graphical processing units (GPU) have tion. Frame differences given learned frame predictions enabled creation of powerful learned models for many are encoded by a standard still-image (intra) codec. Ex- tasks, including prediction of future video frames given perimental results show that the rate-distortion perfor- past frames without using motion vectors. mance of the simple codec with symmetric complexity is This paper investigates whether learned frame pre- on average better than that of x264 codec on 10 MPEG diction (LFP) can replace the traditional BMC in video test videos, but does not yet reach the level of x265 compression, making estimating and sending motion codec. This result demonstrates the power of learned vectors as side information redundant. LFP is not con- frame prediction (LFP), since unlike motion compen- strained by the block translation motion model, but only sation, LFP does not use information from the current uses previously decoded frames at both the encoder and picture. The implications of training with `1, `2, or com- decoder, unlike traditional BMC, which has access to bined `2 and adversarial loss on prediction performance the current frame at the encoder. On the other hand, a and compression efficiency are analyzed. video codec employing LFP can use bits saved by not sending motion vectors to better code the prediction Keywords deep learning · frame prediction · predictive residual in order to increase video fidelity. Hence, it is frame difference · HEVC-Intra codec · rate-distortion of interest to analyze how a video codec based on LFP performance compares with traditional codecs. The main contribution of this paper is to demon- 1 Introduction strate that a simple video encoder based on a pre-trained LFP model yields rate-distortion performance that on arXiv:2007.08922v1 [eess.IV] 17 Jul 2020 An essential component of video compression is mo- tion compensation, which reconstructs a predicted frame the average exceeds that of the well established x264 with the help of block-based motion vectors sent as side encoder in sequential configuration. More generally, we information. Naturally this prediction is imperfect, so provide answers to the following questions: the residual frame difference needs to be encoded and { How do we evaluate the performance of LFP models transmitted alongside motion vectors. These two com- in video compression vs. computer vision? ponents constitute a compressed video file, whose size { Can LFP compete with block-motion compensation in terms of compression rate-distortion performance? A. M. Tekalp acknowledges support from the TUBITAK { Does training with sum of absolute differences (`1) or project 217E033 and Turkish Academy of Sciences (TUBA). mean square (`2) loss or a weighted combination of `2 College of Engineering, Ko University, 34450 Istanbul, Turkey and adversarial losses provide better rate-distortion E-mail: [email protected] performance in video compression? 2 Serkan Sulun, A. Murat Tekalp In the following, Section 2 reviews related works. Our paper differs from other video prediction work as: Learned video frame prediction is discussed in Section 3, { While most related works use some form of LSTM, and compression of the predictive frame differences is de- we chose a deep convolutional residual network, in- scribed in Section 4. Experimental results are presented sired by EDSR [20], for frame prediction. The ratio- in Section 5. Section 6 concludes the paper. nale for this choice is explained in Section 3. { In applications where the predicted frame is the fi- 2 Related Work nal product, the visual quality, i.e., textureness and A recent review of video prediction methods using neu- sharpness, of the predicted image is important; hence, ral networks can be found in [28]. Different from [28], the use of adversarial loss is well justified. However, we classify prior work on LFP in terms of the prediction most methods using such loss function do not re- methodology they employ and the loss function they port a direct quantitative evaluation of generated use in training. In terms of prediction methodology em- vs. ground-truth images using peak signal-to-noise ployed, we classify LFP methods as frame reconstruction ratio (PSNR) or structural similarity metric (SSIM). methods that directly predict pixel values, and frame In contrast, in video compression, it is customary to transformation methods that predict transformation compare compression efficiency by the rate-distortion parameters, e.g., affine parameters, to transform the (RD) performance, where distortion is measured in current frame into future frame. terms of PSNR and predicted frames are only inter- Among frame reconstruction methods, Srivastava et mediate results, not to be viewed. Our results show 1 2 al. [31] use long short-term memory (LSTM) autoen- that training with only ` or ` loss provides the best coders to simultaneously reconstruct the current frame RD performance even though predicted frames may and predict a future frame. Mathieu et al. [23] propose look blurry. multi-scale generator and discriminator networks and in- The state of the art video compression standard is troduce a new loss that calculates the difference between high efficiency video coding, known as H.265/HEVC [27]. gradients of predicted and ground-truth images. Kalch- Several works to enhance H.265/HEVC codecs with or brenner et al. [17] introduces an encoder-decoder archi- without deep learning have been proposed [22]. While tecture called Video Pixel Networks, where the decoder there are many works on learned intra-prediction or employs a PixelCNN [24] network. Denton et al. [12] learned end-to-end image/video compression (e.g., [14]), propose a variational encoder to estimate a probability few works address learned frame prediction for video distribution for potential predicted frame outcomes. compression. Our work differs from them as follows: Among frame transformation methods, Amersfoort { Several researchers propose learned models that sup- et al. [35] predict affine transformation between patches plement standard block motion compensation to from consecutive frames, which is applied on the current enhance prediction and improve the compression ef- frame to generate the next one. Vondrick et al. [38] also ficiency of HEVC [16,21,40,42,26,10]. In contrast, predict frame transformations but train the model using we do not use block motion compensation at all. adversarial loss only. Villegas et al. [37] focus on long { Chen et al. [8] employ a neural network for frame term prediction of human actions using pose information prediction. However, they also estimate and use 4x4 from a pretrained LSTM autoencoder as input. In a fol- block motion vectors both at the encoder and de- low up work, Wichers et al. [41] replace the pre-trained coder, even though they don't transmit them. In pose extracting network with a trainable encoder, en- contrast, we do not need motion vectors at all. abling end-to-end training. Finn et al. [15] predict object motion kernels using convolutional LSTMs. In a follow- 3 Learned Video Frame Prediction up work, Babaeizadeh et al. [3] supplement Finn's model with a variational encoder to avoid generating blurry Recurrent models, such as LSTM, has been the top predictions. In a further follow-up, Lee et al. [19] use choice of architecture to solve sequence learning prob- adversarial loss to get more realistic results. lems. With the advent of ResNet, which introduces skip In terms of loss functions, most methods use mean connections, it has become easy to train deep CNN to square (`2) loss. Other loss functions used include `1 loss learn temporal dynamics from a fixed number of past [23] and cross-entropy loss [31,17]. Mathieu et al. [23] frames. Although, in theory LSTMs can remember the introduce the gradient loss. Variational autoencoders entire history of a video, in practice due to training employ KL-divergence [23], [38], [37], [19] and GANs using truncated backpropagation through time [4], we employ adversarial loss [3], [19], [12]. Perceptual loss is obtain as good if not better performance by process- also used, by comparing frames at a feature space [13], ing only a fixed number of past frames using a CNN. using a pretrained network [37]. Our approach is consistent with a recent study, where Can Learned Frame-Prediction Compete with Block-Motion Compensation 3 Villegas et al. [36] show that large scale networks can layer. After each residual block, residual scaling with 0.1 perform state of the art video prediction without using is applied [32]. At the output layer, output values are optical flow or other other inductive bias in the network scaled between -1 and 1 during training. At test time, architecture. In the following, we present our generator the outputs are scaled between 0 and 255. network architecture in Section 3.1 and discuss the de- tails of training procedures in Section 3.2. 3.2 Training 3.1 The Generator Network An important factor that affects video prediction and compression performance is the choice of loss function, The architecture of our LFP network is inspired by the which is related to how we evaluate the prediction per- success of the enhanced deep super-resolution (EDSR) formance of the network.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    9 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us