Neural Video Coding Using Multiscale Motion Compensation and Spatiotemporal Context Model

1 Neural Video Coding using Multiscale Motion Compensation and Spatiotemporal Context Model Haojie Liu1, Ming Lu1, Zhan Ma1, Fan Wang2, Zhihuang Xie2, Xun Cao1, and Yao Wang3 1 Nanjing University, 2 OPPO.Inc, 3 New York University Abstract—Over the past two decades, traditional block-based bit rate consumption at the same reconstruction quality. This video coding has made remarkable progress and spawned a is well formulated as the minimization of Lagrangian cost J series of well-known standards such as MPEG-4, H.264/AVC of rate-distortion optimization (RDO) that is widely adopted and H.265/HEVC. On the other hand, deep neural networks (DNNs) have shown their powerful capacity for visual content in existing video coders, e.g., understanding, feature extraction and compact representation. min J = R + λ · D; (1) Some previous works have explored the learnt video coding algorithms in an end-to-end manner, which show the great with R and D represent the compressed bit rates and recon- potential compared with traditional methods. In this paper, we structed distortion respectively. propose an end-to-end deep neural video coding framework (NVC), which uses variational autoencoders (VAEs) with joint Motivation. Over the past three decades, video compres- spatial and temporal prior aggregation (PA) to exploit the sion technologies have been evolving and adapting con- correlations in intra-frame pixels, inter-frame motions and inter- stantly with coding efficiency improved by several folds, frame compensation residuals, respectively. Novel features of mostly driven under the efforts from the experts in ISO/IEC NVC include: 1) To estimate and compensate motion over a Moving Picture Experts Group (MPEG), ITU-T Video Cod- large range of magnitudes, we propose an unsupervised multiscale motion compensation network (MS-MCN) together with a ing Experts Group (VCEG) and their joint task forces. It pyramid decoder in the VAE for coding motion features that leads to several popular video coding standards, including generates multiscale flow fields, 2) we design a novel adaptive the H.264/Advanced Video Coding (H.264/AVC) [1], High- spatiotemporal context model for efficient entropy coding for Efficiency Video Coding (HEVC) [2] and emerging versatile motion information, 3) we adopt nonlocal attention modules video coding (VVC) [3]. These standards share the similar (re- (NLAM) at the bottlenecks of the VAEs for implicit adaptive feature extraction and activation, leveraging its high transforma- cursive) block-based hybrid prediction/transform framework tion capacity and unequal weighting with joint global and local where individual coding tools, such as the intra/inter predic- information, and 4) we introduce multi-module optimization and tion, integer transforms, context-adaptive entropy coding, etc, a multi-frame training strategy to minimize the temporal error are intensively handcrafted to optimize the overall efficiency. propagation among P-frames. NVC is evaluated for the low-delay Among them, pixel-domain predictive coding is one of the causal settings and compared with H.265/HEVC, H.264/AVC and the other learnt video compression methods following the most important factors, contributing to the major performance common test conditions, demonstrating consistent gains across all gains [4]. For example, pixel-domain intra prediction was popular test sequences for both PSNR and MS-SSIM distortion officially adopted into the H.264/AVC and later extended with metrics. the support of recursive block-sizes and abundant predictive Index Terms—Neural video coding, neural network, multiscale directions for efficiently exploiting spatial structures; recursive motion compensation, pyramid decoder, multiscale compressed and even non-squared blocks are extensively used in inter flows, nonlocal attention, spatiotemporal priors, temporal error prediction to remove temporal coherency. Basically, conven- propagation. tional video coding methods leverage the spatiotemporal pixel arXiv:2007.04574v1 [eess.IV] 9 Jul 2020 neighbors (as well as their linear combinations) for predictive I. INTRODUCTION signal construction, resulting in corresponding residuals for OMPRESSED video, a dominant media representation subsequent transform, quantization, and entropy coding for C across the entire Internet, occupies more than 70% total more compact representation. Optimal coding mode with ap- traffic volume nowadays for entertainment (e.g., YouTube), propriate block size and orientation (e.g., intra direction, inter productivity (e.g., tele-education), security (e.g., surveillance) motion vectors) is selected via computational RDO process, etc. It still keeps growing explosively. Thus, in pursuit of utilizing `1-norm (e.g., mean absolute error - MAE) or `2- efficient storage and network transmission, and pristine quality norm (e.g., mean squared error - MSE) as the distortion metric. of experience (QoE) with higher resolution content (e.g., 2K, Though recursive block-based pixel prediction shows its 4K and even 8K video with frame rate at 30 Hz, 60 Hz or even great success, it is mainly due to the hardware advancements more), a better compression approach is greatly and continu- in past decades, by which we can exhaustively search for the ously desired. In principle, the key problem in video coding is best prediction. It, however, is more and more challenging to how to efficiently exploit visual signal redundancy using prior simply trade computational resources for efficiency improve- information, spatially (e.g., intra prediction, transform), tem- ment because Moore’s Law does not hold any more [5]. It porally (e.g., inter prediction), and statistically (e.g., entropy therefore calls for innovative methodologies and architectures context adaptation) for more compact representations with less of video coding to further improve the coding efficiency, in 2 response to the ever increasing users’ requirement for video TABLE I: Abbreviations and Notations resolution and quality. Such pixel prediction strategy, either abber. description intra or inter, mostly relies on the physical coherence of NLAM NonLocal Attention Module LAM Local Attention Module video signal and applies the mathematical tools (e.g., linear MS-MCN Multi-scale Motion Compensation Network weighting, orthogonal transform, Lagrangian optimization) for SS-MCN Single-scale Motion Compensation Network signal energy compaction. PA Prior Aggregation MCF Multiscale Compressed Flow Our Approach. We propose an end-to-end neural video neuro-Intra Neural intra coding coding framework (NVC), which codes intra-frame pixels neuro-Motion Neural motion coding (called neuro-Intra), inter-frame motion (called neuro-Motion), neuro-Res Neural residual coding MSE Mean Squared Error and inter-frame residual (called neuro-Res) using separate MAE Mean Absolute Error variational autoencoders (VAE), as shown in Fig.1. A mul- PSNR Peak signal-to-noise ratio tiscale motion compensation network (MS-MCN) works to- MS-SSIM Multiscale Structural Similarity gether with neuro-Motion to generate multiscale optical flows and perform multiscale motion-compensated prediction of the current frame from the previous frame. The sparse image neuro-Intra differences between past and present frame, e.g., residuals, Intra Intra Intra Encoder Decoder Compressed are then encoded to obtain the final reconstruction; All Coding Binary three VAEs, e.g., neuro-Intra, neuro-Motion, neuro-Res for Features neuro-Res compressing intra-pixel, inter-motion and inter-residual, are neuro-Motion Residual Residual engineered together with MS-MCN in an end-to-end learning Motion Motion - Encoder Decoder manner. Note that neuro-Intra takes a native image frame as Encoder Decoder input, neuro-Motion uses the current and past reconstructed MS-MCN Inter Coding Multi-scale Motion + frame, generating multiscale compressed flows (MCFs), MS- Compensation Network MCN uses these generated MCFs for motion compensation to obtain the inter-predicted frame and neuro-Res encodes Reference Frame Buffer the difference between the current frame and its prediction for the final reconstruction. Additionally, joint spatiotemporal Fig. 1: Neural Video Coding (NVC). The modules neuro- and hyper priors are aggregated for efficient and adaptive Intra, neuro-Res, and neuro-Motion follow the general model context modeling of latent features to improve entropy coding architecture in Fig.2 for efficient representations of intra efficiency for the motion field. pixels, displaced inter residuals, and inter motions. The neuro- We have evaluated the efficiency of the proposed NVC Motion uses a pyramid decoder for the main decoder as for the low-delay causal settings against well-known HEVC, discussed in Sec. III-D2. H.264/AVC and other learnt video compression methods following the common test conditions. The NVC demonstrated the leading performance with consistent gains across all pop- • We propose an end-to-end deep neural video coding ular test sequences for both PSNR (Peak signal-to-noise ratio) framework (NVC), leveraging learnt feature domain and MS-SSIM (multiscale structural similarity) [6] distortion representations for intra-pixel, inter-motion and inter- metrics. Using the H.264/AVC as a common anchor, our NVC residual, respectively for compression; presents 35% BD-Rate (Bjontegaard Delta Rate) [7] gains, • neuro-Motion and multiscale motion compensation net- while HEVC and DVC (Deep Video Coding) [8] offer 30% work (MS-MCN) are employed together to capture and 22% gains, respectively, when the distortion

Load more