Arxiv:2104.02322V1 [Cs.CV] 6 Apr 2021 ﬁc

Efficient Video Compression via Content-Adaptive Super-Resolution Mehrdad Khani, Vibhaalakshmi Sivaraman, Mohammad Alizadeh MIT CSAIL fkhani,vibhaa,[email protected] Abstract H.265 codec improves upon these same ideas by incorpo- rating variable block sizes [7]. Recent efforts [31, 10, 35] to Video compression is a critical component of Internet improve video compression have turned to deep learning to video delivery. Recent work has shown that deep learning capture the complex relationships between the components techniques can rival or outperform human-designed algo- of a video compression pipeline. These approaches have rithms, but these methods are significantly less compute and had moderate success at outperforming current codecs, but power-efficient than existing codecs. This paper presents a they are much less compute- and power-efficient. new approach that augments existing codecs with a small, We present SRVC, a new approach that combines ex- content-adaptive super-resolution model that significantly isting compression algorithms with a lightweight, content- boosts video quality. Our method, SRVC, encodes video adaptive super-resolution (SR) neural network that sig- into two bitstreams: (i) a content stream, produced by com- nificantly boosts performance with low computation cost. pressing downsampled low-resolution video with the exist- SRVC compresses the input video into two bitstreams: a ing codec, (ii) a model stream, which encodes periodic up- content stream and a model stream, each with a separate dates to a lightweight super-resolution neural network cus- bitrate that can be controlled independently of the other tomized for short segments of the video. SRVC decodes stream. The content stream relies on a standard codec such the video by passing the decompressed low-resolution video as H.265 to transmit low-resolution frames at a low bitrate. frames through the (time-varying) super-resolution model The model stream encodes a time-varying SR neural net- to reconstruct high-resolution video frames. Our results work, which the decoder uses to boost the quality of decom- show that to achieve the same PSNR, SRVC requires 16% pressed frames derived from the content stream. SRVC uses of the bits-per-pixel of H.265 in slow mode, and 2% of the the model stream to specialize the SR network for short seg- bits-per-pixel of DVC, a recent deep learning-based video ments of video dynamically (e.g., every few seconds). This compression scheme. SRVC runs at 90 frames per second makes it possible to use a small SR model, consisting of just on a NVIDIA V100 GPU. a few convolutional and upsampling layers. Applying SR to improve the quality of low-bitrate compressed video isn’t new. AV1 [15], for instance, has a mode 1. Introduction (typically used in low-bitrate settings) that encodes frames Recent years have seen a sharp increase in video traf- at low resolution and applies an upsampler at the decoder. While AV1 relies on standard bicubic [24] or bilinear [47] arXiv:2104.02322v1 [cs.CV] 6 Apr 2021 fic. It is predicted that by 2022, video will account for more than 80% of all Internet traffic [6,1]. In fact, video interpolation for upsampling, recent proposals have shown content consumption increased so much during the initial that learned SR models can significantly improve the qual- months of the pandemic that content providers like Netflix ity of these techniques [30, 19]. and Youtube were forced to throttle video-streaming quality However, these approaches rely on generic SR neural to cope with the surge [2,3]. Hence efficient video com- networks [42, 48, 23]) that are designed to generalize across pression to reduce bandwidth consumption without com- a wide range of input images. These models are large (e.g., promising on quality is more critical than ever. 10s of millions of parameters) and can typically reconstruct While the demand for video content has increased over only a few frames per second even on high-end GPUs [28]. the years, the techniques used to compress and transmit But in many usecases, generalization isn’t necessary. In par- video have largely remained the same. Ideas such as ap- ticular, we often have access to the video being compressed plying Discrete Cosine Transforms (DCTs) to video blocks ahead of time (e.g, for on-demand video). Our goal is to and computing motion vectors [43, 17] , which were de- dramatically reduce the complexity of the SR model in such veloped decades ago, are still in use today. Even the latest applications by specializing it (in a sense, overfitting it) to 1 (b) Original (c) H.265 1080p (slow) (d) H.264 1080p (slow) 1.2Mbps 1.2Mbps Frame from “Sita Sings the Blues” [9] (e) H.265 480p (f) H.265 480p (g) H.265 480p (h) Deep Video Comp. (i) SRVC (ours) + Bicubic Upsamp. + Generic SR + One-shot Custom. (DVC) [31] 0.20Mbps 0.12Mbps 0.12Mbps 0.12Mbps 4.97Mbps Figure 1: Image patches produced by different schemes. short segments of video. 100s of milliseconds at the same resolution. To make this idea work, we must ensure that the over- • We show that to achieve similar PSNR, SRVC requires head of the model stream is low. Even with our small SR 16% of the bits-per-pixel consumed by H.265 in slow 1 model (with 2.22M parameters), updating the entire model mode , and 2% of DVC’s bits-per-pixel. SRVC’s qual- every few seconds would consume a high bitrate, undoing ity improvement extends across all frames in the video. any compression benefit from lowering the resolution of the Figure1 shows visual examples comparing the SRVC content stream. SRVC tackles this challenge by carefully with these baseline approaches at competitive or higher bi- selecting a small fraction (e.g., 1%) of parameters to update trates. Our datasets and code are available at https: for each segment of the video, using a “gradient-guided” //github.com/AdaptiveVC/SRVC.git coordinate-descent [45] strategy that identifies parameters that have the most impact on model quality. Our primary 2. Related Work finding is that a SR neural network adapted in this man- Standard codecs. Prior work has widely studied video ner over the course of a video can provide such a boost to encoder/decoders (codecs) such as H.264/H.265 [37, 39], quality, that including a model stream along with the com- VP8/VP9 [12, 34], and AV1 [15]. These codecs rely on pressed video is more efficient than allocating the entire bit- hand-designed algorithms that exploit the temporal and spa- stream to content. tial redundancies in video pixels, but cannot adapt these In summary, we make the following contributions: algorithms to specific videos. Existing codecs are partic- • We propose a novel dual-stream approach to video ularly effective when used in slow mode for offline com- streaming that combines a time-varying SR model with pression. Nevertheless, SRVC’s combination of a low- compressed low-resolution video produced by a stan- resolution H.265 stream with a content-adaptive SR model dard codec. We develop a coordinate descent method outperforms H.265 at high resolution, even in its slow to update only a fraction of model parameters for each mode. Some codecs like AV1 provide the option to en- few-second segment of video with low overhead. code at low resolution and upsample using bicubic interpo- • We propose a lightweight model with spatially- lation [24]. But, as we show in §4, SRVC’s learned model adaptive kernels, designed specifically for content- provides a much larger improvement in video quality com- specific SR. Our model runs in real-time, taking only pared to bicubic interpolation. 11 ms (90 fps) to generate a 1080p frame on an 1To the authors’ knowledge, this is the first learning-based scheme that NVIDIA V100 GPU. In comparison, DVC [31] takes compares to H.265 on its slow mode Encoder Decoder Downsampler ...01100110... Content Decoder Super-Resolution Content Stream (H.265) Upsampler Original Frame Decoded Frame Content Encoder (H.265) ...10101001... Model Encoder Model Decoder Model Stream Content Decoder (H.265) Diff2 Diff1 Initialization Values2, Values1, Values0, Indices2 Indices1 Indices0 Figure 2: SRVC video compression pipeline overview. Super resolution. Recent work on single-image SR [48, 3. Methods 23] and video SR [30, 19, 21, 27] has produced a variety of CNN-based methods that outperform classic interpola- Figure2 shows an overview of SRVC. SRVC compresses video into two bitsreams: tion methods such as bilinear [47] and bicubic [24]. Ac- celerating these SR models has been of interest particularly 1. Content stream: The encoder downsamples the in- due to their high computational complexity at higher reso- put video frames by a factor of k in each dimension lutions [49]. Our design adopts the idea of subpixel convo- (e.g., k=4) to generate low-resolution (LR) frames us- lution [38], keeping the spatial dimension of all layers iden- ing area-based downsampling. It then encodes the tical to the low-resolution input until the final layer. Fusing LR frames using an off-the-shelf video codec to gen- the information from several video frames has been shown erate the content bitstream (our implementation uses to further improve single-image SR models [41]. However, H.265 [7]). The decoder decompresses the content to isolate the effects of using a content-adaptive SR model, stream using the same codec to reconstruct the LR we focus on single-image SR in this work. frames. Since video codecs are not lossless, the LR Learned video compression. End-to-end video compres- frames at the decoder will not exactly match the LR sion techniques [35, 32, 10] follow a compression pipeline frames at the encoder.

Load more