Neural Rate Control for Video Encoding Using Imitation Learning
Total Page:16
File Type:pdf, Size:1020Kb
Neural Rate Control for Video Encoding using Imitation Learning Hongzi Mao†? Chenjie Gu† Miaosen Wang† Angie Chen† Nevena Lazic† Nir Levine† Derek Pang* Rene Claus* Marisabel Hechtman* Ching-Han Chiang* Cheng Chen* Jingning Han* †DeepMind *Google ?MIT CSAIL Abstract module manages the distribution of available bandwidth, usually constrained by the network condition. It directly In modern video encoders, rate control is a critical com- determines how many bits to spend to encode each video ponent and has been heavily engineered. It decides how frame, by assigning a quantization parameter (QP). The many bits to spend to encode each frame, in order to opti- goal of rate control is to maximize the encoding efficiency mize the rate-distortion trade-off over all video frames. This (often measured by the Bjontegaard delta rate) as well as is a challenging constrained planning problem because of maintaining the bitrate under a user-specified target bitrate. the complex dependency among decisions for different video Rate control is a constrained planning problem and can frames and the bitrate constraint defined at the end of the be formulated as a Partially Observable Markov Decision episode. Process. For a particular frame, the QP assignment decision We formulate the rate control problem as a Partially depends on (1) the frame’s spatial and temporal complexity, Observable Markov Decision Process (POMDP), and ap- (2) the previously encoded frames this frame refers to, and ply imitation learning to learn a neural rate control pol- (3) the complexity of future frames. In addition, the bitrate icy. We demonstrate that by learning from optimal video constraint imposes an episodic constraint on the planning encoding trajectories obtained through evolution strategies, problem. The inter-dependency of encoding decisions and our learned policy achieves better encoding efficiency and the episodic constraint make it challenging to engineer op- has minimal constraint violation. In addition to imitat- timal rate control algorithms. ing the optimal actions, we find that additional auxiliary To evaluate the potential of improving the rate control losses, data augmentation/refinement and inference-time policy, we conducted an experiment to measure how much policy improvements are critical for learning a good rate we may increase the Bjontegaard delta rate (BD-rate) com- control policy. We evaluate the learned policy against the pared to the rate control policy in libvpx, a widely adopted rate control policy in libvpx, a widely adopted open source open source VP9 codec for adaptive video streaming ser- VP9 codec library, in the two-pass variable bitrate (VBR) vices, such as YouTube and Netflix. We collected 1600 mode. We show that over a diverse set of real-world videos, videos spanning diverse content types and spatial/temporal our learned policy achieves 8.5% median bitrate reduction complexities. For every video, we applied an evolution without sacrificing video quality. strategy algorithm which maximizes the peak signal-to- arXiv:2012.05339v1 [cs.LG] 9 Dec 2020 noise ratio (PSNR) under libvpx’s two-pass variable-bitrate (VBR) mode. We observe an average 13% improvement 1. Introduction in BD-rate, indicating a big gap between the libvpx’s rate control policy and an optimal rate control policy. In recent years, video traffic keeps growing and consti- tutes over 75% of the Internet traffic [5]. Efficient video In this paper, we use imitation learning to train a neu- encoding is key to improving video quality and alleviating ral rate control policy and demonstrate its effectiveness in backbone Internet traffic pressure for video tasks such as improving encoding efficiency. We first generate a teacher video on demand (VOD) and live streaming decoding. dataset by searching for optimal QP sequences for individ- ual videos. We then train a neural network by imitating In modern video encoders, rate control is a critical com- these optimal QP sequences. However, imitating the opti- ponent that directly modulates the trade-off between rate mal QP sequences by itself fails to learn a good rate control (video size) and distortion (video quality). The rate control policy. In order for the policy to generalize to many videos ?Work done during the internship at DeepMind. Correspondence to and to satisfy the bitrate constraint, we design additional Chenjie Gu <[email protected]>. auxiliary losses and customized hindsight experience replay 1 to train the policy. During inference, we use a truncation VP9 has multiple rate control modes. In this paper, we trick, a simple technique to prune spurious action values focus on the “two-pass, variable bitrate (VBR)” mode [27] and improve encoding efficiency. Finally, we augment the (See Supplementary MaterialsA for more details). Math- learned policy with feedback control, in order to precisely ematically, VBR corresponds to a constrained optimization hit the target bitrate, and therefore minimize constraint vio- problem (for a particular video): lations. maximize PSNR(π) Although the method is developed with the goal of solv- π ing the rate control problem, it is general and is applicable ; (1) subject to bitrate(π) ≤ bitrate to other constrained planning problems and imitation learn- target ing methods. For example, the feedback control augmen- where π is a policy that decides the QP for each video frame tation defines a policy improvement operator for a class of during the second-pass encoding. PSNR (Peak Signal-to- constrained planning problems, and may be used in rein- Noise Ratio) measures the video quality (a.k.a., distortion) forcement learning. and is proportional to log-sum of mean squared error of all We evaluate and compare our learned neural rate con- show frames (excluding hidden alternate reference frames). trol policy against the default rate control policy in lib- Bitrate is the sum of bits of all frames divided by the du- vpx’s two-pass VBR mode. When evaluated on 200 real- ration of the video. Note that PSNR in Equation (1) can world videos never seen during training, our learned policy be replaced with other distortion metrics such as SSIM [28] achieves a 8.5% median reduction in bitrate while maintain- and VMAF [20]. ing the same video quality. Metrics. Bjontegaard delta rate (BD-rate) [3] is a key In summary, our contributions include: metric to compare rate control policies. Given the rate- 1. We formulate the rate control problem as a Partially distortion (RD) (bitrate-PSNR in our case) curve of two Observable Markov Decision Process (POMDP) (§2); policies, BD-rate computes the average of bitrate difference in percentage across the overlapped PSNR range, and there- 2. We develop an imitation learning method, including fore, measures the average bitrate reduction. teacher dataset generation, auxiliary losses, model ar- When we have only a single point on the RD-curve of chitecture and a customized hindsight experience re- one policy, BD-rate reduces to projected bitrate difference, play method, for learning a rate control policy (§3); which is the difference between the bitrate of one policy 3. We develop a policy improvement operator that aug- and the interpolated bitrate of the other policy at the same ments a learned policy with feedback control, and PSNR, as illustrated in Figure 1b. Similarly, we define pro- show that it significantly reduces the constraint viola- jected PSNR difference as the difference between the PSNR tion (§3.5); of one policy and the interpolated PSNR of the other policy at the same bitrate. 4. We show that our learned rate control policy achieves 8.5% median reduction in bitrate while maintaining the 2.2. Partially Observable Markov Decision Process same video quality (measured in PSNR) (§4). (POMDP) formulation We formulate the rate control problem Equation (1) as 2. Problem formulation and challenges a POMDP. Each step of the POMDP corresponds to the encoding of one video frame. At each step, the action is 2.1. Rate control in video encoding the QP, an integer in [0; 255], for the corresponding video Without loss of generality, we focus on improving rate frame. The state of the MDP corresponds to libvpx’s in- control in libvpx, a widely adopted open source implemen- ternal state, which is updated as new frames are encoded. tation of VP9. The same methodology applies to other mod- The state transition is deterministic and is determined by ern video encoders. As illustrated in Figure 1a, the rate how libvpx encodes a frame. We expose a subset of lib- controller makes encoding decisions of video frames se- vpx’s internal state as the observation, including (1) first- quentially. The rate controller regulates the trade-off be- pass statistics (e.g., noise energy, motion vector statistics) tween rate and distortion by assigning a quantization pa- of all video frames, which provides a rough estimate of the rameter (QP) to each frame. QP is an integer in the range of frame complexity of all frames, and helps the rate controller [0; 255]. QP is monotonically mapped to a quantization step to plan for future frames, (2) features of the frame to be en- size, which is used to digitalize prediction residue for en- coded (e.g., frame type and frame index) and (3) features tropy coding. Smaller quantization step sizes lead to smaller of previously encoded frames (e.g., bits/size of the encoded quantization error but also higher bits usage. Smaller quan- frame and mean-squared error between the encoded frame tization error means smaller reconstruction error, measured and the raw frame). Supplementary MaterialsB enumerates by mean squared error (MSE). all the observations we used in this work. Note that because 2 QP0 QP1 QP2 QP3 QP4 Encoder Encoder Encoder Encoder Encoder Bits0 / MSE0 Bits1 / MSE1 Bits2 / MSE2 Bits3 / MSE3 Bits4 / MSE4 (a) Rate control via quantization parameters (QPs). (b) Projected PSNR and bitrate. Figure 1: Illustrative example of the rate control problem and the metrics for encoding efficiency.