An Optimized H.266/VVC Software Decoder on Mobile Platform
Total Page:16
File Type:pdf, Size:1020Kb
An Optimized H.266/VVC Software Decoder On Mobile Platform Yiming Li, Shan Liu, Yu Chen, Yushan Zheng, Sijia Chen, Bin Zhu, Jian Lou Tencent Media Lab, Shenzhen, China and Palo Alto, CA, USA, fmarcli, [email protected] Abstract—As the successor of H.265/HEVC, the new versatile standard. Therefore, it is essential to have an efficient and video coding standard (H.266/VVC) can provide up to 50% optimized software decoder implementation to support the bitrate saving with the same subjective quality, at the cost of emerging applications. In [3] [4], an independent VVC soft- increased decoding complexity. To accelerate the application of the new coding standard, a real-time H.266/VVC software ware decoder implemented by Tencent demonstrated real-time decoder that can support various platforms is implemented, HD/UHD decoding capability on x86 platform. Considering where SIMD technologies, parallelism optimization, and the that mobile devices have become an essential carrier and acceleration strategies based on the characteristics of each coding display tool for video services, extensive optimization efforts tool are applied. As the mobile devices have become an essential were made on top of the framework of [3] to achieve real- carrier for video services nowadays, the mentioned optimization efforts are not only implemented for the x86 platform, but more time HD/UHD decoding on the mobile platform. As a result, importantly utilized to highly optimize the decoding performance a uniform-designed software H.266/VVC decoder that can on the ARM platform in this work. The experimental results show run real-time on different platforms and supports versatile that when running on the Apple A14 SoC (iPhone 12pro), the av- functionalities such as screen content coding (SCC) is ac- erage single-thread decoding speed of the present implementation complished. This paper will focus more on the discussions can achieve 53fps (RA and LB) for full HD (1080p) bitstreams generated by VTM-11.0 reference software using 8bit Common of mobile optimization and capabilities. The rest of the paper Test Conditions (CTC). When multi-threading is enabled, an is organized as follows: Section II presents an overview of average of 32 fps (RA) can be achieved when decoding the 4K H.266/VVC decoder processing blocks and briefly introduces bitstreams. the proposed software decoder design including single in- Index Terms—H.266, Versatile Video Coding (VVC), Video struction, multiple data (SIMD) processing technology and decoding, ARM, SIMD optimization data/task level parallelism. Section III elaborates the optimiza- tion on some important modules and shows the benefit of the I. INTRODUCTION implementation. Experimental results of the proposed decoder With the popularization of the video applications on the are shown in Section IV. Finally, Section V concludes the Internet and the demand for high-quality video services, a paper. more efficient video compression standard with high coding II. CODING TOOL INDEPENDENT DECODER performance is desired. After High Efficiency Video Coding OPTIMIZATION (HEVC)/H.265 standard, Joint Video Experts Team (JVET) between VCEG (Q6/16) and ISO/IEC JTC1 SC29/WG11 A. Overview of H.266/VVC Decoder (MPEG) was created, and formed the Versatile Video Coding The structure of H.266/VVC decoder with the main pro- standard (H.266/VVC) in July 2020 [1]. It can provide up cessing modules is shown in Fig. 1. Following the decoding to 50% improvement in compression efficiency for the same pipeline, the entropy decoder stage is the first stage, which ex- subjective quality compared to its predecessor HEVC [2]. In tract and decode the codewords from the input bitstream. After arXiv:2103.03612v1 [eess.IV] 5 Mar 2021 order to enhance the coding performance, several coding tools that, the syntax elements are derived. Similar as H.265/HEVC, with decoder side computations are adopted, like decoder side the entropy encoding/decoding is still based on the context- motion vector refinement (DMVR), bi-directional optical flow adaptive binary arithmetic coding (CABAC) scheme, but sev- (BDOF), adaptive loop filter (ALF), etc. These coding tools eral enhancements have been introduced into VVC to improve introduce additional complexities for the decoder, that makes the throughput and the compression efficiency. it a hard work to optimize the decoding speed. The derived intra mode after the entropy decoder stage is In general, video decoders can be classified into hard- used in intra prediction stage, where the prediction value can ware implementations and software implementations, where be obtained by using the boundary pixels and the intra mode hardware decoders are preferred in low-power devices, but index. To increase the prediction accuracy, VVC extends the normally come a few years later after a standard is finalized. intra modes from 35 to 67, and wide-angle intra prediction In contrast to the hardware decoder, software decoders may modes are used for the non-square blocks. Different with the need more power, but still play an important role on general traditional angular prediction, the matrix weighted intra predic- computer, especially in the early stage for the new coding tion (MIP) technology takes the left and above neighbouring boundary samples as the input to generate the prediction by matrix vector multiplication and linear interpolation. Besides, a cross-component linear model (CCLM) prediction mode is also introduced in the VVC to utilize the correlation between luma and chroma components. Benefiting from these tools, an accurate predicted block is prepared, and the final recon- structed block can be formed by adding the residual block and predicted block. Fig. 2. Runtime Spending of Each Processing Block on ARM Platform without SIMD Acceleration. a better prediction. Similar as the intra prediction module, the reconstructed block generated by inter prediction mode is from the residual block and the prediction block. In VVC, loop filtering (LF) stage contains three kinds of in- loop filters, including deblocking filter (DBK), adaptive loop filter (ALF) and sample-adaptive offset (SAO). The general idea for DBK and SAO in VVC is similar as the previous video coding standard except the granularity for luma component is changed to 4×4 on the deblocking filtering process. Adaptive loop filters is newly added in VVC. Based on the gradients of each block, one among 25 filters is applied for each 4×4 block filtering. The test on each new coding tool has been conducted by AHG13 of JVET [5], which is based on the reference software. Fig. 1. VVC Decoder Structure [3]. In [3], the runtime spent on each processing block based on the optimized real-time decoder is provided on x86 platform. Residual block is generated by the inverse quantization Considering the architecture is different between x86 and and inverse transform (IQ/IT) stage shown in Fig. 1. This ARM platform, the runtime spent of the optimized decoder stage can dequantize and transform the frequency-domain [3] is tested on the ARM platform, as shown in Fig. 2. It coefficients back to the spatial domain. In VVC, multiple can be seen as a start point for the optimization on the ARM transform selection (MTS) scheme is adopted, so that besides platform. In this test, the test bitstreams are encoded from the DCT-2, DCT8/DST7 can also be selected as the core transform H.266/VVC common test conditions (CTC) [6] with random matrices. This tool can help to remove the correlation and access (RA) configuration. We take the average of the 1080p reduce the coding bits. As another new coding tool, low- test bitstreams for collecting the runtime in order to reduce frequency non-separable transform (LFNST) is added to fur- the content based fluctuation. As SIMD optimization for x86 ther remove the correlation in transform domain. platform cannot be directly used in ARM platform, there is As videos usually contain temporal redundancy, for elim- no SIMD acceleration in this test. The results show that ALF inating the correlation in temporal domain, inter prediction and inter prediction are the most two time-consuming blocks technology is widely used, where the motion compensation in the whole decoding pipeline. (MC) stage generates the inter prediction from a decoded reference block. For acquiring a more accurate prediction B. Tool Independent Decoder Optimization result, the decoded inter modes from entropy decoding stage. As described in Section II-A, many new coding tools are e.g., MVs, RefIdxs can help to locate the reference block. As added in VVC. Besides the coding tools in each module, the Fig. 1 shows, another input for motion compensation stage is new standard can also support larger block size, multi-type the reference block, which is from the decoded picture buffer coding unit partitions, etc. For versatile video applications, (DPB). In the MC stage in VVC, 8-tap interpolation filter is the coding tools for screen content sequences like intra block used for luma block and 4-tap filter is used for chroma block. copy are also included in the main profile of H.266/VVC Besides the traditional translation motion as H.265/HEVC instead of in the extension [7]. With the support for the new used, the affine motion model is also applied in VVC. For coding tools and new coding features, the new standard is further increasing the prediction accuracy, BDOF and DMVR significantly more complex than previous codecs and more can help to refine the subblock motion vector in order to obtain efforts are required for fully accelerating. coefficients among 25 filters. As shown in Fig. 3, The shape of filter is diamond, where 7×7 diamond shape is applied for luma component and chroma components use the 5×5 diamond shape. Considering the filtering process is very time-consuming, many efforts are spent in this part in the VVC standardization stage.