An In-Depth Review on the AV1 Intra Prediction Marcel M
Total Page:16
File Type:pdf, Size:1020Kb
An In-depth Review on the AV1 Intra Prediction Marcel M. Corrêa (Ph.D. student), Daniel M. Palomino, Guilherme R. Corrêa (Co-advisors), Luciano V. Agostini (Advisor) Video Technology Research Group (ViTech) Programa de Pós-Graduação em Computação (PPGC) Universidade Federal de Pelotas (UFPel) {mmcorrea, dpalomino, gcorrea, agostini}@inf.ufpel.edu.br Abstract. The AOMedia Video 1 (AV1) is an open-source and royalty-free video coding format, which was developed by the Alliance for Open Media (AOM) industry consortium and released in June 2018. The main goal of AV1 development was to achieve substantial compression gain over state-of-the-art codecs, while keeping a practical decoding complexity, hardware feasibility and its open and free status. This monograph provides a brief technical overview on the AV1 block partitioning and transform coding, and an in-depth technical review of its intra prediction tools, along with a brief presentation of the VP9 and HEVC intra prediction tools. 1. Introduction The continuous growth in consumption of digital videos over the internet, including services such as video-on-demand, video sharing on social networks, as well as high definition video conferences (even on mobile devices), are reaching the limits of available bandwidth of the telecommunication infrastructures. To address this situation, standardization bodies have been developing video coding standards during the last years. The most recent ISO/IEC and ITU-T standard is the High Efficiency Video Coding (HEVC) [Sullivan 2012; ISO/IEC 2013], which has many of its techniques protected by patents. A key factor in the success of the internet is that its core technologies are open and freely implementable, and digital videos certainly are a central part of the internet experience nowadays. According to Cisco Systems, digital videos are consuming the majority of all the internet traffic and are expected to consume even more in the near future [Cisco Systems 2017], which makes open-source and royalty-free video formats highly desirable. Based on that, Google LLC purchased the company On2 Technologies, which originally developed the VP8 video format [Bankoski 2011], and released the VP8 codec code under a BSD-like license and the bitstream specification under an irrevocable free patent license. This episode marked the start of the WebM project, as an effort to develop open media formats, later resulting in the release of the VP9 video format [Mukherjee 2013; Grange 2013]. Although Google LLC has been using VP9 successfully on YouTube, it is considered to be not enough to handle new tendencies, such as Ultra High Definition (UHD) 4K resolution which is four times larger than 1080p, UHD 8K resolution which is four times larger than 4K, High Dynamic Range (HDR) which has 25% to 50% more data than Standard Dynamic Range (SDR) and, also, higher frame rates. These tendencies result in an expressive increase of data to be compressed. As Google LLC was working on a successor for VP9, called VP10 [Mukherjee 2015], and Cisco Systems and Mozilla Corporation were working on their own video formats, respectively called Thor [Bjøntegaard 2016] and Daala [Valin 2016], an industry consortium, called Alliance for Open Media (AOM), was formed. This alliance was made mainly because the paradigm of developing the technology first and solving the licensing deals later was not working for the majority of the companies involved, as Rosenberg (2015) explained: Unfortunately, the patent licensing situation for H.265 [HEVC] has recently taken a turn for the worse. Two distinct patent licensing pools have formed so far, and many license holders are not represented in either. There is just one license pool for H.264. The total costs to license H.265 from these two pools is up to sixteen times more expensive than H.264, per unit. H.264 had an upper bound on yearly licensing costs, whereas H.265 has no such upper limit. […] We [Cisco Systems] also hired patent lawyers and consultants familiar with this technology area. We created a new codec [Thor] development process which would allow us to work through the long list of patents in this space, and continually evolve our codec to work around or avoid those patents. As result, the AOMedia Video 1 (AV1) [Chen 2018; Rivaz 2018] was released in June 2018 as the VP9 successor, to be a next-generation, open-source and royalty-free video format developed based on many elements of VP10, Thor and Daala. This monograph provides a brief technical overview on the AV1 block partitioning and transform coding, and an in-depth technical review of its intra prediction tools, along with a brief presentation of the VP9 and HEVC intra prediction tools. 2. AV1 Block Structure and Transform Coding In AV1, a video frame is partitioned into superblocks (SBs) of size 128x128 or 64x64 pixels, according to a bitstream flag. Then, SBs are predicted in raster order within the frame. To deliver a locally optimal prediction for each SB, the encoder can further divide each SB using a 10-way partition tree structure [Chen 2018], as illustrated in Figure 1. In this figure, blue blocks are final partition modes, but all four partitions of the white block can be recursively divided based on the same 10-way tree structure, down to 4x4 blocks, which is the smallest supported block size. The numbers inside each partition indicate the raster prediction order within the block. Each final partition can be either intra or inter predicted, or even by a compound inter-intra mode, which combines a single-reference inter prediction and an intra prediction separated by a smoothing function [Chen 2018]. Specifically, for intra prediction, the process is not invoked at block level, but at transform block level instead. That is, when the transform size is smaller than the block size, the prediction is invoked multiple times in raster order within the block [Rivaz 2018], thus allowing the prediction of a transform block to use the previously predicted and reconstructed transform block as a better reference. The AV1 specifies many square and non-square sizes for transform blocks, from 4x4 to 64x64, as listed in Table 1. In this table, TxSize represents how a given transform size is signaled in the bitstream. Super Can start at Block 128x128 or (SB) 64x64 pixels. Each block can be further divided. 1 1 2 1 2 1 2 3 4 1 3 2 4 VERT HORZ VERT_4 HORZ_4 NONE 1 2 1 2 1 3 1 2 3 3 2 3 VERT_A VERT_B HORZ_A HORZ_B SPLIT Figure 1. AV1 10-way block partition tree structure. Table 1. AV1 supported transform sizes [Rivaz 2018]. TxSize Transform Size 0 4x4 1 8x8 2 16x16 3 32x32 4 64x64 5 4x8 6 8x4 7 8x16 8 16x8 9 16x32 10 32x16 11 32x64 12 64x32 13 4x16 14 16x4 15 8x32 16 32x8 17 16x64 18 64x16 A very rich set of transform kernels is defined for both inter and intra blocks. The full set consists of 16 combinations of vertical and horizontal Discrete Cosine Transform (DCT), Asymmetrical Discrete Sine Transform (ADST), flipADST and Identity Transform (IDTX). Intra prediction tends to be less accurate for predicted samples further away from the reference samples, resulting in a residue with higher concentration on the bottom right corner. For these cases, the asymmetrical transforms, such as the ADST and flipADST (the ADST applied in reverse order), are the most appropriate [Parker 2016]. Also, the IDTX, which skips the transform in the applied direction, shows good results for coding residue with very sharp edges. Moreover, the specific combination of IDTX vertical and IDTX horizontal results in a full transform skip, which is very effective for Screen Content Coding (SCC) [Parker 2016]. 3. AV1 Intra Prediction As explained in section 2, although AV1 supports many different block sizes ranging from 128x128 to 4x4 due to its flexible partition structure, only partitions of size 64x64 or smaller are supported by the intra prediction modes (predictors), according to the 19 transform sizes listed in Table 1. AV1 supports a variety of non-directional predictors that consider gradients, spatial correlation of samples and coherence of luminance and chrominance planes. These modes are: DC, Paeth, Smooth, Smooth Vertical, Smooth Horizontal, Recursive-based- filtering (modes 0 to 4) and Chroma-from-luma (CfL) [Trudeau 2018]. There are also two modes particularly efficient for Screen Content Coding, namely the Intra Block Copy [Li 2018] and Palette [Guo 2014] modes. Finally, AV1 also supports many different directional predictors, corresponding to angles between 36 and 212 degrees, aiming at several varieties of spatial redundancies in directional textures. Generally, the intra prediction of a single block, referred as a 2D array called Pred of size width by height, requires reference samples from previously reconstructed blocks located on the left and above the current block. The number of reference samples needed for each reference array, referred as AboveRow (row of samples above the current block) and LeftCol (column of samples to the left of the current block), is equal to width+height+1, as illustrated in Figure 2. AboveRow[−1 to 11] Pred[0 to 3][0 to 7] 11] [−1 to [−1 LeftCol Figure 2. Example of a 4x8 block, which needs a total of 25 reference samples from previously encoded adjacent blocks. The position -1 of both reference arrays always share the same value. The following subsections present detailed information about each intra prediction mode supported by the AV1. 3.1. DC Mode This well-known predictor is also found in many older video formats, although this one was based on the VP9 version, which also considers the availability of reference samples. That is, because of the block processing order or data dependency break introduced by parallel encoding tools, the top and/or left neighboring blocks may not be available to be used as reference in intra prediction.