AV1 Now! Project Report

Jacob Aulenback CSC 461 University of Victoria

Fall 2019

AV1 is an up-and-coming state-of-the-art video coding format being developed by a consortium of the largest technology companies in the world. In the coming years it is set to replace current generation formats in all forms of online video. It improves upon the eﬃciency of the incumbent formats AVC and HEVC by 50% to 20% respectively, and, unlike them, is made available completely royalty-free for use by anyone, anywhere.

1 Contents

1 The Story So Far 3

1.1 AVC/H.264 ...... 3

1.2 Free Formats ...... 4

1.3 HEVC/H.265 ...... 5

2 AOMedia and AV1 6

3 Technology 7

3.1 Partitioning ...... 7

3.2 Prediction ...... 8

3.2.1 Intra-Frame Prediction ...... 9

3.2.2 Inter-Frame Prediction ...... 10

3.3 Decode Filters ...... 10

4 Status and Conclusion 11

References 13

2 1 The Story So Far

This report is centred around the current push by some of the largest players in the online video industry to create, standardize, and adopt an entirely new video coding format. The format this eﬀort has spawned, AV1, is uniquely situated to become the predominant video format on the web. To understand the signiﬁcance and potential of AV1, it is necessary to understand the recent history of video encoding standards, and the challenges that new standards face.

1.1 AVC/H.264

In the context of AV1, recent history begins in 2003 with the standardization of the ﬁrst version of AVC/H.264 or MPEG-4 Part10, Advanced Video Coding. This standard was developed by two committees, the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC JTC1 Moving

Pictures Expert Group (MPEG). Together they are collectively known as the Joint Video Team

(JVT). The goal of the project was to create a family of standards which would facilitate the increasing number of uses for digital video including video telephony, traditional broadcast, and High

Deﬁnition content. To achieve this purpose, the standard would need to substantially increase video coding eﬃciency without excessively increasing the computational complexity of encoders and decoders.

The focus during the creation of the standard was purely on technical performance. As such, many of the techniques set out in the standard are protected by patents. These patents are owned by a large variety of third party companies, each of whom has legal rights to certain parts of the standard and software which implement it. In practice, those making commercial use of H.264 encoded video must pay a royalty fee to those patent owners. This has been challenging, even for large corporations, because of the huge number of patent holders. The majority of the patents covering

H.264 can be licensed together from an organization called MPEG LA which administers a pool of thousands of patents from dozens of companies.[1] MPEG LA does not, however, include every patent potentially covering H.264, and due to the nature of software patents, it is near impossible to create an exhaustive list. There have been multiple lawsuits over the years over infringement of

H.264 patents. [2][3] Ultimately, despite its legal woes, AVC rose to prominence over other com-

3 petitors. Initially at odds with Microsoft’s VC-1 format for HD-DVD’s, H.264 became the de-facto standard for both Blu-ray discs and online video of all types.

1.2 Free Formats

The patent encumbered nature of H.264, and the lack of a free, open alternative led to problems in the late 2000’s when work was being done on the new specification for HTML5. The updated specification was going to recognize the importance of video on the modern web and and to move away from the proprietary Adobe Flash plugin which was in widespread use. The problem was that there was no single video encoding format which was both high quality and royalty free. Initially, a free format called Theora, developed by the Xiph.org Foundation was proposed as a mandatory supported format for HTML5 compliance. Concerns were raised, however, that due to Theora’s new status and lack of widespread use, there could have been unknown patent holders who might only make themselves known once a large company was already making use of their patent. Thus the HTML5 specification removed the explicit requirement to use support Theora and read instead:

”It would be helpful for interoperability if all browsers could support the same codecs.

However, there are no known codecs that satisfy all the current players: we need a codec

that is known to not require per-unit or per-distributor licensing, that is compatible with

the open source development model, that is of suﬃcient quality as to be usable, and

that is not an additional submarine patent risk for large companies.” [4]

In the end there was no encoding format suitable for use in open web standards. Work was begun during the subsequent years by various entities to create brand new formats which were hoped to be able to fill this royalty-free gap. Xiph.org in collaboration with Mozilla created a spiritual successor to Theora called Daala, Cisco created a new format called Thor which targeted real-time video applications, and Google purchased On2 technologies and began extending their VP8 format to the newer VP9. Each of these was a high-efficiency format designed to outperform H.264 by a significant margin. The most successful of the three in terms of adoption was VP9 which saw significant use on YouTube as well as eventual hardware support in newer CPU’s and GPU’s.

4 1.3 HEVC/H.265

The VCEG and MPEG, however, had also released a new format, the successor to H.264 called

High Efficiency Video Coding (HEVC) or H.265. Like Daala, Thor, and VP9, H.265 is a significant upgrade in efficiency when compared to H.264. Being the successor to the most commonly used video format in world gave H.265 a significant advantage in gaining market share. It was selected as the format for 4k Blu-Ray discs, for example, which gives reason for chip-makers to create hardware implementations which can be in used Blu-ray players and game consoles. Once the hardware is there it can be used for video which came not from a Blu-ray but over the internet.

This in turn drives support by online video sites which seek to serve video best suited to their customers’ devices. This creates a feedback loop and network eﬀect which makes breaking into the market very diﬃcult, even for companies able to create their own demand like Google and YouTube.

The royalty and licensing landscape for H.265 is signiﬁcantly more involved and expensive than it was previously for H.264. While MPEG LA, the patent pool largely responsible for H.264 licensing, has attempted to simplify their royalty structure, they are no longer the only major entity seek- ing royalty payments. A new pool has appeared called HEVC Advance administering the patents belonging to Samsung Electronics and General Electric as well as 16 others. Per-unit costs for play- back devices charged by HEVC Advance has increased from the US$0.10 to US$0.20 fees charged by MPEG LA to US$0.40 for a mobile device to as much as US$1.20 for 4k televisions for HEVC

Advance.

One signiﬁcant positive change speciﬁcally for online video is that both MPEG-LA and HEVC

Advance have eliminated fees for non-physical content distribution. This means that the only fees owed for using H.265 on the web would be paid for the codec itself, the hardware or software implementation which encodes or decodes the H.265 format. MPEG LA and HEVC Advance are not, however, the only patent holders, nor are they the only ones charging fees. At least two other licensors have been paid for use of the format, and there remain many other patent holders who have no yet provided terms for use of their patents.

5 Despite the tumult and uncertainty around H.265 licensing, it has nevertheless shown remarkable popularity. As of 2019 it is the second most widely used video format after H.264.

2 AOMedia and AV1

The problems caused by the patent-encumbered dominant formats as well as the lack of any royalty- free formats with widespread support were the motivation behind the creation of the Alliance for

Open Media (AOMedia), a consortium of technology companies for the development of new royalty- free video coding format. As of December 2019 AOMedia consists of 14 “founding members”

(although at its creation it consisted of only 7 members), as well as 29 “promoter members”. The founding members include :

• Amazon • Apple • ARM • Cisco • Facebook

• Google • IBM • Intel • Microsoft • Mozilla

• Netlix • Nvidia • Samsung Electronics • Tencent

The current members of the alliance represent a broad section of the online video industry including major players from content providers (Facebook, Netﬂix), makers of browsers and operating systems (Google, Mozilla), as well as hardware manufactures (Intel, Samsung Electronics). This is crucial to the success of a new format as it will greatly ease adoption if support exists at all levels.

The other beneﬁts aﬀorded by the Alliance for Open Media are the pledges made about intel- lectual property. Each member has agreed to license their relevant patents for free to anyone in the world based only on the concept of reciprocity, that is the license is free so long as one does not engage in patent litigation against Alliance members. The Alliance has also established the

AOMedia “patent defense program” to guard against future patent claims by third parties.

It is notable that Cisco, Mozilla, and Google, (which were all part of the original 7 founding members) had their own royalty-free formats in development. The idea was to combine technologies from these existing encoders into a single format. Cisco, Xiph/Mozilla, and Google each submitted their codecs, Thor, Daala, and VP9 respectively as inputs toward standardization. It was decided that the format would be largely based on VP9, with additions from both Daala

6 and Thor. VP9 was chosen because it was seen as the “lowest-technical-risk”[5] format; the techniques used by VP9 were more mature owing to the signiﬁcant success VP9 has achieved on its own.

The first product of these efforts is, of course, AV1, the new video format designed specifically to solve problems the industry has faced until now. In March of 2018 the AV1 1.0.0 bitstream was frozen and a reference encoder and decoder were released.

3 Technology

In order to be a viable replacement to H.265, AV1 would need not only to be royalty free, but also to provide similar or better coding eﬃciency at a reasonable complexity. VP9 was found to already to be of similar or slightly better eﬃciency than HEVC which gave a solid foundation on which

AV1 could be built, however the design goals of AV1 were to improve upon HEVC by a signiﬁcant margin rather than to simple match it. This section investigates the techniques used by AV1, some of which are brand new and some of which are variations of the same coding technology used for decades.

3.1 Partitioning

Partitioning is the process of dividing a video frame into smaller sections called blocks. In AV1 the largest blocks are called superblocks. These are either 64x64 pixels or 128x128 pixels. These are analogous to macroblocks in other encoders. Where AV1 diﬀers from other formats in in the size of the superblocks. In the older H.264 format, the largest blocks 8x8 pixels while the newer H.265 format allows up to 64x64. The larger blocks become useful for higher resolutions where tiny 8x8 pixel blocks incur a high overhead. [6]

In addition to superbocks, modern AV1 encoders allow for smaller, recursive partitions to be per- formed within blocks. Compared to VP9, AV1 expands the ways in which blocks can be partitioned.

Starting at the 128 pixel superblock, an encoder will select one of 10 possible subdivisions. In VP9

7 Figure 1: Each of the 10 ways a block may be partitioned in AV1 only the vertical rectangular, horizontal rectangular, and 2x2 partition options were available. Par- titions may only be applied recursively if the block was partitioned into 2x2 squares. By having more block shapes, encoders are able to better align blocks to natural features of the video without resorting to computationally expensive smaller blocks. Adding addition blocks and block sizes in- creases the computational complexity of the encoding process, however, the decision to use them is up to the encoder. Compliant AV1 video can be created without using any of the newer block sizes.

Thus the additional block sizes can be tuned based on the desired speed setting of an encoder.

3.2 Prediction

Once the frames have been divided into blocks, the technique employed video encoders is to “predict” the data within that block using data from surrounding blocks and surrounding frames. This is possible because “natural” videos contain a lot of visually redundant information. When the block is predicted using data from within its own frame this is called intra-frame prediction. When the prediction happens temporarily between frames it is called inter-frame prediction. A typical video will have intra-predicted frames called I-frames or keyframes spaced periodically through the video with the intervening frames being inter-predicted from them. Intra-prediction is not limited to I-frames, however. Inter-predicted frames can subsequently intra-predicted as well for additional eﬃciency.

8 3.2.1 Intra-Frame Prediction

Intra-frame prediction is a very important aspect of a video encoder. Because an I-frame is es- sentially a compressed still image, they are typically large compared to the rest of the frames.

As better and better advances are made in inter-prediction, I-frames become a larger and larger component of the ﬁnal bitstream. Thus intra-frame prediction needs to be as eﬃcient as possible.

Some ways of achieving this are very simple. AV1 contains an Intra-Prediction mode called Block

Copy which will simply reference back to another block if it determines they are similar enough.

For a repeating texture for example this is an extremely simple way to substantially reduce the necessary bits A more complicated technique used by AV1 is called Directional Intra Prediction whereby the directional spatial features of a frame are exploited. In VP9 there are 8 directional modes available, meaning 8 angles which will be checked for possible prediction. AV1 allows an oﬀset to be applied to each of these directional modes allowing for a total of 56 directional modes and overall a ﬁner angular granularity for spatial features.[7]

Another method implemented by AV1 for intra-prediction is Chroma from Luma prediction. The concept is to use the brightness values in a block to predict the colour information. This does not mean clearly that the colour of an image is extracted only from the luminance data. What is does mean is that there is a correlation in a block between the colour and brightness of a pixel. From

Xiph.org’s AV1 demo page:

...it’s obvious looking at YUV decompositions of frames that edges in the luma and

chroma planes still happen in the same places. There’s remaining correlation that we

can exploit to reduce bitrate [5]

While Chroma from luma techniques have been used in previous encoders including Daala and

HEVC, AV1 includes a new method of model ﬁtting which reduced the prediction error while simultaneously reducing decoder complexity.[8]

9 3.2.2 Inter-Frame Prediction

While Intra-Frame prediction is crucial for an optimized encoder, online high deﬁnition video would not be practical without Inter-Prediction. A video codec will reduce the number of bits necessary to store a block in a frame by referencing another block in a surrounding frame. AV1 has made several improvements over previous generation codecs in this space. For one, AV1 extends the number of references for each frame from 3 for VP9 to 7. Additionally, AV1 allows references to be combined into a pair to form a compound mode. A compound prediction of multiple references “can encode a variety of videos with dynamic temporal correlation characteristics in a more adaptive and optimal way.” [7]

Another tool used in AV1 which was left out of previous generation codecs is Global Motion Com- pensation. Each block in a frame will have a motion vector associated with it. This motion vector represents the true motion of the block in the video between frames. These vectors are calculated for each frame and used to reference blocks in other frames. In many videos, however, there will be large whole-frame motion caused by, for example, a panning camera. In this case the motion vectors for each frame may be the same. AV1 determines this global motion vector and allows individual blocks to signal their use of it if it is beneﬁcial to do so. This technique was examined for previous generation codecs, but was rejected due to the increased computational intensity.

3.3 Decode Filters

All video coders at the most basic level are simply trying to maximize the visual quality to a human viewer while using the lower possible bitrate. Using the techniques mentioned above and many, many others, the AV1 encoder produces a compressed bitstream which is transmitted to the end user. At this stage of the process, however, the decoded video would contain many artefacts result- ing from compression. Encoders mitigate some of these effects by applying filters during the decode loop. AV1 contains two novel filters which have shown success in increasing perceived quality.

The first of these is called the Constrained Directional Enhancement Filter. The goal of the filter out coding artefacts, specifically ringing artefacts of the type caused by sharp edges which have

10 been transformed to the frequency domain and then quantized. Encoded edges develop rippling artefacts emanating from the edge. These are particularly visible against what the eye knows to be a solid, simple backdrop. Other ﬁlters which remove artefacts of this type run into the issue of trying to remove the artefact without degrading the actual detail of the image. The technique employed by the CDEF is to identify the direction of the spatial features within a block and then to ﬁlter along only that direction. This preserves much more of the detail in the edge.[9]

Another decode loop filter used in AV1 is called Film Grain Synthesis. Many videos naturally contain noise originating in from imperfection in film or from a digital camera’s sensor. While noise is often considered undesirable, it is also often part of the creative intent of the video. The grain must then be captured and transmitted to preserver the original look of the video. This is difficult for a video encoder however, because noise is random in nature. It cannot be predicted either within a given frame or between frames. This will either increase the bitrate necessary to represent the video or else degrade the visual fidelity significantly.

The insight of Film Grain Synthesis is that because the grain is inherently random, a viewer will not notice if the it were different grain that the original. AV1 applies at the encode stage a process to parametrize the grain in a video. A flat area of the image is analysed to extract these parameters which are then transmitted with the video. At the decode stage, the encoder will use these parameters to synthesize a similar-looking but entirely different noise which can be applied after the fact. The encoder then is free to filter out the noise as necessary without ruining the look of the video.[10]

4 Status and Conclusion

The final AV1 bitsteam was finalized in 2018. Results on the efficiency of AV1 are still ongoing.

Preliminary results show AV1 succeeding in at least matching the eﬃciency of HEVC for most inputs and typically improving upon it by 20% to 30% .[11] The greatest technical hurdle remaining is speed optimizations, primarily for encoding. Performance is still signiﬁcantly worse than any competing codecs. This is an area of active development and it is uncertain where performance

11 will settle in mature encoders.

Provided improved encoding speeds can be attained, AV1 is well positioned to become an industry standard in the coming years. With support from the largest tech companies in all segments of the industry from hardware to software to content providers and a repertoire of new technologies to increase eﬃciency, AV1 is poised to ﬁnally become the long desired open standard for video on the modern web.

12 References

[1] MPEG LA, “Avc attachment 1.” https://www.mpegla.com/wp-content/uploads/

avc-att1.pdf.

[2] MPEG LA, “Zte sued for avc patent infringement.” https://www.mpegla.com/wp-content/

uploads/ZTE-Infringement-PrsRls-2017-02-15.pdf.

[3] MPEG LA, “Huawei sued for avc patent infringement.” https://www.mpegla.com/

wp-content/uploads/Huawei-Infringement-PrsRls-2017-02-15.pdf.

[4] “When will html 5 support video? sooner if you help.” https://www.w3.org/blog/2007/12/

when-will-html-5-support-soone/.

[5] xiph.org, “next generation video: Introducing av1,” 2018. https://people.xiph.org/

~xiphmont/demo/av1/demo1.shtml.

[6] P. K. Papadopoulos, M. G. Koziri, N. Tziritas, T. Loukopoulos, I. Anagnostopoulos, P. Saloun,ˇ

and D. Andreˇsi´c,“On the evaluation of coarse grained parallelism in av1 video coding,” in 2018

13th International Workshop on Semantic and Social Media Adaptation and Personalization

(SMAP), pp. 55–59, Sep. 2018.

[7] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu, S. Parker, C. Chen, H. Su, U. Joshi,

C. Chiang, Y. Wang, P. Wilkins, J. Bankoski, L. Trudeau, N. Egge, J. Valin, T. Davies,

S. Midtskogen, A. Norkin, and P. de Rivaz, “An overview of core coding tools in the av1 video

codec,” in 2018 Picture Coding Symposium (PCS), pp. 41–45, June 2018.

[8] L. Trudeau, N. Egge, and D. Barr, “Predicting chroma from luma in av1,” in 2018 Data

Compression Conference, pp. 374–382, March 2018.

[9] S. Midtskogen and J. Valin, “The av1 constrained directional enhancement ﬁlter (cdef),” in

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),

pp. 1193–1197, April 2018.

[10] A. Norkin and N. Birkbeck, “Film grain synthesis for av1 video codec,” in 2018 Data Com-

pression Conference, pp. 3–12, March 2018.

13 [11] D. Grois, T. Nguyen, and D. Marpe, “Coding eﬃciency comparison of av1/vp9, h.265/mpeg-

hevc, and h.264/mpeg-avc encoders,” in 2016 Picture Coding Symposium (PCS), pp. 1–5, Dec

2016.