<<

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

Optimized AV1 Inter Prediction using Binary classification techniques

A graduate project submitted in partial fulfillment of the requirements for the degree of Master

of Science in Engineering

by

Alex Kit Romero

May 2020

The graduate project of Alex Kit Romero is approved:

______

Dr. Katya Mkrtchyan Date

______

Dr. Kyle Dewey Date

______

Dr. John J. Noga, Chair Date

California State University, Northridge

ii Dedication

This project is dedicated to all of the professors that I have come in contact with other the years who have inspired and encouraged me to pursue a career in computer science. The words and wisdom of these professors are what pushed me to try harder and accomplish more than I ever thought possible.

I would like to give a big thanks to the open source community and my fellow cohort of computer science co-workers for always being there with answers to my numerous questions and inquiries. Without their guidance and expertise, I could not have been successful.

Lastly, I would like to thank my friends and family who have supported and uplifted me throughout the years. Thank you for believing in me and always telling me to never give up.

iii Table of Contents

Signature Page ...... ii

Dedication ...... iii

List of Figures ...... vi

List of ...... viii

Abstract ...... ix

Chapter 1: Introduction ...... 1

Chapter 2: Background ...... 4

Video Standards ...... 4 Intra- Prediction ...... 6 Inter-frame Prediction ...... 9 AVC ...... 10 VP9 ...... 12 HEVC ...... 13 AV1 ...... 15 Comparison of Standards ...... 16

Inter Prediction of AV1 ...... 18

Statistical Classification ...... 20

Chapter 3: Related Work ...... 22

Chapter 4: Classification Methods ...... 26

Support Vector Machines ...... 26

Decision Trees ...... 28

Chapter 5: Implementation – Languages, Frameworks, and Tools ...... 32

Training Sequences ...... 33

Scikit-Learn ...... 37

iv AV1 Reference Encoder ...... 40

Other helpful tools ...... 41

Chapter 6: Experimental Results ...... 44

Comparison of encoding complexity ...... 44

Conclusion ...... 46

Future Work ...... 47

References ...... 48

v

List of Figures

Figure 1 - I-Frame, P-Frame and B-Frame Example [6] ...... 5

Figure 2 - 4 x 4 Prediction block ...... 6

Figure 3 - 8 directional modes of AVC ...... 7

Figure 4 - Original block (left), DCT coefficients (right) ...... 8

Figure 5 - Quantization table (left), quantized DCT coefficients (right) ...... 8

Figure 6 - H.264 encoding/decoding process ...... 11

Figure 7 - H.264 /Sub-macroblock partitions ...... 12

Figure 8 - VP9 Macroblock/Sub-macroblock partitions ...... 13

Figure 9 - HEVC encoding process [8] ...... 14

Figure 10 - GF Group using Multi-layer [10] ...... 18

Figure 11 - Four-step early-termination [12] ...... 23

Figure 12 - One-dimensional data plot ...... 26

Figure 13 - Support Vector Machine Diagram ...... 27

Figure 14 - Basic Decision Tree structure ...... 28

Figure 15 - Gini Impurity formula ...... 29

Figure 16 - Iris Decision Tree example [17] ...... 30

Figure 17 - Technical Specs w/o SRFPM ...... 34

Figure 18 - Technical Specs /w CRFPM enabled ...... 34

Figure 19 - Encoding Time reduction formula ...... 35

Figure 20 - Rav1e --speed preferences ...... 35

Figure 21 - Side by Side comparison 720p video ...... 36

vi Figure 22 - Sklearn Decision Tree call ...... 41

Figure 23 - Sklearn classifier output - Decision Tree ...... 41

Figure 24 - Sklearn svm -Vector call ...... 42

Figure 25 - Sklearn classifier output - SVM C-Vector ...... 42

Figure 26 - Sklearn svm LinearSVC call ...... 42

Figure 27 - Sklearn classifier output - LinearSVC ...... 43

vii List of Tables

Table 1 - AVC/VP9/HEVC/AV1 Comparison ...... 16

Table 2 - 10 sec encode H264/VP9/HEVC/AV1 [11] ...... 17

Table 3 - Example Data of Flu Symptoms ...... 29

Table 4 - System Specs ...... 32

Table 5 - 10 Sequences encoded /w Rav1e ...... 34

Table 6 - Selected Features ...... 38

Table 7 - Selected Features in textual format ...... 39

Table 8 - Decision Tree classification report ...... 40

Table 9 - Encoding Results ...... 44

.

viii Abstract

Optimized AV1 Inter Prediction using Binary classification techniques

By

Alex Romero

Master of Science in

Software Engineering

Thanks to the dynamic and improved encoding structures of the AOMedia Video 1 (AV1) , considerably high rates of efficiency have been achieved. Developed by the , AV1 was designed to become an alternative to the widely used

High Efficiency Video Coding (HEVC). The significant increase in efficiency compared to other video standards are overshadowed by a significant increase in computational complexity.

In its initial release, the AV1 reference encoder took an average of 63 hours to encode a 5-minute clip. This works out to roughly 45000x real-time. To compare, HEVC took an average of 58x, while AVC only took 4x. This graduate project explores the various classification methods available that can help to reduce encoding time. Optimal training data sets are collected and features are chosen based on trained data. An alternative to the Rate-Distortion (R-D) optimization used for choosing the prediction mode in inter-frame prediction is proposed that ultimately results in a reduction of roughly 30-50% encoding time. The proposed method is integrated seamlessly into the AV1 reference encoder with limited impact on compression efficiency.

ix Chapter 1: Introduction

The objective of the proposed project is to analyze the viability of applying techniques to the AV1 video and its reference implementation. The goal of applying machine learning to the AV1 codec in this project is to ultimately reduce the time it takes to encode a video. There have been a number of different experiments done by computer scientists in the past that apply novel machine learning techniques to one or more particular aspects of a . The common goal of these experiments is to replace a pre-existing feature or that is used within the codec with a more efficient, intuitive process that enhances that particular codec in some type of way. This project will be focusing on the improvement of one particular feature in the AV1 codec that is known as inter-prediction. In later sections, more time will be spent on defining what inter-prediction is and why it is important. To help in understanding as to why this particular experiment is important for the improvement of the AV1 codec, the backstory of the AV1 codec and the “Codec War” that has taken place in recent years needs to be explained.

In September of 2015, a new open, royalty-free named AV1 was developed by the Alliance for Open Media (AOMedia) [1], a consortium of 30+ partners that includes large multi-national technology companies such as , Apple, , and . The primary goal of its creation was to offer an alternative to High Efficiency Video Coding format (HEVC), which is the leading video compression standard created by the Motion Picture Expert Group

(MPEG). HEVC, which is the successor of the format (AVC), boasts a hefty 25% to 50% increase in compression efficiency at the same rate as AVC. Compression efficiency is extremely important in this golden age of video streaming. According to Cisco’s

1 2018 Visual Networking Index, “By 2022, video will make up 82 percent of all global traffic” [4]. Although network speeds have significantly improved over the years, the increase in the volume of 4K and 8K video content being created will heavily rely on compression efficiency to produce optimal results. Here is an example, a typical 4G connection on a mobile device will on average give an individual a data rate or of 20-50 Mbps or megabits per second. The for a 4K video is 66-85 Mbps using the AVC codec. However, to get the same quality using HEVC, the bit rate can be cut in half to 33-42.5 Mbps. Ultimately, this has led to individuals being able to stream 4k video on a train or in a car using only a 4G connection. Theoretically, a 5G connection is said to top out at 10 Gbps or gigabits per second, which would be more than enough to handle any type of video size using any codec. However, the reality is that this type of speed is only theoretical at this time, and the true 5G speeds are being clocked at a much lower rate. The emergence of 8K UHD video will quadruple the bit rate requirements compared to 4K, which makes efficiency the most important factor in video compression at this time and in the near future.

The big problem with HEVC is that it comes with considerably more expensive patent licensing fees. The cost to use HEVC can be over ten times the amount compared to AVC [2]. Ultimately, the price increase led to AOMedia’s development of AV1. AV1 itself is the culmination of multiple royalty-free video codec projects including VP9, and [3]. Since the release of AV1’s reference encoder in late 2015, a number of alternative encoder/decoder implementations have been released. Rav1e, an encoder written in Rust and Assembly, has made some significant gains in reducing encoding complexity. Intel’s SVT-AV1 encoder, which is designed for larger data centers, boasts significant complexity reduction as well. However, none of these newer encoders have reduced the encoding speed of AV1 enough to directly compete

2 with HEVC. As of this time, AV1 is still being viewed as a codec still in its experimental phase due to its compression speed. The actual speed or “complexity” differences that set AV1 apart from other modern codec formats will be observed in the upcoming chapters. Ultimately, the motivation to reduce complexity (i.e. make compression speeds faster) is to streamline AV1 into a useable format that content creators and technology companies alike can integrate into their workflows without the risk of slowing down production flow.

3 Chapter 2: Background

Video/audio compression has been one of the driving forces in almost every human’s daily lives for the past two decades. Monumental growth in every sector of compression has seen an increase (data, video, and audio). In order to keep up with demand, video compression standards have particularly evolved over the past decades to become increasingly more advanced and efficient. In order to understand AV1 and its role in today’s society, a brief history, overview, and comparison of the major video standards and features are presented in the sections below.

Video Standards

In 1988, long before AV1 came into fruition, a group called the Moving Picture Experts Group

(MPEG) was established to set the standards for audio and video compression and transmission

[5]. In 1992, the Moving Picture Experts Group established MPEG-1, which is the coding standard that is used to encode video CDs (VCD) and MP3 . A few years later, MPEG-2 was released and became the standard for all in its time. The compression technique advancements from these formats would eventually lead to the most widely used compression standard used today, known as H.264 or Advanced Video Coding (AVC).

A compression standard or “codec” can be categorized as either lossless or lossy. , which is when data can be compressed and reconstructed with no quality loss, is rarely used in modern video workflows and streaming. The decrease in the size of a lossless is miniscule in comparison to a lossy file, as it only removes certain metadata that does not directly affect the content itself. All of the video standards mentioned in this project including

AV1 are defined as lossy . There are two key methods, known as Inter-frame and Intra- frame coding, that every lossy video codec uses to reduce the stored . While the

4 primary focus of this project is improving the Inter-frame process of AV1, an understanding of both Inter-frame and Intra-frame coding needs to be established.

As many have learned before, video is essentially made of a series of images put together to illustrate motion. These images, also known as frames, are the building blocks of every video file. In a lossless video file, every frame is unique and identical to the original source (as mentioned previously). In a lossy video file, there are typically different types of frames that hold unique information depending on the type. An illustration of the different types of frames can be seen in Figure 1 below.

Figure 1 - I-Frame, P-Frame and B-Frame Example [6]

Typically, a frame is cut up into smaller sections known as , and each macroblock holds a set amount of information. For example, in the MPEG-1 codec every frame is divided into many macroblocks, with each block holding 64 distinct values that define each pixel in it. The first frame in Figure 1 is the I-frame, where the “I” stands for Intra as in Intra- frame prediction. An intra-coded frame is also known as a “” and its defined as a complete image where every macroblock in it is processed using intra-frame compression techniques. The next frame in the figure is the P-frame, where the “P” stands for Predicted. A

P-frame consists of macroblocks that are processed using either Intra-frame prediction or Inter- frame prediction. Macroblocks processed using Inter-frame prediction in a P-frame use motion vectors to calculate the differences between the “anchor” frame (the frame immediately before it), and uses that residual data to process the current frame. As opposed to intra-frame prediction that exploits spatial redundancy, inter-frame prediction takes advantage of temporal

5 redundancies, and uses information from a previous or future frame to process a macroblock.

The last frame type in the figure is a B-frame, where the “B” stands for bidirectionally predicted frame. A B-frame generally requires less data when compared to both I-frames and P-frames, as macroblocks in a B-frame can be processed using Intra-Prediction and Inter-Prediction in both directions (meaning data from the anchor frame or the future frame can be used). The problem with B-frames are that no other frames can be predicted from a B-frame. Also, since a B-frame uses inter-prediction bilaterally, a larger buffer is required to decode. Most of the newest codecs do not use B-frames, however, they do use newer unique hybrid frame types that can process macroblocks in a multitude of different ways.

Intra-frame Prediction

Intra-frame prediction techniques used in video encoding are very similar to the techniques used to a into a lossy file type such as JPEG or PNG. The intra-frame compression pipeline for most codecs follow these steps when compressing a macroblock. Step

One is to convert or transform the color information that defines a macroblock into the correct format. For example, MPEG-1 partitions a frame into 8x8 macroblocks where the color information for each pixel is typically in an RGB format. MPEG-1 would convert the RGB data into YUV format, which basically separates the “Y”, which is the brightness or luminance, from the color information (UV). Step Two is where the exploitation of spatial redundancies occurs, through the extrapolation of using neighboring pixels in a macroblock. The dark shaded cells in Figure 2 below illustrate the group of pixels that can be predicted using extrapolation.

Figure 2 - 4 x 4 Prediction block

6

A pixel can only be predicted if it has both an upper and left neighbor (i.e. the lowercase “a” cell/pixel can be predicted since the uppercase “A” and “I” cells/pixels are available). In the newer video codec formats like AVC and HEVC, the encoder provides multiple intra prediction modes to determine a particular pixel’s information. In AVC, also known as MPEG-4, there are a total of nine different intra-prediction modes that can be leveraged to find the best chroma or luminance match of a particular pixel. These modes consist of eight directional modes and DC mode, which basically averages all of the neighboring pixel values around a particular cell/pixel and uses that average as the prediction. Figure 3 below illustrates the eight different directional modes that the encoder can use to make the best prediction of a given pixel (i.e. the encoder can look diagonal-right at a pixel that might have the closest match to a particular luminance).

Figure 3 - 8 directional modes of AVC As can be seen in Figure 3, “0” points completely vertical, while “1” is horizontal (“2” is missing in the diagram as it is DC prediction and does not have a direction). In newer codecs like AV1 and HEVC, the number of prediction modes increase even higher. Step Three can occur once the pixel information of a block has been predicted and is in the necessary format. In this step, DCT

(discrete cosine transform) or some other type of transform is used to reconstruct the pixel data into a format that separates important and less important information about the macroblock.

7 Figure 4 below illustrates how an 8x8 macroblock (luminance) might look before and after a DCT occurs.

Figure 4 - Original data block (left), DCT coefficients (right)

As seen in Figure 4 above (right), the cosine waves are scored at much smaller numbers than the cosines waves (the low frequency waves start in the top left corner of the matrix and move diagonally to the highest frequency waves at the bottom right corner). This occurs because higher frequency cosine waves occur less frequently in a typical

8x8 macroblock. Step Four can occur after transforming a macroblock. In this step, a quantization table can be applied to the coefficients to “zero out” some of the higher frequency data. Figure 5 below represents a typical quantization table and what the quantized DCT coefficients look like after it is applied.

Figure 5 - Quantization table (left), quantized DCT coefficients (right)

8

A quantization table is typically created based on the QP value that is given during the encoding process and basically dictates the quality of the processed macroblock. Depending on the QP value, more or less of DCT coefficients will become zero. By removing more values, the compression rate becomes greater, but the image quality becomes worse. Figure (right) illustrates what happens to the data in Step Five after it is quantized, which is known as Zig Zag scanning.

Step Five basically scans the data in a zig zag direction diagonally from the top left, low frequency quantized coefficients, down to the bottom right, high frequency quantized coefficients. As seen in Figure 5 above, this rearranges the values order, with the larger numbers typically in the front followed by a large amount of smaller numbers (in this case zeros). In Step

Six, grouped numbers (i.e. when multiple zeros occur in succession) can be stored together in tuples instead of as individual numbers using a process called run-length encoding. Also, the more frequent values can be encoded with fewer , while the less frequent values are encoded with more bits. This process is known as . When decoding or decompressing an intra-frame encoded macroblock these steps are done in reverse from Step Six to Step One.

Inter-frame Prediction

Inter-frame prediction techniques used in video coding capitalize on the ability of an encoder to look at previous or future frames in a video sequence. Just like in Intra-frame prediction, an inter-coded frame is partitioned into macroblocks. For each macroblock, the encoder will search for a macroblock similar to the current block on previous or future frames. These frames that are looked at are referred to as reference frames and every codec provides a number of a different frames near the current macroblock that can be used for reference. There are different macroblock search that are used to find and compare macroblocks such as

Exhaustive, Diamond and Four Step Search (greater detail into the algorithms is not essential to

9 the scope of this project). If the encoder finds a close match, the current block can be encoded by a motion vector, which holds the position of the macroblock being referenced. The match found might not be an exact match of the current macroblock, and in this case the encoder will calculate the differences between the current and reference macroblock. The different values between the original and reference macroblock are known as the prediction error and the prediction error must go through many of the same steps that are done in Intra-frame prediction.

The prediction error consists of much less data than a complete intra-frame encoded macroblock as it only contains the value differences between two similar macroblocks. The prediction error goes through a transform (i.e. DCT), quantization, and the run-length/entropy encoding phases just like in intra-prediction. The decoding process works the same as in intra-frame prediction where all of the steps are done in reverse. Once a prediction error is decoded, it is used along with the motion vector to recreate the macroblock. If a suitable reference macroblock cannot be found for the current macroblock (meaning the size of the motion vector plus the prediction error is a greater size then the current block), then the macroblock will be encoded using intra-frame prediction instead. One of the drawbacks of Inter-frame prediction is that an encoder can take a considerable amount of time searching for a suitable , especially when it fails to find one. More details about AV1 Inter-frame prediction will be presented in a later section.

AVC

Advanced Video Coding (AVC), also known as H.264 or MPEG-4 Part 10, is a video coding standard developed by the ITU Video Coding Experts Group and The Moving Picture Experts

Group (MPEG). It was developed to provide lower bit rates compared to the previous standards set by MPEG-2 and H.263. AVC is a hybrid coding format, which means it combines discrete cosine transform (DCT) with predictive (also known as Inter-frame

10 prediction). As seen in Figure 6 below, the AVC encode/decode process starts by making predictions based on either Intra-Prediction (from the current frame) or Inter-Prediction (from other frames that have been encoded). If a macroblock is encoded using Inter-prediction, the encoder subtracts the predictions from the current block to form a residual. A block of residual samples is then transformed using a 4x4 or 8x8 integer DCT (intDCT).

Figure 6 - H.264 encoding/decoding process AVC uses an integer Discrete Cosine Transform (intDCT) instead of standard DCT that is used by its predecessor MPEG-2 in order to increase compression efficiency. Having selectable sizes of 4x4 or 8x8 macroblocks and using integer approximations of standard DCT leads to an overall reduction of complexity (complexity ultimately refers to processing/encoding time). After the intDCT process is complete, the block of transform coefficients is quantized according to a quantization (QP) value. A higher QP value leads to higher compression but lower decoded image quality, while a lower QP value results in better decoded image quality at the cost of lower compression. In the decoding process, the now compressed information, which includes the quantized transform coefficients along with other information that the decoder needs (such as the motion vector), are sent through this process in reverse to finally result in the expected video output. As this ’s primary focus is on reducing encoding complexity, no further

11 elaborations on the decoding process will be made. The AVC/H.264 encoding process became the standard process that the majority of future coding formats were modeling after. This includes the main predecessor of AV1, Google’s VP9.

VP9

VP9, which was initially released by Google in June 2013, was developed to be the primary, royalty-free alternative to HEVC/H.265. The design goal of the VP9 format was to further increase compression efficiency compared to VP8, while still maintaining . VP9 distinguished itself from its MPEG and VP predecessors by increasing the block size variations in inter/intra predictions and using three different types of transforms. To compare, Figure 7 below illustrates all of the possible size variations available for AVC/H.264. The sizes 16x16 or

4x8 refer to the number of pixels in a particular macroblock (i.e. a 16x16 block would hold the information of 256 pixels in a given location).

Figure 7 - H.264 Macroblock/Sub-macroblock partitions

12

Figure 8 - VP9 Macroblock/Sub-macroblock partitions Above, Figure 8 illustrates that VP9 macroblocks partitions can be as big as 64x64 and as small as 4x4. This can be compared to H.264, which can only go from 16x16 to 4x4. The blocks sizes of VP9 slightly differ from Intra and Inter prediction. A performance analysis published in 2017 detailed that intra-prediction block sizes can range from 4x4 to 32x32 pixels with 10 modes, while inter-prediction block sizes can go from 4x4 up to 64x64 [7]. The more variations and options an encoder can choose from ultimately leads to greater compression efficiency (because finding a closer match to a particular block becomes easier with more variations). In addition to increased block size variations, VP9 uses three transforms: Discrete Cosine Transform (DCT),

Asymmetric Discrete Transform (ADST) and Walsh- (WHT) [7].

Using various transforms for different block situations add to increased compression efficiency as well. Although VP9 still required roughly two times the bitrate to reach video quality comparable to HEVC, it would serve as an excellent building block for its successor, AV1.

HEVC

The latest video compression format, which was standardized by a collaboration of MPEG and the Standardization Sector (ITU), is known as High Efficiency Video

Coding (HEVC) or H.265. HEVC is the second most widely used video coding format behind

AVC, and boasts significantly better data compression over its predecessor (AVC). HEVC

13 introduced a brand-new type of basic processing unit called CTUs or coding tree units that essentially replaced macroblocks. Similar to macroblocks, CUs can start as a large 64x64 pixel block and can be sub-partitioned into blocks as small as 4x4. This may sound similar to VP9’s macroblock partitioning, however, the process of subdivisions works very differently in HEVC.

In Figure 9 below, the HEVC encoding process is shown in greater detail.

Figure 9 - HEVC encoding process [8] In HEVC, the video source macroblock or CTU is chosen by the encoder to have a size of 64x64 to 16x16. From there, the block is further subdivided int CTBs or coding tree blocks, and CTBs are divided into one or more coding units or (CUs). At the CU level, the decision of whether to encode a picture area using inter or intra prediction is made. Once a prediction type is established, the CU is then divided into prediction units or PUs [8]. In AVC/H.264, intra- prediction is limited to only 8 directional modes while HEVC has 33 directional modes to choose from (in addition to DC and planar prediction modes). With more directional modes, predicting a particular pixel becomes much more accurate, as more options usually lead to a closer prediction. Inter-prediction in HEVC was improved from AVC and uses a new feature called

Adaptive Motion Vector Prediction. Compared to previous MPEG iterations, Adaptive Motion

14 Vector Prediction allows for much more picture information/data to be acquired. A research paper on the adaptive motion vector prediction process of HEVC indicates that it can adaptively select the optimal motion vector resolution at frame level, according to the various characteristics of the video contents [9]. A positive result of adaptive motion vector prediction is that it increases encoding efficiency, while not significantly raising the cost/computing time of the prediction. HEVC uses only two transforms which are: intDCT similar to AVC, and (DST). These transforms are used in transforming the residual data from a PU into the coefficients needed in the decoding process.

AV1

AOMedia Video 1 (AV1) is the newest of all of the video standards looked at in this project and it was designed specifically to be the direct competitor of HEVC. As stated earlier, the AV1 coding format is the culmination of the various technological advances of a number of different royalty-free video coding projects, including Google’s VP9, ’s Daala, and Cisco’s Thor.

In order to compete with HEVC, many improvements to Inter/Intra prediction were made to increase compression efficiency. AV1 leverages an assortment of brand-new coding techniques.

In macroblock partitioning, AV1 increased the maximum macroblock size from VP9’s 64x64 to a superblock size of 128x128, which can again be partitioned to be as small as 4x4 pixels. The intra-prediction of AV1 boasts eight main directional modes that can be chosen and can be set to different angles in 22.5-degree increments. This results to a total of 56 different angles, which is ultimately more directions than HEVC’s 33 directional modes if all the various angles are accounted for. In addition, the granularity of directional extrapolation is upgraded and non- directional predictors are enriched by taking into account gradients and evolving correlations

[10]. Inter-prediction was also changed in AV1 compared to its predecessors. While VP9 had

15 the ability to select up to two references amongst three candidates for reference frames, AV1 extends its reference pool from three to seven [10]. A deeper dive into the inner workings of

AV1 inter-prediction will be done later in this chapter as it is the primary focus of this project.

Lastly, AV1 allows for 16 different transform types, compared to the four that VP9 offered. The different types used by AV1 are Discrete Cosine (DCT), Asymmetric Discrete Sine (ADST),

Flip-Asymmetric Discrete Sine, or IDTX/Identity transforms. The AV1 codec provides these numerous transforms as each one excels in different situations (i.e. IDTX or no transform works best when a block has sharp edges). In total this is makes four types of transforms that can be used independently on the horizontal and vertical axis, allowing for a total of 16 different combinations of transforms (4x4). In addition, a transform in AV1 is not constricted to a square while both square and rectangular transforms can be used depending on the situation.

Comparison of Video Standards

The features and improvements touched on in the previous sections are only glimpse of some of the new elements and features introduced with AVC, VP9, HEVC and AV1 respectively. A comparison of some of the other main features can be seen in Table 1 below.

Table 1 - AVC/VP9/HEVC/AV1 Comparison

16 In Table 1 above, the picture type refers to the standard frame types available for each codec (I-

Frame, P-Frame, B-Frame). In addition to these types, other hybrid frame types are used that are unique to a particular coding format. The dimensions listed in Table 1 above refer to the possible block sizes. A number of different studies have been done to test and compare these four unique video codecs in order to gauge compression efficiency and complexity (i.e. encoding time). A recent research study was done by Cisco that encoded a ten-second 4k file using the four different codecs looked at in the previous sections. The results of this study can be seen in

Table 2 below.

Table 2 - 10 sec encode H264/VP9/HEVC/AV1 [11]

The encoding times in Table 2 above are in hours, minutes, and seconds format, while the is in kilobytes. In Table 2 above, VMAF or Video Multimethod Assessment Fusion, is a video quality metric that scores the quality of a particular video file from 0-100 (the higher the score, the better the overall quality is). This study was done in March 2019, which is a few years after AV1s feature freeze occurred in October 2017. From Table 2 above, it can be clearly seen that AV1 still has a significantly higher encoding time compared to the other competing video coding formats. HEVC, which had a roughly one-minute slower encoding time compared to

VP9, counterbalances this increase with considerably higher coding efficiency. The results of the research in Table 2 above indicate that AV1 will need to drastically decrease encoding

17 complexity (i.e. encoding time) in order to be successful as a video standard. A deeper dive into

AV1’s inter-prediction is presented in the next section.

Inter Prediction of AV1

One of the biggest changes of AV1 compared to VP9 is how it processes its inter-predictions.

Inter-predictions, as stated previously, are predictions that leverage temporal redundancies that can found between a particular frame and its neighboring frames (i.e. previous frames or future frames). In AV1 Inter-prediction, an encoder first chooses one frame (Single Reference Frame

Prediction) or two frames (Compound Reference Frame Prediction) for prediction from a candidate pool of seven reference frames. Previously, VP9 had two past frames called

LAST(nearest-past) and GOLDEN(distant past) frames, as well as one future frame called

ALTREF(temporal filtered future). The GOLDEN_FRAME is a past frame that was predicted with intra prediction. AV1 introduces two more past frames called LAST2 and LAST3, and two more future frames called BWDREF (does not apply temporal filtering) and ALTREF2.

ALTREF2 acts as an intermediate filtered future reference between the GOLDEN/KEY frame and ALTREF [9]. Figures 10 below has been used in multiple research and presentations to illustrate single/multi-layer Golden-Frame groups in AV1.

Figure 10 - GF Group using Multi-layer [10] Figure 10 above illustrates how a Golden Frame group can look using Compound Reference

Frame Prediction Mode (CRFPM), where two reference frames are used for prediction. The

18 color of the arrows in Figure 10 above are different to illustrate the different frames that the key/golden frame can point to (directly or through an ALTREF intermediate frame). Prediction modes in AV1 can be bi-directional or uni-directional, meaning that reference frames could both be past frames or possibly both future frames. In the case of VP9, only bi-directional is available, which requires one past and future frame respectively. After a reference or reference frames are determined, the next step in the inter-prediction process is motion vector prediction. In Single

Reference Frame Prediction Mode (SRFPM) there are four motion vector candidates used which are: NEARESTMV(motion vector from nearest neighbor block), NEARMV(motion vector from neighbor of nearest neighbor block), NEWMV(motion vector generated from current block) and

GLOBALMV(motion vector generated for the whole frame). These spatial and temporal motion vector candidates are sorted and ranked accordingly from closest to least close match (of the current block). After choosing the highest ranked motion vector, AV1 signals the index of the selected reference motion vector from the pool. In Compound Reference Frame Prediction

Mode (CRFPM), there are a total of eight motion vector candidates looked at, which are combinations of the previous candidates mentioned. The AV1 encoder employs rate-distortion optimization in order to decide whether to use SRFPM or CRFPM. Rate-distortion optimization has been known to cause some encoding speed issues due to the exhaustive measures it takes in order to find the most optimal choice. In Rate-Distortion optimization (RDO), an encoder makes decisions about which prediction mode to use and how many bits will be needed to encode a particular block based on cost. For every macroblock, Rate-Distortion optimization is used to calculate the respective costs of using CRFPM and SFRPM before the final decision is made.

The time it takes to do this ultimately leads to slower encoding time. AV1 added other predictors like Compound Wedge Prediction, Difference Modulated Masked Prediction and even

19 Compound Inter-Intra Prediction (where Intra and Inter prediction are both used together in combination) that have not been seen in its predecessors. Due to large amounts of new features added to AV1 Inter-prediction (which are substantially used in Compound Reference Frame

Prediction Mode), AV1 encoding takes significantly longer when compared to VP9. If the Rate-

Distortion optimization system itself could be replaced with a less exhaustive prediction method, encoding speeds would definitely improve. In recent years, the research of newer, innovative methods of prediction has led to the application of Machine Learning algorithms to video coding.

One machine learning approach being researched for this type of application is known as

Statistical classification.

Statistical Classification

In machine learning, Statistical classification is a method used to build predicative models that are used to classify unstructured data. The set of data used to train a predictive model is called a training set. From a given training set, features are collected from it, which are ultimately used to accurately form predictions. Feature selection can be automatically generated or manually generated, and are normally chosen based on factors related to the accuracy of the model. Using a training set to map input-output pairs is known in machine learning as supervised learning, which is the category of machine learning related to classification. For this project, binary classification was used to ultimately predict whether a pixel block should be encoded using

Single Reference Prediction Mode (SRFPM) or Compound Reference Frame Prediction Mode

(CRFPM). Every type of supervised learning algorithm that is available can be structured to be binary, or having only two choices. The most commonly used classification algorithms today would be Linear Classifiers, Support Vector Machines, Kernel Estimation, Decision Trees and

Neural Networks. All of the different algorithms have their own strengths and weaknesses,

20 depending on the situation. For example, in terms of accuracy Neural Networks have been known to be extremely efficient in prediction as they mimic how biological Neural Networks operate. A tradeoff would be that neural networks have a considerably high complexity cost

(complexity cost refers to the total runtime and amount of processing power). As the premise of this project is to optimize the AV1 reference encoder by reducing complexity, an algorithm that has a relatively low complexity cost is ideal. The algorithm needs to be accurate enough to mimic the efficiency of the current Rate-Distortion optimization (RDO) process, while keeping the complexity cost low. More details into the classification methods chosen will be detailed in a later section.

21 Chapter 3: Related Work

As AV1 is a relatively new video coding format, there is only a small amount of research that has gone into improving its performance. However, over the years there have been numerous studies involving the use of machine learning to improve the functionality, efficiency and reduce complexity of various past video coding formats. The machine learning research dates back to the first version of MPEG-2. In February 2008, a research paper was published documenting a fast macroblock-mode decision algorithm for Inter-frame prediction to be part of a MPEG-2 to

H.264 video transcoder [12]. In this study, the coding modes determined during the MPEG-2 decoding stage were used to build Decision Trees. The training sets used the mean and variance of various block sizes to accurately predict which macroblock mode would be selected by the rate-distortion optimization (RDO) process. The results of this Inter-frame prediction research were close to a 95% reduction in complexity (compute time and power consumption), with limited to no efficiency loss. Since the High Efficiency Video Coding (HEVC) format started longer before AV1 did, there have been many more research studies on it. Most of the studies focus around leveraging different machine learning techniques to achieve greater efficiency and reduce complexity (computer time and power consumption).

Although not on the same scale as AV1, the new features of the HEVC format add a great deal of additional complexity when compared to the Advanced Video Coding (AVC) format (as stated earlier, AVC is the predecessor to HEVC). A research study in 2014 developed a four-step algorithm for early termination inter-frame prediction using Decision Trees [13]. In this study, blocks encoded as Merge and SKIP (two different Inter-prediction mode in HEVC) were

22 put in one group called MSM mode. The sequence of decision trees can be seen in Figure 11 below.

Figure 11 - Four-step early-termination algorithm [12]

In this four-step process, Test One checks for early Merge or SKIP mode. It terminates early if one of these modes is used. Test Two then checks for possible 2N x 2N early termination, and

Test Three does the same for 2N x N. The last test (Test Four) checks for N x 2N. This experiment again resulted in a significant complexity (compute time and power consumption) savings whereas the complexity reduction overall ranged from 37% - 66% [13]. In addition to using decision trees, other types of classification studies were done on HEVC. In 2016, a Fast

HEVC Intra-Mode Decision based on Logistic Regression Classification was researched [14].

The goal for this study was to reduce the computational complexity of intra CU splitting/unsplitting through early termination. As mentioned previously, a CU or coding unit is the new structure that replaced macroblocks in HEVC. After training the datasets, a total of seven candidate features were selected using 4 different CU sizes (64x64, 32x32, 16x16 and

8x8). Replacing the Rate-Distortion optimization (RDO) process, the built-in process that is used in HEVC’s Intra prediction process, with the trained Logistic Regression classifier resulted in a significant overall complexity reduction (compute time and power consumption reduction).

The average complexity reduction was 55.51% with larger resolution files resulting in savings above 60% [14].

23 While there has been a significant amount of research into improving AV1’s encoding speed and reducing its overall complexity (compute time and power consumption) at this time, there have been relatively few research papers published on the matter. In a paper published in 2018,

Convolutional Neural Network-based texture models are used to reduce temporal flickering artifacts that can sometimes be found as a result of MSE (mean squared error) [15]. The CNN texture analyzer identified the texture regions in a frame and labels a block as textures or non- texture and creates a texture mask every frame [15]. The texture mask and original frame are then fed into the codec and the texture regions are successfully skipped in the encoding process

[15]. Although this experiment was not intended specifically to reduce the computing time/complexity of the encoding process (the experiment actually increased encoding time), this experiment illustrates how Statistical classification can be used by a video coding format to make improvements.

The most significant and closely related research done on reducing the complexity of the AV1 encoder was published in August 2019. The work of this research paper acted as the building block to the current research of this project. This paper proposed using Decision Trees to be used in the early termination of AV1 Inter-prediction [16]. In this research, the first 20 frames of a group of well-known video sequences used to develop MPEG standards were used as the training set [16]. Since the video sequences were of various resolutions, a set number of blocks sampled per sequence was used. In this research, a Binary classifier was used with various features selected that were found to be the most accurate in predicting the correct Inter-prediction mode (SRFPM or CRFPM). Although the experiment achieved success and an average of 43.4% reduction of complexity (compute time and power consumption), only one classification algorithm was tested. Decision Trees are well known as being the simplest to implement,

24 however, the accuracy and overall efficiency compared to other classification methods in this case are unknown. Ultimately, the aim of the current project was to experiment with multiple classification methods, training sets and features to implement a more efficient encoder that can boast a higher rate of complexity reduction compared to previous research.

25 Chapter 4: Classification Methods

After researching and experimenting with the various classification methods and the AV1 encoder, three particular classification methods were chosen for the project. In choosing a classification method, one of the criterions is that the classifier does not add more complexity than it reduces. For example, implementing a Convolutional Neural Network to be used for

Binary classification in this context would probably result in higher efficiency, but the complexity costs (compute time and processing power costs) would far outweigh the benefits. In the actual implementation, the trained classifier will need to be run by the encoder in every macroblock that the inter-prediction process occurs. This makes it crucial to have the classification process be as simple as possible. Another criterion is that the classifier needs to be able to produce a high rate of prediction accuracy. Therefore, choosing a classification method that generally produces a higher rate of successful predictions is needed. The classification methods ultimately chosen for this project were: C-Support Vector SVM, Linear Support Vector

SVM, and the Decision Tree (single).

Support Vector Machines

Support Vector Machines (SVM) are discriminative classifiers that are defined by a separating hyperplane. An SVM basically works by moving data that has no obvious linear classification separation through higher dimensional space, until a Support Vector Classifier can be used effectively. They do this by the use of Kernel functions that systematically find Support Vector

Classifiers in higher and higher dimensions. Figure 12 below is an example of data that does not have an obvious linear classification.

Figure 12 - One-dimensional data plot

26 Imagine that the green circles in Figure 12 above in the middle of the graph mean “yes”, while the red circles that are located in the beginning and ends of the graph mean “no”. In the one- dimensional data plot in Figure 12 above, there is no hyperplane that can really separate the

“yes” and “no” choices. If a hyperplane was set in the middle of the green circles, the “yes” and

“no” choices would have a pretty equal amount of choices on both sides. What a Support Vector

Machine (SVM) can do is use Kernel functions to keep transforming the data until a hyperplane that clearly separates the choices can be generated (Figure 13 below is an example of choices that are clearly separated by a hyperplane). The main advantages of an SVM are that they can be very effective in high dimensional spaces and are relatively memory efficient compared to other methods. Scikit-learn uses three different types of SVMs labeled SVC, NuSVC and

LinearSVC. The first type labeled as SVC is also known as C-Support Vector Classification and is generally used with training sets that are smaller than 10000. The other SVM used known as

Linear Support Vector Classification (LinearSVC) has more flexibility with the training set sizes and can generally handle a larger number of samples. As seen in the Figure 13 below, support vectors maximize the margin around a separating hyperplane to clearly define a classification problem.

Figure 13 - Support Vector Machine Diagram

27 In SVMs, multiple hyperplanes can be found in some classification problems, and the classifier’s job is to calculate the most optimal solution. In the case of this research project, two clear choices are clearly defined as either SRFPM or not using SRFPM, so this should ultimately reduce some of the calculations needed with this classifier. Nu-Support Vector Classification

(NuSVC) was previously looked at, but the implementation seemed to carry a bit more encoding complexity (compute time and power consumption) when compared to the other two SVM classifiers. Ultimately only SVC and LinearSVC were used for experiments.

Decision Trees

In terms of simplicity, Decision Trees are one of the easiest classification methods to explain and understand. Decision Trees are well-known for their high accuracy, stability and ease of interpretation. A Decision Tree can handle both numerical and categorical data and are structured similar to that of a flowchart. The structure of a Decision Tree can be seen in Figure

14 below.

Figure 14 - Basic Decision Tree structure A Decision Tree will normally contain a Root , Decision Nodes, and Terminal Nodes. The

Root Node is the top decision node and represents the entire sample, which will later get divided into Decision and Terminal Nodes. When a node gets split into further sub-nodes, they are

28 known as Decision Nodes (technically the Root Node is also a Decision Node). Finally, a

Terminal Node is a node that does not have any more children (does not split further). There are multiple different algorithms that can be used to decide when a node should be a Decision Node and split further, or when it should be a Terminal Node. The most common algorithm used in

Decision Tree classification is known as the Gini Impurity formula. To help explain how the

Gini Impurity formula works, a random set of data is given in Table 3 below.

Table 3 - Example Data of Flu Symptoms

From the findings in Table 3 above, there are three “Yes” and two “No” answers under the Sore

Throat column that had a Sore Throat and had the Flu. There is one “Yes” and two “No” answers in the Sore Throat column that did not have a Sore Throat and had the Flu. To find the

Gini impurity of having a Sore Throat (which is ultimately how Sore Throat relates to the Flu), the formula in Figure 15 below is used.

Figure 15 - Gini Impurity formula

Using the formula from Figure 15 above, the probability of having a Sore Throat and having the

Flu is calculated to be .6, and .6 squared equals .36. The probability of not having the Flu and having a Sore Throat is calculated to be .4, and .4 squared equals .16. Therefore, using the

Formula from Figure 15 above, the Gini Impurity for “Yes” to Sore throat and the Flu (this would be the left node) is equal to .48. Applying the same formula to the right node (this node

29 represents people who did not have a Sore Throat and have the flu), the Gini Impurity is equal to

.455. Finally, to find the total Gini Impurity of the Sore Throat Decision Node, the weighted average of both Gini Impurity results (.48 and .455) is calculated and equals .471. The Gini

Impurity formula is a crucial formula for finding the best predictors from a set of data. Having a

Gini Impurity of zero is known as Gini Purity and means that everything that falls under that specified category results in 100% accuracy (at least among the samples of a given data set).

The advantages of a Decision Tree include little , low cost complexity costs, and the ability to handle both numerical and categorical data (this can be done in the same tree). A disadvantage of a Decision Tree is that it can become unstable when small variations of a problem are introduced. Small variations in a classification problem can result in completely different Decision Trees. Most of the disadvantages come from problems with the data or understanding of a Decision Tree rather than the actual structure of the tree itself. A good example of a fully developed Decision Tree can be seen in a famous iris example used in the scikit-learn documentation (seen in Figure 16 below).

Figure 16 - Iris Decision Tree example [17]

30 In this Decision Tree (Figure 16 above), features are classified in terms of Gini impurity with respect to the ground truth. As can be seen in Figure 16 above, the Terminal nodes all have Gini

Impurities equal to 0.0 (this is usually the goal, however, in some cases 0.0 or Gini Purity cannot always be achieved). In the case of this research project, only the decision to use SRFPM or to use either SRFPM or CRFPM is needed. More complicated classification problems can become much more difficult to classify into a decision tree structure, however, the binary problem for this project is relatively simple to interpret. In the next section, the various classification methods are put to the test and used with the data collected.

31 Chapter 5: Implementation – Languages, Frameworks, and Tools

A crucial part of gathering accurate and efficient results for this project was discovering and implementing the various frameworks and tools available. The exact specifications of the two systems used in the implementation and testing for this project can be seen in Table 4 below.

Table 4 - System Specs

The technical specifications given in Table 4 above are given merely to give a background on the systems that the encoders were used on. The results of the experiments (specifically encoding time) can vary depending on the systems being used (i.e. using a machine with the newest/fastest processor and/or graphics card would potentially reduce encoding times). Throughout the project a number of different programming languages were needed to implement the various tools and tasks. The AV1 Reference encoder libaom and the AV1 coding format itself is written in C, while all the CMake/CPack/CTest files for the reference encoder are written in C++. The tool/library performing the binary classification tasks is sci-kit learn and it is written in Python.

The tool used to transpile the trained scikit-learn estimators to C (to be used to modify the reference encoder) is also written in Python. The majority of created tables/figures were made with basic html/css/bootstrap. The Rav1e AV1 encoder is written in Rust, however, only minor modifications needed to be made to it. Also, some shell/unix scripting was used to assist with some compilation.

Technically scikit-learn is the only high-level framework used for this project. The editors primarily used were Visual Studio Code and Jupyter Notebook, while Visual Studio

32 Community 2019 v16.3.8 was used for debugging. The scripting was done using a mix of

Windows Powershell/Command Prompt/Terminal and sometimes Ubuntu Terminal was needed to run certain things on . The various tools and setup will be explained in further detail in the next sections.

Training Sequences

One of the most crucial aspects of ML/Classification is finding a training/data set that can be used to efficiently and effectively train a classifier. After some trial and error, 10 sequences were chosen from a batch of famously used sequences that were used in many HEVC/AVC experiments in the past. To simulate the various resolutions and frame rates that would be used in normal encoding circumstances, the chosen training sequences include: three nHD files

(640x360), three HD/WXGA files (1280x720), three FHD files (1920x1080) and one UHD file

(3840x2160). Although AV1 was created to compete with HEVC and to generally be used with higher resolution files, the current encoding speeds of AV1 encoders are so long and CPU-heavy that testing with resolutions higher than 4K/UHD is nearly impossible with the current systems used. Even at 4K, the encoder would frequently crash/timeout while only encoding a small number of frames. As stated previously, the AV1 reference encoder 1.0 (which was used in this project) was clocked at roughly 45000x real-time. To circumvent this, the Rav1e encoder, which runs more efficiently than the reference encoder, was used in the initial testing and statistical analysis of the various training sequences. To verify the statistical analysis done by the previous

AV1 Inter-prediction research that was mentioned in the earlier sections [16], an experiment was done to test the complexity of the training sequences by encoding the first 20 frames of each sequence. The sequences were encoded twice using a Quantization Value (QP) of 56, once with both Single Reference Frame Prediction Mode (SRFPM) and Compound Reference Frame

33 Prediction Mode (CRFPM) available to be chosen, and another where only SRFPM is available.

The encoder specifications and results can be seen in the Figures 17 and 18, and Table 5 below.

Figure 17 - Technical Specs w/o SRFPM

Figure 18 - Technical Specs /w CRFPM enabled

Table 5 - 10 Sequences encoded /w Rav1e

34 From Table 5 above, a significant increase in encoding complexity can be seen when only using

SRFPM versus all reference modes being available. The average encoding time in seconds using

SRFPM was 365.477 seconds or roughly 6:09 minutes to encode just 20 frames of a sequence.

This is compared to an average of 9144.321 seconds or 2 hours and 54 minutes having all reference modes available.

Figure 19 - Encoding Time reduction formula Figure 19 above illustrates the total savings percentage by encoding with modified speed parameters to remove CRFPM versus the unmodified/anchored reference encoder. In total, the average reduction of complexity comes out to 96.01%. These complexity savings could actually be higher as the speed specifications were set to 1 instead of the lowest 0 (which was done due to frequent crashing). Figure 20 below illustrates the speed settings of the Rav1e encoder.

Figure 20 - Rav1e --speed preferences

35 As can be seen in Figure 20 above, speed setting 7 completely removes Complex Prediction mode or CRFPM, while speed setting 1 enables Complex Prediction mode available for all frames. Adding the quantizer Rate-Distortion Optimization (RDO) and bottom-up encoding of speed setting 0, which are standard in the reference encoder, would only further add to the complexity. At relatively lower resolution, very minimal differences can be seen between the two encodes. A side by side comparison the two encodes of the Netflix_Rollercoaster 720p video file can be seen in Figure 21 below (the sequence encoded with SRFPM only is on the left, while the sequence encoded using all modes is on the right).

Figure 21 - Side by Side comparison 720p video The edges can be seen as slightly softer but overall visually they are extremely similar in quality. When encoding 4k or higher video, this could definitely become a bigger factor, but for resolutions up to HD 1920x1080, removing CRFPM does not seem to be a difference maker.

However, in the previous research study, an average of 12.7% efficiency loss overall was found when removing CRFPM modes [15]. Maybe further testing with higher resolution files would show a move significant loss in efficiency (only one UHD file was ultimately used in the training set due to encoder timeout issues).

36 Scikit-Learn

After establishing the 10-sequence training set, samples from each file needed to be taken in order to create the dataset necessary to train the various classification models. Although the

Ravle encoder provides lots of useful data during encoding, it only provides information on the encode on a frame by frame basis. The peak signal-to-noise ratio (PSNR) as well as whether a frame is encoded using Inter or Intra is helpful, however, it does not provide enough information about the Inter-frame modes on a block-by-block basis. From what was gathered, the only way to get this information was to step through an encode using Visual Studio 2019 debug mode and track the various information block by block. The difficulty of this led to a significantly less pool of data then desired. The final number of macroblocks to be tested ended up being 4000, where 20 random inter-frame blocks were used per frame for a total of 20 frames per sequence

(20x20x10). Considering that many different resolutions are used, a set number of samples is preferred when creating the dataset. However, using a small number of samples could ultimately lead to inaccurate prediction. To illustrate, an average frame of video in a 360p video sequence

(640x360) has an average of about 900 macroblocks total per frame, whereas a frame of a 4k video sequence (3840x2160) has roughly 20000 macroblocks total per frame! This is a significant difference and could ultimately lead to inaccurate classification due to .

Nonetheless, the data was gathered and 9 features, including the ground truth, were chosen and can be seen in the Table 6 below.

37 Table 6 - Selected Features

In the feature selection table (Table 6) above, “RF” stands for Reference Frame and “M” stands for macroblock (the number generated on the far left just indicates which randomly selected inter-coded macroblock is being looked at). The features used for the dataset includes the mode of reference frame macroblocks that were used to encode the four surrounding macroblocks of the current block (if the macroblock was encoded with CRFPM, then the second or last reference frame macroblock of the two reference frame macroblocks is used as the feature). As mentioned previously, a total of seven different reference frames can be chosen in SRFPM and CRFPM, which include LAST_FRAME, LAST2_FRAME, LAST3_FRAME, GOLDEN_FRAME,

BWDREF_FRAME, ALTREF_FRAME, and ALTREF2_FRAME. In Table 6 above, each reference frame type is given a designated numerical value from “1” to “7”. The next four features in Table 6 above were binarily separated, where “0” and “1” represent if a surrounding macroblock was encoded using SRFPM or not (where “1” equals SRFPM mode, while “0” equals not SRFPM). The “Bottom”, “Right”, “Left”, and “Top” positions from Table 6 above refer to the positions of the macroblocks used in relation to the current block (i.e. the “Right” M refers to the Inter-prediction mode used to encode the macroblock that is directly to the right of the current macroblock). Finally, the far most right column in Table 6 above represents the actual macroblock being looked at and whether it uses SRFPM or not (the ground truth). Needless to say, there was an abundance of frames that were encoded with SRFPM. To make this table

38 easier to understand, a second table was created to illustrate the data in a categorical sense instead of numerical. Table 7 below illustrates the same information as Table 6, but uses “Yes” or “No” indicating if SRFPM is used and the respective Reference Frames used (please note that

Table 6 has numerical values that designate the reference frame type, however, ultimately the feature is checking the mode used for that particular reference frame macroblock) . Some additional data was added to Table 7 below to specify if the reference frame macroblock was encoded using SRFPM or not (as mentioned previously, the second Reference Frame is used in the case of a macroblock that was coded using CRFPM).

Table 7 - Selected Features in textual format

From the textual data in Table 7 above, it is easier to recognize how “Yes” and “No” Decision

Nodes can be created from the dataset and how the Gini Impurity scores can be generated. The

Scikit-Learn package used for this project does all the Gini Impurity math/calculations and automatically generates fully functional Decision Trees and SVMs based on the dataset and features that are loaded into the software.

After the data was instantiated, it was trained using the three different classifiers mentioned previously (SVM, Linear SVM, and Decision Trees). The classification method that produced the best weighted average prediction accuracy ended up being the decision tree. The results can be seen in the Table 8 below.

39 Table 8 - Decision Tree classification report

As can be seen in Table 8 above, the accuracy of predicting that a block would be encoded without SRFPM was .76 or 76% (which is indicated by the “0”), whereas the accuracy of predicting if a block will be encoded using SRFPM mode was .91 or 91% (which is indicated by the “1”). I think clearly these predictions could be a bit skewed due to the small sample size and the abundance of blocks encoded using SRFPM. In a later section, the process of outputting the classifiers and fitting them into the reference encoder will be discussed.

AV1 Reference Encoder

In this project, the AV1 Reference Encoder was used with discretion, and for the most part it was avoided when possible. The speed of the Reference Encoder proved to be the biggest hurdle to get through while doing this project. As the premise of the entire project is to improve the reference encoder speed, other avenues were researched to run some of the various encoding tests. Ultimately the rav1e and FFmpeg encoders were used often when encoding certain sequences in real-time. FFmpeg, which is a free/open-source project that allows a user to encode and transcode using multiple video codecs/wrappers, recently added the libaom- reference video encoder to its list of available coding formats. Being that FFmpeg is much more user friendly and the version/location of the libaom encoder can be designated (allowing for modifications to the encoder), compilation and using the AV1 reference encoder directly was not

40 done (besides during the debug process and at the start of the project). However, the reference encoder did provide some helpful CMake/makefiles that proved useful during research.

Other helpful tools

Package manager tools like npm and chocolatey were very helpful throughout the project, however, the most crucial/helpful tool that was found for this project was the sklearn-porter tool.

This tool was used to transpile the various classifiers in to C (for use with the reference encoder).

The resulting code of the Decision tree and its output can be seen in Figure 22 and 23 below.

Figure 22 - Sklearn Decision Tree call

Figure 23 - Sklearn classifier output - Decision Tree The sklearn-porter tool proved to be an amazing tool, because attempting to convert the classifiers into a language usable by the reference encoder was extremely frustrating. Even though the other two classifiers did not have their precision as close as the decision tree, both were still transpiled and used to experiment with. The SVM C-Vector classifier was the next model to be transpiled, however, the output was incredibly large (as seen in Figure 24 and 25 below).

41

Figure 24 - Sklearn svm C-Vector call

Figure 25 - Sklearn classifier output - SVM C-Vector The double vector in Figure 25 above took up hundreds of lines of code, and ultimately the SVM c-vector code length was much longer than anticipated. The last classifier used was the Linear

SVM, and although it was the least precise out of the three classifiers in the prediction reports, it was also the most lightweight of the three classifiers (as seen in Figure 26 and 27 below).

Figure 26 - Sklearn svm LinearSVC call

42

Figure 27 - Sklearn classifier output - LinearSVC The linearSVC classifier code (in Figure 27 above) resulted in only 25 lines of code, which definitely does compete with the lines of code length of the decision tree output. In the next section, the experimental results of testing these different classifiers with the reference encoder will be discussed.

43 Chapter 6: Experimental Results

After converting the various classifiers into the C Language, these classifiers were each inserted into the Reference encoder for testing. A total of 7 video sequences that had no relationship with any of the original sequences used for training, were chosen at random to be used in these tests.

The resolutions used for these experiments were intentionally kept low due to encoding speeds and other issues that persisted when using the Reference encoder instead of Ravle. Although using the reference encoder through FFmpeg helped in some regards, the encoding speeds of the reference encoder proved to be exhaustively long. The following sequences found were used to the test the modified reference encoder: bq_zoom, chairlift, Netflix_Tango, Driving_POV,

Netflix_Crosswalk, and Netflix_Boat. In the next section, the encoding results are compared.

Comparison of encoding complexity

A comparison of the encoding times (in seconds) of each of the sequences can be seen in Table 9 below.

Table 9 - Encoding Results

44 Overall, the modified encoders proved to have significantly lower encode times compared to the unmodified reference encoder (libaom). The decision tree ultimately proved to be the most efficient in regards to complexity reduction (i.e. encoding speed), with an average encode speed of 1348.75 seconds (compared to the reference encoder time of 4598.051 seconds). Using the modified reference encoder (decision tree) compared to the unmodified encoder resulted in an average of 71% reduction of coding complexity. The SVM C-Vector and LinearSVC modified encoders resulted in an average of 51.6% and 66.8% respectively. Visually, the encoded sequences seemed relatively similar in quality and the average PSNR scores reflected this. The problem with these testing parameters and the result is that ultimately only lower resolution sequences could be tested (due to system constraints). Lower resolution sequences would typically benefit the most from using just SRFPM without looking at CRFPM more frequently as most of the frames and blocks would ultimately be encoding with SRFPM anyways. The abundance of SRFPM in the original testing data may have inadvertently caused the SRFPM only case to be chosen more frequently then desired.

45 Conclusion

Ultimately the experimentation on the reference encoder appeared to be a success in regards to reducing the complexity of the reference encoder. It can be inferred that lower resolution files definitely benefit from skipping the testing of whether CRFPM is a viable encoding option, because skipping it is not detrimental to the quality of the resulting sequence. The major reason for the creation of AV1 is to become the codec of the future, which is looking to stream video formats bigger than 8k (7680 x 4320). Quality imperfections will definitely become noticeable at these higher resolutions, making CRFPM prediction and encoding efficiency all the more important. The novel idea of using a binary classifier in Inter-prediction is definitely viable. I believe that using a larger pool of test data from a wider range of sequence resolutions could result in a powerful and efficient alternative to the Rate-Distortion Optimization (RDO) process.

As it is now, a replacement of AV1s Inter-prediction process needs to occur in order for AV1 to function as a viable video coding option. While some attempts to fit the outputted classifiers into the Rav1e encoder were made, I was unable to fit them into the encoder as the transpiler used could not convert to Rust (I am also very new to the Rust language). In the Rav1e documentation on Github, it appears that they are also attempting to replace the Inter-prediction

Rate-Distortion Optimization (RDO) process with a different Machine learning prediction algorithm (which will hopefully be implemented soon).

46 Future Work

For future work I believe that additional testing with the current selected methods from this project could definitely be beneficial. Using a wider range of file types, formats and resolutions for the test sets could ultimately lead to better classification precision. Also, experiments implementing classification techniques into the other AV1 encoders could lead to even faster and more efficient encoding than ever before. Machine learning techniques could really be applied to a multitude of different AV1 features. Specifically, some experimentations using classifiers or regression methods on the Intra-prediction process could result in further complexity reduction.

While there were many classification methods researched for this project, not all were implemented in this project. Some future testing on other Machine Learning classification methods could result in higher overall prediction precision.

47 References

[1] “AOM - Alliance for Open Media,” http://aomedia.org/

[2] ”The Cost of Codecs: Royalty-Bearing Video Compression Standards and the Road that Lies Ahead,” https://www.cablelabs.com/the-cost-of-codecs-royalty-bearing-video-compression- standards-and-the-road

[3] Laude, Thorsten, Yeremia Gunawan Adhisantoso, Jan Voges, Marco Munderloh, and Jorn Ostermann. ”A Comparison of JEM and AV1 with HEVC: Coding Tools, Coding Efficiency and Complexity.” 2018 Picture Coding Symposium (PCS) (2018): 36-40. Web.

[4] “Cisco Visual Networking Index: Forecast and Trends, 2017–2022 White Paper.” Cisco, Cisco, 27 Feb. 2019, https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual- networking-index-vni/white-paper-c11-741490.html.

[5] John Watkinson, The MPEG Handbook, p.1

[6] Uzun, Onur. “I-P-B Frames.” Medium, Medium, 18 Dec. 2017, medium.com/@nonuruzun/i- p-b-frames-b6782bcd1460.

[7] Abu Layek, M., Ngo Quang Thai, Hossain, Ngo Thien Thu, Le Pham Tuyen, Talukder, Taechoong Chung, and Eui-Nam Huh. "Performance Analysis of H.264, H.265, VP9 and AV1 Video Encoders." 2017 19th Asia-Pacific Network Operations and Management Symposium (APNOMS) (2017): 322-25. Web.

[8] Sullivan, G., Ohm, Woo-Jin Han, and Wiegand. "Overview of the High Efficiency Video Coding (HEVC) Standard." IEEE Transactions on Circuits and Systems for Video Technology 22.12 (2012): 1649-668. Web.

[9] Zhao Wang, Juncheng Ma, Falei Luo, and Siwei Ma. "Adaptive Motion Vector Resolution Prediction in Block-based Video Coding." 2015 Visual Communications and Image Processing (VCIP) (2015): 1-4. Web.

[10] Chen, Yue, Murherjee, Debargha, Han, Jingning, Grange, Adrian, Xu, Yaowu, Liu, Zoe, Parker, Sarah, Chen, Cheng, Su, Hui, Joshi, Urvang, Chiang, Ching-Han, Wang, Yunqing, Wilkins, Paul, Bankoski, Jim, Trudeau, Luc, Egge, Nathan, Valin, Jean-Marc, Davies, Thomas, Midtskogen, Steinar, Norkin, Andrey, and De Rivaz, Peter. "An Overview of Core Coding Tools in the AV1 Video Codec." 2018 Picture Coding Symposium (PCS) (2018): 41-45. Web.

[11] “Cisco Visual Networking Index: Forecast and Trends, 2017–2022 White Paper.” Cisco, Cisco, 27 Feb. 2019, https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual- networking-index-vni/white-paper-c11-741490.html.

48 [12] Fernandez-Escribano, G., H. Kalva, P. Cuenca, L. Orozco-Barbosa, and A. Garrido. "A Fast MB Mode Decision Algorithm for MPEG-2 to H.264 P-Frame ." IEEE Transactions on Circuits and Systems for Video Technology 18.2 (2008): 172-85. Web.

[13] Correa, Guilherme, Pedro Assuncao, Luciano Agostini, and Luis A Da Silva Cruz. "Four- step Algorithm for Early Termination in HEVC Inter-frame Prediction Based on Decision Trees." 2014 IEEE Visual Communications and Image Processing Conference (2014): 65-68. Web.

[14] Qiang Hu, Zhiru Shi, Xiaoyun Zhang, and Zhiyong Gao. ”Fast HEVC Intra Mode Decision Based on Logistic Regression Classification.” 2016 IEEE International Symposium on Broadband Systems and (BMSB) 2016 (2016): 1-4. Web.

[15] Chen, Di, Chichen Fu, and Fengqing Zhu. ”AV1 Video Coding Using Texture Analysis With Convolutional Neural Networks.” (2018). Web.

[16] Kim, Jieon, Saverio Blasi, Andre Seixas Dias, Marta Mrak, and Ebroul Izquierdo. ”Fast Inter-Prediction Based on Decision Trees for AV1 Encoding.” (2019). Web.

[17] L. Breiman, L., and J. Friedman. “1.10. Decision Trees.” Scikit, scikit- learn.org/stable/modules/tree.html.

49