<<

Title: Enhancing Libva-Utils with VP8 and HEVC Encoding and Temporal Scalability for VP8

= Context =

Many modern CPUs support hardware acceleration for processing, decoding and encoding video. Hardware acceleration lowers the CPU time needed for handling videos, and it can also be more energy-efficient, resulting in longer battery life for mobile devices. Applications can leverage these capabilities with interfaces that enable access to hardware acceleration from user mode.

For CPUs on , this interface is called VA-API. It consists of “a main library and driver- specific acceleration backends for each supported hardware vendor” [1]. The main library is called libva, and the driver-specific backend for integrated Intel GPUs is called intel-vaapi-driver.

In this project, I want to focus on libva-utils, written in C. It provides simple reference applications for encoding, decoding and video processing and includes a number of API conformance tests for libva (which are implemented using the GoogleTest framework). Libva-utils is designed to test the hardware acceleration and API stack and serves as a starting point for further development, such as including VA-API acceleration into application software and multimedia frameworks.

= Problem =

As hardware advances, VA-API must reflect these changes. Most available engineering resources are invested in the backend (intel-vaapi-driver) and the main library (libva). Regarding libva-utils, several areas can be identified that have lagged behind the other parts of this stack. Libva-utils in its current state does not include reference encoders for either VP8 or HEVC. Further encoder-specific enhancements such as temporal scalability are currently only available to selected codecs (e.g., H.264). With this present proposal, I intend to contribute to an up-to-date libva-utils.

= Project Goals/Deliverables =

* Implement a sample encoder application for the VP8 codec (called vp8enc). * Implement a sample encoder application for the HEVC codec (called hevcenc). * Add automated testing * Add temporal scalability to VP8 encoder. * Optional: Add temporal scalability VP9 encoder.

= Prerequisites =

VP8 and HEVC encoding are supported on Intel Gen9 GPUs and higher; therefore, a Skylake+ system is needed for development and testing. I plan to install a low-cost remotely accessible Kabylake Celeron Server running a recent version of 64-bit Ubuntu as the main development machine for this project. On the software side, having a self-contained stack with intel-vaapi-driver, libva and libva-utils would allow different versions to be tested independently. The entire stack should be compiled in a way that all relevant shared libraries, such as libva*.so and libi965_drv_video.so, can be placed in arbitrary locations. This can be done by setting relevant environment variables (such as LIBVA_LIBS at compile time, respectively LIBVA_DRIVERS_PATH at runtime). This allows a coexistence with standard Ubuntu packages and multiple versions of the stack on the same machine. = Implementation =

== VP8 Encoder Application ==

Developing a simple VP8 encoder application demands a basic understanding of the VP8 codec [2], its bitstream [3], the IVF container [4] and VAAPI. Because we are interfacing a hardware encoder and since most of the intra-frame processing is transparent to the user, it is not necessary to understand every detail of the VP8 codec. Nevertheless, we must know how VP8 handles inter- frame processing because we must provide the necessary buffers for the reference frames.

Libva-utils already includes a reference implementation of a VP9 encoder application [5], which can be used as a starting point to prototype a VP8 encoder. To understand which parts of the code need modification, it is worth investigating the differences between VP8 and VP9. The following list is a first attempt to identify these differences, but it may not currently be complete:

* Superblocks and Macroblocks

VP8 processes a frame in fixed-sized macroblocks (for example, 4x4 luma and 8x8 chroma). VP9 uses a more granular approach—it uses so-called superblocks, of sizes up to 64x64 pixels. Superblocks can be further subdivided into smaller entities down to 4x4 pixels. This helps VP9 perform better with high-resolution content. For example, imagine encoding a uniformly blue sky. The corresponding blocks do not hold much frequency information and can therefore be encoded more efficiently in a large block. Having a brief look at the source of vp9enc.c, I am inclined to assume that by using VAAPI, this difference is transparent to the user.

* Segments

Both codecs support segment frames. This means that each macroblock or superblock can be individually assigned to a segment number. The segments need not be contiguous or have any predefined order [6]. Each segment can be processed with its own quantization and filter settings. VP8 and VP9 differ in the number of maximum supported segments per frame. VP8 allows for four segments and VP9 allows for eight segments. In libva, VP8 and VP9 segments are allocated slightly differently, as one can see when comparing va_enc_vp8.h and va_enc_vp9.h, and therefore some adaptation work in this regard can be expected.

* Reference Frames

Both codecs use reference frames. VP8- and VP9-encoded frames can have up to three references (the last frame, the golden frame and an alternative reference frame). In VP9, potential reference frames are stored in a pool of eight frames. For example, in va_enc_vp9.h, the _VAEncPictureParameterBufferVP9 structure holds an element VASurfaceID reference_frames[8]. Currently, I am unsure whether the VP8 implementation uses such a pool. Regardless, reference frames are treated slightly differently. Generally speaking, reference frames are part of the inter- frame processing, and as frames must be allocated (by calling vaCreateSurfaces()), it is certain that this part of the VP8 encoder application must be reworked.

* Tiles and Data Partitioning

VP9 uses tiling to divide frames into sections, which can be processed independently. VP8 uses data partitioning, allowing to binary code macroblocks and motion vectors independently from quantized transformation coefficients. Both techniques address the need to parallelize work across several CPU cores. VP8 does not support tiling and VP9 does not support data partitioning. Implementing data partitioning for the VP8 encoder application can most likely be done by setting the auto_partitions flag in the _VAEncPictureParameterBufferVP8 structure.

Developing the VP8 encoder application can be done in incremental steps. This process would start with an I–Frame-only encoder and later add support for segments and reference frames and VP8- specific features such as data partitioning.

== HEVC Encoder Application ==

Like for the previous task, the prerequisites for developing a simple HEVC encoder application are a basic understanding of the HEVC codec [7], its bitstream and container, as well as the corresponding VAAPI. For reference, an H.264 encoder is already available in libva-utils. Studying other sources such as libyami [8] and -vaapi [9] may be worthwhile. Comparing H.264 to HEVC can hint to which parts of the implementation require special attention:

* Macroblocks and Coding Tree Units

H.264 uses fixed-sized macroblocks for encoding (size depends on the used profile and whether luma or chroma is handled, and it can vary between 4x4, 8x8 and 16x16). HEVC (like VP9) addresses the need for high-resolution content. HEVC approaches this by dividing the frame into coding tree units (CTUs), “which can use larger block structures of up to 64×64 pixels and can better sub-partition the picture into variable sized structures” [10]. The structure of the HEVC coding tree is a quadtree, each subdivision of which has four children. The smallest CTU has a size of 4x4. My guess is that this is internally handled and the encoder application need not interfere at this level.

* Prediction Modes

In HEVC, the number of available prediction modes has increased to 35 (from nine for H.264). Again, this difference is most likely internally handled.

* Reference Frames

H.264 can have up to 16 reference frames, whereas in HEVC, the number of references is 2x8, meaning that the same reference frame can be used more than once but with different weights. The total number of unique references in HEVC is eight.

* Others

This list is certainly not complete at this time. In addition, more research on HEVC, its bitstream and container is required. The time needed to become familiar with both codecs is reserved in the timeline. == Automated Encoder Test ==

To test the developed encoders automatically, the following criteria for encoder output can be evaluated:

* Decodability: tests whether the bitstream can be decoded without errors * Number of frames: checks whether the number of input frames equals the number of output frames * Resolution: checks whether input and output resolutions match * Frame content: tests for content reproduction

All tests demand the availability of a corresponding decoder, either from within libva-utils or an external decoder. Testing frame content is a bit more complicated, as this test includes the generation of suitable test patterns, encoding, decoding and automated analysis of the test pattern. I would like to propose the use of QR codes as a test pattern that is simple both to generate and to analyze. Because we are testing lossy codecs, the QR dots must be the size of a macroblock (or an integer multiple) to keep distortion resulting from the coding process low. Depending on the QR profile used, a certain percentage of dot defects can also be tolerated (7–30%) Rather than quantifying the quality of the resulting image, this test is limited to basic image reproduction and only qualifies as a pass/fail test.

This test could be extended in two ways:

* Testing interframe encoding by using a series of QR codes

While this test is intended for static images, it can be enhanced to handle interframe encoding by using a series of QR codes. For example, the frame number (such as “frame:%03d”) can be encoded into each QR code. Because the QR dots are chosen to be the size of a macroblock (and placed in the exact macroblock raster), the motion predictor can easily make a reference to another white or black dot. This should work well with VP8 and H.264. For VP9 or HEVC, with superblocks and CTUs, it may be more complicated, but by using QR dots that are large enough, the probability of success is higher.

* Testing chroma processing

Because video codecs process luma (Y) and chroma (UV) components separately, the encoder could be tested against different input colors. By using colors that are orthogonal in the YUV color space, this test could be further enhanced to test channel separation between the components. This case would require special frames that code three QR codes into one image by using one of the three components Y, U or V to store the QR dots of each code.

I performed preliminary testing for this method on the vp9enc encoder using static (single-frame) content with a QR dot size of 16x16. Therefore, I expect this test methodology to work for the static case; for the extended tests, more experiments must be carried out to determine whether they are practicable.

Encoders and decoders for QR codes are readily available as standard Ubuntu packages (such as libqrencode3 and zbar-utils) [11]. == Temporal Scalability for VP8 (and optional VP9) ==

Scalable video coding (SVC) is a technique for encoding video streams in a way that enables the distribution of a video stream to different clients with different bandwidth requirements, without the need to transcode the video stream. Thus, the video is encoded only once, and when distributed selectively, elements from the stream are omitted in order to meet the desired bandwidth requirement. In addition, SVC works on three kinds of layers [12]:

* Temporal: using different framerates * Spatial: using different frame sizes * Quality: using different levels of encoding quality

In the present proposal, I am focusing only on temporal scalability. Distributing the same video stream with different framerates means skipping frames. For example, imagine a video recorded and encoded at 60 fps; it can be distributed with 60 fps, 30 fps (1/2), 20 fps (1/3), 15 fps (1/4) or perhaps only 10 fps (1/6). This may sound easy in the beginning, but it becomes more complicated when considering that video codecs encode frames using references to other frames in the stream. This method is called inter-frame prediction and introduces dependencies between frames. When skipping frames, it is therefore important that all dependencies of the remaining frames remain intact. This can be achieved by creating a hierarchy of temporal layers. For example [13], in Layer 0, all frames are dependent on Layer 0 frames only, and in Layer 1, all frames are dependent on Layer 1 and Layer 0 frames. By removing all Layer 1 frames, all dependencies of the remaining Layer 0 frames are still met. This requires care to be taken during the encoding process so that references are chosen in a way that establishes this temporal-layer hierarchy. The WEBM-Project provides a reference implementation that lists several layer patterns for this purpose [14]. When it comes to reference frames for inter-frame predication, VP8 and VP9 work very similarly, as both can use up to three references (the last frame, the golden frame and an alternative reference). On the bitstream level, the temporal-layer ID is stored in the temporal-layer index field and can be found in the RTP payload descriptor. It is encoded slightly differently depending on whether VP8 or VP9 is used. VP8 allows the use of up to four [15] different temporal layers, and VP9 up to eight different temporal layers [16].

For the implementation, I plan to start with adding temporal scaling support to vp8enc. And if time allows I would optionally port it to vp9enc. One way of testing the temporal scalability of the encoders is to write a simple frame-skipper application, which would analyze the bitstream, remove frames of a specific temporal layer and adjust the bitstream to be processable again. The resulting stream could then be fed to the testing procedure proposed above. = Timeline =

Week 1: Become familiar with VP8 Week 2: Program the VP8 encoder applications Week 3: Develop automated testing and enhancement of the VP8 encoder Week 4: Document/merge/evaluate

Week 5: Become familiar with HEVC Week 6: Program the HEVC encoder applications Week 7: Develop automated testing and enhancement of the HEVC encoder Week 8: Document/merge/evaluate

Week 9: Become familiar with VP8 and VP9 temporal scalability Week 10: Implement temporal scalability for VP8 Week 11: Implement temporal scalability for VP9 (optional) Week 12: Document/merge/evaluate

The task of implementing temporal scalability for VP9 is optional. If time allows I would like to use week 11 to port temporal scalability to vp9enc. Otherwise week 11 is used to compensate for delays.

= About me =

My name is Georg Ottinger, and I am pursuing a master’s degree in practical computer science at the University of Hagen. Before that, I completed a Master in Sociology (University of Vienna) and had been working several years as an embedded software engineer, both as an employee and as a contractor, and gained decent practice developing in C/C++. I am inclined towards electrical engineering and I have initiated open-source hardware projects in the domain of audio and video streaming. OggStreamer is a device enabling simple live audio streaming setups, and it is used by many independent radio stations (in Europe and abroad). Its prototype won second place in Lantronix Design Contest 2010. It can be considered a finished product. See https://oggstreamer.wordpress.com/ . I started VideoBrick with two friends, and it is intended to perform live capturing from HDMI. We achieved a prototype and coded a proof-of-concept but then ran out of energy. Nevertheless, we documented our progress here: https://videobrick.wordpress.com/ Currently, I am focusing on progressing with my studies and I really like the idea of learning more about video codecs. I am excited about royalty-free codecs such as VP8, VP9 and AV1. = References =

[1] VA-API (Video Acceleration API) user mode driver for the Intel GEN Graphics family https://github.com/intel/intel-vaapi-driver

[2] Technical overview of VP8, an open-source for the web https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37073.pdf

[3] RFC 6386: VP8 Data Format and Decoding Guide https://tools.ietf.org/html/rfc6386

[4] IFV Container https://wiki.multimedia.cx/index.php/IVF

[5] Libva-utils VP9 encoder https://github.com/intel/libva-utils/blob/master/encode/vp9enc.c

[6] An analysis of VP8, a new video codec for the web http://scholarworks.rit.edu/cgi/viewcontent.cgi?article=4188&context=theses

[7] Overview of the High Efficiency Video Coding (HEVC) Standard http://iphome.hhi.de/wiegand/assets/pdfs/2012_12_IEEE-HEVC-Overview.pdf

[8] Yet Another Media Infrastructure https://github.com/intel/libyami

[9] VA-API support to GStreamer https://github.com/GStreamer/gstreamer-vaapi

[10] Coding tree unit https://en.wikipedia.org/wiki/Coding_tree_unit

[11] QR code: Encode and Decode QR code on linux command line https://tuxthink.blogspot.co.at/2014/01/qr-code-encode-and-decode-qr-code-on.html

[12] Chrome’s WebRTC VP9 SVC Layer Cake https://webrtchacks.com/chrome-vp9-svc/

[13] HOWTO Use temporal scalability to adapt video bitrates http://www.rtcbits.com/2017/04/howto-implement-temporal-scalability.html

[14] vpx_temporal_svc_encoder (WEBM Project) https://www.webmproject.org/docs/webm-sdk/example_vpx_temporal_svc_encoder.html

[15] RTP Payload Format for VP8 Video https://tools.ietf.org/html/rfc7741#section-4.2

[16] RTP Payload Format for VP9 Video (draft-ietf-payload--04) https://tools.ietf.org/html/draft-ietf-payload-vp9-04#section-4.1