A Low Latency Compressed Video Codec for Mobile and Embedded Devices
Total Page:16
File Type:pdf, Size:1020Kb
3.8.2011 Version 9 A Low Latency Compressed Video Codec for Mobile and Embedded Devices Joe Bertolami Founder, everyAir [email protected] Summary - Inarguably the most widely supported video codec (circa 2010) is H.264/AVC. This codec gained significant popularity over the past eight years through the adoption of advanced video formats such as Blu- ray and AVC-HD. Unfortunately, this codec was not designed for real-time game streaming scenarios on low power devices (e.g. mobile and embedded), and thus fails to offer an adequate low latency solution. In this paper we review a simple software solution implemented for the everyAir Cloud Gaming product that trades bandwidth efficiency for significantly improved processing latency over existing H.264 implementations. Analysis of this solution, in comparison with H.264, reveals a severe but expected loss in bandwidth efficiency, but a substantial improvement in coding latency. Index Terms - Compression, Quantization, Real-time, Video 1. INTRODUCTION Recent advances in mobile computing have significantly increased the demand for specialized, efficient, and low latency video compression techniques. One of the most popular formats in widespread use is H.264, a format designed through a partnership between several companies and managed by MPEG-LA (via the MPEG LA License Agreement). Although this format contains some of the world’s best video compression techniques, its adoption requires careful business consideration. During the early stages of everyAir development, we sought to assess the qualifications of H.264, including the relevant benefits and risks of using it. We ultimately incorporated it into our product, but did so cognizant of the following risks: . Given the licensing structure of the codec at the time, our primary source option was to use open source implementations. The package we adopted was released under the GPL v2, which can be particularly thorny when used alongside proprietary technologies. Careful consideration is required to ensure total compliance with GPL requirements. The H.264 process is patent protected, and no precedent has been set for the legal viability of many of the open source encoders. Thus, as legal clarity is unlikely to arrive anytime soon, there remains a risk of taking a dependency on this format without having proper licensing agreements in place. [Update: MPEG-LA has recently set forth a royalty free plan for licensees of H.264] . Adopting open source codecs could potentially stymie our future abilities to innovate in this area. Code familiarity of our engineers with a foreign codebase would undoubtedly be lower than with a homegrown solution, and operating with, cross compiling, bug fixing, and accepting updates from open source projects has its own set of associated risks for a project. Given these risks, we developed a custom codec solution, symbolically named P.264 (due to its similarities to H.264) and used it within our everyAir product. This solution provided lower encode/decode latency and significantly lower computational costs, but was unable to match the quality-per-bitrate of traditional H.2641. Based on encoding trials however, P.264 proved to be acceptable for our product. The remainder of this paper describes the simple P.264 encoding process. We omit the decoding process as it may be trivially derived from the encoding process. 1 Based on subjective and peak signal-to-noise ratio (PSNR) analysis. 1 3.8.2011 Version 9 1.2 Further Motivation P.264 is a low latency software video format designed for real-time application streaming. Its primary consumer is everyAir, a multi-platform real-time remote desktop application that facilitates personal cloud gaming. In our target scenarios, it was important that P.264 support both decoding and encoding operations arbitrarily across processors in a heterogeneous configuration (i.e. both many-core and GPU based execution). Thus, considerations have been made with respect to addressability, threading, cache coherency, and so forth. 1.3 Target Audience This paper was originally authored as an internal document used to describe the basics of video coding as well as the design of the P.264 codec and its proprietary optimizations. Although some proprietary portions of the paper have been removed for this release, the majority of it remains intact. This release is intended for software engineers interested in learning more about P.264 as well as the general field of video compression. We will discuss several of the basic concepts behind video coding and then describe how they were designed and incorporated within P.264. A final note before we begin – it is also important to mention that while the naming of this format borrows heavily from H.264, the techniques and algorithms it uses, even when similarly named, may not necessarily bear technical resemblance to those of H.264. Readers wishing to learn more about the architecture of H.264 should consult the relevant specification2. 2. OVERVIEW At the highest level, P.264 is a simple transform codec that relies upon block based frequency domain quantization and arithmetic encoding to produce reasonable inter and intra-frame compression rates. P.264 achieves significantly lower computational costs, versus H.264, by omitting or simplifying several intensive features of the traditional pipeline. These processes were carefully adjusted after our analysis concluded that they would require an unacceptable amount of processing time on our target platforms, or would prevent us from transmitting early portions of a frame while later portions were still being encoded. A secondary goal for this format was that it be tailored to the potential asymmetries between host and client display resolutions. Special care has been paid to the ever-increasing video resolution requirements by ensuring that all block sizes in P.264 are scalable. In this manner, when possible, video data is compressed with respect to its display properties, which ultimately affects the compression efficiency. 2.2 Display Orientations Assume video file α has a resolution of 2048x1152 and is presented on display A featuring x pixels per inch (PPI). Additionally, assume that video file β has a resolution of 4096x2304, and is presented on display B which features identical physical dimensions as display A, but with 2x the PPI. In the case of most modern encoders, both video files will be partitioned along display-agnostic tile dimensions (typically 16x16 pixel blocks), which may result in larger-than-necessary file sizes for video β. In the case of P.264, the encoder can adapt with the understanding that video β, when viewed on display B, presents a very different visible artifact signature than video α, and thus will adjust accordingly. Although this scenario may seem overly contrived, it is actually quite relevant for mobile devices with well-known native resolutions and processing constraints (e.g. iPhone 4 vs. iPad 1; reasonably similar resolutions but very different PPIs). 2.3 Client Server Design 2 Currently available at http://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.264-200305-S!!PDF-E&type=items 2 3.8.2011 Version 9 Throughout this paper we will regularly use the term client to refer to a device that receives an encoded message, performs decode processing on the message to produce an image, and presents the image. We use the term server to refer to a device that receives an image from the operating system, encodes it into a message, and transmits the message to a client. Additionally, unless otherwise stated, encoder refers to the encoding software on the server, while decoder refers to the decoding software on the client. Lastly, we refer to the very latest image received by the encoder, from the operating system, as the current image, and refer to the previously encoded image as simply the previous image. 2.4 Process Description The everyAir Server is responsible for capturing new candidate frames from the host operating system and then supplying them to the encoder in an appropriate format. On Windows and Mac platforms, screen captures are represented as three channel RGB images with 24 bits per pixel (i.e. RGB8). Since P.264 prefers images to be in the YUV color space with 12 bits per pixel (YUV420), we convert each frame just prior to submitting it to the encoder. The remainder of this document assumes that inputs are represented in this format. 2.4.2 Frame Encoding After image space conversion, the frame is submitted to the encoder. The encoder will quickly examine the frame and its current context to decide whether to encode the frame in intra mode or inter mode. Intra mode will produce an “i-frame” that exclusively contains self-referencing information and will not require access to any other frame in order to decode it. The encoder will generally operate in intra mode in any of the following situations: . The current frame is the first frame to be encoded. Since no previous frames are available to reference, the encoder will produce a frame that can be singly decoded. The encoder is specifically switched to intra mode. This will happen if the server detects that a frame was dropped en route to the client, or that the client is out of sync with the server for any other reason. If the encoder is told to produce intra frames at a specific interval. This is often the case for seekable streams that desire the ability to begin playback from any location. Inter mode, on the other hand, will produce a “p-frame” that references previous frames and thus requires access to them in order to perform a proper decode operation. The advantage of a p-frame is that it will likely compress much better than an i-frame due to any relative similarity between two adjacent frames in a stream.