On Computational Complexity of Motion Estimation Algorithms in MPEG-4 Encoder
Muhammad Shahid
This thesis report is presented as a part of degree of Master of Science in Electrical Engineering
Blekinge Institute of Technology, 2010
Supervisor: Tech Lic. Andreas Rossholm, ST-Ericsson Examiner: Dr. Benny Lovstrom, Blekinge Institute of Technology
Abstract
Video Encoding in mobile equipments is a computationally demanding fea- ture that requires a well designed and well developed algorithm. The op- timal solution requires a trade off in the encoding process, e.g. motion estimation with tradeoff between low complexity versus high perceptual quality and efficiency. The present thesis works on reducing the complexity of motion estimation algorithms used for MPEG-4 video encoding taking SLIMPEG motion estimation algorithm as reference. The inherent prop- erties of video like spatial and temporal correlation have been exploited to test new techniques of motion estimation. Four motion estimation algo- rithms have been proposed. The computational complexity and encoding quality have been evaluated. The resulting encoded video quality has been compared against the standard Full Search algorithm. At the same time, reduction in computational complexity of the improved algorithm is com- pared against SLIMPEG which is already about 99 % more efficient than Full Search in terms of computational complexity. The fourth proposed algorithm, Adaptive SAD Control, offers a mechanism of choosing trade off between computational complexity and encoding quality in a dynamic way.
Acknowledgements
It is a matter of great pleasure to express my deepest gratitude to my ad- visors Dr. Benny L¨ovstr¨om and Andreas Rossholm for all their guidance, support and encouragement throughout my thesis work. It was nonethe- less a great opportunity to do research work at ST-Ericsson under the marvelous supervision of Andreas Rossholm. The counseling provided by Benny L¨ovstr¨om was of great value for me in writing up this manuscript. I can’t forget mentioning the comfort I received from Fredrik Nillson and Jimmy Rubin of ST-E in setting up the working environment and start up of ST-E algorithm. I owe my successes in life so far to all of my family members, for their magnificent kindness and love!
iii
Contents
Abstract i
Acknowledgements iii
Contents v
1 Introduction 1
2 Basics of Digital Video 3 2.1ColorSpaces...... 3 2.2VideoQuality...... 4 2.3RepresentationofDigitalVideo...... 4 2.4Applications...... 4 2.4.1 Internet...... 5 2.4.2 VideoStorage...... 5 2.4.3 Television...... 5 2.4.4 GamesandEntertainment...... 6 2.4.5 VideoTelephony...... 6
3 Video Compression Fundamentals 7 3.1CODEC...... 7 3.2AVideoCODEC...... 8 3.3VideoCodingStandards...... 9 3.3.1 MPEG-1...... 10 3.3.2 MPEG-2...... 10 3.3.3 MPEG-4...... 10 3.3.4 MPEG-7...... 10 3.3.5 MPEG-21...... 10 3.3.6 H.261...... 11 3.3.7 H.263...... 11 3.3.8 H.263+...... 11 3.3.9 H.264...... 11 3.4MPEG-4...... 11
v Contents
3.5Syntax...... 12
4 Motion Estimation and its Implementation 15 4.1BlockMatching...... 17 4.2MotionEstimationAlgorithms...... 18 4.2.1 FullSearch...... 19 4.2.2 Three-StepSearch...... 20 4.2.3 DiamondSearch...... 20 4.2.4 SLIMPEG...... 21
5 Rate Distortion Optimization and Bjontegaard Delta PSNR 23 5.1MeasurementofDistortion...... 24 5.2 Bjontegaard Delta PSNR ...... 24
6 Simulation, Results and Discussion 29 6.1SADasaComparisonMetric...... 30 6.2ProposedTechniques...... 30 6.2.1 Spatial Correlation Algorithm ...... 31 6.2.2 Temporal Correlation Algorithm ...... 31 6.2.3 AdaptiveSADControl...... 32 6.3 Simulations with different video sequences ...... 34 6.3.1 FootballSequence...... 35 6.3.2 ForemanSequence...... 36 6.3.3 ClaireSequence...... 40
7 Conclusion and Future Work 51
List of figures 54
List of tables 55
Bibliography 57
vi Chapter 1
Introduction
Since the advent of the first digital video coding technology standard in 1984 by the International Telecommunication Union (ITU), the technology has seen a great progress. The two main standard setting bodies in this regard are ITU and International Organization for Standardization (ISO). Recommendations of ITU include the standards like H 261/262/263/264 and these focus on applications in the area of telecommunication. Motion Pictures Experts Group (MPEG) of ISO has released standards like MPEG- 1/-2/-4 which focus the applications in computer and consumer electronics area. The standards defined by both of these groups have some parts in common and also some work has been performed as a joint venture. The field of video compression has been continuously developing with the en- hancements in the previous versions of the standards and introduction of new recommendations. MPEG-4 standard is followed in this thesis work. It can be easily said that video compression is a top requirement in any multimedia storage and transmission phenomenon with encoding the video in various forms before sending or storing it and then decoding it subse- quently at the receiver end or when viewing it. Besides the presence of digital video in television and CD/DVD etc, cellular phones will probably be the next high use place of video content. The limited storage capacity of mobile equipments dictates the requirement of efficient video compression tools. Video encoding in mobile equipments has developed from a high- end feature to something that is taken for granted. Nevertheless, it is a computationally demanding feature that requires well designed and well developed algorithms. Many different algorithms need to be evaluated in order to come closer to the optimal solution. As early as 1929, Ray Davis Kell described a form of video compression for which he obtained a patent [1]. Given the fact that a video is actually a series of pictures transmitted at some designated speed between successive images, Rays patent gave rise to the idea of transmitting the difference be- tween the successive images instead of sending the whole image. However,
1 Chapter 1. Introduction
it took ages to get the idea implemented into reality but still it is a keystone of many video compression standards today. Connected to this idea, there comes the concept of motion estimation which tries to exploit the presence of temporal correlation at different positions between the video frames. It predicts the motion found in the current frame using already encoded frames. Henceforth, the residual frame contains much less energy than the actual frame. Motion vectors and the residual frame are encoded by a bit rate much lesser than the bit rate required to encode a regular frame. Motion estimation may require tremendous amount of computational work inside the video coding process. There are certain algorithms employed for doing motion estimation. The basic class of these is called Full Search Al- gorithms and it gives optimal performance but computationally very time consuming. To deal with this computation issue, many sub optimal fast search algorithms have been designed and this thesis will focus on some of them in a try to improve performance of one of them. The SLIMPEG motion estimation algorithm is taken as reference here and inherent video properties like spatial and temporal correlation has been applied to devise techniques in a try to achieve less complex yet performance oriented motion estimation algorithms. The rest of the report is organized as: Chapter 2 and chapter 3 deal with fundamentals of digital video and video compression respectively. Imple- mentation aspects of motion estimation have been explored in chapter 4 ending with the introduction of SLIMPEG motion estimation algorithm. Rate distortion and delta PSNR are the contents of chapter 5. Results of the main contribution have been provided in chapter 6 with their descrip- tion. Chapter 7 contains conclusion and some hints about future work in the field.
2 Chapter 2
Basics of Digital Video
A video image is obtained by capturing the 2D plane view of a 3D scene. Digital video is then spatial and temporal sampled frames presented in a sequence. The spatio-temporal sampling unit which is usually called pixel (picture element) can be represented by a digital value to describe its color and brightness. The more sampling points taken to form the video frame the higher is usually the visual quality but requiring high storage capacity. The video frame is usually formed in a rectangular shape. The smoothness of a video is determined by the rate at which its frames are presented in a succession. A video comprising a frame rate of thirty frames per second looks fairly smooth enough for most purposes. A general comparison of ap- pearance of a video determined by its frame rate is given in the table 2.1[2].
Table 2.1: Video frame rates.[2]
Video frame rates. Appearance Below 10 frames per second ’Jerky’, unnatural appearance to movement 10-20 frames per second Slow movement appears OK; rapid movement is clearly jerky 20-30 frames per second Movement is reasonably smooth 50-60 frames per second Movement is very smooth
2.1 Color Spaces
The pixel may be represented by just one number (grey scale image) or by multiple numbers (colored image). A particular scheme used for rep-
3 Chapter 2. Basics of Digital Video
resenting colors is called color space. Two of the most common schemes are known as RGB (red/green/blue) and YCrCb (luminance/red chromi- nance/blue chrominance). In the RGB color space, each pixel is represented by three numbers indi- cating the relative proportion of the three colors. Each of the numbers is usually formed by eight bits. So, one pixel requires twenty four bits for its complete representation. It has been observed from psycho visual experi- ments that human optical system is less sensitive to color than luminance. This fact is exploited in YCrCb color space where luminance is concentrated in only one of its components Y and color information is contained in the rest of the components. There is a relationship between both color space schemes where one representation can be transformed into another. For details on this, please see [2].
2.2 Video Quality
The video quality is an important parameter and is a subjective issue, by its nature of being judged by human. There are many objective criteria for measuring video quality which can give results with some correlation to hu- man experience e.g., PSNR. However, they may not make up satisfactorily to the demand of subjective experience of a human observer. Experiments show that a picture with lower PSNR may look visually better than with a higher value of PSNR. It is to be noted that human visual experience may vary also from person to person and brings up the need of such alternatives which could be thought of as covering the need of both objective and sub- jective tests. An objective test which matches best with the human visual experience will give acceptable results.
2.3 Representation of Digital Video
Before the video is ready for coding, it is often transformed to one of the Intermediate Formats. The central out of them is common intermediate format, CIF, where a frame size resolution is 253 X 288 pixels. Table 2.2 gives information about some standard common intermediate formats.
2.4 Applications
There has been an exponential growth in applications of digital video and the technology has got the capacity to emerge rapidly. Some examples of
4 Chapter 2. Basics of Digital Video
Table 2.2: Intermediate formats. [2]
Format Luminance resolution(horz. X vert.) Sub-QCIF 128 X 96 Quarter CIF(QCIF) 176 X 144 CIF 352 X 288 4 CIF 704 X 576
widely used digital video applications are given in the following subsections.
2.4.1 Internet
It can be safely said that current era of internet holds the most of digital video applications ranging from a small video clip to wholesome of movies, from a small video chat to a corporate video conference and so on. Re- mote teaching/learning, video telephony and sharing videos has been made possible by the digital video. The state of the art video broadcasting phe- nomenon YouTube presents billions of videos to its viewers worldwide by using benefits of digital video technology.
2.4.2 Video Storage
Digital video has reshaped the way of storing videos. CD/DVD ROM and Blu-ray Disc have almost wiped out the classic film tape media storage devices. These new storage discs come with huge advantages of capacity, portability and durability. The latest of them is Blu-ray Disc and it has storage capacity as much as 50 GB in single layer and upto 100 GB in dual layer[3].
2.4.3 Television
Satellite television channels across the planet create an entire new world of global village by the virtue of digital video. Literally, there are thousands of television channels operating in various areas of the world and the num- ber is yet increasing. News, current affair shows and popular drama serials gather a huge count of viewers.
5 Chapter 2. Basics of Digital Video
2.4.4 Games and Entertainment The heavy video games and movies have gained enormous popularity and these are again an applications of digital video. Now a days, we see an increasing trend of popularity of 3D animation movies which is a big suc- cess of digital video. Take the example of ’Avatar’, a block buster 3D flick which is possibly the best ever liked movie of the current era.
2.4.5 Video Telephony The digital video has played an important role in getting a video along with voice while communicating on telephone. On government and private levels, video conferencing is replacing the need of traveling far away for at- tending meetings at one place. Skype is probably the brand leader in this field.
6 Chapter 3
Video Compression Fundamentals
It has been observed that the size of an ordinary digitized video signal is far greater than usual storage capacity and transmission media bandwidth. This fact shows the need of systems capable of compressing the video. For the sake of example, a channel of ITU-R 601 television (with 30 fps) requires media having bit rate of 216 Mbps for broadcasting in its uncom- pressed form. A 4.7 Gb DVD can store only 87 seconds of uncompressed video at this bit rate. This implies that there is a clear need of such mech- anism which can make this data fit to be able of storing or transmitting having limited capacities. Hence comes the compression but with drawback of some quality loss in visual experience. An effective compression system is, in general, lossy in nature.
3.1 CODEC
The term CODEC represents a combination of two systems capable of en- coding (compressing) and decoding (decompressing). A typical codec is shown in figure 3.1. The encoder compresses the original signal and the process is called source coding. After some more signal processing the sig- nal reaches the point of decompression at source decoder.
According to information theory, there is statistical redundancy in an or- dinary data signal. This principle has been utilized in Huffman coding and such kind of CODEC is known as entropy CODEC. However, the entropy encoders do not perform well in case of images and videos. There is need of deploying source models before entropy coding can applied on such data. There are some properties present in video which are taken into considera-
7 Chapter 3. Video Compression Fundamentals
Figure 3.1: Source coder,channel coder,channel[2].
tion to be benefited in source models. These properties include the spatial and temporal redundancy present amongst pixels in video frames. More- over, psycho visual experiments have shown that human visual system is more particular about lower frequencies. So, in encoding process for video, some high frequencies can be safely ignored. Codecs are often designed to emphasize certain aspects of the media, or their use, to be encoded. For example, a digital video (using a DV codec) of a sports event, such as base- ball or soccer, needs to encode motion well but not necessarily exact colors, while a video of an art exhibit needs to perform well encoding color and sur- face texture. Pertaining to video quality, there can be two kinds of codecs. In order to achieve good level of compression, most of the codecs degrade the original quality of the signal and are known as lossy codecs. There are also codecs which preserve the original quality of the signal and are known as lossless codecs [10]. Some examples of coding techniques are presented here. In Differential Pulse Code Modulation (DPCM) coding technique, pixels are sent as prediction of already dispatched pixels. Next step is to transmit the prediction error which is actually difference of prediction from actual pixel. Transform coding changes the domain of the frame signal. This change is helpful in rounding off insignificant coefficients and a lossy compression is achieved. The transform coding has got a great deal of ap- plication in various video compression techniques. Another technique is motion compensated predictive coding which is the emphasis of this thesis. In a similar way as that of DPCM, a model of actual frame belonging to a video is obtained by prediction based on already encoded frame. This model is then subtracted from the original frame to obtain residual frame which contains much less energy as compared to its original frame [2].
3.2 A Video CODEC
Video signals are constructed by a sequence of still images which are better known as video frames. These frames can be encoded for compression
8 Chapter 3. Video Compression Fundamentals
Figure 3.2: Video CODEC With Prediction[2].
using intra frame coding techniques but this compression does not turn out to be of enough good value for a video. This fact and presence of temporal redundancy inside the video sequence drives the need of inter frame encoding. A prediction of actual video frame based on previous frame is subtracted from actual frame to form what is called residual frame. The residual frame is then encoded by frame codec. A block diagram of such video coder is in figure 3.2. The process of encoding the residual frame includes its transformation. The transform coefficients are quantized and then entropy coding is applied for transmission or storage. At the decoder end, revert operation of these steps are applied to get the original data back [2].
3.3 Video Coding Standards
Most of the video codecs currently being used belong one of the two main- stream video codec standards viz. International Standards Organization (ISO) and International Telecommunications Union (ITU). ISO has intro- duced JPEG and MPEG-x series for image and video respectively. Simi- larly, ITU has introduced its standards with H.26x series. Coming next is a brief description of these standards and MPEG-4 shall be described in detail. ISO has covered the applications related to storage and distribution through its standards. The Moving Picture Experts Group (MPEG) has developed recommendations for video and its standards include the following [2][3].
9 Chapter 3. Video Compression Fundamentals
3.3.1 MPEG-1
Video and audio data can be compressed and played back in real time on CD-ROM under this standard (at a bit rate of 1.4 Mbps). The VHS-quality digital video is compressed down to a ratio of 26:1.
3.3.2 MPEG-2
Bit rate has been increased from the previous standard to 3-5Mbps for com- pression and transmission of video and audio data storage and transmission purposes. Additionally, support for interlaced video has been added into it.
3.3.3 MPEG-4
It came in late 1998 and provides video and comes with additional features to those of the previous standards. It supports a huge range of bit rates and will be discussed in detail at the end of this section.
3.3.4 MPEG-7
It is a multimedia content description standard. This provides support for describing multimedia content data, with the aim of providing a standard- ized system for content-based indexing and retrieval of multimedia infor- mation. It is rather meant for accessing the multimedia data instead of coding and compression phenomenon. MPEG-7 has been formally known as Multimedia Content Description Interface.
3.3.5 MPEG-21
It is usually ratified as Multimedia Framework. It provides definition of an open framework for multimedia applications. The Rights Expression Lan- guage, as defined by MPEG-21, standardizes the process of sharing digital rights for digital content from its source to the consumer end. Integration and inter operation between various technologies related to multimedia field is promoted by this standard.
ITU has focused on applications related to real time and dual duplex video communications. Its working body for standardization is called as Video Coding Experts Group(VCEG) and it has given out the following standards.
10 Chapter 3. Video Compression Fundamentals
3.3.6 H.261 It was primarily introduced for video telephony over ISDN lines where chan- nel capacity is multiple of 64 kbps. Two video sizes CIF and QCIF are being supported by it.
3.3.7 H.263 It offers videoconferencing for a variety of bit rates ranging from kbps to many Mbps. It is quite popular into internet applications.
3.3.8 H.263 + It is second version of H.263 which adds some enhancements into the orig- inal standards including better encoding and a level of immunity to trans- mission errors. There came another version H.263++ where some annexes were added with more functionalities.
3.3.9 H.264 Also known as MPEG-4 part 10 or Advanced Video Coding (AVC), its first set of recommendations came in 2003. Its use is found in applications such as Blu-ray Disc, YouTube videos and television services. Moreover, H.264 was developed by Joint Video team (JVT) which was actually a collabora- tion work group of ITU and ISO.
3.4 MPEG-4
This standard was developed in an effort to enhance the functionalities of the already exiting MPEG standards for video coding. One of the add- on features is efficient compression for applications which involve low bit rate of transmission media. A whole new concept of video scene and video object has been introduced which considers the coding of video based on its contents instead of just taking everything same in rectangular frames. MPEG-4 standard is progressive in a way that it has got the capacity to ab- sorb new tools and enhancements. The whole lot of tools, which MPEG-4 offers for encoding, have been offered through various subsets. Such sub- sets are called profiles and a specific profile addresses a specific application. One such example is Simple Profile which aims at applications requiring low bit rate and low resolution. Another is Advanced Simple Profile which
11 Chapter 3. Video Compression Fundamentals
has features like support for bidirectional prediction frames and quarter pixel motion compensation. Some of the salient functionalities for encod- ing video frames by MPEG-4 are described in the following [2].
• Video core: The video coding phenomenon uses such algorithms which make the core of the standards very low bit rate.
• Input format: Video data is pre-processed and sometimes converted to one of the picture sizes listed in Table 2.2, at a frame rate of up to 30 frames per second and in 4 : 2 : 0 (Y : Cr :Cb) format before the codec is applied.
• Picture type: The frames are encoded as I frames (Intra coded) or P frames (Predictive coded) or B frames (Bidirectional prediction). For video encoding, a frame is usually divided into small sections of a certain size and these small sections are called macro blocks. I frames contain strictly intra coded macro blocks and P frame could have ei- ther inter or intra coded macro blocks.
• Motion estimation: It is done in macro blocks of size 16X16 normally with an optional availability of macro blocks of size 8X8, 4X4, 4X8, 8X4, 8X16, 16X8, depending upon the profile being in operation. The motion vectors (an array of coordinates representing relative motion) could have sub-pixel resolution.
• Transform coding: The residual frame obtained after motion esti- mation process is coded using discrete cosine transform (DCT). The coefficients returned are then quantized and arranged in zig zag fash- ion. Finally, run level coding is applied.
3.5 Syntax
Main features of the syntax of the MPEG-4 coded bit stream are described as under.
• Picture layer: The top layer of the syntax contains a complete coded picture. The picture header contains values to describe picture res- olution, the type of coded picture (inter or intra) and a temporal
12 Chapter 3. Video Compression Fundamentals
reference field.
• Group of blocks layer: A complete row of macro blocks forms a group of blocks (GOB) in QCIF, CIF and SCIF frames and this flag helps in resynchronization by decoder if any errors cause synchronization loss.
• Macroblock layer: Four luminance and two chrominance blocks form one macroblock. The header contains information about type of mac- roblock and motion vectors for inter coded macroblocks.
13
Chapter 4
Motion Estimation and its Implementation
Motion estimation is a key component in video compression, processing and in computer vision. The knowledge of motion in a video compression process helps eradicate temporal redundancy amongst the successive frames and consequently a high value of compression ratio is returned. This makes motion estimation a must found module in the video coding standards. Contrary to the older standards, MPEG-4 introduces a region based motion model which is quite flexible and has got more efficiency. Consider a video frame from a set of video frames and name it as current frame. Some frames from the video which have already been encoded can be used to predict the contents of the current frame and such frames are named as reference frames. Such prediction is called motion prediction. The temporal order of reference frame can be earlier or later than the current frame and the estimation will be forward prediction or backward prediction respectively. The forward and backward prediction could be combined also and in this case it is called bidirectional prediction. The process can be understood by observing the pictorial explanation given by figure 4.1 [11] and block diagram given in figure 4.2 [2]. The target of a motion estimation algorithm is to develop a model of the current frame according to the reference frame with maximum accuracy and minimum computational involvement.
As given in the figure 4.2, the Motion Estimation block creates such a model by altering a reference frame. The Motion Compensation block cre- ates a residual frame by subtracting the model of the current frame from the original current frame. This residual frame is then sent for transform and entropy coding and sent for transmission along with motion vectors information. Another interesting step is taken here which involves the de- coding of this encoded frame in a try to reproduce the current frame to be
15 Chapter 4. Motion Estimation and its Implementation
Figure 4.1: Motion Estimation and Compensation [11].
Figure 4.2: Motion Estimation and Compensation [2].
16 Chapter 4. Motion Estimation and its Implementation
Figure 4.3: Block Matching [2].
used as reference frame for further encoding process. The grade of compression can be measured by the size of the coded resid- ual frame which is also called displaced frame difference (DFD) and the overhead info related to motion vectors. The size of the coded residual frame is proportional to the energy present in the DFD after the motion compensation process. It is observed that this energy can be reduced using motion estimation and compensation to have higher compression efficiency [2].
4.1 Block Matching
To carry out the motion estimation and compensation process, a video frame is taken as composed of non overlapping block of a certain size, e.g., 16 x 16 pixels. There are some other standard sizes too which are used in different video coding standards. Such blocks are formally known as Macro Blocks and motion estimation applied on them is known as block matching. Block matching is performed on luminance samples (e.g., on Y blocks in MPEG-4 encoding). Two macro blocks out of current frame and reference frame are checked for similarity aiming at minimizing the energy difference between them. The search area in reference frame is centered around the macro block under consideration to exploit the temporal redundancy and to avoid searching the whole reference frame. The block matching phenomenon is depicted in the figure 4.3. In this figure, a 3x3 current block is searched for matching in the corre- sponding position in reference frame and the search region is kept 1 pixel wider than the size of the block. There are various search criteria methods to estimate the optimum matching point. Some examples include Sum of Absolute Difference (SAD), Mean Square Error (MSE) and Mean Absolute Error(MAE). SAD is calculated as: