On Computational Complexity of Motion Estimation Algorithms in
MPEG-4 Encoder
Muhammad Shahid
This thesis report is presented as a part of degree of
Master of Science in Electrical Engineering
Blekinge Institute of Technology, 2010
Supervisor: Tech Lic. Andreas Rossholm, ST-Ericsson Examiner: Dr. Benny Lovstrom, Blekinge Institute of Technology
Abstract
Video Encoding in mobile equipments is a computationally demanding feature that requires a well designed and well developed algorithm. The optimal solution requires a trade off in the encoding process, e.g. motion estimation with tradeoff between low complexity versus high perceptual quality and efficiency. The present thesis works on reducing the complexity of motion estimation algorithms used for MPEG-4 video encoding taking SLIMPEG motion estimation algorithm as reference. The inherent properties of video like spatial and temporal correlation have been exploited to test new techniques of motion estimation. Four motion estimation algorithms have been proposed. The computational complexity and encoding quality have been evaluated. The resulting encoded video quality has been compared against the standard Full Search algorithm. At the same time, reduction in computational complexity of the improved algorithm is compared against SLIMPEG which is already about 99 % more efficient than Full Search in terms of computational complexity. The fourth proposed algorithm, Adaptive SAD Control, offers a mechanism of choosing trade off between computational complexity and encoding quality in a dynamic way.
Acknowledgements
It is a matter of great pleasure to express my deepest gratitude to my ad-
- visors Dr. Benny Lovstrom and Andreas Rossholm for all their guidance,
- ¨
- ¨
support and encouragement throughout my thesis work. It was nonetheless a great opportunity to do research work at ST-Ericsson under the marvelous supervision of Andreas Rossholm. The counseling provided by Benny Lovstrom was of great value for me in writing up this manuscript.
- ¨
- ¨
I can’t forget mentioning the comfort I received from Fredrik Nillson and Jimmy Rubin of ST-E in setting up the working environment and start up of ST-E algorithm. I owe my successes in life so far to all of my family members, for their magnificent kindness and love!
iii
Contents
- Abstract
- i
iii v
Acknowledgements Contents
- 1 Introduction
- 1
- 2 Basics of Digital Video
- 3
344455566
2.1 Color Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Video Quality . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Representation of Digital Video . . . . . . . . . . . . . . . . 2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Internet . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Video Storage . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Television . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Games and Entertainment . . . . . . . . . . . . . . . 2.4.5 Video Telephony . . . . . . . . . . . . . . . . . . . .
- 3 Video Compression Fundamentals
- 7
789
3.1 CODEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 A Video CODEC . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Video Coding Standards . . . . . . . . . . . . . . . . . . . .
3.3.1 MPEG-1 . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.2 MPEG-2 . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.3 MPEG-4 . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.4 MPEG-7 . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.5 MPEG-21 . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.6 H.261 . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.7 H.263 . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.8 H.263 + . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.9 H.264 . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 MPEG-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
v
Contents
3.5 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Motion Estimation and its Implementation 15
4.1 Block Matching . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Motion Estimation Algorithms . . . . . . . . . . . . . . . . 18
4.2.1 Full Search . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.2 Three-Step Search . . . . . . . . . . . . . . . . . . . 20 4.2.3 Diamond Search . . . . . . . . . . . . . . . . . . . . 20 4.2.4 SLIMPEG . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Rate Distortion Optimization and Bjontegaard Delta PSNR 23
5.1 Measurement of Distortion . . . . . . . . . . . . . . . . . . 24 5.2 Bjontegaard Delta PSNR . . . . . . . . . . . . . . . . . . . 24
- 6 Simulation, Results and Discussion
- 29
6.1 SAD as a Comparison Metric . . . . . . . . . . . . . . . . . 30 6.2 Proposed Techniques . . . . . . . . . . . . . . . . . . . . . . 30
6.2.1 Spatial Correlation Algorithm . . . . . . . . . . . . . 31 6.2.2 Temporal Correlation Algorithm . . . . . . . . . . . 31 6.2.3 Adaptive SAD Control . . . . . . . . . . . . . . . . . 32
6.3 Simulations with different video sequences . . . . . . . . . . 34
6.3.1 Football Sequence . . . . . . . . . . . . . . . . . . . 35 6.3.2 Foreman Sequence . . . . . . . . . . . . . . . . . . . 36 6.3.3 Claire Sequence . . . . . . . . . . . . . . . . . . . . . 40
7 Conclusion and Future Work List of figures
51 54 55 57
List of tables Bibliography
vi
Chapter 1
Introduction
Since the advent of the first digital video coding technology standard in 1984 by the International Telecommunication Union (ITU), the technology has seen a great progress. The two main standard setting bodies in this regard are ITU and International Organization for Standardization (ISO). Recommendations of ITU include the standards like H 261/262/263/264 and these focus on applications in the area of telecommunication. Motion Pictures Experts Group (MPEG) of ISO has released standards like MPEG- 1/-2/-4 which focus the applications in computer and consumer electronics area. The standards defined by both of these groups have some parts in common and also some work has been performed as a joint venture. The field of video compression has been continuously developing with the enhancements in the previous versions of the standards and introduction of new recommendations. MPEG-4 standard is followed in this thesis work. It can be easily said that video compression is a top requirement in any multimedia storage and transmission phenomenon with encoding the video in various forms before sending or storing it and then decoding it subsequently at the receiver end or when viewing it. Besides the presence of digital video in television and CD/DVD etc, cellular phones will probably be the next high use place of video content. The limited storage capacity of mobile equipments dictates the requirement of efficient video compression tools. Video encoding in mobile equipments has developed from a highend feature to something that is taken for granted. Nevertheless, it is a computationally demanding feature that requires well designed and well developed algorithms. Many different algorithms need to be evaluated in order to come closer to the optimal solution. As early as 1929, Ray Davis Kell described a form of video compression for which he obtained a patent [1]. Given the fact that a video is actually a series of pictures transmitted at some designated speed between successive images, Rays patent gave rise to the idea of transmitting the difference between the successive images instead of sending the whole image. However,
1
Chapter 1. Introduction it took ages to get the idea implemented into reality but still it is a keystone of many video compression standards today. Connected to this idea, there comes the concept of motion estimation which tries to exploit the presence of temporal correlation at different positions between the video frames. It predicts the motion found in the current frame using already encoded frames. Henceforth, the residual frame contains much less energy than the actual frame. Motion vectors and the residual frame are encoded by a bit rate much lesser than the bit rate required to encode a regular frame. Motion estimation may require tremendous amount of computational work inside the video coding process. There are certain algorithms employed for doing motion estimation. The basic class of these is called Full Search Algorithms and it gives optimal performance but computationally very time consuming. To deal with this computation issue, many sub optimal fast search algorithms have been designed and this thesis will focus on some of them in a try to improve performance of one of them. The SLIMPEG motion estimation algorithm is taken as reference here and inherent video properties like spatial and temporal correlation has been applied to devise techniques in a try to achieve less complex yet performance oriented motion estimation algorithms. The rest of the report is organized as: Chapter 2 and chapter 3 deal with fundamentals of digital video and video compression respectively. Implementation aspects of motion estimation have been explored in chapter 4 ending with the introduction of SLIMPEG motion estimation algorithm. Rate distortion and delta PSNR are the contents of chapter 5. Results of the main contribution have been provided in chapter 6 with their description. Chapter 7 contains conclusion and some hints about future work in the field.
2
Chapter 2
Basics of Digital Video
A video image is obtained by capturing the 2D plane view of a 3D scene. Digital video is then spatial and temporal sampled frames presented in a sequence. The spatio-temporal sampling unit which is usually called pixel (picture element) can be represented by a digital value to describe its color and brightness. The more sampling points taken to form the video frame the higher is usually the visual quality but requiring high storage capacity. The video frame is usually formed in a rectangular shape. The smoothness of a video is determined by the rate at which its frames are presented in a succession. A video comprising a frame rate of thirty frames per second looks fairly smooth enough for most purposes. A general comparison of appearance of a video determined by its frame rate is given in the table 2.1[2].
Table 2.1: Video frame rates.[2]
- Video frame rates.
- Appearance
Below 10 frames per second ’Jerky’, unnatural appearance to movement
- 10-20 frames per second
- Slow movement appears OK;
rapid movement is clearly jerky Movement is reasonably smooth
Movement is very smooth
20-30 frames per second 50-60 frames per second
2.1 Color Spaces
The pixel may be represented by just one number (grey scale image) or by multiple numbers (colored image). A particular scheme used for rep-
3
Chapter 2. Basics of Digital Video resenting colors is called color space. Two of the most common schemes are known as RGB (red/green/blue) and YCrCb (luminance/red chrominance/blue chrominance). In the RGB color space, each pixel is represented by three numbers indicating the relative proportion of the three colors. Each of the numbers is usually formed by eight bits. So, one pixel requires twenty four bits for its complete representation. It has been observed from psycho visual experiments that human optical system is less sensitive to color than luminance. This fact is exploited in YCrCb color space where luminance is concentrated in only one of its components Y and color information is contained in the rest of the components. There is a relationship between both color space schemes where one representation can be transformed into another. For details on this, please see [2].
2.2 Video Quality
The video quality is an important parameter and is a subjective issue, by its nature of being judged by human. There are many objective criteria for measuring video quality which can give results with some correlation to human experience e.g., PSNR. However, they may not make up satisfactorily to the demand of subjective experience of a human observer. Experiments show that a picture with lower PSNR may look visually better than with a higher value of PSNR. It is to be noted that human visual experience may vary also from person to person and brings up the need of such alternatives which could be thought of as covering the need of both objective and subjective tests. An objective test which matches best with the human visual experience will give acceptable results.
2.3 Representation of Digital Video
Before the video is ready for coding, it is often transformed to one of the Intermediate Formats. The central out of them is common intermediate format, CIF, where a frame size resolution is 253 X 288 pixels. Table 2.2 gives information about some standard common intermediate formats.
2.4 Applications
There has been an exponential growth in applications of digital video and the technology has got the capacity to emerge rapidly. Some examples of
4
Chapter 2. Basics of Digital Video
Table 2.2: Intermediate formats. [2]
- Format
- Luminance resolution(horz. X vert.)
Sub-QCIF Quarter CIF(QCIF) CIF
128 X 96
176 X 144 352 X 288
- 704 X 576
- 4 CIF
widely used digital video applications are given in the following subsections.
2.4.1 Internet
It can be safely said that current era of internet holds the most of digital video applications ranging from a small video clip to wholesome of movies, from a small video chat to a corporate video conference and so on. Remote teaching/learning, video telephony and sharing videos has been made possible by the digital video. The state of the art video broadcasting phenomenon YouTube presents billions of videos to its viewers worldwide by using benefits of digital video technology.
2.4.2 Video Storage
Digital video has reshaped the way of storing videos. CD/DVD ROM and Blu-ray Disc have almost wiped out the classic film tape media storage devices. These new storage discs come with huge advantages of capacity, portability and durability. The latest of them is Blu-ray Disc and it has storage capacity as much as 50 GB in single layer and upto 100 GB in dual layer[3].
2.4.3 Television
Satellite television channels across the planet create an entire new world of global village by the virtue of digital video. Literally, there are thousands of television channels operating in various areas of the world and the number is yet increasing. News, current affair shows and popular drama serials gather a huge count of viewers.
5
Chapter 2. Basics of Digital Video
2.4.4 Games and Entertainment
The heavy video games and movies have gained enormous popularity and these are again an applications of digital video. Now a days, we see an increasing trend of popularity of 3D animation movies which is a big success of digital video. Take the example of ’Avatar’, a block buster 3D flick which is possibly the best ever liked movie of the current era.
2.4.5 Video Telephony
The digital video has played an important role in getting a video along with voice while communicating on telephone. On government and private levels, video conferencing is replacing the need of traveling far away for attending meetings at one place. Skype is probably the brand leader in this field.
6
Chapter 3
Video Compression Fundamentals
It has been observed that the size of an ordinary digitized video signal is far greater than usual storage capacity and transmission media bandwidth. This fact shows the need of systems capable of compressing the video. For the sake of example, a channel of ITU-R 601 television (with 30 fps) requires media having bit rate of 216 Mbps for broadcasting in its uncompressed form. A 4.7 Gb DVD can store only 87 seconds of uncompressed video at this bit rate. This implies that there is a clear need of such mechanism which can make this data fit to be able of storing or transmitting having limited capacities. Hence comes the compression but with drawback of some quality loss in visual experience. An effective compression system is, in general, lossy in nature.
3.1 CODEC
The term CODEC represents a combination of two systems capable of encoding (compressing) and decoding (decompressing). A typical codec is shown in figure 3.1. The encoder compresses the original signal and the process is called source coding. After some more signal processing the signal reaches the point of decompression at source decoder.
According to information theory, there is statistical redundancy in an ordinary data signal. This principle has been utilized in Huffman coding and such kind of CODEC is known as entropy CODEC. However, the entropy encoders do not perform well in case of images and videos. There is need of deploying source models before entropy coding can applied on such data. There are some properties present in video which are taken into considera-
7
Chapter 3. Video Compression Fundamentals
Figure 3.1: Source coder,channel coder,channel[2].
tion to be benefited in source models. These properties include the spatial and temporal redundancy present amongst pixels in video frames. Moreover, psycho visual experiments have shown that human visual system is more particular about lower frequencies. So, in encoding process for video, some high frequencies can be safely ignored. Codecs are often designed to emphasize certain aspects of the media, or their use, to be encoded. For example, a digital video (using a DV codec) of a sports event, such as baseball or soccer, needs to encode motion well but not necessarily exact colors, while a video of an art exhibit needs to perform well encoding color and surface texture. Pertaining to video quality, there can be two kinds of codecs. In order to achieve good level of compression, most of the codecs degrade the original quality of the signal and are known as lossy codecs. There are also codecs which preserve the original quality of the signal and are known as lossless codecs [10]. Some examples of coding techniques are presented here. In Differential Pulse Code Modulation (DPCM) coding technique, pixels are sent as prediction of already dispatched pixels. Next step is to transmit the prediction error which is actually difference of prediction from actual pixel. Transform coding changes the domain of the frame signal. This change is helpful in rounding off insignificant coefficients and a lossy compression is achieved. The transform coding has got a great deal of application in various video compression techniques. Another technique is motion compensated predictive coding which is the emphasis of this thesis. In a similar way as that of DPCM, a model of actual frame belonging to a video is obtained by prediction based on already encoded frame. This model is then subtracted from the original frame to obtain residual frame which contains much less energy as compared to its original frame [2].
3.2 A Video CODEC
Video signals are constructed by a sequence of still images which are better known as video frames. These frames can be encoded for compression
8
Chapter 3. Video Compression Fundamentals
Figure 3.2: Video CODEC With Prediction[2].
using intra frame coding techniques but this compression does not turn out to be of enough good value for a video. This fact and presence of temporal redundancy inside the video sequence drives the need of inter frame encoding. A prediction of actual video frame based on previous frame is subtracted from actual frame to form what is called residual frame. The residual frame is then encoded by frame codec. A block diagram of such video coder is in figure 3.2. The process of encoding the residual frame includes its transformation. The transform coefficients are quantized and then entropy coding is applied for transmission or storage. At the decoder end, revert operation of these steps are applied to get the original data back [2].