<<

Efficient Coding with Motion-Compensated Orthogonal Transforms

DU LIU

Master’s Degree Project Stockholm, Sweden 2011

XR-EE-SIP 2011:011 Efficient Video Coding with Motion-Compensated Orthogonal Transforms

Du Liu

July, 2011 Abstract

Well-known standard hybrid coding techniques utilize the concept of motion- compensated predictive coding in a closed-loop. The resulting coding de- pendencies are a major challenge for packet-based networks like the Internet. On the other hand, subband coding techniques avoid the dependencies of predictive coding and are able to generate video streams that better match packet-based networks. An interesting class for subband coding is the so- called motion-compensated orthogonal transform. It generates orthogonal subband coefficients for arbitrary underlying motion fields. In this project, a theoretical lossless signal model based on Gaussian distribution is proposed. It is possible to obtain the optimal rate allocation from this model. Addition- ally, a rate-distortion efficient video coding scheme is developed that takes advantage of motion-compensated orthogonal transforms. The scheme com- bines multiple types of motion-compensated orthogonal transforms, variable block size, and half-pel accurate motion compensation. The experimental results show that this scheme outperforms individual motion-compensated orthogonal transforms.

i Acknowledgements

This thesis was carried out at Sound and Image Processing Lab, School of Electrical Engineering, KTH. I would like to express my appreciation to my supervisor Markus Flierl for the opportunity of doing this thesis. I am grateful for his patience and valuable suggestions and discussions. Many thanks to Haopeng Li, Mingyue Li, and Zhanyu Ma, who helped me a lot during my research. I would also like to thank my parents and my friends Alicia Wang and Peng Wu for their constant support.

ii Contents

Abstract i

Acknowledgements ii

1 Introduction 1

2 Background 3 2.1 Motion-Compensated Orthogonal Transforms ...... 3 2.1.1 Motion Compensation ...... 3 2.1.2 The Orthogonal Transform ...... 5 2.2 Adaptive Spatial Wavelet Transforms ...... 7 2.2.1 Type-1 Spatial ...... 7 2.2.2 Type-2 Spatial Wavelet Transform ...... 8 2.3 Quantization ...... 8 2.4 Entropy Coding ...... 10

I Theoretical Model 12

3 Theoretical Signal Model 13 3.1 General Transform Model ...... 13 3.2 Memoryless Gaussian Model ...... 16

4 Numerical Results 20

II Practical System 24

5 Efficient Video Coding Scheme 25 5.1 Construction of Various MCOTs ...... 25 5.1.1 Multiple Types of MCOT ...... 25 5.1.2 Multi-hypothesis MCOT ...... 26 5.2 Obtaining Motion Vectors ...... 29 5.3 Variable Block Size ...... 30

iii CONTENTS iv

5.4 Mode Decision ...... 31

6 Experimental Results 33

7 Conclusions 36

Bibliography 37 List of Figures

2.1 Blocked based motion-compensatioin with two matching blocks x1,i and x2,j and a motion vector mv...... 4 2.2 Half-pel accurate motion compensation for integer position A. Position 1 to Position 8 are the possible half-pel positions for Position A...... 4 2.3 The distribution of a 2-dimensional noised image for (a) Haar wavelet transform with a rotation of 45◦ and (b) MCOT with an optimal decorrelation angle α∗...... 6 2.4 Type-2 spatial wavelet transform of Lena with three decom- position levels...... 9 2.5 Structure of a bitstream for one code-block. Sign: Signs of the coefficients. SP: Significant Propagation Pass. MR: Mag- nitude Refinement Pass. CP: Cleanup Pass...... 11

3.1 Theoretical signal model...... 13 3.2 The theoretical curve g(Rp) of the variance of the clean high band over Rp...... 14 3.3 The theoretical curve of the total rate ht over Rp...... 15 3.4 Different g for different f...... 18 3.5 Different hc for different f...... 19

4.1 The total rate ht over Rp with γ = 9 for different noise levels. 2 g0 = σv = 1...... 21 4.2 The rate of the coefficients hc over Rp with γ = 9 for different 2 noise levels. g0 = σv = 1...... 22

5.1 Efficient video coding system...... 26 5.2 Multi-hypothesis for bidirectional half-pel . 27 5.3 An example of 6-hypothesis motion estimation...... 28 5.4 Partitions of a of 16x16 for motion estimation. . . 30 5.5 Structure of the minimization of the cost function with the three levels...... 31

v LIST OF FIGURES vi

6.1 Luminance PSNR vs. rate for the QCIF sequence F oreman at 30fps with 64 frames and a GOP size of 8 frames. The compared transforms include the proposed MCOT, the bidi- rectional MCOT with variable block size (VBS) and half-pel motion compensation (HP), the bidirectional MCOT without VBS or HP, the Haar wavelet transform without VBS or HP, and the intra coding...... 34 6.2 Luminance PSNR vs. rate for the QCIF sequence Mother & Daughter at 30fps with 64 frames and a GOP size of 8 frames. The compared transforms include the proposed MCOT, the bidirectional MCOT with variable block size (VBS) and half- pel motion compensation (HP), the bidirectional MCOT with- out VBS or HP, the Haar wavelet transform without VBS or HP, and the intra coding...... 35 Chapter 1

Introduction

Video communication has been broadly used in today’s communication and visual services such as terrestrial broadcast, cable TV, satellite TV, real- time conversation, Internet video, and so on. For all these applications, video coding techniques play an important part in storage, transmission, and representation of video data. Since the storage space or the transmission bandwidth is usually limited, most video coding schemes are lossy. There is obviously a trade-off between the and the hardware and software requirements. Thus for a video coding technique, it is expected to code the video sequences efficiently such that the decoded video will provide with the highest possible quality for a given storage space or a given data rate. The standard video compression techniques, such as H.261 [1], H.263 [2], MPEG-1 [3], MPEG-4 Part2 [4], and more recently, H.264/AVC [5], uti- lize the concept of motion-compensated predictive coding. Predicted frames (known as P-frame) and bi-predicted frames (B-frame) are used to exploit the temporal redundancy of the sequences with one key frame (I-frame) for each (GOP). Because predictive coding is developed in a close-loop fashion, the coded heavily depend on the relationship of the successive pictures. These dependencies introduce the risk of error prop- agation to the subsequently decoded pictures, which might be suboptimal in packet loss channels [6]. On the other hand, the motion-compensated orthogonal transform (MCOT) is a subband coding technique that operates in an open-loop fashion. It does not depend on predictive coding and, there- fore, avoids the error propagation. Thus it is more suitable for packet based networks like the Internet. The motion-compensated orthogonal transform is a class of subband coding techniques. It generates orthogonal subband coefficients for arbitrary underlying motion fields. The goal of this project is to develop a rate-distortion efficient video cod- ing scheme that takes advantage of motion-compensated orthogonal trans- forms. A theoretical model is proposed to analyze the

1 CHAPTER 1. INTRODUCTION 2 optimal rate allocation. The performance of the practical system will be evaluated by peak signal-to-noise ratio (PSNR). The report is organized as follows: Chapter 2 introduces the background of the motion compensated orthogonal transforms, the adaptive spatial wavelet transforms, the quantization, and the entropy coding. Chapter 3 proposes a theoretical signal model for the transform coding. Numerical results for the theoretical model are presented in Chapter 4. Chapter 5 describes the implemented video coding system. Chapter 6 presents the experimental results for the coding system. Chapter 2

Background

2.1 Motion-Compensated Orthogonal Transforms

The class of MCOTs include the unidirectional motion-compensated orthog- onal transform [7], the bidirectional motion-compensated orthogonal trans- form [8], a half-pel motion accurate transform [9], and a multi-hypothesis transform [10]. In this thesis, various types of MCOTs are combined with various motion models to achieve an efficient adaption of the actual motion of the coded image sequence.

2.1.1 Motion Compensation Motion compensation describes the similarity of consequent pictures. It is very commonly used in today’s video coding techniques. Usually, a sequence of successive frames are similar. Motion compensation is used to explore the redundancy of this kind of information. Applying this algorithm to the Internet video services, one can save from several megabytes per second to 10 kbps [11]. In block-based motion compensation each frame is divided into blocks, such as 8 × 8 or 16 × 16 pixels in each block. A reference frame is defined and the motion-compensation algorithm searches the most similar block that best matches the current processing block. A motion vector is used to indicate the shift between the current block and the reference block. Fig. 2.1 depicts the two matching blocks x1,i in x1 and x2,j in x2. x2,j is the current processing block. x1 is the reference frame in which x2,j will find the most matching block with a motion vector mv. The system searches the most matching block. The criteria is usually Sum of Squared Differences (SSD) or Sum of Absolute Differences (SAD). The values of motion vector need not to be integer. It can be sub-samples such as half or quarter pixel position.It will provide more accurate motion compensation for the blocking matching scheme and therefore reduce the information in the residual signals.

3 CHAPTER 2. BACKGROUND 4

Figure 2.1: Blocked based motion-compensatioin with two matching blocks x1,i and x2,j and a motion vector mv.

Figure 2.2: Half-pel accurate motion compensation for integer position A. Position 1 to Position 8 are the possible half-pel positions for Position A. CHAPTER 2. BACKGROUND 5

In this project, half-pel motion accuracy is considered. Fig. 2.2 depicts the integer pixel positions A to I and half-pel positions 1 to 8. For each integer pixel position, e.g., A, we consider the around eight half-pel posi- tions. The interpolation for half-pel positions are given by the average of the neighbouring integer pixels as 1 1 p1 = (pA + pD), p5 = (pA + pH ), 2 2 1 1 p2 = (pA + pB + pC + pD), p6 = (pA + pH + pF + pG), 4 4 1 1 p3 = (pA + pB), p7 = (pA + pF ), 2 2 1 1 p4 = (pA + pB + pI + pH ), p8 = (pA + pF + pE + pD). (2.1) 4 4 2.1.2 The Orthogonal Transform Since the MCOT is an orthogonal linear transform, the differential entropy of the source signal is preserved. We have h(x1, x2) = h(L, H) where h(x1, x2) is the joint entropy of the two input pictures x1 and x2 and h(L, H) the joint entropy of the low band and the high band. Fig. 2.3 depicts the 2- dimensional signal with (a) Haar wavelet transform and (b) MCOT. α is the rotation angle decided by the transform. The Haar wavelet transform always rotate the signal by α = 45◦, which means it may be suboptimal if the source signal distribution has an angle unequal to 45◦. We have the Haar transform matrix ! 1 1 1 HHaar = √ . (2.2) 2 −1 1

The MCOT, on the other hand, specifies the decorrelation angle α = α∗ by the constraint of energy concentration. It aims at rotating the signal to the x1-axis, which means the MCOT is adaptive to the distribution of the signal. The orthogonal matrix for MCOT is ! cos α sin α H = . (2.3) MCOT − sin α cos α

For uncorrelated Gaussian signals, the coefficients after the MCOT are independent. The differential entropy of the source signal turns to h(L) + h(H). We can write the differential entropy as

h(x1, x2) = h(L, H) ≤ h(L) + h(H), ∀α (2.4) and ∗ h(x1, x2) = h(L) + h(H), only for α = α . (2.5) CHAPTER 2. BACKGROUND 6

Figure 2.3: The distribution of a 2-dimensional noised image for (a) Haar wavelet transform with a rotation of 45◦ and (b) MCOT with an optimal decorrelation angle α∗.

Considering the two matching blocks in Fig. 2.1, the orthogonal trans- form turns to 00 ! 0 ! x1,i x1,i 00 = H 0 . (2.6) x2,j x2,j 0 00 0 After the transform, x1,i will be the low band block x1,i and x2,j be the high 00 0 0 00 band block x2,j. In the ideal case that x1,i = x2,j, the high band block x2,j will result in zero. Since it is an orthonormal transform, the Parseval’s theorem always holds. The energy is preserved before and after the transform

2 2 2 2 0 0 00 00 x1,i + x2,j = x1,i + x2,j . (2.7) 2 2 2 2 Thus it is possible to evaluate the distortion of the coefficients even before the inverse transform. The MCOT achieves high energy concentration, with up to 99% energy in the temporal low band for the QCIF sequence Foreman [12]. If the input images are identical, the whole energy will be compacted to the temporal low band and the temporal high band turns to zero. Considering bidirectional MCOT, we can construct the orthogonal trans- form matrix as

H =H3H2H1       cos ψ 0 sin ψ 1 0 0 cos φ 0 sin φ       =  0 1 0  0 cos θ − sin θ  0 1 0  . (2.8) − sin ψ 0 cos ψ 0 sin θ cos θ − sin φ 0 cos φ CHAPTER 2. BACKGROUND 7

The Euler angles ψ, θ, and φ are calculated by energy concentration. The bidirectional MCOT can be expressed as

 00   0  x1,i x1,i  00   0  x2,j  =H3H2H1 x2,j  . (2.9) 00 0 x3,k x3,k where x1,i, x2,j, and x3,k are three coefficients from three corresponding frames. 0 0 After the transform, the original coefficients x1,i and x3,k will turn to the 00 00 0 low band coefficients x1,i and x3,k respectively. x2,j becomes the high band 00 0 0 0 coefficient x2,j. In the ideal case that x1,i = x2,j = x3,k, the high coefficient 00 x2,j will be zero. In the case of half-pel accuracy, the pixel based motion-compensated orthogonal transform is given in [9]. We have 2-hypothesis MCOT for p1, p3, p5, and p7, and 4-hypothesis transform for p2, p4, p6, and p8, see also in Fig. 2.2. The 2-hypothesis has a similar transform to the bidirectional MCOT, while the 4-hypothesis extends the orthogonal transform to an operation of five pixels at a time.

2.2 Adaptive Spatial Wavelet Transforms

Spatial transforms can exploit the spatial redundancy between coefficients in a picture and map the pixels into spatial low and high bands. Adap- tive spatial wavelet transform is designed to modify the spatial relationship within each of the temporal bands produced by MCOT [13]. It preserves the orthogonality of the temporal decomposition. It also takes the scale factors from the MCOT into consideration to achieve efficient energy compaction. The adaptive spatial wavelet transform consists of two types: type-1 spatial wavelet transform and type-2 spatial wavelet transform.

2.2.1 Type-1 Spatial Wavelet Transform The type-1 is the adaptive Haar-like wavelet transform. It processes two pixels each time according to

00! 0 ! x1 x1 00 = H 0 , (2.10) x2 x2

0 0 00 where x1 and x2 are two successive temporal low band samples and x1 and 00 x2 the corresponding spatial low and high band coefficients, respectively. The orthogonal transform matrix H is ! 1 1 a H = √ (2.11) 1 + a2 −a 1 CHAPTER 2. BACKGROUND 8 where a is the decorrelation factor determined by energy concentration con- straint. If a = 1, the spatial transform turns to standard Haar wavelet transform. The output of the temporal low band after spatial decomposition will be spatial low band and horizontal, vertical, and diagonal high bands. The adaptive spatial wavelet transform achieves high energy compaction. It is shown in [12] that type-1 adaptive spatial transform achieves 98.3% en- ergy concentration from the temporal low band to the spatial low band for F oreman and 98.77% for Mother&Daughter.

2.2.2 Type-2 Spatial Wavelet Transform Instead of processing two pixels each time for the type-1 spatial transform, the type-2 spatial transform considers three pixels each time and processes 0 0 along a whole row or column within the each of the subbands. Let x1, x2, and 0 x3 be three samples proceesed by the type-2 transform at a time. After the 0 00 0 0 0 transform, x2 turns to spatial high band pixel x2. Assuming x1 = x2 = x3, 00 the high band energy can be completely removed (x2 = 0).  00  0  x1 x1  00  0  x2 = S3S2S1 x2 , (2.12) 00 0 x3 x3 where

S = S3S2S1       cos ψ 0 sin ψ 1 0 0 cos φ 0 sin φ       =  0 1 0  0 cos θ − sin θ  0 1 0  . − sin ψ 0 cos ψ 0 sin θ cos θ − sin φ 0 cos φ (2.13)

The Euler angles φ, θ, and ψ are determined via energy concentration con- straint. Fig. 2.4 depicts the three spatial decomposition levels of Lena (512×512). It has most of its energy in the upper left corner as the spatial low band while the spatial high bands contain information of edges and curves as gray parts. Since the type-2 outperforms the type-1 in energy compaction [13], the type-2 adaptive spatial wavelet transform is used in this thesis.

2.3 Quantization

Quantization is a mapping from a large set of values to a representation of small set of unit values. The mean squared error (MSE) of the quantization can be expressed as h i D = E (x − q(x))2 , (2.14) CHAPTER 2. BACKGROUND 9

Figure 2.4: Type-2 spatial wavelet transform of Lena with three decompo- sition levels. where x is a random vector. The literature studies the relationship of the MSE and the probability density function (pdf) of the source signals [14] [15]. It is shown that for high rate and smooth pdf, the distortion can be written as Z 1 3 1 3 D = 2 fx (x) dx , (2.15) 12N x where N is the number of representative levels and fx(x) is the pdf of x. In most of the cases, the quantizer is uniform and scalar. Assume x is uniformly distributed. The distortion is

N Z xi+1 2 X 2 ∆ D = (xi − xˆi) dx = , (2.16) i=1 xi 12 where ∆ is the quantization step size and ∆ = 1/N. Quantization is always associated with the rate. This refers to the rate- distortion theory. If the quantization step size grows larger, less bits are needed for the entropy coding, and vise versa. In the case of high rate, the uniform quantizer is often optimal [16]. For uniform quantization, the rate can be expressed in the form of the number of quantization levels R = log2 N. It is possible to control the total rate by setting different quantization step sizes. CHAPTER 2. BACKGROUND 10

In this thesis, the Embedded Block Coding with Optimized Truncation (EBCOT) is used as entropy coding as well as Rate-Distortion evaluation. Quantization effects on EBCOT based compression are studied in [17]. How- ever, as suggested in [18], a uniform scalar deadzone quantization is used with standard step size ∆ = 1 for simplicity and the rate is controlled by the post-compression rate-distortion (PCRD).

2.4 Entropy Coding

The main purpose of entropy coding is to reduce the redundancy of the source message and represent it in binary format. Assume X to be a discrete random variable with probability p(xi) for its possible values xi. The Shannon entropy of X is defined as

n X H(X) = − p(xi) log2(p(xi)). (2.17) i=1 It can be used to measure the uncertainty of this random variable X. It also indicates the theoretical lower limit of and the optimal code length for one symbol is H(X). The above definition of Shannon entropy is dependent on the discrete X. If we consider a continuous X, e.g., practical analog signals, the entropy can be extended to differential entropy to describe the continuous case. Let fx(x) be the pdf of X. The differential entropy is defined as Z h(X) = − fx(x) log2 fx(x) dx. (2.18)

Note that the differential entropy can be negative. And it is not limited by the Shannon entropy. A number of entropy coding techniques have been designed, such as Huffman coding, , and Lempel-Ziv-Welch (LZW) [19] for lossless coding. Most image coding techniques are lossy coding, such as Embedded Zerotrees of Wavelet Transforms (EZW) [20], Set partitioning in hierarchical trees (SPIHT) [21], and Discrete Cosine Transform (DCT) [22]. EBCOT is another coding method used in image and video compression [23]. It serves as the entropy coding in JPEG2000. It utilizes the idea of - plane coding to encode bits from the most significant bit-plane to the least significant bit-plane. Generally, the low-order bit-planes are more difficult to encode than the high-order bit-planes because they contain more details and randomness. Consider an 8 bits image. It has a maximum value of 255 for a grey n−1 scale image. Each of the coefficients can be represented as x = an−12 + n−2 0 an−22 + ··· + a02 where x is the coefficients and n is the number of the binary bit-planes. Coefficients construct the bit-planes from the most CHAPTER 2. BACKGROUND 11

significant bit an−1 to the least significant bit a0. The image is divided into small code-blocks, such as 16×16 or 32×32. The bit-plane encoder encodes the bit-planes of each code-block independently by three steps:

• Significant Propagation Pass;

• Magnitude Refinement Pass;

• Cleanup Pass.

All the information obtained from the previous processes is coded by an adaptive arithmetic encoder. The bit-stream for one code-block is con- structed in an embedded way, shown in Fig. 2.5. The most important information is put at the head and the least important information follows at the end. If the bit-stream is truncated, it always tries to preserve the most important information and throw away the bits at the back. In this ways, the EBCOT can perform rate control without different quantization steps. It can create embedded bitstreams from bit-planes and is adaptive to a given rate by truncating the bitstreams.

Figure 2.5: Structure of a bitstream for one code-block. Sign: Signs of the coefficients. SP: Significant Propagation Pass. MR: Magnitude Refinement Pass. CP: Cleanup Pass.

Because the code-blocks are coded independently, the bit-streams of dif- ferent blocks are also independent. If one bit-stream is lost, it does not affect other bit-streams and the decoder is still possible to decode the pic- ture. The PCRD algorithm can be performed over each single code-block and the complexity of the rate-control is reduced. Moreover, the distortion P is additive when the truncated bit-streams are independent: D = i Di, where D is the distortion of the whole picture and Di is the distortion of one code-block Bi. Detailed distortion estimation based on bit-planes can be found in [24], but here we only look at the block-based distortion. In this project, the JasPer software of JPEG-2000 (ISO/IEC 15444- 1) is used for implementation [25]. Part I

Theoretical Model

12 Chapter 3

Theoretical Signal Model

3.1 General Transform Model

Within each single MCOT, assume there are two input pictures x0 and x1. x0 and x1 can be viewed as a clean picture v plus independent additive white Gaussian noises n0 and n1 respectively, shown as Fig. 3.1 [26]. The noise n0 and n1 are statistically independent.

Figure 3.1: Theoretical signal model.

After the transform, the output signals have one temporal low band L and one energy removed temporal high band H. However, the energy of the noises n0 and n1 cannot be shifted after the transform. They remain in the each of their subbands. Thus the temporal subbands are composed by the clean subband signals plus the noises: L = Lclean + n0 and H = Hclean + n1. Since we would like to describe the performance of transform, we use the parameter rate Rp to determine how much energy is moved from the high band to the low band. The parameter rate includes the information that is related to the transform, such as the rate of the motion vectors and the rate of the block sizes. If Rp = 0, no additional bits are spent on the

13 CHAPTER 3. THEORETICAL SIGNAL MODEL 14

parameter rate, the energy is not shifted. If Rp gets larger, more energy will be concentrated to the low band and less is left in the high band. If Rp → +∞, the clean high band energy can be completely removed and all the signal energy is in the low band, shown as Fig. 3.2. Let g(Rp) be the transform function of Rp indicating the variance of the clean high band signal. g(Rp) is a decreasing function saying if Rp gets larger, more energy will be removed from the high band. However we should notice that it is not possible to remove the noise from the high band even if Rp → +∞.

Figure 3.2: The theoretical curve g(Rp) of the variance of the clean high band over Rp.

Because the noise cannot be shifted around, use f to present the variance of the noised high band

2 2 f = σH = σn + g(Rp). (3.1)

E 2 2 Since f indicates the high band, f should be 0 ≤ f ≤ 2 , where E = 2σn+2σv is the total energy. From Fig. 3.2, we know both f and g are convex. The variance of the noised low band is

2 2 2 2 2 σL = σn + 2σv − g(Rp) = 2σn + 2σv − f. (3.2)

The total energy E is always conserved because of the orthogonal transform. Let hc be the differential entropy of the coefficients 1 1 hc = (h(L) + h(H)) ≥ h(x1, x2). (3.3) 2 2 The differential entropy of the total signal is

ht = Rp + hc 1 = Rp + (h(L) + h(H)) [bpp]. (3.4) 2 CHAPTER 3. THEORETICAL SIGNAL MODEL 15

Because hc is related to the subband coefficients, there is a trade-off be- tween Rp and hc. Either more bits are spent on Rp to improve the transform perform and less bits for hc, or less bits are spent on Rp and more bits are required for hc. For efficient transform coding, we expect the decreasing amount of hc is larger than the increasing amount of Rp, such that it is ∗ possible to reduce the total rate. Theoretically, there exists an optimal Rp that can minimize the total rate. Fig. 3.3 depicts the theoretical curve of ∗ ∗ 1 ht with optimal Rp and min{ht} = ht . If Rp = 0, ht is 2 (h(L) + h(H)). If 1 Rp → +∞, hc approaches to the constant entropy 2 h(x1, x2) of the source signal.

Figure 3.3: The theoretical curve of the total rate ht over Rp.

To minimize the total rate ht, we take partial derivative of ht with respect to Rp

∂ht = 0. (3.5) ∂Rp

∗ ∗ The result will give min{ht} = ht(Rp) = ht . If we consider a cost function L with µ = 1

L = Rp + µhc, (3.6) to minimize ht is to find

dL = dhc + dRp = 0, (3.7) which turns to be dhc = −1. (3.8) dRp CHAPTER 3. THEORETICAL SIGNAL MODEL 16

To evaluate Eq. 3.8, we need to know the absolute value of hc such that we can find the optimal rate allocation when combined with Rp. Although hc can be obtained from the bit-streams after entropy coding, it is difficult to calculate the value of hc within the transform. Because we would like to evaluate the transform before entropy coding, we therefore assume the source signal is Gaussian distributed and calculate its differential entropy.

3.2 Memoryless Gaussian Model

As is known that the Gaussian source is the most difficult source to encode of all real-valued probability distributions with the same mean and variance [27], it requires the highest amount of bits. Here we are considering the worst case in modeling the signal. We obtain the differential entropy of the temporal low band 1 1 h(L) = log (2πe) + log (σ2 ) (3.9) 2 2 2 2 L and the temporal high band 1 1 h(H) = log (2πe) + log (σ2 ). (3.10) 2 2 2 2 H The rate of the coefficients is 1 hc = (h(L) + h(H)) (3.11) 2 1 1 h 2 2 2 i = log (2πe) + log (σ + 2σ − g(Rp))(σ + g(Rp)) (3.12) 2 2 4 2 n v n 1 1 = log (2πe) + log [(E − f)f] . (3.13) 2 2 4 2

If we minimize hc for a given Rp0 , we have

min hc s.t. Rp = Rp0 . (3.14)

⇒ min(E − f)f s.t. Rp = Rp0 (3.15)

E Because 0 ≤ f ≤ 2 , Eq. 3.15 can be rewritten as

min f s.t. Rp = Rp0 (3.16)

Looking at Eq. 3.14, hc is based on the Gaussian distribution assumption. But from Eq. 3.16, we notice the function is based on the variance of the high band rather than Gaussian distribution. That means, if we consider Eq. 3.16, there is no need to know the signal distribution in advance. For this, we can construct another cost function

2 J = f + λRp = σH + λRp, (3.17) CHAPTER 3. THEORETICAL SIGNAL MODEL 17 where λ is the Lagrangian multiplier. To minimize this cost function dJ = 0, (3.18) dRp we have df λ = − (3.19) dRp This equation indicates the transform performance in two aspects. The first is for the performance with the same Rp but different g(Rp)s. The second is for the same g(Rp) but different Rps:

• Suppose there are two transform functions g1 and g2. Assume g1 is steeper than g2, which means at given f1, Rp11 < Rp21. So g1 is more efficient in compacting energy as less bits are spent. Then at f1 we have λ11 < λ21. At f2 we have an opposite case that λ21 > λ22, shown as Fig. 3.4.

• Consider f1 and f2 (f1 < f2) for a given g1. The rate at these two points are Rp11 > Rp12, which means a higher Rp will remove more energy from the high band.Then the two λs at these two points are λ11 < λ12. The same holds for g2 that λ21 < λ22, also shown in Fig. 3.4.

Table 3.1 shows the parameters for each of the conditions in Figs. 3.4 and 3.5.

f g1 g2 hc1 hc2 f1 Rp11, λ11 Rp21, λ21 λh11 λh21 f2 Rp12, λ12 Rp22, λ2 λh12 λh22

Table 3.1: Parameters shown in Figs. 3.4 and 3.5.

Eq. 3.8 can be rewritten as

dhc dhc df = = −1. (3.20) dRp df dRp

We obtain dhc 1 E − 2f 1 = = . (3.21) df 4 ln 2 (E − f)f λ

dhc The term df evaluates the slope of the differential entropy of the subbands over f based on Gaussian distribution. This term also has meanings in two aspects:

• Consider one f for two curves hc1 and hc2, saying hc1 is higher than hc2. We have hc1(f) > hc2(f). The larger value means hc1 is not CHAPTER 3. THEORETICAL SIGNAL MODEL 18

Figure 3.4: Different g for different f.

so efficient in coding the coefficients comparing to hc2. Then for f1, 1 1 1 1 > ⇒ λh11 < λh21 and for f2, < ⇒ λh12 > λh22, λh11 λh21 λh12 λh22 shown as Fig. 3.5. For a relatively small f1, ghc1 should be lower than ghc1 such that λh11 < λh21, thus Rp,h11 < Rp,h21. So ghc1 is more effective than ghc2. When a more effective ghc1 is combined with a less efficient hc1, it can still achieve an optimal performance. 1 • If we look at one curve hc1 and two points f1 and f2, we have > λh11 1 ⇒ λh11 < λh12. Besides, for one transform function g, we obtain λh12 f1 < f2 ⇒ Rp1 > Rp2 ⇒ λh11 < λh12. The two results are consistent, which means the transform and the entropy coding can be analyzed together. As is discussed above, Eq. 3.20 clearly demonstrates the relationship dhc between the transform and the entropy coder. In real cases, the term df will tell the performance of the entropy coder, e.g., the EBCOT. If entropy coder can encode the coefficients efficiently, it will give a smaller hc than that from an inefficient encoder with the same f. In the case of bit-streams, we can use Rc instead of hc to present the actual rate of the coefficients.

To choose one optimal λ, we can do the following steps:

• For a given transform g(Rp), calculate N possible λs λ1 . . . λN for the N different Rps.

2 • With each Rp,n (n ∈ [1 ...N]) we can obtain σH,n and Rc,n. CHAPTER 3. THEORETICAL SIGNAL MODEL 19

Figure 3.5: Different hc for different f.

• Choose the minimum Rn = Rp,n + Rc,n among the N possible rates and we can find the corresponding optimal λ.

The next chapter will give numerical results for for a given g(Rp). An optimal rate combination will be presented. Chapter 4

Numerical Results

Because g(Rp) is unknown beforehand, it can have any form like exponential functions or rational functions. For simplicity, here we assume the function g(Rp) to be −γRp g(Rp) = g02 (4.1) where g0 is the value for g at Rp = 0 and γ > 0 is a parameter indicating the shape of g. To find one minimum point for ht, we have

∂ht = 0 (4.2) ∂Rp 2 ∂ ht 2 > 0. (4.3) ∂Rp

The solutions are p ! 1 4 − γ + (4 − γ)2 + 4(2 − γ)C R∗ = − log (4.4) p γ 2 2(2 − γ) where 2 2 4 C = 4σvσn + 2σn, (4.5) 1 1 h i h∗ = R∗ + log (2πe) + (σ2 + g(R∗))(2σ2 + σ2 − g(R∗)) , (4.6) t p 2 2 4 n p v n p and df ∗ ∗ −γRp ∗ λ = − ∗ = γg0 ln 2 · 2 = γ ln 2 · g(Rp). (4.7) dRp ∗ ∗ This pair of RP and ht is then our optimal rate allocation. Figure 4.1 presents the total rate ht over Rp for noise levels of −10dB, 2 −30dB, and −50dB. The top curve of σn = −10dB shows small amount of reduction in ht with about 0.1bits. At this level, the noise is large enough to destroy the original signal. The transform does not gain much in reducing

20 CHAPTER 4. NUMERICAL RESULTS 21

8

7

6

5

4 (bits) t h

3

2 σ2=−50dB n σ2=−30dB 1 n σ2=−10dB n 0 −2 0 2 4 6 8 Rp(bits)

Figure 4.1: The total rate ht over Rp with γ = 9 for different noise levels. 2 g0 = σv = 1.

2 the total rate. The curve in the middle is with σn = −30dB. There is 1bit at the minimum point. This level of noise is acceptable to the signal as it is not too high nor too low. For the last one with noise of −50dB, the noise is so small that the signal is almost clean. The transform decreases with about 1.8bits in ht. It is possible to see the transform is efficient in reducing the total rate for a low level of noise, but for a high level of noise there is no much gain. As we are also interested in the relationship between hc and Rp, Fig. 4.2 depicts hc vs. Rp. As is shown, all the three curves first decrease and then approach to be constant. The decreasing part indicates that Rp is compensating hc. We can see negative values in hc because it is the differential entropy and differential entropy can be negative. The level of 2 the noise affect the rate distribution. For the top curve with σn = −10dB, there is only a short range of trade-off between hc and Rp. Because the noise destroys the signal, we do not see much reduction in hc. The optimal rates ∗ ∗ ∗ of Rp, hc , and ht are presented in Table 4.1. CHAPTER 4. NUMERICAL RESULTS 22

2

1.5

1

0.5 (bits)

c 0 h

−0.5 σ2=−50dB n −1 σ2=−30dB n σ2=−10dB −1.5 n

−2

0 1 2 3 4 5 6 Rp(bits)

Figure 4.2: The rate of the coefficients hc over Rp with γ = 9 for different 2 noise levels. g0 = σv = 1.

∗ ∗ ∗ 2∗ ∗ Noise ht Rp hc σH λ dhc/df -50dB 0.24 1.88 -1.64 1.81e-5 4.99e-5 2.00e4 -30dB 1.16 1.14 0.02 0.18e-2 0.50e-2 2.00e2 -10dB 2.09 0.37 1.72 0.20 0.61 1.64

2∗ ∗ Table 4.1: Optimal rate combinations and corresponding σH and λ for different noise levels. CHAPTER 4. NUMERICAL RESULTS 23

Remarks

Notice that the function g(Rp) is assumed to be given when calculating Eqs. 4.2 and 4.3. If we calculate these two functions without assuming any g(Rp), the solution is q ∗ 2 2 2 2 2 2 2 2 −4Rp g = σv + σn − (σv + σn) − [2(σv + σn)g0 − g0]2 . (4.8)

∗ g∗ ∂ht(g ) , ∀R h This special makes ∂Rp = 0 p, which means the obtained t is flat over Rp. However, a flat curve of ht means the transform is not efficient at all. And the original question is to find one minimum point for ht. We don’t need to make g available for all Rp. Therefore, g should be assumed in advance. Part II

Practical System

24 Chapter 5

Efficient Video Coding Scheme

The practical video coding system is depicted in Fig. 5.1. It utilizes various types of motion-compensated orthogonal transforms. The input is a group of n pictures (GOP = n). The MCOT is a combination of the unidirectional MCOT, the bidirectional MCOT, a half-pel motion accurate transform, and variable block sizes. The decision of which type to be used for the transform is decided by the Lagrangian cost function. After the MCOT, the temporal subbands consist of one temporal low band and n − 1 temporal high bands. Then the adaptive spatial wavelet transform is applied to the temporal sub- bands. It is not efficient to apply the spatial transform to the temporal high band comparing to the transform of the temporal low band, as there is no much spatial redundancy in the high bands. But we still use the spa- tial transform to all the subbands, because the EBCOT codec requires the same spatial decomposition level for all the subbands. And in this work, the spatial decomposition level is set to three. After the transforms, we use EBCOT as entropy coding to encode the obtained coefficients. According to [28], the uniform deadzone quantization with step size one is used and the rate is controlled by the PRCD.

5.1 Construction of Various MCOTs

5.1.1 Multiple Types of MCOT As introduced in Chapter 2, there are two types of MCOT: the unidirectional MCOT and the bidirectional MCOT. Using these two types to construct our system, we are considering the following available transforms:

• Intra-frame coding

• Left unidirectional MCOT

25 CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 26

Figure 5.1: Efficient video coding system.

• Right unidirectional MCOT

• Bidirectional MCOT.

Intra-frame coding means there is no transform in the video sequence. The original pictures are kept as temporal subbands directly without per- forming any algorithm. It is highly inefficient in video compression. The reason we keep this kind of coding scheme is because it can be applied to the worst case when the motion-compensation corrupts, such as a completely dif- ferent frame appearing in the video sequence and any motion-compensation will bring in high distortion. In general case, the intra-frame coding is not touched. The left unidirectional MCOT and the right unidirectional MCOT are similar. The only difference is the left unidirectional MCOT considers the previous frame as the reference frame while right unidirectional MCOT con- siders the subsequent frame. This strategy applies for the cases that there might be a sudden break in the sequence and the following frames are com- pletely different. The system would choose to take the previous (left) frame as the reference to process the current frame and the subsequent (right) one as reference for the following pictures. The last one, the bidirectional MCOT takes both the previous frame and the subsequent frame into consideration. The system will compare the performance of the four possibilities and choose the optimal one. The decision is made by Lagrangian cost function (see Section 5.4) [29] . The purpose of engaging various types of MCOT is that the implemented system is expected to adapt to different video sequences with different con- tents and patterns and, thus, improve the overall performance.

5.1.2 Multi-hypothesis MCOT A unidirectional half-pel MCOT has been introduced in [9]. To enrich vari- ous combinations of motion models for MCOT, it is necessary to consider an CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 27 additional combination of the bidirectional MCOT and half-pel motion esti- mation. This combination requires an extension of multi-hypothesis motion- compensation besides the 1-hypothesis, 2-hypothesis, and 4-hypothesis. The unidirectional MCOT can be considered as 1-hypothesis with integer motion estimation. Having half-pel motion estimation, we have 2-hypothesis motion estimation for positions p1, p3, p5, and p7 and 4-hypothesis for posi- tions p2, p4, p6, and p8, Fig. 2.2. In the case of bidirectional half-pel MCOT, suppose we have two reference frames A (the previous frame) and B (the subsequent frame). The unidirectional half-pel MCOT has three hypothesis types: • 1-hypothesis for frame A or B (using 1A or 1B for short)

• 2-hypothesis for frame A or B (2A or 2B)

• 4-hypothesis for frame A or B (4A or 4B). When this turns to bidirection, there will be nine combination possibil- ities in total, shown in Fig. 5.2.

Figure 5.2: Multi-hypothesis for bidirectional half-pel motion estimation.

As we can see, there are four kinds of new multi-hypothesis motion estimation: 3-hypothesis, 5-hypothesis, 6-hypothesis, 8-hypothesis. To con- struct the transform matrices for these new kinds of hypothesis, we need to consider the idea of energy compaction and distribution. From the bidirectional transform matrix shown in Eq. 2.8, we can ob- serve that the orthogonal matrix H1 performs energy concentration on the first and third pixels. Their energy is compacted to the third pixel. H2 then concentrates the energy from the second pixel to the third pixel. At this point, the energy of all the three pixels is compacted in the third one. Finally H3 split the compacted energy back to the first and the third pixel. CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 28

So the energy in the second pixel is shifted to the other two pixels and the second pixel becomes high band pixel. With the same idea, we can construct the additional transform matrices.

Figure 5.3: An example of 6-hypothesis motion estimation.

An example of 6-hypothesis MCOT Fig. 5.3 is an example of 6-hypothesis (4A+2B) motion estimation. The reference frame A provides a 4-hypothesis motion estimation, which means the half-pel position p2 is the average of the four neighbouring integer pixels. The reference frame B provides a 2-hypothesis motion estimation that the half-pel position p5 is the average of the two neighbouring integer pixels. The MCOT compacts the energy of the seven coefficients to one coefficient and then distributes the whole energy back to the six low band pixels (pA ∼ pD in A and pA and pH in B) leaving an energy-removed high band pixel (grey pA). Eq. 5.2 to 5.4 present the sub-transform matrices Ha to Hf that construct H. Each of the sub-matrix deals with two pixels at a time. The energy is gradually compacted to one pixel by Euler angles φ1 to φ6. The distribution of the energy is determined by φ7 to φ11. The transform matrix for this 6-hypothesis motion estimation is

H =Ha(φ11)Hb(φ10)Hc(φ9)Hd(φ8)He(φ7)Hf (φ6)

·He(φ5)Hd(φ4)Hc(φ3)Hb(φ2)Ha(φ1) (5.1) CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 29

where Ha to Hf are 7 × 7 matrices with     cos φ sin φ 0 0 0 0 0 1 0 0 0 0 0 0     − sin φ cos φ 0 0 0 0 0 0 1 0 0 0 0 0      0 0 1 0 0 0 0 0 0 cos φ sin φ 0 0 0     Ha =  0 0 0 1 0 0 0 H = 0 0 − sin φ cos φ 0 0 0   b        0 0 0 0 1 0 0 0 0 0 0 1 0 0      0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 (5.2)     1 0 0 0 0 0 0 1 0 0 0 0 0 0     0 1 0 0 0 0 0 0 cos φ 0 sin φ 0 0 0     0 0 1 0 0 0 0 0 0 1 0 0 0 0     Hc = 0 0 0 1 0 0 0 H = 0 − sin φ 0 cos φ 0 0 0   d    φ φ    0 0 0 0 cos sin 0 0 0 0 0 1 0 0     0 0 0 0 − sin φ cos φ 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 (5.3)     1 0 0 0 0 0 0 1 0 0 0 0 0 0     0 1 0 0 0 0 0 0 1 0 0 0 0 0      0 0 1 0 0 0 0 0 0 1 0 0 0 0      He = 0 0 0 cos φ 0 sin φ 0 H = 0 0 0 1 0 0 0    f       0 0 0 0 1 0 0 0 0 0 0 1 0 0      0 0 0 − sin φ 0 cos φ 0 0 0 0 0 0 cos φ sin φ 0 0 0 0 0 0 1 0 0 0 0 0 − sin φ cos φ (5.4) and φ1 ∼ φ11 are determined by energy concentration constraints.

5.2 Obtaining Motion Vectors

In this coding system, we are using both integer motion estimation and half-pel motion estimation. To obtain motion vectors, we use JMV = P min{ |xi −xj|+λmvRmv} where xi is the reference block and xj is the cur- P rent block. |xi −xj| is the sum of the absolute difference of the coefficients of xi and xj. λMV is the Lagrangian multiplier for the motion vectors and RMV is the rate of the motion vectors. Normally, a higher rate of motion vectors can provide a better match for xi and xj. If λmv is set to zero, we only consider the similarity of the two blocks. Then JMV will give the most similar block to the reference block with respect to xj.

Our motion compensation provides one integer motion vector (mx0 , my0 ) and its corresponding eight half-pel positions around the integer position. However, due to implementation complexity and time consuming, only the CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 30

integer (mx0 , my0 ) and best two half-pel motion vectors (mx1 , my1 ) and

(mx2 , my2 ) are considered for practical evaluation. Because the motion vectors are crucial to the reconstruction of the video sequences, they require lossless coding. Huffman coding is used to code them.

5.3 Variable Block Size

In addition to the multiple types of MCOTs, various block sizes are engaged to provide more accurate block-based motion estimation. A macroblock n with block size of m × n is partitioned into smaller block sizes of m × 2 , m m n 2 ×n, and 2 × 2 . Fig. 5.4 depicts a macroblock of 16×16 is segmented into subblock sizes of 16 × 8, 8 × 16, and 8 × 8. The motion estimation provides one motion vector for each of the subblocks. A maximum of four motion vectors can be transmitted for a macroblock if the subblocks of size 8 × 8 is chosen. In our case, there are 9 (= 1 + 2 + 2 + 4) motion vectors saved for each macroblock before the MCOT. Our system is to evaluate all the four subblock types to determine which kind of block size is optimal, see Section 5.4.

Figure 5.4: Partitions of a macroblock of 16x16 for motion estimation.

Summarizing from the description above, there are three levels of com- binations inside our system:

• Motion compensation for each type of MCOT

• Different types of MCOT for each subblock

• Variable block sizes for each macro block.

Fig. 5.5 demonstrates the structure this three levels of combinations. The first and most detailed level evaluates the nine possible motion vectors for a particular subblock with a particular type of MCOT. The second level evaluates the performances of different transform types for each single sub- block given different motion vectors. That means after the second level, we have a number combinations of motion vectors and transform types for the CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 31 subblocks. Finally, the last level finds out which kind of block segmenta- tions is best for a macroblock. In the end, the system gives the optimal combination of motion vectors, transform types, and subblock type for each macroblock.

Figure 5.5: Structure of the minimization of the cost function with the three levels.

5.4 Mode Decision

The purpose of our Lagrangian cost function is to figure out the optimal com- bination of our various kinds of parameters and achieve an efficient trade-off between the rate of the parameters and the rate of the coefficients. Let Rp be the parameter rate indicating the sum of the rate of the motion vectors Rp(mv), the rate of the types of MCOT Rp(t), and the rate of the subblock sizes Rp(s). They are obtained from the motion estimation, transform types, 2 and various block sizes, respectively. Let σH present the variance of the high 2 band. The relationship between Rp and σH has been studied in Chap. 3. Our Lagrangian cost function is

2 J = σH + λRp (5.5) 2 = σH + λ (Rp(mv) + Rp(t) + Rp(s)) . (5.6)

For practical implementation, the multiplier λ is set to 1. This cost function is based on a macro block. Because the various block size is used to split a macro block into subblocks, the cost function can be expanded to

N ! 2 X J = σH + λ (Rp(mv, i) + Rp(t, i)) + Rp(s) . (5.7) i=1 CHAPTER 5. EFFICIENT VIDEO CODING SCHEME 32 where N is the number of subblocks. Here N can be 1, 2, or 4 as is shown in Fig. 5.4. Note that even there are subblocks, we still sum the parameters up to make them equal to the level of a macro block: X Rp(mv) = Rp(mv, i), (5.8) i X Rp(t) = Rp(t, i). (5.9) i

In this system, different macro blocks contain different combinations of the parameters. Different subblocks within one macro block can also have different transform types. Take a macro block with block partition of size 16 × 8 for example. Assume the optimal transform type for the first Subblock0 is Type1 (left unidirectional MCOT). The transform type for the second Subblock1 can be any type from Type0 to Type3. However, the constraint is the total cost for these two subblocks should be minimum comparing to other types of transform. Finally, we obtain the optimal subblocks, transform types, and motion vectors for each macroblock. After the MCOT, the subband coefficients are processed by spatial transform, uniform deadzone quantization, and EBCOT. Chapter 6

Experimental Results

For experiments, we use the test videos F oreman and Mother&Daughter. The motion compensation uses a macro block size of 16 × 16 and a search range of ±20. The dictionary for Huffman coding of motion vectors is es- tablished from five training videos F oreman, Carphone, Salesman, Claire, and Mother&Daughter, each with 288 frames. The performance is evalu- ated by PSNR ! 2552 PSNR = 10 log . (6.1) 10 MSE The JasPer software is used as entropy coding here, which is the codec specified in the JPEG2000 part I standard (ISO/IEC 15444-1) written in C programming language. It has been verified that JasPer and JJ2000 (JPEG2000 part I in Java) give almost the same coding performances. Figs. 6.1 and 6.2 present the PSNR of the luminance signal over the rate for F oreman and Mother&Daughter with different transform types. The first curve is the proposed transform, which is an efficient combina- tion of variable block size, different transform types, and half-pel motion- compensated accuracy. The second curve is the bidirectional MCOT without being combined with the left/right unidirectional MCOT. The third one is also the bidirectional MCOT, but without variable block size or half-pel mo- tion compensation. The fourth curve is the Haar wavelet transform without variable block size or half-pel motion compensation. And the last one is intra coding without temporal transform. For Fig. 6.1, there is a large gap between intra coding and Haar wavelet transform. The bidirectional MCOT shows a 2 to 4 dB improvement com- pared to the Haar wavelet transform. If variable block size and half-pel motion accuracy is engaged, we have an additional 1 dB improvement. Fi- nally, when the transform is constructed by unidirectional and bidirectional MCOT, our proposed system gains another 0.5 dB comparing to a single bidirectional MCOT. As is shown in the figures, the proposed MCOT out- performs the other compared transforms. We can observe the same result

33 CHAPTER 6. EXPERIMENTAL RESULTS 34 from Fig. 6.2 that the proposed system is the optimal one.

42

40

38

36

34

32 PSNR[dB]

30

Proposed MCOT 28 Bi−MCOT,VBS,HP Bi−MCOT,non−VBS,non−HP 26 Haar,non−VBS,non−HP Intra 24 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rate[bpp]

Figure 6.1: Luminance PSNR vs. rate for the QCIF sequence F oreman at 30fps with 64 frames and a GOP size of 8 frames. The compared transforms include the proposed MCOT, the bidirectional MCOT with variable block size (VBS) and half-pel motion compensation (HP), the bidirectional MCOT without VBS or HP, the Haar wavelet transform without VBS or HP, and the intra coding. CHAPTER 6. EXPERIMENTAL RESULTS 35

46

44

42

40

38

36

PSNR[dB] 34

32 Proposed MCOT 30 Bi−MCOT,VBS,HP Bi−MCOT,non−VBS,non−HP 28 Haar,non−VBS,non−HP Intra 26 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Rate[bpp]

Figure 6.2: Luminance PSNR vs. rate for the QCIF sequence Mother & Daughter at 30fps with 64 frames and a GOP size of 8 frames. The compared transforms include the proposed MCOT, the bidirectional MCOT with variable block size (VBS) and half-pel motion compensation (HP), the bidirectional MCOT without VBS or HP, the Haar wavelet transform without VBS or HP, and the intra coding. Chapter 7

Conclusions

The goal for this project is to implement an efficient video coding scheme that combines various kinds of motion-compensated orthogonal transforms. The first part of the report proposes a theoretical lossless signal model for the orthogonal transform. The signal model is based on the Gaussian distribution assumption. From this model, we find an optimal rate combi- nation for the rate of the coefficients and the rate of the parameters. The relationship between the transform and entropy coding is studied and a cost function for the orthogonal transform is constructed. This cost function is also used in practical implementation to make mode decisions. Numerical results for the memoryless Gaussian model is presented to show the optimal rate allocation for a given transform function. The second part of the report describes an efficient combination of the MCOTs. The combination includes multiple types of MCOTs, variable block sizes, and half-pel motion estimation. The experimental results show that using variable block sizes and half-pel motion estimation can improve the PSNR perfomance significantly. And combined with multiple types of MCOTs, the PSNR can be increased by another 0.3 to 0.6 dB. From the result, we see that our proposed system outperforms the individual motion- compensated orthogonal transforms.

36 Bibliography

[1] for Audiovisual Services at p × 64 kbit/s. ITU-T Recom- mendation H.261, 1990. [2] K. Rijkse. H.263: video coding for low-bit-rate communication. Com- munications Magazine, IEEE, 34(12):42–45, Dec. 1996. [3] Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s - Part 2: Video. Int. Standards Org./Int. Electrotech. Comm. (ISO/IEC) JTC 1, 1993. [4] Coding of audio-visual objects - Part 2: Visual. Int. Standards Org./Int. Electrotech. Comm. (ISO/IEC) JTC 1, 1999-2003. [5] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560–576, July 2003. [6] G. Sullivan and T. Wiegand. Video compression - from concepts to the H.264/avc standard. Proceedings of the IEEE, 93(1), 2005. [7] M. Flierl and B. Girod. A motion-compensated orthogonal transform with energy-concentration constraint. In Proc. of the IEEE Interna- tional Workshop on Multimedia Signal Processing, pages 391–394, Oct. 2006. [8] M. Flierl and B. Girod. A new bidirectionally motion-compensated orthogonal transform for video coding. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 1, pages I–665–I–668, Apr. 2007. [9] M. Flierl and B. Girod. Half-pel accurate motion-compensated orthog- onal video transforms. In Data Compression Conference, 2007. DCC ’07, pages 13–22, Mar. 2007. [10] M. Flierl and B. Girod. A double motion-compensated orthogonal transform with energy concentration constraint. In Proceedings of the SPIE Conference on Visual Communications and Image Processing, page 6508, 2007.

37 BIBLIOGRAPHY 38

[11] B. Girod. Efficiency analysis of multihypothesis motion-compensated prediction for video coding. Image Processing, IEEE Transactions on, 9(2):173–183, Feb. 2000. [12] O. Barry, Du Liu, S. Richter, and M. Flierl. Robust motion- compensated orthogonal video coding using EBCOT. In Image and Video Technology (PSIVT), 2010 Fourth Pacific-Rim Symposium on, pages 264–269, Nov. 2010. [13] M. Flierl. Adaptive spatial wavelets for motion-compensated orthogonal video transforms. In Proc. of the IEEE International Conference on Image Processing (ICIP), pages 1045–1048, Nov. 2009. [14] P.F. Panter and W. Dite. Quantization distortion in pulse-count mod- ulation with nonuniform spacing of levels. Proceedings of the IRE, 39(1):44–48, Jan. 1951. [15] Sangsin Na and David L. Neuhoff. On the support of mse-optimal, fixed-rate, scalar quantizers. IEEE Trans. Inform. Theory, 47:2972– 2982, 2001. [16] H. Gish and J. Pierce. Asymptotically efficient quantizing. , IEEE Transactions on, 14(5):676–683, Sep. 1968. [17] C. Gunter and A. Rothermel. Quantizer and entropy effects on ebcot based compression. Consumer Electronics, IEEE Transactions on, 53(2):661–666, May 2007. [18] D. S. Taubman and M. W. Mercellin. JPEG2000 Fundamentals, Standards and Practice. Kluwer Academic Publishers, Boston/Dordrecht/London, first edition, 2002. [19] T.A. Welch. A technique for high-performance data compression. Com- puter, 17(6):8–19, June 1984. [20] J.M. Shapiro. Embedded image coding using zerotrees of wavelet co- efficients. Signal Processing, IEEE Transactions on, 41(12):3445–3462, Dec. 1993. [21] A. Said and W.A. Pearlman. A new, fast, and efficient image codec based on set partitioning in hierarchical trees. Circuits and Systems for Video Technology, IEEE Transactions on, 6(3):243–250, June 1996. [22] N. Ahmed, T. Natarajan, and K.R. Rao. Discrete cosine transfom. Computers, IEEE Transactions on, C-23(1):90–93, Jan. 1974. [23] D. Taubman. High performance scalable image compression with EBCOT. IEEE Transactions on Image Processing, 9(7):1158–1170, July 2000. BIBLIOGRAPHY 39

[24] F. Auli-Llinas and M.W. Marcellin. Distortion estimators for bitplane image coding. Image Processing, IEEE Transactions on, 18(8):1772– 1781, Aug. 2009.

[25] Michael D. Adams and Faouzi Kossentini. Jasper: A software-based jpeg-2000 codec implementation, 2000.

[26] M. Flierl and B. Girod. Investigation of motion-compensated lifted wavelet transforms. In Proceedings of the Picture Coding Symposium, pages 59–62, 2003.

[27] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley- Interscience, 2006.

[28] M.W. Marcellin, M.J. Gormish, A. Bilgin, and M.P. Boliek. An overview of JPEG-2000. In Proc. of the IEEE Data Compression Con- ference, pages 523–541, Mar. 2000.

[29] H. Everett. Generalized Lagrange multiplier method for solving prob- lems of optimum allocation of resources. Oper. Res., 11:399–417, 1963.