COMPRESSION AND RATE CONTROL METHODS BASED ON THE WAVELET TRANSFORM

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Eric J. Balster, B.S., M.S.

*****

The Ohio State University

2004

Dissertation Committee: Approved by

Yuan F. Zheng, Adviser Ashok K. Krishnamurthy Adviser Steven B. Bibyk Department of Electrical and Computer Engineering °c Copyright by

Eric J. Balster

2004 ABSTRACT

Wavelet-based image and video compression techniques have become popular ar- eas in the research community. In March of 2000, the Joint Pictures Expert Group

(JPEG) released JPEG2000. JPEG2000 is a wavelet-based image compression stan- dard and predicted to completely replace the original JPEG standard. In the video compression field, a compression technique called 3D wavelet compression shows promise. Thus, wavelet-based compression techniques have received more attention from the research community.

This dissertation involves further investigation of the wavelet transform in the compression of image and video signals, and a rate control method for real-time transfer of wavelet-based compressed video.

A pre-processing algorithm based on the wavelet transform is developed for the removal of noise in images prior to compression. The intelligent removal of noise reduces the entropy of the original signal, aiding in compressibility. The proposed wavelet-based denoising method shows a computational speedup of at least an order of magnitude than previously established image denoising methods and a higher peak signal-to-noise ratio (PSNR).

A video denoising algorithm is also included which eliminates both intra- and inter-frame noise. The inter-frame noise removal technique estimates the amount of motion in the image sequence. Using motion and noise level estimates, a video

ii denoising technique is established which is robust to various levels of noise corruption and various levels of motion.

A virtual-object video compression method is included. Object-based compres- sion methods have come to the forefront of the research community with the adoption of the MPEG-4 (Motion Pictures Expert Group) standard. Object-based compres- sion methods promise higher compression ratios without further cost in reconstructed quality. Results show that virtual-object compression outperforms 3D wavelet com- pression with an increase in compression ratio and higher PSNR.

Finally, a rate-control method is developed for the real-time transmission of wavelet- based compressed video. Wavelet compression schemes demand a rate-control al- gorithm for real-time video communication systems. Using a leaky-bucket design approach, the proposed rate-control method manages the uncertain factors in both the acquisition time of the group of frames (GoF), computation time of compres- sion/decompression algorithms, and network delay. Results show good management and control of buffers and minimal variance in .

iii To my parents

iv ACKNOWLEDGMENTS

I would like to express my sincere gratitude to my advisor Professor Yuan F. Zheng for his constant encouragement, shrewd guidance, and financial support throughout my years at The Ohio State University (OSU). I have benefited from his expert tech- nical knowledge in science and engineering and learned from his creative and novel solutions to many research problems. It has truly been an honor and a privilege to study under his guidance. I would also like to thank Professors Ashok K. Krishna- murthy and Steven B. Bibyk for serving on my committee and providing feedback on this dissertation.

It has been my pleasure to work with my colleges in the Wavelet Research Group at OSU. Specifically I would like to thank Ms. Yi Liu and Mr. Zhigang (James)

Gao for the continual help with many technical problems that I had come across over the years and their computer support help that is second to none. I would also like to thank my former colleges Dr. Jianyu (Jane) Dong (currently at California

State University) and Mr. Chao He (currently at Microsoft Corp.) for helping me to become acclimated to our research group and to the university during the beginning of my studies. Both Jane and Chao were also helpful in many productive discussions concerning wavelet-based compression of video signals.

I would like to thank both the Dayton Area Graduate Studies Institute (DAGSI) and the Air Force Research Laboratory (AFRL) for funding this research.

v I want to give a special thanks to the AFRL Embedded Information Systems

Engineering Branch (IFTA) for their continued support over the years. Everyone in the branch has been very encouraging and supportive throughout my studies.

Specifically, I would like to thank Mr. James Williamson and Mr. Eugene Blackburn for giving me the opportunity to work at AFRL; an institution of superb research and state-of-the-art technology. Thanks to Dr. Robert L. Ewing for his tutelage and advise through many milestones over the years. I would also like to thank Mr. Al

Scarpelli for his support and help during many projects.

Lastly, I would also like to thank my family for their love and encouragement.

Susan, Craig, Jenny, Michael, Megan, Evan, Mom, and Dad, you have always been a very supportive and loving family. Without you all, I would not be able to pursue my goals.

vi VITA

Dec. 24, 1975 ...... Born - Dayton, OH

May 1998 ...... B.S. Electrical Engineering, University of Dayton, Dayton, OH Aug. 1998 - Aug. 1999 ...... Graduate Teaching Assistant, Electri- cal Engineering, University of Dayton, Dayton, OH Aug. 1999 - May. 2000 ...... Graduate Research Assistant, Electri- cal Engineering, University of Dayton, Dayton, OH May 2000 ...... M.S. Electrical Engineering, University of Dayton, Dayton, OH Sept. 2000 - June 2002 ...... Graduate Research Associate, Electri- cal Engineering, The Ohio State Uni- versity, Columbus, OH July 2002 - present ...... Associate Electronics Engineer, Em- bedded Information Systems Engineer- ing Branch, Air Force Research Labo- ratory, Wright-Patterson AFB, OH

PUBLICATIONS

Research Publications

Eric J. Balster, Yuan F. Zheng, and Robert L. Ewing, ”Combined Spatial and Tem- poral Domain Wavelet Shrinkage Algorithm for Video Denoising”, submitted to IEEE Transactions on Circuits and Systems for Video Technology. Apr. 2004.

Eric J. Balster, Yuan F. Zheng, and Robert L. Ewing, ”Combined Spatial and Tem- poral Domain Wavelet Shrinkage Algorithm for Video Denoising”, in Proc. IEEE

vii International Conference on Communication Systems, Networks, and Digital Signal Processing. March 2004.

Eric J. Balster, Yuan F. Zheng, and Robert L. Ewing, ”Feature-Based Wavelet Shrink- age Algorithm for Image Denoising”. submitted with one revision to IEEE Transac- tions on Image Processing. Feb 2004.

Eric J. Balster, Yuan F. Zheng, and Robert L. Ewing, ”Fast, Feature-Based Wavelet Shrinkage Algorithm for Image Denoising”, in Proc. IEEE International Conference on Integration of Knowledge Intensive Multi-Agent Systems. pp. 722-728, Oct. 2003.

Eric J. Balster, Waleed W. Smari, and Frank A. Scarpino, ”Implementation of Effi- cient Wavelet Image Compression Algorithms using Reconfigurable Devices”, in Proc. IASTED International Conference on Signal and Image Processing. pp 249-256, Aug. 2003.

Eric J. Balster and Yuan F. Zheng, ”Constant Quality Rate Control for Content- based 3D Wavelet Video Communication”, in Proc. World Congress on Intelligent Control and Automation. pp. 2056-2060, June 2002.

Eric J. Balster and Yuan F. Zheng, ”Real-Time Video Rate Control Algorithm for a Wavelet-Based Compression Scheme”, in Proc. IEEE Midwest Symposium on Circuits and Systems. pp. 492-496, Aug 2001.

Eric J. Balster, Frank A. Scarpino, and Waleed W. Smari, ”Wavelet Transform for Real-Time Image Compression Using FPGAs”, in Proc. IASTED International Con- ference on Parallel and Distributed Computing and Systems. pp 232-238, Nov. 2000.

FIELDS OF STUDY

Major Field: Electrical Engineering

Studies in: Communication and Signal Processing Circuits and Electronics Mathematics

viii TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vii

List of Variables ...... xii

List of Tables ...... xxi

List of Figures ...... xxii

Chapters:

1. Introduction ...... 1

1.1 A Review of Current Compression Standards ...... 1 1.1.1 Image Compression Standard (JPEG) ...... 1 1.1.2 JPEG2000 Image Compression Standard ...... 2 1.1.3 Video Compression Standards (H.26X and MPEG-X) . . . . 3 1.2 Motivation for Wavelet Image Compression Research ...... 6 1.2.1 Wavelet Image Compression vs. JPEG Compression . . . . 6 1.2.2 Wavelet Image Pre-processing ...... 9 1.3 Motivation for Wavelet Video Compression Research ...... 11 1.3.1 Video Signal Pre-processing for Noise Removal ...... 12 1.3.2 Virtual-Object Based Video Compression ...... 13 1.4 Motivation for the Rate Control of Wavelet-Compressed Video . . . 14 1.5 Dissertation Overview ...... 15

ix 2. Wavelet Theory Overview ...... 17

2.1 Scaling Function and Wavelet Definitions ...... 17 2.2 Scaling Function and Wavelet Restrictions ...... 20 2.3 Wavelet Filterbank Analysis ...... 20 2.4 Wavelet Filterbank Synthesis ...... 22 2.5 Two-Dimensional Wavelet Transform ...... 22 2.6 Summary ...... 24

3. Feature-Based Wavelet Selective Shrinkage Algorithm for Image Denoising 25

3.1 Introduction ...... 25 3.2 2D Non-Decimated Wavelet Analysis and Synthesis ...... 30 3.3 Retention of Feature-Supporting Wavelet Coefficients ...... 33 3.4 Selection of Threshold τ and Support s ...... 39 3.5 Estimation of Parameter Values ...... 49 3.5.1 Noise Estimation ...... 49 3.5.2 Parameter Estimation ...... 49 3.6 Experimental Results ...... 51 3.7 Discussion ...... 54

4. Combined Spatial and Temporal Domain Wavelet Shrinkage Algorithm for Video Denoising ...... 59

4.1 Introduction ...... 59 4.2 Temporal Denoising and Order of Operations ...... 62 4.2.1 Temporal Domain Denoising ...... 62 4.2.2 Order of Operations ...... 64 4.3 Proposed Motion Index ...... 66 4.3.1 Motion Index Calculation ...... 66 4.3.2 Motion Index Testing ...... 67 4.4 Temporal Domain Parameter Selection ...... 69 4.5 Experimental Results ...... 71 4.6 Discussion ...... 84

5. Virtual-Object Video Compression ...... 86

5.1 Introduction ...... 86 5.2 3D Wavelet Compression ...... 89 5.2.1 2D Wavelet Transform ...... 89 5.2.2 2D Quantization ...... 91 5.2.3 3D Wavelet Transform ...... 91

x 5.2.4 3D Quantization ...... 92 5.2.5 3D Wavelet Compression Results ...... 95 5.3 Virtual-Object Compression ...... 97 5.3.1 Virtual-Object Definitions ...... 97 5.3.2 Virtual-Object Extraction Method ...... 98 5.3.3 Virtual-Object Coding ...... 102 5.4 Performance Comparison Between 3D Wavelet and Virtual-Object Compression ...... 103 5.5 Discussion ...... 105

6. Constant Quality Rate Control for Content-Based 3D Wavelet Video Com- munication ...... 107

6.1 Introduction ...... 107 6.2 Multi-Threaded, Content-Based 3D Wavelet Compression . . . . . 109 6.3 The Rate Control Algorithm ...... 112 6.3.1 Rate Control Overview ...... 112 6.3.2 Buffer Constraints ...... 114 6.3.3 Grouping Buffer Design ...... 118 6.3.4 Display Buffer Design ...... 120 6.4 Experimental Results ...... 123 6.5 Discussion ...... 128

7. Conclusions and Future Work ...... 129

7.1 Contributions ...... 129 7.2 Future Work ...... 131

Appendices:

A. Computation of S·,k[x, y]...... 134

Bibliography ...... 135

xi LIST OF VARIABLES

In this dissertation, the following variables are used:

Greek Variables:

• α[x, y, z]: Boolean value of position (x, y, z) indicating the presence of back-

ground information

• αk[n]: Non-decimated scaling coefficient of scale k and position n

• αll,k[x, y]: Two-dimensional non-decimated scaling coefficient of scale k and

position n

3D • αk [l, z]: Non-decimated scaling coefficient of level k, spatial position l, and frame z, generated by temporal domain transformation

• αˆll,k[x, y]: Reconstructed non-decimated scaling coefficient of spatial position

(x, y)

opt • αˆll,k[x, y]: Optimally reconstructed non-decimated scaling coefficient of spatial position (x, y)

• αA: Percent change in frame acquisition rate

• αD: Percent change in display rate

xii • γx[z]: Leftmost position of the virtual-object in frame z

• γy[z]: Highest vertical position of the virtual-object in frame z

• Γ: The maximum size of a group of frames (GoF)

• δA: Incremental change in the frame acquisition rate

• δD: Incremental change in the display rate

• ²d: Empty display buffer warning threshold

• ²g: Empty grouping buffer warning threshold

• ²x[z]: Rightmost position of the virtual-object in frame z

• ²y[z]: Lowest vertical position of the virtual-object in frame z

• η(x, y): Two-dimensional noise function value at spatial position (x, y)

• λk[n]: Non-decimated wavelet coefficient of scale k and position n

• λhl,k[x, y]: Two-dimensional non-decimated wavelet coefficient, high-low sub-

band, of scale k and spatial position (x, y)

• λlh,k[x, y]: Two-dimensional non-decimated wavelet coefficient, low-high sub-

band, of scale k and spatial position (x, y)

• λhh,k[x, y]: Two-dimensional non-decimated wavelet coefficient, high-high sub-

band, of scale k and spatial position (x, y)

3D • λk [l, z]: Non-decimated wavelet coefficient of level k, spatial position l, and frame z, generated by temporal domain transformation

xiii e • λ·,k[x, y]: Non-decimated wavelet coefficient of level k and spatial position (x, y),

generated by the wavelet transform of fe(·)

• λvo[x, y, z]: Non-decimated wavelet coefficient of position (x, y, z) used to de-

termine location of the virtual-object

z • µl: Temporal mean of spatially averaged pixel values, Al

• σn: Standard deviation of η(·)

• σen: Estimated standard deviation η(·)

• τ: Threshold used in image denoising

• τc: The critical time period before the display buffer is empty

• τm(·): Optimal threshold function used in image denoising

• τfm(·): Estimated threshold function used in image denoising

• τvo: Threshold used to determine motion in the wavelet coefficients, λvo[·]

• τz[·]: temporal domain threshold for video denoising

• φd: Full display buffer warning threshold

• φg: Full grouping buffer warning threshold

• Φ(t): Scaling function

• Φk,n(t): Scaling function of scale k and shift n

• Ψ(t): Mother wavelet

xiv • Ψk,n(t): Wavelet of scale k and shift n

English Variables:

• ak[n]: Scaling coefficient of scale k and position n

• all,k[x, y]: Two-dimensional scaling coefficient of scale k and spatial position

(x, y)

• eall,k[x, y, z]: Quantized, two-dimensional scaling coefficient of scale k and posi-

tion (x, y, z)

3D • a·,k,j[x, y, z]: Three-dimensional scaling coefficient of 2D scale k, 3D scale j, and position (x, y, z)

3D • ea·,k,j[x, y, z]: Quantized three-dimensional scaling coefficient of 2D scale k, 3D scale j, and position (x, y, z)

• as: Multiplicative term used in the LMMSE calculation of sfm(·)

• aτ : Multiplicative term used in the LMMSE calculation of τfm(·)

• Ai: Frame acquisition rate

z • Al : Spatially averaged pixel value of spatial position l and frame z used in motion index calculation

• b(x, y): Background pixel of spatial location (x, y)

• bs: Additive term used in the LMMSE calculation of sfm(·)

• bτ : Additive term used in the LMMSE calculation of τfm(·)

xv d • Bi : Display buffer fullness at time i

g • Bi : Grouping buffer fullness at time i

th • CN : Size of the N group of frames (GoF)

• dk[n]: Wavelet coefficient of scale k and position n

• dhl,k[x, y]: Two-dimensional wavelet coefficient, high-low subband, of scale k

and spatial position (x, y)

• dlh,k[x, y]: Two-dimensional wavelet coefficient, low-high subband, of scale k

and spatial position (x, y)

• dhh,k[x, y]: Two-dimensional wavelet coefficient, high-high subband, of scale k

and spatial position (x, y)

e • dhl,k[x, y, z]: Quantized, 2D wavelet coefficient, high-low subband, of scale k and

location (x, y, z)

e • dlh,k[x, y, z]: Quantized, 2D wavelet coefficient, low-high subband, of scale k and

location (x, y, z)

e • dhh,k[x, y, z]: Quantized, 2D wavelet coefficient, high-high subband, of scale k

and location (x, y, z)

3D • d·,k,j[x, y, z]: Three-dimensional wavelet coefficient of 2D scale k, 3D scale j and position (x, y, z)

e3D • d·,k,j[x, y, z]: Quantized three-dimensional wavelet coefficient of 2D scale k, 3D scale j and position (x, y, z)

xvi • D: Space below the virtual-object

• D d : Estimated average display rate avg|Bi−1<²

• Di: Display frame rate at time i

• Ei: Compression rate at time i

• Ex(z): Ending horizontal position of the virtual-object in frame z

• Ey(z): Ending vertical position of the virtual-object in frame z

• f(t): Arbitrary function

• fk(t): Arbitrary function of scale k

• f(x, y): Original image pixel of spatial position (x, y)

• fe(x, y): Noisy image pixel of spatial position (x, y)

• fˆ(x, y): Denoised image pixel of spatial position (x, y)

• fˆopt(x, y): Optimal denoised image pixel of spatial position (x, y)

• f(x, y, z): Original video signal pixel of position (x, y, z)

• fˆ(x, y, z): Reconstructed video signal pixel of position (x, y, z)

z • fl : Video signal pixel of spatial location l and frame z

• F : Number of frames in a group of frames (GoF)

• g[n]: Wavelet filter coefficient of position n

xvii th • GN : Time period when the last frame of the N group of frames (GoF) is

acquired

• h[n]: Scaling function filter coefficient of position n

• Hf : Height of image

• Ho: Height of the virtual-object

• I: The initial buffering level for the display buffer

e • I·,k[x, y]: Boolean value formed by thresholding noisy wavelet coefficient, λ·,k[x, y]

by τ

• Ivo[x, y, z]: Boolean value created by thresholding λvo[x, y, z] coefficient by the

threshold, τvo

• J·,k[x, y]: Boolean value formed by refining I·,k[x, y] with local support

opt • J·,k [x, y]: Optimal Boolean value of spatial location (x, y)

• Jvo[x, y, z]: Refined Boolean value used for motion detection of location (x, y, z)

• K: Number of terms included in noise estimation calculation

• KM : Number of subband levels in the 2D wavelet transform

• JM : Number of subband levels in the 3D wavelet transform

• L: Space left of the virtual-object

• L·,k[x, y]: Wavelet coefficient of scale k and spatial location (x, y) used in re-

construction

xviii opt • L·,k [x, y]: Wavelet coefficient of scale k and spatial location (x, y) used in opti- mal reconstruction

th • LN : The total delay of the N group of frames (GoF)

• mse: Mean-squared error between original and modified image

• Ml: Motion index of spatial location l

• o(x, y, z): Virtual-object pixel of location (x, y, z)

• R: Space right of the virtual-object

• Ri: Video reconstruction rate at time i

• s: Support variable used to create Boolean map J·,k[·]

• s2: 2D Quantization step size

• s3: 3D Quantization step size

• sm(·): Optimal support function used in image denoising

• sfm(·): Estimated support function used in image denoising

• svo: Support value used to refine motion detection

• S·,k[x, y]: Coefficient support value of level k and spatial location (x, y)

• Sd: Size of the display buffer

• Sg: Grouping buffer size

• Sx(z): Starting horizontal position of the virtual-object in frame z

xix • Sy(z): Starting vertical position of the virtual-object in frame z

• U: Space above the virtual-object

• Vk: Spanning set of scaling functions of scale k

• Wf : Width of image

• Wk: Spanning set of wavelet functions of scale k

• Wo: Width of the virtual-object

• zm,x: Frame which contains the maximum virtual-object width

• zm,y: Frame which contains the maximum virtual-object height

xx LIST OF TABLES

Table Page

3.1 Minimum average error of test images for various noise levels and their corresponding threshold and support values...... 48

3.2 PSNR comparison of the proposed method to other methods given in the literature (results given in dB)...... 52

3.3 Computation times for a 256x256 image, in seconds...... 53

3.4 Compression ratios of 2D wavelet compression both with and without denoising applied as a pre-processing step...... 54

4.1 Compression ratios of 3D wavelet compression both with and without denoising applied as a pre-processing step...... 84

xxi LIST OF FIGURES

Figure Page

1.1 Generalized architecture of the H.261 encoder...... 4

1.2 2D wavelet transform. Left: Original ”Peppers” image. Center: Wavelet transformed image, MRlevel = 3. Right: Subband reference...... 7

1.3 Comparison between JPEG and wavelet compression methods using the ”Peppers” image. Left: JPEG compression, file size = 6782 bytes, compression ratio 116:1, PSNR = 22.32. Right: 2D Wavelet compres- sion, file size = 6635 bytes, compression ratio 118:1, PSNR = 25.64. . 9

2.1 Wavelet decomposition...... 22

2.2 Wavelet reconstruction...... 23

3.1 Non-decimated wavelet decomposition...... 31

3.2 Non-decimated wavelet synthesis...... 32

3.3 Generic coefficient array...... 36

3.4 Generic coefficient array, with corresponding S·,k values...... 37

3.5 Optimal denoising method applied to noisy ”Lenna” image. Left: Cor- e rupted image f(x, y), σn = 50, PSNR = 14.16 dB. Right: Optimally denoised image fˆopt(x, y), PSNR = 27.72 dB...... 41

3.6 Test images...... 44

3.7 Average PSNR values using different wavelets...... 46

xxii 3.8 Error results for test images, σn = 30...... 47

3.9 τm(·), sm(·) and their corresponding estimates, τfm(·), sfm(·)...... 51

3.10 Results of the proposed image denoising algorithm. Top left: Original ”Peppers” image. Top right: Corrupted image, σn = 37.75, PSNR = 16.60 dB. Bottom: Denoised image using the proposed method, PSNR = 27.17 dB...... 56

3.11 Results of the proposed image denoising algorithm. Top left: Original

”House” image. Top right: Corrupted image, σn = 32.47, PSNR = 17.90 dB. Bottom: Denoised image using the proposed method, PSNR = 29.81 dB...... 57

3.12 Wavelet-based compression results with and without pre-processing. . 58

4.1 Test results of both TFS and SFT denoising methods. Upper left: FOOTBALL image sequence, SFT denoising, max. PSNR = 30.85,

τ = 18, τz = 12. Upper right: FOOTBALL image sequence, TFS denoising, max. PSNR = 30.71, τ = 18, τz = 12. Lower left: CLAIRE image sequence, SFT denoising, max. PSNR = 40.77, τ = 19, τz = 15. Lower right: CLAIRE image sequence, TFS denoising, max. PSNR = 40.69, τ = 15, τz = 21...... 73

4.2 Spatial positions of motion estimation test points. Left: FOOTBALL image sequence, frame #96. Right: CLAIRE image sequence, frame #167...... 74

4.3 Motion estimate given in [10] of image sequences, CLAIRE and FOOT- BALL...... 74

4.4 Proposed motion estimate of image sequences, CLAIRE and FOOT- BALL...... 75

4.5 α and β parameter testing for temporal domain denoising...... 75

4.6 Denoising methods applied to the SALESMAN image sequence, std. =10...... 76

4.7 Denoising methods applied to the SALESMAN image sequence, std. =20...... 77

xxiii 4.8 Denoising methods applied to the TENNIS image sequence, std. = 10. 77

4.9 Denoising methods applied to the TENNIS image sequence, std. = 20. 78

4.10 Denoising methods applied to the FLOWER image sequence, std. = 10. 78

4.11 Denoising methods applied to the FLOWER image sequence, std. = 20. 79

4.12 Original frame #7 of the SALESMAN image sequence...... 79

4.13 SALESMAN image sequence corrupted, std. = 20, PSNR = 22.10. . . 80

4.14 Results of the 3D K-nearest neighbors filter, [83], PSNR = 28.42. . . 80

4.15 Results of the 2D wavelet denoising filter, given in Chapter 3, PSNR = 29.76...... 81

4.16 Results of the 2D wavelet filtering with linear temporal filtering, [55], PSNR = 30.47...... 82

4.17 Results of the proposed denoising method, PSNR = 30.66...... 82

4.18 Wavelet-based compression results with and without pre-processing. . 83

5.1 3D wavelet compression...... 89

5.2 Starting from left to right. 1) Original three-dimensional video signal. 2) 2D wavelet transform (KM = 2 and JM = 0). 3) Symmetric 3D wavelet transform 4) Decoupled 3D wavelet transform (KM = 2 and JM = 2)...... 94

5.3 Decoupled 3D wavelet transform subbands, KM = 2, JM = 2. Left: 3D 3D Subband dhl,1,1[·] highlighted in gray. Right: Subband dlh,0,2[·] high- lighted in gray...... 95

xxiv 5.4 Comparison of 2D wavelet compression and 3D wavelet compression using the CLAIRE image sequence (frame #4 is shown). Left: 2D wavelet compression. s2 = 64, KM = 8, file size = 198KB, compression ratio = 256:1, average PSNR = 29.80. Right: 3D wavelet compression. s2 = 29, s3 = 29, KM = 8, JM = 8, file size = 196KB, compression ratio = 258:1, average PSNR = 33.31...... 96

5.5 Virtual-object extraction...... 99

5.6 Virtual-object compression...... 103

5.7 Comparison of 3D wavelet compression and virtual-object compression using the CLAIRE image sequence (frame #4 is shown). Left: 3D

wavelet compression. s2 = 29, s3 = 29, KM = 8, JM = 8, file size = 196KB, compression ratio = 258:1, average PSNR = 33.31. Right:

Virtual-object compression, s2 = 25, s3 = 25, KM = 8, JM = 8 for the virtual-object and s2 = 9, KM = 8 for the background, file size = 195KB, compression ratio = 259:1, average PSNR = 34.00...... 104

5.8 Comparison of 2D wavelet compression, 3D wavelet compression, and virtual-object compression...... 105

6.1 Content-based 3D wavelet compression/decompression design flow. . . 110

6.2 3D wavelet communication system...... 111

6.3 Complete rate control system...... 113

6.4 Rate control model...... 115

6.5 Display frame rate and display buffer size, D0=12 fps...... 124

6.6 Frame acquisition rate and grouping buffer size, D0=12 fps...... 125

6.7 Display frame rate and display buffer size, D0=2 fps...... 126

6.8 Frame acquisition rate and grouping buffer size, D0=2 fps...... 127

xxv CHAPTER 1

Introduction

Effective image and video compression techniques have been active research areas for the last several years. Because of the vast data size of raw digital image and video signals and limited transmission bandwidth and storage space, image and video compression techniques are paramount in the development of digital image and video systems. It is essential to develop compression methods which can both produce high compression ratios and preserve reconstructed quality in order for the creation of high quality, affordable image and video products.

It is this seemingly limitless demand for higher quality image and video compres- sion systems which provides substantial motivation for further compression research.

First, a brief overview of the latest compression standards will be provided prior to the presentation of specific research topics and objectives.

1.1 A Review of Current Compression Standards

1.1.1 Image Compression Standard (JPEG)

The Joint Pictures Experts Group (JPEG) committee developed a compression standard for digital images in the late 1980’s. JPEG compression has long since been

1 the most widely accepted standard in image compression, embedded in most modern digital imaging products.

The JPEG image encoder operates on 8x8 or 16x16 blocks of image data. Thus, images being compressed by JPEG are segmented into processing blocks called mac- roblocks. JPEG compresses each macroblock separately by first transforming the block by Discrete Cosine Transformation (DCT), quantizing the resultant coefficients, run-length encoding, and finally coding with a variable length entropy coder [47]. The block-based encoder facilitates simplicity, computational speed, and a modest mem- ory requirement.

Typically, JPEG can compress images at a 10:1 to 20:1 compression ratio and retain high quality reconstruction. 30:1 to 50:1 compression ratios can be obtained with only minor defects to the reconstructed image [34].

1.1.2 JPEG2000 Image Compression Standard

It has been known throughout the research community for several years that the wavelet transform is superior to DCT methods in image compression. Thus, in

March of 2000, JPEG published the JPEG2000 standard based on wavelet technology

[63]. The compression method of JPEG2000 is similar to that of JPEG. However,

JPEG2000 uses the wavelet transform instead of the block-based DCT. This allows for the user to specify the size of the processing block (small block sizes reduce the mem- ory requirement while large block sizes improve compression gain and reconstructed image quality). After transformation, coefficients are quantized and encoded as in the JPEG standard.

2 The JPEG2000 standard promises a 20%-25% smaller average file size with com- parable quality than the original JPEG standard [44].

1.1.3 Video Compression Standards (H.26X and MPEG-X) The H.261 Video Compression Standard

H.261 is a compression standard developed by the ITU (International Telecom

Union) in 1990. The compression algorithm involves block-based DCT transforma- tion as in JPEG, but also inter-frame prediction and motion compensation (MC) for temporal domain compression. Temporal domain compression starts with an initial frame, the intra (or I) frame. Compression is achieved by creating a predicted (P) frame by subtracting the motion compensated current frame from the closest recon- structed I frame. The I and P frames are then compressed by a method very similar to JPEG, and because the P frames no longer contain as much information as their original frame counterparts, temporal domain compression is achieved. Figure 1.1 gives a generalized architecture of the H.261 encoder.

Because of the subtraction involved in temporal domain compression, the quality of the P frames are highly dependent upon the quality of the I frames. To combat this problem, the P frames are compressed by subtraction from reconstructed I frames.

Thus, in decoding the P frames, there is little error introduced from temporal domain compression.

The H.263 Video Compression Standard

H.263, also developed by the ITU, was published in 1995. The standard is similar to H.261, but provides more advanced techniques such as half-pixel precision MC, whereas H.261 uses full pixel precision MC.

3 Figure 1.1: Generalized architecture of the H.261 encoder.

The MPEG-1 Video Compression Standard

The Motion Pictures Expert Group (MPEG) published the MPEG-1 standard in 1990 [1]. The video compression algorithm embedded in MPEG-1 follows H.261 with a few differences. One, the MC algorithm has less restriction providing better predictive performance. Two, MPEG-1 not only generates I and P frames, but also provides bi-directional predicted (or B) frames. While a P frame is generated from the difference between the motion compensated current frame and the closest recon- structed I frame, a B frame is produced from the difference between the current frame and the average of the closest two reconstructed I frames. The introduction of the B frame in MPEG-1 gives a sequence of coded video frames in form of:

I BB P BB P BB P BB I BB P BB P...

4 The advances of MPEG-1 from H.261 and H.263 make it a more popular com- pression standard. A typical compression ratio from a high quality MPEG-1 encoded bitstream is 26:1 [8].

The MPEG-2 Video Compression Standard

Soon after the advent of MPEG-1, MPEG-2 was developed. The MPEG-2 stan- dard is much like MPEG-1, with some added capability. Among the many improve- ment, like H.263 from H.261, MPEG-2 supports half-pixel precision MC for higher performance inter-frame prediction, [2, 30] . Typically, a high-quality MPEG-2 video encoding will result in a 45:1 compression ratio [9]. Currently MPEG-2 is the most widely used compression standard. It is the compression method used in digital video disks (DVD), and most digital video recorders (DVR).

The MPEG-4 Video Compression Standard

The finalized version of the MPEG-4 standard was published in December of 1999.

The basis of coding in MPEG-4 is not a processing macroblock, as in MPEG-1 and

MPEG-2, but rather an audio-visual object [3]. Object based compression techniques have certain advantages, such as:

1) Allowing more user interaction with video content.

2) Allowing the reuse of recurring object content.

3) Removal of artifacts due to the joint coding of objects.

Although MPEG-4 does specify the advantages of object-based compression and provides a standard of communication between sender and receiver, it does not provide the means by which a) the content is separated into audio-visual objects, or b) the audio-visual objects are compressed. The MPEG-4 standard is a more open standard

5 which can accept various compression methods. As long as both sender and receiver possess the correct respective tool set for compression and decompression, they can communicate.

The advent of the MPEG-4 compression standard has opened up audio and video compression to more researchers, and provides a flexible environment for continual improvement in the compression of audio and video signals.

1.2 Motivation for Wavelet Image Compression Research

1.2.1 Wavelet Image Compression vs. JPEG Compression

With the exception of JPEG2000 and MPEG-4 (which does not provide a method of compression), each of the aforementioned compression standards given in Section

1.1 have the same drawback: blocking artifacts which appear in the reconstructed signals at low bit-rate coding. These artifacts are a direct result of the block-based

DCT transform.

The wavelet transform does not have the drawbacks of block-based DCT methods.

Compression algorithms based on the wavelet transform do not segment frames into processing blocks. Thus, wavelets have been extensively researched as an alternative to block-based DCT compression methods, for both images and video signals [37, 52,

70, 82].

Figure 1.2 shows the ”Peppers” image, its wavelet decomposition, and a graphic giving the referenced subband decomposition. As shown in Figure 1.2 the wavelet transform does not break the image up into processing blocks, but processes the entire image as a whole, creating subbands representative of differing spatial frequency bandwidths.

6 Figure 1.2: 2D wavelet transform. Left: Original ”Peppers” image. Center: Wavelet transformed image, MRlevel = 3. Right: Subband reference.

Each of the subbands of the subband reference in the rightmost portion of figure

1.2 is labeled with a letter ”a” or ”d”. The subband labeled with a letter ”a” con- tains scaling coefficients, which are the low spatial frequency representation of the original image. The remaining subbands which are labeled with a letter ”d” contain wavelet coefficients. Wavelet coefficients represent different levels of bandpass spatial frequency information of the original image.

The subscript letters following the a’s and d’s, given in Figure 1.2, provide the horizontal and vertical contributions of the particular subband. Typically, in the

2D wavelet transform, the original data values are processed first in the horizontal direction, then in the vertical direction. Therefore, the data in each subband has been contributed to from both horizontal and vertical processing. Thus, the ”H” designation is representative of high frequency information, and the ”L” designation is representative of low frequency information. For example, an HL designation denotes data in that particular subband is representative of high frequency information in the horizontal dimension and low frequency information in the vertical dimension.

7 Conversely, the LH designation denotes low frequency information in the horizontal dimension and high frequency information in the vertical dimension. Also, the ”all,2” subband is the lowest frequency representation of the original image and merely a copy of the original image that has been decimated (low-pass filtered and downsampled) by 22+1 in both the horizontal and vertical dimensions.

The numbers following the subscript letters represent the multiresolution level

(MRlevel) of the wavelet decomposition; The higher the value, the lower frequency representation of the original signal the wavelet coefficients represent.

After the wavelet transform is applied to an image as in Figure 1.2, each subband is quantized, run-length encoded, and sometimes entropy encoded, much like JPEG compression.

Images compressed by methods utilizing the 2D wavelet transform have been shown to progress into a more graceful degradation of reconstructed quality with an increase in compression ratio. Unlike DCT-based compression, wavelet based image encoders operate on each frame as a whole, thus eliminating blocking artifacts. Figure

1.3 gives the ”Peppers” image compressed both by the JPEG standard and wavelet based compression.

As displayed in Figure 1.3, the wavelet compression algorithm does not produce the blocking artifacts that appear in JPEG compression, but rather exhibits a more graceful degradation in image quality with high compression ratio.

The JPEG compressed image given in Figure 1.3 is produced by the Advanced

JPEG Compressortm, downloadable software that can be found at http://www.winsoftmagic.com. The wavelet compressed image given in Figure 1.3 is produced by in-house software developed by the OSU research group. The ”Peppers”

8 Figure 1.3: Comparison between JPEG and wavelet compression methods using the ”Peppers” image. Left: JPEG compression, file size = 6782 bytes, compression ra- tio 116:1, PSNR = 22.32. Right: 2D Wavelet compression, file size = 6635 bytes, compression ratio 118:1, PSNR = 25.64.

image is compressed by wavelet transformation, uniform quantization in all subbands, stack-run coding [72], and Huffman coding [22]. No other processing is used. This method of compression is referred to as 2D wavelet compression; the two dimensions being processed are the vertical and horizontal dimensions of the image, as shown in

Figure 1.2.

1.2.2 Wavelet Image Pre-processing

Our research motivation in image compression is to provide supplemental pre- processing steps to further enhance the capabilities of 2D wavelet compression. Im- age pre-processing techniques are well established in many compression algorithms.

9 However, we have developed an image pre-processing algorithm which has proven to out-perform established methods in both image quality and computation time.

Image pre-processing techniques are able to intelligently remove noise inherent in digital images. The removal of noise decreases the entropy in the original image signal, facilitating compressibility and reconstructed quality. With the removal of noise, the encoder need not waste bits on noise, but rather use all the encoded bits for storage of important image features.

Many different noise removal techniques have been applied to images, but the wavelet transform has been viewed by many as the preferred technique for noise removal [29, 42, 43, 54]. Rather than a complete transformation into the frequency domain, as in DCT or FFT (Fast Fourier Transform), the wavelet transform produces coefficient values which represent both time and frequency information. The hybrid spatial-frequency representation of the wavelet coefficients allows for analysis based on both spatial position and spatial frequency content. The hybrid analysis of the wavelet transform is excellent in facilitating image denoising algorithms.

The wavelet transform does have a drawback, however. The computation time of the wavelet transform hinders the performance of real-time image denoising ap- plications. Thus, it is imperative to minimize the processing steps between wavelet transformation and inverse transformation, i.e., the modification of wavelet coefficient values for noise removal.

Thus, an image denoising method is developed which outperforms algorithms given in [42, 43, 54] both in signal-to-noise ratio and computation time. This is accomplished by providing an accurate and computationally simple coefficient selec- tion process. Results of the proposed image denoising research show an improvement

10 in PSNR and a substantial reduction in computational complexity with a speedup of over an order of magnitude than the established methods given in [42, 43, 54].

1.3 Motivation for Wavelet Video Compression Research

Because the wavelet transform has been successful in achieving better image qual- ity at high compression ratios than traditional JPEG image compression, it is only natural to assume that wavelet video compression techniques would be able to out- perform the block-based DCT compression methods of H.26X and MPEG-X.

Several wavelet compression techniques have been targeted toward video appli- cations. Tham et. al. uses block-based motion compensation for temporal domain compression and the 2D wavelet transform for spatial compression [71]. Zheng, et. al. uses the wavelet transform for temporal domain compression as well as spatial domain compression, or 3D wavelet compression [24, 81].

The more straightforward approach in [81] exploits the advantages of the wavelet transform in three dimensions for the compression of video. This approach uses the 2D wavelet transform for intra-frame coding, and use the wavelet transform in between frames for inter-frame coding.

Although both wavelet video compression techniques have had success in video compression, there has not been an overwhelmingly superior wavelet video com- pression technique to combat the industry standards. Thus, this research develops wavelet-based techniques that further enhance the capabilities of 3D wavelet com- pression.

11 We provide two processing methods to aid in the effectiveness of 3D wavelet compression: a wavelet-based video noise removal algorithm for video pre-processing, and a virtual-object based compression scheme utilizing 3D wavelet compression.

1.3.1 Video Signal Pre-processing for Noise Removal

It is well known that the removal of noise in images helps compression techniques obtain higher compression ratios while achieving better reconstructed image quality.

However, there has not been much work in the removal of noise in video signals.

With video signals, there exists not only spatial domain noise, but also noise in the temporal domain. Using the wavelet transform, we remove both spatial and temporal noise providing a higher compression gain with 3D wavelet compression.

Noise reduction in digital images has been studied extensively [15, 16, 27, 29,

31, 42, 43, 54, 61, 77]. However, in digital video has only rarely been studied. Preliminary methods for temporal domain noise removal are variable coefficient spatio-temporal filters [33, 83] and weighted median filters [45]. These types of filters have also been studied in noise removal of images. Huang, et. al. uses an adaptive median filter for noise removal in images [27]. Rieder and Scheffler [61], and Wong [77] both use an adaptive linear filter for image noise removal. But the wavelet transform has not been used for temporal domain noise removal.

One can only speculate why the wavelet transform has not yet been considered for video signal denoising. However, our own preliminary analysis shows that the overwhelming difficulty with using the wavelet transform is a considerable computa- tional load. But with our image denoising technique, we have shown a significant

12 speedup in wavelet image denoising when compared to established methods, so the computational burden in video denoising is overcome.

Thus, we include a method of removing temporal domain noise in video sequences via the wavelet transform. Using techniques similar to the proposed image denoising technique, we overcome the overwhelming computational burden provided by the application of the wavelet transform in the temporal domain. Our video denoising technique is applied to image sequences prior to compression, enabling more effective compressed video.

1.3.2 Virtual-Object Based Video Compression

With the advent of the MPEG-4 standard, video compression is based on an audio-visual object instead of the traditional macroblock [3].

Due to the advantages of object-based compression, as provided in the MPEG-

4 standard [3], we propose a wavelet-based virtual-object compression algorithm.

Virtual-object compression first separates moving objects from stationary background and compresses each separately, thus achieving the advantages of object-based com- pression.

There are two separate processing areas in object-based compression. Object extraction is the method of separating different objects in an image sequence, and the compression of those objects is a method of coding arbitrarily shaped objects. In the virtual-object compression method, the wavelet transform is used for both object extraction and object compression.

When the wavelet transform is applied in the temporal domain, motion of objects is detected by large coefficient values. Therefore, the wavelet transform is used in

13 the identification and extraction of moving objects prior to object-based compression.

Virtual-object compression uses the non-decimated wavelet transform in the temporal domain for the separation of objects from stationary background.

Virtual-object compression also restricts the virtual-object to be rectangular. This restriction enables the use of 3D wavelet compression for the compression of the virtual-object. Also, with a rectangular object restriction, the location and shape of the object can be completely defined with only two sets of spatial coordinates (the starting horizontal and vertical locations of the virtual-object, and the width and height of the virtual-object), thus virtually eliminating shape coding overhead.

Results show the virtual-object compression method to be superior in compression ratio with higher PSNR when compared to 3D wavelet compression.

1.4 Motivation for the Rate Control of Wavelet-Compressed Video

Using the 3D wavelet compression method discussed in [24, 81], the number of frames contained in a GoF (Group of Frames) varies due to video content. Thus, there exists an unknown delay in the acquisition of the GoF, and the computation time needed for compression. Also, in streaming applications across the Internet, there exists another unknown delay in the transmission of the compressed GoF to the receiver, and yet another unknown delay in the decompression time. The variability in the time from frame acquisition to frame display requires a rate control algorithm for real-time transmission of 3D wavelet compressed video.

A real-time video compression and transmission system is necessarily a multi- threaded package. On the server side frame acquisition, GoF compression, and packet transmission processes must work independently for real-time operation. For example,

14 in real-time compression the frame acquisition process may not wait for the compres- sion process to finish before acquiring the next GoF. Frame acquisition must occur at regular intervals for real-time processing. On the client side, the decompression of the GoF must occur independently from frame display for real-time systems.

In a multi-threaded environment such as the real-time compression and transmis- sion of video, there must exist a process to manage the computational activity of each processing thread in order to avoid overflow or starvation of buffers between the threads. Also, this management process must exist in both the client and server systems, and the management processes must communicate to ensure equivalent ac- quisition and display rates (a requirement for real-time video applications).

The true motivation for a rate-control algorithm in a 3D wavelet compression scheme is that of necessity. We may possess an efficient and effective video compres- sion scheme, but without an effective rate-control system, real-time video commu- nication is not possible. Performance results give a continuous video stream from sender to receiver with a modest variation in frame rate.

1.5 Dissertation Overview

The rest of the dissertation is organized as follows. Chapter 2 is an overview of wavelet theory. The goal of the overview is to develop the wavelet filterbank analysis and synthesis equations, used in the computation of the wavelet forward and inverse transforms. The wavelet forward and inverse transforms are then used throughout the dissertation.

In Chapter 3 we develop the feature-based wavelet selective shrinkage algorithm for image denoising. The coefficient selection method is based on a two-threshold

15 criteria to aptly determine which coefficients contain useful image information, and which coefficients are corrupted with noise. The two-threshold criteria proves to be an effective means of distinguishing between useful and useless coefficients, and the performance of the denoising method is an improvement over other methods given in the literature both in PSNR and computation time.

Chapter 4 develops the video denoising algorithm which is based upon the image denoising algorithm described in Chapter 3. However, the video denoising algorithm also applies temporal domain processing to eliminate inter-frame noise. There is also a motion estimation algorithm applied to the video signal prior to temporal domain processing. The motion estimation algorithm is able to determine the amount of temporal domain processing which can improve overall quality.

Chapter 5 describes the virtual-object compression method. The virtual-object compression method separates moving objects from stationary background and com- presses each separately. The independent coding of object and background gives the virtual-object compression method an improvement in signal-to-noise ratio over GoF based compression methods such as 3D wavelet compression.

Chapter 6 develops a rate control algorithm for real-time video communication using wavelet-based compression schemes. The size of the GoF varies in the wavelet- based codec, so the computation times of the compression and decompression algo- rithms are unknown. Also, the transmission time of the compressed GoF from sender to receiver is unknown and variable. Thus, it is necessary to include a rate con- trol mechanism to ensure continuous video delivery from server to client. Chapter 7 concludes the dissertation and provides some areas for future research.

16 CHAPTER 2

Wavelet Theory Overview

An overview of wavelet theory is presented for completeness and for the formu- lation of both the wavelet analysis and synthesis filterbank equations, used in the computation of the wavelet forward and inverse transforms, respectively.

2.1 Scaling Function and Wavelet Definitions

The basic idea of a transform is to use a set of orthonormal basis functions to convolve with an input function. The resultant output function, then, can be evalu- ated or modified. The Fourier Transform, for example, uses complex sinusoids (i.e. ejωn, ∀ n) as its orthonormal basis set. The wavelet transform uses stretched and shifted versions of one function, the mother wavelet, as its basis. However, not any function can be a mother wavelet. There are certain criteria which the mother wavelet must obey.

We will start with a scaling function, Φ(·). A basis can be generated by shifting and stretching this function.

− k −k Φk,n(t) = 2 2 Φ(2 t − n), (2.1) and

||Φ(t)|| = 1. (2.2)

17 th th where Φk,n(·) is the basis function of the k scale and n position.

It is required that the set of all Φk,n(·) be an orthonormal basis. Therefore, any function, f(·), can be completely defined by a weighted sum of the basis functions given in Equation 2.1. X X f(t) = ak[n]Φk,n(t), (2.3) k n where Z ∞ ∗ ak[n] = hΦk,n(t), f(t)i = Φk,n(t)f(t)dt. (2.4) −∞ ak[·] are called scaling coefficients.

Let us define a subset of the basis functions, Φk,n(·).

Vk = Span{Φk,n(t); n ∈ Z}. (2.5)

It is required that,

... Vk+1 ⊂ Vk ⊂ Vk−1 ... (2.6)

where Vk+1 defines a span of coarser scaling functions than does Vk.

We know from Equations 2.5 and 2.6 that, Φk+1,0(·) ∈ Vk+1 ⊂ Vk. So substituting into Equation 2.3 we can show there exists a set of weights, h[·], such that

X

Φk+1,0(t) = h[n]Φk,n(t), (2.7) n which when using Equation 2.1 and setting k = 0 reduces to

√ X Φ(t) = 2 h[n]Φ(2t − n). (2.8) n

Equation 2.8 is referred to as the scaling equation, and the scaling function, Φ(·) is completely defined by h[·].

18 A subset of scaling functions, Vk−1 can be defined by a subset of coarser scaling functions Vk plus a difference subset, which we will call Wk. Therefore,

Vk−1 = Vk + Wk (Vk ⊥ Wk). (2.9)

We can then define a basis for Wk:

Wk = span{Ψk,n(t), n ∈ Z}, (2.10) where

− k −k Ψk,n(t) = 2 2 Ψ(2 t − n). (2.11)

Ψ(·) is the mother wavelet, and the set of all Ψk,n(·) are the wavelet basis functions corresponding to the subset Wk.

Because Wk ⊂ Vk−1, as given in Equation 2.9, we can substitute into Equation 2.3 to show that there exists a set of values, g[·] such that, X

Ψk,0(t) = g[n]Φk−1,n(t), (2.12) n which using Equation 2.11 and setting k = 1 can be reduced to √ X Ψ(t) = 2 g[n]Φ(2t − n). (2.13) n Equation 2.13 is referred to as the wavelet scaling equation, and g[·] completely de- scribes the Mother Wavelet, Ψ(·).

Notice from Equation 2.9 for any arbitrarily fine scale, k, we can show that,

Vk = Vk+1 + Wk+1 = V + W + W k+2 k+2 k+1 (2.14) = Vk+3 + Wk+3 + Wk+2 + Wk+1 P∞ = n=1 Wk+n. And therefore, any function, f(·), can be defined by X X f(t) = dk[n]Ψk,n(t), (2.15) k n 19 where Z ∞ ∗ dk[n] = hΨk,n(t), f(t)i = Ψk,n(t)f(t)dt. (2.16) −∞

2.2 Scaling Function and Wavelet Restrictions

Recall, that we want to keep shifted basis functions, Φk,n(·), orthonormal. There- fore, for a given scale, k, we have

δ[m] = hΦ­k,0(t), Φk,m(t)i ® k (2.17) = Φk,0(t), Φk,0(t − 2 m) , where δ[·] is the Kronecker delta function [50]. Using Equations 2.1, 2.7, and setting k = 1, Equation 2.17 can reduce to X δ[m] = h[n]h[n − 2m]. (2.18) n

The wavelet basis functions, Ψk,n(·), also need to be orthonormal to the scaling basis functions Φk,n(·), for Equation 2.9 to be valid. Therefore,

0 = hΨk,0(t), Φk,m(t)i , (2.19) which can be reduced to X 0 = g[n]h[n − 2m]. (2.20) n Equation 2.20 can be solved by

g[n] = (−1)nh[N − n], (2.21) where N is the length of both h[·] and g[·].

2.3 Wavelet Filterbank Analysis

Let fk(·) ∈ Vk. From Equations 2.3, 2.14, and 2.15 it can be shown that P f (t) = a [n]Φ (t) k Pn k k,n P (2.22) = n ak+1 [n]Φk+1n(t) + n dk+1 [n]Ψk+1,n(t), 20 where dk+1 [·] and ak+1 [·] are the wavelet coefficients and scaling coefficients of the k+1 scale, respectively.

Using Equation 2.4 the scaling coefficients are realized, and substituting Equation

2.7 we obtain ­ ® a [n] = f (t), Φ (t) k+1 ­kP k+1,n ® = a [m]Φ (t), Φ (t) P m k ­ k,m k+1,n ® (2.23) = a [m] Φ (t), Φ (t) m k ­ k,m k+1,n ® P k+1 = m ak[m] Φk,m(t), Φk+1,0(t − 2 n) . Using Equations 2.1 and 2.7, Equation 2.23 can be reduced to

D E X X k k − 2 −k − 2 −k ak+1 [n] = ak[m] h[l] 2 Φ(2 t − m), 2 Φ(2 t − l − 2n) . (2.24) m l

Since the scaling function basis is orthonormal, the inner product in Equation 2.24 is equal to one if and only if (l + 2n) = m. Therefore,

X

ak+1 [n] = ak[m]h[m − 2n]. (2.25) m

Equation 2.25 indicates that the scaling coefficients ak+1 [·] can be obtained by con- volving a reversed h[·] with ak[·], and downsampling by two.

Very similarly, it can be shown that,

X

dk+1 [n] = ak[m]g[m − 2n]. (2.26) m

From Equations 2.23 and 2.25, we can obtain increasing coarser scales of wavelet

coefficients, dk+1 [·], by convolving the scaling coefficients, ak[·], by both a reversed scaling filter, h[·], and a reversed wavelet filter, g[·], and downsampling by two. Figure

2.1 gives a block diagram of wavelet filterbank analysis.

Because each filtered output is downsampled by two, the same number of total coefficients remains the same regardless of the number of resolution levels, k.

21 Figure 2.1: Wavelet decomposition.

2.4 Wavelet Filterbank Synthesis

Let fk(·) ∈ Vk. From Equations 2.4 and 2.22 it can be shown that

a [n] = hf (t), Φ (t)i k ­Pk k,n P ® (2.27) = m ak+1 [m]Φk+1,m(t) + m dk+1 [m]Ψk+1,m(t), Φk,n(t) .

With some further computation, and substituting in Equations 2.7 and 2.12 it can be shown that P ­ ® P ­ ® a [n] = a [m] Φ (t), Φ (t) + d [m] Ψ (t), Φ (t) k Pm k+1 k+1,m k,nP m k+1 k+1,m k,n (2.28) = m ak+1 [m]h[n − 2m] + m dk+1 [m]g[n − 2m].

From Equation 2.28, we can the obtain the original signal, fk(t), by upsampling the scaling and wavelet coefficients and filtering the coefficients with their respective

filters, h[·] and g[·]. The wavelet reconstruction block diagram is given in Figure 2.2.

2.5 Two-Dimensional Wavelet Transform

A digital image is, in most cases, considered as a two-dimensional array, with width and height as the dimensions. Let f(·) be a 2 dimensional, discrete signal. As shown in Equations 2.25 and 2.26, the wavelet transform in one dimension generates two pair of coefficients: scaling coefficients, ak[·], and wavelet coefficients, dk[·]. When

22 Figure 2.2: Wavelet reconstruction.

dealing with two dimensions, however, four pair of coefficients are generated. That is, P P a [x, y] = h[n − 2y] h[m − 2x]f(m, n) ll,0 Pn Pm d [x, y] = h[n − 2y] g[m − 2x]f(m, n) hl,0 Pn Pm (2.29) d [x, y] = g[n − 2y] h[m − 2x]f(m, n) lh,0 Pn Pm dhh,0[x, y] = n g[n − 2y] m g[m − 2x]f(m, n). As in the case of the 1-dimensional wavelet transform, the scaling coefficients can be processed further for a multiresolution analysis of the original image, f(·): P P a [x, y] = h[n − 2y] h[m − 2x]a [m, n] ll,k+1 Pn Pm ll,k d [x, y] = h[n − 2y] g[m − 2x]a [m, n] hl,k+1 Pn Pm ll,k (2.30) d [x, y] = g[n − 2y] h[m − 2x]a [m, n] lh,k+1 Pn Pm ll,k dhh,k+1 [x, y] = n g[n − 2y] m g[m − 2x]all,k[m, n].

The four coefficient sets are referred to as the low-low band, all,·[·], the high-low band, dhl,·[·], the low-high band, dlh,·[·], and the high-high band, dhh,·[·]. The subbands are named due to the order in which the scaling and/or the wavelet filters process the scaling coefficients, all,·[·].

The reconstruction of f(x, y) is accomplished by P P a [x, y] = h[x − 2m] h[y − 2n]a [m, n] ll,k P m P n ll,k+1 + h[x − 2m] g[y − 2n]d [m, n] Pm Pn lh,k+1 (2.31) + g[x − 2m] h[y − 2n]d [m, n] Pm Pn hl,k+1 + m g[x − 2m] n g[y − 2n]dhh,k+1 [m, n],

23 and P P f(x, y) = h[x − 2m] h[y − 2n]a [m, n] Pm Pn ll,0 + h[x − 2m] g[y − 2n]d [m, n] Pm Pn lh,0 (2.32) + g[x − 2m] h[y − 2n]d [m, n] Pm Pn hl,0 + m g[x − 2m] n g[y − 2n]dhh,0[m, n],

2.6 Summary

In this chapter, a brief overview of wavelet theory is presented and a formulation of the wavelet analysis and synthesis filterbank equations is developed. The wavelet analysis equations are given by Equations 2.25 and 2.26, and wavelet synthesis equa- tion is given by Equation 2.28. Also, the 2D wavelet transform is described. The 2D forward wavelet transform is given by Equations 2.29 and 2.30, and the 2D wavelet inverse transform is given by Equations 2.31 and 2.32. Both the wavelet analysis and synthesis equations and the 2D wavelet transform are used throughout the rest of the dissertation.

24 CHAPTER 3

Feature-Based Wavelet Selective Shrinkage Algorithm for Image Denoising

3.1 Introduction

The recent advancement in multimedia technology has promoted an enormous amount of research in the area of image and . Image and video processing applications such as compression, enhancement, and target recognition require preprocessing functions for noise removal to improve performance. Noise removal is one of the most common and important processing steps in many image and video systems.

Because of the importance and commonality of preprocessing in most image and video systems, there has been an enormous amount of research dedicated to the subject of noise removal, and many different mathematical tools have been proposed.

Variable coefficient linear filters [17, 49, 61, 77], adaptive nonlinear filters [27, 46,

53, 83], DCT based solutions [31], cluster filtering [76], genetic algorithms [73], fuzzy logic [39, 64], etc. have all been proposed in the literature.

The wavelet transform has also been used to suppress noise in digital images. It has been shown that the reduction of absolute value in wavelet coefficients is suc- cessful in signal restoration [43]. This process is known as wavelet shrinkage. Other

25 more complex denoising techniques select or reject wavelet coefficients based on their predicted contribution to reconstructed image quality. This process is known as se- lective wavelet shrinkage, and many works have used it as the preferred method of image denoising. Preliminary methods predict the contribution of the wavelet co- efficients based on the magnitude of the wavelet coefficients [69], and others based on intra-scale dependencies of the wavelet coefficients [15, 20, 41, 43]. More recent denoising methods are based on both intra- and inter-scale coefficient dependencies

[18, 26, 29, 42, 54].

Mallat and Hwang prove the successful removal of noise in signals via the wavelet transform by selecting and rejecting wavelet coefficients based on their Lipschitz

(H¨older)exponents [43]. The H¨olderexponent is a measure of regularity in a sig- nal, and it may be approximated by the evolution of wavelet coefficient ratios across scales. Thus, this regularity metric used in selecting those wavelet coefficients which are to be used in reconstruction, and those which are not. Although this fundamental work in image denoising is successful in the removal of noise, its application is broad and not focused on image noise removal, and the results are not optimal.

Malfait and Roose refined the selective shrinkage denoising approach by applying a Bayesian probabilistic formulation, and modeling the wavelet coefficients as Markov random sequences [42]. This method is focused on image denoising and its results are an improvement upon [43]. The H¨olderexponents are roughly approximated by the evolution of coefficient values across scales, i.e. P ¯ ¯ 1 p−1 ¯ λk+1,n ¯ ml,n = ¯ ¯, p−l k=l λk,n where ml,n is the approximated H¨olderexponent of position n of scale l, and λk,n is the wavelet coefficient of scale k and position n. The rough approximation is refined

26 by assuming that the coefficient values are well modeled as a Markov chain, and the probability of a coefficients contribution to the image can be well approximated by the H¨olderexponents of neighboring coefficients. Coefficients are then assigned binary labels xk,n of scale k and position n depending on their predicted retention for reconstruction (xk,n = 1), or predicted removal (xk,n = 0). The binary labels are then randomly and iteratively switched until P (X|M) is maximized, where xk,n ∈ X

new and mk,n ∈ M. The coefficients are modified by λk,n = λk,nP (xk,n = 1|M), and the denoised image is formed by the inverse wavelet transform of the modified coefficients.

Each coefficient is reduced in magnitude depending on the probable contribution to the image, i.e. P (xk,n = 1|M).

Later, Pizurica, et al. ([54]) continued on the work done by [42] by using a different approximation of the H¨olderexponent given by

P ¯ ¯ 1 p−1 ¯ Ik+1,n ¯ ρl,n = ¯ ¯ p−l k=l Ik,n where

P Ik,n = t∈C(k,n) |λk,t|.

ρk,n is the approximation of the H¨olderexponent, and C(k, n) is the set of coefficients surrounding λk,n. This work applies the same probabilistic model as [42] using the new approximation of the H¨olderexponent. Coefficients are assigned binary labels, xk,n, depending on their predicted retention for reconstruction (xk,n = 1), or predicted removal (xk,n = 0). The binary labels are then randomly and iteratively switched until

P (X|M) is maximized. Unlike [42], the significance measure of a coefficient, M, is not merely its H¨olderexponent, but evaluated by the magnitude of the coefficients as well as its H¨olderapproximation, i.e. fM|X (mk,n|xk,n) = fΛ|X (λk,n|xk,n)fR|X (ρk,n|xk,n).

27 Thus a joint measure of coefficient significance is developed based on both the H¨older exponent approximation and the magnitude of the wavelet coefficient. As in [42], the

new coefficients are modified by λk,n = λk,nP (xk,n = 1|M). Although both algorithms in [42] and [54] show promising results in denoised image quality, the iterative procedure necessary to maximize the probability P (X|M) adds computational complexity making the processing times of the algorithms impractical for most image and video processing applications. Also, the Markov Random Field

(MRF) model used in the calculation of P (X|M) is not appropriate for analysis of wavelet coefficients because it ignores the influence of non-neighboring coefficients.

The MRF model is strictly used for simplicity and conceptual ease [42].

From the review of the literature, one can see that image denoising remains to be an active and challenging topic of research. The major challenge lies in the fact that one does not know what the original signal is for a corrupted image. The performance of a method, on the other hand, can only be measured by comparing the denoised image with its origin. In this chapter, we present a new denoising approach which consists of two components. The first is the selective wavelet shrinkage method for denoising, and the second is a new threshold selection method which makes use of test images as training samples.

In general, selective shrinkage methods are comprised of three processing steps.

First, a corrupted image is decomposed into multiresolution subbands via the wavelet transform. Next, wavelet coefficients are modified based upon certain criteria to predict their importance in reconstructed image quality. Finally, the denoised image is formed by reconstructing the modified coefficients via the inverse wavelet transform.

The processing step of most cost computationally in the methods of [42] and [54] and

28 greatest importance in denoising performance is the coefficient modification process, which calls for effective and efficient criteria to modify wavelet coefficients. To improve performance, this paper presents a new coefficient selection process which uses a two-threshold criteria to non-iteratively select and reject wavelet coefficients. The two-threshold selection criteria results in an effective and computationally simple coefficient selection process.

The threshold selection method presented is based on minimizing the error be- tween the wavelet coefficients of the denoised image and the wavelet coefficients of an optimally denoised image produced by a method using supplemental information.

The supplemental information provided produces a denoised image that is far superior than any method which does not utilize supplemental information. Thus, the image produced by the method utilizing supplemental information is referred to as an op- timally denoised image. Using several test cases, the threshold values which produce the minimum difference between the wavelet coefficients of the denoised image and the wavelet coefficients of the optimally denoised image are chosen as the threshold values for the general case.

The two-threshold coefficient selection method results in a denoising algorithm which gives improved results upon those provided by [42, 54] without the compu- tational complexity. The two-threshold requirement investigates the regularities of wavelet coefficients both spatially and across scales for predictive coefficient selection, providing selective wavelet shrinkage to non-decimated wavelet subbands.

Following the Introduction, Section 3.2 gives theory on the 2D non-decimated wavelet analysis and synthesis filters. Section 3.3 then describes the coefficient selec- tion process prior to selective wavelet shrinkage. Section 3.4 gives testing results for

29 parameter selection. Section 3.5 gives the estimation algorithms for proper parameter selection, and Section 3.6 gives the results. Section 3.7 gives the discussion.

3.2 2D Non-Decimated Wavelet Analysis and Synthesis

To facilitate the discussion of the proposed method, non-decimated wavelet filter- bank theory is presented. In certain applications such as signal denoising, it is not desirable to downsample wavelet coefficients after decomposition, as in the tradition wavelet filterbank. The spatial resolution of the coefficients is degraded due to down- sampling. Therefore, for the non-decimated case, each subband contains the same number of coefficients as the original signal.

Let ak[n] and dk[n] be scaling and wavelet coefficients, respectively, of scale k and position n. Thus, k+1 αk[2 n] = ak[n] k+1 (3.1) λk[2 n] = dk[n], where αk[·] are the non-decimated scaling coefficients, and λk[·] are the non-decimated wavelet coefficients. Equation 3.1 is substituted into the scaling analysis filterbank equation, Equation 2.25, to find the non-decimated filterbank equation: P ak+1 [n] = m h[m]ak[m − 2n] k+2 P k+1 αk+1 [2 n] = m h[m]αk[2 (m − 2n)] (3.2) P k+1 αk+1 [n] = m h[m]αk[2 m − n], where h[·] and g[·] are the filter coefficients corresponding to the low-pass and high- pass filter, respectively, of the wavelet transform. The 2k+1 scalar introduced into

Equation 3.2 is equivalent to upsampling h[·] by 2k+1 prior to its convolution with

αk[·]. Similarly Equation 3.1 is substituted into the wavelet analysis filterbank equa- tion, Equation 2.26, to obtain

P k+1 λk+1 [n] = m g[m]αk[2 m − n]. (3.3)

30 Figure 3.1 gives a block diagram of the non-decimated wavelet decomposition.

Figure 3.1: Non-decimated wavelet decomposition.

The synthesis of the non-decimated wavelet transform also differs from the down- sampled case. From the wavelet synthesis filterbank equation, Equation 2.28, we obtain, X X

ak[2n] = h[2(n − m)]ak+1 [m] + g[2(n − m)]dk+1 [m]. (3.4) m m Substituting (p = n − m) we obtain,

X X

ak[2n] = h[2p]ak+1 [n − p] + g[2p]dk+1 [n − p]. (3.5) p p Substituting Equation 3.1 into Equation 3.5,

k+2 P k+2 αk[2 n] = p h[2p]αk+1 [2 (n − p)] P k+2 , (3.6) + p g[2p]λk+1 [2 (n − p)] and P k+2 αk[n] = p h[2p]αk+1 [n − 2 p] P k+2 (3.7) + p g[2p]λk+1 [n − 2 p].

Looking at Equation 3.7 samples are being thrown away by downsampling αk+1 [·] and

λk+1 [·] by 2 prior to convolution. Because the downsampling in the analysis filters

31 is eliminated, a downsample by 2 is shown in the synthesis equation, Equation 3.7.

If a downsample by 2 is not performed, i.e. (m = 2p), then we must divide by 2 to provide power equality. That is,

1 P k+1 αk[n] = 2 m h[m]αk+1 [n − 2 m] 1 P k+1 (3.8) + 2 m g[m]λk+1 [n − 2 m].

Figure 3.2 gives a block diagram of the non-decimated wavelet transform synthesis.

Figure 3.2: Non-decimated wavelet synthesis.

The above analysis is expanded to the two-dimensional case. For a 2D discrete signal f(·), the 2D non-decimated wavelet transform is given by

P k+1 k+1 αll,k+1 [x, y] = n,m h[n]h[m]αll,k[2 m − x, 2 n − y] P k+1 k+1 λhl,k+1 [x, y] = n,m h[n]g[m]αll,k[2 m − x, 2 n − y] P k+1 k+1 (3.9) λlh,k+1 [x, y] = n,m g[n]h[m]αll,k[2 m − x, 2 n − y] P k+1 k+1 λhh,k+1 [x, y] = n,m g[n]g[m]αll,k[2 m − x, 2 n − y], where

αll,−1[x, y] = f(x, y). (3.10)

The four coefficient sets given in Equation 3.9 are referred to as the low-low band,

αll,k+1 [·], the high-low band, λhl,k+1 [·], the low-high band, λlh,k+1 [·], and the high-high

32 band, λhh,k+1 [·]. The subbands are named due to the order in which the scaling and/or the wavelet filters process the scaling coefficients.

For the synthesis of f(·) we have,

1 P k+1 k+1 αll,k[x, y] = 4 m,n h[m]h[n]αll,k+1 [x − 2 m, y − 2 n] 1 P k+1 k+1 + 4 m,n h[m]g[n]λhl,k+1 [x − 2 m, y − 2 n] 1 P k+1 k+1 . (3.11) + 4 m,n g[m]h[n]λlh,k+1 [x − 2 m, y − 2 n] 1 P k+1 k+1 + 4 m,n g[m]g[n]λhh,k+1 [x − 2 m, y − 2 n] Equation 3.9 is recursively computed to produce several levels of wavelet coefficients, and reconstruction of the 2D signal, f(·), is accomplished by the recursive computa- tion of Equation 3.11.

The non-decimated wavelet transform has many advantages in signal denoising over the traditional decimated case. One, each subband in the wavelet decomposition is equal in size, thus it is more straightforward to find the spatial relationships be- tween subbands. Two, the spatial resolution of each of the subbands is preserved by eliminating the downsample by two. Because of the elimination of the downsampler, information contained in the wavelet coefficients is redundant, and this redundancy is exploited to determine the coefficients comprised of noise and the coefficients com- prised of feature information contained in the original image.

3.3 Retention of Feature-Supporting Wavelet Coefficients

One of the many advantages of the wavelet transform over other mathematical transformations is the retention of the spatial relationship between pixels in the orig- inal image by the coefficients in the wavelet domain. These spatial relationships represent features of the image and should be retained as much as possible during denoising. In general, images are comprised of regular features, and the resulting wavelet transform of an image generates few, large, spatially contiguous coefficients

33 which are representative of the features given in the original image. We refer to the spatial contiguity of the wavelet coefficients as spatial regularity.

The concept of spatial regularity has the similar function as that of signal regu- larity in previous denoising approaches for selecting the wavelet coefficients. The key difference is that spatial correlation of the features are represented by connectivity of wavelet coefficients rather than statistical models such as Markov random sequences

[42, 54] or H¨olderexponents [42, 43, 54] in previous methods. These models are often computationally complicated and still do not reflect the geometry of the features ex- plicitly. As a result the current method has a better performance even with a much simpler computation.

Because of spatial regularity, the resulting subbands of the wavelet transform do not generally contain isolated coefficients. This regularity can aid in deciding which coefficients should be selected for reconstruction, and which should be thrown away for maximum reconstructed image quality. The proposed coefficient selection method in which spatial regularity is exploited is shown as follows.

Let us assume that an image is corrupted with additive noise, i.e.

fe(x, y) = f(x, y) + η(x, y), (3.12) where f(·) is the noiseless 2D signal, η(·) is a random noise function, and fe(·) is the corrupted signal.

The first step for selecting the wavelet coefficient is to form a preliminary binary label for each coefficient, which collectively form a binary map. The binary map is then used to determine whether or not a particular wavelet coefficient is included in e e a regular spatial feature. The wavelet transform of f(·) generates coefficients, λ·,k[·], e from Equations 3.9 and 3.10. λ·,k[·] is used to create the preliminary binary map,

34 I·,k[·]. ½ 1, when |λe [x, y]| > τ I [x, y] = ·,k , (3.13) ·,k 0, else where τ is a threshold for selecting valid coefficients in the construction of the binary e coefficient map. A valid coefficient is defined as a coefficient, λ·,k[x, y], which results in I·,k[x, y] = 1; hence the coefficient has been selected due to its magnitude. After coefficients are selected by magnitude, spatial regularity is used to further examine the role of the valid coefficient: whether it is isolated noise or part of a spatial feature.

The number of supporting binary values around a particular non-zero value I·,k[x, y] is used to make the judgement. The support value, S·,k[x, y], is the sum of all I·,k[·] which support the current binary value I·,k[x, y]; that is, the total number of all valid coefficients which are spatially connected to I·,k[x, y].

A coefficient is spatially connected to another if there exists a continuous path of valid coefficients between the two. Figure 3.3 gives a generic coefficient map. The valid coefficients are highlighted in gray. From Figure 3.3 it can be shown that coefficients

A, B, C, and H do not support any other valid coefficients in the coefficient map.

However, coefficients D and F support each other, coefficients E and G support each other, and N and O support each other. Also, coefficients I, J, K, L, M, P, Q, and R all support one another. Figure 3.4 gives the value of S·,k[x, y] for each of the valid coefficients given in Figure 3.3. A method of computing S·,k[x, y] is given in Appendix

A. S·,k[·] is used to refine the original binary map I·,k[·] by   1, when S·,k[x, y] > s, J [x, y] = or J [x, y]I [x, y] = 1 , (3.14) ·,k  ·,k+1 ·,k 0, else

35 Figure 3.3: Generic coefficient array.

where J·,k[·] is the refined binary map, and s is the necessary number of support coefficients for selection. J·,·[·] is calculated recursively, starting from the highest multiresolution level, and progressing downward.

Equation 3.14 is equal to one when there exists enough wavelet coefficients of large magnitude around the current coefficient. However, it also is equal to one when the magnitude of the coefficient is effectively large (I·,k[·] = 1) but not locally supported (J·,k[·] = 0) only if the coefficient of the larger scale is large and locally

supported (J·,k+1 [·] = 1). The decision to use this criterion is in the somewhat rare case when a useful coefficient is not locally supported. In the general case, wavelet coefficients of images are clustered together, but rarely they are isolated. In [43], wavelet coefficients are modified only by their evolution across scales. Regular signal features contain wavelet coefficients which increase with increasing scale. Thus, if

36 Figure 3.4: Generic coefficient array, with corresponding S·,k values.

there exists a useful coefficient which is isolated in an image, it is reasonable that a coefficient in the same spatial location of an increase in scale will be sufficiently large and spatially supported. Thus, the coefficient selection method provided by Equation

3.15 selects coefficients which are sufficiently large and locally supported as well as isolated coefficients which are sufficiently large and supported by scale.

This type of scale-selection is consistent with the findings of Said and Pearlman

[62], who developed an image codec based on a ”spatial self-symmetry” between dif- fering scales in wavelet transformed images. They discovered that most of an image’s energy is concentrated in the low-frequency subbands of the wavelet transform. And because of the self-symmetry properties of wavelet transformed images, if a coefficient value is insignificant (i.e. of small value or zero), then it can be assumed that the coefficients of higher spatial frequency and same spatial location are insignificant also.

37 In our application, however, we are looking for significance rather than insignificance, so we look to the significance of lower frequency coefficients to determine significance of the current coefficient. In this way, the preliminary binary map is refined by both spatial and scalar support, given by equation 3.14.

The final coefficients retained for reconstruction are given by ½ λe [x, y], when J [x, y] = 1 L [x, y] = ·,k ·,k . (3.15) ·,k 0, else

The denoised image is reconstructed using the supported coefficients, L,k[·] in the synthesis equation given in Equation 3.11. Thus,

1 P k+1 k+1 αˆll,k[x, y] = 4 m,n h[m]h[n]ˆαll,k+1 [x − 2 m, y − 2 n] 1 P k+1 k+1 + 4 m,n h[m]g[n]Lhl,k+1 [x − 2 m, y − 2 n] 1 P k+1 k+1 . (3.16) + 4 m,n g[m]h[n]Llh,k+1 [x − 2 m, y − 2 n] 1 P k+1 k+1 + 4 m,n g[m]g[n]Lhh,k+1 [x − 2 m, y − 2 n] Equation 3.16 is calculated recursively producing scaling coefficients of finer resolution until k = −1. The denoised image, fˆ(·), is then given by

ˆ f(x, y) =α ˆll,−1[x, y]. (3.17)

αˆll,k[·] are the reconstructed scaling coefficients of scale k.

In general, natural and synthetic imagery can be compactly represented in few wavelet coefficients of large magnitude. These coefficients are in general spatially clustered. Thus, it is useful to obtain selection methods based on magnitude and spatial regularity to distinguish between useful coefficients which are representative of the image and useless coefficients representative of noise. The two-threshold criteria for the rejection of noisy wavelet coefficients is a computationally simple, non-iterative test for magnitude and spatial regularity which can effectively distinguish between useful and useless coefficients.

38 3.4 Selection of Threshold τ and Support s

The selection of threshold τ and support s is a key component of the denoising algorithm. Unfortunately, the two parameters cannot be easily determined for a given corrupted image because there is no information about the decomposition between the original signal and the noise. We derive τ and s using a set of test images which serve as training samples. These training samples are artificially corrupted by noise.

The noise is then removed by a series of τ and s. The set of τ and s which generates the best results is selected for noise removing in general. This approach has its root in an idea called oracle ([15]) which is described below.

An oracle is an entity which provides extra information to aid in the denoising process. The extra information provided by the oracle is undoubtedly beneficial in providing substantially greater denoising results than methods which are not fur- nished supplemental information. Thus, the coefficient selection method which uses the oracle’s information is referred to as the optimal denoising method. By the op- timal denoising method the threshold and support can be selected using test images of which both original image and noise are known. The selected threshold and sup- port functions can then be selected for any corrupted images without supplemental information.

An optimal coefficient selection process has been defined based on the original

opt (noiseless) image. The optimal binary map J·,k [·] is given by ½ 1, when |λ [x, y]| > σ J opt[x, y] = ·,k n , (3.18) ·,k 0, else where λ·,k[·] are the wavelet coefficients of the original (noiseless) image, f(·), and e σn is the standard deviation of the noise in the corrupted image, f(·). Thus, the

39 extra information given by the oracle is the noiseless wavelet coefficients, λ·,k[·]. The coefficients of the original image are used in coefficient selection process, but not in the image reconstruction. The coefficients which are used in the reconstruction,

opt L·,k [·], are given by, ½ λe [x, y], when J opt[x, y] = 1 Lopt[x, y] = ·,k ·,k , (3.19) ·,k 0, else e where λ·,k[·] are the wavelet coefficients of the noisy image.

The optimal coefficient map is used to create the optimal denoised image which is given by αˆopt [x, y] = ll,k P P 1 h[m]h[n]ˆαopt [x − 2k+1m, y − 2k+1n] 4 Pm Pn ll,k+1 + 1 h[m]g[n]Lopt [x − 2k+1m, y − 2k+1n] . (3.20) 4 Pm Pn hl,k+1 + 1 g[m]h[n]Lopt [x − 2k+1m, y − 2k+1n] 4 Pm Pn lh,k+1 + 1 g[m]g[n]Lopt [x − 2k+1m, y − 2k+1n] 4 m n hh,k+1 Equation 3.20 is recursively computed for lesser values of k until the optimal denoised image is achieved, where

ˆopt opt f (x, y) =α ˆll,−1[x, y]. (3.21)

opt ˆopt αˆll,k[·] are the optimal scaling coefficients, and f (·) is the optimally denoised image. Figure 3.5 gives the denoising results of the optimal denoising method when applied to the ”Lenna” image corrupted with additive white Gaussian noise (AWGN). As shown in Figure 3.5, the optimal denoising method is able to effectively remove the noise from the ”Lenna” image because of the added information given by the oracle.

PSNR is calculated for performance measurement and is given by µ ¶ 255 PSNR = 20log √ , (3.22) 10 mse where 1 X X ³ ´2 mse = fˆ(x, y) − f(x, y) . (3.23) W H f f x y

40 Figure 3.5: Optimal denoising method applied to noisy ”Lenna” image. Left: Cor- e rupted image f(x, y), σn = 50, PSNR = 14.16 dB. Right: Optimally denoised image fˆopt(x, y), PSNR = 27.72 dB.

mse is the mean-squared error between the original image f(·) and the denoised image ˆ f(·), and Wf and Hf are the width and height of the image, respectively.

PSNR is the most popular quality metric among researchers in the image and video processing community and has been used almost exclusively in the literature for more than a decade. However, it is also well know in the community that PSNR is not always consistent with the human perception of quality. That is, although image processing method A is shown to give a higher PSNR than image processing method

B, people on average may tend to prefer the results of image processing method B.

Because of this inconsistency, recently there has been research conducted in the development of new quality metrics which tend to give results which more closely follow human perception. A metric call QI (quality index) has been developed based

41 not on pixel error as in PSNR, but on loss of correlation, luminance distortion, and contrast distortion [75]. This method is tested, and the results suggest that QI may be a better means of quantitative quality measurement than PSNR.

Also, another metric has been developed which suggest even more consistent qual- ity assessment than both QI and PSNR. The weighted frequency-domain normalized mean-squared error (W-NMSE) quality metric is based upon wavelet coefficient er- ror [19]. Results given in [19] suggest that W-NMSE gives results that are closer to human perception than both PSNR and QI.

In addition to PSNR, QI, and W-NMSE, there are also a number of proprietary quality metrics available for purchase. So, there is a choice to be made when eval- uating the performance of an image processing algorithm. The choice made in this dissertation is to use PSNR, and there is a reason for the decision. The methods of

[19, 75] are very new metrics developed only in the past few years. These metrics may be substantially better metrics than PSNR, but they have not had time to impact the literature published by the image and video processing communities. Because the methods of [19, 75] are new, it is unclear how much of an improvement they have over PSNR, and until these metrics become more well known and commonplace among researchers they will not replace PSNR as the quality metric of choice. Also, the results of methods given in this dissertation are compared to methods developed previously whose results are given in the literature. These methods all use PSNR as the performance metric, so we must use PSNR for consistency.

It is rather obvious that the optimal coefficient selection process is unattainable when no supplemental information is provided by the oracle. Thus the optimal image

42 denoising method is not possible for practical implementation. However, the knowl-

opt edge obtained by the optimal binary map, J·,k [·], is used to compare with the refined coefficient map generated by the two-threshold criteria, J·,k[·], described in Section

3.3. The coefficient selection method is based on the error between the optimal coef-

ficient subband and the subband generated by the two-threshold criteria. The error is given by P ¡ ¢ opt e2 p∈{hl,lh,hh},k,x,y Jp,k [x, y] ⊕ Jp,k[x, y] λp,k[x, y] Error = P , (3.24) e2 p∈{hl,lh,hh},k,x,y λp,k[x, y] where ⊕ is the exclusive OR operation.

In the proposed coefficient selection algorithm, we use a training sample approach.

The approach starts with a series of test images serving as training samples to derive the functions which determine the optimal set of values for τ and s as well as the type of wavelet used for denoising. Theoretically, we may represent each training sample as a vector Vi, i = 1, n. Those training samples should span a space which includes many similar images corrupted by noise:

S = Span{Vi; i = 1, ..n}. (3.25)

The original data and the statistical distribution of the noise are given for each of the training samples which are corrupted. The optimal set of parameters can then be determined for the training samples using the approach described earlier. Ideally, the space spanned by the training samples contains the types of corrupted images which are to be denoised. As a result, the same set can generate an optimal or close to optimal performance for the corrupted images of same type. It is clear that more training samples will generate parameters suitable for more types of images, while a space of fewer training samples is suitable for a lesser number of images. In the

43 following, we will use some examples to illustrate this approach. The test images

Figure 3.6: Test images.

are all 256x256 pixels. Shown in Figure 3.6, each of the training sample images is well known in the image processing community, and collectively represents as many types of images as possible. Starting from the upper-left image and going clockwise, the images are ”Lenna”,”Airplane”, ”Fruits”, and ”Girl”. In this way, the τ and s obtained will likely perform well in most cases.

44 A test is used to demonstrate the effectiveness of different wavelets in denoising.

First, each of the four test images is corrupted with AWGN at various levels. Next, the 2D non-decimated wavelet transform, given in Section 3.2, is calculated using several different wavelets. The wavelet coefficients are then hard thresholded using a threshold T ranging from 0 − 150, and the inverse wavelet transform is applied to the thresholded coefficients. The wavelet which gives the reconstructed images with the highest average PSNR is chosen to be used in the general case.

Several wavelets were used in the testing. However, for simplicity only five are presented. We have chosen the Daubechies wavelets [12] (Daub4 and Daub8) for their smoothness properties, the spline wavelets (first order and quadratic spline) [6] because of their use in the previous works of [42, 43, 54], and the Haar wavelet because of its simplicity and compact support. The results are given in Figure 3.7. After the testing results given in Figure 3.7, the Haar wavelet is selected for image denoising:

 −1 ½  √ , when n = 0 √1 , when n = 0, 1 2 h[n] = 2 g[n] = √1 , when n = 1 . (3.26) 0, else  2 0, else

Testing has shown the Haar wavelet to be the most promising in providing the highest reconstructed image quality. The compact support of the Haar wavelet enables the wavelet coefficients to represent the least number of original pixels in comparison with other types of wavelets. Therefore, when a coefficient is removed because of its insignificance or isolation, the result affects the smallest area of the original image in the reconstruction, which reduces the impact to the image quality even if a removed coefficient is not only comprised of noise.

The Haar wavelet is used in a non-decimated wavelet decomposition of the original image. Three subband levels are used, i.e. k = −1 to 2. The proposed selective

45 denoising using different wavelets, σ = 10 denoising using different wavelets, σ = 20 n n 34 Haar wavelet 30 1st order Spline wavelet 32 Quadradic Spline wavelet Daub. 4 wavelet 28 Daub. 8 wavelet 30 26 28 PSNR (dB) PSNR (dB) 24 26

24 22 0 50 100 150 0 50 100 150 threshold T threshold T denoising using different wavelets, σ = 30 denoising using different wavelets, σ = 40 n n 28 28

26 26 24 24 22 22 PSNR (dB) PSNR (dB) 20

20 18

18 16 0 50 100 150 0 50 100 150 threshold T threshold T

Figure 3.7: Average PSNR values using different wavelets.

wavelet shrinkage algorithm is applied to all wavelet subbands, and the subbands are synthesized by the non-decimated inverse wavelet transform.

Testing for the optimal values of τ and s is accomplished by artificially adding

Gaussian noise to each of the four images, denoising all four images with a particular τ and s, and recording the average error given by Equation 3.24. Then, the combination of τ and s which gives the lowest error is the choice for that particular noise level.

The average error is recorded when denoising each of the four test images given in Figure 3.6 using τ ranging from 0 − 150 and s ranging from 0 − 20. The proposed

46 algorithm is tested by applying AWGN with a standard deviation (σn) of 10, 20, 30,

40, and 50 to each of the test images. The proposed method of selective wavelet shrinkage is applied to the corrupted image, and the resulting error is recorded using

Equation 3.24. The results of the testing in which σn = 30 is given in Figure 3.8.

Error Results with noise, σ = 30

−3 x 10

14

12

10

8

Percent Error 6

4

2 20

15 150

10 100

5 50

0 spatial support pixels, s 0 Threshold value, τ

Figure 3.8: Error results for test images, σn = 30.

Table 3.1 gives the τ and s which provide the lowest average error for each noise level tested. These particular values are referred to as τm(·) and sm(·). Table 3.1

47 Noise Level(σn) 10 20 30 40 50

Min. Avg. Error 3E-4 11E-4 24E-4 42E-4 64E-4 sm value 5 9 10 15 14 τm value 23 43 63 85 108

Table 3.1: Minimum average error of test images for various noise levels and their corresponding threshold and support values.

suggests that parameters τm(·) and sm(·) are functions of the standard deviation of the noise, σn.

Because τm(·) and sm(·) generally increase with an increase in additive noise as shown in Table 3.1, both parameters can be modeled as functions of the additive noise, σn. Then, knowing the level of noise corruption, the threshold levels which produce the minimum error, Error, may be obtained by estimating the τm(·) and sm(·) functions. The five noise levels provided in the test are used as sampling points for the estimation of the continuous functions τm(·) and sm(·). With enough sampling points both τm(·) and sm(·) can be effectively estimated, and the correct τ and s can be calculated to denoise an image with any level of noise corruption, given that the noise level is known.

The estimated functions of the sampled values τm(·) and sm(·) are referred to as

τem(·) and sem(·), respectively. Once the estimated functions are calculated they are used in the general case. Thus, given an image corrupted with noise, it is denoised with no prior knowledge by estimating the level of noise corruption, calculating the proper thresholds using the τem(·) and sem(·) functions, and using the calculated threshold levels in the denoising process given in Section 3.3.

48 3.5 Estimation of Parameter Values

It can be shown from the values given in Table 3.1 that the parameters τm(·) and sm(·) are functions of σn; therefore, we need to estimate the standard deviation of the noise level, and the functions. These two topics are discussed in this section.

3.5.1 Noise Estimation

The level of noise in a given digital image is unknown and must be estimated from the noisy image data. Several well known algorithms have been given in the literature e to estimate image noise. From [16, 54] a median value of the λhh,0[·] subband is used in the estimation process. The median noise estimation method of [54] is used in our algorithm. Median(|λe [·]|) σe = hh,0 , (3.27) n 0.6745

e th where λhh,0[·] are the noisy wavelet coefficients in the high-high band of the 0 scale.

Because the vast majority of useful information in the wavelet domain is confined to few and large coefficients, the median can effectively estimate the level of noise

(i.e. the average level of the useless coefficients) without being adversely influenced by useful coefficients.

3.5.2 Parameter Estimation

Using the known level of noise added to the original images, the values of τm(·) and sm(·), given in Table 3.1, are estimated. One of the simplest and most popular estimation procedures is the LMMSE (Linear Minimum Mean Squared Error) method, and it is used as the estimation procedure [68]. That is, two parameters aτ and bτ

49 are found such that

τfm(σn) = aτ σn + bτ . (3.28)

The choice of aτ and bτ will minimize the mean squared error. Similarly, an estimate of sm, which must be an integer, is found as:

sfm(σn) = basσn + bsc. (3.29)

The parameters which minimize the mean squared error are: aτ = 2.12, bτ = 0.80, as = 0.26, and bs = 2.81.

The LMMSE estimation procedure gives a simple description of the τm and sm functions. That is, there are only two values needed (a and b) to be able to determine the proper thresholds for denoising. The LMMSE estimator also is shown to be a good fit into the test data given in Figure 3.9. The values of τm(·), and sm(·) are given as well as their corresponding LMMSE estimates. The LMMSE estimate functions are the best linear fit into the data. Note that the support value sm must be an integer.

The threshold τ and the support value s are determined by using the estimate of the noise given by Equation 3.27. The two thresholds are given by

τ = a σf + b τ n τ . (3.30) s = basσfn + bsc

Using this information, a new image denoising algorithm is formalized. With a given image, the noise level is estimated by Equation 3.27, τ and s are then calculated using

Equation 3.30, and the image is denoised by the method given in Section 3.3.

50 Threshold and support estimation based upon noise level 140

120 Threshold level for min. error Threshold estimate τ 100

80

60

40 Threshold value, 20

0 0 10 20 30 40 50 60 Noise level (standard deviation, σ)

20

Local support value for min. error 15 Support estimate

10

5 Local support value, s

0 0 10 20 30 40 50 60 Noise level (standard deviation, σ)

Figure 3.9: τm(·), sm(·) and their corresponding estimates, τfm(·), sfm(·).

3.6 Experimental Results

The ”Peppers” and ”House” images are used for gauging the performance of the proposed denoising algorithm. These two images have also been used in the results of

[42, 43, 54]. Therefore, the proposed algorithm’s performance is compared with the performance of other recent algorithms given in the literature. Both the ”Peppers” and ”House” images are corrupted with AWGN and the proposed method is used for denoising. The results are given in Figures 3.10 and 3.11.

51 ”Peppers” Image Input PSNR 22.6 19.6 16.6 13.6 Average Proposed Algorithm 31.00 28.98 27.17 25.46 28.15 Pizurica 3-band, [54] 30.20 28.60 27.00 25.20 27.75 Pizurica 2-band, [54] 29.90 28.20 26.60 24.90 27.40 Malfait and Roose, [42] 28.60 27.30 26.00 24.60 26.63 Mallat and Hwang, [43] 28.20 27.30 27.10 24.60 26.80 Matlab’s Sp. Adaptive Wiener 29.00 27.10 25.30 23.30 26.18 ”House” Image Input PSNR 23.9 20.9 17.9 14.9 Average Proposed Algorithm 33.09 31.55 29.81 28.34 30.70 Pizurica 3-band, [54] 32.80 31.30 29.80 28.30 30.55 Pizurica 2-band, [54] 32.10 30.50 29.30 28.10 30.00 Malfait and Roose, [42] 32.90 31.30 29.80 28.20 30.55 Mallat and Hwang, [43] 31.30 30.50 29.10 27.10 29.50 Matlab’s Sp. Adaptive Wiener 30.30 28.60 26.70 24.90 27.63

Table 3.2: PSNR comparison of the proposed method to other methods given in the literature (results given in dB).

Table 3.2 gives the results of the proposed method, as well as the results of

[42, 43, 54]. Note that the methods of [42, 43, 54] all use the quadratic spline wavelet

[6] in three subband levels, and each of the algorithms’ coefficient selection method is based on a probabilistic formulation to determine how much a particular coeffi- cient contributes to the overall image quality. The proposed algorithm uses the Haar wavelet, given in Equation 4.4, in three subband levels, and the coefficient selection process is based on a geometrical approach. As shown in Table 3.2, the results of the proposed method are an improvement over other methods described in the literature.

In addition to improved performance, the proposed algorithm is computationally sim- ple to facilitate real-world applications. The proposed algorithm has been computed on older processors for an accurate comparison, and the computation time of the

52 Processor Pentium IV Pentium III IBM RS6000/320H Proposed Algorithm 0.66 1.14 *** Pizurica 3-band, [54] *** 45.00 *** Pizurica 2-band, [54] *** 30.00 *** Malfait and Roose, [42] *** *** 180.00 *** Computation time not evaluated

Table 3.3: Computation times for a 256x256 image, in seconds.

proposed method is an order of magnitude less than the previous method of highest performance, [54]. Table 3.3 gives the computational results of the proposed method as well as the results of [42, 54].

The proposed algorithm shows a substantial drop in computation time. Both

[42] and [54] use iterative computation in the selection of wavelet coefficients for reconstruction which requires unreasonable computation time for certain applications.

The current two-threshold technique is a simpler, non-iterative coefficient selection method which produces greater performance results.

In addition to obtaining a higher signal-to-noise ratio than established image de- noising algorithms, the proposed denoising algorithm facilitates image compression when used as a pre-processing step. That is, the image is first denoised using the pro- posed method, then compressed by 2D wavelet compression. The ”Peppers” image is compressed with various quantization step sizes, both with and without the proposed denoising algorithm. Figure 3.12 gives the compression results.

As given in Figure 3.12, regardless of the quantization step, applying the proposed denoising algorithm prior to compression improves the compression ratio. However, pre-processing is most beneficial when the step size is small. This is not surprising,

53 Image Step Size Without Denoising With Denoising Lenna (512x512) 2 4.82:1, 159.2 kbytes 7.15:1, 107.4 kbytes Fruits (512x512) 4 8.66:1, 88.7 kbytes 10.92:1, 70.4 kbytes Barb (512x512) 8 11.67:1, 65.8 kbytes 12.98:1, 59.19 kbytes Goldhill (512x512) 16 24.56:1, 31.3 kbytes 28.30:1, 27.14 kbytes Peppers (512x512) 32 49.28:1, 15.6 kbytes 51.19:1, 15.0 kbytes

Table 3.4: Compression ratios of 2D wavelet compression both with and without denoising applied as a pre-processing step.

however. When a large step size is applied to the wavelet transform subbands, much of the noise inherent in the image as well as much image content is removed, thus increasing the compression ratio. However, when a small step size is applied, much of the inherent noise is included in the compressed image, decreasing the compression ratio.

Table 3.4 gives the results of 2D wavelet compression of various images both with and without the denoising algorithm applied as a pre-processing step. As shown in

Table 3.4, when the denoising algorithm is applied to the image prior to compression, the 2D wavelet compression algorithm achieves better performance. However, the performance improvement is greater with a smaller quantization step size.

3.7 Discussion

A new selective wavelet shrinkage algorithm for image denoising has been de- scribed. The proposed algorithm uses a two-threshold support criteria which inves- tigates coefficient magnitude, spatial support, and support across scales in the coef-

ficient selection process. In general, images can be accurately represented by a few large wavelet coefficients, and those few coefficients are spatially clustered together.

54 The two-threshold criteria is an efficient and effective way of using the magnitude and spatial regularity of wavelet coefficients to distinguish useful from useless coefficients.

Furthermore, the two-threshold criteria is a non-iterative solution to selective wavelet shrinkage to provide a computationally simple solution, facilitating realtime image processing applications.

The values of the two-thresholds are determined by minimizing the error between the coefficients selected by the two-thresholds and the coefficients selected by a de- noising method which uses supplemental information provided by an oracle. The supplemental information provided by the oracle is useful in determining the cor- rect coefficients to select, and the denoising performance is substantially greater than methods which do not use the supplemental information. Thus, the method which uses the supplemental information provided by the oracle is referred to as the opti- mal denoising method. Therefore, by minimizing the error between the two-threshold method and the optimal denoising method, the two-threshold method can come as close as possible to the performance of the optimal denoising method.

Consequently, the two-threshold method of selective wavelet shrinkage provides an image denoising algorithm which provides signal-to-noise ratios than previous image denoising methods given in the literature both in denoised image quality and com- putation time. The light computational burden of the proposed denoising method makes it suitable for real-time image processing applications.

55 Figure 3.10: Results of the proposed image denoising algorithm. Top left: Original ”Peppers” image. Top right: Corrupted image, σn = 37.75, PSNR = 16.60 dB. Bottom: Denoised image using the proposed method, PSNR = 27.17 dB.

56 Figure 3.11: Results of the proposed image denoising algorithm. Top left: Original ”House” image. Top right: Corrupted image, σn = 32.47, PSNR = 17.90 dB. Bottom: Denoised image using the proposed method, PSNR = 29.81 dB.

57 Compressed file sizes of the "Peppers" image 250 2−D wavelet compression 2−D wavelet compression with pre−processing

200

150

100 Compressed file size (kBytes)

50

0 0 5 10 15 20 25 30 35 40 45 50 quantization step size

Figure 3.12: Wavelet-based compression results with and without pre-processing.

58 CHAPTER 4

Combined Spatial and Temporal Domain Wavelet Shrinkage Algorithm for Video Denoising

4.1 Introduction

As shown in the introduction of Chapter 3, the process of removing noise in digital images has been studied extensively [15, 17, 18, 20, 26, 27, 29, 31, 39, 41, 42, 43, 46,

49, 53, 54, 61, 64, 69, 73, 77, 76, 83]. However, until recently, the removal of noise in video signals has not been studied seriously. Cocchia, et. al., developed a three dimensional rational filter for noise removal in video signals [10]. The 3D rational

filter is able to remove noise, but preserve important edge information. Also, the

3D rational filter uses a motion estimation technique. Where there is no motion detected, the 3D rational filter is applied in the temporal domain. Otherwise, only spatial domain processing is applied.

Later, Zlokolica, et. al., uses two new techniques for noise removal in image sequences [83]. Both these new techniques show improved results upon the method of

[10]. The first method is an alpha-trimmed mean filter of [4] extended to video signals, and the second is the K nearest neighbors (KNN) filter. Both alpha-trimmed and

KNN denoising methods are based on ordering the pixel values in the neighborhood of the location to be filtered, and averaging a portion of those spatially contiguous

59 pixels. Each of these methods attempts to average values which are close in value, and avoid averaging values which are largely dissimilar in value. Thus, the image sequence is smoothed without blurring edges.

However, because the success of the wavelet transform over other mathematical tools in denoising images, some researchers believe that wavelets may be successful in the removal of noise in video signals as well. Pizurica, et. al., uses a wavelet- based image denoising method to remove noise from each individual frame in an image sequence, then applies a temporal filtering process for temporal domain noise removal [55]. The combination of wavelet image denoising and temporal filtering outperforms both wavelet based image denoising techniques [42, 43, 54] and spatial- temporal filtering techniques [4, 10, 83].

The temporal domain filtering technique described in [55] is a linear IIR filter which will continue to filter until it reaches a large temporal discontinuity. It will not

filter the locations of large temporal discontinuity where the absolute difference in neighboring pixel values is greater than a threshold, T , thus preserving motion while removing noise.

Although temporal processing aids in the quality of the original image denoising method, the parameter T varies with differing video signals for improved performance.

That is, the value of T may be large in sequences where there is little motion for im- proved noise removal, i.e., there is more redundancy between consecutive frames.

Thus the redundancy may be exploited by a large T to improve video quality. How- ever, in image sequences where there exists a large amount of motion, consecutive frames are more independent and there exists little to no redundancy to exploit.

Thus, the parameter T must be small to achieve optimal performance.

60 In the case of video denoising, it has been fairly well documented that the amount of noise removal achievable from temporal domain processing, while preserving overall quality, is dependent on the amount of motion in the original video signal [10, 55].

Thus, a robust, high-quality video denoising algorithm is required to not only be scalable to differing levels of noise corruption, but also scalable to differing amounts of motion in the original signal. Unfortunately, this principle has not been seriously considered in video denoising.

In this chapter, we develop a noise removal algorithm for video signals. This algo- rithm uses selective wavelet shrinkage in all three dimensions of the image sequence and proves to outperform the few video denoising algorithms given in the relevant literature in terms of PSNR. First, the individual frames of the sequence are denoised by the method described in Chapter 3, then a new selective wavelet shrinkage method is used for temporal domain processing.

Also, a motion estimation algorithm is developed to determine the amount of temporal domain processing to be performed. Several motion estimators have been proposed [10, 55], but few are robust to noise corruption. The proposed motion esti- mation algorithm is robust to noise corruption and an improvement over the motion estimation method of [10]. The proposed denoising algorithm, including the proposed motion estimation method, is experimentally determined to be an improvement over the methods of [10, 55, 83].

Following the Introduction, Section 4.2 describes the temporal domain wavelet shrinkage method and explores the proper order of temporal and spatial domain processing functions. Section 4.3 provides the proposed motion estimation index used in the temporal domain processing and compares it with the motion estimation

61 method of [10]. Section 4.4 develops the parameters for temporal domain processing, and Section 4.5 gives the experimental results of the proposed method as well as other established methods. Section 4.6 gives the discussion.

4.2 Temporal Denoising and Order of Operations

In this section, we develop the principal algorithm for video denoising. Additional mechanisms required by this algorithm will be discussed in latter sections.

4.2.1 Temporal Domain Denoising

z Let us define fl as a pixel of spatial location l and frame z in a given image sequence. The non-decimated wavelet transform applied in the temporal domain is given by: X 3D 3D k+1 λk+1[l, z] = g[p]αk [l, 2 p − z], (4.1) p and X 3D 3D k+1 αk+1[l, z] = h[p]αk [l, 2 p − z], (4.2) p where

3D z α−1 [l, z] = fl . (4.3)

3D λk [l, z] is the high-frequency wavelet coefficient of spatial location l, frame z and scale

3D k. Also, αk [l, z] is the low-frequency scaling coefficient of spatial location l, frame z and scale k. Thus, multiple resolutions of wavelet coefficients may be generated from iterative calculation of Equations 4.1 and 4.2.

62 The wavelet function used in the temporal domain denoising process is the Haar wavelet given by

 −1 ½  √ , when n = 0 √1 , when n = 0, 1 2 h[n] = 2 g[n] = √1 , when n = 1. (4.4) 0, else  2 0, else

The decision to use the Haar wavelet is based on experimentation with several other wavelet functions and finding the greatest results with the Haar. The compact support of the Haar wavelet makes it a suitable function for denoising applications. Because of it’s compact support, the Haar coefficients represent least number of original pixels in comparison to other types of wavelets. Thus, when a coefficient is removed because of its insignificance, the result affects the smallest area of the original signal in the reconstruction.

Significant wavelet coefficients are selected by their magnitude with a threshold operation. ½ λ3D[l, z], when |λ3D[l, z]| > τ [l], L3D[l, z] = k k z , (4.5) k 0, else

3D where Lk [·] are the thresholded wavelet coefficients used in signal reconstruction, and τz[·] is the threshold value. The resulting denoised video signal is computed via the inverse non-decimated wavelet transform

3D 1 P 3D k+1 αˆk [l, z] = 2 p h[p]ˆαk+1[l, z − 2 p] 1 P 3D k+1 , (4.6) + 2 p g[p]Lk+1[l, z − 2 p] which leads to

ˆz,3D 3D fl =α ˆ−1 [l, z]. (4.7)

ˆz,3D fl is the temporally denoised video signal.

63 4.2.2 Order of Operations

With a spatial denoising technique and a temporal denoising technique established in Chapter 3 and above, respectively, there still remains the question of the order of operations. The highest quality may occur with temporal domain denoising followed by spatial domain (TFS) denoising, or spatial denoising followed by temporal (SFT) denoising.

Theoretically, is it not possible to prove and determine which operation is better because the description of the noise is not known. However, it is our hypothesis that SFT denoising can more aptly determine noise from signal information. The reasoning behind this hypothesis is that removing noise in the spatial domain is a well known process, and any noise removal prior to temporal domain processing is helpful in discriminating between the residual noise and motion in the image sequence.

However, a validation of this hypothesis is determined heuristically.

Thus, a test is conducted using two video signals. The first video signal is one which contains little motion, and the other contains a great deal of motion. The selected image sequences are the ”CLAIRE” sequence from frame #104-167 and the

”FOOTBALL” sequence from frame #33-96.

Both of the image sequences are denoised with τ and τz ranging from 0 − 30 for both TFS and SFT denoising operations. Note that in the test, τz is a single value and spatially independent, unlike the temporal threshold, τz[·], which is used in the

final denoising algorithm, dependent upon spatial position, and given in Equation 4.5.

Also, the s parameter for feature selection in the image denoising method described in Section 3.3 is calculated by taking Equation 3.30 and solving for s. The parameter

64 s is given by: ¹ º as s = (τ − bτ ) + bs . (4.8) aτ Also, the number of resolutions of the non-decimated wavelet transform used in both the spatial and temporal denoising methods is k = 1...5. The average PSNR of each trial is recorded. The PSNR of an image is given by Equation 3.22.

Figure 4.1 gives the results of testing. As shown in Figure 4.1, the highest av- erage PSNR is achieved by SFT denoising; first spatially denoising each frame of the sequence followed by temporal domain denoising. Thus, for the proposed de- noising method, spatial domain denoising occurs prior to temporal domain denoising, exclusively.

In addition to a higher average PSNR, there is another benefit to SFT denoising.

The level of motion in an image sequence is known to be crucial in determining the amount of noise reduction possible from temporal domain processing, and a motion index calculation is inevitably done by comparing consecutive frames to one another.

ˆz Thus, let us define a noisy image sequence where fl is a corrupted pixel in spatial position l and frame z and is defined by

ˆz z z fl = fl + ηl , (4.9)

z z where fl is the noiseless pixel value, and ηl is the noise function. We can compare consecutive frames by taking the difference as in [10, 55] to find

ˆz ˆz+1 z z fl − fl = ∆fl + ∆ηl . (4.10)

Thus by taking the difference between frames to find the level of motion, the noise function is subtracted from itself, in effect doubling the level of noise corruption [68].

Therefore, by applying spatial denoising prior to motion index calculation we can

65 z reduce the value of ∆ηl and provide a more precise calculation of the motion given in the image sequence.

4.3 Proposed Motion Index

A motion index is important in the success of a video denoising method in order to discriminate between large temporal variances in the video signal which are caused by noise and large temporal variances which are caused my motion in the original

(noiseless) signal. A motion index is able to aid temporal denoising algorithms to eliminate the large temporal variances caused by noise while preserving the temporal variances caused by motion in the original image sequence, creating a higher quality video signal. That is, the motion index is used to determine τz[·].

4.3.1 Motion Index Calculation

Several works have developed a motion estimation index to determine the amount of temporal domain processing to perform, i.e., the amount of information that can be removed from the original signal to improve the overall quality [10, 55]. However, neither of these proposed indices are robust to noise corruption, which is an important feature in a motion index. There are a few characteristics that a motion index must possess. One, a motion index should be a localized value. The reasoning behind a localized motion index is because the amount of motion may vary in different spatial portions of an image sequence. Thus the motion index should be able to identify those differences. Two, a motion index needs to be unaffected by the amount of noise corruption in a given video signal. A motion index should be robust to noise corruption to aptly determine the proper amount of temporal domain processing.

66 Thus, a localized motion index is developed which is relatively unaffected by the level of noise corruption in the original image sequence. A spatially averaged temporal standard deviation (SATSD) is used as the index of motion. Spatial averaging is used to remove the noise inherent in the signal, and the temporal standard deviation is used to detect the amount of activity in the temporal domain.

ˆz,2D th Let us define fl as pixel value in the spatial location l of the z frame of an image sequence already processed by the 2D denoising method given in Chapter 3.

The spatial averaging of the spatially denoised signal is given by

1 X Az = fˆz,2D, (4.11) l B2 i i∈I where I is the set of spatial locations which form a square area centered around spatial location l, and B2 is the number of spatial locations contained in I; typically, B = 15.

The value of B must be an odd value to allow for the square area to set centrally around spatial location l. This average is used to find the standard deviation in the temporal domain. 1 XF µ = Ai, (4.12) l F l i=1 and v u u 1 XF M = t (Ai − µ )2. (4.13) l F l l i=1

Ml is the localized motion index, F is the number of frames in the image sequence, and µl is the temporal mean of the spatial average at location l.

4.3.2 Motion Index Testing

The ”FOOTBALL” and ”CLAIRE” image sequences are used once more to test the proposed motion index as well as the motion index given in [10], and two specific

67 spatial locations are selected from each sequence: a location where there is little to no motion present, and a location where motion is present. A frame from each of the two image sequences is given in Figure 4.2, and the four spatial locations for evaluation of the proposed motion index are highlighted.

The two sequences are corrupted with various levels of noise, and the motion is estimated at each of the four spatial locations selected with both the proposed motion index and that of [10]. The results of the motion index used in [10] is given in Figure 4.3. As shown in Figure 4.3, the motion index of [10] is not robust to noise corruption. That is, the motion calculation from the same spatial location increases with an increase in noise. Also, the motion index shows the ”FOOTBALL” image sequence (x = 300, y = 220) as having a higher motion index than the ”CLAIRE” image sequence (x = 40, y = 200) with zero noise corruption. However, the motion index shows the opposite results with higher levels of noise. Thus, the motion index gives conflicting results with the introduction of noise.

The results of the proposed SATSD motion index are given in Figure 4.4. As shown in Figure 4.4, the proposed motion index is much more robust to varying noise levels, and the order of locations from highest to lowest motion is what one would believe is correct. The location with the lowest motion index is in the ”CLAIRE” image sequence where there is no camera motion, and there are no moving objects in that spatial location. The next lowest motion location is in the ”FOOTBALL” image sequence in the spatial location where there are no moving objects. However, there is some slight camera motion in the sequence, so the motion index is slightly higher than in the ”CLAIRE ”image sequence. The location with the next highest motion index is the center of the ”CLAIRE” image sequence, where there is some motion

68 due to movement of the head, and the location with the highest motion index is the

”FOOTBALL” image sequence in the spatial location where many objects cross.

4.4 Temporal Domain Parameter Selection

The amount of temporal denoising which is beneficial to an image sequence is dependent upon the amount of noise corruption as well as the amount of motion.

Thus, the threshold τz[·] is given by

τz[l] = ασfn + βMl (4.14)

where Ml is the motion index of spatial position l, and σfn is the estimated noise stan- dard deviation of the image sequence. The two parameters α and β are determined experimentally using test image sequences.

In the proposed coefficient selection method, we use a training sample approach.

The approach starts with a series of test image sequences serving as training samples to derive the functions which determine the optimal set of the values for α and β.

Theoretically, we may represent each training sample as a vector Vi, i = 1, n. Those training samples should span a space which covers more corrupted image sequences than the training samples:

S = Span{Vi; i = 1, ..n}. (4.15)

The original data and the statistical distribution of the noise are given for each of the training samples which are corrupted. The optimal set of parameters can then be determined which give the highest average PSNR for the training samples. Ideally, the space spanned by the training samples contains the type of the corrupted image sequences which are to be denoised. As a result, the same parameter set can generate

69 optimal or close to optimal performance for the corrupted image sequences of the same type. It is clear that more training samples will generate parameters suitable for more types of image sequences, while a space of fewer training samples is suitable for fewer types of image sequences.

In order to obtain an estimate of the noise level, σfn, an average is taken from the noise estimates of each frame in the image sequence, given by Equation 3.27. It is reasonable to assume an IID (Independent, Identically Distributed) model for the level of noise for each pixel position since noise in each pixel position is generated by individual sensing units of the image sensor such as CCD [25] which are independent.

As a result, the estimate of the standard deviation of the noise (σn) in each image also represents the standard deviation of the noise in the temporal domain. Therefore, we can use the estimate of the noise in the spatial domain to estimate that in the temporal domain.

It should be pointed out that after the denoising has occurred in the spatial domain using the SFT method, the standard deviation of the noise is significantly reduced.

That reduction is statistically equal to each frame. As a result, the estimated noise in the spatial domain can still be nominally used for noise reduction in the temporal domain as the reduction of σn can be automatically absorbed by α.

The sequences ”CLAIRE”, ”FOOTBALL”, and ”TREVOR” are used for α and β selection. Each of the image sequences are corrupted with differing levels of noise cor- ruption (σn = 10, 20) and denoised with the SFT denoising method where Equation

4.14 is used as the temporal domain threshold. Values of α and β are used ranging from α = 0 to 3.0 and β = −0.3 to 0.3. The results of this testing is given in Figure

4.5. As shown in Figure 4.5 the maximum average PSNR is achieved when α = 0.9

70 and β = −0.11. The result is reasonable, of course, because as the motion increases in an image sequence the redundancy between frames decreases, and the benefits of temporal domain processing decrease. Thus, as the testing has shown, the temporal domain threshold decreases as the motion increases.

4.5 Experimental Results

The proposed video denoising algorithm first is applied to each of the video frames individually and independently. The method developed in Chapter 3 is used to denoise single images, and is used as the spatial denoising portion of the wavelet-based video denoising algorithms.

The video signal is then denoised in the temporal domain by the method developed in Sections 4.2 and 4.4. The temporal denoising algorithm is a selective shrinkage algorithm which uses a proposed motion estimation index to determine the temporal threshold, τz[·]. The temporal threshold is modified by the motion index to effectively eliminate temporal domain noise while preserving important motion information.

Three image sequences are used to determine the effectiveness of the proposed video denoising method. They are the ”SALESMAN” image sequence, the ”TENNIS” image sequence, and the ”FLOWER” image sequence. These three sequences are all corrupted with various levels of noise and denoised with the methods of [10, 55, 83] as well as the proposed method. Please note that only the temporal domain denoising algorithm of [55] is being tested. The spatial domain denoising method given in

Chapter 3 is used for all the wavelet-based video denoising methods. The results are given in Figures 4.6 through 4.11. As shown in Figures 4.6 through 4.11, the proposed method consistently outperforms the other methods presented. In all cases,

71 the proposed denoising method has a higher average PSNR than all other denoising methods tested. Also, note that in the method of [55], the threshold T changes due to video content and noise level to obtain the highest average PSNR using that particular method. In the proposed method, the temporal domain threshold, τz[·], is automatically calculated due to estimates of the noise level and motion.

Figures 4.12 through 4.17 give an example of the effectiveness of each of the denoising methods. Figure 4.12 gives the original frame #7 of the SALESMAN image sequence, and Figure 4.13 gives frame #7 corrupted with noise. Frames 4.14 through

4.17 give frame #7 denoised by each of the methods mentioned in this section.

In addition to obtaining a higher signal-to-noise ratio than established video de- noising algorithms, the proposed denoising algorithm facilitates the compression of video signals when used as a pre-processing step. That is, the image sequence is first denoised using the proposed method, then compressed by 3D wavelet compression.

The ”CLAIRE” image sequence is compressed with various quantization step sizes, both with and without the proposed denoising algorithm. Figure 4.18 gives the com- pression results. As given in Figure 4.18, regardless of the quantization step, applying the proposed denoising algorithm prior to compression improves the compression ra- tio. However, pre-processing is most beneficial when the step size is small.

Table 4.1 gives the results of 3D wavelet compression of various image sequences both with and without the denoising algorithm applied as a pre-processing step.

72 FOOTBALL Image Sequence. SFT Denoising. FOOTBALL Image Sequence. TFS Denoising.

30.5 30.5

30 30

29.5 29.5

29 29 Average PSNR (dB) Average PSNR (dB) 28.5 28.5 30 30 30 20 30 20 20 10 20 10 0 0 10 0 0 10 τ τ τ τ z z

CLAIRE Image Sequence. SFT Denoising. CLAIRE Image Sequence. TFS Denoising.

40 40

38 38

36 36

34 34

32 32 Average PSNR (dB) Average PSNR (dB) 30 30 2030 2030 010 010 0 10 20 30 τ 0 10 20 30 τ τ τ z z

Figure 4.1: Test results of both TFS and SFT denoising methods. Upper left: FOOT- BALL image sequence, SFT denoising, max. PSNR = 30.85, τ = 18, τz = 12. Upper right: FOOTBALL image sequence, TFS denoising, max. PSNR = 30.71, τ = 18, τz = 12. Lower left: CLAIRE image sequence, SFT denoising, max. PSNR = 40.77, τ = 19, τz = 15. Lower right: CLAIRE image sequence, TFS denoising, max. PSNR = 40.69, τ = 15, τz = 21.

73 Figure 4.2: Spatial positions of motion estimation test points. Left: FOOTBALL image sequence, frame #96. Right: CLAIRE image sequence, frame #167.

Local Motion Estimate of [10] for Varying Noise Levels 45 Claire image sequence (frames 104−167), pos: x=40, y=200 Claire image sequence (frames 104−167), pos: x=180, y=144 40 Football image sequence (frames 33−96), pos: x=300, y=220 Football image sequence (frames 33−96), pos: x=160, y=120

35

30

25

20 Motion Estimate

15

10

5

0 0 5 10 15 20 25 30 35 40 Noise Std. Dev.

Figure 4.3: Motion estimate given in [10] of image sequences, CLAIRE and FOOT- BALL.

74 Proposed Local Motion Estimate for Varying Noise Levels 25

20

Claire image sequence (frames 104−167), pos: x=40, y=200 l Claire image sequence (frames 104−167), pos: x=180, y=144 15 Football image sequence (frames 33−96), pos: x=300, y=220 Football image sequence (frames 33−96), pos: x=160, y=120

10 Motion Estimate, M

5

0 0 5 10 15 20 25 30 35 40 Noise Std. Dev.

Figure 4.4: Proposed motion estimate of image sequences, CLAIRE and FOOTBALL.

Average PSNR for image sequences used in test varying α and β

33.5

33

32.5

32

PSNR (dB) 31.5

31

30.5 30 25 60 20 50 15 40 10 30 20 5 10 0 α*10+1 0 β*100+31

Figure 4.5: α and β parameter testing for temporal domain denoising.

75 SALESMAN image sequence, std. = 10 35

34.5

34

33.5

33 Proposed method Pizurica (T=20) 32.5 2D wavelet filter 3D KNN filter PSNR (dB) 3D rational filter 32

31.5

31

30.5

30 5 10 15 20 25 30 35 40 45 50 Frame number

Figure 4.6: Denoising methods applied to the SALESMAN image sequence, std. = 10.

76 SALESMAN image sequence, std. = 20 31

30.5

30

29.5 Proposed method

PSNR (dB) Pizurica (T=40) 2D wavelet filter 3D KNN filter 29 3D rational filter

28.5

28 5 10 15 20 25 30 35 40 45 50 Frame number

Figure 4.7: Denoising methods applied to the SALESMAN image sequence, std. = 20.

TENNIS image sequence, std. = 10 34

33

32

31

30

29 PSNR (dB) 28 Proposed method Pizurica (T=20) 2D wavelet filter 27 3D KNN filter 3D rational filter 26

25

24 20 40 60 80 100 120 140 Frame number

Figure 4.8: Denoising methods applied to the TENNIS image sequence, std. = 10.

77 TENNIS image sequence, std. = 20 31

30

29

28

PSNR (dB) 27

26 Proposed method Pizurica (T=40) 2D wavelet filter 25 3D KNN filter 3D rational filter

24 20 40 60 80 100 120 140 Frame number

Figure 4.9: Denoising methods applied to the TENNIS image sequence, std. = 20.

FLOWER image sequence, std. = 10 31

30

29

28

27 Proposed method Pizurica (T=10) 2D wavelet filter

PSNR (dB) 26 3D KNN filter 3D rational filter

25

24

23

22 5 10 15 20 25 30 35 40 45 50 Frame number

Figure 4.10: Denoising methods applied to the FLOWER image sequence, std. = 10.

78 FLOWER image sequence, std. = 20 26

25.5

25

24.5 Proposed method Pizurica (T=20) 2D wavelet filter 3D KNN filter 24 3D rational filter PSNR (dB)

23.5

23

22.5

22 5 10 15 20 25 30 35 40 45 50 Frame number

Figure 4.11: Denoising methods applied to the FLOWER image sequence, std. = 20.

Figure 4.12: Original frame #7 of the SALESMAN image sequence.

79 Figure 4.13: SALESMAN image sequence corrupted, std. = 20, PSNR = 22.10.

Figure 4.14: Results of the 3D K-nearest neighbors filter, [83], PSNR = 28.42.

80 Figure 4.15: Results of the 2D wavelet denoising filter, given in Chapter 3, PSNR = 29.76.

81 Figure 4.16: Results of the 2D wavelet filtering with linear temporal filtering, [55], PSNR = 30.47.

Figure 4.17: Results of the proposed denoising method, PSNR = 30.66.

82 6 x 10 Compressed files sizes of the "CLAIRE" image sequence 9

8

3D wavelet compression 7 3D wavelet compression with pre−processing

6

5

4

3 Compressed file size (kBytes)

2

1

0 0 2 4 6 8 10 12 14 16 18 20 Quantization step size

Figure 4.18: Wavelet-based compression results with and without pre-processing.

83 Image Sequence Step Size Without Denoising With Denoising CLAIRE (360x288x168) 2 15.12:1, 3.29 Mbytes 31.72:1, 1.57 Mbytes FOOTBALL (320x240x97) 4 6.45:1, 3.30 Mbytes 7.95:1, 2.68 Mbytes MISSA (360x288x150) 8 33.10:1, 1.34 Mbytes 66.93:1, 0.68 Mbytes CLAIRE (360x288x168) 16 137.2:1, 0.38 Mbytes 170.0:1, 0.30 Mbytes MISSA (360x288x150) 32 198.2:1, 0.23 Mbytes 273.6:1, 0.17 Mbytes

Table 4.1: Compression ratios of 3D wavelet compression both with and without denoising applied as a pre-processing step.

As shown in Table 4.1, when the denoising algorithm is applied to an image se- quence prior to compression, the 3D wavelet compression algorithm achieves better performance. However, the performance improvement is greater with a smaller quan- tization step size.

4.6 Discussion

In this chapter, a new combined spatial and temporal domain wavelet shrinkage method is developed for the removal of noise in video signals. The proposed method uses a geometrical approach to spatial domain denoising to preserve edge information, and a newly developed motion estimation index for selective wavelet shrinkage in the temporal domain.

The spatial denoising technique is a selective wavelet shrinkage algorithm devel- oped in Chapter 3 and is shown to obtain a higher average PSNR than other wavelet shrinkage denoising algorithms given in the literature both in denoised image quality and computation time. The temporal denoising algorithm is also a selective wavelet shrinkage algorithm which uses a motion estimation index to determine the level of thresholding in the temporal domain.

84 The proposed motion index is experimentally determined to be more robust to noise corruption than other methods, and is able to help determine the threshold value for selective wavelet shrinkage in the temporal domain. With the motion index and temporal domain wavelet shrinkage, the proposed video denoising method is experimentally proven to provide higher average PSNR than other methods given in the literature for various levels of noise corruption applied to video signals with varying amounts of motion.

85 CHAPTER 5

Virtual-Object Video Compression

5.1 Introduction

The finalized version of the MPEG-4 standard was published in December of 1999.

The basis of coding in MPEG-4 is not a processing macroblock, as in MPEG-1 and

MPEG-2, but rather an audio-visual object [3]. Object based compression techniques have certain advantages, such as:

1) Allowing more user interaction with video content.

2) Allowing the reuse of recurring object content.

3) Removal of artifacts due to the joint coding of objects.

Although MPEG-4 does specify the advantages of object-based compression and provides a standard of communication between sender and receiver, it does not provide the means by which a) the content is separated into audio-visual objects, or b) the audio-visual objects are compressed. Since the publication of the MPEG-4 standard, much research has been conducted in the areas of shape coding [28, 40, 79] and texture coding [36, 78] of arbitrarily shaped objects, and methods of object identification and tracking [11, 23, 80].

However, although some success has been achieved in the various components necessary for the implementation of an object-based compression method, no such

86 compression method exists to date. The reason that a robust, object-based compres- sion method does not exist is two-fold. One, robust multiple object identification and tracking methods have yet to be developed. The identification and tracking of all objects that exist in a given image sequence is difficult, and the object extraction and tracking technologies given in the literature are not mature enough to handle the task. Two, it is unknown whether the additional bit savings achieved by object-based compression will be greater than the added overhead of shape coding of objects to provide an overall compression gain.

Thus, a wavelet-based compression method is presented to provide some of the benefits of object-based compression methods without the difficulties of true object- based compression. An object-based wavelet compression algorithm, called virtual- object compression, is developed for high quality, low bit-rate video.

Virtual-object compression separates the portion of video that exhibits motion from the portion of the video that is stationary. The stationary video portion is then grouped as the background, and the portion of the video which exhibits motion is grouped as the virtual-object. After separation, both background and virtual- object are coded independently by means of 2D wavelet compression and 3D wavelet compression, respectively.

There are two separate processing areas in object-based compression. Object extraction is the method of separating different objects in an image sequence, and the compression of those objects is a method of compressing arbitrarily shaped objects.

In the virtual-object compression method, the wavelet transform is used for both object extraction and compression.

87 When the wavelet transform is applied in the temporal domain, the motion of ob- jects is detected by large coefficient values. Therefore, the wavelet transform is used in the identification and extraction of moving objects prior to object-based compres- sion. Virtual-object based compression uses the non-decimated wavelet transform in the temporal domain in the separation of objects and stationary background.

Virtual-object compression also restricts the shape of the virtual-object to be rectangular. This restriction enables the use of known video compression methods such as 3D wavelet compression for the compression of the virtual-object. Also, with a rectangular object restriction, the location and shape of the object can be completely defined with only two sets of spatial coordinates (the starting horizontal and vertical locations of the virtual-object, and the width and height of the virtual- object), virtually eliminating shape coding overhead.

Experimental results show the virtual-object compression method to be superior in compression ratio and PSNR when compared to both 2D wavelet compression and

3D wavelet compression.

The organization of this chapter is as follows. Following the Introduction, Sec- tion 5.2 gives a description of 3D wavelet compression. 3D wavelet compression is a known compression method of video signals [21, 24] and is used to test the effec- tiveness of virtual-object compression. Section 5.3 describes the virtual-object com- pression method, and Section 5.4 gives the performance results of both virtual-object compression and 3D wavelet compression. Section 5.5 gives the discussion.

88 5.2 3D Wavelet Compression

To show the improvement of the virtual-object compression method to more tra- ditional compression methods based on macroblocks or frames, we briefly describe a known compression method called 3D wavelet compression, which is an extension to the well known image compression method, 2D wavelet compression. A block dia- gram of 3D wavelet compression is given in Figure 5.1, the components of which are as follows:

Figure 5.1: 3D wavelet compression.

5.2.1 2D Wavelet Transform

The first processing block of 3D wavelet compression is the spatial transformation of each of the frames of the image sequence into the wavelet domain. This processing block is referred to as 2D wavelet transformation.

First, let us define a 3 dimensional video signal f(·), where f(x, y, z) is a pixel in the image sequence of horizontal position x, vertical position y and frame z. The dimensions of f(·) are width Wf , height Hf , and frames F . f(·) is a processing unit

89 referred to as a group of frames (GoF). The 2D wavelet transform of f(·) is given by: P P a [x, y, z] = h[n]h[m]a [m − 2x, n − 2y, z] ll,k+1 Pn Pm ll,k d [z, y, z] = g[n]h[m]a [m − 2x, n − 2y, z] lh,k+1 Pn Pm ll,k , (5.1) d [z, y, z] = h[n]g[m]a [m − 2x, n − 2y, z] hl,k+1 Pn Pm ll,k dhh,k+1 [z, y, z] = n m g[n]g[m]all,k[m − 2x, n − 2y, z] where

all,−1[x, y, z] = f(x, y, z). (5.2)

d·,k[·] and all,k[·] are the wavelet and scaling coefficients of subband level k, respec- tively. The subband level k ranges from [−1,KM ), where KM is the 2D multires- olution level (MRlevel). h[·] is the low-pass scaling filter, and g[·] is the high-pass wavelet filter. The subscript designations of the coefficients, ll, lh, hl, hh describe the horizontal and vertical processing in the coefficient construction. For example, dhl,k[·] is obtained by first high-pass filtering all,k−1[·] with g[·] in the horizontal dimension, and then low-pass filter the result with h[·] in the vertical dimension.

The type of wavelet used in all the given results is the FT wavelet, or 5/3 wavelet.

The FT wavelet is given by

1 1 3 1 1 h[·] = {− 8 , 4 , 4 , 4 , − 8 } 1 1 . (5.3) g[·] = { 2 , −1, 2 }

The FT wavelet is chosen because it has shown to give the best overall quality for a given compression ratio for a wavelet which produces only integer coefficients [74].

Note, the benefits of integer wavelet coefficients is both a reduced computational complexity and memory requirement.

After the coefficients are transformed in the spatial domain, they are then quan- tized to represent the coefficients with no more precision than necessary to obtain the desired reconstructed quality.

90 5.2.2 2D Quantization

After the GoF has been 2D wavelet transformed, the coefficients are quantized uniformly across all subbands in the case of orthonormal wavelet transformation.

However, the wavelet transform used in the given 3D wavelet compression algorithm is biorthogonal to facilitate integer computation and fast compression. Therefore, the quantization level is modified according to scale. That is,

³ k+1 ´ 2 all,k[x,y,z] eall,k[x, y, z] = Int s2 ³ k ´ e 2 dlh,k[x,y,z] dlh,k[z, y, z] = Int s2 ³ k ´ , (5.4) e 2 dhl,k[x,y,z] dhl,k[z, y, z] = Int s2 ³ k−1 ´ e 2 dhh,k[x,y,z] dhh,k[z, y, z] = Int s2

e e e where s2 is the 2D quantization step size, and eall,k[·], dlh,k[·], dhl,k[·], and dhh,k[·] are the

2D quantized coefficient values. For more information on orthogonal and biorthogonal wavelets, refer to [12].

After all frames in the GoF have been spatially transformed and quantized, they are then transformed in the temporal domain to exploit intra-frame redundancy.

This is generally referred to as 3D wavelet transformation. The temporal domain transformation generally allows for greater compression, given that the frames in the

GoF are similar.

5.2.3 3D Wavelet Transform

The 3D wavelet transform is given by: P d3D [x, y, z] = g[z]a3D [x, y, p − 2z] ζ,k,j+1 Pp ζ,k,j , (5.5) a3D [z, y, z] = h[z]a3D [x, y, p − 2z] ζ,k,j+1 p ζ,k,j where

3D e aζ,k,−1[x, y, z] = dζ,k[x, y, z]. (5.6)

91 3D 3D In Equations 5.5 and 5.6, ζ ∈ {lh, hl, hh}, and aζ,k,j[·] and dζ,k,j[·] are the scaling and wavelet subbands of spatial scale k and temporal scale j. The superscript indicator

3D denotes 3D wavelet transformation, and j is the subband level in the temporal domain, which ranges from [−1,JM ), where JM is the 3D MRlevel.

For the ll band of the 2D transform, eall,k[·], we have P d3D [x, y, z] = g[z]a3D [x, y, p − 2z] ll,k,j+1 Pp ll,k,j , (5.7) a3D [z, y, z] = h[z]a3D [x, y, p − 2z] ll,k,j+1 p ll,k,j where

3D all,k,−1[x, y, z] = eall,k[x, y, z]. (5.8)

Note that in Equations 5.5 and 5.7 all 2D wavelet coefficients which are processed with

3D the g[·] filter are designated as 3D wavelet coefficients, d· [·], and all the 2D coeffi- cients which are processed with the h[·] filter are designated as 3D scaling coefficients,

3D a· [·]. As followed by the 2D wavelet transformation, the 3D wavelet coefficients are quantized once they are obtained.

5.2.4 3D Quantization

The 3D wavelet and scaling coefficients are quantized by µ ¶ µ ¶ √ j √ j+1 s 2k+1 2 d3D [x,y,z] s 2k+1 2 d3D [x,y,z] e3D 2 ·,k,j 3D 2 ·,k,j dll,k,j[x, y, z] = Int s eall,k,j[z, y, z] = Int s µ 3 ¶ µ 3 ¶ √ j √ j+1 s 2k 2 d3D [x,y,z] s 2k 2 d3D [x,y,z] e3D 2 ·,k,j 3D 2 ·,k,j dlh,k,j[x, y, z] = Int s ealh,k,j[x, y, z] = Int s µ 3 ¶ µ 3 ¶ √ j √ j+1 s 2k 2 d3D [x,y,z] s 2k 2 d3D [x,y,z] e3D 2 ·,k,j 3D 2 ·,k,j , dhl,k,j[x, y, z] = Int s eahl,k,j[x, y, z] = Int s µ 3 ¶ µ 3 ¶ √ j √ j+1 s 2k−1 2 d3D [x,y,z] s 2k−1 2 d3D [x,y,z] de3D [x, y, z] = Int 2 ·,k,j ea3D [x, y, z] = Int 2 ·,k,j hh,k,j s3 hh,k,j s3

(5.9)

92 where s3 is the 3D quantization level. Again, if the transform used in compression is an orthonormal transform, the scalings of Equation 5.9 would not be necessary.

However, the bi-orthogonal wavelet transform requires an adjustment by subband level.

The quantization levels s2 and s3 are left to the user to determine. The relation- ship between s2 and s3 is an important one, however. If s3 is significantly larger than s2, unwanted temporal artifacts may result in the reconstructed signal. Therefore, it is recommended to maintain s3 ≤ s2. Also, there is specific reasoning to why two quantization processes are necessary. It is known that the statistical properties of the horizontal and vertical dimensions in a video signal are similar to each other but differ from the time dimension [23]. Thus, a different quantization step applied to the spatial and temporal domains is reasonable. Also, it is well known that the quantization step leads to artifact generation in signal reconstruction. However, the artifacts that appear from quantization of the 2D wavelet coefficients and the 3D wavelet coefficients are perceptibly vastly different. The quantization of spatial do- main wavelet coefficients leads to blurring and softening of the video signal, while the quantization of the 3D wavelet coefficients leads to ”trails” of moving objects from frame to frame. Thus, to mitigate the differing types of artifacts generated from wavelet transformation in the two domains, two quantization step sizes are necessary.

Also, the above formulation of the 2D and 3D wavelet transform is not consistent with the traditional symmetric wavelet transformation of a 3-dimensional signal. In the symmetric case, each dimension is transformed at a certain MRlevel level, and the lowest subband is then processed further for the next MRlevel. In the above for- mulation, however, the wavelet transform is applied in the spatial domain through all

93 subbands, and only afterwards is applied in the temporal domain. This is referred to as the decoupled 3D wavelet transform, and it is the preferred wavelet transformation method for video compression [5, 21, 24, 35].

A visual difference between the 2D wavelet transform and 3D wavelet transform

(both symmetric and decoupled) can be shown when viewing the differing sizes and shapes of the various subbands that are calculated. Figure 5.2 gives the size and shapes of each of the subbands calculated by the various wavelet transforms. The 2D

Figure 5.2: Starting from left to right. 1) Original three-dimensional video signal. 2) 2D wavelet transform (KM = 2 and JM = 0). 3) Symmetric 3D wavelet transform 4) Decoupled 3D wavelet transform (KM = 2 and JM = 2).

wavelet transform, shown in Figure 5.2, applies no temporal domain processing, thus there are no segmentation lines crossing the temporal domain separating different subbands. There are only segmentation lines crossing the horizontal and vertical dimensions, where the level 2 LL band, all,2[·], is shown in the upper left-hand corner, and the level 0 HH band, dhh,0[·], is shown in the lower right-hand corner. Also shown in Figure 5.2, there exists a greater number of subbands generated by the

94 decoupled 3D wavelet transform than in the symmetric 3D wavelet transform, allowing for greater frequency analysis in both the spatial and temporal domains.

Each subband generated by the 3D wavelet transform is a 3-dimensional bandpass signal representing the original signal, f(·). A sample of subband locations is given in Figure 5.3.

Figure 5.3: Decoupled 3D wavelet transform subbands, KM = 2, JM = 2. Left: 3D 3D Subband dhl,1,1[·] highlighted in gray. Right: Subband dlh,0,2[·] highlighted in gray.

After the decoupled 3D wavelet transform and quantization are computed, stack- run [72] followed by Huffman [22] encoding are applied to each of the subbands for compression.

5.2.5 3D Wavelet Compression Results

The advantage of the 3D wavelet transform is evident when coding a video signal with both 2D and 3D wavelet compression. Figure 5.4 gives the results of 2D wavelet compression vs. 3D wavelet compression on the ”CLAIRE” image sequence. 2D

95 wavelet compression is accomplished by computing the 2D wavelet transform on each frame in the image sequence separately, applying 2D quantization, and using stack- run [72] followed by Huffman [22] coding on the quantized coefficients. The 3D wavelet transform exploits redundancy in the temporal domain as well as in the spatial domain. Therefore, 3D wavelet compression produces a much higher compression ratio and better overall quality. As shown in Figure 5.4, the performance of 3D

Figure 5.4: Comparison of 2D wavelet compression and 3D wavelet compression using the CLAIRE image sequence (frame #4 is shown). Left: 2D wavelet compression. s2 = 64, KM = 8, file size = 198KB, compression ratio = 256:1, average PSNR = 29.80. Right: 3D wavelet compression. s2 = 29, s3 = 29, KM = 8, JM = 8, file size = 196KB, compression ratio = 258:1, average PSNR = 33.31.

wavelet compression method is greater than that of 2D wavelet compression. Note that for the results given in Figure 5.4 the GoF processing block for 3D wavelet compression is F = 64 frames.

96 5.3 Virtual-Object Compression

The advantages of 3D wavelet compression over the traditional 2D frame-by-frame compression is evident by the results given in Figure 5.4. However, to further exploit temporal domain redundancy in video signals, virtual-object compression is devel- oped. In virtual-object compression, the original video signal is separated into back- ground and virtual-object. Then each is compressed separately for more optimal compression results.

5.3.1 Virtual-Object Definitions

Let us define a three-dimensional rectangular object o(·) where o(x, y, z) is a pixel in the object sequence of horizontal position x, vertical position y and frame z. The dimensions of o(·) are width Wo, height Ho, and frames F . We restrict the object to be the same size in each frame of the sequence to ensure that the virtual-object is easily defined and compressible. Therefore, Wo and Ho are constant, and not dependent on z.

However, because objects in an image sequence move, we must allow the virtual- object to be placed anywhere within each frame. Thus, we define coordinates Sx[·] and

Sy[·] which correspond to upper-left corner of the virtual-object in each frame, or the starting horizontal and vertical positions of the virtual-object, respectively. We also define Ex[·] and Ey[·] which correspond to the lower-right corner of the virtual-object, or the ending horizontal and vertical positions of the virtual-object, respectively.

With these definitions some boundary conditions are required. The virtual-object must be positive in width and height, and it cannot be larger than the original video frames, thus 0 ≤ Wo ≤ Wf and 0 ≤ Ho ≤ Hf . Also, the virtual-object must lie

97 within each frame. Thus, 0 ≤ Sx[z] < Wf − 1 and 0 ≤ Sy[z] < Hf − 1, for all z. It is also known that Sx[z] < Ex[z] < Wf and Sy[z] < Ey[z] < Hf , for all z.

As stated previously, the virtual-object must remain the same size for each frame in the sequence. Therefore, Ex[z] − Sx[z] = Wo and Ey[z] − Sy[z] = Ho for all z.

The virtual-object is defined as:

o(x, y, z) = f(x+Sx[z], y +Sy[z], z), 0 ≤ x < Wo, 0 ≤ y < Ho, 0 ≤ z < F , (5.10) where o(·) is the virtual-object and f(·) is the original image sequence.

The background is defined as:

( P F −1 f(x,y,z)α[x,y,z] PF −1 z=0P F −1 , when z=0 α[x, y, z] 6= 0 b(x, y) = z=0 α[x,y,z] , (5.11) 0, else where ½ 1, when (x, y, z) ∈ L + R + U + D α[x, y, z] = . (5.12) 0, else

L, R, U, and D represent the area which lies outside the virtual-object, or the area left (L), right (R), above (U), and below (D) the virtual-object. More specifically,

L = {(x, y, z): x < Sx[z]}, R = {(x, y, z): x ≥ Ex[z]}, U = {(x, y, z): y < Sy[z]}, and D = {(x, y, z): y ≥ Ey[z]}. As shown in Equation 5.11, the background is formed by temporal average of the entire GoF area outside of the virtual-object boundary.

Figure 5.5 gives a frame of the ”CLAIRE” image sequence including virtual-object definitions.

5.3.2 Virtual-Object Extraction Method

The method of extracting the virtual-object is accomplished by applying the wavelet transform in the temporal domain to the original image sequence f(·). The extraction method separates the portion of video with motion from the portion of the

98 Figure 5.5: Virtual-object extraction.

video without motion. Motion in an image sequence results in large temporal domain transform coefficients which are spatially contiguous.

The non-decimated wavelet transform in the temporal domain of a 3 dimensional image sequence f(·) is given by

X λvo[x, y, z] = f(x, y, m)gvo[m − z], (5.13) m where λvo[·] are the wavelet coefficients, and gvo[·] is the wavelet filter. The subscript designation vo is given to identify the coefficients and wavelet filter for purposes of virtual-object extraction.

99 Experimentally, it has been determined that the biorthogonal Haar wavelet func- tion provides the best motion identification. The biorthogonal Haar wavelet is given by   1 when t = 0 g [t] = −1 when t = 1 . (5.14) vo  0 else The compact support of the biorthogonal Haar wavelet makes it a natural choice for motion identification. Assuming there is no noise in the image sequence, a simple difference between consecutive frames is the most effective means of motion identi-

fication. The compact support of the Haar wavelet is most aptly able to locate the spatial and temporal position of motion in an image sequence.

A 3 dimensional Boolean map determining motion from non-motion is obtained by thresholding the coefficient values, λvo[·]. ½ 1, when |λ [x, y, z]| > τ I [x, y, z] = vo vo . (5.15) vo 0, else

The Boolean motion map, Ivo[·] is refined by spatial support criteria described in

Section 3.3. That is, ½ 1, when S [x, y, z] > s J [x, y, z] = vo vo , (5.16) vo 0, else where Svo[x, y, z] is calculated by an algorithm given in Appendix A.

The values of τvo and svo are experimentally determined. We find that τvo = 15 and svo = 2 give the best separation of object and background.

100 Each frame of the Boolean map is scanned to find the smallest rectangle possible to fit all the non-zero Jvo[·]. This is obtained by P P ~ ~ k−1 Hf −1 γx[z] = max(K) where k ∈ K ⇐⇒ m=0 n=0 Jvo[m, n, z] = 0 P P ~ ~ k−1 Hf −1 ²x[z] = min(K) where k ∈ K ⇐⇒ m=0 n=0 Jvo[m, n, z] = PWf −1 PHf −1 m=0 n=0 Jvo[m, n, z] . P P ~ ~ k−1 Wf −1 γy[z] = max(K) where k ∈ K ⇐⇒ n=0 m=0 Jvo[m, n, z] = 0 P P ~ ~ k−1 Wf −1 ²y[z] = min(K) where k ∈ K ⇐⇒ n=0 m=0 Jvo[m, n, z] = PHf −1 PWf −1 n=0 m=0 Jvo[m, n, z] (5.17)

The vectors, γx[·] and ²x[·] are the starting and ending horizontal positions of the virtual-object in each frame of the Boolean map. Similarly, γy[·] and ²y[·] are the start- ing and ending vertical positions of the virtual-object. However, these boundaries for the virtual-object may not be the same size, i.e., ²x(b)−γx(b) 6= ²x(a)−γx(a), for a 6= b. Therefore, the width and height of the virtual-object are defined by,

W = max(~² − ~γ ), z = arg max(~² − ~γ ) o x x m,x x x . (5.18) Ho = max(~²y − ~γy), zm,y = arg max(~²y − ~γy) zm,x and zm,y are the frames which contain the maximum virtual-object width and maximum virtual-object height, respectively.

The starting horizontal and vertical positions of the virtual-object , Sx(·) and

Sy(·), are needed to completely specify the location of the virtual-object. These po- sitions are established to completely contain the virtual-object in all frames, and to minimize the horizontal and vertical motion of the virtual-object boarder through- out the image sequence. It has been experimentally determined that minimal spa- tial movement of the virtual-object between consecutive frames provides the largest compression ratios and best reconstructed quality. Thus the starting horizontal and

101 vertical positions of the virtual-object are given by   γx[zm,x], when γx[0] < Sx[zm,x] S [0] = ² [0] − W , when ² [0] ≥ E [z ] , (5.19) x  x o x x m,x Sx[zm,x], else   γx[z], when γx[z] < Sx[z − 1] S [z] = ² [z] − W , when ² [z] ≥ E [z − 1] , (5.20) x  x o x x Sx[z − 1], else   γy[zm,y], when γy[0] < Sy[zm,y] S [0] = ² [0] − H , when ² [0] ≥ E [z ] , (5.21) y  y o y y m,y Sy[zm,y], else and,   γy[z], when γy[z] < Sy[z − 1] S [z] = ² [z] − H , when ² [z] ≥ E [z − 1] . (5.22) y  y o y y Sy[z − 1], else

The calculation of the starting horizontal and vertical positions, Sx[·] and Sy[·], given in Equations 5.19 through 5.22 guarantee minimal movement of the virtual- object boarder.

The reconstructed video signal, fˆ(·), is given by ½ ˆb(x, y), when α[x, y, z] = 1 fˆ(x, y, z) = , (5.23) oˆ(x, y, z), else where ˆb(·) ando ˆ(·) are the reconstructed background frame and virtual-object, re- spectively.

5.3.3 Virtual-Object Coding

Once the virtual-object and background have been identified and separated, the independent compression of each is straightforward. The background is compressed by 2D wavelet compression, and the virtual-object is compression by 3D wavelet compression described in Section 5.2. Figure 5.6 gives the design flow of the virtual- object compression method.

102 Figure 5.6: Virtual-object compression.

As given in Figure 5.6, the original video signal is separated into the virtual-object and background using the virtual-object extraction method. The virtual-object and background are then compressed separately using the 3D wavelet compression and

2D wavelet compression methods, respectively. Each of the processing blocks given in Figure 5.6 following the virtual-object extraction method are described in Section

5.2.

5.4 Performance Comparison Between 3D Wavelet and Virtual- Object Compression

The virtual-object compression method is compared to the 3D wavelet compres- sion method. The ”CLAIRE” image sequence is used for continuity with the com- parison of 2D wavelet compression to 3D wavelet compression, given in Figure 5.4.

Figure 5.7 gives results of 3D wavelet compression and virtual-object compression methods, using the ”CLAIRE” image sequence. Note that for the results given in

Figure 5.7 the GoF processing block is F = 64 frames. As shown in Figure 5.7, the

103 Figure 5.7: Comparison of 3D wavelet compression and virtual-object compression using the CLAIRE image sequence (frame #4 is shown). Left: 3D wavelet compres- sion. s2 = 29, s3 = 29, KM = 8, JM = 8, file size = 196KB, compression ratio = 258:1, average PSNR = 33.31. Right: Virtual-object compression, s2 = 25, s3 = 25, KM = 8, JM = 8 for the virtual-object and s2 = 9, KM = 8 for the background, file size = 195KB, compression ratio = 259:1, average PSNR = 34.00.

virtual-object compression method achieves an increase in compression ratio from 3D wavelet compression while providing higher PSNR.

Along with the ”CLAIRE” image sequence, the virtual-object compression method is tested against 3D wavelet compression as well as 2D wavelet compression using the

”SALESMAN” and ”MISSA” image sequences. The results of the quality comparison is given in Figure 5.8. Figure 5.8 shows that virtual-object compression consistently outperforms both 2D wavelet compression and 3D wavelet compression in compression ratio and PSNR.

104 Results from Using Various Compression Methods on the SALESMAN Image Sequence

30

29

virtual−object comp., 54278 bytes PSNR (dB) 28 3D wavelet comp., 56449 bytes 2D wavelet comp., 59367 bytes 27 5 10 15 20 25 30 35 40 45 50 Results from Using Various Compression Methods on the MISSA Image Sequence 32

31.5

31 virtual−object comp., 199,554 bytes PSNR (dB) 3D wavelet comp., 202,035 bytes 2D wavelet comp., 206,914 bytes 30.5 0 20 40 60 80 100 120 Results from Using Various Compression Methods on the CLAIRE Image Sequence

34

32 virtual−object comp., 200,205 bytes PSNR (dB) 3D wavelet comp., 201,140 bytes 30 2D wavelet comp., 202,878 bytes 20 40 60 80 100 120 140 160 Frame Number

Figure 5.8: Comparison of 2D wavelet compression, 3D wavelet compression, and virtual-object compression.

5.5 Discussion

In this chapter, a new object-based compression method called virtual-object com- pression has been described. Virtual-object compression differs from typical video compression methods by first extracting moving objects from stationary background and compressing each separately. The separation of objects and background enable independent coding of both, providing a low bit-rate compressed video signal.

105 Although virtual-object compression is not a truly object-based compression method set forth by the MPEG-4 standard. It is able to provide compression gain and im- proved PSNR from the 3D wavelet compression method by relaxing some of the constraints involved with object based compression methods. Thus, the results of virtual-object compression have shown a performance improvement over the more traditional wavelet-based compression methods of 2D wavelet compression and 3D wavelet compression.

106 CHAPTER 6

Constant Quality Rate Control for Content-Based 3D Wavelet Video Communication

6.1 Introduction

The vast amounts of data associated with digital images and video streams have provided a growing concern and motivation for efficient image compression methods.

Many such compression algorithms have been developed around a variety of matrix transforms [47, 48, 52]. One such method, the wavelet transform, has shown promising results in large compression ratio and high reconstructed image quality [37, 70, 82].

Recently, the efficient coding of video signals has become a leading topic in com- pression research [30, 56]. A new compression algorithm, the 3D wavelet transform, has been developed to provide very high compression ratios of digital video while preserving the reconstructed quality [71, 81].

Tightly coupled with compression research is the reliable transmission and recep- tion of compressed video. Real-time video communication applications using com- pression algorithms demand a constant frame rate for a high quality of service (QoS).

This requirement is challenging, however. Inconsistent compression and decompres- sion computation times, variable compressed video data size, and the unpredictable

107 available bandwidth of volatile communication channels all hinder the performance of real-time video communication.

Many rate control algorithms have been proposed in recent history, and most have been associated with providing constant frame rate with a variable quantization pa- rameter [13, 32, 38, 51, 57, 59, 60, 65]. The quantization parameter directly affects both the bit rate and reconstructed video quality. Therefore, for low bit-rate envi- ronments, the constant frame rate approach may provide poor quality image frames at the receiver. To combat this effect, other rate control algorithms have controlled both the frame rate and the quantization parameter to provide a best possible QoS

[58, 66, 67]. However, for many applications, individual image frames of reasonable visual quality are vastly more important than high frame rates. Therefore, we employ a fixed quantization step-size to deliver constant quality video frames.

Also, most former rate control algorithms have a minimum bit rate requirement for the communication channel [13, 14, 32, 51, 57, 58, 59, 60, 65, 66]. Unfortunately many communication systems such as the Internet do not provide a minimum bit rate guarantee. Furthermore, the content-based 3D wavelet compression scheme is a special case of image compression and also a relatively new idea [71, 81]. Thus it is desirable for a rate control algorithm specific to 3D wavelet compression to be developed.

The content-based 3D wavelet compression scheme operates on a group of frames

(GoF), and the number of frames varies between groups depending on the video content. Because we group only similar frames together, the number of frames in each group is variable. Thus, the 3D wavelet transform produces a variable delay for the transmission of real-time video. Because of this delay, rate control becomes

108 an even more difficult issue. To deal with the uncertainty of both the bandwidth of the communication channel and the video content, we propose a new rate control algorithm. It differs from previous algorithms in many ways. First, because there are two uncertainties, there are two frame buffers for the storage of video frames in both the client and server sides. Secondly, the client side buffer is developed to ensure the continuous display of reconstructed image frames. The client side buffer must contain enough reconstructed video content to overcome the acquisition delay of the next GoF as well as the delay of data transfer over the network, and the computation time of the compression and decompression algorithms. The buffer is based on a leaky bucket algorithm with an adjustable window of constant frame rate (AWCF). Thirdly, for the server side we develop a feedback mechanism from the client to control the server’s buffer content and ensure that the frame rates of the server and client sides are equal.

This chapter is arranged into five sections. Following the Introduction, Section

6.2 gives a brief description of content-based 3D wavelet compression and illustrates the functionality and importance of a multi-threaded application for real-time com- munication. Section 6.3 provides an overview and analysis of the rate control system, including the constraints imposed on the rate control buffers, design parameters of the control buffers on the client and server sides, and a definition of the AWCF.

Section 6.4 gives experimental results of the rate control algorithm, and Section 6.5 summarizes the chapter.

6.2 Multi-Threaded, Content-Based 3D Wavelet Compres- sion

The content-based 3D wavelet video compression/decompression system design

flow is given in Figure 6.1. As shown in Figure 6.1, the frame grabber loads video

109 frames into the compression system. The dynamic grouping of frames then compares and groups frames of similar content together. The dynamic grouping process sends the group of frames (GoF) to the 3D wavelet compression system. The compression algorithm then compresses the video using wavelet analysis. By grouping frames of similar content, the inter-frame redundancy of the individual pixels is assured, thus providing high compression ratios. The compressed video is then either stored or sent across a communication channel. The 3D wavelet decompression system reconstructs the video, and the video is then displayed to the user. The content-based compression approach develops GoFs of differing size, and because of the disparity in GoF size the computation time required to compress and decompress each GoF varies. Thus, continuous and smooth display of video becomes a challenging issue.

Figure 6.1: Content-based 3D wavelet compression/decompression design flow.

A real-time compression/decompression system must be able to perform many tasks concurrently. For example, the compression algorithm must continuously cap- ture and group frames while compressing video and sending it to the receiver. This

110 can only be performed when operations are being computed independently. There- fore, four processing threads are created in the communication system: the grouping thread, compression thread, decompression thread, and display thread. Figure 6.2 gives a model of the communication system.

Figure 6.2: 3D wavelet communication system.

The two buffers that have been added to the system, shown in Figure 6.2, are instrumental in achieving independent operation from each of the application threads.

Also, all four threads will be continuously active as long as both buffers are neither empty, nor full. The grouping thread will continue to group frames until the grouping buffer is full. At that point, there is no space left for the next GoF. Conversely, the compression thread will continue to compress until the grouping buffer is empty.

After the grouping buffer is empty there is no longer a GoF to compress. Therefore, continuous activity from both the grouping thread and the compression thread depends

111 on the fullness of the grouping buffer. Similarly, at the receiving end, continuous activity from the decompression thread and the display thread can only be achieved if the display buffer is neither full nor empty.

6.3 The Rate Control Algorithm

6.3.1 Rate Control Overview

The rate control algorithm of the current system is based on a leaky bucket ap- proach [7, 13, 14, 59]. The leaky bucket idea has been developed earlier for ATM networks and other applications, but has never been considered for 3D wavelet com- pression. As stated previously, all four computation threads are continuously active if and only if both data buffers given in Figure 6.2 are neither full, nor empty. There- fore, the goal of the rate control algorithm is to keep the amount of data in both the buffers at a reasonable level while ensuring the frame grabber rate and frame display rate are constant, and equal.

Also, the network bandwidth limitation has not yet been considered. With limited bandwidth, all four of the threads cannot be completely active. In most applications, the computational capacity of both platforms greatly exceeds the communication bandwidth available. Therefore, a rate control algorithm must manage each of the threads computational activity. Figure 6.3 gives the completed rate control wavelet communication system. The additions to the system given in Figure 6.2 are as follows:

Send Thread and Send Buffer – The most important part of the wavelet communi- cation system is to maximally utilize the available bandwidth given by the communi- cation channel, thus attempting to provide the highest possible frame rate. Therefore,

112 Figure 6.3: Complete rate control system.

another buffer and processing thread are created to continually send data at the max- imum rate possible. The send buffer is inserted into the system to give the send thread data to output through the channel. The compression thread is an algorithm whose output bit rate depends on the content of the input video, so the send buffer is neces- sary to achieve continuous data throughput. The send thread also partitions the data into smaller packets to enable the continuous flow of data.

Receive Thread and Receive Buffer – The receive thread is used to capture the data packets from the communication channel, and the received data is stored in the receive buffer. The send buffer and receive buffer need not be controlled. Given that they are sufficiently large, the control of the grouping buffer and display buffer will limit the amount of data that the send buffer and receive buffer must hold.

113 Send Monitor – The send monitor controls the rate at which the frame grabber acquires each frame. Its decision comes with the size of the data in the grouping buffer.

The send monitor attempts to keep the grouping buffer fullness at a reasonable level by adjusting the frame acquisition rate. However, the frame acquisition rate is confined by the feedback provided by the receiver, because real-time communication requires that the frame acquisition rate and display rate be equivalent. The send monitor enforces the grouping buffer constraints, which are given in Subsection 6.3.2.

Receive Monitor – The receive monitor regulates the size of the receive buffer by controlling the display rate at the receiver. The receive monitor attempts to keep the display buffer fullness at a reasonable level by adjusting the display rate and enforcing the display buffer constraints, which are given in Subsection 6.3.2.

Feedback – A virtual path where the client sends information to the server. The receiver monitor uses the feedback path ensure equivalent acquisition and display rates.

The proposed leaky bucket control model reduces the number of variables in the compression algorithm. Our interest lies only in rate control, not the specifics of wavelet video compression. Therefore, the compression and decompression threads, and network can be modeled as a single delay from transmitter to receiver. Figure

6.4 gives the control model for the rate control system. From the control model given in Figure 6.4, we can develop the constraints of the grouping and display buffers.

6.3.2 Buffer Constraints

As shown in Subsection 6.3.1, the send monitor and receive monitor adjust the

flow of data into and out of grouping buffer and display buffer respectively to control

114 Figure 6.4: Rate control model.

buffer fullness. Therefore, it is necessary to give analysis of the constraints imposed on both grouping buffer and display buffer by the send monitor and receive monitor.

The display buffer content is given by

d d Bi = Bi−1 + Ri − Di, (6.1)

d where i is the unit time, Bi is the display buffer fullness, and Ri is the video recon- struction rate. Di is the display frame rate. Also, since the display buffer has a fixed size, it is also governed by

d 0 ≤ Bi ≤ Sd, (6.2) where Sd is the size of the display buffer. The receive monitor manages the size of the display buffer by regulating Di. Therefore  d  Di−1 − δD, when Bi−1 < ²d d Di = Di−1, when ²d ≤ Bi−1 ≤ φd, (6.3)  d Di−1 + δD, when φd < Bi−1 where ²d and φd are threshold levels corresponding to an almost empty and almost full display buffer, respectively. δD corresponds to a modest change in the display

115 rate given by

δD = αDDi−1, (6.4) where αD is the percent change in display rate. Assuming a small value for αD, the receive monitor applies a gradual reduction in the display rate when the display buffer falls below ²d. Also, the receive monitor applies a gradual increase in the display rate when the display buffer exceeds above φd. The gradual increase and decrease of frame rate is crucial in producing a high QoS for the user.

The grouping buffer follows similar constraints.

g g Bi = Bi−1 + Ai − Ei, (6.5)

g where Bi is the grouping buffer fullness, and Ai is the frame acquisition rate, and

Ei is the compression rate. Similar to the display buffer, the grouping buffer is also governed by,

g 0 ≤ Bi ≤ Sg, (6.6) where Sg is the size of the grouping buffer. The grouping buffer fullness is controlled by the send monitor, which regulates the frame acquisition rate Ai.

 g  Di−1 + δA, when Bi−1 < ²g g Ai = Di−1, when ²g ≤ Bi−1 ≤ φg, (6.7)  g Di−1 − δA, when φg < Bi−1 where ²g and φg are grouping buffer threshold levels similar to those of the display buffer give in Equation 6.3. δA corresponds to a modest change in the acquisition rate given by

δA = αADi−1. (6.8)

αA is the percent change in acquisition rate. Note that the grouping buffer in the server is controlled by the display rate of the client. The send monitor is provided

116 Di−1 by the receive monitor through the feedback path from client to server. Also,

Ai ≈ Di, (αA, αD << 1) (6.9) which is a requirement for real-time systems.

The compression algorithm can only operate on an entire GoF for temporal domain compression. Therefore ½ C , when i = G E = N N , (6.10) i 0, else and

CN ∈ {1, 2, ..., Γ} (6.11)

where N is the GoF index, and GN corresponds to the unit time period when the last

th th frame of the N group is acquired. CN depicts the size of the N GoF, and Γ is the maximum group size. Note that Γ is an important parameter to select. When

Γ is large, one is allowed to have more frames in a single group thus increasing the compression ratio. On the other hand, a large Γ increases the delay time between the acquisition and display of the video. Usually, Γ is selected to maximize the compression ratio while staying within the delay requirement, which is application specific.

Similar to Equation 6.10, the video reconstruction rate is given by ½ C , when i = G + L R = N N N , (6.12) i 0, else

th where LN is the delay of the N GoF from the grouping buffer to the display buffer as shown in Figure 6.4 , caused by the compression and decompression computation times, and network delay.

117 For the grouping buffer to neither overflow or empty, it is necessary that

1 Xn 1 Xn Lim Ai = Lim Ei. (6.13) n→∞ n n→∞ n i=0 i=0

As n increases, the system reaches steady state where the grouping buffer input rate is equal to the grouping buffer output rate. Similarly, the display buffer input and output rates become equal in steady state.

1 Xn 1 Xn Lim Ri = Lim Di. (6.14) n→∞ n n→∞ n i=0 i=0

The control of the buffers’ fullness given by Equations 6.3 and 6.7, is developed to ensure the validity of Equations 6.13 and 6.14. The steady state of the buffers’ fullness is necessary for the success of the rate control algorithm. With steady state data flow through both buffers, the data flowing from the input of the grouping buffer to the output of the display buffer approaches a constant rate, and a constant rate is what is desired.

6.3.3 Grouping Buffer Design

The design parameters that need to be assigned for the grouping buffer are:

• The empty buffer threshold, ²g.

• The full buffer threshold, φg.

• The grouping buffer size, Sg

The empty buffer threshold

The basic idea of the grouping buffer is to continue to push more data through the network until the maximum bandwidth available is utilized, or the computational

118 activity of one of the platforms is maximized. As seen from Equation 6.7, the grouping thread continues to acquire frames at a slightly greater rate than the display thread in an effort to continually push more data through the network. Also from Equation 6.10 we see that the grouping buffer empties when the last frame of a GoF is acquired. So in an effort to keep constant the acquisition rate, and continually push the available network bandwidth,

²g = Γ. (6.15)

With this threshold in place, the grouping thread will continually acquire frames at a slightly greater rate than the display threads frame rate, thus continually pushing the bandwidth of the communication system.

The full buffer threshold and the grouping buffer size

With limited bandwidth it is possible for the compression thread and sending thread to both be limited in the rate at which each can output data. Therefore, to combat the possible overflow of both the send buffer and grouping buffer, the value of φg is determined.

If we look at the worst case scenario of total network congestion, the grouping thread may acquire up to φg frames before the send monitor will start to slow the frame acquisition rate. Therefore, The value of φg is determined to be

φg = 2Γ. (6.16)

With this threshold in place, the grouping thread may acquire up to two GoF’s of the maximum size before being penalized with a slowed acquisition rate. The size of the grouping buffer is also determined:

Sg = φg + Γ = 3Γ. (6.17)

119 The size of the grouping buffer allows up to three GoFs of the maximum size to be acquired with total network congestion. Therefore, the value of Sg gives enough space for buffer overflow to be avoided.

The grouping buffer design is simple with fixed values for ²g and φg, and mostly governed by the frame rate of the display thread as seen in Equation 6.7. Therefore, the display buffer design is the primary vehicle for rate control, which is discussed in detail in the following subsection.

6.3.4 Display Buffer Design

There are several design parameters that need to be assigned for the display buffer:

• The initial buffering level, I.

• The empty buffer threshold, ²d.

• The full buffer threshold, φd.

• The display buffer size, Sd.

The initial buffering level

Because the video frames are grouped by content, the groups are of different sizes with a maximum threshold Γ, as given in Equation 6.11. Therefore, group sizes range from 1 to Γ frames. As an example, assume the beginning of a video sequence contains two groups: the first group consists of 1 frame, and the second group consists of Γ frames. If the first group is sent to the receiver, and the receiver immediately displays that frame after image reconstruction, the receiver will inevitably wait for the second group to be sent with no frames in the display buffer, and a constant frame rate will

120 not be achieved. Therefore, an initial buffering level large enough to ensure constant video display must exist.

From the previous example, it is obvious that the initial buffering level, I, must be larger than Γ.

I ≥ Γ. (6.18)

However, the initial buffer level must also be larger than the empty buffer threshold,

²d. This is necessary to keep the display buffer level greater than ²d to ensure that the frame rate remains constant, as given in Equation 6.3. Therefore,

I ≥ Γ + ²d. (6.19)

However, I directly corresponds to the initial waiting time for the receiver. If I is chosen too large, the receiver will have an overly large initial buffering time, decreasing the QoS. Therefore must be kept at a minimum, and we choose

I = Γ + ²d. (6.20)

The empty buffer threshold

The variable delay LN , given in Equation 6.12, is used to calculate the minimum value of ²d needed to ensure that the display buffer never empties. From Equations

6.3 and 6.4 we can determine the average display rate during the critical empty buffer

d warning level, i.e., Bi−1 ≤ ²d. First, we can determine the amount of time the buffer has before it empties, without control. That is,

²d τc = . (6.21) Di

τc represents the critical time period before the display buffer is empty. We can now assume control of the display buffer and then determine the estimated average display

121 rate, D d . avg|Bi−1≤²

Di + (Di − δDτc) ²dαD Davg|Bd ≤² > = Di − . (6.22) i−1 2 2

Note that Equation 6.22 is merely an estimate of the average delay in the display buffer. The actual average delay is a polynomial of degree ²d − 1. In Equation 6.22, we assume δD to be constant when in reality the value of changes with each change in the display rate, as seen in Equation 6.4. The choice to use this estimate is based on computational simplicity and algorithmic intuitiveness.

Moreover, we know that enough frames must exist in the display buffer to keep displaying throughout the delay of the next GoF, LN+1. Therefore,

²d ≥ LN+1. (6.23) D d avg|Bi−1≤²

Solving for ²d and substituting in Equation 6.22 we obtain

2LN+1Di ²d ≥ . (6.24) 2 + αDLN+1

In Practice, however, the variable delay LN+1 is greatly dependent on the size of the next GoF, which is unknown. Therefore, for a worst-case scenario, we compute the average delay per frame, Lf , and multiply by Γ to estimate the delay of a GoF consisting of Γ frames. Therefore,

2Lf ΓDi ²d ≥ . (6.25) 2 + αDLf Γ The average delay per frame can then be obtained by

LN Lf = . (6.26) CN

The value of LN is determined by calculating the round-trip time (RTT) of the com- pressed GoF from client to server, dividing by 2, and adding the computation times of the compression and decompression algorithms.

122 Again, to ensure the minimum delay possible for I, and substituting in for Lf we obtain:

2LN ΓDi ²d = . (6.27) 2CN + αDLN Γ And substituting into Equation 6.20, we have,

2L D I = Γ(1 + N i ). (6.28) 2CN + αDLN Γ

The full buffer threshold and the display buffer size

The full buffer threshold, φd, is set one greater than I in order to produce an

AWCF that is 2Γ in size. Therefore

2LN Di φd = Γ(2 + ). (6.29) 2CN + αDLN Γ

The display frame rate is constant whenever the buffer fullness is within this window.

Also, the display buffer size can be arbitrarily set greater than φd. We find that

Sd = 4Γ (6.30) gives enough space for the AWCF to move.

6.4 Experimental Results

The communication system is developed, and a test is run with a maximum group size Γ of 64, an αA of 0.1, and an αD of 0.01. These parameters are found to produce quality results, but their values are determined empirically, and without analysis beyond the requirements given by Equation 6.9. The video is run for approximately

20 minutes. Also, the initial display rate of the receiver is deliberately set for too high a frame rate for the communication system to handle, for evaluation purposes.

123 The video sample is 320x240 color frame size, and the initial frame rate, D0, is set at 12 fps. The display frame rate, as well as the display buffer size is given in Figure

6.5.

Display Frame Rate and Display Buffer Size 15

10

5 Frame Rate (fps)

0 0 2 4 6 8 10 12 14 16 18 20 time (minutes) Display Buffer Size Lower Threshold, ε Upper Threshold, φ 250

200

150

100

Buffer Size (frames) 50

0 0 2 4 6 8 10 12 14 16 18 20 time(minutes)

Figure 6.5: Display frame rate and display buffer size, D0=12 fps.

As seen in Figure 6.5, the rate control algorithm does reduce the frame rate until steady state is found. Also, the frame rate stays constant unless the buffer fullness reaches beyond the threshold levels of the AWCF, as given in Equation 6.3. Therefore,

124 the control algorithm produces a smooth and continuous frame rate for real-time video communication.

The results of the acquisition frame rate and grouping buffer fullness are given in

Figure 6.6.

Acquisition Frame Rate and Grouping Buffer Size 15 Grouping Rate Display Rate

10

5 Frame Rate (fps)

0 0 2 4 6 8 10 12 14 16 18 20 time (minutes)

Grouping Buffer Size 250 Lower Threshold, ε Upper Threshold, φ 200

150

100

Buffer Size (frames) 50

0 0 2 4 6 8 10 12 14 16 18 20 time (minutes)

Figure 6.6: Frame acquisition rate and grouping buffer size, D0=12 fps.

As seen in Figure 6.6, the frame acquisition rate follows the frame display rate as given in Equation 6.9. However, the acquisition rate is slightly higher than the

125 display rate. This is due to the grouping buffer size, which is lower than the empty buffer threshold as shown in Equation 6.7.

The same video is run again, but the initial frame rate is set to 2 fps, intentionally slower than the maximum frame rate the network can handle. The display frame rate, as well as the display buffer fullness is given in Figure 6.7.

Display Frame Rate and Display Buffer Size 10

8

6

4 Frame Rate (fps) 2

0 0 2 4 6 8 10 12 14 16 18 20 time (minutes) Display Buffer Size Lower Threshold, ε Upper Threshold, φ 250

200

150

100

Buffer Size (frames) 50

0 0 2 4 6 8 10 12 14 16 18 20 time(minutes)

Figure 6.7: Display frame rate and display buffer size, D0=2 fps.

As seen in Figure 6.7, the frame rate slowly reaches a steady-state frame rate of approximately 8 fps, the same steady-state frame rate as given in Figure 6.5.

126 Therefore, the rate control algorithm does converge to a frame rate that maximally utilizes the capacity of the platforms and network. Figure 6.8 gives the Acquisition frame rate and grouping buffer fullness.

Acquisition Frame Rate and Grouping Buffer Size 10

8

6 Grouping Rate Display Rate 4 Frame Rate (fps) 2

0 0 2 4 6 8 10 12 14 16 18 20 time (minutes)

Grouping Buffer Size 250 Lower Threshold, ε Upper Threshold, φ 200

150

100

Buffer Size (frames) 50

0 0 2 4 6 8 10 12 14 16 18 20 time (minutes)

Figure 6.8: Frame acquisition rate and grouping buffer size, D0=2 fps.

Figure 6.8 indeed shows that the frame acquisition rate and display rate are close to being equal, as given in Equation 6.9. Thus, the rate control algorithm continually monitors the capacity of the network and adjusts the frame rate accordingly.

127 6.5 Discussion

We have developed a rate control algorithm designed for a content-based 3D wavelet video compression scheme, used for real-time video transfer. With the GoF requirement of 3D wavelet compression, an inherent delay is introduced in the trans- mission of real-time video. Also because the wavelet transform is a content-based compression scheme, the compression and decompression times vary with each group, and the compressed file size also varies between differing GoF’s of the same size. A rate control algorithm is designed to supply a smooth and continuous frame rate from server to client in an environment with a variable and unknown network delay such as the Internet and a compression scheme which allows for variable GoF sizes.

A buffering mechanism is developed on both the client and server sides to ensure the continuous display of reconstructed image frames. On the server side, a grouping buffer is designed based on the maximum GoF size. On the client side, a display buffer is designed based on the maximum GoF size as well as the variable delay of the network. As shown in the experimental results, the AWCF is able to provide continuous video to the client based upon the inherent characteristics associated with content-based 3D wavelet compression and real time video transfer. In addition, a feedback mechanism is used from the client to control the servers buffer content and ensure that the acquisition rate of the server and display rate of the client are equal. Experimental results prove that the rate control algorithm is effective for the content-based 3D wavelet video compression scheme.

128 CHAPTER 7

Conclusions and Future Work

This dissertation presents several methods to improve the state-of-the-art in video compression and communication technology. This concluding chapter summarizes the research presented and specifies contributions made. Also, various topics are identified for future research.

7.1 Contributions

Noise removal in natural digital imagery is an important part of many different imaging systems. Denoising methods based on the non-decimated wavelet transform have shown to achieve a large PSNR increase. However, the computational burden of previous wavelet-based noise removal algorithms are too large for realtime imaging systems. Thus, the two-threshold criteria for coefficient selection in image denoising has been developed to ease the computational burden associated with the coefficient selection process. The two thresholds are defined by using a training sample approach.

The training sample images are artificially corrupted with AWGN and denoised with several threshold levels. The threshold levels which produce the minimum error from that of the optimal denoising method are used in the general case. The resulting image denoising algorithm is not only 10x less complex computationally, but it also

129 shows an improvement in PSNR when compared to other wavelet-based denoising algorithms given in the literature.

The removal of noise from video signals is important in the development of high quality video systems. Therefore, a video denoising technique is described in this dis- sertation. The denoising technique first uses the image denoising technique described in this work for spatial domain denoising then uses a selective wavelet shrinkage al- gorithm for temporal domain denoising. The temporal domain denoising technique uses an estimate of the noise level as well as an estimate of the motion in the image sequence to determine the amount of filtering that can improve the quality of the video signal. This video denoising technique is more effective in noise removal and achieves better average PSNR than the limited number of methods presented in the literature.

Also, a virtual-object compression method is developed to provide the compression gain that object-based compression methods promise, without the many difficulties that object-based compression methods pose. With virtual-object compression, sta- tionary background is separated from moving objects and each is compressed indepen- dently. The independent compression of objects and background give virtual-object compression an improvement in PSNR over 3D wavelet compression.

Real-time delivery of compressed video is a challenging problem because of the many uncertain factors involved such as the computational capacity of both client and server platforms, the bandwidth and amount of congestion of the network, and the inherent acquisition delay of each GoF. We have provided a real-time video com- munication solution which combats the many problems associated with real-time video delivery over lossy channels by developing a rate control algorithm based on a

130 leaky bucket approach. Both sender and receiver include an independent monitoring thread which adjusts the acquisition and display rates, respectively, to ensure proper management of the video stream. The result is real-time video delivery over a lossy channel.

The summation of these contributions results in a high-quality real-time video compression and transmission system.

7.2 Future Work

Although this work provides some promising techniques to boost the overall per- formance of 3D wavelet compression, there are still many issues that need to be addressed for video compression using wavelets to be a method suitable for industry standards. In this Section we outline a few areas of related study:

• Currently, wavelet-based image and video compression systems use one par-

ticular wavelet in transformation of the original signal, and that wavelet is

chosen experimentally. However, given different input signals, different wavelet

functions may prove to provide better results. Thus, it would be beneficial

to analyze the statistics of the input signal prior to compression in order to

select the wavelet which will most compactly represent that signal. Also, in

multiresolution analysis, the same wavelet need not be used in each level of

decomposition. Such signal analysis and wavelet selection could provide a com-

pression system that is optimal for all types of imaging and video signals (i.e.,

long-wave infra-red (LWIR), short-wave infra-red (SWIR), synthetic aperture

radar (SAR), etc.).

131 • Also, currently the image denoising and video denoising algorithms are not

computationally efficient enough for real-time imaging and video systems. Cur-

rently, the image denoising algorithm developed in this work can denoise a

320x240 grayscale image in approximately 1 second, which is 30 times slower

than needed for realtime calculation. In addition, the video denoising algorithm

has an added computational load with the addition of temporal domain process-

ing. It can denoise a 320x240x64 grayscale GoF in approximately 1.5 minutes.

A computational speedup of greater than 30 is most likely unattainable with

computational optimization of the algorithms. Thus, a hardware implementa-

tion is necessary for realtime applications.

• This dissertation in part defines an image and a video denoising algorithm.

These algorithms are designed to remove AWGN from images and video signals

and have shown to give higher PSNR than other methods given in the literature.

However, AWGN is only one of many types of noise sources that is found in

the image and video capture process. Fixed pattern noise, shot noise, thermal

noise, correlated noise, speckle, as well as AWGN are different types of noise

that corrupt many different image and video capture processes. Thus, for an

image/video denoising algorithm to be most useful in industry, the image/video

capture process must be studied and the types of noise corruption involved in

that process must be discovered. Then, an image/video denoising process may

be developed that is tailored to removing the type of noise that is produced by

the capture process.

132 • Much of the work involved in this dissertation is in the removal of noise in

signals prior to compression. The removal of noise facilitates compression by

reducing the amount of entropy of the signal while improving the signal quality.

However, the removal of noisy artifacts generated by the compression algorithm

after reconstruction is also an important processing step. Post-processing is

used in most modern-day compression systems. In both the JPEG and MPEG

standards, there exists filtering algorithms to remove the blocking artifacts as-

sociated with the block-based DCT transform used in the compression engine.

Thus, it would be fruitful to obtain a post-processing method to remove the

artifacts generated by wavelet-based compression methods.

• This dissertation uses PSNR as the metric for quality. The reasoning behind

using this metric is one of legacy and consistency. Most of the image and

video processing community continues to publish results using PSNR as the

quality metric, so to compare results with other methods we use PSNR as well.

However, in Chapter 3 we have briefly mention some metrics that may be closer

to the human perception of quality. Thus, new denoising and compression

methods can and should be developed which publish results with not one but

several quality metrics. In this way, researchers can be more confident about

the performance of such algorithms.

133 APPENDIX A

Computation of S·,k[x, y]

The computation of S·,k[x, y] is given from the following algorithm:

N~ () = {[−1, −1], [−1, 0], [−1, 1], [0, −1], [0, 1], [1, −1], [1, 0], [1, 1]} ~ O[·] = 0, t = 0, p = 0, D·,k(0) = (x, y) if I·,k[x, y] == 1, ~ while D·,k(t) 6= NULL, ~ (i, j) = D·,k(t) t = t + 1 for m = 0 to 7, ~ if ((I·,k[(i, j) + N(m)] == 1) . (A.1) and (O[(i, j) + N~ (m)] == 0)), p = p + 1 ~ ~ D·,k(p) = ((i, j) + N(m)) O[(i, j) + N~ (m)] = 1, end if end for end while end if S·,k[x, y] = t

O[x, y] is a Boolean value to determine whether a particular I·,k[x, y] value has been counted previously. D~ is an array of spatial coordinates of valid coefficients that ~ support the current coefficient value I·,k[x, y]. N is a set of vectors corresponding to neighboring coefficient values.

134 BIBLIOGRAPHY

[1] ISO/IEC 11172-2. Information technology – Coding of moving pictures and asso- ciated audio for digital storage media at up to about 1,5 Mbit/s – Part 2: Video, Mar. 1993.

[2] ISO/IEC 13818-2. Information technology – Generic coding of moving pictures and associated audio information: Video, Mar. 1995.

[3] O. Avaro, A. Eleftheriadis, C. Herpel, G. Rajan, and L. Ward. MPEG-4 Systems: Overview, June 2000.

[4] J. B. Bednar and T. L. Wat. ”Alpha-Trimmed Means and Their Relationship to Median Filters”. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-32:pages 145–153, Feb. 1984.

[5] T. J. Burns, S. K. Roghers, M. E. Oxley, and D. W. Ruck. ”A Wavelet Multires- olution Analysis for Spatio-Temporal Signals”. IEEE Transactions on Aerospace and Electronic Systems, 32(2):628–649, Apr. 1996.

[6] C. S. Burrus, R. A. Gopinath, and H. Guo. Introduction to Wavelets and Wavelet Transforms, A Primer. Prentice Hall, 1998.

[7] M. Butto, E. Cavallero, and A. Tonietti. ”Effectiveness of the Leaky Bucket Policy Mechanism in ATM Networks”. IEEE Journal of Selected Areas in Com- munications, 9:335–342, April 1991.

[8] Berkeley Multimedia Research Center. MPEG-1 faq, Aug. 2001.

[9] Berkeley Multimedia Research Center. MPEG-2 faq, Aug. 2001.

[10] F. Cocchia, S. Carrato, and G. Ramponi. ”Design and Real-Time Implementa- tion of a 3-D Rational Filter for Edge Preserving Smoothing”. IEEE Transactions on Consumer Electronics, vol. 43:pages 1291–1300, Nov. 1997.

[11] C. D. Creusere and G. Dahman. ”Object Detection and Localization in Com- pressed Video”. In Proc. IEEE International Asilomar Conference on Signals, Systems, and Computers, volume 1, pages 93–97, 2001.

135 [12] I. Daubechies. Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, 1992.

[13] W. Ding. ”Joint Encoder and Channel Rate Control of VBR Video over ATM Networks”. IEEE Transactions on Circuits and Systems for Video Technology, 7(2):266–278, 1997.

[14] W. Ding and B. Liu. ”Rate Control of MPEG Video Coding and Recording by Rate-Quantization Modeling”. IEEE Transactions on Circuits and Systems for Video Technology, 6(1):12–20, 1996.

[15] D. L. Donoho and I. M. Johnstone. ”Ideal Spatial Adaptation by Wavelet Shrink- age”. Biometrika, vol. 81:pages 425–455, Apr. 1994.

[16] D. L. Donoho and I. M. Johnstone. ”Adapting to Unknown Smoothness via Wavelet Shrinkage”. Journal of American Statistical Association, vol. 90:pages 1200–1224, 1995.

[17] R. Dugad and N. Ahuja. ”Video Denoising by Combining Kalman and Wiener Estimates”. In Proc. IEEE International Conference on Image Processing, vol- ume 4, pages 152–156, 1999.

[18] F. Faghih and M. Smith. ”Combining Spatial and Scale-Space Techniques for Edge Detection to Providee a Spatially Adaptive Wavelet-Based Noise Filtering Algorithm”. IEEE Transactions on Image Processing, vol. 11:pages 1062–1071, Sept. 2002.

[19] Z. Gao and Y. F. Zheng. ”Variable Quantization in Subbands for Optimal Com- pression Using Wavelet Transform”. In Proc. World Conference on Systemics, Cybernetics, and Informatics, July 2003.

[20] M. Ghazel, G. H. Freeman, and E.R. Vrscay. ”Fractal-Wavelet Image Denoising”. In Proc. IEEE International Conference on Image Processing, volume 1, pages I836–I839, 2002.

[21] K. H. Goh, J. J. Soraghan, and T. S. Durrani. ”New 3-D wavelet Transform Coding Algorithm for Image Sequences”. Electron. Letters, 29(4):401–402, Feb. 1993.

[22] R. C. Gonzalez and R. E. Woods. Digital Image Processing. Addison-Wesley Publishing, 1992.

[23] C. He, J. Dong, Y. F. Zheng, and S. C. Ahalt. ”Object Tracking Using the Gabor Wavelet Transform and the Golden Section Algorithm”. IEEE Transactions on Multimedia, 4(4):528–538, Dec. 2002.

136 [24] C. He, J. Dong, Y. F. Zheng, and Z. Gao. ”Optimal 3-D Coefficient Tree Struc- ture for 3-D Wavelet Video Coding”. IEEE Transactions on Circuits and Systems for Video Technology, 13(10):961–972, Oct. 2003.

[25] G. Healey and R. Kondepudy. ”CCD Camera Calibration and Noise Estima- tion”. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition, volume 1, page 90, June 1992.

[26] T. C. Hsung, D Pak-Kong Lun, and W. C. Siu. ”Denoising by Singularity Detection”. IEEE Transactions on Signal Processing, vol. 47:pages 3139–3144, Nov. 1999.

[27] S. J. Huang. ”Adaptive Noise Reduction and Image Sharpening for Digital Video Compression”. In Proc. IEEE International Conference on Computational Cybernetics and Simulation, volume 4, pages 3142–3147, 1997.

[28] Y.-T. Hwang, Y.-C. Wang, and S.-S. Wang. ”An Efficient Shape Coding Scheme and its Codec Design”. In Proc. IEEE Workshop on Signal Processing Systems, volume 2, pages 225–232, 2001.

[29] C. R. Jung and J. Scharcanski. ”Adaptive Image Denoising in Scale-Space Us- ing the Wavelet Transform”. In Proc. XIV Brazilian Symposium on Computer Graphics and Image Processing, pages 172–178, 2001.

[30] C. M. Kim, B. U. Lee, and R. H. Park. ”Design of MPEG-2 Video Test Bit- streams”. IEEE Transactions on Consumer Electronics, 45(4):1213–1220, 1999.

[31] S. D. Kim, S. K. Jang, M. J. Kim, and J. B. Ra. ”Efficient Block-Based Coding of Noise Images by Combining Pre-Filtering and DCT”. In Proc. IEEE Interna- tional Symposium on Circuits and Systems, volume 4, pages 37–40, 1999.

[32] Y.-R. Kim, Y. K. Kim, Y.-K. Ko, and S.-J. Ko. ”Video Rate Control Using Activity Based Rate Prediction”. In Proc. IEEE International Conference on Consumer Electronics, volume 99, pages 236–237, June 1999.

[33] R. P. Kleinhorst, R. L. Lagendijk, and J. Biemond. ”An Efficient Spatio- Temporal OS-Filter for Gamma-Corrected Video Signals”. In Proc. IEEE Inter- national Conference on Image Processing, 1:348–352, Nov. 1994.

[34] Tom Lane. Image Compression FAQ, part 1/2, Mar. 1999.

[35] A. S. Lewis and G. Knowles. ”Video Compression Using 3D Wavelet Trans- forms”. Electron. Letters, 26(6):396–398, Mar. 1990.

137 [36] S. Li and W. Li. ”Shape-Adaptive Discrete Wavelet Transforms for Arbitrarily Shaped Visual Object Coding”. IEEE Transactions on Circuits and Systems for Video Technology, 10(5):725–743, August. 2000.

[37] C. Lin, B. Zhang, and Y. F. Zheng. ”Packed Integer Wavelet Transform Con- structed by Lifting Scheme”. IEEE Transactions on Circuits and Systems for Video Technology, 10(8):1496–1501, Dec. 2000.

[38] G. Lin and L. Zemin. ”3D Wavelet Video Codec and its Rate Control in ATM Network”. In Proc. IEEE International Syposium on Circuits and Systems, vol- ume 4, pages 447–450, 1999.

[39] W. Ling and P. K. S. Tam. ”Video Denoising Using Fuzzy-connectedness Princi- ples”. In Proc. IEEE International Symposium on Intelligent Multimedia, Video, and Speech Processing, pages 531–534, 2001.

[40] T.-M. Liu, B.-J. Shieh, and C.-Y. Lee. ”An Efficient Modeling Codec Archi- tecture for Binary Shape Coding”. In Proc. IEEE International Symposium on Circuits and Systems, volume 2, pages II–316–II–319, 2002.

[41] W. S. Lu. ”Wavelet Approaches to Still Image Denoising”. In Proc. IEEE Inter- national Asilomar Conference on Signals, Systems, and Computers, volume 2, pages 1705–1709, 1998.

[42] M. Malfait and D. Roose. ”Wavelet-Based Image Denoising Using a Markov Random Field A Priori Model”. IEEE Transactions on Image Processing, vol. 6:pages 549–565, Apr. 1997.

[43] S. Mallat and W. L. Hwang. ”Singularity Detection and Processing with Wavelets”. IEEE Transactions on Information Theory, vol. 38:pages 617–623, March 1992.

[44] F. McMahon. ”JPEG2000”. Digital Output, June. 2002.

[45] M. Meguro, A. Taguchi, and N. Hamada. ”Data-dependent Weighted Median Filtering with Robust Motion Information for Image Sequence Restoration”. In Proc. IEEE International Conference on Image Processing, 2:424–428, 1999.

[46] M. Meguro, A. Taguchi, and N. Hamada. ”Data-dependent Weighted Median Filtering with Robust Motion Information for Image Sequence Restoration”. IE- ICE Transactions on Fundamentals, vol. 2:pages 424–428, 2001.

[47] J. Miano. Compressed Image File Formats: JPEG, PNG, GIF, XBM, BMP. ACM Publishing, 1999.

138 [48] N. Moayeri. ”A Low-Complexity, Fixed-Rate Compression Scheme for Color Images and Document”s. The Hewlett-Packard Journal, 50(1), Nov. 1998.

[49] O. Ojo and T. Kwaaitaal-Spassova. ”An Algorithm for Integrated Noise Reduc- tion and Sharpness Enhancement”. IEEE Transactions on Consumer Electron- ics, vol. 46:pages 474–480, May 2000.

[50] S. J. Orfanidis. Introduction to Signal Processing. Prentice Hall, 1996.

[51] I.-M. Pai and M.-T. Sun. ”Encoding Stored Video for Streaming Applications”. IEEE Transactions on Circuits and Systems for Video Technology, 11(2):199– 209, Feb. 2001.

[52] K. R. Persons, P. M. Pallison, A. Manduca, W. J. Charboneau, E. M. James, M. T. Charboneau, N. J. Hangiandreou, and B. J. Erickson. ”Ultrasound grayscale image compression with JPEG and wavelet techniques”. Journal of Digital Imaging, 13(1):25–32, 2000.

[53] R. A. Peters. ”A New Algorithm for Image Noise Reduction Using Mathematical Morphology”. IEEE Transactions on Image Processing, vol. 4:pages 554–568, May 1995.

[54] A. Pizurica, W. Philips, I. Lemahieu, and M. Acheroy. ”A Joint Inter- and Intrascale Statistical Model for Bayesian Wavelet Based Image Denoising”. IEEE Transactions on Image Processing, vol. 11:pages 545–557, May 2002.

[55] A. Pizurica, V. Zlokolica, and W. Philips. ”Combined Wavelet Domain and Tem- poral Video Denoising”. In Proc. IEEE International Conference on Advanced Video and Signal Based Surveillance, volume 1, pages 334–341, July 2003.

[56] S. M. Poon, B. S. Lee, and C. K. Yeo. ”Davic-based Video-on-Demand System over IP Networks”. IEEE Transactions on Consumer Electronics, 46(1):6–15, 2000.

[57] D. Qiao and Y. F. Zheng. ”Dynamic Bit-Rate Estimation and Control for Constant-Quality Communication of Video”. In Proc. Third World Congress on Intelligent Control and Automation, pages 2506–2511, June 2000.

[58] A. C. Reed and F. Dufaux. ”Constrained Bit-Rate Control for Very Low Bit-Rate Streaming-Video Applications”. IEEE Transactions on Circuits and Systems for Video Technology, 11(7):882–889, July 2001.

[59] A. R. Reibman and B. G. Haskell. ”Constraints on Variable Bit-Rate Video for ATM Networks”. IEEE Transactions on Circuits and Systems for Video Technology, 2(4):361–372, 1992.

139 [60] J. Ribas-Corbera and S. Lei. ”Rate Control in DCT Video Coding for Low- Delay Communicatinos”. IEEE Transactions on Circuits and Systems for Video Technology, 11(2):172–185, Feb. 2001.

[61] P. Rieder and G. Scheffler. ”New Concepts on Denoising and Sharpening of Video Signals”. IEEE Transactions on Consumer Electronics, vol. 47:pages 666–671, Aug. 2001.

[62] A. Said and W. A. Pearlman. ”A New, Fast, and Efficient Image Codec Based on Set Partitioning in Hierarchical Trees”. IEEE Transactions on Circuits and Systems for Video Technology, vol. 6:pages 243–250, June 1996.

[63] D. Santa-Cruz, T. Ebrahimi, J. Askelof, M. Larsson, and C.A. Christopoulos. ”JPEG2000 still image coding versus other standards”. In Proc. SPIE’s 45th annual meeting, Applications of Digital Image Processing XXIII, volume 4115, pages 446–454, 2000.

[64] L. Shutao, W. Yaonan, Z. Changfan, and M. Jianxu. ”Fuzzy Filter Based on Neural Network and Its Applications to Image Restoration”. In Proc. IEEE International Conference on Signal Processing, volume 2, pages 1133–1138, 2000.

[65] K.-D. Soe, S.-H. Lee, J.-K. Kim, and J.-S. Kow. ”Rate Control Algorithm for Fast Bit-Rate Conversion Transcoding”. IEEE Transactions on Consumer Elec- tronics, 46(4):1128–1136, Nov. 2000.

[66] H. Song, J. Kim, and J. Kuo. ”Real-Time H.263+ Frame Rate Control for Low Bit-Rate VBR Video”. In Proc. IEEE International Symposium on Circuits and Systems, volume 4, pages 307–310, May 1999.

[67] H. Song and C.-C. J. Kuo. ”Rate Control for Low-Bit Rate Video via Variable- Encoding Frame Rates”. IEEE Transactions on Circuits and Systems for Video Technology, 11(4):512–521, April 2001.

[68] H. Stark and J. Woods. Probability, Random Processes, and Estimation Theory for Engineers. Prentice Hall, 1994.

[69] A. De Stefano, P. R. White, and W. B. Collis. ”An Innovative Approach for Spa- tial Video Noise Reduction Using a Wavelet Based Frequency Decomposition”. In Proc. IEEE International Conference on Image Processing, volume 3, pages 281–284, 2000.

[70] W. Sweldens. ”The lifting scheme: A custom-design construction of biorthogonal wavelets”. Appl. Comput. Harmon. Anal., 3(2):186–200, 1996.

140 [71] J. Y. Tham, S. Ranganath, and A. A. Kassim. ”Highly Scalable Wavelet-Based Video Codec for Very Low Bit-Rate Environment”. IEEE Journal on Selected Areas in Communications, 16(1):12–27, 1998.

[72] M. J. Tsai, J. D. Villasenor, and F. Chen. ”Stack-Run Image Coding”. IEEE Transactions on Circuits and Systems for Video Technology, 6:519–521, Oct. 1996.

[73] C. Vertan, C. I. Vertan, and V. Buzuloiu. ”Reduced Computation Genetic Al- gorithm for Noise Removal”. In Proc. IEEE International Conference on Image Processing and Its Applications, volume 1, pages 313–316, July 1997.

[74] J. D. Villasenor, B. Belzer, and J. Liao. ”Wavelet Filter Evaluation for Image Compression”. IEEE Transactions on Image Processing, 4(7):1053–1060, Aug. 1995.

[75] Z. Wang and A. Bovik. ”A Universal Image Quality Index”. IEEE Signal Pro- cessing Letters, 9(3):81–84, March 2002.

[76] Y. F. Wong, E. Viscito, and E. Linzer. ”PreProcessing of Video Signals for MPEG Coding by Clustering Filter”. In Proc. IEEE Internatonal Conference on Image Processing, volume 2, pages 2129–2133, 1995.

[77] Y. I. Wong. ”Nonlinear Scale-Space Filtering and Multiresolution System”. IEEE Transactions on Image Processing, vol. 4:pages 774–786, June 1995.

[78] G. Xing, J. Li, S. Li, and Y.-Q. Zhang. ”Arbitrarily Shaped video-Object Coding by Wavelet”. IEEE Transactions on Circuits and Systems for Video Technology, 11(10):1135–1139, Oct. 2001.

[79] C. H. Yeh, H. T. Chang, and C. J. Kuo. ”Boundary Block-Searching Algo- rithm for Arbitrary Shaped Coding”. In Proc. IEEE International Conference on Multimedia, volume 1, pages 473–476, 2002.

[80] W. Zhe, S. Wang, R.-S. Lin, and S. Levinson. ”Tracking of Object with SVM Regression”. In Proc. IEEE International Conference on Computer Vision and Pattern Recognition, volume 2, pages II–240–II–245, 2001.

[81] Y. F. Zheng. ”Method for Dynamic 3D Wavelet Transform for Video Com- pression”. U.S. Pattent Application Submitted by the Department of Electrical Engineering, The Ohio State University, Dec. 2000.

[82] Z. Zheng and I. Cumming. ”SAR Image Compression Based on the Discrete Wavelet Transform”. In Proc. IEEE International Conference on Signal Pro- cessing, pages 787–791, Oct. 1998.

141 [83] V. Zlokolica, W. Philips, and D. Van De Ville. ”A New Non-linear Filter for Video Processing”. In Proc. IEEE Benelux Signal Processing Symposium, volume 2, pages 221–224, 2002.

142