Harmonic Coding of Speech at Low Bit Rates

HARMONIC CODING OF SPEECH AT LOW BIT RATES Peter Lupini B. A.Sc. University of British Columbia, 1985 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOROF PHILOSOPHY in the School of Engineering Science @ Peter Lupini 1995 SIMON FRASER UNIVERSITY September, 1995 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author. APPROVAL Name: Peter Lupini Degree: Doctor of Philosophy Title of thesis : Harmonic Coding of Speech at Low Bit Rates Examining Committee: Dr. J. Cavers, Chairman Dr. V. Cuperm* Senior Supervisor Dr. Paul K.M. fio Supervisor Dr. J. Vaisey Supervisor Dr. S. Hardy Internal Examiner Dr. K. Rose Professor, UCSB, External Examiner Date Approved: PARTIAL COPYRIGHT LICENSE I hereby grant to Simon Fraser University the right to lend my thesis, project or extended essay (the title of which is shown below) to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. I further agree that permission for multiple copying of this work for scholarly purposes may be granted by me or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without my written permission. Title of Thesis/Project/Extended Essay "Harmonic Coding of Speech at Low Bit Rates" Author: September 15. 1995 (date) Abstract Activity in research relating to the compression of digital speech signals has increased markedly in recent years due in part to rising consumer demand for products such as digital cellular telephones, personal communications systems, and multimedia systems. The dominant structure for speech codecs at rates above 4 kb/s is Code Excited Linear Prediction (CELP) in which the speech waveform is reproduced as closely as possible. Recently, however, harmonic coding has become increasingly prevalent at rates of 4 kb/s and below. Harmonic coders use a parametric model in an attempt to reproduce the perceptual quality of the speech signal without directly encoding the waveform details. In this thesis, we address some of the challenges of harmonic coding through the development of a new speech codec called Spectral Excitation Coding (SEC). SEC is a harmonic coder which uses a sinusoidal model applied to the excitation signal rather than to the speech signal directly. The same model is used to process both voiced and unvoiced speech through the use of an adaptive algorithm for phase dispersion. Informal listening test results are presented which indicate that the quality of SEC operating at 2.4 kb/s is close to that of existing standard codecs operating at over 4 kb/s. The SEC system incorporates a new technique for vector quantization of the variable dimension harmonic magnitude vector called Non-Square Transform Vector Quantization (NSTVQ). NSTVQ addresses the problem of variable-dimension vector quantization by combining a fixed-dimension vector quantizer with a set of variable- sized non-square transforms. We discuss the factors which influence the choice of transform in NSTVQ, as well as several algorithm features including single-parameter control over the tradeoff between complexity and distortion, simpler use of vector prediction techniques, and inherent embedded coding. Experimental results show that - NSTVQ out-performs several existing techniques in terms of providing lower distortion along with lower complexity and storage requirements. Results are presented which indicate that NSTVQ used in the Improved Multiband Excitation (IMBE) environment could achieve equivalent spectral distortion while reducing the overall rate by 1000-1250 bits per second. Acknowledgements I would like to thank my advisor, Dr. Vladimir Cuperman, for his steady supply of ideas, insight, and guidance. Thanks also to Brigitte Rabold, Jackie Briggs, Marilyn Prouting, and Chao Cheng for all their excellent support. I am grateful to Dr. Neil Cox at MPR Teltech Ltd. for his helpful comments and for his support of my GREAT scholarship application, and to the B.C. Science Council and the National Sciences and Engineering Research Council for their financial assistance. I also want to thank everyone in the speech group for their friendship, support, and, most importantly, for always nicing their unix jobs to at least +15. Finally, I want to thank my parents, Louise and Dante Lupini, for their support and encouragement, my children, Jesse and Elliot, for providing balance and bedlam, and my wife, Valerie, for (on top of everything else) finding the "last" bug. Contents ... Abstract ................................................................... 111 Acknowledgements ........................................................ v List of Tables .............................................................. x List of Figures ............................................................. xiv List of Symbols ............................................................ xv .. List of Abbreviations ..................................................... xvii 1 Introduction ........................................................... 1 1.1 Thesis Objectives .............................................. 2 1.2 Thesis Organization ............................................ 2 2 Linear Prediction ...................................................... 4 2.1 Introduction .................................................. 4 2.2 Prediction Overview ........................................... 5 2.3 The Linear Prediction Model .................................... 5 2.4 Linear Prediction of Speech ..................................... 7 2.4.1 Autocorrelation and Covariance Methods ................... 8 2.5 Linear Prediction and the Speech Production Model ................ 9 2.6 Alternative Representation of the LPC Coefficients ................. 12 2.7 Conclusions ................................................... 13 3 Vector Quantization ................................................... 14 3.1 Introduction .................................................. 14 3.2 VQ Definitions ................................................ 15 3.3 Nearest-Neighbour Quantizers ................................... 16 3.4 Generalized Lloyd Algorithm for VQ Design ....................... 18 3.4.1 GLA With Known Multivariate Probability Density Function . 18 3.4.2 GLA Based on a Training Sequence ........................ 18 3.5 Constrained Vector Quantization ................................ 20 3.5.1 Mean-Removed Vector Quantization ....................... 20 3.5.2 Multi-Stage Vector Quantization .......................... 21 3.6 Vector Transform Quantization .................................. 23 3.6.1 The Karhunen-Loeve Transform ........................... 24 3.6.2 The Discrete Cosine Transform ............................ 27 3.6.3 The Orthonormal Polynomial Transform ................... 28 3.6.4 The Discrete Hartley Transform ........................... 28 4 Speech Coding ......................................................... 29 I 4.1 Introduction .................................................. 29 4.2 Waveform vs . Parametric Coding ................................ 30 4.2.1 Variable-Rate Speech Coding ............................. 31 4.3 Code-Excited Linear Prediction ................................. 33 4.4 Sinusoidal Coding ............................................. 38 4.4.1 Sinusoidal Transform Coding ............................. 40 4.4.2 Multi-band Excitation Coding ............................ 43 5 Non-Square Transform Vector Quantization ........................ 50 5.1 Introduction .................................................. 50 5.2 Existing Approaches ........................................... 51 5.2.1 IMBE Hybrid Scalar/Vector Quantization .................. 51 5.2.2 All-Pole Modeling ....................................... 53 5.2.3 Variable Dimension Vector Quantization ................... 55 5.3 NSTVQ System Overview ...................................... 57 5.3.1 System Block Diagram ................................... 57 vii 5.3.2 Choice of an Inverse Transform ........................... 60 5.3.3 Vector Quantizer Design for NSTVQ ...................... 61 5.4 Choosing the Transformation Matrices in NSTVQ .................................................... 63 5.4.1 Reducing the Modeling Distortion ......................... 63 5.4.2 Embedded Coding Property of NSTVQ .................... 66 5.4.3 A Multi-Source Coding Formulation ....................... 66 5.4.4 Reducing the Quantization Distortion ...................... 73 5.4.5 Reducing the Total Distortion ............................ 75 5.5 Complexity and Storage Requirements ............................ 80 5.6 Evaluation of NSTVQ Performance .............................. 83 5.6.1 Evaluation Environment ................................. 83 5.6.2 Comparison with All-Pole Modeling and VDVQ ............. 83 5.6.3 Comparison with IMBE Scalar/Vector Quantization ......... 84 5.7 Summary ..................................................... 85 6 Spectral Excitation Coding of Speech ............................... 87 6.1 Introduction .................................................. 87 6.2 Excitation-Based Harmonic Coding Overview ..................... 88 6.3 Generation of the Unquantized Excitation Signal .......................................................

Harmonic Coding of Speech at Low Bit Rates

The Discrete Cosine Transform (DCT): Theory and Application

Relating Optical Speech to Speech Acoustics and Visual Speech Perception

Improving Opus Low Bit Rate Quality with Neural Speech Synthesis

Lecture 16: Linear Prediction-Based Representations

Source-Channel Coding for CELP Speech Coders

Block Transforms in Progressive Image Coding

Samia Dawood Shakir

Video/Image Compression Technologies an Overview

Predictive and Transform Coding

Speech Compression

End-To-End Video-To-Speech Synthesis Using Generative Adversarial Networks

Linear Predictive Modelling of Speech - Constraints and Line Spectrum Pair Decomposition