Thesis for the degree of Doctor of Philosophy

Entropy and Speech

Mattias Nilsson

Sound and Image Processing Laboratory School of Electrical Engineering KTH (Royal Institute of Technology)

Stockholm 2006 Nilsson, Mattias Entropy and Speech

Copyright c 2006 Mattias Nilsson except where otherwise stated. All rights reserved.

ISBN 91-628-6861-6 TRITA-EE 2006:014 ISSN 1653-5146

Sound and Image Processing Laboratory School of Electrical Engineering KTH (Royal Institute of Technology) SE-100 44 Stockholm, Sweden Telephone + 46 (0)8-790 7790 Abstract

In this thesis, we study the representation of speech signals and the es- timation of information-theoretical measures from observations containing features of the speech signal. The main body of the thesis consists of four research papers. Paper A presents a compact representation of the speech signal that fa- cilitates perfect reconstruction. The representation is constituted of models, model parameters, and signal coefficients. A difference compared to existing speech representations is that we seek a compact representation by adapting the models to maximally concentrate the energy of the signal coefficients according to a selected energy concentration criterion. The individual parts of the representation are closely related to speech signal properties such as spectral envelope, pitch, and voiced/unvoiced signal coefficients, beneficial for both and modification. From the information-theoretical measure of entropy, performance limits in coding and classification can be derived. Papers B and C discuss the es- timation of differential entropy. Paper B describes a method for estimation of the differential entropies in the case when the set of vector observations (from the representation) lie on a lower-dimensional surface (manifold) in the embedding space. In contrast to the method presented in Paper B, Paper C introduces a method where the manifold structures are destroyed by constraining the resolution of the observation space. This facilitates the estimation of bounds on classification error rates even when the manifolds are of varying dimensionality within the embedding space. Finally, Paper D investigates the amount of shared information between spectral features of narrow-band (0.3-3.4 kHz) and high-band (3.4-8 kHz) speech. The results in Paper D indicate that the information shared between the high-band and the narrow-band is insufficient for high-quality wide- band speech coding (0.3-8 kHz) without transmission of extra information describing the high-band. Keywords: speech representation, energy concentration, entropy esti- mation, manifolds.

i

List of Papers

The thesis is based on the following papers:

[A] M. Nilsson, B. Resch, M.-Y. Kim, and W. B. Kleijn, “A Canon- ical Representation of Speech”, to be submitted to IEEE Trans- actions on Speech, Audio, and Language Processing, 2006.

[B] M. Nilsson and W. B. Kleijn, “On the Estimation of Differential Entropy from Data Located on Embedded Manifolds”, submit- ted to IEEE Transactions on , 2004.

[C] M. Nilsson and W. B. Kleijn, “Intrinsic Dimensionality and its Implication for Performance Prediction in Pattern Classifica- tion”, to be submitted to IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 2006.

[D] M. Nilsson, H. Gustafsson, S. V. Andersen, and W. B. Kleijn, “Gaussian Mixture Model based Mutual Information Estima- tion between Frequency Bands in Speech”, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 525-528, Orlando, Florida, USA, 2002.

iii In addition to papers A-D, the following papers have also been produced in part by the author of the thesis:

[1] M. Nilsson, S. V. Andersen, and W. B. Kleijn, “On the Mutual Information between Frequency Bands in Speech”, in Proceed- ings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 1327-1330, Istanbul, Turkey, 2000.

[2] M. Nilsson and W. B. Kleijn, “Avoiding Over-Estimation in Bandwidth Extension of Speech ”, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Sig- nal Processing, vol. 2, pp. 869-872, Salt Lake City, Utah, USA, 2001.

[3] M. Faundez-Zanuy, M. Nilsson, and W. B. Kleijn, “On the Rele- vance of Bandwidth Extension for Speaker Verification”, in Pro- ceedings of the International Conference on Spoken Language Processing, pp. 2317-2320, Denver, Colorado, USA, 2002.

[4] M. Faundez-Zanuy, M. Nilsson, and W. B. Kleijn, “On the Rel- evance of Bandwidth Extension for Speaker Identification”, in Proceedings of the European Signal Processing Conference (EU- SIPCO), vol. 3, pp. 125-128, Toulouse, France, 2002.

[5] M. Nilsson and W. B. Kleijn, “Shannon Entropy Estimation based on High-Rate Quantization Theory”, in Proceedings of the European Signal Processing Conference (EUSIPCO), pp. 1753- 1756, Vienna, Austria, 2004.

[6] S. Srinivasan, M. Nilsson, and W. B. Kleijn, “Denoising through Source Separation and Minimum Tracking”, in Proceedings of Interspeech - Eurospeech, pp. 2349-2352, Lisboa, Portugal, 2005.

[7] S. Srinivasan, M. Nilsson, and W. B. Kleijn, “Speech Denoising through Source Separation and Min-Max Tracking”, submitted to IEEE Signal Processing Letters, 2005. [8] B. Resch, M. Nilsson, A. Ekman, and W. B. Kleijn, “Estima- tion of the Instantaneous Pitch of Speech”, submitted to IEEE Transactions on Speech, Audio, and Language Processing, 2006.

iv Acknowledgements

I am now approaching the end of my Ph.D. studies, and I would like to take the opportunity to thank everyone who, in one way or the other, has supported me during this time. Although it is very likely that I forget someone, I still wish to mention a number of people who have been of great importance throughout this journey. First of all, I would like to express my gratitude to my supervisor, Prof. Bastiaan Kleijn. Without his knowledge, clever ideas, and genuine interest in research, this thesis would not have been possible. Next, I would like to thank all my present and former colleagues and guests at the Sound and Image Processing Lab: Anders Ekman, Andrei Jefremov, Arne Leijon, Barbara Resch, Christian Feldbauer, David Zhao, Davor Petrinovic, Dora S¨oderberg, Elisabet Molin, Ermin Kozica, Geoffrey Chan, Harald Pobloth, Kahye Song, Jan Plasberg, Jesus´ De vicente pena,˜ Jonas Lindblom, Jonas Samuelsson, Karolina Smeds, Manohar Murthi, Martin Dahlquist, Moo-Young Kim, Peter Kabal, Peter Nordqvist, Renat Vafin, Shenghui Zhao, Søren Vang Andersen, Sriram Srinivasan, Volodya Grancharov, Yannis Agiomyrgiannakis, and Yusuke Hiwasaki. Your techni- cal and social support cannot be overstated. I especially would like to thank my previous and current office mates: Søren Vang Andersen, Manohar Murthi, Moo-Young Kim, and David Zhao for our numerous discussions on various topics such as information-theory, numbness, Kimchi, diapers and despair. I would also like to devote a special thanks to Prof. Bastiaan Kleijn, Barbara Resch, David Zhao, Jan Plasberg, and Sriram Srinivasan for the proofreading of, and all the discussions on, the introduction and the papers included in this thesis. Finally, using the highest number known by my eldest son, thirteen- eighteen-thousand-millions of kisses go to my beloved wife Kristina and my two wonderful boys Axel and Albin for their endless encouragement and unconditional love.

Mattias Nilsson Stockholm, May 2006.

v

Contents

Abstract i

List of Papers iii

Acknowledgements v

Contents vii

Acronyms xi

I Introduction 1

Introduction 1 1 Speech signal representation ...... 2 1.1 Motivation ...... 2 1.2 Goodness of a representation ...... 3 1.3 Some existing speech representations ...... 4 1.4 Modeling tools ...... 6 2 Information extraction ...... 14 2.1 Entropy ...... 14 2.2 Relation to coding and classification ...... 19 2.3 Estimation of entropy ...... 22 2.4 Effects of manifolds for entropy estimation . . . . . 27 3 Summary of contributions ...... 29 References ...... 31

II Included papers 39

A A Canonical Representation of Speech A1 1 Introduction ...... A1 2 Frame theory ...... A4

vii 3 System description ...... A5 3.1 Speech analysis ...... A5 3.2 Speech synthesis ...... A13 4 Applications ...... A14 4.1 Speech coding ...... A14 4.2 Prosodic modifications ...... A15 5 Experiments and results ...... A16 5.1 Implementation specific details ...... A17 5.2 Voiced and unvoiced separation ...... A17 6 Concluding remarks ...... A21 References ...... A21

B On the Estimation of Differential Entropy from Data Located on Embedded Manifolds B1 1 Introduction ...... B1 2 Entropy estimation ...... B3 2.1 Entropy estimator based on quantization theory . . B4 2.2 Estimation of the manifold dimension and entropy . B6 2.3 Statistical properties ...... B7 3 Experiments and results ...... B10 3.1 Artificial experiment I: Swiss roll ...... B10 3.2 Artificial experiment II: ring torus ...... B11 3.3 Real experiment: cepstral coefficients of speech . . . B12 4 Conclusions ...... B15 References ...... B26

C Intrinsic Dimensionality and its Implication for Performance Prediction in Pattern Classification C1 1 Introduction ...... C1 2 Problem formulation ...... C3 3 Feature space resolution ...... C4 3.1 Limiting resolution in plug-in estimators ...... C5 3.2 Limiting resolution by noise addition ...... C5 4 Classification error probability estimation ...... C7 5 Experiments and results ...... C9 5.1 Varying intrinsic dimensionality ...... C10 5.2 Pattern classification - artificial scenario ...... C11 5.3 Pattern classification - speech vowels ...... C13 6 Concluding remarks ...... C15 References ...... C15

viii D Gaussian Mixture Model based Mutual Information Estimation between Frequency Bands in Speech D1 1 Introduction ...... D1 2 Sound classes ...... D2 3 Modeling of the speech spectrum ...... D2 4 Information measures ...... D4 5 Simulations ...... D6 5.1 Simulation procedure ...... D6 5.2 Number of mixture components ...... D6 5.3 Analysis of sound class sensitivity ...... D7 6 Results ...... D7 7 Discussion and conclusions ...... D8 References ...... D9

ix

Acronyms

ADPCM Adaptive Differential Pulse-Code Modulation APC Adaptive Predictive Coding ASR Automatic Speech Recognition AR Auto-Regressive CC Cepstral Coefficient CCR Comparison Category Rating CELP Code-Excited Linear Prediction DCT Discrete Cosine Transform DFT Discrete Fourier Transform DPCM Differential Pulse-Code Modulation EM Expectation Maximization GMM Gaussian Mixture Model i.i.d. Independent and Identically Distributed HNM Harmonic plus Noise Model KLT Karhunen-Lo´eve Transform kbps kilo bits per second LER Log Energy Ratio LP Linear Prediction LPC /Linear Prediction Coefficients LSF Line Spectral Frequency MBE MultiBand Excitation

xi MDCT Modified Discrete Cosine Transform MDL Minimum Description Length MFCC Mel-Frequency Cepstral Coefficients MI Mutual Information MLT Modulated Lapped Transform MNRU Modulated Noise Reference Units MST Minimum Spanning Tree NN Nearest Neighbor PCM Pulse-Code Modulation pdf probability density function pmf probability mass function PSOLA Pitch-Synchronous OverLap-Add RMS Root Mean Squared r.v. Random Variable SLB Shannon Lower Bound SNR Signal to Noise Ratio ST Steiner Tree STC Sinusoidal STFT Short-Time Fourier Transform TTS Text To Speech TSP Traveling Salesman Problem VQ Vector Quantization/Quantizer WI Waveform Interpolation WSOLA Waveform Similarity OverLap-Add

xii Part I

Introduction

Introduction

The novelist Charles Dickens has supposedly said that ”Electric communi- cation will never be a substitute for the face of someone who with their soul encourages another person to be brave and true.”1 If he would have lived in the information technology oriented society of today he would perhaps have been a bit more careful with ”will never be” in his statement. Equipment for conferences has been available for some time and virtual reality communication systems simulating the complete audio-visual experience of face-to-face communication are emerging [76, 86]. Due to the fast progress in technology it is very likely that we soon are as used to three-dimensional telepresence systems as we are now to existing communication systems such as the telephone, the television, the Internet, etc. At a general level, the communication chain can be described by the following three consecutive blocks: a source, a channel, and a sink. The source produces a signal containing some information and the sink receives the parts of the information that have not been destroyed during the trans- mission of the signal through the channel. Let us take the speech communication chain as an example. The speaker (source) formulates a linguistic message in his or her mind, and the brain maps it into neurological motor signals activating the muscles connected to the lungs, the vocal folds, and the vocal tract. This results in pressure changes in the vocal tract and at the lips, yielding the acoustic speech signal (source signal). Being a pressure wave, the acoustic signal propagates through the air and causes pressure changes at the ear drum of the listener (sink). In the ear, the relevant features of the signal are extracted and a neurological signal is generated and transported through the auditory nerve to the brain where the linguistic message is finally decoded [59, 96]. In the previous example we considered that both the source and the sink were humans, but in speech communication applications such as automatic speech recognition (ASR) and text-to-speech (TTS) the sink and the source, respectively, are artificial. This thesis is about the representation of source signals and the extrac-

1www.whatquote.com/quotes/Charles-Dickens/11237-Electric-communicatio.htm 2 Introduction

tion of information useful for both coding and recognition from the obtained representation. The focus is on speech, but we believe that many of the ideas and techniques presented can be applied to other sources. The objectives of this thesis are to: 1. Develop a compact speech signal representation that can form a basis for speech coders and speech modification systems. 2. Develop tools and methodologies for measuring information useful for estimating performance limits in compression and classification given, e.g., a particular speech signal representation. The remainder of this introduction is organized as follows. In Section 1 we discuss both general aspects of representing the speech signal and some particular existing speech signal representations. Section 2 focuses on the extraction of information from the signal representation. A summary of the contributions in this thesis is given in Section 3, which concludes the introduction.

1 Speech signal representation

In this section we discuss a generic approach of source signal representa- tion and apply it to the speech signal. The section starts with providing a motivation for having a representation that captures the structure and char- acteristic features of the speech signal. We then discuss desired properties of a good representation, followed by a brief description of some existing speech signal representations. Finally, we present some useful tools that we utilize in Paper A to obtain an efficient representation of the speech signal.

1.1 Motivation Let us start by defining what a representation is. In our context, a represen- tation consists of a model, model parameters, and signal coefficients. The model is for instance an autoregressive model [96] of a particular prediction order, and the model parameters determine the pole locations of the autore- gressive model. The signal coefficients of the representation characterize the mismatch, or equivalently the error, between the original signal and the sig- nal as predicted from the model. It is possible to interpret the modeling of the speech signal as a mapping from the signal to the coefficients, where the mapping is determined by the model and the model parameters. The main motivation for having a speech signal representation consti- tuted by models, model parameters, and signal coefficients, is that it facili- tates description of particular characteristics of the signal in a concise way. This makes it possible to develop more efficient systems for speech cod- ing [45,67], speech modification (e.g., time- and pitch-scaling) [77,111,113], 1 Speech signal representation 3

and speech recognition [78] since tailor-made processing strategies can be applied to the different speech signal features. For instance, voiced (tonal) speech has a strong periodic structure due to the vibration of the vocal folds. The vibration frequency of the vocal folds is commonly referred to as the fundamental frequency or the pitch. The periodic structure implies a high redundancy in the signal, which must be exploited in efficient speech coding. Thus, modeling of the pitch of the speech signal is valuable to coding. Representation of the pitch is also rel- evant to speech modification applications such as time- and pitch-scaling. Moreover, in speech coding, unvoiced (noise like) waveform descriptors can be modeled by Gaussian noise (with properly matched gain and color) with- out introducing any significant perceptual degradation to the reconstructed speech signal [74]. The result indicates that unvoiced signal coefficients can be compactly described since representing the (blocks of) coefficients by their second-order statistics suffice. Similarly, in systems for text-to-speech based on concatenative speech synthesis [112], identification of unvoiced signal components facilitates time-stretching of unvoiced speech free from artificial periodicity (i.e., avoiding periodic extension of unvoiced speech by, e.g., phase randomization of the unvoiced signal coefficients).

1.2 Goodness of a representation What is a good representation of a source signal? In general this depends on the particular source signal and the type of application. However, it is possible to define some generic properties that we would like a good source signal representation to have, independently of the particular application. Reasonable properties of the representation are that it is compact, complete, and that it ”orthogonally” describes the relevant characteristics of the source signal. By a compact (or sparse) representation we imply that a signal can be reconstructed at a low distortion from a representation that has relatively few model parameters and few significant signal coefficients per time instant. This is consistent with the energy of the coefficient space, after an energy- preserving transformation, being concentrated in a small subspace, which is advantageous for compression (cf. coding gain [49]). Completeness of a representation means that the signal can be perfectly reconstructed given the representation. Thus, transforms built from basis (or frame) functions spanning the complete signal space are valuable tools to achieve compact and complete representations. The concept of compactness and completeness of a representation was previously discussed in [65,70] in the context of speech and is consistent with the more formal objectives of, e.g., minimum description length (MDL) principle [13], and rate-distortion theory [30, 108]. The third desired property of the representation is that the different 4 Introduction

characteristics of the source signal are captured by almost independent parts of the representation. E.g., in the case of a speech signal, we prefer to have different parameters that describe the short-term dependencies due to the resonances of the vocal tract, the pitch-track (evolution of the pitch over time), and signal coefficients of the voiced and the unvoiced speech. This decomposition allows for high-quality manipulations of the speech signal. That is, if we alter the model parameters or signal coefficients, the output signal after synthesis should be a speech signal albeit a different one than the original.

1.3 Some existing speech representations Speech coding research has been the primary driving force in finding more efficient representations of the speech signal. Comprehensive overviews on speech coding research can be found in, e.g., [45, 48, 50, 69, 104, 110]. A secondary driving force, of significantly less magnitude, is speech modifica- tion research, where high-quality time- and pitch-scaling is of great impor- tance [113]. In the following, we briefly discuss some of the existing speech representations. Historically, the speech coding methods can be split into two broad cat- egories; waveform coders and voice coders (vocoders). The former category aims at replicating the speech waveform as accurately as possible, whereas the latter category synthesizes the speech signal using a parametric model of speech production. The properties of the waveform coders and the vocoders are in stark contrast to each other. For example, a waveform coder, such as the pulse code modulation (PCM) scheme [1,62], is capable of producing high-quality decoded speech. The expense is, however, that a high bit-rate (more than 16 kbps for speech sampled at 8 kHz) is required for the trans- mission of the encoded representation to achieve sufficiently high perceptual quality. The vocoders [44], on the other hand, facilitate speech coding at very low-bit rates (less than 4 kbps for speech sampled at 8 kHz), but the performance in terms of perceptual quality saturates at a relatively low level as the bit-rate of the vocoder increases. Research on waveform coders and on vocoders has been motivated by making their corresponding representations compact and complete, respec- tively. In the case of the original PCM waveform coder, more compact representations are achieved by using the dependencies between adjacent speech sample amplitudes. Initially, prediction of the current speech sam- ple amplitude was performed using the previous speech samples with fixed prediction coefficients. The actual encoding was performed using PCM on the prediction error. An example of such a coding scheme is differential PCM (DPCM) [50, 102]. Since the dependencies between speech signal samples vary with both the vocal tract configuration (including lip radiation and glottal excitation) 1 Speech signal representation 5

and the speaker’s pitch, a fixed predictor cannot predict the speech signal samples efficiently at all times. Thus, to increase coding efficiency time- adaptive prediction coefficients were introduced (using the assumption that speech can be considered stationary during segments of 10-30 ms duration). The associated coding is referred to as adaptive predictive coding (APC), or more commonly, linear predictive coding (LPC) [2, 12, 102]. The vocoder was invented in 1939 by H. Dudley [35,104] and showed that speech can be represented compactly. Dudley’s system generated synthetic speech by spectrally shaping a source signal that was either produced by a random noise generator or a pitch controlled oscillator. The spectral shaping simulated the vocal tract configuration and could be changed continuously through a resonance control unit. Although the original vocoder yielded a very compact representation of speech, the synthesized speech was far from natural-sounding. From the autoregressive (AR) modeling of the vocal tract (discussed later in Section 1.4), linear prediction (LP) follows naturally [80,81] motivating the LPC based vocoders [10]. To improve the quality of the synthesized sound, more elaborate models of the source excitation have been proposed. The concept of modeling the voiced excitation by a sum of sinusoids was introduced in [55], and further refined into the approach sinusoidal transform coding (STC) [69, 84]. In [7] the sinusoids were constrained by the pitch, and thus, the approach was referred to as harmonic cod- ing. An interesting feature of the harmonic coding presented in [7] is that the error between the harmonic model and the original excitation is quan- tized and transmitted. Thus, in principle, the system in [7] forms a com- plete description of the speech signal. An alternative to STC that facili- tates mixed voiced/unvoiced excitation is the multiband excitation (MBE) vocoder [25, 54]. The MBE represents the excitation spectrum by a pitch (fundamental frequency), a voiced/unvoiced decision for each harmonic of the pitch frequency, and the phase of each voiced harmonic. The need for improved compactness of the waveform coders and com- pleteness of the vocoders as discussed above has led to hybrids of the two coder categories. The most well known hybrid coder is the code-excited linear prediction (CELP) scheme [11] that followed from the idea of the multipulse excitation [9]. The principle of CELP combines linear-predictive analysis-by-synthesis [105] with codebooks (vector quantizers) representing the excitation signal. CELP is the core of most of cellular standards of today [3–5, 40]. Another coding scheme capable of producing high-quality decoded speech at very low bit rates is the waveform interpolation (WI) coder [66, 67]. In WI, the one-dimensional speech signal is represented by a two- dimensional surface of so-called characteristic waveforms. Each character- istic waveform consists of speech segments of length one pitch period time- scaled to a fixed normalized length and periodically extended. A major 6 Introduction

advantage of WI over CELP is that the two-dimensional surface of charac- teristic waveforms in WI facilitates efficient voiced-unvoiced decomposition which has shown beneficial for coding [74] due to the different perceptual characteristics of voiced and unvoiced sounds. The discussion so far has been related to speech coding, but also the area of speech modification has influenced the research on speech representations. In speech modification, compactness of the representation is generally of less importance compared to speech coding. An exception is TTS systems based on concatenative speech synthesis, where the speech sound database has to be efficiently stored [112]. For speech modification it is more important that the different characteristics of the speech are described by almost indepen- dent parts of the representation since this facilitates flexible manipulation of the speech signal, as mentioned in Section 1.2. One such speech repre- sentation is the harmonic plus noise model (HNM) [77, 111, 113]. Similarly to the harmonic coding in [7], the speech signal is decomposed into a har- monic part and a noise part. The harmonic part of the signal is modeled by harmonically related sinusoids with linear time-varying complex ampli- tudes, and the noise part is modeled by modulated colored Gaussian noise (following the energy envelope of the original signal over time). Besides the modeling of the noise part, HNM differs from the harmonic coding approach by having pitch-synchronous analysis and synthesis.

1.4 Modeling tools In this section we present some general modeling tools useful for represent- ing speech. In addition to the AR-modeling, mentioned in the previous section, the tools we present are: constant pitch warping, transformation, and best-basis selection. Transformations and best-basis selections are im- portant tools for obtaining compact signal representations. Further, the pitch is a key feature of the speech signal. The high signal redundancy it implies facilitates a compact speech representation if the information is properly used. The pitch varies continuously as a function of time, and the tool of constant pitch warping, time-scales (warps) the speech signal into a (warped) signal that has a constant pitch. This facilitates both pitch- synchronous processing and the usage of efficient (fixed) transforms, which is of importance to coding and modification of the speech signal. The sec- tion ends with a discussion on how to combine the modeling tools to have a compact and complete representation of the speech signal, as proposed in Paper A.

Autoregressive modeling To capture the short-term dependencies between speech signal samples, the autoregressive model is an efficient tool commonly used in current speech 1 Speech signal representation 7

coding systems, including adaptive differential pulse code modulation (AD- PCM) [2, 102] and CELP [3–5, 11, 40]. The AR model of a discrete-time signal s(n) can be expressed as [94]

M s(n) = aM s(n m) + e(n), (1) m − m=1 X M where M is the model order, am m=1 is the set of parameters specifying the AR model, and where e(n{) denotes} the (prediction) error term. The M parameters am m=1 are often referred to as linear prediction coefficients, since they are{ commonly} estimated through linear prediction analysis [80, 81,96]. The AR-model specifies a prediction of the current sample s(n) given M a weighted sum of previous samples, and the optimal weights am m=1 are found by minimizing the energy (variance) of the prediction {error.} In the frequency domain, the AR-model results in an all-pole transfer function [96]. The power-spectrum of the all-pole transfer function describes the overall (smoothed) distribution of the signal energy along the frequency axis using few coefficients (thus a compact representation). Since the empirical short-term statistics of speech vary over time, the AR model parameters are commonly estimated from short speech segments (typically of 10-30 ms duration) using either the autocorrelation or the covariance LP analysis methods [96]. A historical perspective on LP analysis applied to speech can be found in [8]. Instead of viewing the AR modeling as merely a simple, yet efficient, model to capture dependencies between speech samples, it has been shown that the AR model can also be justified from a physical speech produc- tion perspective. Fant [43] did pioneering work in this area, and proposed the so-called source-filter decomposition of speech production [43] shown in Figure 1. The basic premise of the source-filter decomposition is that the speech signal is modeled by filtering an excitation source signal through vocal-tract and lip radiation filters. As seen from Figure 1, the excitation is a weighted sum of a pitch-pulse train (pulses spaced inversely proportional to the pitch and shaped by the glottal model) and random noise. The most well-known physical modeling of the vocal tract is the tube model [43]. A discrete-time model based on the concatenation of tubes of various diame- ters can be shown to be equivalent to an autoregressive (AR) model of the speech signal [32, 43, 96].

Constant pitch warping In voiced speech the duration and shape of the pitch cycles generally change slowly, and the long-term dependencies are strong (high redun- dancy). Therefore, it is essential to take advantage of the pitch to get a compact representation of the speech signal. Since the pitch is continuously 8 Introduction

Pitch period Gain Vocal tract Pulse train Glottal pulse parameters generator model

Vocal tract Radiation Speech model model

Random noise generator

Gain

Figure 1: Source-filter model of speech production.

varying, the design of transforms that maximally concentrate the energy of the signal coefficients is difficult. However, if the speech signal is warped into a signal of constant pitch, the design of these transforms is simplified. The parameters specifying the mapping between the original and warped signal domains are then a part of the speech signal representation.

To enable warping of the speech signal, we require access to an accurate description of a so-called warping function, relating the original time domain and the (warped) time domain in which the pitch is constant. In [98] the continuous-time warping function is modeled by cubic B-splines, and the optimal B-spline coefficients are assumed to be the ones that minimize the squared error between the speech signal and the speech signal delayed by one pitch period. Under the assumption that both the continuous speech sig- nal and the warping function are well described with B-splines, a standard gradient descent method can be used to find an estimate of the warping function [63, 98]. Since, the warping function is related to the (instanta- neous) pitch, standard pitch estimation methods [22, 68, 85] can be used to obtain an initial estimate of the warping function for the gradient descent method.

The actual warping can either be performed by methods of interpola- tion (resampling) in the time domain using the warping function as done in [98], or by zero-padding pitch cycles as done in [41]. The latter ap- proach requires pitch marking such that the zero-padding can be performed within the closed phase of the glottal cycle. Moreover, if we want to apply a frequency transform to the warped signal, the zero-padding approach to warping is disadvantageous since it yields an oversampled frequency rep- resentation. This is not the case if we use an interpolation based warping approach. 1 Speech signal representation 9

Transforms and filter banks A desired property of the representation of the speech signal is compactness. One aspect of compactness is that the energy of the signal coefficients, after an energy-preserving linear transformation, is concentrated into a few of the signal coefficients only. This generally facilitates a low distortion when reconstructing the signal from only a few significant signal coefficients. Transforms and filter banks can be used to achieve energy concentration. Let us start the discussion on transforms by considering the Karhunen- Lo`eve transform (KLT) which is commonly used in speech signal processing both for coding [51, 64] and enhancement [37, 87]. The KLT is a unitary 2 transform UKLT that when applied to a d-dimensional random variable X (representing e.g., a speech signal block) diagonalizes the covariance matrix of X (or equivalently the autocorrelation matrix in the case of X being zero mean). That is, if X is zero mean the autocorrelation matrix of the transformed vector Y = UKLTX becomes λ 0 0 1 · T T H 0 λ2 0 E Y Y = UKLTE XX UKLT =  ·  , (2) · · · ·      0 0 λd   ·    where the superscript .H denotes conjugate transpose. From (2) we note that the KLT is computed from an eigenvalue decomposition of the au- T tocorrelation matrix E[XX ] and that the rows of UKLT consist of the eigenvectors of E[XXT ]. The concentration of energy obtained by a unitary transform is com- monly evaluated using the coding gain defined as [49, 62]

d−1 d σ2 i=1 Xi Gcoding = d−1 , (3) dPσ2 i=1 Yi Q  where σ2 and σ2 are the component variances of the d-dimensional ran- Xi Yi dom variable before and after the transform, respectively. It can be shown that the coding gain is optimized by the KLT [51], and thus, beneficial in transform coding. As seen from (2), the KLT is signal dependent since it is constructed from the eigenvectors of the covariance/autocorrelation matrix. This is a disadvantage in the context of coding because the transform itself has to be transmitted. Interestingly, if we consider d-dimensional signal blocks from a stationary signal the KLT becomes the discrete Fourier transform (DFT) as the block length d tends to infinity [49] (cf. complex exponentials

2A unitary transform U is a transform that satisfies U H U = I, where I denotes the identity matrix. 10 Introduction

are eigenfunctions of linear time-invariant LTI systems [90]). This result justifies the commonly used approximation of the signal dependent KLT by the signal independent DFT or discrete cosine transform (DCT) [103]. The unitary transforms of the KLT, the DFT, and the DCT are consti- tuted by a set of orthonormal basis functions. These transforms perform a one-to-one mapping between d-dimensional coefficient spaces. It is also pos- sible to construct transforms that perform a one-to-one mapping between a d-dimensional space and a space of dimensionality higher than d. The formalism that discusses this is the so-called frame theory [82]. Frames are a generalization of bases. Frames, and implicitly also bases, are efficient means for showing the perfect reconstruction property of uni- form filter-banks (UFB). Let f denote a band-limited discrete signal (func- tion) in the discrete Hilbert space, l2(Z). A frame consists of a set of functions γk k∈K ( being a countable index set) that satisfies the frame condition,{e.g.,} [82] K

A f, f f, γ 2 B f, f , (4) h i ≤ |h ki| ≤ h i kX∈K where , denotes the inner product and the scalars A > 0, B < are the so-calledh· ·i frame bounds (constants). We call a frame where A =∞B a tight frame. Furthermore, we define the frame operation ( denotes the operator) on the signal f as U

f(k) = f, γ . (5) U h ki The frame condition (4) forms a necessary and sufficient condition for to be invertible with a bounded inverse. The dual operator of is itsU pseudo-inverse, i.e., U −1 ] = H H , (6) U U U U where the superscript H denotesconjugate transpose. For a tight frame, H −1 H 1 H = A , and that the frame is its own dual in this case, except forU aUconstanU t factor.U Because of the conditions of invertibility imposed by the frame condi- tion (4), frames can be used efficiently to show the perfect reconstruction property of the UFBs [23]. The analysis stage of such a filter bank cor- responds to frames constructed by generating the set of frame functions from regular time-shifts of a finite set of functions, γj,0 j∈J . The trans- form coefficients are the outcomes of the inner pro{ducts.} Similarly, the synthesis stage can be seen as a summation, over all channels, of all dual (pseudo-inverse) frame functions for that channel; each frame function is scaled by its transform coefficient. Thus, the synthesis stage forms an ex- pansion in dual frame functions. Let γj,m(n) denote a shifted frame func- tion γ (n mM) that is nonzero only over a finite time-support. Then j − 1 Speech signal representation 11

the (transform) coefficients of the frame expansion can be expressed by the inner product gj(m) = f, γj,m , and the (perfectly) reconstructed signal h ] i ] becomes f(n) = j,m gj(m)γj,m(n), where γj,m(n) denotes a dual frame PSfrag replacementsfunction with time offset m and frequency index j. The analysis and syn- thesis stages of a Puniform filter bank are shown in Figure 2.

f(n) g (m) f(n) γ∗ ( n) M 1 M γ] (n) 1,m − ↓ ↑ 1,m

g (m) γ∗ ( n) M J M γ] (n) J,m − ↓ ↑ J,m

Analysis filter bank Synthesis filter bank

Figure 2: The analysis and synthesis stages of a uniform filter bank. In the analysis stage the filters consist of complex conjugated time reversed frame functions and in the synthesis stage the frame functions obtained from the pseudo-inverse.

There exist many different frequency transforms, e.g., DFT, DCT, short- time Fourier transform (STFT) [96], Gabor transform [15, 95] etc. A par- ticulary attractive frequency transform is the modulated lapped transform (MLT) also referred to as modified DCT (MDCT) [36, 83]. A major rea- son for this is that the MLT facilitates the implementation of a critically sampled, i.e., renders as many output coefficients as input samples (or coef- ficients), perfect reconstruction uniform filter bank with well localized func- tions in time and frequency. This can for instance be achieved by combin- ing smooth windows (that satisfy the power complementarity constraint3, needed for the perfect reconstruction) and DCT-IV functions [36]. That is, for this specific MLT the frame coefficients gj(m) in Figure 2 can be ex- pressed as the inner product between the function f and the frame function φjm, i.e., g (m) = f, φ , (7) j h jmi where 2 (2j + 1)(2n (2m + 1)M + 1)π φ (n) = w (n mM) cos − , (8) jm − · M 4M r   3For a window w(n) with nonzero support for n = 0, ..., M − 1, the periodic extension of squared windows has to be constant. 12 Introduction

where M denotes the size of the time shift. The window w (n mM) in (8) can for instance be the square-root of the Hann window (it satisfies− the power complementarity constraint) defined as

1 π(n−mM+ 2 ) 1 1 cos n Ω w (n mM) = 2 − M ∈ n , (9) −  s     0 otherwise where Ω = [mM, ..., (m + 2)M 1]. n − Best Basis selection In best-basis selection, a library of orthonormal bases are used in combina- tion with a cost function to match the basis to a given signal. The library of orthonormal bases can be for instance constructed from DCTs of different time support. To facilitate efficient coding it is desirable to have a cost function that measures the energy concentration of coefficients expressed in the terms of the new basis. A large number of different energy concentration cost functions exist, e.g., coding gain (3) based or entropy based [27], and the best choice depends on the particular application. If we consider the speech signal, the effect of the best-basis selection using a library of DCTs of various time-support in combination with an en- ergy concentration criterion is an automatic time-segmentation of the signal. Steady speech vowels are assigned DCT basis functions of long time-support and transients are assigned basis functions of short time-support. Thus, best basis selection [27] forms a powerful tool when seeking a compact and complete speech signal representation. For computational reasons the best- basis selection often starts from some fixed segmentation (of some minimum segmentation length) and, in a tree-structured bottom-up fashion, adjacent segments are merged if the new transform coefficients resulting from the merged segments yield an increased concentration in energy.

An efficient tool combination for speech So-far, we have presented some useful tools for representing speech. In the following we briefly discuss the approach we have used in Paper A to combine these tools such that the resulting speech signal representation is compact and complete. The proposed arrangement of the modeling tools is shown in Figure 3, and consists of an initial LP analysis followed by constant pitch warping, a pitch-synchronous frequency transform, and a modulation transform. The modulation transform is a DCT transform applied to blocks of the time sequence of frequency coefficients from the pitch-synchronous frequency transform. The modulation transforms are in fact a best-basis se- lection where the block lengths are adaptively set to maximize some energy 1 Speech signal representation 13

concentration criterion. Besides the energy concentration potential of the modulation transform, it facilitates the identification of voiced and unvoiced signal coefficients beneficial for both coding [74] and prosodic modifica- tion [113]. This can be accomplished by assigning the coefficients of the low modulation bands to the voiced speech category. These coefficients repre- sent the constant and slowly evolving components of the pitch-synchronous coefficients over time (block length of the modulation transform).

Lapped frequency transform

Modulation voiced coeff. transform constant- unvoiced coeff. Pitch-synch. speech LP residual pitch residual Warper frequency analysis transform Modulation voiced coeff. transform unvoiced coeff. AR pitch-track parameters model parameters best-basis-selection

best-basis parameters

Figure 3: An efficient combination of modeling tools for representing speech.

The combination of the pitch-synchronous and the modulation trans- forms result in lapped frequency transforms [83], and similarly to all fre- quency transforms, they approximate the Karhunen-Lo`eve transform (KLT) for stationary signal segments, as discussed previously. The KLT maximizes the coding gain [49,62] which can be seen as a particular energy concentra- tion criterion. If the pitch is constant, the pitch-synchronous and the mod- ulation transforms can be applied directly on the speech signal to achieve a highly energy concentrated representation. It is desirable to describe the variances of the signal coefficients after the transformations in an efficient way. Towards this goal we describe the spectral envelope by a parametric model. As is common in speech processing, we use the conventional AR model for this purpose. In practice, the pitch is not constant, but varying over time. However, by warping the speech signal of varying pitch into a signal of constant pitch, fixed pitch-synchronous and modulation transforms can be used. To facilitate perfect reconstruction the warped signal has to be oversampled. Thus, an increased efficiency of the AR-modeling is obtained if the LP analysis is performed prior to the warper, motivating the system structure in Figure 3. 14 Introduction

2 Information extraction

In the previous section we discussed a generic approach of source signal representation and specifically applied it to the speech signal. This section considers the extraction of information from the signal representation ob- tained from source models. In particular, we are interested in information that is useful for the estimation of bounds on compression and performance of classifiers. The concept of entropy is a well-accepted measure of informa- tion, and we present in the following a detailed discussion on the definition of entropy, related measures, and existing entropy estimators. In prac- tice, high-dimensional vector observations often associated with real-world sources, such as speech autoregressive model parameters or image tangent vectors, seem to have an effective space that can be parameterized with much fewer dimensions than the original vector [71, 106, 114, 115]. This causes severe problems in entropy estimation. We discuss these problems together with possible solutions at the end of this section.

2.1 Entropy Entropy is Greek for ”in transformation”, and the term was originally in- troduced by Clausius in 1865 as a useful quantity for the definition of the second-law of thermodynamics4. Clausius’ work on thermodynamics was followed by contributions from both Boltzmann and Gibbs, who during the 1870s established the connection between statistical mechanics and the thermodynamic entropy. In statical mechanics, entropy is seen as a macro- scopic entity that quantifies the average disorder in an isolated system at a microscopic level. To be more concrete, consider an isolated system with a fixed total en- ergy and a fixed total number of particles N. Each particle of the system can be in one out of M discrete energy states m 1, 2, ..., M , and the particles can exchange energy without loss. A microstate∈ { defines} in this context a specific microscopic configuration of the system, i.e., one partic- ular configuration of the N particles and M discrete energy states. Then Boltzmann showed that the entropy S is proportional to the logarithm of the number of possible microscopic configurations W , under the assump- tion that all possible microstates of the system are equally likely and that constraints on energy mentioned above are satisfied, i.e.,

S = kB ln(W ), (10)

−23 where kB 1.3806505(24) 10 denotes the Boltzmann constant, and ≈ · M where the number of possible configurations W = N!/( m=1 Nm!), with

4Entropy in an isolated system cannot decrease. Q 2 Information extraction 15

Nm representing the number of particles that are in a specific energy state m. A particular energy constraint allows a set of different distributions Nm . Under the fore-mentioned assumption that all microstates are {equally} likely and a given overall energy for all particles, one distribution is the most likely distribution. That is the distribution that corresponds to the largest number of microstates W . This distribution is the so-called maximum-entropy distribution. For large systems (many particles in our ex- ample) the maximum-entropy distribution can be assumed to be the actual distribution (to be used for estimating macroscopic observables). Applying Stirling’s approximation log(Q!) Q log(Q) Q, which is valid for large Q, to (10) the entropy can be expressed≈ as −

M S = k N p ln(p ), (11) − B m m m=1 X where pm = Nm/N denotes the empirical probability that any particle is in a particular energy state m.

Information entropy In 1948 Shannon published his pioneering work relating the uncertainty of messages to their limit of compression [107]. As a measure of uncertainty Shannon adopted the concept of entropy from statistical mechanics and de- veloped the foundation of what is known today as information theory. In contrast to statistical mechanics, Shannon postulates a distribution (and not a uniform distribution of the microstates). Herein we define the (infor- mation) entropy and the differential entropy for a discrete and a continuous random variable, respectively. Let Ξ be a discrete random variable (r.v.) that can take any outcome from a countable set of outcomes . The set of outcomes is often referred to as the alphabet. Associated to Athe r.v. Ξ is a probability mass function (pmf) pΞ(ξ) P r(Ξ = ξ) where ξ . The information entropy of Ξ is then defined ≡(in bits) as [107] ∈ A

H(Ξ) = p (ξ) log (p (ξ)). (12) − Ξ 2 Ξ ξX∈A In many cases the source under consideration, e.g., a speech signal, is analog in nature. The entropy of such a source as defined in (12) equals infinity, i.e., an infinite amount of bits is required for lossless encoding of the continuous-alphabet (analog) random variable. Interestingly, using the notion of a uniform quantizer (i.e., a quantizer where all quantization cells are identical) we can split the entropy of the indices from the quantizer 16 Introduction

into one term describing the intrinsic properties of the continuous random variable, and one term directly related to the cell volume of the quantizer. That is, for sufficiently small quantization cells the pmf of the indices can be approximated as pΞ(ξ) fX (ξ)∆, where fX (ξ) is the probability density function (pdf) of the contin≈ uous variable X at position ξ, and ∆ denotes the volume of the quantization cell. This yields

H(Ξ) = p (ξ) log (p (ξ)) − Ξ 2 Ξ ξX∈A f (ξ)∆ log (f (ξ)∆) ≈ − X 2 X ξX∈A f (ξ) log (f (ξ))∆ log (∆) ≈ − X 2 X − 2 ξX∈A

fX (x) log2(fX (x)) dx log2(∆) ≈ − Ωx − Z effect of scale intrinsic property of X | {z } = h|(X) log (∆){z , } (13) − 2 where the entity h(X) = fX (x) log (fX (x)) dx is commonly referred Ωx 2 to as the differential entropy− of X. Note that the differential entropy can be negative, which contrasts withR the entropy of a discrete random variable.

Entropy based information measures For later purposes it is useful do define some other information measures related to differential entropy. Given the joint pdf of two r.v. X and Y , fXY (x, y), the conditional differential entropy of X given Y , h(X Y ) is defined (in bits) as [30] |

h(X Y ) = f (x, y) log f (x y) dydx, (14) | XY 2 X|Y | ZΩx ZΩy  where fX|Y (x y) = fXY (x, y)/fY (y) denotes the conditional distribution of X given Y = |y. The conditional differential entropy of (14) represents the remaining uncertainty of X when Y is given, averaged over the alphabet of Y . By subtracting the conditional entropy h(X Y ) from the differential en- tropy h(X), we can quantify the information that| Y provides about X, or equivalently, the mutual information between X and Y , i.e.,

I(X; Y ) = h(X) h(X Y ). (15) − | Using the joint and marginal pdfs of X and Y , the mutual information can 2 Information extraction 17

be also expressed as

f (x, y) I(X; Y ) = f (x, y) log XY dydx. (16) XY 2 f (x)f (y) ZΩx ZΩy  X Y  From (16) we note that the mutual information is a symmetric measure in the sense that the information that Y provides about X is the same as the information that X provides about Y .

Shannon versus Renyi entropy An alternative to, or generalization of, Shannon’s definition of entropy is the Renyi entropy [97]. The Renyi entropy (of order α) of the r.v. X with pdf fX (x) is defined as [30]

1 h (X) = log f α (x) dx , (17) α 1 α X − ZΩX  α where fX (x) denotes the pdf of X raised to the power α. The Renyi entropy is sometimes advantageous due to mathematical tractability. For instance, Renyi entropy in combination with Parzen windows facilitates the deriva- tion of gradient type algorithms in applications where entropy or mutual information is used as cost functions, e.g., blind deconvolution [38] and blind source separation [58]. Moreover, the graph-length of e.g., minimum span- ning trees has shown to be closely related to the Renyi entropy [56], which has found applications in image registration [79, 88]. As α approaches one, the Renyi entropy converges to the Shannon dif- ferential entropy h(X), i.e.,

1 α lim hα(X) = lim log fX (x) dx α→1 α→1 1 α − ZΩX  α 0 d f (x) dx l Hospital 1 1 ΩX X = lim α→1 1 f α (x) dx R d α  − ΩX X R 1 α = lim α fX (x) log (fX (x)) dx α→1 − f (x) dx Ω ΩX X Z X = Rf (x) log (f (x)) dx = h(X), (18) − X X ZΩX where we have used l’Hospital’s rule and assumed the validity of interchang- ing the orders of differentiation and integration. In [107] Shannon stated three properties that are reasonable to require from a measure of uncertainty for a given pmf. First the entropy should be a continuous function of the pmf, i.e., an arbitrarily small change in 18 Introduction

the pmf should not result in a jump in the entropy (cf. [101] for a formal definition of continuous functions). Second, if the pmf is uniform, then the entropy should be a monotonically increasing function of the cardinality of the discrete random variable. This is intuitively reasonable since with equally likely events the uncertainty about a particular event grows with the increasing number of possible events. Finally, the third desired property, according to Shannon, is that if a choice of an event can be split into two consecutive choices, the original entropy should remain the same and it should be possible to express it as a weighted sum of individual entropies. This is illustrated by the two example configurations in Figure 4, where we consider the uncertainty of the joint vector of two discrete r.v. Ξ and Υ. Ξ and Υ have the possible outcomes zero and one, and thus, the joint vector [Ξ, Υ] has four possible outcomes. The configuration to the right

p (0, 0) PSfrag replacements p (0 0) ΞΥ Υ|Ξ |

pΞΥ(0, 0)

pΞ(0) pΞΥ(0, 1) p (1 0) Υ|Ξ | pΞΥ(0, 1)

pΞΥ(1, 0) pΞΥ(1, 0) pΥ|Ξ(0 1) pΞ(1) |

pΞΥ(1, 1)

pΥ|Ξ(1 1) | pΞΥ(1, 1)

Figure 4: Illustration of the desired conditioning property of the uncer- tainty measure according to Shannon.

in Figure 4 is constructed by representing the four different events as a sequence of binary choices. If we let H˜ denote some measure of uncertainty, then for this example, Shannon’s third desired property implies that

H˜ (Ξ, Υ) = H˜ (Ξ) + p (ξ)H˜ (Υ Ξ = ξ) = H˜ (Ξ) + H˜ (Υ Ξ). (19) Ξ | | ξ=X{0,1} Shannon showed in [107, Appendix 2] that the only possible expression of the uncertainty that satisfies all three properties mentioned above is the expression of (12). That is, for the relation of (19) to hold H˜ (Ξ, Υ) = H(Ξ, Υ) = pΞΥ(ξ, υ) log2(pΞΥ(ξ, υ)). Thus, the third property is not satisfied for−the general Renyi entropy unless α = 1 in (17), i.e., unless the Renyi entropyPcoincides with the Shannon differential entropy. 2 Information extraction 19

According to the source coding theorem [30], the entropy of an indepen- dent and identically distributed (i.i.d.) source is the lower bound on the average number of bits per source symbol needed for compression of the source without loss. This suggests that the relation in (19) is a desirable property since it facilitates the derivation of compression bounds including side information. Thus, the Shannon entropy clearly has an advantage over the Renyi entropy (for the case when α = 1) in the field of coding theory. 6 2.2 Relation to coding and classification A major motivation for estimating entropies and differential entropies is that they can be used to form bounds on the performance of source coders and pattern classifiers. Entropy and mutual information can also be used as cost functions in various signal processing applications, e.g., image regis- tration or blind source separation as previously mentioned, and for optimal feature selection in speech recognition. In the following we discuss the role of entropy in both coding and classification.

Source coding Source coding refers to the compression of signals. Source coders can either be of lossless or type. As the name indicates, lossless com- pression implies that the source signal can be compressed and decompressed without introducing any distortion. Examples of commonly used techniques for are the Huffman-coding [60] and the arithmetic- coding [99] schemes. Lossless coding using a finite number of bits is only possible when the source signal to encode is a sequence of discrete-alphabet variables. When the receiver tolerates distortion to the processed signal, lossy com- pression techniques such as, e.g, scalar or vector quantization [53] can be used. A comprehensive overview on lossy source coding is given in [19]. Irrespective of the type of compression, i.e., lossless or lossy, it is always of interest to know/estimate the lowest average rate per symbol possible (or bounds on this rate) when designing source coders. In the encoding of a sequence of discrete-alphabet source variables we assign to each source variable ξ a codeword of length l(ξ). Typically, we are only interested in codes that facilitate a decoding of a sequence of concate- nated codewords. This is commonly referred to as the code being uniquely decodable. The unique decodability of the code places constraints on the set of possible codeword lengths l(ξ) ξ∈A. This constraint is formalized −{l(ξ) } by the Kraft inequality: ξ∈A 2 1. Using the Kraft inequality, it is possible to show that the minimum av≤erage codeword length of a uniquely decodable code is boundedP between the entropy of the source and the en- tropy of the source plus one bit. This is more formally stated in Theorem 1. 20 Introduction

Thus, an estimate of the entropy provides useful knowledge about on how close to optimal performance the performance of a practical lossless coder is to.

Theorem 1 The source coding theorem: The uniquely decodable code that minimizes the average codeword length, L = ξ∈A pΞ(ξ)l(ξ), satisfies P H(Ξ) L = p (ξ)l(ξ) < H(Ξ) + 1. (20) ≤ Ξ ξX∈A

The theory specifying the optimal performance in lossy compression is referred to as rate-distortion or distortion-rate theory, and was also origi- nally developed by Shannon in [108]. The rate-distortion function is a tight lower bound on the average R required to transmit a stationary process for a given average distortion D. For a Rd-valued r.v. X with pdf fX (x) and a bounded distortion criterion d(x, xˆ), the rate distortion function is defined as

R(D) = inf I(X; Xˆ), (21) ˆ {fXˆ |X (xˆ|x):E[d(X,X)]≤D} where f (xˆ x) specifies the statistical mapping between the original and Xˆ |X | the quantized variable, and where E[d(X, Xˆ)] denotes the average distortion. The rate-distortion function (21) is generally hard to determine ana- lytically. An example of an exception is the combination of a univariate Gaussian source (zero mean and variance σ2) and a mean squared error distortion measure where the rate-distortion function is simply R(D) = 2 0.5 log2(σ /D). A numerical approach for the computation of the rate- distortion function is the Blahut algorithm [21]. Naturally, for a continuous alphabet source the Blahut algorithm yields an approximation of the rate- distortion. An alternative approach that is usually simpler than finding the rate- distortion function is to determine a lower bound to it. The Shannon lower bound (SLB) is such a bound. The SLB can be defined for discrete-alphabet variables as well as for continuous-alphabet variables. For the latter case and a difference distortion criterion, i.e., d(x, xˆ) = d(x xˆ), the SLB is defined as −

RSLB(D) = h(X) sup h(X Xˆ). (22) − ˆ − {fX−Xˆ (x−xˆ):E[d(X−X)]≤D}

If the reconstruction variable Xˆ is independent of the reconstruction error X Xˆ then the SLB coincides with the actual rate-distortion function and the−SLB is said to be tight. 2 Information extraction 21

Classification The aim of pattern recognition is to classify sensor data into a discrete set of predefined classes (patterns). Due to the curse of dimensionality [17], features extracted from the raw sensor data are commonly used for the classification task instead of the sensor data itself. Through Fano’s inequality [42] it is possible to obtain a lower bound on the error probability of any classifier given the entropies of the classes conditioned on the features. Let the discrete r.v. Υ represent the discrete classes we want to classify our sensor data into. Furthermore, let the contin- uous r.v. X represent the observations from which we classify Υ. Then the probability of error, Pe, of a classifier is related to the conditional entropy H(Υ X) as [30] | H + P log ( Ω 1) H(Υ X), (23) Pe e 2 | Υ| − ≥ | where HPe = Pe log2 (Pe) (1 Pe) log2 (1 Pe), and where ΩΥ denotes the cardinality−of Υ. Figure−5 sho−ws an example− of (23) for the| case| of two

0.5

0.45

0.4 e

P 0.35 , y 0.3

0.25

probabilit 0.2

0.15 Error 0.1

0.05

0 PSfrag replacements 0 0.2 0.4 0.6 0.8 1 H(Υ X) [bits] | Figure 5: Lower bound on the error probability as a function of the con- ditional entropy H(Y |X) for a two class example.

equally probable classes. The extreme points in Figure 5 are intuitive; if the class-conditional entropy is zero, then the lower bound on the classification error is zero. Conversely, if the class conditional entropy equals the class en- tropy the features do not provide any information and the error probability corresponds to chance only. From (23) we note that selecting the set of features (represented by X) that minimizes H(Υ X) minimizes the lower bound on the classification | 22 Introduction

error probability. This is equivalent to seeking the features that maximize the mutual information (MI) between the features and the classes I(Υ; X). The MI I(Υ; X) represents the reduction (in bits) of the class entropy (un- certainty) H(Υ) that the features X provide. Feature selection using the MI is treated in, e.g., [6, 14, 16, 24, 39, 116, 120].

2.3 Estimation of entropy So-far, we have introduced the definition of entropy and how it relates to bounds on optimal coding and classification. The second main topic of this thesis concerns the actual estimation of the differential entropy. Existing techniques for (differential) entropy estimation can essentially be split into two categories; plug-in entropy estimators, and direct entropy estimators. The former category of estimators consists of a parametric or non-parametric estimate of the probability density function (pdf) followed by numerical integration, whereas the latter category utilizes the data di- rectly to estimate the entropy avoiding the intermediate density estimation step). Examples of different pdf-estimation techniques used for the plug-in entropy estimates are histograms and kernels based models. In the cate- gory of direct entropy estimators we classify, for instance, the approaches based on the Euclidean distance between observations. A description of the different methods is given in this section.

Histogram based An estimate of the pdf based on a histogram gives a piecewise constant approximation (it is constant over the histogram bin) of the true density. The resulting estimator of the differential entropy becomes (cf. the entropy to differential entropy relation in (13)) ˆ ˆ hhist(X) = H(Ξ) + log2 (V ) , (24) where Hˆ (Ξ) is the estimator of the (quantization) index entropy of the histogram bins and V is the volume of the histogram bins, assuming identical bins. The maximum-likelihood estimate of the probability of each bin pˆΞ(ξ) is equal to the ratio of the number of observations falling into the bin and the total number of observations. The corresponding estimate of the index entropy is H(Ξ) = Ξ∈X pˆΞ(ξ) log2 (pˆΞ(ξ)), where denotes the set of indices. Since the n−umber of histogram bins grows expXonentially with the dimensionality, the practicalP use of the method is generally limited to the scalar or low-dimensional vector case. In [20,46] the histogram bins are not of equal size and shape over the space, but instead they are individually selected as a function of the local structure of X Rd, and the assumption ∈ of identical histogram bins in (24) (that resulted in the log2(V ) term) no longer holds and has to be compensated for. 2 Information extraction 23

Kernel based The kernel-based pdf models used in plug-in entropy estimators can be either non-parametric or parametric. The Parzen windows approach is a commonly used non-parametric technique for density estimation [91]. In this case the pdf estimate is constructed by a superposition of windows centered around each observation of the random variable. Naturally, there are constraints on the windows to ensure that the pdf estimate is a valid pdf, that is it has to be non-negative and integrate to one. This can be guaranteed by letting the window itself be a density, e.g., a Gaussian. If we denote the window (or Kernel) centered around an observation xn by ψN (x, xn) the Parzen windows density estimate can be expressed as

1 N fˆ (x) = ψ (x; x ). (25) Parzen,X N N n n=1 X The window ψN (x, xn) is a function of the number of observations N: the window width is selected to be inversely proportional to the number of obser- vations. Figure 6 shows an example of Parzen-window based pdf estimation using Gaussian kernels.

Figure 6: Example of pdf estimation using Parzen windows. The dashed curve shows the true pdf and the solid shows the Parzen win- dow estimate using Gaussian kernels with standard deviation 0.1. The number of Parzen windows is 5 and 500 in the left and right plots, respectively. In the left plot the individual windows are shown (dotted curves).

Plug-in entropy estimators require a multidimensional integration, which often is difficult or impossible to compute analytically. However, if we as- sume the random variable X to come from an ergodic source, we can re- place the ensemble average h(X) = E[log (f (X))] by the sample average − 2 X 24 Introduction

−1 N limN→∞ N i=1 log2(fX (xi)), and we obtain for the Parzen windows technique:− P 1 K 1 N hˆParzen(X) = log ψ (x ; x ) , (26) −K 2 N N k n n=1 ! kX=1 X where K is the number of observations used for the stochastic integration and xk denotes the k’th observation. The observations used for the sto- chastic integration above can be a subset of the true observations that was not used to form the density estimate, or artificially generated given the estimated pdf. In the former case, the estimator in (26) has a positive bias as K approaches infinity, since

lim hˆParzen(X) = h(X) + h(fX fˆParzen,X ), (27) K→∞ k where h( ) denotes the Kullback-Leibler distance [75] between the true and estimated·k·distributions. Thus, by using a subset of the true observations in the stochastic integration, we have the advantage of knowing that our dif- ferential entropy estimate is an overestimate of the true differential entropy. Under some constraints on the smoothness of the window and the rate of decrease of the window width, convergence in mean square error of (25) can be shown [34, 91], and the bias term in (27) vanishes as N approaches infinity. A commonly used parametric pdf model utilized in plug-in estimators of the differential entropy is the Gaussian mixture model (GMM). Similarly to the Parzen windows approach, the pdf is modeled by a weighted sum of kernels. However, in contrast to the Parzen windows, each kernel has an individual weight, mean, and covariance. Moreover, the number of kernels is an additional model parameter. Thus, the pdf using a GMM can be expressed as L

fˆGMM,X (x) = αlφl(x; ml, Cl), (28) Xl=1 where L denotes the number of components, αl is the component weight, and φl(x; ml, Cl) denotes the l’th Gaussian kernel with mean ml and covariance L Cl. The parameters of the GMM, αl, ml, Cl l=1, can be obtained using, for instance, the expectation maximization{ (EM)}algorithm [33]. Figure 7 shows an example of pdf modeling using GMM. Assuming an ergodic source, we can form an estimator of the differential entropy using stochastic integration

K L ˆ 1 hGMM(X) = log2 αlφl(xk; ml, Cl) , (29) −K ! kX=1 Xl=1 K where xk k=1 is the set of observations not included in the estimation of the mo{del }parameters. 2 Information extraction 25

Figure 7: Example of pdf modeling using GMMs. The dotted curves show the individual components of the GMM and the solid curve shows the composite pdf.

In contrast to the histogram approach, both the Parzen windows and the GMM based entropy estimators perform a smoothing of the observation space since the kernel widths are adapted to the number of observations and to the number of mixtures. In the case of Parzen windows the degree of smoothing is controlled by the kernel bandwidth and has to be manually set using, e.g., Silverman’s rule of thumb [109]. For the Gaussian univariate kernel the Silverman rule of thumb yields

√2 −1 1 σkernel = (0.9 min (σˆ , (q q )/1.34)) N 5 , (30) 10 X 75 − 25 where σkernel specify the standard deviation of the Gaussian kernel, σˆX denotes the estimate of the standard deviation of the data, q25 and q75 specifies the 25 percent respectively the 75 percent quantiles, and N denotes the number of kernels.

Euclidean distance based In [119] it was shown that an estimator of the differential entropy given a set of scalar observations can be derived from the distance between sorted observations (sorted from minus infinity to plus infinity). Intuitively this makes sense, since regions where the distances between the sorted observa- tions are small are consequently also densely populated, and thus, closely related to the underlying pdf. Sorting of observations is only possible in the scalar case. However, the Euclidean distance between observations in the vector space can be used for estimating the differential entropy [56, 57, 72]. In [56, 57] it was shown that the graph-length obtained from, e.g., the minimum spanning tree (MST), the Steiner tree (ST), or the traveling salesman problem (TSP) (see [121] 26 Introduction

for more details on graph theory) is directly related to the Renyi entropy. Similarly, Kozachenko et al. [72] showed that the average nearest-neighbor distance relates to the Shannon differential entropy. N N Let x = xn n=1 define a set of N d-dimensional independent identi- cally distributed{ (i.i.d.)} vectors in Rd, and let an edge be the connection between two vectors of the set. Associated with each edge is a cost (in here the cost is the same as the Euclidean distance). A minimum spanning tree is defined as the set of edges that connects all vectors in xN at minimum total cost. Thus, the length (or equivalently the cost), LxN , of the tree is

γ LxN = min e , (31) e∈τ k k e X where e denotes an edge in the tree τ, and γ denote the Euclidean norm and power exponent, respectively. kThe· k most common algorithms for the minimum spanning tree search are the Kruskal [73] and Prim [93] algo- rithms. Figure 8 shows a minimum spanning tree of 512 random vectors con- taining the first and second mel-frequency cepstral coefficients MFCCs [31] of narrowband speech.

Figure 8: Minimum spanning tree of 512 random vectors containing the first and second mel-frequency cepstral coefficients of speech.

In [57] Hero and Michel exploit that the length of the minimum spanning tree can be related (for d 2) to an estimate of the differential entropy (α- ≥ 2 Information extraction 27

Renyi entropy):

LxN hˆ (X) = log log (β ) /(1 α), (32) α 2 N α − 2 L,γ −     where the α-entropy is controlled by varying γ as α = (d γ)/d, and − where βL,γ is a density-independent constant only depending on the type of spanning tree (MST, ST, TSP, etc.), and γ. A problem, however, is that βL,γ is not known analytically, and has to be estimated from the data. As previously mentioned, the Euclidean distance between nearest neigh- bors can be related to the Shannon differential entropy. In contrast to (32) the equivalent of the constant βL,γ is known for the nearest-neighbor en- tropy estimator (cf. βL,γ and the normalized moment of inertia for random codebooks [122]). The nearest-neighbor (NN) entropy estimator is defined as [72] N 1 d hˆNN(X) = log e (N 1) exp(C )V , (33) N 2 k n k − E d n=1 X  where e is the Euclidean distance between an observation x and its k n k n nearest-neighbor, d denotes the dimension, and Vd is the volume of a d- d/2 dimensional sphere with unit radius, i.e., V = π . The constant C d (d/2)! E ≈ 0.5772 in (33) is the Euler constant. Asymptotic properties of the NN entropy estimator have been investigated in [72, 89, 117, 118].

2.4 Effects of manifolds for entropy estimation A manifold is a mathematical space which is locally Euclidean, i.e., observ- ing such a space from a close distance the space is approximately flat whereas when observed from further distance it may have a much more complicated structure. The surface of the Earth is an example of a two-dimensional manifold in a three-dimensional space. Locally, the distance between two neighboring Swedish cities such as Stockholm and Uppsala can be well ap- proximated by the Euclidean distance in the three-dimensional space. How- ever, if we would like to measure the distance between Stockholm and Syd- ney, Australia the Euclidean distance would be improper. Figure 9 shows two other two-dimensional manifolds embedded in the three-dimensional Euclidean space. Manifolds have been observed in topics related to vision [100, 106, 114] and speech [71, 115]. High-dimensional vectors of observations from these real-world processes have often been shown to be located on a lower- dimensional manifold embedded in the high-dimensional vector space. The dimensionality of the manifold is commonly referred to as the intrinsic di- mensionality, and estimation methods for the intrinsic dimensionality are 28 Introduction

Möbius strip Ring torus

Figure 9: Examples of two dimensional manifolds embedded in the three dimensional Euclidean space.

presented in e.g., [18, 26, 47, 61, 92]. This implies that the number of pa- rameters needed to describe a point in the high-dimensional space is equal to the manifold dimension, assuming that the manifold structure is known. Dimension reduction techniques that consider an underlying manifold struc- ture can be found in [100, 114]. The manifold structure of the data causes vanishing support of the prob- ability density function, which in turn, when using the standard definition, results in an infinitely negative Shannon differential entropy. A more repre- sentative calculation of the differential entropy in this case is to first make a, possibly nonlinear, mapping of the data from the high-dimensional space onto a space of dimension equal to the intrinsic dimension of the manifold. The entropy calculation can then be performed in this lower dimensional space. The existence of manifold structures in the data has often been over- looked in entropy estimations, with the result that classical methods provide erroneous estimates of the entropy, since they assume the wrong intrinsic dimension (manifold dimension). Recently, progress was made towards including the effect of data lying on a manifold in entropy estimation. Methods for estimating the Renyi entropy of data located on manifolds have been presented recently [28, 29]. The method in [28] is based on the construction of so-called geodesic min- imal spanning trees obtained from pruning of the Isomap produced by the algorithm derived in [114]. In [29] the same authors reduce the complex- ity of their previous method by using k-nearest neighbor graphs instead of the geodesic minimal spanning trees. Contrasting with the work of [28,29], 3 Summary of contributions 29

which provides estimation procedures for the Renyi entropy, our work pro- vides direct estimation procedures for the Shannon differential entropy of data lying on a manifold. In Paper B, we start from high-rate quantization theory and arrive at a joint estimator of the intrinsic dimensionality and the Shannon differential entropy. The assumption of a fixed manifold dimension over the embedding space is too strong in many cases. The manifold can be of varying dimensionality within the embedded space. In this case, the entropy estimation meth- ods assuming a fixed dimensionality cannot be applied. The estimation of information theoretical measures in this context is addressed in Paper C. It is important to emphasize that the manifold structures (of constant or varying dimensionality) of real-world processes are not likely surfaces in a strict sense because of the presence of noise (e.g., measurement or quantization noise). However, the noise can be so small that the practical behavior, given a finite set of observations, is very similar to the theoretical effect of infinitely negative differential entropy caused by manifolds.

3 Summary of contributions

In this thesis, we study the representation of speech signals and the estima- tion of entropy given observations containing features of the speech signal. In Paper A we present a compact representation of the speech signal that fa- cilitates perfect reconstruction. The individual parts of the representation are closely related to speech signal properties such as spectral envelope, pitch, and voiced/unvoiced signal coefficients, which is beneficial for both speech coding and manipulation. The canonical representation forms a natural basis for speech process- ing, including classification, bandwidth extension. When performing such processing, it is natural to wonder about any bounds on the performance of the processing. Such bounds are often based on information theory, and require estimates of the entropy. We address this problem in papers B and C. The estimation of entropy is important since entropy is the core of the performance limit theorems in coding [52, 107, 108] and classification [42]. Paper B considers the estimation of the differential entropy in the case when the set of vector observations (from the representation), to be used for the estimation, lie on a lower-dimensional surface (manifold) in the embedding space. In contrast to the method presented in Paper B, Paper C introduces a method where the manifold structures are destroyed by constraining the resolution of the observation space in Paper C. This facilitates estimation of prediction error rates even when the manifolds are of varying dimensionality within the embedding space. Finally, Paper D investigates the amount of shared information between 30 Introduction

spectral features of narrow-band (bandwidth from 0.3 to 3.4 kHz) and high- band (bandwidth from 3.5 to 7 kHz) speech. The results in Paper D indicate that the information shared between the narrow- and high-band is insuffi- cient to facilitate high-quality wide-band speech coding without transmis- sion of some extra information describing the high-band. Short summaries of the four research papers included in this thesis are presented below. Most of the derivations and all of the experiments in the following papers have been performed by the author of this thesis.

Paper A: A Canonical Representation of Speech It is well known that usage of an appropriate representation of the speech signal improves the performance of speech coders, recognizers, and synthe- sizers. In this paper we present a representation of speech that has the efficiency, in terms of being compact, similar to that of parametric mod- eling, but additionally has the completeness property of signal expansions. The resulting canonical representation of speech is suited for a wide range of speech processing applications and we demonstrate this through experi- ments related to coding and prosodic modification.

Paper B: On the Estimation of Differential Entropy from Data Located on Embedded Manifolds Estimation of the differential entropy from observations of a random variable is of great importance for a wide range of signal processing applications such as source coding, pattern recognition, hypothesis testing, and blind source separation. In this paper we present a method for estimation of the Shannon differential entropy that accounts for embedded manifolds. The method is based on high-rate quantization theory and forms an extension of the clas- sical nearest-neighbor entropy estimator. The estimator is consistent in the mean square sense and an upper bound on the rate of convergence of the es- timator is given. Because of the close connection between compression and Shannon entropy, the proposed method has an advantage over methods es- timating the Renyi entropy. Through experiments on uniformly distributed data on known manifolds and real world speech data we show the accuracy and usefulness of our proposed method.

Paper C: Intrinsic Dimensionality and its Implication for Perfor- mance Prediction in Pattern Classification Fano’s inequality relates entropy to the error probability of any classifier. It has been shown that feature observations extracted from real-world sources can have an intrinsic dimensionality different from the vector dimensional- ity. In this work, we show how this affects the estimation of entropy and consequently the prediction of classification error rates. We introduce a new References 31

method that minimizes the effects due to the differences in dimensionality of the embedding space and the intrinsic dimension and demonstrate its performance.

Paper D: Gaussian Mixture Model based Mutual Information Es- timation between Frequency Bands in Speech In this paper, we investigate the dependency between the spectral envelopes of speech in disjoint frequency bands, one covering the telephone bandwidth from 0.3 kHz to 3.4 kHz and one covering the frequencies from 3.7 kHz to 8 kHz. The spectral envelopes are jointly modeled with a Gaussian mixture model based on mel-frequency cepstral coefficients and the log- energy-ratio of the disjoint frequency bands. Using this model, we quantify the dependency between bands through their mutual information and the perceived entropy of the high frequency band. Our results indicate that the mutual information is only a small fraction of the perceived entropy of the high band. This suggests that speech bandwidth extension should not rely only on mutual information between narrow- and high-band spectra. Rather, such methods need to make use of perceptual properties to ensure that the extended signal sounds pleasant.

References

[1] Pulse code modulation (PCM) of voice frequencies. ITU-T Recommendation G.711, November 1988. [2] 40, 32, 24, 16 kbit/s adaptive differential pulse code modulation (ADPCM). ITU-T Recommendation G.726, December 1990. [3] Coding of speech at 8 kbit/s using conjugate-structure algebraic-code- excited linear prediction (CS-ACELP). ITU-T Recommendation G.729, March 1996. [4] 3GPP TS 26.090. AMR Speech Codec; Transcoding Functions. [5] 3GPP TS 26.290. Extended Adaptive Multi-Rate - Wideband (AMR-WB+) Codec; Transcoding Functions. [6] A. Al-Ani and M. Deriche. Feature selection using a mutual information based measure. In IEEE Int. Conf. Pattern Recog., pages 82–85, 2002. [7] L. B. Almeida and J. M. Tribolet. Harmonic coding: A low bit-rate, good- quality speech coding technique. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 1664–1667, 1982. [8] B. S. Atal. The history of linear prediction. IEEE Signal Processing Mag., pages 154–157,161, 2006. [9] B. S. Atal and N. David. On synthesizing natural-sounding speech by linear prediction. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 44–47, 1979. 32 Introduction

[10] B. S. Atal and S. L. Hanauer. Speech analysis and synthesis by linear prediction of the speech wave. J. Acoust. Soc. Am., 50(2):637–655, 1971. [11] B. S. Atal and J. R. Remde. A new model of LPC excitation for producing natural-sounding speech at low bit rates. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 614–617, 1982. [12] B. S. Atal and M. R. Schroeder. Predictive coding of speech signals and subjective error criteria. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 573–576, 1978. [13] A. R. Barron, J. Rissanen, and B. Yu. The minimum description length prin- ciple in coding and modeling. IEEE Trans. Information Theory, 44(6):2743– 2760, 1998. [14] G. L. Barrows and J. C. Sciortino. A mutual information measure for feature selection with application to pulse classification. In Proc. IEEE-SP Int. Symp. time-freq. and time-scale analysis, pages 249–252, 1996. [15] M. J. Bastiaans. Gabor’s expansion of a signal into Gaussian elementary signals. Proceedings of the IEEE, 68(4):538–539, 1980. [16] R. Battiti. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Networks, 5(4):537–550, 1994. [17] R. E. Bellman. Adaptive control processes: a guided tour. Princeton Univ. Press, 1961. [18] R. S. Bennett. The intrinsic dimensionality of signal collections. IEEE Trans. Information Theory, 15(5):517–525, 1969. [19] T. Berger and J. D. Gibson. Lossy source coding. IEEE Trans. Information Theory, 44(6):2693–2723, 1998. [20] H. Bernhard and G. Kubin. A fast mutual information calculation algo- rithm. In M.J.J. Holt et al., editor, Signal Processing VII: Theories and Applications, volume 1, pages 50–53. Elsevier, Amsterdam, 1994. [21] R. E. Blahut. Computation of channel capacity and rate-distortion func- tions. IEEE Trans. Information Theory, 18(4):460–473, 1972. [22] P. Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. Proc. Institute of Phonetic Sciences of the University of Amsterdam, 17:97–110, 1993. [23] H. B¨olcskei, F. Hlawatsch, and H. G. Feichtinger. Frame-theoretic analysis of oversampled filter banks. IEEE Trans. Signal Processing, 46(12):3256– 3268, 1998. [24] B. V. Bonnlander and A. S. Weigend. Selecting input variables using mutual information and nonparametric density estimation. In Proc. Int. Symp. Artificial Neural Networks, pages 42–50, 1994. [25] M. S. Brandstein, P. A. Monta, J. C. Hardwick, and J. S. Lim. A real-time implementation of the improved MBE speech coder. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 5–8, 1990. References 33

[26] J. Bruske and G. Sommer. Intrinsic dimensionality estimation with opti- mally topology preserving maps. IEEE Trans. Pattern Analysis and Ma- chine Intelligence, 20(5):572–575, 1998. [27] R. R. Coifman and M. V. Wickerhauser. Entropy-based algorithms for best basis selection. IEEE Trans. Information Theory, 38(2):713–718, 1992. [28] J. Costa and A. Hero. Geodesic entropic graphs for dimension and entropy estimation in manifold learning. IEEE Trans. Signal Processing, 52(8):2210– 2221, 2004. [29] J. Costa and A. Hero. Manifold learning using Euclidean k-nearest neighbor graphs. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 988– 991, 2004. [30] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991. [31] S. B. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuosly spoken sentences. IEEE Trans. Acoust., Speech, and Signal Processing, 28(4):357–366, 1980. [32] J. R. Deller, J. G. Proakis, and J. H. L. Hansen. Discrete-time processing of speech signals. Prentice Hall, New Jersey, USA, 1993. [33] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Soc., Ser. B, 39(1):1–38, 1977. [34] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, New York, USA, 2000. [35] H. Dudley. Remaking speech. J. Acoust. Soc. Am., 11(2):169–177, 1939. [36] P. Duhamel, Y. Mahieux, and J. P. Petit. A fast algorithm for the im- plementation of filterbanks based on ”time domain aliasing cancellation”. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 2209–2212, 1991. [37] Y. Ephraim and H. L. Van Trees. A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Processing, 3(4):251–266, 1995. [38] D. Erdogmus, K. E. Hild, J. C. Principe, M. Lazaro, and I. Santamaria. Adaptive blind deconvolution of linear channels using Renyi’s entropy with Parzen window estimation. IEEE Trans. Signal Processing, 52(6):1489– 1498, 2004. [39] T. Eriksson, S. Kim, H-. G. Kang, and C. Lee. An information-theoretic per- spective on feature selection in speaker recognition. IEEE Signal Processing Lett., 12(7):500–503, 2005. [40] European Telecommun. Standards Institute (ETSI). Enhanced Full Rate (EFR) speech transcoding (GSM 06.60), 1996. [41] G. Evangelista. Pitch-synchronous wavelet representation of speech and music signals. IEEE Trans. Signal Processing, 41(12):3313–3330, 1993. [42] R. M. Fano. Transmission of information. Wiley, New York, USA, 1961. 34 Introduction

[43] G. Fant. Acoustic theory of speech production. Technical report, Report No. 10, Division of Telegraphy-Telephony, Royal Institute of Technology, Stockholm, Sweden, 1958. [44] J. L. Flanagan. A resonance vocoder and baseband complement: A hybrid system for speech transmission. IRE Transactions on Audio, 8(3):95–107, 1960. [45] J. L. Flanagan, M. R. Schroeder, B. S. Atal, R. E. Crochiere, N. S. Jayant, and J. M. Tribolet. Speech coding. IEEE Trans. Communications, 27(4):710–932, 1979. [46] A. M. Fraser. Information and entropy in strange attractors. IEEE Trans. Information Theory, 35(2):245–262, 1989. [47] K. Fukunaga and D. R. Olsen. An algorithm for finding intrinsic dimen- sionality of data. IEEE Trans. Computers, c-20(2):176–183, 1971. [48] A. Gersho. Advances in speech and audio compression. Proceedings of the IEEE, 83(6):900–918, 1994. [49] A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Kluwer Academic Publishers, Dordrecht, Holland, 1991. [50] J. D. Gibson. Adaptive prediction in speech differential encoding systems. Proceedings of the IEEE, 68(4):488–525, 1980. [51] V. K. Goyal. Theoretical foundations of transform coding. IEEE Signal Processing Mag., 18(5):9–21, 2001. [52] R. M. Gray. Source Coding Theory. Kluwer Academic Publishers, 1990. [53] R. M. Gray and D. L. Neuhoff. Quantization. IEEE Trans. Information Theory, 44(6):2325–2383, 1998. [54] D. W. Griffin and J. S. Lim. Multiband excitation vocoder. IEEE Trans. Acoust., Speech, and Signal Processing, 36(8):1223–1235, 1988. [55] P. Hedelin. A tone-oriented voice-excited vocoder. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 205–208, 1981. [56] A. O. Hero, B. Ma, O. Michel, and J. Gorman. Applications of entropic spanning graphs. IEEE Signal Processing Mag., 19(5):85–95, 2002. [57] A. O. Hero and O. Michel. Asymptotic theory of greedy approximations to minimal k-point random graphs. IEEE Trans. Information Theory, 45(6):1921–1938, 1999. [58] K. E. Hild, D. Erdogmus, and J. Principe. Blind source separation using Renyi’s mutual information. IEEE Signal Processing Lett., 8(6):174–176, 2001. [59] X. Huang, A. Acero, and H.-W. Hon. Spoken Language Processing, A guide to theory, algorithm, and system development. Prentice Hall, New Jersey, USA, 2001. [60] D. A. Huffman. A method for the construction of minimum-redundancy codes. Proc. IRE, 40(9):1098–1101, 1952. References 35

[61] D. R. Hundley and M. J. Kirby. Estimation of topological dimension. In Proc. SIAM Int. Conf. Data Mining, pages 194–202, 2003. [62] N. S. Jayant and P. Noll. Digital coding of waveforms. Prentice Hall, New Jersey, USA, 1984. [63] A. Jefremov and W. B. Kleijn. Spline-based continuous-time pitch estima- tion. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 337–340, 2002. [64] M-. Y. Kim and W. B. Kleijn. KLT-based adaptive classified VQ of the speech signal. IEEE Trans. Speech Audio Processing, 12(3):277–289, 2004. [65] W. B. Kleijn. A frame interpretation of sinusoidal coding and waveform interpolation. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., vol- ume 3, pages 1475–1478, 2000. [66] W. B. Kleijn and J. Haagen. Transformation and decomposition of the speech signal for coding. IEEE Signal Processing Lett., 1(9):136–138, 1994. [67] W. B. Kleijn and J. Haagen. A speech coder based on decomposition of characteristic waveforms. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 508–511, 1995. [68] W. B. Kleijn, P. Kroon, L. Cellario, and D. Sereno. A 5.85 kbps CELP algorithm for cellular applications. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 596–599, 1993. [69] W. B. Kleijn and K. K. Paliwal. Speech Coding and Synthesis. Elsevier, 1998. [70] W. B. Kleijn and D. Talkin. Compact speech representations for speech synthesis. In Proc. IEEE Workshop on Speech Synthesis, pages 35–38, 2002. [71] W. Klein, R. Plomp, and L. C. W. Pols. Vowel spectra, vowel spaces, and vowel identification. J. Acoust. Soc. Am., 48(4):999–1009, 1970. [72] L. F. Kozachenko and N. N. Leonenko. Sample estimate of entropy of a random vector. Problems of Information Transmission, 23(1):95–101, 1987. [73] J.B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Amer. Math. Soc., 7:48–50, 1956. [74] G. Kubin, B. S. Atal, and W. B. Kleijn. Performance of noise excitation for unvoiced speech. In IEEE Workshop on Speech Coding for Telecommunica- tions, pages 35–36, 1993. [75] S. Kullback and R. A. Leibler. On information and sufficiency. The Ann. Math. Stat., 22(1):79–86, 1951. [76] E. Lamboray, S. Wurmlin,¨ and M. Gross. Data streaming in telepresence en- vironments. IEEE Trans. Visualization and Computer Graphics, 11(6):637– 648, 2005. [77] J. Laroche, Y. Stylianou, and E. Moulines. HNM: A simple, efficient har- monic + noise model for speech. In IEEE Workshop on Applications of Sign. Proc. to Audio and Acoust., pages 169–172, 1993. 36 Introduction

[78] C-. H. Lee. From knowledge-ignorant to knowledge-rich modeling: a new speech research paradigm for next generation automatic speech recognition. In Proc. Int Conf. Speech Lang. Process., pages 845–848, 2004. [79] B. Ma, A. Hero, J. Gorman, and O. Michel. Image registration with mini- mum spanning tree. In Proc. IEEE Int. Conf. Image Process., pages 481– 484, 2000. [80] J. Makhoul. Spectral linear prediction: properties and applications. IEEE Trans. Acoust., Speech, and Signal Processing, 23(3):283–296, 1975. [81] J. Makhoul and M. Berouti. High-frequency regeneration in speech coding systems. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 428–431, 1979. [82] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, 1998. [83] H. Malvar. Lapped transforms for efficient transform/subband coding. IEEE Trans. Acoust., Speech, and Signal Processing, 38(6):969–978, 1990. [84] R. J. McAulay and T. F. Quatieri. Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust., Speech, and Signal Process- ing, 34(4):744–754, 1986. [85] A. McCree and J. C. De Martin. A 1.7 kb/s MELP coder with improved analysis and quantization. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 593–596, 1998. [86] D. McLeod, U. Neumann, C. L. Nikias, and A. A. Sawchuk. Integrated media systems. IEEE Signal Processing Mag., 16(1):33–43,76, 1999. [87] U. Mittal and N. Phamdo. Signal/noise KLT based approach for enhancing speech degraded by colored noise. IEEE Trans. Speech Audio Processing, 8(2):159–167, 2000. [88] H. Neemuchwala, A. Hero, and P. Carson. Image registration using entropic graph-matching criteria. In Conf. Rec. Asilomar Conf. Signals, Systems, and Computers, pages 134–138, 2002. [89] M. Nilsson and W. B. Kleijn. On the estimation of differential entropy from data located on embedded manifolds. Submitted for publication, 2004. [90] A. V. Oppenheim, A. S. Willsky, and S. H. Nawab. Signals and systems, second edition. Prentice Hall, New Jersey, USA, 1997. [91] E. Parzen. On estimation of a probability density function and mode. The Ann. Math. Stat., 33(3):1065–1976, 1962. [92] K. W. Pettis, T. A. Bailey, A. K. Jain, and R. C. Dubes. An intrinsic dimen- sionality estimator from near-neighbor information. IEEE Trans. Pattern Analysis and Machine Intelligence, 1(1):25–37, 1979. [93] R. C. Prim. Shortest connection networks and some generalizations. Bell Sys. Tech. J., 36:1389–1401, 1957. [94] J. G. Proakis and D. G. Manolakis. Digital Signal Processing - Principles, Algorithms, and Applications. Prentice Hall, New Jersey, USA, 1996. [95] S. Qian and D. Chen. Discrete Gabor transform. IEEE Trans. Signal Processing, 41(7):2429–2438, 1993. References 37

[96] T. F. Quatieri. Discrete-time speech signal processing - principles and prac- tice. Prentice Hall, New Jersey, USA, 2002. [97] A. Renyi. On the foundations of information theory. Review of the interna- tional statistical institute, 33(1):1–14, 1965. [98] B. Resch, A. Ekman, M. Nilsson, and W. B. Kleijn. Estimation of the instantaneous pitch of speech. Submitted to IEEE Trans. on Speech, Audio, and Language Processing, 2006. [99] J. J. Rissanen. Generalized Kraft inequality and . IBM J. Res. Develop., 20:198–203, 1976. [100] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embeddings. Science, 290(1):2323–2326, 2000. [101] W. Rudin. Principles of mathematical analysis, third edition. McGraw-Hill, 1976. [102] S. Saito and K. Nakata. Fundamentals of speech signal processing. Academic Press, Orlando, USA, 1985. [103] V. Sanchez, P. Garcia, A. M. Peinado, J. C. Segura, and A. J Rubio. Di- agonalizing properties of the discrete cosine transform. IEEE Trans. Signal Processing, 43(11):2631–2641, 2001. [104] M. R. Schroeder. Vocoders: analysis and synthesis of speech. Proceedings of the IEEE, 54(5):720–735, 1966. [105] M. R. Schroeder and B. S. Atal. Rate distortion theory and predictive coding. In Proc. IEEE Int. Conf. Acoust. Speech Sign. Process., pages 201– 204, 1981. [106] H. S. Seung and D. D. Lee. The manifold ways of perception. Science, 290(1):2268–2269, 2000. [107] C. E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:379–423,623–656, 1948. [108] C. E. Shannon. Coding theory for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec., pages 142–163, 1959. [109] B. W. Silverman. Density estimation for statistics and data analysis. Chapman and Hall, 1986. [110] A. S. Spanias. Speech coding: a tutorial review. Proceedings of the IEEE, 82(10):1541–1582, 1994. [111] Y. Stylianou. Decomposition of speech into a deterministic and stochastic part. In Proc. Int Conf. Speech Lang. Process., pages 1213–1216, 1996. [112] Y. Stylianou. Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Trans. Speech Audio Processing, 9:21–29, 2001. [113] Y. Stylianou, J. Laroch, and E. Moulines. High-quality speech modification based on a harmonic + noise model. In Proc. Eurospeech, pages 451–454, 1995. [114] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(1):2319– 2323, 2000. 38 Introduction

[115] R. Togneri, M. D. Alder, and Y. Attikiouzel. Dimension and structure of the speech space. In IEE Proc.-I Communications, Speech and Vision, volume 139, pages 123–127, 1992. [116] K. Torkkola. Visualizing class structure in data using mutual information. In Proc. IEEE-SP Workshop on Neural Networks for Sign. Proc., pages 376–385, 2000. [117] A. B. Tsybakov and E. C. van der Meulen. Root-n consistent estimators of entropy for densities with unbounded support. Technical Report Discussion paper 9206, Institute de Statistique, Universit`e Catholique de Louvain, 1992. [118] A. B. Tsybakov and E. C. van der Meulen. Root-n consistent estimators of entropy for densities with unbounded support. Scandinavian Journal of Statistics, 23(1):75–83, 1996. [119] O. Vasicek. A test for normality based on sample entropy. J. Royal Statistical Soc., Ser. B, 38(1):54–59, 1976. [120] H. Yang and J. Moody. Feature selection based on joint mutual information. In Advances in Intelligent Data Analysis (AIDA), Computational Intelli- gence Methods and Applications (CIMA), International Computer Science Conventions, 1999. [121] J. E. Yukich. Probability Theory of Classical Euclidean Optimization Prob- lems (Lecture Notes in Mathematics; No.1675). Springer, 1998. [122] P. L. Zador. Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Trans. Information Theory, 28(2):139–149, 1982.