ROBUST STRUCTURED VOICE EXTRACTION FOR FLEXIBLE EXPRESSIVE RESYNTHESIS

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Pamornpol Jinachitra June 2007 c Copyright by Pamornpol Jinachitra 2007

All Rights Reserved

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Julius O. Smith, III Principal Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Robert M. Gray

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Jonathan S. Abel

Approved for the University Committee on Graduate Studies.

iii iv Abstract

Parametric representation of audio allows for a reduction in the amount of data needed to represent the sound. If chosen carefully, these parameters can capture the expressiveness of the sound, while reflecting the production mechanism of the sound source, and thus allow for an intuitive control in order to modify the original sound in a desirable way. In order to achieve the desired parametric encoding, algorithms which can robustly identify the model parameters even from noisy recordings are needed. As a result, not only do we get an expressive and flexible coding system, we can also obtain a model-based speech enhancement that reconstructs the speech embedded in noise cleanly and free of musical noise usually associated with the filter-based approach. In this thesis, a combination of analysis algorithms to achieve automatic encoding of a human voice recorded in noise is described. The source-filter model is employed for parameterization of a speech sound, especially voiced speech, and an iterative joint estimation of the glottal source and vocal tract parameters based on Kalman filtering and the expectation-maximization algorithm is presented. In order to find the right production model for each speech segment, speech segmentation is required which is especially challenging in noise. A switching state-space model is adopted to represent the underlying speech production mechanism involving the smoothly varying hidden variables and their relationship to the observed speech. A technique called the unscented transform is incorporated in the algorithm to improve the segmentation performance in noise. In addition, during voiced periods, the choice of the glottal source model requires the detection of the glottal closure instants. A dynamic programming-based algorithm with a flexible parametric model of the source is also proposed. Each algorithm is evaluated in comparison to recently published

v methods from the literature. The system combination demonstrates the possibility of a parametric extraction of speech from a clean recording or a moderately noisy recording, further providing the option of modifying the reconstruction to implement various desirable effects.

vi Acknowledgements

I would like to thank Professor Julius Smith, my principal advisor, who gave me all kinds of support throughout the course of developing the work in this dissertation. His encouragement and inspiration made a difference in propelling me to try to do better work and keep improving myself. His enthusiasm in sound source modeling has inspired me to pursue the research in this area, to think differently and to take chances. His teachings in classes, seminars and meetings we had now and then had greatly influenced the way I approach the problem. His advice these last five years will surely guide me through my future career and life just as it has guided me to the finishing of this thesis. I would like to thank my associate advisor, Professor Robert M. Gray, who agreed so willingly to be on my reading committee and provided me with many useful com- ments that expanded my horizon greatly. Without a doubt, his expertise has con- tributed to the improvement of this dissertation. I would like to thank Dr. Jonathan Abel who is also on my reading committee. He has contributed to the CCRMA community and to my work here in particular, beyond his duty. His enthusiasm, encouragement and technical insights have proved to be a great help to this work. The CCRMA community has provided me with a great learning atmosphere in a multi-disciplinary area which I always appreciate. I am particularly grateful to my colleagues and friends in the DSP group: Kyogu Lee, Ryan Cassidy, Patty Huang, Hi- roko Terasawa, Ed Berdhal, David Yeh, Nelson Lee, Gautham Mysore, Matt Wright, Greg Sell. The group’s alumni, and mostly my predecessors: Yi-Wen Liu, Aaron Master, Arvindh Krishnaswamy, Rodrigo Segnini and Harvey Thornberg, also helped

vii me along the way by both showing and sharing. We supported each other technically as well as morally, by letting each other know that we are not alone in this long and arduous process. My thanks extend to CCRMA DSP seminar regular participants, who contributed bits and pieces along the way. My gratitude also goes to my Master degree’s thesis advisor, Professor Jonathon Chambers, who got me interested in statistical signal processing and perhaps set the path for the rest of my academic career. I would like to also thank the people at Toyota InfoTechnology Center, U.S., the sole provider of my financial support for most of the last three years. Special thanks go to my manager, Ben Reaves, who looked after my welfare and professional development, not only as a manager but also as a friend. I would like to thank Ramon Prieto, my colleague and friend there who brought me in and shared a lot of his experience. I would like to thank the executives at the ITC for their continuous support and friends there who made the experience all the more enjoyable. Last but not least, I would like to express my eternal gratitude toward my beloved parents, who kept investing in my education and gave me the freedom, along with the moral and financial support, to pursue whatever I wanted. I would like to extend my gratitude to my sister who has been a great big sister throughout and was a great help to the family in affording me the freedom and this wonderful experience abroad. Under the category of my family, I would like to thank Papinya for her support and companionship that certainly made the hard part of this experience more bearable and the better part of it even more joyful.

viii Preface

The research in parametric representation of audio, especially in physical modeling, has been a dominant theme at CCRMA, Stanford University. The idea of structured audio coding where the sound source parametric description is sent for resynthesis at the receiving end, was originated at CCRMA [1] [2]. Since then some structured audio coding standard and tools have emerged in the musical domain. Arguably, though, structured speech coding has been the holy grail since Homer Dudley’s day at the Bell Laboratory in the form of an articulatory model. Yet, the most popular is achieved by concatenation of samples and high-quality speech coding is achieved only with the help of codebooks.

This dissertation demonstrates the possibility of a well-structured audio paramet- ric representation which can be obtained from real-world recordings where noise can be present. It is based heavily on prior research and inspired by the ideas developed at CCRMA. The aspect of modification flexibility is largely inspired by the current difficulty of generating emotional speech from neutral speech and the evident desire for an expressive computer-generated singing voice which is also easy to control. The inclusion of noise into consideration is motivated by the lack of research in this area and the experience of seeing many technologically superior techniques fail when it comes to real-world deployment. Speech enhancement by reconstruction also seems to be a deserted venue of research that may not deserve to be abandoned just yet.

From the applications point of view, the dissertation is an attempt to bring to- gether coding and synthesis while achieving speech enhancement as a by-product. It tries to achieve all desirable properties of a sound source analysis from natural

ix recording: high compression or data reduction, faithful and noise-free reconstruc- tion, physically meaningful controls that are flexible to manipulate or modify, all achieved with robustness against noise. Given the current inadequate knowledge of the true physical model for voice production, most algorithms presented in this thesis only try to stay as close as possible to some physical interpretation, while retaining other properties. With more powerful techniques becoming available in recent years, an investigation into these “upgrades” to see if they can give such coding effects is an interesting endeavor both for academic and industrial purposes. Hopefully, the contributions here will encourage more attempts in this direction for even better well-structured parametric models that may allow us to attain the aforementioned properties.

x Contents

Abstract v

Acknowledgements vii

Preface ix

1 Introduction 1 1.1 Review of Parametric Speech Representation ...... 5 1.1.1 TemplateModel...... 5 1.1.2 SinusoidModel ...... 7 1.1.3 TheVocoder...... 8 1.1.4 Source-filterModel ...... 9 1.1.5 FormantSynthesizer ...... 12 1.1.6 DigitalWaveguideModel...... 13 1.1.7 ArticulatoryModel ...... 15 1.1.8 Fluid Mechanics Computational Models ...... 17 1.2 SingingVersusSpeaking ...... 18 1.3 Parameter Identification and Speech Enhancement ...... 19 1.4 Summary ...... 22

2 System Overview 23 2.1 SegmentationFront-end ...... 23 2.2 VoiceModel...... 24 2.3 SynthesisSystem ...... 26

xi 2.4 Discussion...... 27 2.5 Summary ...... 29

3 Noisy Speech Segmentation 31 3.1 Introduction...... 31 3.2 Switching State-space Model of Speech ...... 34 3.2.1 Learning...... 36 3.2.2 Inference...... 39 3.3 Nonlinear Noisy Observation Model ...... 42 3.3.1 GPB1 and GPB2 Inference Using Unscented Kalman Filtering 44 3.4 ExperimentsandResults...... 47 3.4.1 EvaluationMethod ...... 47 3.4.2 CleanSpeechSegmentation ...... 47 3.4.3 ComparisonwithHMM ...... 48 3.4.4 NoisySpeechSegmentation ...... 50 3.4.5 Other Classes Definition and Hierarchical Decision ...... 54 3.5 Conclusion...... 55

4 Robust Glottal Closure Instant Detection 59 4.1 Dynamic Programming for GCI/GOI Detection ...... 61 4.1.1 WaveformErrorCost...... 61 4.1.2 Cross-correlationCost ...... 63 4.1.3 Filling The Dynamic Programming Grids ...... 63 4.1.4 CandidateSelection...... 64 4.2 Experiments...... 66 4.2.1 EvaluationTestSet...... 66 4.2.2 ResultsandDiscussion ...... 68 4.2.3 Parametric Voice Synthesis and Voice Modification ...... 75 4.3 Conclusion...... 79

5 Probabilistic Framework 81 5.1 GenerativeModelofVoiceinNoise ...... 84

xii 5.2 InferenceandLearning ...... 86 5.2.1 Estep ...... 87 5.2.2 MAP Estimate of The Boundary Variable ...... 88 5.2.3 KalmanSmoothing ...... 89 5.2.4 Mstep...... 89 5.2.5 Joint Source-Filter Parameter Maximum Likelihood Estimation 90 5.2.6 Penalized Maximum Likelihood ...... 92 5.2.7 Post-KalmanSmoothingIntegration ...... 93 5.2.8 VQ-CodebookConstraint ...... 94 5.2.9 AlgorithmInitialization ...... 95 5.2.10 AdaptivePost-filtering ...... 97 5.2.11 EM Parameter Extraction for Fricatives ...... 97 5.3 Experiments...... 98 5.3.1 TestSamplesandEvaluation ...... 98 5.3.2 ResultsandDiscussion ...... 99 5.3.3 ListeningTest...... 111 5.4 Conclusion...... 116

6 Applications of Structured Voice Coding 117 6.1 Voice Coding and Flexible Synthesis ...... 117 6.1.1 VolumeChanging...... 117 6.1.2 TimeScaling ...... 118 6.1.3 PitchShifting ...... 118 6.1.4 BreathinessModification ...... 119 6.1.5 GlottalFrySynthesis ...... 119 6.2 Resampling and Bandwidth Extension ...... 120 6.3 VoiceConversion ...... 121 6.4 ComfortNoiseGeneration ...... 122

7 Conclusions and Future Work 125 7.1 NoisySpeechSegmentation ...... 127 7.2 GCI/GOIDetection ...... 128

xiii 7.3 VoiceinNoiseParameterExtraction ...... 129 7.4 Flexible Expressive Resynthesis and Structured Coding ...... 131

A State-space Model Inference and Estimation 133 A.1 KalmanFiltering ...... 133 A.2 KalmanSmoothing ...... 134 A.3 Switching State-Space Inference ...... 135 A.3.1 GPB1 ...... 136 A.3.2 GPB2 ...... 137 A.3.3 Filtering...... 137 A.3.4 Smoothing...... 138 A.3.5 Unscented Kalman Filtering ...... 138

B Maximum Likelihood Estimator Derivation 141

Bibliography 143

xiv List of Tables

3.1 Clean speech phone classification accuracy of various algorithms and features ...... 48 3.2 The confusion matrix using GPB2(6) with MFCC as features on clean speech ...... 48 3.3 Comparison of accuracy between HMM and SSM ...... 49 3.4 Noisy speech phone classification accuracy ...... 50 3.5 The confusion matrix using GPB-UKF(4) and LMFB as features in a car noise environment with SNR=10dB ...... 50 3.6 The confusion matrix using uncompensated GPB2 and GPB2-UKF(4) on speech embedded in different kinds of noise with MFCC as features 52 3.7 The confusion matrix using uncompensated GPB2 and GPB2-UKF(4) on speech embedded in different kinds of noise with LMFB as features 53 3.8 The confusion matrix using GPB-UKF(4) and LMFB as features in a car noise environment of SNR=10dB for the seven phone classes.... 57

4.1 Initial candidates’ characteristics...... 66 4.2 GCI identification results comparison...... 71 4.3 GOIidentificationresults...... 71

5.1 Mean opinion score of a male voice (/aa/) extraction and noise sup- pression under pink noise contamination of SNR = 20 dB...... 116

xv List of Figures

1.1 Kelly-Lochbaum vocal tract digital waveguide model ...... 14

2.1 Overallsystemdiagram...... 24 2.2 Rosenberg’s derivative glottal source waveform ...... 25 2.3 Latticefilterforsynthesis ...... 27

3.1 Switching state-space graphical model ...... 36 3.2 GPB1andGPB2diagram ...... 41 3.3 LMFB-1 (x-axis) vs LMFB-2 (y-axis) scatter plot in clean speech (dot) andwithwhitenoiseatSNR=10dB(cross)...... 43 3.4 MFCC-1 (x-axis) vs MFCC-2 (y-axis) scatter plot in clean speech (dot) andwithwhitenoiseatSNR=10dB(cross)...... 44 3.5 Switching state-space graphical model for noisy observation...... 45 3.6 Continuous state trajectories of an utterance “Which church do the Smithsworshipin?” ...... 49 3.7 result illustration with 10dB car noise ...... 54 3.8 Spectrogram result illustration with 10dB white noise ...... 55 3.9 The accuracy (%) for various noise types and SNRs using LMFB and MFCCasfeatures...... 56

4.1 Two periods of Rosenberg’s derivative glottal waveform model .... 62 4.2 Example of the three initial candidate selection schemes...... 65 4.3 DEGG waveform, inverse-filtered derivative glottal waveform, GCIs andGOIsderivedfromDEGG...... 67

xvi 4.4 FAR-MDR operating curve for females and males using different values ofthecosttermrelativeweight...... 69 4.5 Accuracy and gross error performance for females and males using different values of the cost term relative weight...... 70 4.6 FAR-MDR operating curve of the EWGD method for males and fe- males using different averaging window sizes...... 72 4.7 Normal and pressed voice with GCIs and GOI identified ...... 73 4.8 A female voice with GCIs and GOIs identified ...... 74 4.9 Error histogram of GCI detection for the male and female test set in milliseconds...... 74 4.10 Error histogram of GOI detection for the male and female test set in milliseconds...... 75 4.11 Voiceanalysissystemdiagram ...... 76 4.12 Spectrogram of the original male singing voice and the parametric con- struction...... 77 4.13 Raw and smoothed parameter estimates for a male singing voice in normalmode ...... 77 4.14 Spectrogram of the original male utterance and the parametric con- struction...... 78 4.15 Smoothed parameter estimates for male voiced utterance “Where are you?”...... 78

5.1 Noisyvoicesystemdiagram ...... 84 5.2 Generativemodelofvoiceinnoise...... 86 5.3 Standard deviation of various LPC-related coefficients extracted from a singing voice using closed-phase covariance LPC...... 94

5.4 Prediction error surface for different OQ values, all for F0=100 Hz,

except for the bottom right, where F0=200Hz...... 96 5.5 Averaged I-S and area function distance measures of synthetic signals of vowels /aa/, /iy/, /uw/ and /eh/ at different SNR levels of pink noise.100

xvii 5.6 Averaged I-S and area function distance measures of synthetic signals of vowels /aa/, /iy/, /uw/ and /eh/ at different SNR levels of white noise...... 101 5.7 Averaged I-S and area function distance measures of a male singing voice at different SNR levels of pink and white noise...... 103 5.8 Spectral envelopes of reference pre-emphasized clean speech closed- phase covariance LPC, noisy autocorrelation LPC and the result of joint source-filter estimation from the basic EM, EM-PKS and EM- VQalgorithms...... 104 5.9 The normalized area function from closed-phase covariance LPC esti- mates, the initial estimates from autocorrelation LPC and after EM iterations ...... 104 5.10 The derivative glottal waveform, g(n), the initial model (dash) and the modelafterEMiterations(solid) ...... 105 5.11 Process error log-variances at each iteration...... 106 5.12 Log-likelihoodconvergence...... 107

5.13 Raw (dash) and smoothed (solid) estimates of α5, ag and bg at the last iterationofEM-PKS...... 108

5.14 Estimates of fundamental period (T0), open-quotient (OQ) and ampli- tude (AV )usingEM-PKS...... 108 5.15 The time-samples and spectra of the original and the resynthesized signal...... 109 5.16 Mean spectral envelopes and 1 standard deviation over time of filter ± coefficients estimate using basic EM-Kalman...... 109 5.17 Mean spectral envelopes and 1 standard deviation over time of filter ± coefficients estimate using post Kalman smoothing...... 110 5.18 Mean spectral envelopes and 1 standard deviation over time of filter ± coefficients estimate using basic EM-Kalman with no input model. . . 110 5.19 Spectrogram of the original singing voice, the noisy version with white noise at SNR = 20 dB, and the parametric reconstruction from EM- PKSandEM-VQ...... 112

xviii 5.20 Spectrogram of the original male utterance “Where are you?”, the noisy version with white noise at SNR = 20 dB, and the parametric reconstructionfromEM-PKSandEM-VQ...... 113 5.21 Codebook index lookup using original clean utterance as codebook. . 113 5.22 Codebook index lookup using speaker-independent codebook. . . . . 114

6.1 of the original utterance ( “She saw a fire”), the pitch- shifting by a factor of 1.5 and the time-scaling by a factor of 2 .... 123 6.2 Original synthesized glottal waveform (top) and its breathy version (bottom)...... 123

xix xx Chapter 1

Introduction

A parametric representation of a human voice not only provides a data compression, but also allows modification of the sound which can be useful in many applications. In speech communication, a slowed-down reproduction of speech can be beneficial to people with hearing difficulty or the normal hearing in difficult environments. Correcting the stress and perceptively emphasizing some articulations can also help intelligibility. More importantly, parameter modification can change the expression of the voice, making it more emotionally relevant. This can be valuable for a cheap implementation of an emotional speech synthesis system where only neutral speech recordings are needed and the range of expression is unlimited. In the artistic realm, modification flexibility is even more important. A parametrically-coded singing voice can change pitch, get time-scaled or, with an appropriate model, change normal voice mode to breathy or pressed, without the need for re-recording. An amateur singing voice can be improved by correcting pitches, adjusting or adding vibrato and boosting the “singer’s formant” [3]. Parametric modeling of natural sound can also allow us to reconstruct a sound without loss in resolution. To draw a comparison to computer graphics, consider a rendition of a circle. If the circle is enlarged by expanding the original set of pixels representing it, without an interpolation mechanism, the resolution will be lost. On the other hand, if we represent the circle by its center position and its radius, enlarging it will be as simple as increasing the value of the radius, now that we have a method to re-render it, given its parameters. In audio,

1 2 CHAPTER 1. INTRODUCTION

direct “stretching” in amplitude, time or frequency on the signal samples results in a similar loss, whereas re-rendering the sound that is louder, slower or with higher pitch, through parameter modification does not lead to such loss. In addition, parametric representation of a sound allows us to achieve a sound that may not be possible to produce naturally, yet resembles what people are familiar with. For example, a human sound that is extremely high-pitched speaking an impossible-to-articulate utterance can be produced relatively easily. This type of advantage is also evident in computer animation where many real-world object instances can do what they cannot do in reality. The history of research in parametric representation of a human voice would not be complete if we did not consider the undertaking in both speech coding and speech synthesis since they go hand-in-hand in trying to represent the production of speech sounds in one way or another. Speech coding has been a subject of intensive research for decades mainly because of its place in telecommunication. Speech is one of the most common forms of communication among people and the ability to compress a speech sound, transmit the data and reconstruct the speech at the far end of the channel allows humans to overcome the physical boundary to communication. The more we can compress the data, the faster and the more data we can manage to send under limited resource availability such as bandwidth and transmission power. In contrast to general speech synthesis, the top priority of speech coding is its faithfulness and compression rate which are usually in conflict. The ability to modify the encoded speech is not important. While robustness to noise is also important, especially for telecommunication applications, most standard coding schemes only rely on a front- end noise reduction or analysis procedures that deemphasize the effect of noise. Most current coders are not purely parametric and, even if they are, little intuitive control can be applied [4] [5]. Speech synthesis, on the other hand, is concerned more with the naturalness of the produced sound and the ability for the user to control the production for a desired expression [6]. Although intelligibility is also of the utmost importance, since the user has full control of the parameters which are usually pre-determined, this is of little problem as opposed to having to estimate them for faithful reproduction of the 3

original speech. Although a purely parametric speech synthesizer is not currently as often used as the sample-based [7] due to its unnatural sound, constant progress has been made in this direction. Its advantages of requiring a small footprint and the absence of synthesis discontinuity make it desirable. Even the samples used in concatenative synthesis are parametrically encoded in some way, which shows the trend to at least hybridize the two approaches [8] [9].

Despite the differences in speech synthesis and coding, they both converge on the notion of how to represent natural human sounds, especially the parametric model which mimics the voice production mechanism. Recently, the idea of object-based coding of audio commonly called structured audio coding has led to a part of the MPEG-4 audio coding standard [10]. In short, structured audio coding involves the idea of transmitting a sound by describing it rather than compressing it. The sound synthesis model is now part of the content, instead of just residing in the codec. It is transmitted along with the parametric encoding of sound source components at each instant. The parametric model describes the sound semantically. The lower the dimension of the parameter space and the more intuitive they are, the more structured the sound is. While the idea of structured audio was originated mainly for synthetic sounds in the context of computer music sound synthesis [1], especially in the physical model simulation, it now also encapsulates natural sound. At least in the standard, it can refer loosely to various forms of sound representations which are structured descriptively for independent manipulation in some ways [2]. For example, Casey proposed the structured audio group transform concept which deals with modeling natural sounds like glass shattering or tire screeching by structured decomposition of spectral and temporal bases [11]. The standard itself covers everything from a frequency-domain representation like the sinusoid model, to physical models, such as a vibrating string or even a non-purely parametric speech coding like CELP [12]. Many processing modules can also be added or changed modularly to create the sounds the content creator wants. Examples of these are the reverberation module to simulate a room characteristic, a different guitar body response for a specific vintage guitar sound and general filtering effects for spectral transformation. For an ensemble, these sound objects can also be remixed differently with each one being encoded by 4 CHAPTER 1. INTRODUCTION

its most efficient model. Synthetic sounds and natural sounds can also be mixed under the same framework. The algorithmic structured coding has also been shown using information theory to potentially provide higher compression ratios than other techniques [13]. Ultimately, all that is needed for transmission is a text file with event definitions and the associated parameters for the sound to be re-rendered at the receiver’s end. With the standardization trend towards tag-based programming such as the XML for multimedia content, a sound can be compactly transmitted, modified and even indexed for search and retrieval using these descriptive tags. The use of structured audio coding has been practical for monophonic synthetic sound and composition [1][14]. However, its direct use on natural sounds still has not realized its potential due to difficult parameter extraction and the lack of a complete and powerful model for some sound sources such as the human speech. Analyzing a sound ensemble also requires source separation or auditory scene analysis [15, 16, 17, 18] which are extremely difficult in general. Nevertheless, a constant stream of research into the sound production mechanisms, their effects on perception and the accompanying analysis techniques should contribute to making a more efficient structural coding possible. It is foreseeable that a hybrid, or multiple-encoder system, will be used sooner or later, subject to applications and the type of signal to be encoded [2]. For example, a generic coder like the MPEG-AAC [19] can be applied broadly to a complex mixture of sounds while a physical model can be used with a simple monophonic sound. For human voice, a good sound representation and robust analysis techniques are of the primary importance in order to achieve this type of structured coding. This chapter starts by reviewing the history of parametric speech coding and synthesis, both in speech communication and singing. Although the focus is on the parametric models, some non-parametric approaches will be mentioned in perspective while some systems are not entirely parametric. An attempt is made to present the categories in increasing order towards a more physical model. The review attempts to elucidate the problem space and the compromises among analysis complexity, intuitive parametric control and the quality of the encoded sound. The context of structured audio concept within the speech enhancement paradigm is then discussed, leading to an interesting 1.1. REVIEW OF PARAMETRIC SPEECH REPRESENTATION 5

approach to noise elimination through parametric reconstruction. Together, they should give an overview of the vision to which this thesis is trying to contribute.

1.1 Review of Parametric Speech Representation for Coding and Synthesis

1.1.1 Template Model

In the template model, speech is represented by a set of quantized units in a code- book. The model may be considered non-parametric due to the absence of obvious parameters. On the other hand, we may consider the speech as having a model repre- sented by a (linear) combination of selected discrete units using appropriate weighting “parameters” or coefficients. The differences among the template models seen so far in literatures are on what part (only the excitation source or the entire frame of the speech signal) or what features (temporal or frequency) are being represented. An- other important difference is on the method used by the coder to derive a codebook. Optimal solutions are often determined in least-squares error or maximum likelihood manner using the observation as a reference. Rate-distortion tradeoff is an impor- tant aspect of model selection and optimization. Sometimes, other constraints such as perceptual criteria or continuity constraints can be incorporated into either the optimization process or the codebook itself. Examples of a pure template model include the ergodic hidden Markov model (HMM) of spectral features and their derivatives where a discrete state represents a unit template of those features which can be inverted back to reconstruct the original speech [20]. In this example, the use of spectral feature derivatives force implicitly the continuity of the spectrum while maximum likelihood method is used to promote global reconstruction faithfulness. A finite set of vector components which can be used to describe the original speech signal, possibly in a lossy manner, can also be derived in the information theoretic fashion. For example, principal component analysis (PCA) is one such technique that concentrates the coding effort on the space coordinates where there are the most variances. Independent component analysis 6 CHAPTER 1. INTRODUCTION

(ICA) decomposition [21, 22, 23] is a similar information theoretic criterion that can be used to decompose a signal into discrete basis components. Instead of promoting the components which have the highest variances, ICA encourages those which are the most mutually independent that sometimes represent more perceptually meaningful features of the observed signal. For example, most temporal units of time-domain observations of general sounds derived from ICA are bandpassed in nature just like the ear’s auditory filterbank [21]. For spectral observations, the spectral units arrived at are harmonically related, among other auditory cues such as common onset [17] [24]. They can even represent the sound in a part-based manner, for example, as an onset transient, sustained harmonics and residual portions [25]. Sparse coding gives similar results albeit using a different but closely related objective function of sparsity, that is, values of either the basis templates or the expansion coefficients, are mostly zeros [26]. This gives a concise representation of a signal with a sparse nature like speech. From the speech synthesis angle, concatenative synthesis is similar in nature to this type of speech signal representation where many units of speech are stored and selected as required subject to some cost functions and rules. The technique has also been used to generate a singing voice in [27]. Obviously, other types of model templates and other derivation methods exist beyond those mentioned here. Popular dictionary-driven methods such as matching pursuit finds the “best” atoms in the given dictionary and corresponding coefficients according to some energy criterion [28]. The sound is then represented by coefficients of symbols, representing selected units in the dictionary. Its application to denoising has also been demonstrated in [28]. Its variation includes basis pursuit where sparsity of the coefficients is emphasized [29] and the best orthogonal basis method where a specialized sub-collections of the dictionary are selected [30]. While the template model is a useful and simple way to represent speech for pat- tern classification, as proposed for early speech recognition, it may not be satisfactory for many speech coding and synthesis applications that require modification flexibility and easy maneuver. However, the use of codebooks is still prominent in many modern speech coders. This is evident in the standard code-excited linear prediction (CELP) speech codec where the glottal excitation source is reconstructed from codebooks. 1.1. REVIEW OF PARAMETRIC SPEECH REPRESENTATION 7

The same is true for unit-selection concatenative speech synthesis, which can give a very natural sound since the codebooks consist of samples from the actual speech. Without a doubt, there will also be constant and parallel attempts to perform code- book derivation that are intuitive perceptually or physically. For example, a template model of the vocal tract area for speech synthesis has been proposed recently [31], based upon the idea that there are eigen-shapes of the vocal tract during voice pro- duction. Voices are then created by the excitation of the weighted sum of selected substrates and constriction shapes, in combination with other fine-detail parametric controls.

1.1.2 Sinusoid Model

This section refers to the use of sinusoids to model sounds directly from the spectral observations. Sinusoidal decomposition can however be used in other parametric mod- eling of speech, especially in the source-filter model, due to its power in representing a general quasi-periodic signal. Quartieri and McAulay first showed the use of sinusoid modeling and estimation to coding of speech [32] while its use in musical domain was first shown by Serra and Smith [33]. Since then, its use has proliferated in coding of the sound’s tonal components and many refinements have followed. In its basic form, sinusoid modeling involves the representation of the harmonic components as a sum of sinusoids whose phases and amplitudes are time-varying. The estimations of pa- rameters can be as simple as peak picking (with interpolation [34]) or can be iterative e.g. [35]. Harmonic structure has been exploited in both representation (as pitch e.g. [36]) and in constrained estimation (e.g. [35] [37]). The movement of the sinusoids is another area of intense study since it is crucial to artifact-free reconstruction. Spline interpolation, among other techniques, has been used [32]. What is left after the modeling of the tonal components is usually called the residual and is encoded by numerous methods such as Bark-band filtered noise [38]. The sines+noise+transients model is another popular extension where the transients are modeled differently than the tonal parts [38]. A sinusoid model has been used for singing voice [39] [40] and has been commercially deployed in the product by Yamaha Corporation, due 8 CHAPTER 1. INTRODUCTION

to its flexibility in pitch changing, duration and vibrato effect. Sinusoid models exist in the MPEG audio compression standards [19]. In the MPEG4 standard, sinusoid- based codecs which allow for high quality modification have been adopted. An HILN (harmonic and individual lines plus noise) codec is adopted for general audio which allows pitch and time-scale modification [41] [42]. For speech, a codec called HVXC (harmonic vector excitation coding) is included, using the sinusoid module to de- scribe the speech excitation [43] [44]. HVXC also allows time-scaling. FM (frequency modulation) synthesis is a close cousin of sinusoid modeling, where each frequency component is modulated to generate a rich sound spectrum due to the nonlinearity in the modulation. It has been used to generate a singing voice [45] but the extraction of parameters from real voices for the coding purpose remains difficult. Arguably, sinusoid modeling is currently the most popular when it comes to high quality sound modification, with recent upgrades including [46]. However, what it represents is at the receiving end, i.e., the ears, so users still have to deduce the change they need to make in order to achieve the desired perceptual effect. There is no doubt that research in this area will try to fill the gap by studying the perceptual correlates and deriving rules for various effects, for example, making a breathy voice as demonstrated in [47]. Intuition is, however, still left to be desired.

1.1.3 The Vocoder

Preceding the sinusoid modeling is the vocoder (voice coder). The first vocoder was in an analog form, produced by Homer Dudley at the Bell Laboratory. The coder itself was called the “channel vocoder” while the manually-controlled version was called a Voder. An amplitude envelope of each filterbank channel is calculated using analog bandpass filters and rectifiers. The voiced/unvoiced decision is made along with estimating the corresponding fundamental frequency, to be regenerated by a buzz or an oscillator through a filter bank. Although the original idea of Homer Dudley’s comes from the underlying notion that a speech sound can be represented by a small set of articulator movements, he had to settle for spectral modeling-based channel vocoder [48]. The channel vocoder is nevertheless very much the first proof 1.1. REVIEW OF PARAMETRIC SPEECH REPRESENTATION 9

of concept of structural compression of speech for coding application. An extension of the channel vocoder, called the , was developed for speech coding by Flanagan and Golden [49]. Instead of using only the amplitude for each filter channel, the starting phase is also included. The phase vocoder was implemented in software based on a short-time Fourier transform (STFT), unlike the analog hardware implementations of the channel vocoder. A synthesis method called “” is used for reconstructing the signal from its amplitude and the phase derivative or instantaneous frequency. The phase vocoder can be considered a direct ancestor of the sinusoid modeling. Its ability for time-scaling and pitch-shifting application has also been demonstrated [49] [50].

1.1.4 Source-filter Model

The source-filter model [51] attempts to linearly separate the effect of the tract path- way from the glottal source excitation and model them as two lumped systems. This results in a great simplification in identification and independent adjustment to achieve cross-synthesis of different voices. For vowels, the tract pathway is gen- erally modeled as an all-pole digital filter due to its ability to match perceptually important formant resonances. The lip radiation is often modeled by a filter 1 µz−1 − where µ ≈ 1 but is less than one, which then gets folded into the glottal waveform to get an approximate derivative glottal waveform as the sole input. The sound sample, represented by s(n), is then given by

P s(n)= a(i)s(n i)+ u(n) (1.1) − i=1 X where a(i) represents the ith coefficient of the all-pole filter, while P is the order of the filter and u(n) is the excitation input. The model assumes no interaction between the glottal and tract compartment due to the insignificant glottal opening relative to the vocal tract area. This is not true during the glottal opening phase of the vibration [52]. Many have tried to incorporate the coupling effect into either the source or the tract, for example, a ripple model in the source and the reduced first 10 CHAPTER 1. INTRODUCTION

formant bandwidth in the tract filter [53]. Such coupling still confounds source-filter modeling attempts, leading to efforts in more distributed modeling such as the digital waveguide or fluid mechanics simulation where coupling is implicitly simulated. The actual synthesis can be in a form of an overlap-add STFT or a time-domain filtering, whose parameters can be derived efficiently from the all-pole filter coefficients obtained from the identification process. A particular form of time-domain filtering, called the lattice filter, also has some physical meaning with respect to the traveling air wave in a series of piecewise-cylindrical tubes, under some simplifying assumptions [51]. In nasalized sounds, however, there is an opening which leads to zeros in the transfer function of the overall pathway, having now two parallel transmission lines. Using the source-filter model, the filter now must have zeros in its transfer function. For fricatives, it is evident that zeros are evident in the transfer function. However, usually there is no need to model them. The source input was first modeled as a train of (filtered) impulses for voice and random noise for unvoiced sounds [51]. This is what is used in the LPC-10e speech coding standard which can give intelligible but low quality speech due to the binary decision of voicing excitation. An improvement is obtained in the multi-band excitation where each frequency band is modeled as a mixture of sinusoid components and some random noise (spectral modeling in this part) [54]. A harmonic model has also been used to encode the source in part of MPEG-4 standard, called the HVXC coder [43] [44]. Researchers have identified the importance of asymmetry of the glottal shape to perception and concluded that the closing of the vocal fold must be more abrupt than the opening for a realistic synthesis [55]. A popular parametric form of a glottal waveform that reflects such finding is given by Liljencrants and Fant (LF model) [56]. A simplified quadratic version is given by Rosenberg [55], which is also used in Dennis Klatt’s formant synthesizer [57] (see Section 1.1.5). Recently, a normalized amplitude quotient has been proposed as the source parameter that captures most perceptual quality while being easy to determine [58]. A polynomial model has also been proposed by Fujisaki and Ljungquist [59]. A filtered impulse train has been experimented with to shape the glottal pulse excitation. However, 1.1. REVIEW OF PARAMETRIC SPEECH REPRESENTATION 11

incorrect phase causes perceptual distortion and an FIR filter is hard to modify for pitch shifting [60]. The identification of the vocal tract filter parameters has been well investigated. The leading analysis technique is without a doubt the linear predictive coding (LPC) approach [51] [61]. The LPC coefficients can also be represented in other forms such as, the partial correlation or reflection coefficients, the area ratio or the line spectral frequency (LSF) which proves to be less sensitive to quantization error than the LPC coefficients themselves. The physical interpretation of the reflection coefficients and the area function are certainly appealing. However, these conversions rely on some assumptions, such as terminal zero or matched impedance which are not quite realis- tic. Many techniques to determine these LPC coefficients exist, e.g., Levinson-Durbin algorithm, Cholesky decomposition and Burg’s method [53]. Despite its closeness to physical interpretation when converted to the reflection coefficients or the area func- tions, LPC cannot give a unique physical solution due to the one-to-many relationship of the acoustic observation and the vocal tract shape. Its common model assumption is obviously not physically correct (input u(n) is either random noise or an impulse train). Direct estimation of the vocal tract area is desired. However, this is a non- linear problem which often requires codebooks and the analysis-by-synthesis approach (see [62] for classical review). For nasal sounds, identification also becomes harder due to possible pole-zero cancellation effect. Most approached this by modeling the zeros as an FIR filter and combine it during estimation or synthesis with the source or the vocal tract filter [60]. The technique that is most commonly used for excitation identification is the analysis-by-synthesis method. This involves minimizing the reconstruction error com- pared to the original signal. The first attempt of using impulse train and random noise as excitation only requires pitch estimation and the power gain, both of which have been extensively researched independently. The choice of model does have an impact on identification and its accuracy. For example, the polynomial model of Fujisaki- Ljungquist [59] might allow better characteristics but it has the tendency to yield a physically impossible solution during estimation. On the other hand, Rosenberg’s model [55] has parameters that can be constrained over a physical range, helping 12 CHAPTER 1. INTRODUCTION

identification. Jointly estimating the source and the filter is one way of reducing the interactive effect on both estimates. While most analysis-by-synthesis methods choose the source that, when combined with the tract filter, minimizes error, this is not a true joint estimation since the tract filter is estimated first, using LPC, which assumes white input. Joint source-tract estimation can be viewed as a way to obtain better esti- mates by modeling more accurately. The effect of the source that might be otherwise assigned to the tract in separate estimation is mitigated. Due to the one-to-many re- lationship of the acoustic and the tract’s area function, jointly estimating, having the source constrained to a model, could also help achieving a more physically realistic area function estimate.

1.1.5 Formant Synthesizer

The formant synthesizer can be considered a type of source-filter synthesis model of speech. The filter part is implemented as either a bank of parallel or a cascade of resonance filters, each modeling a formant peak. The parallel configuration is used for fricatives and plosives generation while the cascade is used for vowel synthesis due to simpler relative amplitude control [57]. Such modeling clearly has some links to the vocoder mentioned earlier. The source is still mostly either an impulse train or a parametric glottal waveform for voiced sounds, and some noise for unvoiced ones. KLSYN is a well-known formant synthesizer for speech developed by Dennis Klatt in the 80’s [63] [57]. Although the parameters representing the sound, namely the formant frequencies, their bandwidths, and the source pitch, can be extracted from real speech, the model has not been used for coding. This is perhaps due to the difficulty and inefficiency in estimation compared to the LPC model. Formant identification from voice observation is a research area in its own. It generates interest not only from speech synthesis research but also from speech recognition. The original formant synthesizer also cannot produce an indistinguishable human voice due to its model simplicity so its use is currently mainly for intelligibility purpose in a text-to- speech system with small footprint requirement. The formant synthesizer is another 1.1. REVIEW OF PARAMETRIC SPEECH REPRESENTATION 13

example of a combination of physical and spectral modeling, just like the LPC source- filter model. There is also a similar formant synthesizer developed specifically for a singing voice called the FOF method which is based on the generation of the time-domain formant wave function [64] in a system called CHANT [65]. The difference from Klatt’s speech formant synthesizer is the use of parallel time-domain (impulse response) formant filters and relatively simpler impulse train excitation. All of them are characterized by the spectral formant frequencies specification.

1.1.6 Digital Waveguide Model

The digital waveguide model may be viewed as a more distributed version of the source-filter model. It is based on simulating wave propagation through different parts of the sound production system using delay lines, gains, digital filters etc. The model is a result of sampling the solution to the governing wave equation [14]. The wave could be that of the volume velocity of the air or its pressure, among many pos- sibilities. The subsystems along the sound production pathway model the physical properties, such as, the time-varying reflection and refraction caused by the difference in the areas of piecewise-cylindrical air cavities in speech, or the frequency-dependent loss in an inelastic collision of the air itself and the cavity tissue. Instead of lumping subsystems together, for example, the lip radiation and the source, the waveguide model allows for these subsystems or subcomponents to be modeled separately at actual points of occurrence through out the pathway. Sometimes, however, these sub- systems can indeed be lumped together without noticeable difference and sometimes they are lumped for simplification or the speed of the simulation. The source-filter model can be thought of as a heavily simplified special case of a digital waveguide model. The digital waveguide also offers the capability of real-time synthesis where output sound samples can be generated to instantaneously reflect changes in model parameters An example of a 1-D digital waveguide for voice synthesis is shown in Figure 1.1. The model is known as Kelly-Lochbaum cascade filter and is probably the first 14 CHAPTER 1. INTRODUCTION

1+k(P) 1−k(P−1) 1+k(1) 1+k u(n) T T lips s(n) TP P−1 T1 0

k glot −k(P) k(P)−k(P−1) k(P−1) −k(1) k(1) −klips

TTP P−1TT 1 0 1−k(P) 1−k(P−1) 1−k(1)

Figure 1.1: Kelly-Lochbaum vocal tract digital waveguide model

sampled traveling wave model of the vocal tract [66]. It shows wave propagating with delays through each vocal tract tube, reflected and transmitted through the scattering junction caused by the difference in wave impedances. When the lips termination is assumed to have zero impedance due to the much larger area in the free space compared to the lips opening, the full digital waveguide can be collapsed to a basic lattice filter often used to implement a vocal tract filter. Such 1-D digital waveguide model for the human voice has been shown to give human-like quality singing [67] [68]. In [67], a nasal cavity and throat radiation are each modeled using another coupling digital waveguide, demonstrating the flexibility and the distributed nature of the model. A 2-D digital waveguide mesh has been attempted for vocal tract simulation in [69] using LF parametric waveform as excitation input with rigid wall boundary conditions among other simplifications. The 2-D model has been shown to improve significantly from the 1-D model [70]. Recent attempts to improve the digital waveguide model for vocal synthesis includes the use of conical segments [71] and fractional delay lengthening of the segments [72].

A model based on the digital waveguide idea applied to the vocal fold oscillation without going fully into the physical paradigm is presented by [73]. In that work, a combination of an oscillator, delay line, gain and a non-linear black-box which models non-linear source-tract interaction through regression, is used to simulate the vocal fold with a good match to the original derivative glottal waveform. This is preceded by the simulation of the vocal fold mass-spring mechanical model proposed by Ishizaka and Flanagan [74] which is actually often used not only in the digital waveguide synthesis, but also in the lumped source-filter and computational simulation to be 1.1. REVIEW OF PARAMETRIC SPEECH REPRESENTATION 15

described. The distributed nature of the digital waveguide model allows detailed modeling of each scattering junction as well as the frequency-dependent absorption along the pathway. The tissue absorption must be determined beforehand from physical mea- surement and fixed in estimation of other dynamic parameters such as the vocal tract shape. The non-linear interaction, for example, near the glottis can also now be modeled. However, its identification is still hard and usually requires prior train- ing. Some of these characteristics, such as tissue absorption, might be assumed fixed without much harm, while others, such as, the non-linear glottal-tract interaction may not generalize well to all speakers. The estimation of the vocal tract shape and the source in this type of model is not analytically possible so most rely on the analysis-by-synthesis principle, searching through codebooks or combinations of con- figurations. A genetic algorithm has been used to help identify relevant parameters for implementation in a 2-D digital waveguide mesh using a physical measurement of the glottal opening as input [75]. A 3-D model where finite difference is employed, using data from physical measurements, has achieved a natural sound [76]. Whether it sounds like the original subject whose measurements were taken is not clear.

1.1.7 Articulatory Model

Using an articulatory model, one tries to determine the movement in two or three dimensions of the articulators involved in the production of speech, such as tongue and jaws, and directly manipulate them to effect the synthesis. Strictly speaking, an articulatory model is not a synthesis model but it is worth mentioning considering our goal of structured coding and its long history in speech modeling. The actual sound synthesis still requires the transformation from the articulatory set of values to either the area function or the formant, to be fed to either a waveguide model or the formant synthesizer respectively. It, however, offers the advantage of representing human voice by even a smaller number of parameters than the conventional LPC. After all, Homer Dudley’s original idea for the vocoder was to transmit these articulatory values. Cecil Coker, also at Bell Labs, was probably the first to present a small 16 CHAPTER 1. INTRODUCTION

set of articulatory parameters that can be converted to produce intelligible speech [77] [78]. A simple and intuitive control of these parameters with direct connection with motor control strategies should be an advantage although the control rules are not normally known. These parameters are also subject to physical constraints such as the range within which they can move, by itself and as a combination, helping identification. Their inability to move too quickly also results in a nicely smooth variation and the unattainable targets which effect coarticulation can follow naturally. Indeed, experiments have confirmed the ability to interpolate over intervals of several phonemes without artifacts but still intelligible [79]. However, the most advanced articulatory model still cannot produce natural speech.

Identification of the articulatory parameters given a speech observation is ex- tremely difficult due to the non-linear nature of such mapping. Initially, the identifi- cation step was done in the analysis-by-synthesis manner. Articulator configurations are chosen from codebooks along with the excitation such that the reconstruction error is minimized. Perceptual codebooks have been derived to help reduce the size [80]. To exploit smooth variation of the articulators, dynamic programming can be used to encourage continuity. Recently, more data have become available from MRI and electromagnetic articulograph (EMA) which circumvent the dosage limitation in X-ray imaging. Along with new statistical techniques, the non-linearity is modeled through training by some form of a neural network [81], a set of radial-basis functions or a quantized codebook [82]. A (non-linear) dynamical system is commonly used to model articulator movements. A variety of identification techniques are then applied based on previous training to obtain the parameters of the incoming sound. These in- clude extended-Kalman filtering [83] and linearization using a quantization codebook [84]. A variational calculus approach has also been proposed for articulatory-acoustic mapping which iteratively finds an optimal solution with respect to vocal tract shape and its smooth variation [85]. Alternative to non-linear continuous state models are discrete mapping models, such as the HMM, for example, as experimented in [86], where the diphone observation is modeled by a three-state HMM and trained with seven articulatory points. Although this non-linearity may be well modeled by a neural network, the lack of a physical relation makes it hard to control during the 1.1. REVIEW OF PARAMETRIC SPEECH REPRESENTATION 17

identification process and often results in large errors. The research trend in this direction seems to be the use of collected data to learn, in a data-driven manner, the rules of movement and the way to estimate these ar- ticulatory values accurately from acoustic observations. Careful studies of various articulatory-acoustic relationships for better understanding are also still very much needed. More high-quality physical data acquisition and interpretation tools are still desirable and crucial to deeper understanding. While not applicable to direct synthe- sis, articulatory modeling provides a useful smoothly varying and physical guidance to other synthesis models and is heavily studied for speech recognition application [87] [83]. Observations of articulator behavior can lead to useful constraints on algorithms used for parameter estimation from the acoustic waveform.

1.1.8 Fluid Mechanics Computational Models

This approach refers to the use of differential equations that govern the physics of air flow during the speech production that may include high-order terms which are usu- ally difficult to solve. It can be thought of as more abstract than the digital waveguide model. When boundary conditions, especially at the tract walls, glottis and lips, and the initial conditions are specified, the synthesis is then achieved through simulation using the solution to the differential equations. Any complex manifestation of non- linearity within the model is easily achieved in the output sound once that part of the model is accurately specified. Related parameters determine the characteristics of the system, some of which can be measured physically but obviously hard to deter- mine from acoustic observation. In order to synthesize voice, some control through articulatory modeling, which in turns determines the tract pathway configuration, is needed. Therefore, the main identification is still related to tract shape and vocal fold movement, just like the other models mentioned previously. While promising, even after all parameters have been identified, solving this set of equations is still time-consuming, sometimes unstable and, to-date, no perfectly natural results have been achieved. Real-time synthesis is not yet possible. However, with ever increasing computational power, this approach may be viable in the future. A degree of success 18 CHAPTER 1. INTRODUCTION

has been obtained for voice and fricatives using irrotational but slightly compressible model of fluid in the Reynolds-averaged-Navier-Stokes (RANS) equations which allow us to relax linearity, no viscosity, plane-wave assumptions [88] [89]. A simple solver such as the finite difference method, which for a linearly lossless case is equivalent to the digital waveguide, has also been applied to model human voice in the simple lin- ear case [90] and the RANS [89]. It cannot deal with complex differential equations, however. The source excitation can be simulated by any pressure/velocity waveform generation. Identification, so far, will have to be done mostly just as in the digital waveguide model, using prior physical measurements or analysis-by-synthesis and some sort of search mechanism. Essentially, the main parameters determinable directly from acoustic observations are the vocal tract shape. It is not yet clear how to determine physical coefficients like viscosity or power absorption. Most researchers otherwise use medical imaging data in their synthesis demonstration. In order to encode a voice us- ing this method, physical coefficients have to be specified and simulation is performed at the receiver. Research efforts in this area, to date, again employ imaging data and are concerned more with pure synthesis than with faithful encoding. Nevertheless, they gradually elucidate the inner workings of the human voice production system.

1.2 Singing Versus Speaking

Modeling of a singing voice may be viewed as a harder or easier problem than modeling speech, depending on how we look at it. While a singing voice often demands a higher sampling rate, raising the computational concern and the difficulty in recovering the high frequency region of the sound, a speaking voice may be able to get away with lower bandwidth (as a result of telephone bandwidth) since the primary concern is more on intelligibility. On the other hand, a normal speaking voice often lacks control in pitch and vocalizing effort, resulting in pitch irregularity and non-conformal glottal activities. This can cause many problems in identification efforts. Indeed, creaky voice and vocal fry occur readily in normal speech while glottal closure is almost non-existent. A greater proportion and longer periods of sustained vowel sounds also 1.3. PARAMETER IDENTIFICATION AND SPEECH ENHANCEMENT 19

help estimation. Arguably, co-articulation effects are also more common in speaking. From the coding point of view, a singing voice tends to be more regular, which allows for coding in quantized pitch (MIDI-like)1 and low-order amplitude while the highly regular fluctuation of vibrato and tremolo may be encoded independently. Vowel template and stochastic residual codebook may be used without degradation in singing voice as shown in [91]. These are clearly inappropriate for speech.

1.3 Robust Parameter Identification and The Speech Enhancement Perspective

At the core of realizing any parametric representation in practice is the ability to obtain robust and accurate estimates of such parameters from real world observations. Most parameter identification and other analysis algorithms have been proposed for clean speech signals. When noise is present, the luxury of direct analysis-by-synthesis is gone, perceptual weighting becomes harder, and spurious peaks as a result of noise can throw most algorithms off-course. Selecting “codes” for excitation source such that it produces a speech that matches with observation perceptually as done in CELP is not valid any more. As a result, such coders rely on a pre-processor to first denoise the incoming signal without distorting the original signal, which is a difficult task in its own right. A cascade of error can easily occur using such method. Relatively few works have considered noise explicitly during coding. A structured audio concept, on the other hand, describes speech as an object ren- dered by a synthesis model. Noise is also an object which combines with the speech to create the final object entering the ear receiver. Had there been a room reverber- ation, the object will be described as having gone through a reverberation system. Once all system components have been modeled, the sound can be recreated noise- free, reverberation-free, or if fidelity is desired, noise can be added and reverberation system can be implemented. Modeling noise and adding it to the reconstruction is perfectly valid, for example, in comfort noise generation commonly used in current

1Simple MIDI is actually still not suitable for continuous-event instruments like voice but more suited to discrete-event instruments like piano. 20 CHAPTER 1. INTRODUCTION

telephony. The estimation process therefore needs to model well all components in the sound scene and should give robust estimates despite all the complicated interactions. This is a basic requirement for structured coding that is not easy to achieve. Speech enhancement research [92], on the other hand, usually has only a rough model of speech. Sometimes only a statistical model is used, with hardly any struc- ture particularly pertaining to speech. Most also address the problem as a filtering problem: finding a filter that will give undistorted output speech. For example, the classical Wiener filter is derived from the signals’ power spectral density estimates. The spectral subtraction method [93] assumes linear addition of the spectral magni- tude of the noise and the clean signal. It can also be viewed as a filtering process. Similarly, the log-spectral magnitude estimator in [94] and [95] calculate directly the clean speech spectrum. When dynamic properties are explicitly included in the model, there is HMM [96] for discrete states and Kalman filtering for continuous states [97], all trying to achieve the Wiener solution of non-stationary processes. Re- cently, simulation-based solutions such as particle filtering were also used [98]. The filtering approach is robust and researchers will continue to include more perceptual [99] and production-model features for improvement [100]. However, at the time of this writing, none of these methods is able to achieve both good noise suppression and low distortion at the same time. Many of them produce “musical noise”, which is a result of imperfect filtering where noise bleeds through and create tones here and there through time. Structured audio approaches this simply by modeling the observation and its process of production. With good parameter estimates, the sound of interest can be resynthesized noise-free. While the existing algorithms are likely not robust enough and can easily result in disgraceful degradation, compared to filtering, the result is most certainly musical-noise free. A similar idea of estimation and reconstruction for enhancement has been explored in sinusoid modeling [101]. Basis thresholding and general subspace approaches such as wavelet thresholding and subspace projection [102], are arguably in the same camp where the signal is represented by a set of some spanning bases and those coefficients in the space corresponding to noise are set to zeros or excluded in reconstruction. Most, however, only model the speech spectrum 1.3. PARAMETER IDENTIFICATION AND SPEECH ENHANCEMENT 21

and none has gone close to a physical model.

The goal of this thesis is to develop an analysis system which can automatically identify, in a structured manner, the parameters of an expressive parametric model of a human voice, be it normal speech or a singing, even in a moderately noisy condition. A robust identification without human intervention is important for practical purpose since in real world applications, there is most likely some noise in the background. The contributions are both in the development of algorithms based on advanced techniques which have emerged relatively recently for each sub-system and how they can be put together to form a robust automatic speech sound coding system for practical use. Observation noise is put into consideration at all stages of the development. From a coding perspective, it is an attempt to bring speech coding one step closer to a well-structured coding paradigm with physically meaningful parameters and noise compensation integrated within the system. From an enhancement point of view, it is an attempt to bring more structure and source model into the enhancement process. Ideally together, the work in this thesis merges speech synthesis principle into speech coding, with a by-product of achieving at the receiving end, a noise-free reconstruction of the original speech. While most speech coding research has been concentrated on how best to compress the sound, this thesis focuses only on how to extract slowly varying parameters which should be amenable to compression in a noisy environment.

Another recurring theme in this thesis which should be pointed out is the motiva- tion to incorporate as much as possible some physically-related parameters into the models and the methods used at each stage of the analysis. There is a clear thread of relationship where analysis and synthesis enhance each other. Speech recognition has moved closer to modeling of speech production mechanism to address the shortfall of purely data driven approach like the popularly used HMM [87, 83, 103]. Speech en- hancement with a more accurate source model has shown a better SNR improvement [104]. These models along with their parameters can also help constrain the problem to only a set of physically possible configuration. 22 CHAPTER 1. INTRODUCTION

1.4 Summary

In this chapter, the motivation for structured coding of speech and voice observed in interfering noise has been given. To provide the landscape of modeling choices, a review of past and present research in human voice representation for synthesis and coding is presented, both in terms of the quality of the model and the analysis techniques involved. The perspective of structured coding with respect to the en- hancement application is also given. This provides the background motivation to the development of each component presented in this thesis for a fully structured voice extraction system. Chapter 2

System Overview

In this chapter, we discuss the model chosen for modification-friendly encoding, fol- lowed by an overview of the proposed system used in analysis and synthesis. The system takes into consideration the possibility of noise contamination. The purpose is to extract sound source parameters which can be resynthesized in many expressive ways, according to the structured audio coding principle. Despite many advances in estimation techniques and progress in knowledge about speech production, com- pletely parametric representation of human speech, even just a voiced sound, is still quite far from completely indistinguishable to a real one. Fully structured coding of human voice where noise may be present is still very difficult even for a simple model. All system components presented are developed and studied independently in the following chapters with its own evaluation before putting the ideas together to illustrate the concept of structured coding for human voice. Figure 2.1 shows the components developed in this dissertation towards such coding in noise, described in the following sections.

2.1 Segmentation Front-end

Each type of sound is produced by a different model. Segmentation and identification are needed as a front-end (block I) in order to determine the appropriate model of each segment in an utterance or a singing phrase. This module is also important for

23 24 CHAPTER 2. SYSTEM OVERVIEW

Mode Information

UV

Sound III I II Parameters Parameter Speech Segmentation GCI Detection V Estimation

RF Mode Information T0 OQ Reconstructed (OQ,AV) AV Sound Reconstruction

Figure 2.1: Overall system diagram expressive synthesis where only voiced sounds need to be modified whereas the tran- sients such as fricatives and plosives only need to be translated in time (for time-scale modification) and are virtually unchanged in frequency (for pitch-modification) [38]. Segmentation and mode identification get complicated if noise is present. A contri- bution to performing this segmentation and identification in noisy circumstances is presented in Chapter 3. Recent advances in inference algorithms such as the general- ized pseudo Bayesian method for switching state space models and unscented Kalman filtering will be employed for the task. Although the algorithms are general enough for identifying many different types of sound, for the source-filter model used, only the fricatives and the vowels can be parametrically represented with sufficient sound quality, as later described in the synthesis model.

2.2 Voice Model

Once the voice segments have been identified, their parameters can be extracted according to the source-filter model similar to that shown in (1.1). The vocal tract filter is represented by a set of auto-regressive (AR) coefficients. Such all-pole filters have been found to model the resonances of the vocal tract well during non-nasal voice. The glottal source waveform will follow Rosenberg’s model, given by 2.2. VOICE MODEL 25

2 2agn/fs 3bg(n/fs) , 0 n T0 OQ fs g(n)= − ≤ ≤ · · (2.1) 0, T OQ f n T f  0 · · s ≤ ≤ 0 · s  27 AV 27 AV a = · , b = · (2.2) g 4 (OQ2 T ) g 4 (OQ3 T 2) · · 0 · · 0

CP OP

GCI

T 0

Figure 2.2: Rosenberg’s derivative glottal source waveform

An example of two periods of this model is shown in Figure 2.2. The glottal closure instants (GCI) detection is performed during the voice analysis (block II). The purpose is basically to get pitch information. In a conventional speech coder, such as the standard LPCe10, this will probably be done by a correlation method to determine the periodicity in the voice, which is then used to generate an impulse train for voiced excitation. In this dissertation, we explore time-domain waveform fitting via dynamic programming which, in principle, can give sample-accurate closure instant detection. The algorithm can also identify other parameters shown in (2.1) to fully characterize the derivative glottal excitation waveform. Voice source parameters

can be stored as the fundamental periods (T0), the amplitude of voicing (AV ) and the open-quotient (OQ), defined as the ratio of OP and T0 shown in Figure 2.2. These parameters can be modified at will subject to applications. Together with the vocal tract filter estimates, the voice can be reconstructed or modified intuitively. The 26 CHAPTER 2. SYSTEM OVERVIEW

algorithm and its evaluation will be presented in Chapter 4. In the case of clean speech, or when noise is neglected in the model, direct LPC analysis can be used to identify the vocal tract filter. When noise is present, how- ever, LPC estimates are often not good enough. An iterative pitch-synchronous joint source-filter estimation is proposed as a final subcomponent (block III) to robustly extract all voice parameters. Iteration back to glottal period segmentation can also be done. A probabilistic framework for such iteration and instances of algorithms to accomplish it are presented in Chapter 5. The iterative nature of the glottal period segmentation and parameter estimation is represented by the loop within the dotted- line box. Of course, when noise is present, a noise-suppression may be desirable before the initial glottal period segmentation presented in Chapter 4.

2.3 Synthesis System

The analysis of the sound will yield a set of parameters. In the case of voiced sounds, it consist of the fundamental periods (or pitches), the amplitude and the open-quotient of the derivative glottal source and the vocal tract filter coefficients. While these parameters are highly compressible due to their smooth variation in time, the actual information-theoretic data compression is beyond the scope of this dissertation. The glottal excitation parameters can be modified to change the pitch through T0, the loudness through AV and the breathiness through OQ and the injection of glottal instance synchronous noise [68]. For fricatives, which are the only type of sounds that will be parametrically represented here, the excitation is simply white Gaussian noise. The parameters representing this type of sounds are then the vocal tract filter coefficients and the energy parameter of the injected noise. All useful voice modifications and applications will be described in Chapter 6. The vocal tract filter can also be represented in many forms which are interchange- able, albeit with different robustness to quantization and interpolation properties [61] [105]. In this dissertation, the synthesis model will be the lattice filter implementation of the all-pole filter, as shown in Figure 2.3, where the coefficients involved are the re- flection coefficients. These coefficients have desirable properties of easy interpolation 2.4. DISCUSSION 27

between two sets of stable all-pole filters with a guaranteed stability which is handy for voice modification or smoothing. Besides, they also have a physical interpretation in the waveguide model, representing how the air pressure or velocity wave transfers and reflects at each junction through the vocal tract cross sections. The lattice filter is a special case of the Kelly-Lochbaum waveguide model in Figure 1.1 where there is a matched impedance at the glottal source and a zero impedance at the lips end. A manipulation of the signal flow graph in Figure 1.1 results in the lattice filter in Figure 2.3 [51].

u(n) s(n)

−k(P) −k(2) −k(1)

k(P) k(2) k(1)

−1 −1 −1 −1 Z Z Z Z

Figure 2.3: Lattice filter for synthesis

2.4 Discussion

While the chosen model, based on source-filter parameterization, is not fully physical as explained earlier in Chapter 1, it has a close relationship with physical interpre- tations. Since a full physical mechanism of a speech production is not quite fully understood yet, this provides a way to parameterize speech samples for a faithful temporal-spectral reconstruction while giving a leeway to physically intuitive control. The glottal excitation waveform chosen is easier to understand and has more intuitive control compared to a straightforward AR or ARMA source model as used in [100] and [60]. Being the derivative of the supposed actual glottal air flow, many physical limitations can be used to help conform its shape, and hence the output to be natu- ral. Many researchers have been using the term “physically informed” with a varying degree of physicality in such a model. In a way, it can be considered a hybrid model at the boundary of spectral and physical source modeling. Comparing to a direct spectral modeling approach such as the sinusoid model, it provides a more intuitive control of voice texture such as laryngealization and breathiness. The glottal period 28 CHAPTER 2. SYSTEM OVERVIEW

synchronous nature of the aspiration noise synthesis arguably sounds more natural than some attempts in the sines+noise paradigm [106, 47, 107]. It also provides a means for direct manipulation of the open-quotient which has been identified as im- portant to the perception of breathiness in voice, though to a lesser degree than the noise injection [57]. Compared to impulse train excitation, Rosenberg’s model gives less buzziness. It is also the model used in the popular Klatt formant synthesizer (KLSYN). While the buzziness remains a criticism of KLSYN synthesized voice, its unnaturalness is perhaps more damaging due to the use of rules to determine for- mant targets, pitch and duration. This is what causes people to reject the synthesis as machine-like. As examples in this dissertation will show, when voice parameters are properly extracted, the natural progression of pitch and vocal tract resonances simply follow the original, and the result is acceptable as human-like. This indicates another area where the work in this dissertation will be useful: a parametric speech synthesis [8], where parameters representing speech units are modeled and learned from real speech database and used in synthesis. The advantage over conventional unit-selection concatenated speech synthesis is the smaller footprint and the absence of unit discontinuity, while overcoming unnaturalness of the rule-based fully para- metric synthesis. Although still in its infancy, the combination of parametric and unit-selection techniques seems to be an interesting venue for the future of speech synthesis research. Speaker identification can also benefit from the chosen model by discriminating between speakers using the vocal tract and glottal source parameters. Comparing to conventional features commonly used, such as the Mel-frequency cepstral coefficients (MFCC), a more specific set of physically-related features such as pitch, loudness, emphasis, open-quotient variation, directly obtained in this work should be more dis- criminative. Most likely, this crude model cannot capture all speaker-discriminatory information, but it may provide a good supplement to existing systems. This has been illustrated in the hybrid system of [108] where the combination of MFCC and glottal source model parameters is shown to outperform the use of MFCC alone. Although the selected model is obviously still too simple, it provides much of the naturalness while allowing many linear estimations (Chapter 4 and 5). The lack 2.5. SUMMARY 29

of adoption of physical parametric model in speech synthesis and coding is most likely due to the complexity of the estimation procedures coupled with the lack of modeling knowledge of human speech production, perception and cognition. It may yet, however, emerge as a winner in the future. This dissertation contributes in stepping one step closer in the area of algorithmic parameter estimation using state- of-the-art tools and knowledge about speech production while trying to keep the complexity down.

2.5 Summary

An overview of the envisioned system for structured voice extraction from a possi- bly noisy recording is presented in this chapter. Some justifications for the chosen synthesis model on the trade-off between quality and estimation difficulty have also been given comparing to other models. The subcomponents are briefly described as an introduction to each of the following chapters; how they fit into the system. The contribution of this thesis toward improved robust speech segmentation (block I), using recent techniques in statistical signal processing and modeling, is presented in Chapter 3. A new glottal waveform segmentation and parameter extraction (block II) based on dynamic programming is presented in Chapter 4. Finally, for noisy en- vironment, a new iterative algorithm that jointly estimates the vocal tract and the glottal parameters (block III) is presented in Chapter 5. All accompanying sound samples can be found at http://ccrma.stanford.edu/~pj97/Thesis/index.html. 30 CHAPTER 2. SYSTEM OVERVIEW Chapter 3

Noisy Speech Segmentation

As shown in Chapter 2, speech segmentation is at the forefront of the analysis sys- tem. Its decision determines appropriate models for a well-structured coding and estimation of parameters for each different period of the speech. In this chapter, various algorithms for robust speech segmentation for speech embedded in noise are reviewed. While there are many ways to do such segmentation, we focus on the emerg- ing switching state-space model (SSM) which is an extension to the popular hidden Markov model (HMM) to make inference on the segmentation decision. A mixture model along with its learning and inference method is presented. While noise is ad- ditive in time to the speech signal, its effect on the observation is nonlinear in feature space, such as MFCC or other log-scale based filter bank outputs, where discrimi- nation between phoneme classes is otherwise better. A technique called unscented Kalman filtering (UKF) is adopted as a tool to deal with the nonlinear inference.

3.1 Introduction

Speech segmentation refers to the process of dividing accurately parts of speech into identifiable groups according to requirements in applications. In mobile telephony, a crude segmentation into speech and silence is enough for bandwidth efficiency and helps enhancing automatic speech recognition (ASR) performance by eliminating un- informative frames. Voiced/unvoiced/silence segmentation can also increase speech

31 32 CHAPTER 3. NOISY SPEECH SEGMENTATION

coding efficiency and is useful in segment-based speech enhancement. Accurate voice frames identification can also help applications like emotion detection where accurate pitch dynamics detection is crucial. A more detailed categorization is useful for lin- guistic study, concatenative speech synthesis unit generation and general structured audio application in which a sound source can be regenerated using its descriptive parametric model. While all of these processings can be called segmentation, “speech detection”, “voice activity detection” (VAD) or “end-point detection”, are often used for the seg- mentation of an incoming sound into speech and silence, having interest only on when a speech starts and ends. In this respect, various features such as the energy change and zero-crossing rate among others have been used to compare against thresholds and a decision module decides whether there is a change. Many hang-over schemes have been devised to avoid false starts or ends of speech. The standard G.728 Annex B used in GSM mobile telephony is often quoted as a state-of-the-art VAD algorithm based on a combination of schemes and features mentioned above. More sophisti- cated algorithms have been proposed and studied since then especially those based on statistical decision making and classifiers such as neural networks, support vector machines (SVM), hidden Markov models (HMM), fuzzy clustering, and Gaussian mix- ture models (GMM). A variety of features have also been tested, including wavelet coefficients, auditory filterbank outputs, higher-order statistics from LPC residual, likelihood ratio, pitch, cepstral coefficients, and the Itakura LPC distance. In some applications, including ours, a more detailed categorization for speech segmentation is required. Effectively, this is a simpler version of speech recognition. The difference between them, apart from the level of categorization, is that speech segmentation requires phonetically meaningful units from segmentation with accurate boundaries identification whereas in speech recognition, the units do not strictly need to represent a phone and also boundaries are usually optimized for the overall recog- nition performance which includes the use of a language model. The recognizer units do not necessarily coincide with the phone boundaries. The categorization depends on the nature of the application. Structured speech coding, for example, may need to determine whether the sound is voiced, unvoiced, mixed excitation or a pause. In 3.1. INTRODUCTION 33

addition, voiced sounds can be nasal, which should be modeled differently from the vowel sounds due to the nasal cavity opening. HMM is presently the most popular tool for ASR, which motivates the investi- gation and development presented in this chapter. HMM, however, is known to be a poor model for speech. Being just a statistical tool, HMM does not include any inherent property of the speech production. As a result, it performs poorly on spon- taneous speech where a lot of coarticulation, the effect of pronouncing the current phoneme in anticipation of future phonemes, occurs. The acoustic observations are drastically different from the training samples as a result. HMM, as commonly used in speech modeling, is also only a model for discrete variables and hence unable to exploit smoothness in speech modeling through the use of continuous variables. On the other hand, a conventional linear dynamical system typically consists of a single set of parameters modeling a fixed period of time over which the system characteristic is assumed stationary. The change of the model through time requires pre-determined boundaries. Recently, there has been an emerging interest in switching models where a dynamic model contains both discrete and continuous variables. It allows for ap- propriate switching of model to best suit the data and the prior knowledge about the system in an automatic manner based on probabilities. Thanks to better tools necessary for approximate inference and learning in such type of model, a switching state-space model (SSM) has been applied to a number of applications, including speech recognition [103, 109, 87, 110], speech segmentation [111] and speech feature enhancement for ASR [112]. Speech production fits naturally into the SSM framework, consisting of discrete units such as phones or phonemes which govern how the articulators in the speech production should move to produce the acoustic observations. These articulators also move smoothly through time with physical constraints. SSM therefore offers a clear advantage over the popular HMM widely used now for ASR in modeling smooth variation in speech production itself for better ASR accuracy. In [82], the continuous states model the smooth variation of the articulators, while in [103] [109] and [87], they are used more explicitly to model the vocal tract resonances which vary smoothly through time toward a target position for each phone. As a result, it can then take 34 CHAPTER 3. NOISY SPEECH SEGMENTATION

into account the coarticulation. This is because of the global optimization through phone targets, which the vocal tract resonances may or may not reach. Coarticulation has led to poor ASR performances and SSM has been shown to improve this shortfall [87, 103, 109] with fewer parameters than HMM. In [111], SSM is used in the speech segmentation application, identifying frames of speech observation into vowels, nasals, fricatives/stops and silence. The hidden continuous states represent implicitly again the slowly varying articulators-related states through time. Also, [110] shows how the state posterior variance is much smaller at segment boundaries when SSM is used instead of a factor analyzed HMM which does not include continuous dynamic system in the model. While SSM has also been applied to noisy speech for speech enhancement, both for listening [99] and as an enhanced front-end feature extraction to a speech recognizer [112], no such work has been reported for segmentation where phonetically meaningful units are desired. On the other hand, SSM used in such speech segmentation, as in [111], has not attempted to address the problem with noise present. Inspired by a significant performance gain in ASR from past research, in this chapter we focus on the use of SSM for speech segmentation under noisy circumstances. The work does not intend to compare this type of classifier (generative model) with others, for example, neural networks or support vector machines, nor does it try to compare different features which may include those derived from highly discriminative auditory models. There are, however, advantages of using such a generative model over others. They include clear topographical and dynamic relationships which often have meanings, the globally optimized nature of the algorithms involved and the ability to integrate feature compensation directly into the system, as illustrated in this chapter.

3.2 Switching State-space Model of Speech

The clean speech model presented here is an extension of [111] to a mixture of Gaus- sians model, similar to [103] and [87]. Basically, each phoneme class is represented by a mixture of Gaussians at the observation, each having its own linear dynamical system evolution. Using a mixture model can lead to a significant improvement over a 3.2. SWITCHING STATE-SPACE MODEL OF SPEECH 35

single Gaussian model. A clean speech utterance can be modeled using the following equations.

x = A (S ) x + v (S ) (3.1) t m t · t−1 t t y = C (S ) x + D (S )+ w (S ) (3.2) t m t · t m t t t v(S ) (0, Q (S )), w(S ) (0, R (S )) t ∼N m t t ∼N m t T (i, j) = Pr(S = i S = j), i, j =1, , S (3.3) t | t−1 ··· | | Pr(S = i)= π , i =1, , S (3.4) 0 i ··· | |

The clean speech feature vector, such as the MFCC, at each time instant is rep- resented by yt. The vector x represents the hidden continuous state variable with the dynamic properties governed by the matrices A, C and D. The dynamic model is driven by state-specific random variables, v and w, of which covariances are in- vertible. The hidden dynamic model at each time instant is determined by the dis- crete state at that instant, St. The discrete states represent the phone classes while the continuous states reflect the slowly varying articulatory-related parameters. The number of phone classes is S , each consisting of M dynamic systems; i.e., within | | each class, from one step to the next, there are M possible trajectories and hence M output Gaussians, modeling output distributions which may be non-Gaussian or multi-modal. The probability of transition from one phone class to another is given in the transition matrix, T , whereas the probability of having a particular phone at the beginning of the observation is given by π. The mixture distribution is mixed through the prior probability weights, α = Pr(m S), which will be obtained through m | training. The system’s graphical model is depicted in Figure 3.1. The graphs show that if the phone class is known at a particular time to be, say, St, the dynamics of the hidden states and the observations follow a dynamical system component m of that phone class. 36 CHAPTER 3. NOISY SPEECH SEGMENTATION

S(t−1) S(t) S(t+1)

m m m

X(t−1) X(t) X(t+1)

Y(t−1) Y(t) Y(t+1)

Figure 3.1: Switching state-space graphical model

3.2.1 Learning

The transition matrix and the prior probability in (3.3) and (3.4) are both learned by counting from labeled frames of clean speech samples in the training set.

j f(S0 = j) π0 = (3.5) f(S0)

f(St = i, St−1 = j) Pr(St = i, St−1 = j)= (3.6) f(St,St−1) Pr(S = i, S = j) T (i, j)= t t−1 (3.7) i Pr(St = i, St−1 = j) where f(E) is the frequency that eventP E occurs in the training data. Had we not wanted meaningful phone boundaries, optimal learning of arbitrary discrete units can also be included in the iteration below. Given the hand-labeled discrete state sequence, the maximum-likelihood (ML) continuous state space parameters can be learned through the expectation-maximization (EM) algorithm as follows: E step: For each m, find the sufficient statistics

xˆ = E(x y ) (3.8) t|Tk t| 1:Tk 3.2. SWITCHING STATE-SPACE MODEL OF SPEECH 37

V = Cov(x x′ y ) (3.9) t|Tk t t| 1:Tk V = Cov(x x′ y ) (3.10) t,t−1|Tk t t−1| 1:Tk ′ ′ < xˆt|T xˆt|T >= Vt,t|Tk + xˆt|T xˆt|T (3.11)

′ ′ < xˆt|T xˆt−1|T >= Vt,t−1|Tk + xˆt|T xˆt−1|T (3.12)

The statistics as shown of each component m are obtained by Kalman smoothing

(see Appendix A) for each phone segment, k, and collecting over all Ks phone class segments. To avoid having to collapse M distributions at every time-step, we assume that each component propagates separately during the phone segment. Since each phone segment is now represented by a mixture of components, at the end of each segment, the smoothed states are combined to form a single Gaussian state through moment matching for use in the following phone segment. If the segment is too short, having only one frame, only filtering is used to arrive at the required statistics from that segment. The M-step involves calculating ML estimates of the parameters, through partial differentiation of the likelihood

ˆ L = EP (S1:T ,x1:T ,y1:T )[L] (3.13) where L is the complete data likelihood

L = log P (S1:T , x1:T ,y1:T ) (3.14) T T 1 1 = [y C x ]R−1[y C x ] log R (3.15) −2 t − t t t t − t t − 2 | t| t=1 t=1 XT  X T 1 1 ([x A x ]′Q−1[x A x ]) log Q (3.16) −2 t − t t−1 t t − t t−1 − 2 | t| t=2 t=2 1 X 1 1 X [x µ ]′Σ−1[x µ1] log Σ T (n + m)log2π (3.17) −2 1 − 1 1 1 − − 2 | 1| − 2 T

+log π1 + log T (St,St−1) (3.18) t=2 X 38 CHAPTER 3. NOISY SPEECH SEGMENTATION

After some simplification and taking appropriate expectation, which leads to the use of sufficient statistics estimated from E-step, we arrive at the M-step as follows: M step: Denoting y D by y¯ where m is omitted, subject to the context, ML t − m estimates of the system parameters can be found from

−1 Ks Tk Ks Tk A = ω (m) < xˆ xˆ′ > ω (m) < xˆ xˆ′ > m t t|T t−1|T m · t t−1|T t−1|T m "k=1 t=2 # "k=1 t=2 # X X X X (3.19) −1 Ks Tk Ks Tk C = ω (m)y¯ xˆ′ ω (m) < xˆ xˆ′ > (3.20) m t k,t t−1|T,m · t t|T t|T m "k t=1 # "k t=1 # X=1 X X=1 X Ks Tk Ks Tk D = ω (m) y ω (m) (3.21) m t · t t k t=1 ,k t=1 X=1 X X=1 X Ks Tk < xˆ xˆ′ > A < xˆ xˆ′ > k=1 t=2 t|T t|T m − m t|T t−1|T m Qm = Ks Tk (3.22) P P k=1 t=2 ωt(m) Ks Tk ωP(m)(Py¯ y¯′ C xˆ y¯′ ) k=1 t=1 t k,t k,t − m t|T,m k,t Rm = Ks Tk (3.23) P P  k=1 t=1 ωt(m)  Ks Tk P P Ks Tk ¯ xˆ1,m = ω1(m)xˆ1,m ω1(m) (3.24) k t=1 ,k t=1 X=1 X X=1 X Ks Tk Ks Tk ¯ Vˆ = ω (m)(x xˆ¯ )(xˆ xˆ¯ )′ ω (m) (3.25) 1,m 1 1,m − 1,m 1,m − 1,m 1 k t=1 ,k t=1 X=1 X X=1 X Ks Tk Ks

αm = ωt(m) Tk (3.26) k t=1 ,k X=1 X X=1 ¯ ¯ where xˆ1 and Vˆ1 are the initial state and covariance estimates at time t = 1 respec-

tively. As usual, to avoid scaling ambiguity between Cm and Qm, Cm is constrained 3.2. SWITCHING STATE-SPACE MODEL OF SPEECH 39

to have unit columns. Also,

ω (m)= p(m y ,S , Θ ) t t| t t t p(y x , m, Θ) p(m S , Θ ) = t| t · t| t t M ′ ′ ′ p(y x , m , Θ) p(m S , Θ ) (3.27) m =1 t| t · t| t t L(m) α = P · m M ′ ′ L(m ) α ′ m =1 · m P where Θt represents all system parameters at time t for that iteration. L is the likelihood obtained from filtering or smoothing (see the difference in Appendix A). Note that each m is conditional on the state S so where an index m is shown, it refers to the component m within a class S and hence S is omitted for cleaner presentation.

The parameters Cm, Rm and αm were initialized using probabilistic PCA for mixture distributions [113]. A and Q are initialized using linear regression from the projected hidden states. Q and R are forced to be diagonal for numerical stability during learning.

3.2.2 Inference

The ultimate goal is to find the globally smoothed posterior probability sequence, Pr(S Y ), given the observations. The difficulty involving the inference in SSM 1:T | 1:T is the exponential diversity of each component which leads to an intractable exact inference. At each time step, there are M possible trajectories, according to each class’s dynamic. So for the entire sequence of length T , there are M T possible tra- jectories. Many approximate inference methods have been proposed (see a review in [114]). They may be categorized into

1. Selection: by keeping only the path with the highest likelihood. Unlike basic HMM, this cannot guarantee a globally optimal solution. This method is usually called approximate Viterbi.

2. Collapsing: approximate the mixture of M t Gaussians at time t by a mixture of r Gaussians using moment matching. 40 CHAPTER 3. NOISY SPEECH SEGMENTATION

3. Expectation propagation: this is a batch algorithm which iteratively calculates the approximate likelihood and prior to give better estimates.

4. Variational method: a variational parameter is used to link artificially decoupled structure which would have been intractable otherwise. An iteration of exact inference can be done to minimize the Kullback-Liebler (KL) divergence between the original system and the artificially-decoupled one.

5. Sampling: a distribution is approximated by a set of weighted particles or sam- ples. This approach can handle arbitrary distributions and arbitrary graphical topology, although, in practice, some marginalization techniques are used to help estimation. Particle filtering and MCMC method are examples of this technique.

In this work, a collapsing approach in the form of GPB(r) is used in approximate inference due to its simplicity.

Generalized Pseudo Bayesian Inference

Figure 3.2 illustrates the operation of GPB1 and GPB2. In GPB1, at each time step, S Gaussians are generated from S components. To avoid exponential explosion of the number of states, they are collapsed to a single Gaussian component. In other words, we collapse Gaussians (at time t) which differ in history only one step ago (from time t 1). In GPB2, we collapse Gaussians which differ in their history two − steps ago, having come from different state origin then. Generally, the more history we keep, the more accurate the approximation should be. Obviously, higher order means more computation. An approximation to GPB2 called “interacting multiple models” (IMM) allows reduction of computation but cannot do smoothing, unlike GPB2 itself. The summary of generic steps for both GPB1 and GPB2 are as follows: Filtering: For t =1,...,T ,

1. Predict the states and covariances from the previous time step. 3.2. SWITCHING STATE-SPACE MODEL OF SPEECH 41

1 1 (x , V ) Filter 1 t|t t|t

Collapse (x , V ) (x t−1|t−1 , V t−1|t−1 ) t|t t|t

2 2 Filter 2 (x t|t , V t|t )

1,1 1,1 (x , V ) Filter 1 t|t t|t

1 1 1 1 (x t−1|t−1 , V t−1|t−1 ) Collapse (x t|t , V t|t )

1,2 1,2 Filter 2 (x t|t , V t|t )

2,1 2,1 (x , V ) Filter 1 t|t t|t 2 2 2 2 Collapse (x , V ) (x t−1|t−1 , V t−1|t−1 ) t|t t|t

2,2 2,2 Filter 2 (x t|t , V t|t )

Figure 3.2: GPB1 (top) and GPB2 (bottom) diagram

2. Update the states and the covariances using the observation, using conventional Kalman filtering.

3. Calculate the likelihood and the filtered posteriors.

4. Collapse the Gaussian components.

Smoothing (GPB2 only): For t = T,..., 1,

1. Calculate smoothed states and covariances from the future time step.

2. Calculate the smoothed posteriors. 42 CHAPTER 3. NOISY SPEECH SEGMENTATION

3. Collapse the Gaussian components.

For details of basic GPB1 and GPB2, see Appendix A. Despite making approxi- mations at all steps, the error has been shown to be bounded [115]. The collapsing operation through moment-matching gives the closest Gaussian, in the KL-divergence sense, to the original distribution. Note that GPB1 does not allow backward smoothing since it does not have the first-order time difference statistics. While a heuristic backward pass which can help continuous state variance reduction can be done by simply repeating the GPB1 pro- cess in a time-reverse manner, and then combining the result via the posterior esti- mates [112], in inference, this will not affect the resulting posterior. Although, one more pass after such a forward-backward combination, now with better continuous state estimates, may give better filtered posteriors. For this work, however, only GPB1 filtering and GPB2 smoothing will be shown.

3.3 Nonlinear Noisy Observation Model

When speech is corrupted with noise, the observation features badly deviate from the clean speech training model. Figures 3.3 and 3.4 show how the log-Mel filterbank (LMFB) magnitude outputs and the MFCC are affected by white noise at SNR=10dB along the first and second dimension. An LMFB feature can be calculated by taking the log magnitude of the DFT spectrum coming through the Mel-scale filterbank. MFCC can be obtained by multiplying the LMFB by a Discrete Cosine Transform (DCT) matrix. Assuming what is recorded at the microphone is a linear addition of the speech and the background noise, in the frequency feature domain such as the popular MFCC or LMFB, they are nonlinearly combined. For log-Mel filterbank outputs of the power spectral density, it can be approximated by

Z(k) log(10Y (k) + 10N(k)) (3.28) ≈ where Z(k), Y (k) and N(k) is the kth coefficient of the observation, the clean speech 3.3. NONLINEAR NOISY OBSERVATION MODEL 43

2 5

0

−2 0

−4 −5 −6

−8 −10 −10

−12 −15 −12 −10 −8 −6 −4 −2 0 −15 −10 −5 0 vowels nasals

5 5

0 0

−5 −5

−10 −10

−15 −15 −15 −10 −5 0 −15 −10 −5 0 fricatives silence

Figure 3.3: LMFB-1 (x-axis) vs LMFB-2 (y-axis) scatter plot in clean speech (dot) and with white noise at SNR=10dB (cross)

and the noise respectively. The expression is precise only if the speech and the noise are in-phase for all frequencies that fall within this band k. In [116], this model was used as shown, or with a variance model for the discrepancy, for probabilistic modeling of the speech and noise for speech feature enhancement for ASR applications. The approximation in (3.28) has been shown to do generally well in any SNR level (whereas the other expression with variance model does well at high SNR level but worse at low level).

If MFCC is used instead, the approximation is

† † Z F log(10F ·Y + 10F ·N) (3.29) ≈ ·

where F is the DCT matrix and F † is its right-inverse such that F F † = I. · 44 CHAPTER 3. NOISY SPEECH SEGMENTATION

60 60

40 40

20 20

0 0

−20 −20

−40 −40 −250 −200 −150 −100 −50 0 −300 −250 −200 −150 −100 −50 0 vowels nasals 60 60

40 40

20 20

0 0

−20 −20

−40 −40 −300 −200 −100 0 −300 −250 −200 −150 −100 −50 0 fricatives silence

Figure 3.4: MFCC-1 (x-axis) vs MFCC-2 (y-axis) scatter plot in clean speech (dot) and with white noise at SNR=10dB (cross)

3.3.1 GPB1 and GPB2 Inference Using Unscented Kalman Filtering

In addition to the set of Equations (3.1)–(3.4), which models clean speech, we now have another layer of the generative model for the noisy speech observation Z(t) as given by Equation (3.29). As mentioned earlier, we employed GPB1 and GPB2, with an incorporation of the UKF, as approximate inference methods. The basic operation of the GPB(r) algorithm is to approximate S t Gaussian com- | | ponents at time t by S (r−1) Gaussians using moment matching. The resulting Gaus- | | sian after collapsing through moment matching is optimal in the Kullback-Leibler sense [117] and the error can be shown to be bounded despite the approximations [115]. In the forward-pass, at each time step, the current state is filtered using each of the class’s component systems. For GPB1, the collapsing is done at each step, 3.3. NONLINEAR NOISY OBSERVATION MODEL 45

S(t−1) S(t) S(t+1)

m m m

X(t−1) X(t) X(t+1)

Y(t−1) Y(t) Y(t+1)

N(t−1) N(t) N(t+1)

Z(t−1) Z(t) Z(t+1)

Figure 3.5: Switching state-space graphical model for noisy observation

keeping only one state estimates, while for GPB2, we keep S state estimates, one | | for each possible discrete state. See [114] for more details on GPB1 and GPB2 on SSM. Unscented Kalman Filtering (UKF) is a method of state inference in nonlinear dynamical system [118] [119]. Instead of linearizing the nonlinearity as in the ex- tended Kalman filtering (EKF), UKF updates the states by passing a deterministi- cally sampled set of points that characterize the current state’s distribution through the dynamic system and approximates the filtered distribution from them. The accu- racy is generally up to second-order, which is better than the EKF while demanding similar computational load. In fact, our experiments using the EKF for state update have been unsuccessful, without regularization, partially due to numerical instability of the Jacobian mainly caused by the exponentials in the observation model in (3.28) and (3.29). For GPB1, UKF can be directly applied in the filtering step, giving filtered hidden state and covariance estimates [119]. Smoothing is not allowed in GPB1, but we can approximate it by simply repeating the filtering process but in the time-reverse manner. The smoothed posterior is calculated from the filtered posteriors and the 46 CHAPTER 3. NOISY SPEECH SEGMENTATION

transition matrix, backward in time. On the other hand, GPB2 allows for smoothing but this requires an estimation of the cross-covariance between adjacent time steps which is non-standard for the UKF. It can be shown that the filtered cross-covariance,

Vt,t−1|t, can be expressed as

V = A V K E[(z zˆ−)(x xˆ− )] (3.30) t,t−1|t t t,t|t − t t − t t−1 − t−1 E[(z zˆ−)(x xˆ− )] (c)[( ) z−][( ) x− ]T (3.31) t − t t−1 − t−1 ≈ Wi Zt i − t Xt−1 i − t−1 i X where (c) is the conventional unscented transform weight for the ith sample point, Wi ( ) is the unscented transform output point from its corresponding “Sigma points”, Zt i ( ) . z− is the predicted observation and x− is the predicted state from the previous Xt i t t−1 time-step. The last bracket in (3.31) can be stored directly for each previous time- step as the difference. Due to the linearity of the hidden continuous state dynamic model, the smoothing is exactly the same as in conventional Kalman smoothing. For details of other steps in the UKF, see [119]. In this work, the noise feature is also assumed to be deterministic in Equation (3.29), being characterized only by its mean for simplicity. Although initial continuous states and covariances are not generally important, it was found that using initial state and covariances estimates from training (Equations (3.24) and (3.25)), consistently gives better accuracy, especially at low SNR, than estimating by reverse projection from the first frame observation, albeit only for a few decimals of a percentage point. At each step in both forward- and backward-pass, multiple Gaussians resulting from different component propagation need to be collapsed in order to keep the num- ber of distributions finite. Experimentally, collapsing components within the same class before collapsing with others give the same performance as simultaneously col- lapsing all components. Only the latter is used for the reported results. Also, the combined likelihood of class S at time-frame t is calculated from

L(S)= αm L(m) (3.32) m X 3.4. EXPERIMENTS AND RESULTS 47

3.4 Experiments and Results

3.4.1 Evaluation Method

The features are extracted frame-by-frame using a frame length of 20 ms with 10 ms overlap and Hamming window applied. The dimension of the features is kept at 10 for both LMFB and MFCC while the hidden continuous state dimension is 2. Note that, however, MFCC is derived from a 40-band Mel-filterbank before getting reduced to 10 via the DCT and truncation. The classification accuracy is defined as the percentage of frames correctly classified, compared to their ground truth labels. The ground truth labels are decided upon the class which takes the majority of the frame using the manually segmented phoneme boundaries given in the database. Note that this measure will favor an algorithm which does well with the vowel phone class due to more frames encountered being vowels. This should be considered as reasonable for real world applications rather than an unfair bias. For comparison with past research in [111], we first use the same training set of ten female speakers in DR1 of the TRAIN data in the TIMIT database [120] and the rest of four female speakers for testing. The phone classes are : 1) vowels, semi-vowels and glides, 2) nasals, 3) fricatives and stop releases, and 4) stop closures and pauses.

3.4.2 Clean Speech Segmentation

Table 3.1 shows the accuracy of various classification methods in clean speech test using different number of components. Both GPB1 and GPB2 outperform the ap- proximate Viterbi algorithm used in [111], which keeps only the maximum likelihood node path in each forward-pass step. GPB2 performs slightly better than GPB1 and the performance increases with the number of mixture components at the expense of more computation. In fact, approximate Viterbi fails dramatically using MFCC. Note that, however, in [111], approximate Viterbi was used more successfully, as confirmed by our own experiment, with line spectral frequency features, which should be more linear than both LMFB and MFCC, but has no closed-form expression in noise. Nev- ertheless, using a mixture model yields encouraging results. For a complete picture, 48 CHAPTER 3. NOISY SPEECH SEGMENTATION

Table 3.2 also shows the results by class from the best total accuracy achieved from using GPB2-UKF(6), where 6 is the number of mixture components, with MFCC as features. The first column is the true phone class while the columns on the right contain the percentage of frames per category these phones are recognized as. The vowel class seems to be recognized most accurately due to, perhaps, its strong formant features and high energy, which provides robustness. On the other hand, the training data also contain more voiced sounds which must have contributed to the cause and may help explain the asymmetry about the diagonal in misrecognition seen in Table 3.2.

% AV GPB2 GPB2(2) GPB2(4) GPB2(6) LMFB 64.5 76.9 78.6 80.4 81.9 MFCC N/A 81.2 82.7 82.8 82.9

Table 3.1: Clean speech phone classification accuracy in % of original approximate Viterbi (AV), the single Gaussian GPB2 and GPB2 with mixture model where M = 2, 4 and 6 (in parentheses), using LMFB and MFCC as features.

3.4.3 Comparison with HMM

To show the benefit of having hidden continuous state dynamics underneath the discrete state transition, as is typical in HMM, the output matrix can simply be set to zeros so that the continuous states cannot affect the likelihood calculation. HMM is therefore a special case of SSM. The accuracy is then compared across number of mixtures as shown in Table 3.3. For the same number of components modeling the observation distribution, SSM outperforms HMM. However, it must be mentioned

% Vowels Nasals Fric. Sil. Vowels 92 4 4 0 Nasals 37 59 3 1 Fric. 14 2 77 6 Sil. 3 5 18 73

Table 3.2: The confusion matrix in % using GPB2(6) with MFCC as features on clean speech test set 3.4. EXPERIMENTS AND RESULTS 49

that for each case, SSM requires more memory, having the extra model layer, and takes longer to compute. The HMM inference method, the Viterbi algorithm, is also exact, whereas for SSM only approximation is possible. The continuous state trajectories, inferred by single component GPB2, of an utterance “Which church do the Smiths worship in?” are displayed in Figure 3.6. It can be seen that the trajectories of voiced classes (1 & 2) are smoother than those unvoiced (3 & 4), reflecting the nature of the dynamics and the variance of process error of each class, which helps constraining the dynamics during inference.

%/G 1 2 4 6 HMM 80.7 81.2 81.5 82.8 SSM 81.2 82.7 82.8 82.9

Table 3.3: Comparison of accuracy between HMM and SSM using the same number of mixture components for clean speech classification

0.15 x(1) x(2)

0.1

0.05

0

−0.05

−0.1

4 1 4 3 4 3 1 3 1 3 1 3 2 1 3 4 1 3 1 4 3 1 2 4

20 40 60 80 100 120 140 160 180 200

Figure 3.6: Continuous state trajectories of an utterance “Which church do the Smiths worship in?” 50 CHAPTER 3. NOISY SPEECH SEGMENTATION

3.4.4 Noisy Speech Segmentation

For noisy speech segmentation, the noise feature means, N(k), are assumed to be stationary and are estimated from some silence frames at the start of the file. Table 3.4 shows the results using GPB1 and GPB2 with mixture model and UKF noise compensation, averaged over SNR= 0, 5,..., 20 dB.

% white car babble GPB2 49.3 38.4 49.6 32.4 46.5 37.9 Feat. Sub.+GPB2 57.0 60.8 53.7 55.1 54.3 55.8 Denoise+GPB2 56.0 54.8 57.2 51.6 55.5 55.1 GPB1-UKF 64.6 65.4 65.3 66.9 63.6 66.1 GPB2-UKF 62.8 67.2 66.5 68.4 64.9 67.8 GPB2-UKF(2) 71.1 69.4 67.4 68.5 64.9 64.9 GPB2-UKF(4) 70.5 66.8 68.7 69.1 64.7 62.3 GPB2-UKF(6) 70.4 63.3 67.2 60.4 64.9 57.0

Table 3.4: Noisy speech phone classification accuracy in % using basic single Gaus- sian GPB2, noise feature subtraction and basic GPB2, front-end denoising and basic GPB2, GPB1-UKF and GPB2-UKF with mixture model where M = 2, 4 and 6, in various types of noise. The first and second column of each noise has LMFB and MFCC as features respectively

% Vowels Nasals Fric. Sil. Vowels 89 8 1 2 Nasals 35 54 7 4 Fric. 14 11 45 30 Sil. 3 8 30 59

Table 3.5: The confusion matrix using GPB-UKF(4) and LMFB as features in a car noise environment with SNR=10dB

The results in Table 3.4 show that the proposed scheme can greatly improve over- all frame recognition rate for both GPB1 and GPB2. GPB2 outperforms GPB1 in all types of noise and at all SNRs (not all SNRs are shown) due to the use of more infor- mation and history keeping. While the two-mixture model consistently outperforms the single one at all SNRs, the benefits of having more than two mixture components is less certain, especially at low SNR when it comes to the noisy segmentation. This is 3.4. EXPERIMENTS AND RESULTS 51

probably due to more confusion under approximated compensation and estimation or perhaps under-training due to more parameters. For comparison, simply estimating clean speech features solving (3.29) before using basic GPB2 improves only moder- ately and so does applying front-end denoising to the corrupted speeches, using the MMSE log-spectral estimator by Ephraim and Malah [94]. Front-end denoising ob- viously will remove some information such as fricatives and in some cases also add artifacts before classification. A closer inspection on using the Ephraim and Malah’s algorithm seems to indicate the latter, with vowel class being misrecognized as frica- tives more than the other way round. The results from relatively non-stationary babble noise still show a comparable improvement. An example of by-class accuracy result is shown in Table 3.5 for car noise. As expected, despite the compensation, the fricative class is misclassified as silence while other classes are greatly improved. Especially in the case of white noise, most of the salient feature at high frequency of the class-three member is buried in noise, making it difficult to get recognized. It should be mentioned that without noise compensation, most phones are mistaken as the fricative class for white noise, as the vowel class for the car noise and as nasals for babble noise, depending on the characteristics of each noise.

Figure 3.7 shows an example utterance in car noise and its classification results using GPB2-UKF(4) with MFCC as features. Note that fricatives such as /s/ is hardly evident in noise with a rather low sampling rate of 8 kHz, yet can still be detected. Using LMFB instead can only detect the fricatives such as /sh/ and /ch/. The same utterance but in white noise of SNR=10 dB is shown in Figure 3.8

For comparison, an example of results applying GPB2 directly to the uncompen- sated features and that of GPB2-UKF(4) are shown for various kinds of noise in Table 3.6 and 3.7, with LMFB and MFCC as features. The corresponding clean results are those already shown in Table 3.2. The detailed comparison of all algorithms with different number of mixture components, expanding the summary shown in Table 3.4, is illustrated in Figure 3.9. 52 CHAPTER 3. NOISY SPEECH SEGMENTATION

(a) White noise + basic GPB2 (b) White noise + GPB2-UKF % Vowels Nasals Fric. Sil. % Vowels Nasals Fric. Sil. Vowels 34 0 66 0 Vowels 89 8 2 1 Nasals 6 0 94 0 Nasals 49 41 6 4 Fric. 1 0 99 0 Fric. 14 5 28 54 Sil. 0 0 97 3 Sil. 1 3 11 85 (c) Car noise + basic GPB2 (d) Car noise + GPB2-UKF % Vowels Nasals Fric. Sil. % Vowels Nasals Fric. Sil. Vowels 9 0 91 0 Vowels 90 6 3 1 Nasals 1 0 99 0 Nasals 41 44 11 4 Fric. 1 0 99 0 Fric. 14 7 53 26 Sil. 0 0 98 2 Sil. 2 8 42 49 (e) Babble noise + basic GPB2 (f) Babble noise + GPB2-UKF % Vowels Nasals Fric. Sil. % Vowels Nasals Fric. Sil. Vowels 36 1 63 0 Vowels 87 10 3 0 Nasals 13 6 80 0 Nasals 50 39 8 2 Fric. 5 0 95 0 Fric. 28 17 49 6 Sil. 1 0 96 2 Sil. 21 22 23 34

Table 3.6: The confusion matrix using uncompensated GPB2 and GPB2-UKF(4) on speech embedded in white noise, car noise and babble noise, all at SNR = 10 dB. MFCC is used as features. 3.4. EXPERIMENTS AND RESULTS 53

(a) White noise + basic GPB2 (b) White noise + GPB2-UKF % Vowels Nasals Fric. Sil. % Vowels Nasals Fric. Sil. Vowels 34 0 66 0 Vowels 89 8 2 1 Nasals 6 0 94 0 Nasals 49 41 6 4 Fric. 1 0 99 0 Fric. 14 5 28 54 Sil. 0 0 97 3 Sil. 1 3 11 85 (c) Car noise + basic GPB2 (d) Car noise + GPB2-UKF % Vowels Nasals Fric. Sil. % Vowels Nasals Fric. Sil. Vowels 9 0 91 0 Vowels 90 6 3 1 Nasals 1 0 99 0 Nasals 41 44 11 4 Fric. 1 0 99 0 Fric. 14 7 53 26 Sil. 0 0 98 2 Sil. 2 8 42 49 (e) Babble noise + basic GPB2 (f) Babble noise + GPB2-UKF % Vowels Nasals Fric. Sil. % Vowels Nasals Fric. Sil. Vowels 36 1 63 0 Vowels 87 10 3 0 Nasals 13 6 80 0 Nasals 50 39 8 2 Fric. 5 0 95 0 Fric. 28 17 49 6 Sil. 1 0 96 2 Sil. 21 22 23 34

Table 3.7: The confusion matrix using uncompensated GPB2 and GPB2-UKF(4) on speech embedded in white noise, car noise and babble noise, all at SNR = 10 dB. LMFB is used as features. 54 CHAPTER 3. NOISY SPEECH SEGMENTATION

"Which church do the Smiths worship in?" 4000

3000

2000

1000 Frequency (Hz) 0 0 0.5 1 1.5 2 Time (sec) 4

3

2 Class Label

1 0 0.5 1 1.5 2

Figure 3.7: A spectrogram plot of an utterance in 10 dB SNR car noise (top) and the classification results (blue/circles) using GPB2-UKF(4) along with ground truth labeling (red/crosses) (bottom)

3.4.5 Other Classes Definition and Hierarchical Decision

It can be envisioned that a total production mode-dependent parametric coding may be possible one day. The production model changes with the types of phones to be generated. Applying the above algorithms to a larger number of phone classes is straightforward and the results are shown here for completion with the same condi- tions otherwise as those in Table 3.5. Although not even nearly perfect, most misclassifications are still among like- types, for example, vowels for liquids/glides and stops for affricates/fricatives. Stops closures are very hard to detect but may not be that necessary to identify anyway. Obviously, the narrower classes can be combined to give classification of a broader class, e.g. combining vowels and liquids/glides, resulting in hierarchical classification. From experiments of combining results from Table 3.8 for the original four-class seg- mentation do not show a significant difference in performance. However, a deeper investigation could shed more lights on the details and is left for future work. 3.5. CONCLUSION 55

"Which church do the Smiths worship in?" 4000

3000

2000

1000 Frequency (Hz) 0 0 0.5 1 1.5 2 Time (sec) 4

3

2 Class Label

1 0 0.5 1 1.5 2

Figure 3.8: A spectrogram plot of an utterance in 10 dB SNR white noise (top) and the classification results (blue/circles) using GPB2-UKF(4) along with ground truth labeling (red/crosses) (bottom)

3.5 Conclusion

In this chapter, a robust speech phone segmentation has been developed and in- vestigated. The switching state-space model used to describe the speech feature production process allows smooth and robust decisions based on the maximum a posteriori principle. The contribution in this work is an extension of a simple SSM into a mixture of Gaussian observation model and the use of GPB approximate infer- ence methods which show improved results in clean speech application over a single Gaussian model and previous inference technique. When noise is present, a nonlinear approximate inference method called unscented Kalman filtering has been adopted which has shown to improve the segmentation results over basic uncompensated in- ference algorithms. Although the multiple components of the observation model does not improve recognition results, unlike the clean speech, the basic work investigated has shown the possibility of recognizing the periods of each phone which could be used for the ultimate phone-based coder even in noisy situations. Whether this level 56 CHAPTER 3. NOISY SPEECH SEGMENTATION

LMFB − white noise 80 MFCC white 80

75

70 70

65 60

60

50 55 Accuracy (%) Accuracy (%)

50 GPB2 40 GPB1−UKF GPB2 GPB2−UKF(1) GPB1−UKF 45 GPB2−UKF(2) GPB2−UKF(1) GPB2−UKF(4) GPB2−UKF(2) GPB2−UKF(6) 30 GPB2−UKF(4) 40 GPB2−UKF(6)

35 0 2 4 6 8 10 12 14 16 18 20 20 0 2 4 6 8 10 12 14 16 18 20 SNR (dB) SNR (dB) (a) White noise + LMFB (b) White noise + MFCC

LMFB − car noise MFCC car 80 80

75 70 70

65 60

60 50 55 Accuracy (%) Accuracy (%)

50 40 GPB2 GPB2 GPB1−UKF GPB1−UKF 45 GPB2−UKF(1) GPB2−UKF(1) GPB2−UKF(2) GPB2−UKF(2) GPB2−UKF(4) 30 GPB2−UKF(4) GPB2−UKF(6) GPB2−UKF(6) 40

35 20 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 SNR (dB) SNR (dB) (c) Car noise + LMFB (d) Car noise + MFCC

LMFB − babble noise MFCC babble 80 80

75

70 70

65 60

60

55 50

Accuracy (%) 50 Accuracy (%)

40 45 GPB2 GPB2 GPB1−UKF GPB1−UKF GPB2−UKF(1) GPB2−UKF(1) 40 GPB2−UKF(2) GPB2−UKF(2) GPB2−UKF(4) 30 GPB2−UKF(4) GPB2−UKF(6) GPB2−UKF(6) 35

30 20 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 SNR (dB) SNR (dB) (e) Babble noise + LMFB (f) Babble noise + MFCC

Figure 3.9: The accuracy (%) for various noise types and SNRs using LMFB and MFCC as features 3.5. CONCLUSION 57

% Vow. Liq./Gl. Nas. Affr./Fric. Stops. Stops Cl. Sil. Vow. 71 18 8 1 0 1 0 Liq./Gl. 39 45 12 2 1 0 1 Nas. 30 10 49 6 3 1 2 Affr./Fric. 3 1 6 58 19 1 13 Stops. 13 6 16 32 15 1 15 Stops Cl. 21 10 13 21 11 8 16 Sil. 5 1 9 24 18 2 42

Table 3.8: The confusion matrix using GPB-UKF(4) and LMFB as features in a car noise environment of SNR=10dB for the seven phone classes. Vow.=Vowels, Liq./Gl.=Liquids and Glides, Nas.=Nasals, Affr./Fric.=Affricates and Fricatives, Stops Cl.=Stops closures and Sil.=Silence or pause. of accuracy will be enough in practice, however, remains to be seen. Most likely, a mechanism to ensure graceful degradation from misclassification should always be used. 58 CHAPTER 3. NOISY SPEECH SEGMENTATION Chapter 4

Robust Glottal Closure Instant Detection

The knowledge of glottal closure and opening instants (GCI/GOI) is useful for many speech analysis applications. Closed-phase linear prediction has been shown to give more accurate vocal tract filter due to little source-tract interaction in those periods. Speaker identification, pathological voice detection and pitch tracking can also benefit from the knowledge of GCIs. Clearly, low bit-rate coding using pitch-synchronous waveform modeling of the voice, such as that presented in Chapter 5, requires GCI detection. Given the perceptual significance of aspiration noise injection near glottal closure and opening, knowing the instants where this noise should be modeled and synthesized is also important. Numerous techniques for GCI detection have been proposed in the past. Most are based on detecting discontinuities or peaks from some measurements of voiced speech. The peaks of LPC residual energy were used in [121] while in [122], abrupt change in Kalman filtering innovation error indicates such changes. In [123], an energy-weighted group delay (EWGD) was proposed. This method provides a very effective and efficient GCI detection based on a closed- form expression of group delay of a minimum-phase system used to model the vocal tract. Its false alarm rate can, however, increase significantly in the presence of noise. While a large window used in the averaging of EWGD method improves false alarm rate (FAR), it may compromise the missed detection rate (MDR) as well as

59 60 CHAPTER 4. ROBUST GLOTTAL CLOSURE INSTANT DETECTION

the accuracy, which refers to how close the detection is to the actual value. (See a quantitative analysis in [124].) Recently, a dynamic programming approach with a variety of cost functions related to pitch deviation, quasi-periodicity, combined with other heuristic cost terms was proposed [125], giving a considerable improvement over earlier methods.

GCI detection is a classical problem, receiving a great deal of attention. The detection of GOIs, on the other hand, has received relatively less consideration. One reason is because the opening instant is much harder to identify or even define. An- other reason is its less crucial effect on the perceptual quality of a voice. While the closures generate significant excitation with an abrupt change in waveform, the open- ing instant tends to be gradual. A number of works have been presented on how various tools respond to the opening instants [126] [127] but none has thoroughly evaluated using a real speech corpus.

In this chapter, dynamic programming is employed to solve for the global closed- phase and open-phase segmentation based on a polynomial parametric waveform of the derivative glottal waveform described in Chapter 2 and its quasi-periodicity. Some physical considerations are used, for example, the parameter values to ensure a phys- ically plausible glottal waveshape, the range of human pitch and the constraint of alternating closed and open phase to boost detection of the GOIs. The results allow for the subsequent pitch-synchronous joint estimation of source and filter presented in Chapter 5. In fact, not only does the algorithm identify GCIs, but also the elu- sive GOIs, and as a by-product, the parameters of the glottal excitation waveform. Together, they can represent the voiced sounds which can achieve the modification flexibility described earlier in this dissertation, including pitch shifting, time scaling and breathiness modification. The chapter will describe the newly proposed algo- rithm with its own experimental evaluation on detection accuracy in comparison to a classical GCI detection method. The chapter concludes with a description on how the algorithm can yield voice parameters amenable to modifications. 4.1. DYNAMIC PROGRAMMING FOR GCI/GOI DETECTION 61

4.1 Dynamic Programming for GCI/GOI Detec- tion

Dynamic programming (DP) calculates the optimal path through a lattice of candi- date points where the decision at any particular point only depends on the objective function of that point and the previous ones. For our problem, the cost function to be minimized is based on a combination of polynomial waveform fitting and the quasi- periodicity nature of the derivative glottal waveform, expected from inverse filtering the speech signal using LPC estimates. General speech cannot be expected to always follow such a simple model, and the inverse filtering is expected to be imperfect, so the algorithm tries to provide some robustness against these anomalies. This is achieved through the flexibility in the cost function as well as other means to constrain the problem toward the right solution. The composite cost function of a segment of the inverse-filtered waveform, between point i and j, is given by

C(i, j)= C (i, j)+ λ C (i, j) (4.1) P · Q where CP and CQ are the waveform error cost function and the cross-correlation measurements respectively. The constant λ provides the relative weighting between the two cost terms.

4.1.1 Waveform Error Cost

The derivative glottal waveform can be presented by the Rosenberg’s model [55], reproduced here from Section 2.1.

2 2agn/fs 3bg(n/fs) , 0 n T0 OQ fs g(n)= − ≤ ≤ · · (4.2) 0, T OQ f n T f  0 · · s ≤ ≤ 0 · s  27 AV a = · (4.3) g 4 (OQ2 T ) · · 0 62 CHAPTER 4. ROBUST GLOTTAL CLOSURE INSTANT DETECTION

CP OP

GCI

T 0

Figure 4.1: Two periods of Rosenberg’s derivative glottal waveform model showing the period (T0), glottal closure instant (GCI), closed phase (CP) and open phase (OP)

27 AV b = · (4.4) g 4 (OQ3 T 2) · · 0 where T0 is the fundamental period, fs is the sampling frequency, AV is the amplitude of voicing parameter, and OQ is the open-quotient of the glottal source. An example of the waveform is shown in Figure 4.1. Let s indicate the phase of the glottal waveform where s = 0 is the closed phase (CP) and s = 1 is the open phase (OP). Given the LPC residual signal, x, the

waveform-fitting error cost function of a segment between time sample t1 and t2, for

both CP and OP, is the squared L2-norm

C (t , t )= x ˆx 2 (4.5) P,s 1 2 || t1:t2 − ||2

where for s = 0, ˆx is the mean of xt1:t2 . Even though the model expects this to be zero, using the mean gives extra robustness to non-ideal waveform. For s = 1,

ˆx is generated from the first line of Equation (4.2) using ag and bg estimated from least-squares regression. That is 4.1. DYNAMIC PROGRAMMING FOR GCI/GOI DETECTION 63

† 0 0  2 1/f 3 (1/f )2  ˆ s s θLS = · . − · . xt1:t2 (4.6)  . .  ·    2 2 N/fs 3 (N/fs)   · − ·  T   where θ = a b and N = t t + 1. If the estimates of ag or bg are less than g g 2 − 1 zero, the waveformh i is not in the right shape and the cost is set to a large number. To increase the robustness, local search is also performed for open phase fitting, while for closed phase, an offset at the beginning of the segment is allowed to avoid spikes which commonly occur in LPC residual.

4.1.2 Cross-correlation Cost

To tap into the quasi-periodicity expected in the LPC residual waveform, the cross- correlation cost between two segments, used similarly in [125], is

C (x , x ,γ)= max(CrossCorr (x , x )) (4.7) Q,s 1 2 − γ 1 2 where γ is the maximum lag used (set to correspond to 1.5 ms in the experiments). Similarity in two waveforms will result in large cross-correlation and hence large

negative cost term, CQ.

Both cost terms, CP and CQ, are the sum of time-sample products making them comparable in magnitude, suggesting that the cost term relative weighting should be close to one in the order of magnitude. Nevertheless, fine-tuning is possible and the experiments to find its optimal value with respect to the test set will be presented in section 4.2.2.

4.1.3 Filling The Dynamic Programming Grids

At each candidate point j, for each mode s between i and j, the cost functions with respect to a set of preceding marker candidates i , are calculated by ∈ Is 64 CHAPTER 4. ROBUST GLOTTAL CLOSURE INSTANT DETECTION

J (i, j) = min J ′ (k, i)+ C (i, j) (4.8) s k∈Ks′ { s s } where s′ is the negation of s, which means that the constraint of alternating open phase and closed phase is enforced. The set and ′ contains allowable candidates Is Ks for each mode: = i j ∆s

CQ(i, j) in Equation (4.1), used in Equation (4.8), actually depends on two further points back. In order to keep the global optimality, we have to look back two points and select the point that gives minimum total cost at point k in (4.8). It is the same back-tracing used in general DP with point k now being the end-point. The first period may be assumed to be CP for simplicity. However, in our ex- periments, either case is allowed for truncation robustness. This is useful especially in segment-based or phonetic coding where the segmentation may not be sample- accurate. Two tables are therefore populated and, at the end of the segment, the smallest end-cost between the two sequences at the end of the two tables is chosen and back-tracing is executed to obtain alternate GCIs and GOIs which give a global optimality, subject to constraints. Note that due to these constraint limitations, Viterbi-like decoding as described may not be truly optimal. Keeping more than one trace at any point may give better results at the expense of much greater computation and memory requirement.

4.1.4 Candidate Selection

Strictly speaking, all sample points can be considered as candidates for GCI and GOI. However, the computation involved will be far too great. A careful selection of 4.1. DYNAMIC PROGRAMMING FOR GCI/GOI DETECTION 65

Figure 4.2: Example of the three initial candidate selection schemes: Zero-crossings (triangle), energy-weighted group delay (circle) and hybrid (star).

candidate points can help reduce the amount of computation time. In fact, it also affects the performance, both in terms of detection rate trade-off and the accuracy. On the other hand, it is important to make sure that all correct points, both GCIs and GOIs, are at least included as initial candidates. One possible way of reducing the number of possible candidates is to choose the positive zero-crossings of the inverse- filtered signal. To make sure opening instant candidates are included, an extra point has to be added for every positive zero-crossing. Alternatively, one can first apply the EWGD method which has been shown to detect impulses at the GCIs as well as some GOIs. By using a narrow averaging window, it is likely that GOIs will be detected. Besides, using a narrow window generates an over-complete set of candidates with higher accuracy for the correct points [124]. The key is to generate a redundant set of candidate points, which also include those accurate ones, for the DP to choose from. An experiment was carried out to evaluate initial candidate sets against ground truth GCIs, to be described in Section 4.2.1. It is found that using zero-crossings as candidates can give more accurate results but with a large number of candidates. On the other hand, using the EWGD method with a very narrow averaging window, say 0.1 ms, gives much less data but not very accurate. We combine the two methods by first performing a narrow-windowing EWGD method and then look for the adjacent 66 CHAPTER 4. ROBUST GLOTTAL CLOSURE INSTANT DETECTION

zero-crossings on its left, say p (potentially “true” opening instant) and right, say q (potentially more accurate closing instant). From inspection, sometimes CP never exists, especially in a female voice. We therefore assign an extra point, q +1, to the candidate set so that the DP can have more choices and at the same time, allows for no CP events which can now be modeled as having the CP duration of one sample. Table 4.1 shows the accuracy measured against ground truth data and the number of candidates per voice segment. Accuracy is defined as the percentage of those samples, excluding false alarms and misses, that fall within 0.25 ms of the references. The RMS measures the standard deviation of all errors from matched samples1. The combined method (hybrid) achieves a good compromise of the first two methods. An example of a period of the LPC residual signal, along with the initialization points from all schemes, is shown in Figure 4.2.

Table 4.1: Initial candidates’ characteristics. Zero X EWGD (.1 ms) Hybrid Accuracy (%) 80.2 46.4 78.8 RMS (ms) 0.29 0.34 0.30 # per segment 310 102 156

4.2 Experiments

4.2.1 Evaluation Test Set

The algorithms performance will be first evaluated in the context of GCI/GOI de- tection accuracy. The test set used here is the Keele’s database [128], popularly used for pitch estimation evaluation. It consists of 20-kHz sampling-rate recordings of a roughly 30-second phonetically-balanced passage read by five females and five males. The electroglottograph (EGG) signal is also simultaneously recorded for each speaker. The pitch masks at a frame resolution have been derived from the EGG and our evaluation periods ignore one frame margin on both sides of the voiced periods.

1Naturally, these cases give zero missing rate and large false alarms; characteristics of a good candidate initialization. 4.2. EXPERIMENTS 67

female

male

Figure 4.3: Examples of the DEGG waveform (dash) and the inverse-filtered deriva- tive glottal waveform (solid) for a female (top) and male (bottom). Reference GCIs and GOIs derived from peak picking DEGG are shown in (x) and (o) respectively

Our reference GCIs are generated by finding peaks of the derivative EGG (DEGG) [52] which are very clear. The ground truths for GOIs are, on the other hand, much harder to identify. Our references are again based on minimal peaks of the DEGG which correspond to the inflexion point in the EGG (see Figure 4.3. Note that a con- stant time lag between observing the speech air pressure near the lips and the EGG recording at the throat is not compensated). Although easier to identify, such min- imal peaks do not correspond to the “opening” instants for our modeling purposes, although it may be useful in other applications. The results presented in Table 4.3 will therefore be only an approximation after forced-alignment (also a DP) of these different types of opening definition, assuming constant offset as evident in Figure 4.3. The forced-alignment is performed on the GCI results to compensate for a fixed delay difference between the speech pressure wave having to travel through the vocal tract and the EGG measurement. Given two sets of markers, the reference and the target, force alignment calculates the best pairing, subject to some cost measures.

For each target marker ti and reference marker tj, the cost function of the DP grid at point (i, j) is given by 68 CHAPTER 4. ROBUST GLOTTAL CLOSURE INSTANT DETECTION

J(i, j) = min J(i 1, j 1) + t t 2,J(i, j 1) + J ,J(i 1, j)+ J (4.9) { − − | i − j| − MD − FA}

In our experiments, the cost of missed detection, JMD, and false alarm, JFA, are treated equally as the squared of maximum half glottal period in the reference. Back- tracing will result in the best alignment subject to the cost measures above. This setting has been verified to match human’s manual alignment expectation in many simulated cases. The false alarm rate (FAR) and missed detection rate (MDR) are defined as the number of outputs and the references, respectively, left unmatched after the forced-alignment, with respect to the ground-truth. Accuracy and RMS are defined as described earlier in Section 4.1.4.

4.2.2 Results and Discussion

In order to find the optimal value of the relative penalty weight, λ, batch experiments have been run on the male and female data while varying the value of λ in each experiment. Figure 4.4 shows the operating curves of FAR versus MDR using λ in the range of 0 and 8 for the female and the male in each column. Since the two measures are in conflict with respect to the weighting λ, the best value is the point on the operating curve closest to the origin. However, given the insensitivity of MDR to λ relative to FAR shown in the figures, we may use a larger value to suppress false alarm detection at a price of small increase in missed detection. In the case of male data, the optimal point is rather fortunately clear at λ =0.75 while the female is at around λ = 0.5. Looking at the accuracy of the detected points and the STD error in Figure 4.5, the optimal value of λ = 0.75 is confirmed for the male data and 0.5 for the female data. Table 4.2 shows the results of our algorithm and the EWGD method using window size that gives the best comparable compromise on the operating curve for each gender (see Figure 4.6). The proposed DP waveform-fitting (DPWF) achieves comparable FAR and MDR to the EWGD. The accuracy, however, is clearly more superior. Figure 4.7 shows a typical example of a successful detection of both GCIs and 4.2. EXPERIMENTS 69

Female Male 5 5.2 λ=8 λ=8 4.95 5 4.9

4.85 λ*=0.5 4.8

4.8 4.6 4.75

MDR (%) 4.4 MDR (%) 4.7

4.65 4.2 λ 4.6 *=0.75 4 4.55 λ=0 λ=0 4.5 3.8 0 5 10 15 0 10 20 30 FAR (%) FAR (%)

Figure 4.4: FAR-MDR operating curve for females (left) and males (right) using different values of the cost term relative weight λ.

GOIs when applying to a male normal (modal) and pressed singing voice. On the other hand, Figure 4.8 illustrates another example of a successful performance in case of a female voice when CP is virtually non-existent and in a presence of waveform deviations from the ideal model. A closer inspection on the data in general reveals that most errors occur during transitioning periods where the LPC residual does not follow Rosenberg’s model and when many spurious peaks occur. This may also explain the poorer results of male data. In many cases of the male voice, the inverse-filtered LPC residue contains large deviation from ideal waveform during the expected closed phase, looking a lot like another glottal period. As a result, FAR is high which is perhaps also why the optimal weighting value of the male data is higher than the females. The female voices do not have such a problem and false alarms are fewer, while the potential missed detection from not seeing a closed phase when there is none is well managed by the technique described in the algorithm. The results for GOI identification are given in Table 4.3. FAR and MDR are comparable to those of GCI as should be expected due to forced alternation. The ac- curacy, however, is much lower for the same stringent accuracy criterion used for GCIs, 70 CHAPTER 4. ROBUST GLOTTAL CLOSURE INSTANT DETECTION

Female Male 79 60 λ *=0.5 λ*=0.75 78 55

77 50

76 45 Accuracy (%) Accuracy (%)

75 40 −2 0 −2 0 10 10 10 10 λ λ

25 60

24 55

23 50

22 45 Gross error (%) Gross error (%) λ*=0.5 λ*=0.75 21 40 −2 0 −2 0 10 10 10 10 λ λ

Figure 4.5: Accuracy and gross error performance for females (left) and males (right) using different values of the cost term relative weight λ. illustrating the difficulty. Figure 4.10 shows the error histogram distributions (after forced-alignment) of the GOI detection results which have much larger spreads than those of GCI’s shown in Figure 4.9. From inspection, most identifications are never- theless reliable enough for waveform coding purposes, although parameter smoothing might be required for a good sound. It is possible that other models can be used to fit the waveform. Spikes in RMS error between the predicted waveform and the actual one may be used to indicate identification error. Systematic errors of identifying an opening as another closing is quite common, and post processing to correct such obvious errors should improve the performance. Other cost terms such as those used in [125] could also help further, especially a pitch deviation cost. On the other hand, adding more cost terms requires careful determination of penalty weighting, which could be circumstantial and difficult to tune. The two cost terms presented should not require heavy tuning due to their comparable order of magnitude. The algorithm also has a weak point of being polarity dependent. However, this should be easy to spot using a simple thresholding. It is also interesting to note that, due to the waveform-fitting nature, during low energy 4.2. EXPERIMENTS 71

Table 4.2: GCI identification results comparison. Performance EWGD DPWF Females 2.6 2.3 FAR (%) Males 6.0 6.1 Females 4.6 4.3 MDR (%) Males 2.6 2.0 Females 52.8 80.9 Accuracy (%) Males 64.1 69.3 Females 2.0 0.5 RMS (ms) Males 1.0 0.9

Table 4.3: GOI identification results. Performance DPWF Females 2.6 FAR (%) Males 2.9 Females 4.2 MDR (%) Males 4.6 Females 27.0 Accuracy (%) Males 14.6 Females 1.3 RMS (ms) Males 1.4 72 CHAPTER 4. ROBUST GLOTTAL CLOSURE INSTANT DETECTION

16 female 14 male

12

10 larger window

8 MDR

6 smaller window 4

2

0 0 5 10 15 20 25 FAR

Figure 4.6: FAR-MDR operating curve of the EWGD method for males and females using different averaging window sizes.

periods, the missed detection is inaudible. By the same token, applying this algorithm over a plosive consonant, adjacent to some voiced sounds, results in a very intelligible resynthesis. After all, the generated waveform is still impulsive, with the magnitude proportional to the actual observation of the LPC residual.

The algorithm was also tested for SNR=30 dB and the results changed only slightly, with FAR = 4.2, MDR = 1.9, Accuracy = 82.6 and RMS error = 0.78 (fe- males only). The disadvantage of using derivative glottal waveform is evident in this case where, besides less accurate LPC estimates of the vocal tract filter, the resulting inverse-filtered residue is also much more noisy due to the highpass nature of the inverse filtering. Using the integral version of the residue and possibly a parametric model of the glottal excitation waveform like the LF model could mitigate this prob- lem. However, the integral signal will have an indeterminate DC drift which makes waveform fitting difficult. In addition, the non-linear model of the glottal waveform is more difficult to fit via least-squares than the derivative glottal waveform presented. 4.2. EXPERIMENTS 73

(a) Normal

(b) Pressed

Figure 4.7: Normal and pressed voice with GCIs (x) and GOIs (o) identified 74 CHAPTER 4. ROBUST GLOTTAL CLOSURE INSTANT DETECTION

Figure 4.8: A female voice with GCIs (x) and GOIs (o) identified.

Females

−5 0 5

Males

−5 0 5 ms

Figure 4.9: Error histogram of GCI detection for the male and female test set in milliseconds. 4.2. EXPERIMENTS 75

800 Females 600

400

200

0 −5 0 5

150 Males 100

50

0 −5 0 5 ms

Figure 4.10: Error histogram of GOI detection for the male and female test set in milliseconds.

4.2.3 Parametric Voice Synthesis and Voice Modification

For a clean speech input, or even a moderately noisy one, the voice can be resynthe- sized from the glottal instants segmentation, the corresponding waveform parameters just derived and the vocal tract AR filter from LPC. The synthesis system is shown in Figure 4.11. For controllable coding, it makes sense to encode the primary variables

such as the GCIs, GOIs, ag and bg, as the fundamental periods (T0), the open-quotient (OQ) and the amplitude (AV ), which are more intuitive to control when voice mod- ification is desired. Often, however, their first-time estimates still do not follow a smooth trajectory and manifest themselves in the resynthesized sound as annoying clicks or unnatural roughness. A simple heuristic is to filter the parameters using a narrow tapering window such as a Hann window for smoothness. Alternatively, median filtering can be used for extra robustness against outliers, or something in- between such as the trimmed-mean which takes the mean of the central portion of the population, leaving out the outliers. Preliminary experiments have been conducted on the whole set of data. Defining gross error of the pitch estimate as those fixed frame-rate pitch estimates, taken from the closest glottal period, that deviate more 76 CHAPTER 4. ROBUST GLOTTAL CLOSURE INSTANT DETECTION

Conversion RF & Smoothing a LPC

GCI/GOI Conversion x GCI,GOI OQ s LPC Detection & Synthesizer y [a,b] AV Param. Est. Smoothing T0

Figure 4.11: Voice analysis system diagram

than five cents2 from the reference, median filtering can reduce such error from about 22% to about 18%. However, the pitches associated with correct GCI estimates will be compromised and the total standard deviation error of non-gross error will in- crease. The trimmed means were found not to improve either aspect. A combination of median filtering to get rid of outliers and a narrow-window smoothing for smooth contour may be most desirable.

An example of a male singing voice at a sampling rate of 16 kHz is shown as a spectrogram in Figure 4.12. The raw parameter trajectory estimates superimposed on the ones smoothed by a Hann window of length three are shown in Figure 4.13. Figure 4.14 shows the spectrogram of the original male utterance of “Where are you?” and the reconstruction using parameters obtained as described. The estimated parameters after smoothing are shown in Figure 4.15.

The basic voice modifications that can be easily implemented are pitch-shifting, time-scaling and open-quotient modification. A good combination is the key to chang- ing the emotion and voice quality. How to achieve a specific voice effect is beyond the scope of this dissertation. The schemes to do basic operations are described in Chapter 6.

2One cent is precisely equal to the 1200th root of 2. Just noticeable difference deviation for human pitch perception is roughly 5 cents for the whole range of pitches. 4.2. EXPERIMENTS 77

Original 1

0.8

0.6

0.4

0.2

0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Reconstruction 1.0

0.8 Normalized Frequency 0.6

0.4

0.2

0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 time (sec)

Figure 4.12: Spectrogram of the original male singing voice (top) and the parametric construction (bottom)

Normal mode singing (/aa/) 160 140

1200 T 100 80 60 50 100 150 200 250 300 350 400

0.8 0.6 OQ 0.4

−3 50 100 150 200 250 300 350 400 x 10

15 10 AV 5

50 100 150 200 250 300 350 400

Figure 4.13: Raw (dots) and smoothed (solid) parameter estimates for a male singing voice in normal mode 78 CHAPTER 4. ROBUST GLOTTAL CLOSURE INSTANT DETECTION

1

0.8

0.6

0.4

0.2

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

1

0.8

0.6

0.4

0.2

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Figure 4.14: Spectrogram of the original male utterance “Where are you?” (top) and the parametric construction (bottom).

12 10 (ms)

0 8 T 6 4 10 20 30 40 50 60 70 80 90 100

0.8

0.6 OQ 0.4

0.2 10 20 30 40 50 60 70 80 90 100 −3 x 10 4

3

AV 2

1

10 20 30 40 50 60 70 80 90 100

Figure 4.15: Smoothed parameter estimates for male voiced utterance “Where are you?”. 4.3. CONCLUSION 79

4.3 Conclusion

In this chapter, a dynamic programming algorithm which simultaneously identifies the GCIs, GOIs and the glottal waveform parameters has been presented. The eval- uation results, compared with the EGG physical measurements, show identification rates of the GCIs to be comparable to a classical method using group delay. The accuracies of the identified GCIs, however, were improved. Experiments also show reasonable estimates of the GOIs. The algorithm can provide initial estimates for further pitch-synchronous iterative estimation, especially when noise can cause in- accurate estimation of the vocal tract filter in the first place. By itself, it already enables parametric coding of the voice excitation source which is amenable to various types of expressiveness modification. 80 CHAPTER 4. ROBUST GLOTTAL CLOSURE INSTANT DETECTION Chapter 5

Probabilistic Framework for Voice Parameter Extraction

When the voice recording is corrupted by noise from the recording instrument or from the environment, the previous method of simple linear prediction, inverse filtering and glottal derivative waveform fitting will not work. While many techniques of vocal tract filter and glottal source estimation applicable to clean voice have been developed [68] [129], having generic noise in observation voids procedures such as pre-emphasis or the assumption of white Gaussian noise often used in least-squares fitting. Approaches which try to extract glottal source parameters by first inverse filtering the voice samples by vocal tract filter estimates are also invalid [130] [131] [108] [58]. One way to address this is to do pre-processing on the signal by a noise suppression algorithm which has been a standard norm for speech coding application in real-world situations. However, doing this risks losing information, especially the high frequency components which tend to get suppressed as noise in classical noise reduction algorithm such as Wiener filtering and basic Kalman filtering [94] [96] [102] [97]. Furthermore, it can add distortion to the signal, leaving the subsequent parameter estimation with bad initial samples. Many estimation procedures for the vocal tract all-pole filter from noisy observations also do not simultaneously yield excitation parameters [132] [133]. There is virtually no algorithm which tries to automatically estimate glottal source parameters directly from the noisy observations.

81 82 CHAPTER 5. PROBABILISTIC FRAMEWORK

Joint source-filter estimation exists but only on an assumed-clean signal [68] [134].

In this chapter, a probabilistic framework is presented as a way to model this noisy observation holistically according to our sound source generation model previ- ously shown and a simple autoregressive noise model. The parameters involved in the model include the glottal closure and opening instants, the vocal tract filter coeffi- cients and the glottal source waveform parameters. Like the work shown in Chapter 3, it is an attempt to integrate noise compensation tightly within the estimation frame- work, instead of performing noise suppression and estimation separately. An EM algorithm [135] framework for iterative inference and learning of these parameters, in the presence of noise, is then described. Due to the complexity of the problem space, global inference is not tractable. An instance of approximate inference and subsequent learning is therefore presented which allows an extraction of voice parameter from a noisy observation to be used for structured coding application. The EM framework allows many useful constraints, especially physically-motivated ones, to be applied during iteration while keeping the monotonic convergence property [136]. Addition- ally, it provides a single framework for joint glottal segmentation and joint source-filter estimation of a voiced sound. By doing glottal period segmentation jointly with vo- cal tract estimation (pitch-synchronous estimation), the harmonic-peaks over-fitting problem common to fixed-interval estimation should be alleviated. A joint estimation of the source and the vocal tract filter should also result in better estimates. It effec- tively takes into account the problematic source-tract interaction to a certain extent [137] [138]. The estimated parameters can then be used to resynthesize the original signal. Due to its parametric nature, the voice can also be modified in many ways during resynthesis and a great amount of compression can be achieved.

Another motivation for this parametric approach is that most classical single- channel noise suppression algorithms are based on filtering of the input signal by some form of filter estimates. This inevitably results in musical noise which occurs when noise gets filtered due to incorrect filter estimation, which also varies rapidly in time. One way to alleviate this problem is to pay more attention to the perceptual side of the filtering process, for example, using the masking properties to determine 83

an appropriate level of suppression [99], or trying to model spectral peaks and val- leys more accurately subject to some perceptual cost function [139] [140]. With our structured model, the parameters of the sound source instead get extracted and the original source is resynthesized. This gets rid of the musical noise problem completely, although at a price of suffering from the effect of the model being too simple. One can imagine, however, that one day, when all necessary model and parameters have been identified, a robust algorithm will be used to extract them and achieve an indis- tinguishable reconstruction of the original voice. This chapter investigates the quality of the parametric reconstruction to see if our current algorithm is good enough for human listeners. Previously, a similar idea of sound source modeling for noise re- duction has been used for de-hissing of a string instrument [141] and an example of structured audio coding for a two-voice guitar has shown an excellent result in [142]. In both cases, only the vibrating fundamental frequency and the frequency-dependent loss filter per loop of wave propagation need to be estimated. Human voice is most certainly much more challenging. A spectral template approach to noise reduction by reconstruction has been attempted in [101] although without much success. To be sure, the EM algorithm with Kalman smoothing has been used in speech enhancement applications before [97]. Here, we extend beyond most Kalman filtering- based enhancement algorithms by modeling the source input using a parametric form of Rosenberg’s as described earlier in Chapter 2, instead of just white noise or a pulse train as used in [97]. On the other hand, joint estimation of glottal source and filter has also been presented by Lu [68], where convex optimization is used to arrive at the joint estimation. An extension to a frequency-warped version has been shown in [91] while a multiple-period least-squares solution has been presented in [129]. In contrast to these deterministic approaches, this work instead employs a statistical model to include uncertainty due to the observation noise. Different types of noise are hence allowed. Incorporation of the glottal segmentation into the framework is also new compared with others. First, a generative model of the voice in noise will be presented which shows the relationship among variables and helps in formulating the probabilistic inference and learning framework. An EM framework is then described along with an approximate 84 CHAPTER 5. PROBABILISTIC FRAMEWORK

v(n) w(n)

1 x(n) g(n) A(z) y(n)

Figure 5.1: Noisy voice system diagram

inference which can make use of our previous glottal segmentation algorithm. The rest of the EM step will then be described detailing how we infer the clean voice and learn the model parameters. Different methods to achieve this are compared and some results will be shown.

5.1 Generative Model of Voice in Noise

A generative model helps determining the relationship and dependency among dif- ferent variables. In this case, we need to estimate the glottal source parameters and the vocal tract filter coefficients, which then generate a clean output sound that is combined with noise to give the actual observation, as shown in Figure 5.1. However, due to the segmental model, the glottal periods need to be identified simultaneously.

Let b = b , b , ..., b represent a variable, indicating glottal period boundary { 0 1 K} sample-indices. Given the segmentation estimate bˆ, the state-space model of each glottal period, say between b and b 1, is represented by k k+1 −

xn+1 = Asxn + Bun + vn

wn+1 = Awwn + ǫn (5.1)

yn = Czn + rn 5.1. GENERATIVE MODEL OF VOICE IN NOISE 85

where

T T T zn = xn wn (5.2)

T h i T αs αw As = , Aw = (5.3) "IPs−1 0# "IPw−1 0#

T T C = 1 0Ps−1 1 0Pn−1 (5.4) h i ag bg qm 0 B = , Qm = (5.5) " 0 0# " 0 0# v (0, Q ), ǫ (0, Q ), r (0, R) (5.6) ∼N s ∼N w ∼N

T x(n)= x(n) x(n 1) x(n P + 1) (5.7) − ··· − s h i T w(n)= w(n) w(n 1) w(n P + 1) (5.8) − ··· − w h i

αs = a a a s , αw = a a a w (5.9) s,1 s,2 ··· s,P w,1 w,2 ··· w,P h i h i

2 (n no)/fs · − , no n N  2 ≤ ≤ u(n)=  3 ((n no)/fs) (5.10)  − · T −   0 0 , otherwise h i  where n is the sample index within one glottal period frame. As contains AR co-

efficients of the vocal tract filter, αs, on the top row and Aw contains those of the generic colored noise, αw, which is assumed to be stationary through out the ob-

servation. The product Bun gives the glottal excitation, gn, as given by Equation (2.1). Since we consider a period from one glottal closure instant (GCI) to the next,

Equation (2.1) is modified so that there is an integer offset no, which is the starting

index of the glottal source open phase. no therefore determines the open quotient (OQ) such that OQ = (T n )/T . z is a state variable consisting of a length-P 0 − o 0 s clean speech concatenated with length-Pw noise to be inferred. The noisy observation 86 CHAPTER 5. PROBABILISTIC FRAMEWORK

_ {b,O}

X(t−1) X(t) X(t+1) U(t−1) U(t) U(t+1)

w(t−1) w(t) w(t+1)

Y(t−1) Y(t) Y(t+1)

Figure 5.2: Generative model of voice in noise

yn is then the sum of the two instantaneous samples, added by small observation model error, rn. Qs and Qw only have a non-zero element qs and qw respectively at top-left. They are forced to have such form during estimation for stability. Ideally, if the deterministic model of the voice production is accurate, vn can be thought of as the aspiration noise. In practice, however, there is always a modeling error. Using qs as pure aspiration noise energy tends to be an overestimation. The variance R of rn is fixed to a small number. All processes are driven by Gaussian variables and, because of the linearity of the system, remain Gaussian. The parameters to be estimated are referred to collectively as θ = α , a , b , n , α , q , q . { s g g o w s w} The graphical model of the variables and parameters as described in the previ- ous set of equations is shown in Figure 5.2 where t indicates the time-step through which each state variable is evolving. Each instant is governed by the current frame parameters, θ, and the segmentation, b.

5.2 Inference and Learning

Having outlined the generative model, an algorithm within an EM framework can be devised for iterative inference (E step) and learning (M step). Inference is needed since we have an incomplete set of data, that is, some variables are hidden. According 5.2. INFERENCE AND LEARNING 87

to the model, the hidden state variables are the segmentation boundary variables, b, the clean speech, here collectively denoted by X, and the noise, collectively denoted by W . Let Y represent the observation and Z the combined state variable of X and W , according to the EM principle [135] [143],

L(θ) = log P (Y U, θ) | = log P (Z, b,Y U, θ)dZdb b | Z ZZ P (Z, b,Y U, θ) = log Q(Z, b) | dZdb b Q(Z, b) Z ZZ P (Z, b,Y U, θ) Q(Z, b)log | dZdb ≥ b Q(Z, b) Z ZZ = Q(Z, b)log P (Z, b,Y U, θ)dZdb Q(Z, b)log Q(Z, b)dZdb b | − b Z ZZ Z ZZ = (Q, θ) F (5.11)

which is true for any distribution Q and the inequality is achieved by Jensen’s in- equality. For details on EM algorithm, see [135]. The following steps are alternately iterated until convergence.

E step : Q argmax (Q, θ ) k+1 ← Q F k M step : θ argmax (Q , θ) k+1 ← θ F k+1

5.2.1 E step

It can be shown that E step is satisfied for Q(Z, b)= P (Z, b Y, U, θ). Unfortunately, | in our model, an inference of b is intractable. This is because generally the boundary can be any combination of samples. The problem space dimension is large and a combinatorial explosion entails (even if some constraints on the space is imposed as will be shown later). Instead, approximate inference is needed. In this case, the sufficient statistics required is for p(Z, b Y, θ). If we approximate this distribution to | concentrate on one particular point as a delta function in the space spanned by b, say 88 CHAPTER 5. PROBABILISTIC FRAMEWORK

at its maximum a posteriori (MAP) estimate, we can marginalize Equation (5.11) to obtain

p(Z, b Y, θ)db = p(Z b,Y,θ) p(b Y, θ)db | | · | Zb Zb = p(Z b,Y,θ) δ(b b )db (5.12) | · − MAP Zb = p(Z b ,Y,θ) | MAP

where bMAP can be found from

bMAP = argmax p(b Y, θ) (5.13) b |

5.2.2 MAP Estimate of The Boundary Variable

The conditional probability, p(b X, θ) is used to approximate the posterior probabil- | ity, p(b Y, θ) , originally shown in Equation (5.12) where X represents clean speech | estimates. Following the work in [144], assuming a Markov relationship, the posterior can be expressed as

K p(b X, θ) p(X b, θ) p(b)= p(xb1 ) p(xbk−1 xbk−2 , θ)p(b) (5.14) | ∝ | · b0 bk | bk−1 k Y=2 Many forms of probability function can be used. One possibility is a simple harmonic model and spectral voice template used in [144] to calculate each conditional probability and the initial probability respectively. Alternatively, due to conditioning on θ, we can also perform approximate inference on the derivative glottal waveform obtained from inverse-filtering the current estimate of clean speech by the current estimate of AR parameters. The algorithm presented in Chapter 4 can then be used to find MAP estimates of b. Instead of using p(X b, θ), we then use p(G b) where G | | represents the glottal derivative waveform. The method in Chapter 4 also gives initial estimates of no for the next stage [145]. Theoretically, the segmentation candidates consist of all samples in the observation and the problem space is over all combinations 5.2. INFERENCE AND LEARNING 89

of these candidates. In practice, we can limit the candidates to some sensible set of points, e.g., the positive zero-crossings of the estimated clean speech, or the points given by the hybrid scheme used in Chapter 4, to reduce computation. The prior p(b) can then be modeled as uniform over these initial candidate points.

5.2.3 Kalman Smoothing

Given the MAP segmentation solution, the sufficient statistics of the marginalized distribution in (5.12) can now be derived using Kalman filtering or smoothing and the model in (5.1) [146]. While filtering uses only past samples, smoothing exploits future samples as well, generally giving better estimates with lower variance and sharper tracking response. Since the state-space model is linear and driven by only Gaussian random variables, all the state variables are also Gaussian distributed. Therefore, only up to second-order statistics are needed at each time point. Overall, the E step consists of an inference of the boundary variables by finding its MAP estimates in Section 5.2.2 and then performing Kalman smoothing on the state- space model in (5.1) to infer clean voice and the additive colored noise. However, here, the Kalman smoothing and the ML estimation is performed in its own EM iteration, before any re-segmentation is done. Details of Kalman filtering and smoothing can be found in Appendix A.

5.2.4 M step

During M-step, a period-by-period maximum likelihood (ML) estimates of the state- space model parameters can be derived using the statistics from the E-step. Since the posterior p(Z, b Y, θ) gives the lower-bound on the likelihood distribution estimate, | deriving parameter estimates which maximize the likelihood means the likelihood will monotonically increase after each iteration. 90 CHAPTER 5. PROBABILISTIC FRAMEWORK

5.2.5 Joint Source-Filter Parameter Maximum Likelihood Es- timation

At each iteration, given the sufficient statistics at each time point within a glottal T T period, the voice parameters, denoted by θs = αs ag bg and its process noise variance, qs, can be estimated in the maximum likelihoodh fashioni by

−1 θˆs = J D (5.15) where N V (n) xˆ(n 1)uT (n) J = 0 − (5.16) T T n=2 "u(n)xˆ (n 1) u(n)u (n) # X − N v1(n) D = 1 (5.17) n=2 "xˆ(n)u(n)# X

N 1 (1,1) T V1(n) T qˆs = V (n) 2θs + θs J(n)θs (5.18) N 1 0 − T n=2 "u(n)xˆ (n 1)# − X  − 

1 where V0(n) and V1(n) are the covariances given by the Kalman smoother, v1(n) is T (1,1) the first column of V1 (n) and V0 (n) is the top-left-corner entry of V0(n), i.e.,

V (n)= x(n)xT (n) (5.19) 0 h i

V (n)= x(n)xT (n 1) (5.20) 1 h − i where denotes the posterior averages from the Kalman smoother from the E-step. h·i Detailed derivation of the ML estimate expressions can be found in Appendix B.

Unfortunately, the OQ-related parameter, no, is nonlinear with respect to the error minimized; we therefore need to do a grid search. Providing that its initialization value is close to the solution, only a few points in the vicinity of the current estimate are needed for calculation. However, it is important to say that for a given set of 5.2. INFERENCE AND LEARNING 91

current estimates of other parameters at one iteration, even a grid search on no might not lead to an eventual globally optimal solution. A way to ensure optimality is to

start by optimizing other parameters for all values of no and then pick the one that gives the highest likelihood at convergence. This method is, however, computationally expensive. We therefore contend with trying to get good initialization and only make sure the likelihood function increases at each iteration doing grid search for the current ML estimates of other parameters. Keeping the likelihood function increasing, without actually maximizing it, still guarantees EM’s monotonic convergence.

Many constraints can be applied during this M-step without jeopardizing the monotonic convergence property of EM, as long as the constraints still yield greater likelihood value at each iteration. In the basic joint estimation, physical waveshape constraints applied are, ag > 0 and bg > 0, which give the quadratic shape, and 2a f /3b < N n , which ensures non-zero time return-phase during glottal closure, g s g − o where N is the glottal period length. Filter stability is also checked and unstable poles are reflected inside the unit circle. The open-quotient also has to be within a certain range which is usually tighter than 0 and 1.

For the algorithm just described, the model parameters are estimated indepen- dently for each glottal period. This results in a significant variation due to noise inter- ference. Perceptually perfect estimation is hardly possible even with probabilistically optimal estimates when it comes to real signals. Before resynthesis, small-window weighted-average smoothing is found to be necessary in order to avoid unpleasant artifacts. After all, smoothness is an important cue in voice production. While most speech enhancement techniques which involve some smoothness constraints report better filtered speech result [147] [148], the requirement for resynthesis is much more stringent since a few sample jitter and over-smoothing can be heard very easily. In addition, high frequency noise, coupled with weak high frequency content in voice, will result in poor high frequency formant estimates. Using Rosenberg’s derivative glottal waveform model, which has an intrinsic -6 dB/octave spectral roll-off, the lack of higher formants in filter estimates results in more buzziness and muffled resynthe- sis. In the following sections, two methods are proposed as ways to constrain voice 92 CHAPTER 5. PROBABILISTIC FRAMEWORK

parameter estimates at a certain glottal period by their neighbors. One is the penal- ized maximum likelihood which weighs the original likelihood function by a form of prior model, acting as a penalty against too much deviation. The other method em- ploys a dynamic model of the parameters themselves which constrain the trajectories of these parameters to be smooth. Lastly, a codebook constraint is presented which helps solving the problem of bad high-frequency formant filter estimates.

5.2.6 Penalized Maximum Likelihood

The penalized maximum likelihood adds a penalty term to the original likelihood ex- pression to be maximized (see Appendix B). In this case, we chose a Gaussian error between the estimate and some measure of parameter’s mean derived from neighbor- ing values. This can also be viewed as a probabilistic prior model of the parameters. The mean of this prior is taken to be the half Hann window-weighted average of pre- vious frames’ estimates. Since averaging AR coefficients does not guarantee stability, the modeling is done instead on the line spectral frequencies (LSF). The covariance matrix of the LSF could be converted back to the AR domain using unscented trans- form [118] to retain accuracy. The penalized log-likelihood term pertaining to the voice parameters becomes

N 1 T ¯ T −1 ¯ L(θs) (xn θs dn)+ λ (θs θs) Σθs (θs θs) (5.21) ∝ qs − · − − n=2 X T T T T T where θs = αs ag bg and dn = xn−1 un . λ is the normalizing constant or penalty weight,h includingi the frame lengthh factor.i The estimated mean and covariance

of the prior are θs and Σθs respectively. The constrained estimate of θs is then given by

−1 ˆ −1 −1 θsPML = J + λ q Σ D + λ q Σ θ¯s (5.22) · s · θs · s · θs h i h i where J and D are as shown in (5.16) and (5.17).

The contribution of the prior to the estimation is controlled by λ, qs and the

prior covariance. qs decreases with iteration, meaning the contribution of prior is 5.2. INFERENCE AND LEARNING 93

reduced, once the evidence is more reliable. From experiments, this constraint is found to help only during the first few iterations. During the last iterations, its contribution becomes smaller and the normal likelihood term dominates. This results in more robust estimates, but not smooth enough for re-synthesis purpose. Another smoothing mechanism is still needed as described next.

5.2.7 Post-Kalman Smoothing Integration

Assuming slowly varying parameters, a state-space model of the parameters’ dynamics can be constructed as

θ˜n+1 = F θ˜n + en · (5.23) θn = θ˜n + ηn where θ is the parameter’s ML estimate and θ˜ is the smoothed estimate. Both pro- cess error, en, and observation error, ηn, are Gaussian. Given state-space model parameters, smoothed vocal parameters can be found using Kalman smoothing over a sequence of raw ML estimates. An EM-algorithm can also be performed here to determine appropriate matrix F and the noise variances for each period where parameter dynamics can be assumed stationary. However, from experiments, a simple drift model, where F is an identity matrix, is enough and seems more robust. The process covariance determines the inertia or degree of smoothness in the inference estimates: the smaller, the smoother (strong prior belief), and is fixed in experiments. Kalman smoothing is applied over the segment where the dynamics of the parameters are expected to be stationary. This is performed only on ag, bg and AR coefficients, although it can also be applied to n0. The AR coefficients also require a conversion to LSF, before performing smoothing to retain stability. The choice of using LSF is again due to its stability guarantee. Additionally, LSF seems to have smoother variation and better interpolation property. This is illustrated in Figure 5.3 by the lower standard deviation over time for a male singing voice, compared to other LPC-related coefficients such as the reflection coefficients (RF), the area function and the AR coefficients themselves. The resulting 94 CHAPTER 5. PROBABILISTIC FRAMEWORK

algorithm will be referred to as the EM-PKS, in contrast to the basic EM joint estimation described in Section 5.2.5.

0 10

−1 10

−2 10 STD

−3 10 LSF AR RF Area fn −4 10 0 5 10 15 20 index

Figure 5.3: Standard deviation of various LPC-related coefficients extracted from a singing voice using closed-phase covariance LPC.

5.2.8 VQ-Codebook Constraint

In a similar EM-based Kalman filtering for speech enhancement, a codebook of AR coefficients trained on clean speech has been shown to provide a more robust con- vergence and better noise suppression performance [149]. The codebook-constrained Wiener filter version has also been studied in [150]. In this dissertation, the code- book is meant to help the filter estimates converge to the correct solution as well as recovering the high-frequency detail of the vocal tract configuration mostly cov- ered by noise. The codebook is learned using Lloyd or k-means clustering, trained on the LSF extracted from autocorrelation LPC of pre-emphasized clean voice from a reference database. Given that the LSF coefficients characterize the shape of the spectral formants, a simple squared error can be used as the distance measure that reflects perception. Alternatively, LPC cepstral coefficients and squared-error can be 5.2. INFERENCE AND LEARNING 95

used for distance calculation and comparison with equally easy conversion from AR coefficients. However, mixing lower-band estimates and high-band coefficients from codebook cannot be done directly using cepstral coefficients and LSF will probably have to be used in that step. The Itakura-Saito distance, which is a good distortion measure of spectral envelopes, can also be used directly with the AR coefficients, although it takes more computation during clustering. Lastly, one can imagine a codebook of vocal tract shapes, trained from physical measurements [31] [67]. How- ever, a straightforward conversion to an AR filter does not exist except for a simple constrained version which is not truly physical anymore. For an LSF codebook, at each iteration and each glottal period, the distances between the current estimate of AR coefficients, up to Pc coefficients, and those from the codebook are calculated. The entry which gives the minimum distance measure provides the LSF coefficients from Pc +1 to P where P is the total order. This new LSF is then converted back to AR coefficients for the next iteration. Obviously, due to the quantization nature, unless the codebook is very large and smoothness function is used in its selection, post smoothing like that in Section 5.2.7 will be needed to smooth over discontinuities over frames which would have otherwise manifested in the synthesized output. The resulting algorithm, with appropriate smoothing, will be referred to as EM-VQ to reflect the quantized nature of the codebook.

5.2.9 Algorithm Initialization

Just like any other ML method, the EM algorithm has the risk of converging to a suboptimal local maximum. Therefore, good initialization is crucial for global conver- gence. Conventional LPC can be used for initialization of the AR coefficients under high SNR conditions. Alternatively, a noise suppression pre-processing is possible, especially in light of the required segmentation which will perform much worse under noisy circumstances. Noise filter and its variance parameter can be estimated from a pause period, say, at the beginning of the samples, using simple LPC. OQ can be estimated from the following expression given by Fant [151] 96 CHAPTER 5. PROBABILISTIC FRAMEWORK

15 6 OQ=0.3 5 OQ=0.6 10 4

3

5 2

1 Prediction Error Prediction Error 0 0 0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 0.8 1 OQ OQ

0.4 3 OQ=0.8 2.5 0.3 OQ=0.6 2 F0=200Hz 0.2 1.5

1 0.1 0.5 Prediction Error Prediction Error 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 OQ OQ

Figure 5.4: Prediction error surface for different OQ values, all for F0=100 Hz, except for the bottom right, where F0=200 Hz

H H = 6+0.27 exp(5.5OQ) (5.24) 1 − 2 − ·

where H1 and H2 are the spectral amplitudes of the first and second harmonics. Alter- natively, the segmentation algorithm in Chapter 4 can give accurate glottal opening

for no which grid search during the EM iteration can hardly beat. To illustrate how

the grid search of no will perform, Figure 5.4 shows the prediction error surface which has an inverse relationship with the likelihood when all other parameters, except OQ, are at the true values for a synthetic input. AV = 0.001 is used and the AR co- efficients are taken from LPC analysis of a frame of sound /aa/. Where the range towards OQ = 1 is not shown in the top-left figure, the exponential trend continues. The figure shows that OQ should be over-initialized if in doubt. Given that physi- cally from experimental studies [57], 0.4 OQ 1, a grid search only in that range ≤ ≤ suffices to bring the likelihood up. Also note that the error curve is convex in that interval for these examples, implying a gradient method could be reliably used. 5.2. INFERENCE AND LEARNING 97

5.2.10 Adaptive Post-filtering

A post processing can improve the perceptual quality of the reconstructed sound. Adaptive post-filtering is a technique used in standard speech coder to boost formant peaks [4]. In our case, this is especially helpful in reducing the buzziness. Given the AR coefficientsα ˆ estimated at frame i, the filtering is done over the frame either in time-domain by the overlap-add method or by the windowed-FFT, using the filter with a frequency response described by

Aˆ(z/γ ) M(z)= | 1 | (5.25) Aˆ(z/γ ) | 2 |

where 0 < γ1 < γ2 < 1 and Aˆ(z) is the frequency response ofα ˆ. The zeros of M(z) have broader bandwidth than the poles. Therefore, its transfer function has peaks at the same frequencies as 1/A(z) but with a smaller peak-to-valley ratio, giving an extra boost to the formant peaks without over-doing it. Adaptive comb filtering can also be applied to enhance harmonicity which is also perceptually important. Note that, however, this kind of post-filtering only helps reduce annoying buzziness but cannot compensate for bad filter estimates.

5.2.11 EM Parameter Extraction for Fricatives

For completeness, in this section, we discuss another class of speech signal which can be easily encoded parametrically with a convincing resynthesis. Fricatives can be parameterized by the energy of the white Gaussian noise excitation and the all-pole filter that shapes the spectrum of the output. For clean speech, conventional LPC can be used reliably. When noise is present, the state space model presented in Section 5.1 with the deterministic input Bu omitted can also be used to find, perhaps, better filter coefficients. The same EM algorithm and Kalman smoothing can be applied. However, in practice, this may not yield anything perceptually superior to what we can get from a basic LPC applied to a noise-suppressed signal, since human ears are not so sensitive to differences in noise-like sounds. Indeed, these consonants, which have low correlation and look very much like noise itself, usually get suppressed 98 CHAPTER 5. PROBABILISTIC FRAMEWORK

during the EM iteration if the noise parameters are allowed to adapt. Obviously, any smoothing is rather unnecessary for these procedures. A codebook constraint may be applied but its benefit tends to be minimal.

5.3 Experiments

5.3.1 Test Samples and Evaluation

The main real test sample is a male singing voice with vibrato and tremolo of phoneme /aa/ at a fundamental frequency around 123 Hz. The original sampling rate is 16 kHz. The sample is an instance of a well-controlled voice with challenging and important variation in both pitch and amplitude. A male speech sample will also be used to show

its applicability to speech. The source filter order, Ps, in the model is set to follow the rule of thumb, Ps = 4+fs/1000, for a sampling rate fs. A set of synthetic signal is also used to collect large numerical simulation results. These samples are generated with pre-specified glottal waveform parameters and a vocal tract filter template extracted from real speech using covariance LPC with pre-emphasis. Two types of noises are used: white Gaussian noise and pink noise (or 1/f noise where the noise energy is inversely proportional to the frequency). Evaluation will be based on the perceptually-motivated distance measure between spectral envelopes corresponding to the AR coefficients estimates. The Itakura-Saito (I-S) distance is used for this purpose [51]. Given a reference AR coefficients and an estimate, the I-S distance can be expressed by

P (ω) P (ω) dIS = log( ) 1 (5.26) Pˆ(ω) − Pˆ(ω) − where P (ω) denotes a power spectral density and its value can be calculated through the autocorrelation of the corresponding AR coefficients based on the Fourier trans-

form duality [152]. Although dIS is not a symmetric metric, its use as spectral distance measure is acceptable as long as the order of usage is known [153]. The reference, in case of real voice, is the closed-phase covariance LPC, derived from pre-emphasized sound samples. The covariance LPC gives better accuracy than the autocorrelation 5.3. EXPERIMENTS 99

LPC, but without stability guarantee. Estimating only during a glottal closed phase minimizes the interference caused by the source-tract interaction. Note that the use of closed-phase covariance LPC only makes sense here only after a verification of its filter stability and a good quality reproduction is ensured. Another distance measure to be used in order to reflect on the physical accuracy is the distance measure between the area function estimates used in [68], defined as

P 1 d = (A Aˆ )2 (5.27) v Area P i − i u i=1 u X t where the area function is calculated from the corresponding AR coefficients, under the assumption of zero area at the glottis and the lip area normalized to one [152]. Despite its physical literal meaning, the area function can be safely used only when the conversion yields an area function that makes sense, which is not always the case for arbitrary AR coefficients. On the other hand, for a “correct” vocal tract filter’s spectral configuration in voice, the corresponding area function usually, if not always, yields the shape that is physically plausible. It is included in the experiments to validate the possibility of a physical model extraction.

5.3.2 Results and Discussion

Although the segmentation can be repeated at each iteration, in experiments with samples to be shown, not much is changed (except, maybe, a few at the beginning and the end where voice energy is low). Given a stable and correct segmentation, we therefore only look at results of the final EM iteration of the joint source-filter estimation after one segmentation iteration. First we start with the synthetic signals using vowel AR coefficients templates, excited by a synthetic glottal waveform with a constant fundamental frequency of 40 Hz for 1.25 seconds. The sampling rate used is 8 kHz. The distances are calculated with respect to the template used in generation. The average is taken over all periods except for the first and the last and over ten noisy instant simulations. Both pink and white noise are tested. For all SNRs, the EM joint estimation gives smaller distance measures of the vocal 100 CHAPTER 5. PROBABILISTIC FRAMEWORK

4 4

3 3

IS 2 IS 2 d d

1 1

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

5 5 10 10

0 0 10 10 Area Area d d

−5 −5 10 10 0 10 20 30 40 50 60 0 10 20 30 40 50 60 SNR (dB) SNR (dB) (a) /aa/ (b) /iy/

2.5 8

2 6 1.5

IS IS 4 d 1 d 2 0.5

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

5 4 10 10

2 10 0 10 Area Area

d d 0 10

−5 −2 10 10 0 10 20 30 40 50 60 0 10 20 30 40 50 60 SNR (dB) SNR (dB) (c) /uw/ (d) /eh/

Figure 5.5: Averaged I-S and area function distance measures of synthetic signals of vowels /aa/, /iy/, /uw/ and /eh/ at different SNR levels of pink noise. The distance is between reference and initial LPC estimate (solid-circle), basic EM (dash-x), EM- PKS (dash-dot-*) and basic EM without input model (dots-+). tract filter estimates, both for the I-S and the area function distance, than using the initial autocorrelation LPC estimated directly from the noisy signal. This is also generally true when no input is modeled (setting B = 0) during iteration, obviously since our synthetic signal was generated with one. The difference is not so significant at moderate to high SNRs, however. EM-PKS can improve the performance at low SNRs, say, 0 to 10 dB due to its smoothness constraint, giving the estimates more immunity against noise. This is true for almost all vowel templates and both types of noise tested as shown in Figure 5.5 and 5.6. For the male singing voice, Figure 5.7 illustrates the two distance measure results 5.3. EXPERIMENTS 101

4 4

3 3

IS 2 IS 2 d d

1 1

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

5 5 10 10

0 0 10 10 Area Area d d

−5 −5 10 10 0 10 20 30 40 50 60 0 10 20 30 40 50 60 SNR (dB) SNR (dB) (a) /aa/ (b) /iy/

2.5 8

2 6 1.5

IS IS 4 d 1 d 2 0.5

0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60

5 5 10 10

0 0 10 10 Area Area d d

−5 −5 10 10 0 10 20 30 40 50 60 0 10 20 30 40 50 60 SNR (dB) SNR (dB) (c) /uw/ (d) /eh/

Figure 5.6: Averaged I-S and area function distance measures of synthetic signals of vowels /aa/, /iy/, /uw/ and /eh/ at different SNR levels of white noise. The distance is between reference and initial LPC estimate (solid-circle), basic EM (dash-x), EM- PKS (dash-dot-*) and basic EM without input model (dots-+). 102 CHAPTER 5. PROBABILISTIC FRAMEWORK

at different levels of SNR, for pink and white noise. Note that in this case, the ref- erence is taken from the closed-phase covariance LPC estimates of a pre-emphasized clean voice. Both measures show improvements over direct autocorrelation LPC es- timates for both types of noise at all noise levels. It also outperforms EM estimation where no glottal input model is included (B = 0). The improvement of incorporating PKS in EM is even more evident than the synthetic cases at low SNRs. To show the algorithms behavior in more detail, the male singing voice embedded in pink noise with SNR = 20 dB is studied. Figure 5.8 shows the spectral envelopes corresponding to the AR coefficients estimates obtained from various algorithms of the glottal period somewhere in the middle of the singing voice, compared to the reference of closed-phase covariance LPC and the initial autocorrelation LPC of the noisy observation. The gain is normalized to unity at DC. The iteration provides a more accurate estimate of the vocal tract filter in the presence of noise. In this example, the PKS integration attenuates high-frequency formants due to its averaging process, but nevertheless gives a better overall I-S distance measure and a better sound quality. Both algorithms suffer from lack of information in the spectral regions where noise is overwhelming. In such areas, the poles tend to get assigned uniformly with uniform bandwidth, resulting in a relatively flat spectral envelope in that area. The ringing buzz at the output is caused by having this “floor.” Applying post-filtering to suppress the valleys and boost the peaks helps reduce the buzz to a certain extent. The VQ-codebook constrained EM gives a somewhat similar spectral shape at low frequencies compared to the others but different, perhaps better, formant shaping at high frequencies, especially in deep low spectral valleys. This helps reduce the buzziness although the distance measure is not necessarily superior overall. Also, due to its quantized nature, unnatural fluctuations can occur so more smoothing is required. Figure 5.9 shows the equivalent area functions of the basic EM and other refer- ences, converted from the corresponding AR coefficients. It suggests an applicability of the algorithm to achieving a physically-plausible solution, as shown by the shape. Figure 5.10 shows the waveform of initial estimates of the glottal source parameters versus the results after EM iterations. They are superimposed onto the residue from 5.3. EXPERIMENTS 103

12 Auto LPC 10 EM 8 EM−PKS EM B=0 IS 6 d 4 2 0 0 10 20 30 40 50 60

4 10

2 10 Area

d 0 10

−2 10 0 10 20 30 40 50 60 SNR (dB)

(a) pink noise

12 Auto LPC 10 EM 8 EM−PKS EM B=0 IS 6 d 4 2 0 0 10 20 30 40 50 60

2 10

1 10 Area

d 0 10

−1 10 0 10 20 30 40 50 60 SNR (dB)

(b) white noise

Figure 5.7: Averaged I-S and area function distance measures of a male singing voice at different SNR levels of pink and white noise. 104 CHAPTER 5. PROBABILISTIC FRAMEWORK

40 Cov LPC Auto LPC 30 EM EM−PKS 20 EM−VQ

10

0

−10 Magnitude (dB)

−20

−30

−40 0 0.2 0.4 0.6 0.8 1 Normalized Frequency

Figure 5.8: Spectral envelopes of reference pre-emphasized clean speech closed-phase covariance LPC, noisy autocorrelation LPC and the result of joint source-filter esti- mation from the basic EM, EM-PKS and EM-VQ algorithms.

2.5 EM Cov LPC clean auto LPC input 2

1.5

1

0.5

0 0 5 10 15 20 25

Figure 5.9: The normalized area function from closed-phase covariance LPC esti- mates, the initial estimates from autocorrelation LPC and after EM iterations 5.3. EXPERIMENTS 105

g(n) after before

Figure 5.10: The derivative glottal waveform, g(n), the initial model (dash) and the model after EM iterations (solid) inverse-filtering of clean speech using autocorrelation LPC estimated from clean pre- emphasized samples. While estimates given by EM-PKS and EM-VQ are similar, they may not “look” like a better fit due to the averaging and constraint matching. They, however, usually produce better sounds, especially when the input SNRs are low.

Using basic EM joint estimation, Figure 5.11 shows the variance, qs, of the voice process error over time step and iterations. It keeps decreasing with iterations, mean- ing the model fits the inferred clean speech better and better. An example of con- vergence of the log-likelihood is shown in Figure 5.12. Once it reaches convergence, which can be less than five iterations, the parameters are used to reconstruct the voice. For basic EM algorithm of the joint source-filter estimation, raw estimates result in artifacts caused by the discontinuities in parameters trajectories, especially those in the glottal waveform, ag, bg and no. A narrow Hann window can be used to smooth them up by filtering operation on the trajectories. Alternatively, they can be converted to T0, AV and OQ before smoothing. PKS applied at the end of every iteration, on the other hand, circumvents this ad-hoc post processing step. In Figure 106 CHAPTER 5. PROBABILISTIC FRAMEWORK

5.13, the raw estimates of α5, ag and bg are shown with the results after PKS su-

perimposed. They can be converted directly into the canonical set of parameters T0,

OQ and AV , as shown in Figure 5.14 (no smoothing in no, hence roughness in OQ and slightly in AV , although it can also be done). Figure 5.15 shows the spectrum of a frame of the original voice and the parametric reconstruction. From listening, using EM-PKS makes small difference to using small window smoothing on the basic EM estimation. However, it generally yields less fluctuating sound and moderates significantly all estimation anomalies during iteration for a more graceful degradation at low SNR levels.

−2 10 itr=0 itr=1 itr=2 itr=3 itr=4

−3 10

50 100 150 200 250 300 350

Figure 5.11: Process error log-variances at each iteration.

In addition to a more pleasantly smooth sound, EM-PKS also provides the smooth- ness constraint to the problem such that the estimate variances over time are low, which is generally beneficial for human voice. Figure 5.16 shows the mean spectral envelope corresponding to the estimated AR filter from the basic EM, along with its standard deviation bounds. Figure 5.17 shows the same thing but from the EM algorithm with PKS. For comparison, those values obtained from basic EM without input model (or equivalently, setting B = 0 at all time), are illustrated in Figure 5.18. The difference of providing source model in the estimation is mainly in correct 5.3. EXPERIMENTS 107

4 x 10 12

10

8

6 Log−likelihood 4

2

0 1 2 3 4 5 6 7 8 9 10 11 iteration

Figure 5.12: Log-likelihood convergence.

spectral tilt inherent in the source model which helps us achieve a better separation of the source and the filter. As mentioned earlier, pre-emphasis of noisy signal for vocal tract filter estimate is not advisable. The joint estimation, on the other hand, avoids this problem. As mentioned earlier, the high-frequency formant structures remain difficult to identify in both basic EM and EM-PKS, being overwhelmed by noise. EM-VQ some- what alleviates this problem. Figure 5.19 shows the spectrogram of the original singing voice, the noisy version with white noise contamination at SNR of 20 dB. Also shown are the spectrogram of the parametric reconstruction, using parameters extracted from EM-PKS and EM-VQ using Pc = 6 and a codebook of size 64 trained on the clean signal. A speaker-independent codebook can also give the high-frequency structure recovery, but the listening reveals a rather high fluctuation in formant tra- jectories. This results in unnatural sound reproduction which highlights the drawback of using too small quantized codebooks with entries that may not match well with the actual signal. Smoothing needs to be done more extensively using longer seg- ments and smaller process variances for greater inertia. The benefit of better-shaped higher formants may then be compromised. Another example of male speech uttering 108 CHAPTER 5. PROBABILISTIC FRAMEWORK

1

0

−1 α 5 −2 0 50 100 150 200 250 300 350 400 60

40 20 a g 0 0 50 100 150 200 250 300 350 400 15

10 5 b g 0 0 50 100 150 200 250 300 350 400 periods

Figure 5.13: Raw (dash) and smoothed (solid) estimates of α5, ag and bg at the last iteration of EM-PKS.

160 T 0 150

140

130 50 100 150 200 250 300 350

OQ 0.64 0.62 0.6 0.58

50 100 150 200 250 300 350

0.3 AV

0.2

0.1

50 100 150 200 250 300 350

Figure 5.14: Estimates of fundamental period (T0), open-quotient (OQ) and ampli- tude (AV ) using EM-PKS. 5.3. EXPERIMENTS 109

1 original synthesized 0.5

0

−0.5 0 100 200 300 400 500 600 700 800 900 n

40 original 20 synthesized 0

−20 dB −40

−60

−80 0 1000 2000 3000 4000 5000 6000 7000 8000 Hz

Figure 5.15: The time-samples (top) and spectra (bottom) of the original (dots) and the resynthesized signal (solid). The spectrum is offset for clarity.

25

20

15

10

5 dB

0

−5

−10

−15 0 0.2 0.4 0.6 0.8 1 Normalized Frequency

Figure 5.16: Mean spectral envelopes and 1 standard deviation over time of filter ± coefficients estimate using basic EM-Kalman. 110 CHAPTER 5. PROBABILISTIC FRAMEWORK

25

20

15

10 dB 5

0

−5

−10 0 0.2 0.4 0.6 0.8 1 Normalized Frequency

Figure 5.17: Mean spectral envelopes and 1 standard deviation over time of filter ± coefficients estimate using post Kalman smoothing.

40

30

20

10 dB

0

−10

−20 0 0.2 0.4 0.6 0.8 1 Normalized Frequency

Figure 5.18: Mean spectral envelopes and 1 standard deviation over time of filter coefficients estimate using basic EM-Kalman± with no input model. 5.3. EXPERIMENTS 111

“Where are you?” is shown as spectrogram comparison in Figure 5.20. Presented in the Figure are: the original, the noisy with pink noise at SNR = 20 dB embedded, the EM-PKS and the EM-VQ reconstruction where the codebook of size 84 is taken from the original clean utterance, essentially frame-by-frame. Using the codebook from the original utterance clearly limits its practical use but here demonstrates how the codebook look-up performs and what kind of sound quality can be achieved. Fig- ure 5.21 shows the indices of the codebook that match best during the last iteration. Most selections are correct, judging from approximate segmentation of phonemes and codebook, except for /y/ and /ux/ which are low in energy and sound very much like /w/. This results in good reconstruction quality for the first part while the last part is not as intelligible. Using a codebook trained from other speakers (512 entries per phoneme) give indices shown in Figure 5.22 which behaves not so much differently from before. The sound quality is, however, inferior. While the proposed look-up procedure, as well as how the codebook is derived, clearly needs to be improved, what is missing from the basic EM and EM-PKS estimation has been shown to be partially recovered.

5.3.3 Listening Test

An informal listening test has been conducted to gauge the perception of quality of different sounds. The test subjects consist of three males and three females in the age range from 20’s to 40’s. Roughly half may be considered listening experts while the rest are general audience with normal hearing. Despite the objective measures like the Itakura-Saito distance used earlier, the ultimate test of any parametric encoding must reflect the auditory perception of human listeners. While SNR, calculated from the energy of the difference between the target and the reference, is commonly used in conventional noise suppression algorithms where sample-by-sample comparison is valid, a structure-based coding and reconstruction is totally not sample accurate and hence SNR measurement is not applicable. Recently, a perceptually relevant objective measure like the perceptual evaluation of speech quality (PESQ) was introduced [154]. However, it is not yet clear how it correlates with the actual listening for model-based 112 CHAPTER 5. PROBABILISTIC FRAMEWORK

Original Noisy (white SNR=20dB) 8 8

7 7

6 6

5 5

4 4 kHz kHz 3 3

2 2

1 1

0 0 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 sec sec

EM−PKS EM−VQ 8 8

7 7

6 6

5 5

4 4 kHz kHz 3 3

2 2

1 1

0 0 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 sec sec

Figure 5.19: Spectrogram of the original singing voice, the noisy version with white noise at SNR = 20 dB, and the parametric reconstruction from EM-PKS and EM-VQ. 5.3. EXPERIMENTS 113

1 1

0.8 0.8

0.6 0.6

0.4 0.4 Frequency 0.2 0.2

0 0 Original Noisy

1 1

0.5 0.5 Frequency

0 0 EM EM−VQ

Figure 5.20: Spectrogram of the original male utterance “Where are you?”, the noisy version with white noise at SNR = 20 dB, and the parametric reconstruction from EM-PKS and EM-VQ.

/w/ /eh/ /aa/ /y/ /ux/

80 /ux/ 70 /y/ 60

50 /aa/ 40

Codebook index 30 /eh/ 20

10 /w/ 0 10 20 30 40 50 60 70 80 90 Period

Figure 5.21: Codebook index lookup using original clean utterance as codebook. 114 CHAPTER 5. PROBABILISTIC FRAMEWORK

/w/ /eh/ /ah/ /y/ /ux/ 2500 /ux/

2000 /y/

1500 /ah/

1000 Codebook index /eh/

500 /w/

0 10 20 30 40 50 60 70 80 90 Period

Figure 5.22: Codebook index lookup using speaker-independent codebook.

coding among other effects like echo or competing talker. Indeed, the current version of PESQ, while covering both narrow- and wide-band signal, as well as a packet-based speech coding like VoIP, it does not perform accurately with silence substitution of speech frames [155]. Structured coding, on the other hand, is more about re-rendering the sound object and re-creating its indistinguishable image. A parametric sound reconstruction may have continuously variable delay, causing alignment problems in PESQ, but yet sounds good. Often, people do not object to high-quality new sound, which may be slightly different from the original.

Due to this current lack of appropriate objective measurement, we use a primi- tive subjective measure used widely in speech research called the mean opinion score (MOS) [156] to reflect on the performance of the parameter extraction and recon- struction presented in this dissertation. The test subject scores each sound sample according to the scale of 1 to 5, for 1=Bad, 2=Poor, 3=Fair, 4=Good and 5=Excel- lent based on the overall quality of the sound they hear, including mixed background noise and other artifacts accompanying subsequent processing. For simple relative comparison, the original sound will be referenced by the score of 5 (excellent). The 5.3. EXPERIMENTS 115

test is meant only to verify the effect that may not be captured in quantitative mea- surements and to explore opinions from human listeners. A larger test set and number of subjects will be necessary for a solid evaluation as required for formal voice coding standardization.

Table 5.1 shows the averaged results taken from six subjects for different algo- rithms in the case of a sustained male singing voice (/aa/) at a sampling rate of 16 kHz when pink noise is mixed in at SNR = 20 dB. It is also compared against a conventional noise suppression algorithm like the fixed frame-rate EM with Kalman Smoothing [97] which is the same as our basic algorithm except for the input model and the pitch-synchrony. For comparison, an achievable result of our parametric model using the DPWF algorithm proposed in Chapter 4, applying to a clean signal, is also included. The listening experiments clearly shows that, for EM-PKS, while the first few formants are accurately estimated, the higher-frequency ones are over- damped or largely not modeled. The inclusion of the source model helps compensate for the spectral tilt due to its inherent -6 dB/octave roll-off. However, this does not help recovering the higher-order formants (and valleys) which are low in energy or even buried in noise as evident in Figure 5.8. The resulting sound is more buzzy than the original because the inharmonicity at high frequencies from glottal waveform dis- continuity in the waveform which is not shaped or compensated properly by the vocal tract filter. This is in contrast to the clean voice estimation of the vocal tract, with pre-emphasis, and the glottal source parameters, which seems to give higher score for the reconstruction, although a certain degree of buzziness remains. An experiment where the sound input is first pre-emphasized has been conducted. The result is the high-frequency formants can be recovered better, at the expense of worse estimates for those at lower frequencies and the source parameters. Pre-emphasis of a noisy signal also risks boosting the noise at high frequencies, misguiding subsequent AR modeling. Interestingly though, when the parametric reconstruction from EM-PKS, which is somewhat buzzy, is compared with the Kalman-filtering noise suppression, which contains musical noise, some subjects preferred one type of artifacts over the other. This is perhaps due to the ability of the human brain to listen through even the musical noise and capture the original voice image embedded in it. 116 CHAPTER 5. PROBABILISTIC FRAMEWORK

Table 5.1: Mean opinion score of a male voice (/aa/) extraction and noise suppression under pink noise contamination of SNR = 20 dB. EM-KF refers to Kalman filtering noise suppression using EM while EM-PKS refers to the best reconstruction using the proposed EM parameter estimation with smoothing and DPWF refers to the reconstruction using only LPC and DPWF glottal waveform estimation from a clean signal. Reference Noisy EM-KF EM-PKS Clean DPWF 5.0 1.8 3.3 3.5 4.2

5.4 Conclusion

In this chapter, a probabilistic framework based on the generative model of the seg- mented clean voice in noise is presented where all random variables are conditional on the voice and the noise production model. An iterative method based on EM algorithm is described with approximate inference using a MAP estimate of the seg- mentation variable and sub-EM algorithm for inference of clean speech and learning of relevant parameters. The framework can be extended and applied to other models and many extensions on EM are also possible. The results show estimation accu- racy gain when the source and the filter are estimated jointly while smoothness and codebook constraints help improve the quality of the resynthesized voice. Chapter 6

Applications of Structured Voice Coding

In this chapter, we briefly discuss and illustrate various applications to structured voice coding achieved from algorithms described in this dissertation.

6.1 Voice Coding and Flexible Synthesis

The primary motivation of voice parameter extraction is for parsimonious encoding of sounds with added benefit of modification flexibility. Using the model described, the glottal waveform parameters can be converted to the fundamental period (T0), the amplitude (AV ), and the open-quotient (OQ) for independent manipulation. Resynthesis involves converting back to the intrinsic waveform parameters for glottal excitation generation. Manipulations in the pitch-amplitude-open-quotient domain can achieve the following effects.

6.1.1 Volume Changing

Changing the volume is probably the easiest of all. AV is simply modified while the rest are kept unchanged. This is useful in combination with other effects for stress modification in speech or changing loudness dynamic in singing.

117 118 CHAPTER 6. APPLICATIONS OF STRUCTURED VOICE CODING

6.1.2 Time Scaling

Stretching the voice involves inserting extra parameters into the original trajectory. These values are calculated using linear interpolation. For a factor of D increase in duration, where D is a positive integer, the new set of parameters are obtained, with appropriate boundary ramping, from

θnew(l)= θ(old) + n (θ(old) θ(old))/D (6.1) k · k+1 − k where l = (k 1)D + n, n = 0,...,D, k = 1,...,K and K is the total number of − parameters in the original trajectory. Compression in time by a factor of α, where 1 0 <α< 1, involves throwing away parameters every J time steps, where J = (1−α) and [ ] denotes the rounding operation. A stretch in time with non-integer factor of · β can be done in either of two ways: a) First, stretch the parameters by an integer number β then compress by a factor of β/ β or, b) First stretch the parameters by ⌈ ⌉ ⌈ ⌉ an integer number β then stretch by a non-integer number β/ β to get the final ⌊ ⌋ ⌊ ⌋ results. Obviously, this is only an approximation, especially since the pitch period itself varies in time. It makes sense only for a modification by multiplicative factors but given a target time duration, such factor can be calculated relative to the original duration. To get precisely the right amount of time duration, extrapolation may be required. But given the spread in places of replication or deletion, the final time duration should be very close to the target and the change will sound very natural. For mixed phone types, all transients like fricatives and plosives will be translated in time without stretching or compressing in itself to maintain naturalness. This can be seen in the spectrogram at the bottom of Figure 6.1 where the original utterance says “She saw a fire.”

6.1.3 Pitch Shifting

Increasing pitch involves reducing the length of the glottal period and vice versa. This can be achieved simply by multiplying T0 by the desired factor. However, since the duration has to remain unchanged, it must be followed by a time-scale modification 6.1. VOICE CODING AND FLEXIBLE SYNTHESIS 119

described previously such that the stretched or compressed time gives back the origi- nal duration. The transient components like fricatives stay the same as shown in the spectrogram on the top right of Figure 6.1. The spectrogram also shows wider har- monic components achieved in the spectral domain through the time-domain model synthesis.

6.1.4 Breathiness Modification

The open-quotient, OQ, is directly related to the amount of air that flows through the glottis. It is therefore one of the primary cues for the perception of breathiness. However, Dennis Klatt showed experimental results that suggest a more dominant cue actually comes from the aspiration noise [57]. In [68], a GCI-synchronous in- jection of window-shaped Gaussian noise is used to synthesize a breathy voice. An example of such aspirated glottal excitation is shown in Figure 6.2. The energy of the noise could be encoded as another parameter and can be estimated by extracting the residue after estimation, for example, using wavelet thresholding as shown in [68]. However, extracting breathy voice from a noisy recording is much harder. Besides the observation noise which will bury the weak aspiration noise, the modeling error can also obscure the aspiration noise even further. It may be possible to determine the amount of noise appropriate for injection from the waveform parameters. An increase in spectral tilt can also enhance the perception of breathiness, giving a more rounded glottal waveform [57]. For a pressed mode of singing, decreasing OQ alone is still very much effective, as also discussed and tested in [57].

6.1.5 Glottal Fry Synthesis

An extremely low open-quotient in addition to irregularity in the glottal pulses leads to a vocal fry or a creaky voice. It happens when the vocal fold gets compressed tightly through the arytenoid cartilage rotation, resulting in a slow and irregular air flow at a low fundamental frequency. It has significance in speech communication, being a type of phonation or indicating the start or end of sentences, as well as an 120 CHAPTER 6. APPLICATIONS OF STRUCTURED VOICE CODING

emotional cue such as being bored. In singing, such a voice will be rough or raspy. The ability to transform a normal voice into a creaky one not only allows a wider range of expressive synthesis, it can also be used to mimic dysarthria for clinical studies. This effect is clearly much less intuitive to achieve using spectral modeling coder such as the sinusoid model. To synthesize a creaky voice, energy-dependent jitters can be added to modulate the position of each glottal pulse. The open-quotient is compressed to a very low value. Additional modifications can be done to the spectral tilt, along with an addition of aspiration noise and even extra (smaller) pulses for a diplophonic double pulsing effect and a variety in the degree of creakiness.

6.2 Resampling and Bandwidth Extension

Another advantage of a model-based encoding of voice is that sometimes upsampling is possible. Considering structured coding in computer graphics, describing a circle by its radius and center position means its rendering is independent of pixel resolu- tion. On a similar token, the derivative glottal model used represents the voice by parameters which are independent of the sampling rate. We only need to generate the waveform using the correct sampling rate given these parameters. Nevertheless, for human voice, the detailed information of the vocal tract filter that shapes the high frequency region is still missing. Bandwidth extension is the term used in speech application where the high-frequency band content is generated from the lower-band commonly transmitted in telephone channel at the sampling rate of 8 kHz 1. Current research in bandwidth extension, mostly focus on spectral modeling. To recover, or rather regenerate, the high-band spectral content, the vocal tract filter coefficients are predicted from the lower-band coefficients. In [157], a trained codebook of LSF using the Itakura-Saito distance is used to look up higher order coefficients. The

1Telephone speech bandwidth is between 0.3-3.4 kHz. So sometimes, this is referred to as nar- rowband whereas the extension, which may include both lower and higher band to the telephone bandwidth range, is called wideband. 6.3. VOICE CONVERSION 121

scheme has been improved with interpolation and memory in [158]. More sophis- ticated model has been studied e.g. hidden Markov model [159], a neural network [160], a Gaussian mixture model [161]. At the same time, the set of lower-band fea- tures which are most informative about the high-band has been found using linear discriminant analysis [162]. All kinds of features related to LPC have been used in these works. Formant peaks have also been studied [163]. Less obvious is the way to regenerate the excitation for the highband region. Early research proposed simple inverse-filtered residue’s spectral folding which has continued to find its use [164] [46]. Some distinguish between voiced and unvoiced spectra and use sinusoid components for the former and random noise for the latter [165] [163].

6.3 Voice Conversion

Voice conversion refers to generating an utterance spoken by one speaker with the voice characteristics of another [166]. Its applications include speech synthesis using a limited database and voice dubbing in films. Both the vocal tract pathway config- uration and the glottal excitation determine characteristics of a speaker, along with other non-physical factors such as speaking style and rate. Conventionally, voice con- version involves mapping of the formant structure and the pitch and other fine details of the excitation between two speakers via training. This requires both speakers to speak the same utterances during training. Having parameterized the vocal tract and the glottal excitation, it is imaginable that a cross synthesis between the glottal source generated using the parameters estimated from a person’s voice can be used as input to the vocal tract filter estimated from another’s. The glottal source encoding not only captures the large-scale pitch, amplitude and open-quotient variation, but also implicitly the jitter and shimmer in a small-scale, which can also form an important part of a speaker’s identity. An experiment on two male singing voice has been conducted and a successful cross- synthesis has been achieved. Although, it is clear that frequency-warping vocal tract filter will be needed to better match the characteristics and possibly mask some artifacts due to the cross-synthesis mismatching. 122 CHAPTER 6. APPLICATIONS OF STRUCTURED VOICE CODING

6.4 Comfort Noise Generation

Despite the obvious motivation to suppress noise interference, sometimes, a small dose of ambient noise is desirable. Comfort noise generation is accepted as part of the standard telephony where a low level of noise is generated at the receiver to avoid the effect of dead-channel silence during no speech activity periods. Additionally, sometimes the ambient noise is also informative of where the speaker at the other end of the line is. Structured modeling and coding of voice in noise allow us to capture the noise spectral characteristics in a few parameters which can be used for the comfort noise or ambient noise generation at an appropriate loudness level. For example, the model and identification techniques in Chapter 5 give the colored noise short- term correlation coefficients and its energy in the original observation. The spectral characteristic can then be preserved while the energy is reduced to an appropriate level. 6.4. COMFORT NOISE GENERATION 123

Original Pitch−shifting x1.5 8 8

6 6

4 4 kHz kHz

2 2

0 0.2 0.4 0.6 0.8 0 0.5 1 seconds seconds Time−scaling x 2 8

6

4 kHz

2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 seconds

Figure 6.1: Spectrograms of the original utterance (top-left), the pitch-shifting by a factor of 1.5 (top-right) and the time-scaling by a factor of 2 (bottom). The utterance is “She saw a fire.”

0.1

0

−0.1

−0.2

−0.3

−0.4 200 400 600 800 1000 1200

0.1

0

−0.1

−0.2

−0.3

−0.4 200 400 600 800 1000 1200

Figure 6.2: Original synthesized glottal waveform (top) and its breathy version (bot- tom). 124 CHAPTER 6. APPLICATIONS OF STRUCTURED VOICE CODING Chapter 7

Conclusions and Future Work

The goal of this dissertation work was to investigate and develop new automatic procedures which will enable parametric extraction of human voice from a microphone observation which may contain noise. The parametric model of the voice was chosen based on its modification flexibility with physical intuitiveness, in contrast to state- of-the-art voice or speech coders currently used for modifiability purposes. The thesis of the research is that such a relatively simple yet intuitive parametric model can be estimated for acceptable faithful encoding as well as a convincing sound modification.

In order to be applicable to general speech or singing, segments of different modes of sound production have to be identified. A new mixture model and inference tech- nique has been applied to a switching state-space model (SSM) which shows im- provement over past techniques using similar models. A novel method of noise fea- ture compensation also shows superior segmentation and classification accuracy over front-end noise reduction and loosely-coupled feature compensation. While there are many phone types that mostly originate from different production models, the focus of this dissertation is on non-nasal voice since this class constitutes the majority of speech and singing and is also more easy to deal with. It is also the part which is the most expressive that needs to be processed for voice modification applications. In ad- dition, it can also serve as a foundation for an extension to nasal sounds in the future. A new dynamic programming method which aims to identify not only the GCIs but

125 126 CHAPTER 7. CONCLUSIONS AND FUTURE WORK

GOIs and associated waveform parameters, in contrast to other GCI detection algo- rithms, has been described. The output of the system is a parametric representation which can be used to reconstruct the original voice without much degradation and whose parameters can be modified easily for seamless continuous pitch-shifting and time-scaling. Comparing to spectral manipulation often used for such applications, our model and method also allow for glottal effects such as breathiness modification and laryngealization to be achieved more intuitively. The proposed glottal segmentation algorithm assumes that conventional LPC can be first applied to the voice in order to estimate the vocal tract filter and subsequently, the inverse-filtered waveform which should correspond to the glottal derivative wave- form. With noise contamination, such estimates of the vocal tract filter are signifi- cantly degraded. As a result, the parametrically reconstructed voice either contains spurious resonances, is less intelligible or coarse, or all of the above. These are the results of noise resonances being mistaken as those of the voice’s, inaccurate formant estimates and incorrect GCI detection respectively. A probabilistic framework which encompasses all variables involved in the model is then described along with an itera- tive method based on an EM algorithm to derive more accurate parameter estimates. Due to its monotonic convergence, many constraints, especially physically motivated ones, could be applied during the iteration. Experiments have been shown to validate improvement of this iterative algorithm over initial estimates and informal listening test have shown usability in direct voice parameter extraction from noise for coding purposes. Comparing to conventional noise suppression by filtering, the reconstruc- tion is musical-noise free, at a price of buzziness involved with the simplistic model and imperfect reconstruction in the spectral regions where SNR condition is severe. On the other hand, the voice is now completely parameterized and can be modified as described previously. Besides an assessment of each individual part, the system developed during the course of this dissertation has provided some convincing examples. However, there are many things that must be done or at least, are worthwhile exploring, before a practical system can be achieved. Some suggestions for future work are listed in the following sections. 7.1. NOISY SPEECH SEGMENTATION 127

7.1 Noisy Speech Segmentation

The model used in Chapter 3 for speech segmentation is meant to be scalable. Many different phone-type categorizations are possible while the algorithm and its features can remain essentially the same. In this work, the noise can be compensated because of the closed-form expression, albeit an approximation. It is yet to be compared against more rigid classifiers such as the neural networks which cannot be easily adapted on the fly but can use more discriminative or more noise-robust features in classification. For example, a form of periodicity measurement should definitely be included as a feature in any phone classifier. Although model-based feature compen- sation is most likely harder for a feature such as this, its robust implementation based on, say, spectral peaks can provide noise immunity. A larger set of speech should be evaluated to compare the two approaches. Possibly, a combination of both types of features or classifiers can also be beneficial.

Another useful by-product of the proposed algorithm is the inference probability which can be used as a confidence score. It is virtually impossible to get 100% classi- fication accuracy and sometimes, using a wrong model could be costly. Incorporating these confidence scores into a coder’s decision making could provide robustness to misclassification and make its use acceptable.

As shown in the results of Chapter 3, using more than two mixture components of dynamical system per class seems to create more confusion in noisy cases. Many other inference techniques for SSM exist, for example, the variational method [167] and particle filtering [168]. A non-uniform number of mixtures per class also should make sense. This may require a study of model-selection techniques like Bayes information criterion or Akaike information criterion. Although the hidden continuous states are learned from the training data in a data-driven manner, its physical meaning, possibly reflecting the underlying articulators, will be interesting if one can be found. Of course, explicit modeling of these state trajectories based on physical attributes such as the vocal tract resonances used in [87] can also be done but will require much higher computation. 128 CHAPTER 7. CONCLUSIONS AND FUTURE WORK

7.2 GCI/GOI Detection

The GCI detection performance of the algorithm shown in Chapter 4 is comparable to a classical method like the EWGD. Despite better accuracy, its use in practical coders will be marred by these few false alarms and missed detections. One way of solving this is to do a post-processing such as averaging, median-filtering or something in the middle like taking a trimmed-mean which should be able to help recover spurious mistakes. However, contiguous errors can be hard to identify and distinguish from an actual rapid change in pitch. In its current form, such errors are still present from time to time. A better solution is to strengthen the algorithm by adding other cost terms to the dynamic programming cost function in hope that one will catch mistakes the others cannot, as done in [125]. The drawback of having more cost terms is the difficulty in adjusting the weights among them for the best result. A preliminary experiment showed that using the algorithm in [125] to identify GCIs as part of the initialization process, before trying to identify GOIs and the waveform parameters using the algorithm in Chapter 4, gives more reliable results with more robustness to noise. Another weakness in the current algorithm is also in the use of derivative glottal waveform which is susceptible to noise, compared to the actual glottal waveform. Using a different model in the least-squares waveform fitting could be beneficial but is most likely computationally intensive. In fact, the least-squares estimation itself, even in the current linear model, is the main source of slow computation. A way to reduce this load significantly will be a considerable contribution. With respect to coding, the proposed algorithm only goes so far as floating point parameter estimation as well as modification. These parameters can be used when sufficient computing power and power supply is abundant. In many cases, such as in mobile communication, further quantization is needed. While the quantization of open-quotient data with respect to perceptual distortion has not been studied, to the best of the author’s knowledge, its effect on pitch contour has been studied somewhat and an appropriate quantization scheme has been proposed in [169]. Quantization of the vocal tract’s AR coefficients or their variants such as the line spectral frequencies has been well investigated and used in standard mobile communication [170]. 7.3. VOICE IN NOISE PARAMETER EXTRACTION 129

7.3 Voice in Noise Parameter Extraction

Due to the lack of a copy of clean speech, direct analysis-by-synthesis (A-b-S) is not possible. So, in contrast to conventional speech coder derived by A-b-S, incorporating perceptual criterion is not so straightforward although not entirely impossible. A warping function or perceptual filter might be applied to the inferred clean speech. But in order to make the most of the statistical estimation, covariances and other system variables might have to be manipulated accordingly which is the source of the difficulty. Problems also exist when the glottal period is too short, relative to the order of the filter, for example in a female voice. This will then require the use of multiple periods. Experiments have shown that trying to fit the same glottal source parameters to two adjacent periods simultaneously give inferior results. Therefore, a more flexible manipulation is required. The model described in Chapter 5 is quite general and flexible. The input to the process in the state-space model can probably be extended to other forms to deal with difficult classes like plosives. For nasals, an additional FIR filter can also be applied there although preliminary experiments showed that the estimates will be more error- prone. The FIR filter provides zeros which cause more jaggedness in the combined input waveform. Furthermore, no physical interpretation can be derived from such model. Currently, the output matrix C in Equation (5.1) is also only a simple linear addition of instantaneous samples of the voice and the noise. Theoretically, it can be used to represent the room reverberation FIR filter which provides spectral coloring to the original sound. If it can be assumed to be stationary over a long period of sound input, the statistics over that period can be calculated and the FIR filter coefficients contained in the output matrix can be estimated [143]. The limitation, however, remains in the length of the FIR filter which must not be longer than the size of the state in the state-space model. A type of codebook-constrained EM algorithm has been demonstrated in this work. However, despite the apparent improvement in spectral shape, little percep- tual gain has been achieved, especially when the codebook has been derived from other speakers. The first problem with the current codebook-constrained algorithm 130 CHAPTER 7. CONCLUSIONS AND FUTURE WORK

is low-quality codebook. This includes differences in voice characteristics as well as perceptually meaningless codebook entries, despite the perceptually meaningful cri- terion used to derive the codebook. The second problem is that the estimates of low-band spectral envelope characteristics are usually not good enough, being inter- mediate in iteration, to look up accurately from the codebook. Future works that can solve these problems should contribute to a better sound quality. A preliminary look at other types of spectral envelope extension or prediction such as the GMM (softer decision than VQ-codebook), vocal tract area function interpolation and the formant peak features, have not yielded any better outcome. Although these techniques have been used rather successfully in speech bandwidth extension where less stringent de- mand is required of the prediction, its direct use as a synthesis filter in combination with Rosenberg’s glottal excitation gives rather unconvincing results. While the basic EM algorithm presents the AR coefficients and the glottal pa- rameters as deterministic parameters, EM-PKS gives a glimpse of how they can be turned into stochastic variables, which follow their own dynamic systems. Such mod- eling could provide more robust estimation, especially in low SNRs, as shown in the case of EM-PKS, and could also allow for non-stationary modeling where these pa- rameters vary constantly in time. EM-PKS is only a loose way to couple them into the model. There exist many other methods that could potentially perform better inference, especially when the parameters are really allowed to change constantly, for example, the dual UKF in [119] and the particle filtering in [98]. As in the case of voice observed in a pristine environment, eventually, further quantization of the floating point estimates may be needed. While the process of denoising and quantization may be done simultaneously, it has been shown that the process can be equivalently done in two successive steps, with different intermediate estimate of clean speech or voice signal, depending on the type of desired parameters and the statistical model of the noise [171]. For example, the wave samples can be optimized directly from noisy observations if their MMSE estimates are made in the intermediate step. On the other hand, for AR coefficients, the spectral magnitude is the desired intermediate estimate. The algorithm proposed in Chapter 5 executes in a similar way, giving MMSE estimates of the time-samples of the clean voice 7.4. FLEXIBLE EXPRESSIVE RESYNTHESIS AND STRUCTURED CODING131

which means applying time-sample quantization would have been optimal. Although optimality cannot be proven for those parameters we eventually extracted, further investigation from the information-theoretic point of view may yield better overall coding performance when limited precision quantization is desired. Aspiration noise is an important component of a voice texture. In a noisy observa- tion, it can hardly be heard, let alone estimated. The process error in the state-space model also tends to over-estimate the aspiration if used since it actually includes all kinds of errors associated with the model used. However, there might be other cues which could indicate how much aspiration noise should be present and hence can be synthetically produced in addition to the deterministic waveform just estimated from the noisy observation. One important clue is probably the open-quotient although other features could certainly help [68]. An investigation into this kind of relationship is interesting and will be useful for the cause of robust structured voice coding.

7.4 Flexible Expressive Resynthesis and Structured Coding

As mentioned previously, phone-type classification can never be 100% accurate. A mechanism to cope with such errors is imperative for any coder to be useable in practice. There has been some evidence that the glottal segmentation and modeling algorithm in Chapter 4 can deal somewhat with plosives misidentified as voiced, say, during a vowel-consonant-vowel (VCV) type of utterance. After all, the algorithm is in a way A-b-S and with such impulsive models, plosive excitation can be modeled by a set of narrow impulses (very small open-quotients). However, the sound quality is not so high. Although left out in this dissertation, many methods exist to encode the residue left-over after the deterministic waveform has been modeled. For example, the use of codebook similar to that in CELP could function as another safety net, capturing what is left over after incorrect modeling. However, during voice modi- fication, these errors are most likely exposed. Other mechanisms might have to be invented to mitigate these effects. 132 CHAPTER 7. CONCLUSIONS AND FUTURE WORK

On the other hand, one can imagine that even in the current form, our model can be used in combination with other standard coders. Difficult types of signal like plosives may be encoded using codebook, wavelet or other transform coder while noise compensation in those domains may also be possible in some ways. The result is a hybrid coder where the model is selected subject to suitability judging by the signal or some confidence indicator. The determination is not trivial and further research and development will be needed. Lastly, as mentioned earlier, appropriate parameter quantization and ultimate data compression must be determined for overall coding performance and efficiency in real use, especially when there are multiple modes and possibly variable rate of parameter estimates. Putting together systems which will ensure graceful degradation with acceptable quality should eventually lead to a structured speech and singing voice coder that is versatile and highly compressed for a custom delivery of the sound experience. Appendix A

State-space Model Inference and Estimation

A.1 Kalman Filtering

In statistical inference, filtering refers to calculating P (X y ) where X is a random t| 1:t t variable at time t and y1:t is the observation from time instant 1 to t. This can be used to calculate E[X y ], the expected value of the hidden state at time t, given t| 1:t all observations up to that time. For a linear dynamical system component as given in (3.1), Kalman filtering first calculates the predicted mean and variance.

xt|t−1 = Axt−1|t−1 (A.1) T Vt|t−1 = AVt−1|t−1A + Q (A.2) where A is the state transition matrix and Q is the covariance of the process noise. V is the covariance of the state variable at time t 1, while V is called the t−1|t−1 − t|t−1 a priori estimate error covariance. The error in the prediction (the innovation), and its covariance, is then calculated by

133 134 APPENDIX A. STATE-SPACE MODEL INFERENCE AND ESTIMATION

e = y Cx (A.3) t t − t|t−1 T St = CVt|t−1C + R (A.4) where C is the observation matrix and R is the covariance matrix of the observation noise. Finally, the a posteriori estimates of the state variable, the covariance and the cross-covariance at time t, are then computed through

xˆt|t = xt|t−1 + Ktet (A.5) V = V K S KT (A.6) t|t t|t−1 − t t t =(I K C)V (A.7) − t t|t−1 V =(I K C)AV (A.8) t,t−1|t − t t−1|t−1 where Kt is Kalman gain matrix

T −1 Kt = Vt|t−1C St (A.9)

The log likelihood of the N-dimensional observation can also be calculated from

Lt = log N(et;0,St) (A.10) 1 1 N = (eT S−1e) log S log2π (A.11) −2 − 2 | | − 2

A.2 Kalman Smoothing

Instead of using only the observations up to time t in estimation, the entire sequence is used to calculate P (X y ). This can be used to calculate E[X y ]. First t| 1:T t| 1:T predictions are calculated. A.3. SWITCHING STATE-SPACE INFERENCE 135

xt+1|t = Axt|t (A.12) T Vt+1|t = AVt|tA + Q (A.13)

Then calculate the estimate updates.

x = x + J (x x ) (A.14) t|T t|t t t+1|T − t+1|t V = V + J (V V )J T (A.15) t|T t|t t t+1|T − t+1|t t V = V +(V V )V −1 V (A.16) t+1,t|T t+1,t|t+1 t+1|T − t+1|t+1 t+1|t+1 t+1,t|t+1 where Jt is the smoother gain matrix

T −1 Jt = Vt|tA Vt+1|t (A.17)

A.3 Switching State-Space Inference

Most formulae in this part were extracted from [114]. Let’s first denote

xˆj = E[X y ,S = j] (A.18) t|t t| 1:t t Vˆ j = Cov[X y ,S = j] (A.19) t|t t| 1:t t Lj = Pr(y y ,S = j) (likelihood) (A.20) t t| 1:t−1 t T (i, j) = Pr(S = i S = j) (transition probability) (A.21) t | t−1

j ˆ j The estimatesx ˆt|t and Vt|t are the mean and the covariance of the state variable j given that the system is in the discrete state Sj at time t. Lt is the likelihood of the innovation given that the current discrete state at time t is Sj. 136 APPENDIX A. STATE-SPACE MODEL INFERENCE AND ESTIMATION

A.3.1 GPB1

For each time step, t, we obtain the filtered statistics using the following steps.

j ˆ j ˆ j j ˆ (ˆxt|t, Vt|t, Vt,t−1|t, Lt ) = Filter(ˆxt−1|t−1, Vt|t,yt; Aj,Cj, Qj, Rj) (A.22) j j Lt i T (i, j)M M j = Pr(S = j y )= t−1|t−1 (A.23) t|t t | 1:t j j j LPt i T (i, j)Mt−1|t−1 ˆ j ˆ j j (ˆxt|t, Vt|t) = Collapse(ˆP Pxt|t, Vt|t, Mt|t) (A.24) where Filter is the Kalman filtering operation as given in Appendix A.1 and Collapse is a moment-matching collapsing operation defined as

j j j j j j j Collapse(µX ,VX ,P ) = CollapseCross(µX ,µY ,VX ,P ) (A.25)

j X and Y are two random variables with condition mean and covariance, µX = E[X S = j], µj = E[Y S = j] and V j , respectively. The mixing coefficients are | Y | X,Y P j = Pr(S = j). The CollapseCross operation is described by

j j j j (µX ,µY ,VX,Y ) = CollapseCross(µX ,µY ,VX ,P ) (A.26) j j µX = P µX (A.27) j X j j µY = P µY (A.28) j X V = P jCov[X,Y S = j] (A.29) X,Y | j X A.3. SWITCHING STATE-SPACE INFERENCE 137

A.3.2 GPB2

Let’s first define some notations.

xˆi(j) = E[X y ,S = i, S = j] (A.30) t|τ t| 1:τ t−1 t xˆj(k) = E[X y ,S = j, S = k] (A.31) t|τ t| 1:τ t t+1 xˆj = E[X y ,S = j] (A.32) t|τ t| 1:τ t Vˆ j = Cov[X y ,S = j] (A.33) t|τ t| 1:τ t Vˆ j = Cov[X ,X y ,S = j] (A.34) t,t−1|τ t t−1| 1:τ t Vˆ i(j) = Cov[X ,X y ,S = i, S = j] (A.35) t,t−1|τ t t−1| 1:τ t−1 t M (i, j) = Pr(S = i, S = j y ) (A.36) t−1,t|τ t−1 t | 1:τ M (j) = Pr(S = j y ) (posterior) (A.37) t|τ t | 1:τ Lj = Pr(y y ,S = j) (likelihood) (A.38) t t| 1:t−1 t

Similar notations as given for GPB1 in (A.18)-(A.20) are used for the state mean and covariance being in a particular state. For GPB2, however, the history of cross- i(j) state trajectories must also be kept. For example,x ˆt|t denotes the filtered mean of

the state variable going from the discrete state Si to Sj at time t.

A.3.3 Filtering

Now the GPB2 algorithm is executed in sequence as follows:

i(j) ˆ i(j) ˆ i(j) i(j) i ˆ i (ˆxt|t , Vt|t , Vt,t−1|t, Lt ) = Filter(ˆxt−1|t−1, Vt|t,yt; Aj,Cj, Qj, Rj) (A.39) i(j) i Lt T (i, j)Mt−1|t−1 Mt−1,t|τ (i, j)= (A.40) i(j) i i j Lt T (i, j)Mt−1|t−1

Mt|t(j)= P MPt−1,t|t(i, j) (A.41) i X W i|j = Pr(S = i S = j, y )= M (i, j)/M (j) (A.42) t−1 | t 1:t t−1,t|t t|t j ˆ i(j) ˆ i(j) i(j) (ˆxt|t, Vt|t) = Collapse(ˆxt|t , Vt|t , Mt ) (A.43) 138 APPENDIX A. STATE-SPACE MODEL INFERENCE AND ESTIMATION

A.3.4 Smoothing

Following the notation from GPB2 Filtering, the following steps are executed in order for smoothing.

(j)k ˆ (j)k ˆ (j)k k ˆ k j ˆ j ˆ k ˆ j(k) (ˆxt|T , Vt|T , Vt+1,t|T ) = Smooth(ˆxt+1|T , Vt+1|T , xˆt|t, Vt|t, Vt+1|t+1, Vt+1,t|t+1; Fk, Qk)

(A.44)

j|k ≈ Mt|t(j)T (j, k) Ut = Pr(St = j St+1 = k,y1:T ) ′ ′ (A.45) | j′ Mt|t(j )T (j ,k) j|k Mt,t+1|T (j, k)= Ut Mt+1|T (k) P (A.46)

Mt|T (j)= Mt,t+1|T (j, k) (A.47) k X W k|j = Pr(S = k S = j, y )= M (j, k)/M (j) (A.48) t t+1 | t 1:T t,t+1|T t|T j ˆ j (j)k ˆ (j)k k|j (ˆxt|T , Vt|T ) = Collapse(ˆxt|T , Vt|T , Wt ) (A.49) ˆ j ˆ j (ˆxt|T , Vt|T ) = Collapse(ˆxt|T , Vt|T , Mt|T (j)) (A.50) xˆ(j)k = E[X y ,S = k] ≈ xˆk (A.51) t+1|T t| 1:T t+1 t+1|T ˆ k j(k) j(k) ˆ j(k) j|k Vt+1,t|T = CollapseCross(ˆxt+1|T , xˆt|T , Vt+1,t|T , Ut ) (A.52) xˆj(k) = E[X y ,S = k]= xˆj(k)U j|k (A.53) t|T t| 1:T t+1 t|T t j X ˆ k j(k) ˆ k Vt+1,t|T = CollapseCross(ˆxt+1|T , xˆt|T , Vt+1,t|T , Mt+1|T (k)) (A.54)

A.3.5 Unscented Kalman Filtering

First, the unscented transform is described. Given a nonlinear function of two random variables, y = f(x), the input distribution is characterized by a set of deterministically chosen points called “Sigma points.” These points are propagated through the non- linear function and a Gaussian distribution is fit to the resulting transformed points. The approximation is accurate up to at least second-order, comparing to EKF which is only first-order accurate. Its computational complexity is also of the order O(d3) where d is the dimension of the input variable. Assuming P (x)= (x;ˆx, P ), the Sigma points are chosen to be, N x A.3. SWITCHING STATE-SPACE INFERENCE 139

x0 =x ˆ (A.55)

xi =x ˆ +( (d + λ)Px)i, i =1,...,d (A.56) x =x ˆ (p(d + λ)P ) , i = d +1,..., 2d (A.57) i − x i p where λ = α2(d + κ) d is a scaling parameter. The optimal values of α, β and κ are − th problem dependent. ( (d + λ)Px)i is the i column of the matrix square root, giving points that are 1 standard deviation around the mean. The transformed points are ± p then yi = f(xi) and the mean and covariance of y can be calculated by

d (m) yˆ = Wi yi (A.58) i=0 Xd P = W (c)(y yˆ)(y yˆ)T (A.59) y i i − i − i=0 X where

(m) W0 = λ/(d + λ) (A.60) W (c) = λ/(d + λ)+(1 α2 + β) (A.61) 0 − (m) c Wi = Wi =1/(2(d + λ)) (A.62)

The UKF algorithm is a simple modification to basic Kalman filtering but with the non-linearity handled by the unscented transform. For a zero-mean additive Gaussian noise with a dynamical system given as follows:

xt = f(xt−1)+ vt (A.63)

yt = g(xt)+ wt (A.64) 140 APPENDIX A. STATE-SPACE MODEL INFERENCE AND ESTIMATION

We first create a concatenated vector,

X = xˆ , xˆ + γ P , xˆ γ P (A.65) t−1 t−1|t−1 t−1|t−1 t−1|t−1 t−1|t−1 − t−1|t−1 where γ = √d +λ. The updates arep calculated as p 

Xt|t−1 = f(Xt−1) (A.66) i=0 (m) xˆt|t−1 = Wi Xi,t|t−1 (A.67) 2d Xi=0 Pˆ = W (c)(X xˆ )(X xˆ )T + Q (A.68) t|t−1 i i,t|t−1 − t|t−1 i,t|t−1 − t|t−1 d X2 Yˆt|t−1 = g(Xt|t−1) (A.69) i=0 (m) yˆt|t−1 = Wi Yi,t|t−1 (A.70) d X2 The measurement updates are

i=0 S = W (c)(Y yˆ )(Y yˆ )T + R (A.71) t i i,t|t−1 − t|t−1 i,t|t−1 − t|t−1 2d Xi=0 (c) T Vˆ t t = W (X xˆ )(Y yˆ ) (A.72) x y i i,t|t−1 − t|t−1 i,t|t−1 − t|t−1 d X2 ˆ −1 Kt = Vxtyt St (A.73) xˆ =x ˆ + K (y yˆ ) (A.74) t|t t|t−1 t t − t|t−1 V = V K S KT (A.75) t|t t|t−1 − t t t Appendix B

Maximum Likelihood Derivation for Joint Source-Filter Estimation

The complete data log-likelihood of the model in Chapter 5 can be expressed as

L(θ) = log p(Z,Y ) (B.1)

K Tk K Tk = log p(y z )+ log p(z z ) (B.2) t| t t| t−1 k t=tk k t=tk X=1 X X=1 X where z refers to the concatenated state of the clean voice and the noise, and k is the index of glottal periods. Assuming independence among glottal periods and no correlation between voice and noise, the likelihood term relating to voice parameters, T T θs = αs ag bg , and its model error, qs, can be expressed for a single period as h i N N 1 1 T 2 L(θ) − log qs + (xn θs dn) (B.3) ∝ − 2 −2qs − n=2 X T T T where dn = xn−1 un according to the model in (5.1). Taking a partial derivative with respecth to θ andi applying MMSE estimator (< >) to the missing data, the s · maximum likelihood estimate can be calculated as follow:

141 142 APPENDIX B. MAXIMUM LIKELIHOOD ESTIMATOR DERIVATION

N ∂ L(θ) 1 T = < xndn > + < dndn > θs (B.4) ∂θs −qs − n=2 X −1 ˆ N T N θsML = n=2 < dndn > n=2 < xndn > (B.5)

h N T i h T i−1 N P < xn x >P < xn u > xnxn = −1 n−1 −1 n −1 (B.6) T T n=2 " < unxn−1 > unun # n=2 " xnun #  X  X which results in Equations (5.15)-(5.17). The process error variance, qs, can be esti- mated similarly.

N ∂ L(θ) N 1 1 T 2 = − + 2 < (xn θs dn) > (B.7) ∂ qs − qs q − s n=2 N X 1 qˆ = < x2 > 2θT < x d > +θT < d dT > θ (B.8) sML N 1 n − s n n s n n s n=2 − X   which leads to Equation (5.18). The stationary noise AR parameter and its process error variance can be similarly derived, except for the sum being over all periods instead of just within period. Their derivation can also be found in [97]. Bibliography

[1] J. O. Smith, “Viewpoints on the history of digital synthesis,” in Int. Computer Music Conf., 1991, pp. 1–10.

[2] E. Schierer and Y. M. Kim, “Generalized audio coding with MPEG-4 structured audio,” in AES 17th Int. Conf. on High Quality Audio Coding, 1999, pp. 1–16.

[3] J. Sundberg, “Perceptual aspects of singing,” Journal of Voice, vol. 8, no. 2, pp. 106–22, 1994.

[4] M. Hasegawa-Johnson and A. Alwan, “Speech coding: Fundamentals and ap- plications,” in Wiley Encyclopedia of Telecommunications. John Wiley & Sons, Inc., 2003.

[5] M. R. Schroeder, “Vocoders: Analysis and synthesis of speech (a review of 30 years of applied speech research),” IEEE Trans., vol. 54, pp. 720–734, May 1966.

[6] R. Carlson, “Models of speech synthesis,” in Colloquium on Human-Machine Communication by Voice. National Academy of Sciences, April 1993.

[7] E. Moulines and F. Charpentier, “Pitch synchronous waveform processing tech- niques for text-to-speech synthesis using diphones,” Speech Communication, 1990.

[8] A. Black, “CLUSTERGEN: A statistical parametric synthesizer using trajec- tory modeling,” in ICSLP, April 2006.

143 144 BIBLIOGRAPHY

[9] Y. Stylianou, “Concatenative speech synthesis using a harmonic plus noise model,” in The 3rd ESCA/COCOSDA Workshop on Speech Synthesis, 1998.

[10] B. Vercoe, W. Gardner, and E. Scheirer, “Structured audio: Creation, trans- mission, and rendering of parametric sound representations,” Proc. IEEE, vol. 86, no. 5, pp. 922–940, May 1998.

[11] M. A. Casey, Auditory Group Theory with Applications to Statistical Basis Methods for Structured Audio, Ph.D. thesis, M.I.T. Media Laboratory, Cam- bridge, MA, February 1998.

[12] S. R. Quackenbush, “Coding of natural audio in MPEG-4,” in ICASSP, 1998, vol. 6, pp. 3797–3800.

[13] E. D. Scheirer, “Structured audio, Kolmogorov complexity and generalized audio coding,” IEEE Transaction on Speech and Audio Processing, vol. 9, no. 8, pp. 914–931, November 2001.

[14] Julius O. Smith III, Physical Audio Signal Processing: For Virtual Musical Instruments and Digital Audio Effects, http://ccrma.stanford.edu/~jos/ pasp/, 2006.

[15] M. Reyes-Gomez, N. Jojic, and D. P. W. Ellis, “Towards single-channel unsu- pervised source separation of speech mixtures: The layered harmonics/formants separation-tracking model,” in ISCA Tutorial and Research Workshop on Sta- tistical and Perceptual Audio Processing, 2004.

[16] S. T. Roweis, “One microphone source separation,” Neural Information Pro- cessing Systems, 2000.

[17] M. Casey and A. Westner, “Separation of mixed audio sources by independent subspace analysis,” in ICMA, August 2000.

[18] A. S. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound, MIT Press, Cambridge, Massachusetts, 1990. BIBLIOGRAPHY 145

[19] M. Bosi and R. E. Goldberg, Introduction to Digital Audio Coding and Stan- dards, Kluwer Academic Publishers, 2003.

[20] M. E. Lee, A. S. Durey, E. Moore, and M. A. Clements, “Ultra low bit rate speech coding using an ergodic hidden Markov model,” in ICASSP, 2005.

[21] A. J. Bell and T. J. Sejnowski, “Learning the higher-order structures of a natural sound,” Network: Computation in Neural Systems, vol. 7, no. 2, pp. 261–266, July 1996.

[22] Gil-Jin Jang, Te-Won Lee, and Yung-Hwan Oh, “Learning statistically efficient features for speaker recognition,” Neural Computing, vol. 49, no. 1-4, pp. 329– 348, 2002.

[23] T.-W. Lee and G.-J. Jang, “The statistical structures of male and female speech signals,” in ICASSP, May 2001.

[24] P. Smaragdis, Redundancy Reduction for Computational Audition: A Unifying Approach, Ph.D. thesis, M.I.T. Media laboratory, 2001.

[25] P. Jinachitra, “Polyphonic instrument identification using independent sub- space analysis,” in ICME, 2004.

[26] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” Neural Information Processing Systems, pp. 556–562, 2000.

[27] N. Schnell, G. Peeters, S. Lemouton, P. Manoury, and X. Rodet, “Synthesizing a choir in real-time using pitch synchronous overlap add (PSOLA),” in Int. Computer Music Conf., 2000, pp. 102–108.

[28] S. Mallat and S. Zhang, “Matching pursuits with time-frequency dictionar- ies,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3397–3416, December 1993.

[29] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Journal on Scientific Computing, 1996. 146 BIBLIOGRAPHY

[30] R. R. Coifman and M. V. Wickerhauser, “Entropy-based algorithms for best- basis selection,” IEEE Transactions on Information Theory, vol. 38, pp. 713– 718, 1992.

[31] B. H. Story, “A parametric model of the vocal tract area function for vowel and consonant simulation,” J. Acoust. Soc. of Am., vol. 117, no. 5, pp. 3231–3254, May 2005.

[32] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Trans. ASSP, vol. 34, no. 4, pp. 744–754, August 1986.

[33] X. Serra and J. O. Smith, “Spectral modeling synthesis: A sound analy- sis/synthesis system on a deterministic plus stochastic decomposition,” Com- puter Music Journal, vol. 14, no. 4, pp. 12–24, 1990.

[34] M. Abe and J. O. Smith, “Design criteria for simple sinusoidal parameter estimation based on quadratic interpolation of FFT magnitude peaks,” in Audio Engineering Society Convention, San Francisco, 2004.

[35] P. Jinachitra, “Constrained EM estimates for harmonic source separation,” in ICASSP, 2003.

[36] Y. Stylianou, J. Laroche, and E. Moulines, “High-quality speech modification based on a harmonic+noise model,” in Eurospeech, 1995, pp. 451–454.

[37] A. L. C. Wang, Instantaneous and Frequency-Warped Signal Processing Tech- niques for Auditory Source Separation, Ph.D. thesis, Electrical Engineering Department, Stanford University, 1994.

[38] S. N. Levine, Audio Representations for Data Compression and Compressed Domain Processing, Ph.D. thesis, Electrical Engineering Department, Stanford University, 1998.

[39] M. Macon, L. Jensen-Link, J. Oliverio, M. Clements, and E. B. George, “A singing voice synthesis system based on sinusoidal modeling,” in ICASSP, 1997. BIBLIOGRAPHY 147

[40] J. Bonada and X. Serra, “Synthesis of the singing voice by performance sam- pling and spectral models,” IEEE Signal Processing Magazine, vol. 24, no. 2, pp. 67–79, 2007.

[41] B. Edler, H. Purnhagen, and C. Ferekidis, “ASAC-analysis/syntheis audio codec for very low-bit rates,” in 100th Conv. of AES, May 1996.

[42] H. Purnhagen and N. Meine, “HILN - The MPEG-4 parametric audio coding tools,” in ISCAS, May 2000, vol. 3, pp. 201–204.

[43] M. Nishiguchi and J. Matsumoto, “Harmonic and noise coding of LPC residuals with classified vector quantization,” in ICASSP, May 1995.

[44] M. Nishiguchi, A. Inoue, Y. Maeda, and J. Matsumoto, “Parametric speech coding - HVXC at 2.0-4.0 kbps,” in IEEE Speech Coding Workshop, 1999, pp. 84–86.

[45] J. Chowning, “Frequency modulation synthesis of the singing voice,” in Current Directions in Computer Music Research, pp. 57–63. MIT Press, Cambridge, MA, 1989.

[46] D. Chazan, R. Hoory, A. Sagi, S. Shechtman, A. Sorin, Z. W. Shuang, and R. Bakis, “High quality sinusoidal modeling of wideband speech for the purposes of speech synthesis and modification,” in ICASSP, 2006.

[47] F. Thibault and P. Depalle, “Adaptive processing of singing voice timbre,” in DAFX, 2004.

[48] M. R. Schroeder, Computer Speech: Recognition, Compression, Synthesis, in- formation Science. Springer, 1999.

[49] J. L. Flanagan and R. M. Golden, “Phase vocoder,” Bell System Technical Journal, , no. 45, pp. 1493–1509, 1966.

[50] J. Laroche, “Time and pitch scale modification of audio signals,” in Ap- plications of Digital Signal Processing to Audio and Acoustics, M. Kahrs and K. Brandenburg, Eds. Kluwer, Norwell, MA, 1998. 148 BIBLIOGRAPHY

[51] J. D. Markel and A. H. Gray Jr., Linear prediction of speech, Springer-Verlag, 1976.

[52] D. G. Childers, D. M. Hicks, G. P. Moore, and Y. A. Alsaka, “A model for vocal fold vibratory motion, contact area, and the electroglottogram,” J. Acoust. Soc. Am., vol. 80, no. 5, pp. 1309–20, 1986.

[53] J. R. Deller, J. G. Proakis, and J. H. Hansen, Discrete-time processing of speech signals, Macmillan, 1993.

[54] D. W. Griffin and J. S. Lim, “Multi-band excitation vocoder,” IEEE Trans. on Speech and Audio Processing, vol. 36, no. 8, pp. 1223–35, 1988.

[55] A. E. Rosenberg, “Effect of glottal pulse shape on the quality of natural vowels,” J. Acoust. Soc. of Am., vol. 49, no. 2, pp. 583–590, 1971.

[56] G. Fant, J. Liljencrants, and Q. Lin, “A four-parameter model of glottal flow,” Tech. Rep., STL-QPSR, 1985.

[57] D. H. Klatt and L. C. Klatt, “Analysis, synthesis and perception of voice quality variations among female and male talkers,” J. Acoust. Soc. Am., vol. 87, no. 2, pp. 820–857, 1990.

[58] P. Alku, T. B¨ackstr¨om, and E. Vilkman, “Normalized amplitude quotient for parameterization of the glottal flow,” J. Acoust. Soc. of Am., vol. 112, no. 2, pp. 701–710, 2002.

[59] H. Fujisaki and M. Ljungquist, “Proposal and evaluation of models for the glottal source waveform,” in ICASSP, 1986, vol. 31, pp. 1605–08.

[60] A. Acero, “Source-filter models for time-scale pitch-scale modification of speech,” in ICASSP, 1998.

[61] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice Hall, Englewood Cliffs, NJ, 1978. BIBLIOGRAPHY 149

[62] J. Schroeter and M. M. Sondhi, “Techniques for estimating vocal-tract shapes from the speech signal,” IEEE Transaction on Speech and Audio Processing, vol. 2, no. 1(II), pp. 133–150, January 1994.

[63] D. H. Klatt, “Software for a cascade/parallel formant synthesizer,” J. Acoust. Soc. Am., vol. 67, no. 3, pp. 971–995, March 1980.

[64] X. Rodet, “Time-domain formant wave-function synthesis,” Computer Music Journal, pp. 6–14, 1984.

[65] G. Bennett and X. Rodet, “Synthesis of the singing voice,” in Current Di- rections in Computer Music Research, pp. 19–44. MIT Press, Cambridge, MA, 1989.

[66] J. L. Kelly and C. C. Lochbaum, “Speech synthesis,” in Proc. Fourth Int. Congress on Acoustics, Copenhagen, September 1962, pp. 1–4.

[67] P. R. Cook, Identification of Control Parameters in an Articulatory Vocal Tract Model with Applications to the Synthesis of Singing, Ph.D. thesis, Department of Electrical Engineering, Stanford University, 1991.

[68] H. L. Lu, Towards a High Quality Singing Synthesizer with Vocal Texture Con- trol, Ph.D. thesis, Department of Electrical Engineering, Stanford University, 2001.

[69] J. Mullen, D. M. Howard, and D. T. Murphy, “Digital waveguide mesh modeling of the vocal tract acoustics,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 2003, pp. 119–122.

[70] J. Mullen, D. M. Howard, and D. T. Murphy, “Waveguide physical modeling of vocal tract acoustics: Flexible formant bandwidth control from increased model dimensionality,” IEEE Transaction on Speech, Audio and Language Processing, vol. 14, no. 3, pp. 1–8, May 2005. 150 BIBLIOGRAPHY

[71] V. V¨alim¨aki and M. Karjalainen, “Improving the Kelly-Lochbaum vocal tract model using conical tube sections and fractional delay filtering techniques,” in ICSLP, September 1994, vol. 2, pp. 615–618.

[72] S. Mathur, B. H. Story, and J. J. Rodr´ıguez, “Vocal-tract modeling: Fractional elongation of segment lengths in a waveguide model with half-sample delays,” IEEE Trans. on Audio, Speech and Language Processing, 2006.

[73] F. Avanzini, C. Drioli, and P. Alku, “Synthesis of the voice source using a physically-informed model of the glottis,” in ISMA, 2001.

[74] K. Ishizaka and J. Flanagan, “Synthesis of voiced sounds from a two-mass model of vocal cords,” Bell Syst. Tech. J., vol. 51, pp. 1233–1269, 1972.

[75] C. Cooper, D. Murphy, D. Howard, and A. Tyrrell, “Singing synthesis with an evolved physical model,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1454–61, July 2006.

[76] P. Birkholz and D. Jackel, “Construction and control of a three-dimensional vocal tract model,” in ICASSP, 2006.

[77] C. H. Coker, “Synthesis by rule from articulatory parameters,” in IEEE Con- ference on Speech Communication Processes, 1967, pp. 52–53.

[78] C. H. Coker, “A model of articulatory dynamics and control,” in Proceedings of the IEEE, April 1976, vol. 64, pp. 452–460.

[79] S. Parthasarathy and C. H. Coker, “Phoneme-level parameterization of speech using an articulatory model,” in ICASSP, 1990, pp. 337–340.

[80] S. Chennoukh, D. Sinder, G. Richard, and J. Flanagan, “Articulatory based low bit-rate speech coding,” in Journal of the Acoustical Society of America, 1997, vol. 102, p. 3163.

[81] M. Karjalainen, T. Altosaar, and M. Vainio, “Neural net trained to map control parameter to acoustic wave generation, with tract modeled by WLPC,” in ICASSP, 1998, vol. 2, pp. 877–880. BIBLIOGRAPHY 151

[82] L. J .Lee, P. Fieguth, and L. Deng, “A functional articulatory dynamic model for speech production,” in ICASSP, 2001.

[83] R. Togneri and L. Deng, “An EKF-based algorithm for learning statistical hidden dynamic model parameters for phonetic recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, April 2001, vol. 1, pp. 465–468.

[84] S. Dusan and L. Deng, “Estimation of articulatory parameters from speech acoustics by Kalman filtering”,” in Proceedings of the Inaugural CITO Re- searcher Retreat, 1998.

[85] Y. Laprie and P. Mathieu, “A variational approach for estimating vocal tract shapes from the speech signal,” in ICASSP, 1998.

[86] S. Hiroya and M. Honda, “Determination of articulatory movements from speech acoustics using an HMM-based speech production model,” in ICASSP, 2002.

[87] J .Z. Ma and L. Deng, “Target-directed mixture dynamic models for sponta- neous speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 12, no. 1, pp. 47–58, 2004.

[88] J. Huang, S. Levinson, D. Davis, and S. Slimon, “Articulatory speech synthesis based upon fluid dynamic principles,” in ICASSP, 2002.

[89] G. Richard, M. Liu, D. Sinder, H. Duncan, Q. Lin, J. Flanagan, S. Levinson, D. Davis, and S. Slimon, “Numerical simulations of fluid flow in the vocal tract,” in Eurospeech, September 1995, pp. 1297–1300.

[90] K. Cummings, J. Maloney, and M. Clements, “Modeling speech production using Yee’s finite difference method,” in ICASSP, 1995, vol. 1, pp. 672–675.

[91] Y. E. Kim, Singing Voice Analysis/Synthesis, Ph.D. thesis, M. I. T. Media Laboratory, 2003. 152 BIBLIOGRAPHY

[92] J. S. Lim, Speech Enhancement, Prentice-Hall, New Jersey, 1983.

[93] R. Martin, “Spectral subtraction based on minimum statistics,” in Seventh European Signal Processing Conference, 1994, pp. 1182–1185.

[94] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech and Signal Processing, vol. ASSP-33, pp. 443–445, April 1985.

[95] Y. Ephraim, “A minimum mean square error approach for speech enhance- ment,” ICASSP, 1990.

[96] Y. Ephraim, D. Malah, and B.-H. Juang, “On the application of hidden Markov models for enhancing noisy speech,” IEEE Trans. Acoust., Speech, Signal Pro- cessing, vol. ASSP-37, December 1989.

[97] S. Gannot, D. Burshtein, and E. Weinstein, “Iterative and sequential Kalman filter-based speech enhancement algorithms,” IEEE Trans. on Speech and Audio Processing, 1998.

[98] J. Vermaak, C. Andrieu, A. Doucet, and S. J. Godsill, “Particle methods for Bayesian modeling and enhancement of speech signals,” IEEE Trans. Speech and Audi Processing, vol. 10, pp. 173–185, March 2002.

[99] J. Deng, M. Bouchard, and T. H. Yeap, “Speech enhancement using a switching Kalman filter with a perceptual post-filter,” in ICASSP, 2005, vol. 1, pp. 1121– 1124.

[100] Z. Goh, K. C. Tan, and B.T.G Tan, “Kalman-filtering speech enhancement method based on a voiced-unvoiced speech model,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 5, 1999.

[101] T. F. Quatieri and R. J. McAulay, “Noise reduction using a soft-decision sine- wave vector quantizer,” in ICASSP, 1990. BIBLIOGRAPHY 153

[102] Y. Ephraim and H. L. V. Trees, “A signal subspace approach for speech en- hancement,” IEEE Transactions on Speech, Audio Processing, vol. 3, no. 4, pp. 251–266, 1995.

[103] J. L. Zhou, F. Seide, and L. Deng, “Coarticulation modeling by embedding a target-directed hidden trajectory model into HMM - model and training,” in ICASSP, 2003.

[104] A. Yasmin, P. Fieguth, and L. Deng, “Speech enhancement using voice source models,” in ICASSP, 1999.

[105] J. L. Flanagan, Speech Analysis, Synthesis, and Perception, Springer Verlag, New York, 1972.

[106] D. Mehta and T. F. Quatieri, “Pitch-scale modification using the modulated aspiration noise source,” in ICSLP, April 2006.

[107] B. Matthews, R. Bakis, and E. Eide, “Synthesizing breathiness in natural speech with sinusoid modeling,” in ICSLP, April 2006.

[108] M. D. Plumpe, T. F. Quatieri, and D. A. Reynolds, “Modeling of the glottal flow derivative waveform with application to speaker identification,” IEEE Trans- action on Speech and Audio Processing, vol. 7, no. 5, pp. 569–586, September 1999.

[109] J. L. Zhou, F. Seide, and L. Deng, “Coarticulation modeling by embedding a target-directed hidden trajectory model into HMM - MAP decoding and eval- uation,” in ICASSP, 2003.

[110] A-V. I. Rosti and M. J. F. Gales, “Switching linear dynamical systems for speech recognition,” Tech. Rep., Cambridge University, 2003.

[111] Y. Zheng and M. Hasegawa-Johnson, “Acoustic segmentation using switching state Kalman filter,” in ICASSP, 2003, vol. 1, pp. 752–755. 154 BIBLIOGRAPHY

[112] J. Droppo and A. Acero, “Noise robust speech recognition with a switching linear dynamic model,” in ICASSP, 2004, pp. 953–956.

[113] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” Journal of the Royal Statistical Society, Series B, vol. 61, no. 3, pp. 611–622, 1999.

[114] K. P. Murphy, “Switching Kalman filter,” Tech. Rep., Compaq Cambridge Res. Lab, Cambridge, MA, 1998.

[115] X. Boyen and D. Koller, “Approximate learning of dynamic models,” in Ad- vances in Neural Information Processing Systems, December 1998, pp. 396–402.

[116] J. Droppo, L. Deng, and A. Acero, “A comparison of three non-linear observa- tion models for noisy speech features,” in Proc. Eurospeech, September 2003, pp. 681–684.

[117] S. Lauritzen, Graphical Models, Oxford, 1996.

[118] E. Wan, R. van der Merwe, and A. T. Nelson, “Dual estimation and the unscented transformation,” Neural Information Processing Systems, vol. 12, pp. 666–672, 2000.

[119] E. A. Wan and R. van der Merwe, “The unscented Kalman filter,” in Kalman Filtering and Neural Networks, S. Haykin, Ed. Wiley Publishing, 2001.

[120] National Institute of Standards and Technology, “TIMIT acoustic-phonetic continuous speech corpus,” CD-ROM NTIS Order No. PB91-505065, October 1990.

[121] D. Y. Wong, J. D. Markel, and A. H. Gray, “Least squares inverse filtering from the acoustic speech waveform,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 4, pp. 350–355, August 1979.

[122] J. McKenna, “Automatic glottal closed-phase location and analysis by Kalman filtering,” in SSW4, 2001. BIBLIOGRAPHY 155

[123] R. Smits and B. Yegnanarayana, “Determination of instants of significant exci- tation in speech using group delay function,” IEEE Trans. Speech Audio Proc., vol. 3, pp. 325–333, 1995.

[124] M. Brookes, P. A. Naylor, and J. Gudnason, “A quantitative assessment of group delay methods for identifying glottal closures in voiced speech,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 2, pp. 456–466, March 2006.

[125] A. Kounoudes, P. A. Naylor, and M. Brookes, “The DYPSA algorithm for estimation of glottal closure instants in voiced speech,” in ICASSP, 2002.

[126] M. Brookes and H. P. Loke, “Modeling energy flow in the vocal tract with applications to glottal closure and opening detection,” in ICASSP, 1999, pp. 213–216.

[127] W. Wokurek, “Time-frequency analysis of the glottal opening,” in ICASSP, 1997.

[128] F. Plante, W. A. Ainsworth, and G. Meyer, “A pitch extraction reference database,” in Proc. Eurospeech Madrid, 1995, pp. 837–840.

[129] H. Deng, R. K. Ward, M. P. Beddoes, and M. Hodgson, “Estimating vocal-tract area functions from vowel sound signals over closed glottal phases,” in ICASSP, 2005.

[130] H. Strik, B. Cranen, and L. Boves, “Fitting a LF-model to inverse filter signals,” in Proceedings of the 3rd European conference on speech communication and technology, 1993, vol. 1, pp. 103–106.

[131] E. Moore and M. Clements, “Algorithm for automatic glottal waveform estima- tion without the reliance on precise glottal closure information,” in ICASSP, 2004, vol. 1, pp. 533–536.

[132] R. W. Morris, M. A. Clements, and J. S. Collura, “Autoregressive parameter estimation of speech in noise,” in ICASSP, 2005. 156 BIBLIOGRAPHY

[133] K. K. Paliwal and N. Koestoer, “Robust linear prediction analysis for low bit-rate speech coding,” in WOSPA, 2002.

[134] M. Fr¨ohilich, D. Michaelis, and H. W. Strube, “SIM-simultaneous inverse fil- tering and matching of a glottal flow model for acoustic speech signals,” J. Acoust. Soc. Am., vol. 110, no. 1, July 2001.

[135] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. B, no. 1, pp. 1–38, 1977.

[136] X. L. Meng and D. B. Rubin, “Maximum likelihood estimation via the ECM algorithm: A general framework,” Biometrika, vol. 80, no. 2, pp. 267–278, 1993.

[137] D. G. Childers and C.-F. Wong, “Measuring and modeling vocal source-tract interaction,” IEEE Transaction on Biomedical Engineering, vol. 41, no. 7, pp. 663–671, July 1994.

[138] G. Fant and Q. Lin, “Glottal source-vocal tract acoustic interaction,” Journal of the Acoustical Society of America, vol. 81, no. 1, 1987.

[139] V. Atti and A. Spanias, “Speech analysis by estimating perceptually relevant pole locations,” in ICASSP, 2005.

[140] V. Grancharov, J. Samuelson, and W. B. Kleijn, “Improved Kalman filtering for speech enhancement,” in ICASSP, March 2005, vol. 1, pp. 1109–1112.

[141] P. A. A. Esquef, V. V¨alim¨aki, and M. Karjalainen, “Restoration and enhance- ment of solo guitar recordings based on sound source modeling,” Journal of the Audio Engineering Society, vol. 50, no. 4, pp. 227–236, 2002.

[142] T. Tolonen, “Object-based sound source modeling for musical signals,” in AES 109th Convention, September 2000.

[143] Z. Ghahramani and S. Roweis, “A unifying review of linear Gaussian models,” Neural Computation, , no. 11, pp. 305–345, 1999. BIBLIOGRAPHY 157

[144] K. Achan, S. Roweis, A. Hertzmann, and B. Frey, “A segment-based proba- bilistic generative model of speech,” in ICASSP, 2005.

[145] P. Jinachitra, “Glottal closure and opening detection for flexible parametric voice coding,” in Interspeech, September 2006.

[146] P. Jinachitra and J. O. Smith III, “Joint estimation of glottal source and vocal tract for vocal synthesis using Kalman smoothing and EM algorithm,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 2005.

[147] J. H. Hansen and M. A. Clements, “Iterative speech enhancement with spectral constraints,” in ICASSP, April 1987, vol. 12, pp. 189–192.

[148] J. H. Hansen and L. M. Arslan, “Markov model-based phoneme class parti- tioning for improved constrained iterative speech enhancement,” IEEE Trans- actions on Speech and Audio Processing, vol. 3, no. 1, pp. 98–104, January 1995.

[149] K. Venkatesh, Framework for Speech Enhancement, Ph.D. thesis, Georgia Institute of Technology, 2005.

[150] T. V. Srinivas and P. Kirnapure, “Codebook constrained Wiener filtering for speech enhancement,” IEEE Trans. Speech and Audio Proc., vol. 4, no. 5, September 1996.

[151] G. Fant, “The voice source in connected speech,” Speech Communication, , no. 22, 1997.

[152] M. Brookes, “Voicebox matlab toolbox,” http://www.ee.ic.ac.uk/hp/ staff/dmb/voicebox/voicebox.html, 1997.

[153] R. M. Gray, A. Buzo, A. H. Gray Jr., and Y. Matsuyama, “Distortion measures for speech processing,” IEEE Trans. on Acoustics, Speech, and Signal Proc., vol. ASSP-28, no. 4, pp. 367–376, August 1980. 158 BIBLIOGRAPHY

[154] ITU-T, “Perceptual Evaluation of Speech Quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs,” Series P: Telephone Transmission Quality Recommendation P.862, February 2006.

[155] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual eval- uation of speech quality (PESQ) - A new method for speech quality assessment of telephone networks and codecs,” in ICASSP, May 2001.

[156] ITU-T, “Recommendation p.800.1 (07/06),” July 2006.

[157] Y. Yoshida and M. Abe, “An algorithm to reconstruct wide-band speech from narrow-band speech based on codebook mapping,” in ICSLP, 1994.

[158] R. Hu, K. Venkatesh, and D. V. Anderson, “Speech bandwidth extension by improved codebook mapping towards increased phonetic classification,” in In- terspeech, 2005.

[159] P. Jax and P. Vary, “Wideband extension of telephone speech using a hidden Markov model,” in ICASSP, 2003.

[160] A. Uncini, F. Gobbi, and F. Piazza, “Frequency recovery of narrow-band speech using adaptive spline neural networks,” in ICASSP, 1999.

[161] M. Nilsson, H. Gustafsson, S. V. Andersen, and W. B. Kleijn, “Gaussian mix- ture model based mutual information estimation between frequency bands in speech,” in ICASSP, 2002.

[162] P. Jax and P. Vary, “Feature selection for improved bandwidth extension of speech signals,” in ICASSP, 2004.

[163] H. Gustafsson, U. A. Lindgren, and I. Claesson, “Low-complexity feature- mapped speech bandwidth extension,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 2, pp. 577–588, March 2006. BIBLIOGRAPHY 159

[164] H. Yasukawa, “Restoration of wideband signal from telephone speech using linear prediction residual error filtering,” in Proc. IEEE Workshop on Speech Coding, 1996.

[165] J. Epps, Wide-band Extension of Narrow-band Speech for Enhancement and Coding, Ph.D. thesis, University of New South Wales, Australia, 2000.

[166] E. Moulines and Y. Sagisaka, “Voice conversion: state of the art and perspec- tives,” Special issue of Speech Communication, vol. 16, no. 2, February 1995.

[167] Z. Ghahramani and G. E. Hinton, “Variational learning for switching state- space models,” Neural Computation, vol. 12, no. 4, pp. 831 – 864, April 2000.

[168] A. Doucet, N. J. Gordon, and V. Krishnamurthy, “Particle filters for state estimation of jump Markov linear systems,” IEEE Trans. on Signal Processing, vol. 49, no. 3, pp. 613–624, March 2001.

[169] T. Eriksson and H. G. Kang, “Pitch quantization in low bit-rate speech coding,” in ICASSP, 1999.

[170] K. K. Paliwal and W. B. Kleijn, Quantization of LPC parameters, pp. 433–466, Elsevier, 1995.

[171] Y. Ephraim and R. M. Gray, “A unified approach for encoding clean and noisy sources by means of waveform and autoregressive vector quantization,” IEEE Transaction on Information Theory, vol. IT-34, pp. 826834, July 1988.