Markov-Switching GARCH Models and Applications to Digital

Ari Abramson

Markov-Switching GARCH Models and Applications to Digital Speech Processing

Research Thesis

As Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

Ari Abramson

Submitted to the Senate of the Technion—Israel Institute of Technology Tevet 5768 Haifa December 2007

i

The Research Thesis was Done Under the Supervision of Associate Professor Israel Cohen in the Department of Electrical Engineering.

Acknowledgement

I wish to express my deep gratitude and appreciation to Prof. Israel Cohen for his guidance and dedicated supervision. Thank for your professional support, for your encouragement to perfection, and for many valuable suggestions throughout all the stages of this research.

I would also like to thank Kuti Avargel for many fruitful discussions and Dr. Emanu¨el Habets for valuable discussions and for giving me the opportunity to expand my research to speech dereverberation.

Special thanks to my parents, Miri and Moshe and to my beloved Efrat who encouraged and supported me through the whole way.

The Generous Financial Help of The Technion, The Israel Science Foundation (Grant no. 1085/05), and The European Commission’s IST Program Under Project Memories is Gratefully Acknowledged. ii Contents

1 Introduction 7 1.1 Markov-switching GARCH models ...... 8 1.2 Speechenhancement ...... 10 1.2.1 Spectral modeling of speech signals ...... 11 1.3 Other speech processing applications ...... 13 1.4 Detection and estimation of speech signals ...... 15 1.5 Thesisstructure...... 17 1.6 Listofpublications ...... 21

2 Research Methods 23 2.1 GARCH Models and stationarity analysis ...... 23 2.1.1 Markov-switching GARCH model ...... 24 2.2 Time-frequency GARCH model and spectral speech enhancement ..... 26 2.2.1 Time-frequency GARCH model ...... 26 2.2.2 Varianceestimation...... 28 2.2.3 Spectralenhancement ...... 28 2.3 Single-channel blind source separation ...... 30 2.4 Speechdereverberation ...... 31

3 Stationarity Analysis of MS-GARCH Processes 35 3.1 Introduction...... 36 3.2 Stationarity of Markov-switching GARCH models ...... 38 3.2.1 MSG-Imodel ...... 39 3.2.2 MSG-IImodel...... 43

iii iv CONTENTS

3.2.3 Comparison of stationarity conditions ...... 45 3.3 Relationtootherworks...... 46 3.4 Conclusions ...... 49 3.A ProofofTheorem3.2 ...... 49 3.B Equivalence with Haas condition ...... 50

4 MS-GARCH Process in the STFT Domain 53 4.1 Introduction...... 54 4.2 Markov-switching time-frequency GARCH model ...... 56 4.2.1 Time-frequency GARCH model ...... 57 4.2.2 MSTF-GARCHformulation ...... 57 4.2.3 Stationarity of an MSTF-GARCH process ...... 58 4.3 Restoration of noisy MSTF-GARCH process ...... 60 4.4 Estimationefficiency ...... 68 4.5 Modelestimation ...... 71 4.6 Experimentalresults ...... 72 4.6.1 MSTF-GARCHsignals...... 72 4.6.2 Speechsignals...... 75 4.7 Conclusions ...... 80 4.A Application to Speech Enhancement ...... 82 4.A.1 Introduction...... 82 4.A.2 Modelformulation ...... 83 4.A.3 Modelestimation ...... 85 4.A.4 Spectral enhancement of noisy speech ...... 86 4.A.5 Experimental results and discussion ...... 87

5 State Smoothing in MS-GARCH Models 91 5.1 Introduction...... 91 5.2 Problemformulation ...... 93 5.3 Stateprobabilitysmoothing ...... 94 5.3.1 Generalized forward-backward recursions ...... 95 5.3.2 Generalized stable backward recursion ...... 98 CONTENTS v

5.4 Experimentalresults ...... 99 5.5 Conclusions ...... 99

6 Simultaneous Detection and Estimation 101 6.1 Introduction...... 102 6.2 Classical speech enhancement ...... 104 6.3 Reformulation of the speech enhancement problem ...... 107 6.4 Quadratic spectral amplitude cost function ...... 110 6.5 Relationtospectralsubtraction ...... 115 6.6 AprioriSNRestimation ...... 118 6.7 Experimentalresults ...... 121 6.7.1 Comparison with the STSA estimator ...... 122 6.7.2 Speech enhancement under nonstationary noise environment . . . . 123 6.8 Conclusions ...... 126 6.A Riskderivation ...... 127 6.B Speech Enhancement Under Multiple Hypotheses ...... 129 6.B.1 Introduction...... 129 6.B.2 Problemformulation ...... 130 6.B.3 Optimal estimation under a given detection ...... 131 6.B.4 Experimentalresults ...... 135 6.B.5 Conclusions ...... 137

7 Single-Sensor Audio Source Separation 139 7.1 Introduction...... 140 7.2 Codebook-BasedSeparation ...... 142 7.2.1 Simultaneous Classification and Estimation ...... 144 7.2.2 Joint Classification and Estimation ...... 147 7.3 GMMVs.GARCHCodebook ...... 150 7.4 ImplementationoftheAlgorithm ...... 153 7.5 ExperimentalResults...... 156 7.5.1 Experimental setup and quality measures ...... 156 7.5.2 Simulationresults...... 158 vi CONTENTS

7.6 Conclusions ...... 163 7.A Derivationof(7.10)...... 164 7.B Derivationof(7.11)...... 165

8 Speech Dereverberation Using GARCH Modeling 167 8.1 Introduction...... 167 8.2 Dual-microphonedereverberation ...... 169 8.3 Latereverberantspectralestimation ...... 171 8.4 Modeling early reverberation using GARCH ...... 172 8.4.1 Spectralvarianceestimation ...... 173 8.4.2 Speech presence probability ...... 174 8.5 Experimentalresults ...... 174 8.6 Conclusions ...... 176

9 Research Summary and Future Directions 179 9.1 Researchsummary ...... 179 9.2 Futureresearchdirections ...... 182

Bibliography 185 List of Figures

3.1 Stationarity regions for two-state Markov-chains with GARCH of order (1, 1) 47

4.1 SNR improvements obtained by using MSTF-GARCH based estimators . . 74 4.2 Trace of instantaneous output SNR achieved by the proposed algorithm . . 75 4.3 Typical traces of one-frame-ahead conditional variance estimates for speech signals ...... 77 4.4 Typical traces of estimated squared absolute values for speech signal . . . . 78 4.5 Speechspectogramsandwaveforms ...... 88 4.6 Conditional speech presence probability ...... 89

5.1 State smoothing error rate for 3-state MSTF-GARCH models ...... 100

6.1 Independent detection and estimation system, and strongly coupled detec- tionandestimationsystem...... 109

6.2 Gain curves of G1, G0, and the total detection and estimation system gain curve, compared with the STSA gain under signal presence uncertainty . . 114 6.3 Signals in the time domain: sinusoidal signal with stationarynoise . . . . . 117 6.4 Amplitudes of the STFT coefficients along the time-trajectory correspond- ing to the frequency of the sinusoidal signal ...... 117 6.5 Signals in the time domain: sinusoidal signal with stationary and transient noise...... 120 6.6 Amplitudes of the STFT coefficients along the time-trajectory correspond- ing to the frequency of the sinusoidal signal ...... 120 6.7 Spectrograms and waveforms of speech signal degraded by engine noise and asirennoise...... 125

vii viii LIST OF FIGURES

6.8 Gain curves for p(H )=0.8, C = 5, C = 3, and G = 15dB . . . .134 1 01 10 min − 6.9 Speech degraded by keyboard typing noise, spectrograms and waveforms . 136

7.1 A cascade classification and estimation scheme...... 148 7.2 Blockdiagramoftheproposedalgorithm...... 155 7.3 Quality measures for mmse estimation as functions of the number of GARCHstates...... 159 7.4 Trade-off between residual interference and signal distortion resulting from changing the false detection and missed detection parameters...... 160 7.5 Original and mixed signals. (a) Speech signal: ”Draw every outer line first, then fill in the interior”; (b) piano signal (F¨ur Elise); (c) mixed signal. . . 162 7.6 Separation of speech and music signals. (a) speech signal reconstructed by using the GMM-based algorithm; (b) speech signal reconstructed using the proposed approach; (c) piano signal reconstructed by using the GMM algorithm; (d) piano signal reconstructed using the proposed approach. . . 162

8.1 Dual microphone speech dereverberation system...... 170 8.2 SegSIR and LSD as functions of the number of GARCH states ...... 177 8.3 Spectrograms and waveforms of reverberant and processed speech signals . 177 List of Tables

4.1 Vector form of the recursive MSTF-GARCH signal estimation ...... 67

6.1 Segmental SNR and Log Spectral Distortion Obtained by Using Either the Simultaneous Detection and Estimation Approach or the STSA Estimator inStationaryNoiseEnvironment...... 122 6.2 Segmental SNR, Log Spectral Distortion and PESQ Score Under Transient Noise...... 124 6.3 Segmental SNR and Log Spectral Distortion Obtained Using the OM-LSA andtheProposedAlgorithm...... 137

7.1 Averaged Quality Measures for the Estimated Speech Signals Using 3-state GARCHModeland8-stateGMM...... 161 7.2 Averaged Quality Measures for the Estimated Music Signals Using 3-state GARCHModeland8-stateGMM...... 161

8.1 SegSIR and LSD obtained by using the decision-directed approach and the proposed MS-GARCH-based approach, with d=0.5 m ...... 175 8.2 SegSIR and LSD obtained by using the decision-directed approach and the proposed MS-GARCH-based approach, with d=1 m ...... 176

ix x LIST OF TABLES Abstract

This dissertation addresses theory and applications of generalized autoregressive condi- tional heteroscedasticity (GARCH) models with Markov regimes for digital speech pro- cessing. The GARCH model is widely-used in the field of for volatility forecast derivation of econometric rates, and it was recently proposed in the field of for applications such as speech enhancement, , and voice activity detection. GARCH models explicitly parameterize the time-varying volatility by using both past conditional variances and past squared innovations (prediction errors), while taking into account excess kurtosis (i.e., heavy tailed distribution) and volatility clustering.

In this thesis, we develop a new for nonstationary signals in the joint time-frequency domain based on GARCH formulation with Markov regimes. The pro- posed model exploits the advantages of both the conditional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains. The main motivation for this research is spectral modeling of speech signals for hands-free communication applications such as speech enhancement, nonstationary noise reduction, dereverberation, and audio source separation.

We analyze the asymptotic stationarity of Markov-switching GARCH (MS-GARCH) processes in the general case of (p, q)-order GARCH models with finite-state Markov chains. Necessary and sufficient conditions for asymptotic wide-sense stationarity are de- veloped for several model formulations which are known in the literature. The properties of the proposed model are investigated and algorithms are developed for conditional vari- ance, as well as for signal estimation in noisy environments. The proposed model with the corresponding estimation algorithms are shown to be useful for applications of speech enhancement and speech dereverberation. In addition, a state smoothing algorithm is

1 2 ABSTRACT developed for the sequence of active states estimation. Furthermore, a new formulation for the speech enhancement problem is proposed in this thesis, which incorporates simul- taneous operations of detection and estimation. A detector for speech presence in the short-time Fourier transform domain is combined with an estimator, which jointly mini- mizes a cost function that takes into account both detection and estimation errors. We show that the proposed simultaneous detection and estimation approach enables greater noise reduction than estimation only approach, without further degrading the speech signal. A simultaneous classification and estimation approach together with GARCH model- ing is employed for developing an algorithm for single-sensor audio source separation. We show that for mixtures of speech and music signals, an improved source separation can be achieved compared to using Gaussian for both signals. Moreover, cost parameters enable one to control the trade-off between missed and false detection of the desired signal, and correspondingly the trade-off between signal distortion and residual interference. Notation

x scalar / time-domain signal x column vector

Xtk , X(t, k) time-frequency coefficient A , a the (i, j) element of matrix A { }ij ij A−1 matrix inverse diag x diagonal matrix with the vectorx on its diagonal { } eig A the spectrum of matrix A { } tr trace {·} x absolute value | | A determinant | | ( )T , ( )′ transpose operation · · ( )H Hermitian · ( )∗ complex conjugate · p( ) probability / probability density · E expectation {·} V ar variance {·} Cov covariance {·} det( ) determinant · I ( ) modified Bessel function of order ν ν · J ( ) Bessel function of order ν ν · Γ( ) Gamma function · 1F1 (a; b; x) confluent hypergeometric function ρ( ) spectral radius ·

3 4 NOTATION

1 column vector of ones 0 m m matrix of zeros m × I m m identity matrix m × Kronecker product ⊗ term-by-term vector multiplication ⊙ ( ) term-by-term vector division ÷ ▽ gradient RN N-demential real-valued vectors RN + N-demential positive real-valued vectors CN N-demential complex-valued vectors Euclidian norm k·k ( ) real part · R ( ) imaginary part · I Abbreviations

AIR Acoustic impulse response AR Autoregressive ARCH Autoregressive conditional heteroscedasticity ARMA Autoregressive moving average DPC Direct path compensation DSB Delay and sum beamformer GARCH Generalized autoregressive conditional heteroscedasticity GMM Gaussian mixture model HMM Hidden HMP Hidden Markov process IMCRA Improved minima-controlled recursive averaging

IRH0 Interference reduction LRSVE Late reverberant spectral variance estimator LSA Log spectral amplitude LSD Log spectral distortion MAP Maximum a posteriori ML Maximum likelihood MMSE Minimum -square error MSE Mean-square error MS-GARCH / MSG Markov-switching GARCH MSTF-GARCH Markov-switching time-frequency GARCH NE Noise estimator OM-LSA Optimally modified log-spectral amplitude

5 6 ABBREVIATIONS

PESQ Perceptual evaluation of speech quality PSD Power spectral density QSA Quadratic spectral amplitude RIR Room impulse response SegSNR Segmental signal-to-noise ratio SegSIR Segmental signal-to-interference ratio SNR Signal-to-noise ratio STFT Short-time Fourier transform STSA Short-term spectral amplitude TF-GARCH Time-frequency generalized autoregressive conditional heteroscedasticity VAD Voice activity detector Chapter 1

Introduction

This dissertation addresses the theory of generalized autoregressive conditional het- eroscedasticity (GARCH) models with Markov regimes and their applications to signal processing, and in particular to digital speech processing.

GARCH model explicitly parameterizes a time-varying volatility by using both re- cent conditional variances and recent squared innovations. This model is widely-used in the field of econometrics for the analysis and volatility forecasts in financial markets. Recently, this model was proposed for speech processing applications, such as speech en- hancement, speech recognition, and voice activity detection. In this thesis we formulate a new complex-valued Markov-switching GARCH (MS-GARCH) model for nonstation- ary signals in the joint time-frequency domain. The MS-GARCH model exploits the advantages of both the conditional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains. The basic motivation for our research is based on spectral modeling of speech signals, where examples for applications may include, e.g., noise reduction in communication systems (both background and tran- sient noise), speech enhancement and dereverberation in hands-free communication, and audio source separation.

The thesis starts with an asymptotic stationarity analysis of MS-GARCH processes. In case of processes with time-varying variances, conditions for asymptotic wide-sense stationarity are useful to ensure a , with a finite second-order moment. We then formulate a new complex-valued MS-GARCH model for nonstationary signals in the short-time Fourier transform (STFT) domain. The properties of the model are

7 8 CHAPTER 1. INTRODUCTION investigated and algorithms are developed for causal, as well as noncausal estimation in noisy environment. The proposed model is shown to be useful for spectral modeling of speech signals for the applications of speech enhancement and speech dereverberation. A basic property of speech signals is that their expansion coefficients are sparse in the STFT domain. Therefore, a reliable detector may significantly improve performance in noisy environments. We propose a new formulation for the speech enhancement problem, which incorporates simultaneous operations of detection and estimation. This approach is applied to develop speech enhancement algorithms under stationary, as well as tran- sient noise. A simultaneous classification and estimation approach together with GARCH modeling is employed to develop an algorithm for single-sensor audio source separation. In this chapter we briefly describe scientific background for the main topics of this research and specify the structure of the thesis.

1.1 Markov-switching GARCH models

The GARCH model is widely-used in the field of econometrics, both by practitioners and by researchers [1–5]. The model represents a powerful tool for analysis and forecasting of volatility in financial markets. This model, first introduced by Bollerslev [1] as a gener- alization of the ARCH model [2], explicitly parameterizes the time-varying volatility by using both recent conditional variances and recent squared innovations. GARCH mod- els preserve the persistence of the process volatility in the sense that small variations tend to follow small variations and large variations tend to follow large variations. Incor- porating GARCH models with hidden Markov chains, where each state (regime) of the chain implies a different GARCH behavior, extends the dynamic formulation of the model and enables a better fit for a process with a more complex time-varying volatility struc- ture [6–11]. However, a major drawback of such models is that estimating the volatility with switching-regimes requires knowledge of the entire history of the process, including the regime path. Consequently, Markov-switching ARCH models were proposed [12, 13], which avoid problems of path dependency in a noiseless environment. The conditional variance in ARCH models depends on previous observations only, so the does not have to be known for constructing the conditional variance for a given regime. 1.1. MARKOV-SWITCHING GARCH MODELS 9

In [6], a variant of MS-GARCH (MS-GARCH) model was introduced relying on the assumption that the conditional variance given current regime is dependent on the ex- pectation of the previous conditional variances rather than their values. Accordingly, the conditional variance depends on some finite, state dependent, expected conditional variances via their conditional state probabilities. Klaassen [7] proposed modifying this model by manipulating the current regime and all available observations while evaluat- ing the expectation of previous conditional variances. A different method for reducing the dependency of the conditional variance on past regimes has recently been proposed in [8]. Accordingly, a Markov chain governs the ARCH parameters while the autoregres- sive behavior of the conditional variance is subject to the assumption that past conditional variances are in the same regime as that of the current conditional variance. These vari- ants of MS-GARCH models were developed for improved volatility forecasts of financial time-series under possible existence of shocks.

MS-GARCH processes, as well as the standard (single-state) GARCH process, are non- stationary as their second-order moments change recursively over time. However, if these processes are asymptotically wide-sense stationary then their variances are guarantied to be finite. A necessary and sufficient condition for the stationarity of a (single-regime) GARCH(p, q) process has been developed in [1]. A deep analysis of the probabilistic structure of MS-GARCH model is derived in [14] with conditions for the existence of moments of any order. In [15–17], stationarity analysis has been derived for some models of conditional heteroscedasticity, and conditions for the asymptotic stationarity of some AR and ARMA models with Markov-regimes has been derived in [18–22]. However, for the MS-GARCH models, stationarity conditions are known in the literature only for some special cases. In [7], necessary (but not necessarily sufficient) conditions for station- arity are developed for the special case of two regimes and GARCH modeling of order (1, 1). A necessary and sufficient stationarity condition has been developed in [8] for a spe- cific MS-GARCH model formulation, but only in case of GARCH(1, 1) behavior in each regime. We introduce a comprehensive approach for stationarity analysis of MS-GARCH models, which manipulates a backward recursion of the model’s second-order moment. A recursive formulation of the state-dependent conditional variances is developed and the corresponding conditions for stationarity are obtained. In particular, we derive necessary 10 CHAPTER 1. INTRODUCTION and sufficient conditions for the asymptotic wide-sense stationarity of two different vari- ants of MS-GARCH processes, and obtain expressions for their asymptotic variances in the general case of m-state Markov chains and (p, q)-order GARCH processes. Recently, GARCH models have been employed for modeling speech signals in the time- frequency domain [23–25], for speech recognition application [26], and for voice activity detection [27]. Speech signals in the STFT domain demonstrate both variability clus- tering and heavy tail behavior similarly to financial time-series [25]. Motivated by these characteristics, it was proposed to model the conditional variance of speech signals in the STFT domain by a complex GARCH model. This model has been shown useful for speech enhancement applications [23–25], but it relies on the assumption that the model parame- ters are time-invariant. It is commonly assumed in econometrics that the analyzed process is observed in a noiseless environment so that its past observations provide a complete specification of its current conditional variance, for any given regime. In our proposed MS-GARCH approach for speech modeling, the desired signal is generally observed in a noisy environment. Accordingly, we developed algorithm for conditional variance estima- tion which is based on iterating propagation and update steps with regime conditional probabilities. In addition, we developed state smoothing algorithm which generalizes existing algorithms which are used for state smoothing in hidden-Markov processes.

1.2 Speech enhancement

The problem of spectral enhancement of noisy speech signals from a single microphone has attracted considerable research effort for over thirty years, and is still an active research area, e.g., [28–40]. This problem is often formulated as estimation of speech spectral components from a degraded signal which consists of statistically independent additive noise. A variety of different approaches for spectral enhancement of noisy speech signals have been introduced over the years. One of the earlier methods is the spectral subtraction [29,30]. Accordingly, an estimate of the clean signal is obtained by subtracting an estimate of the power spectral density (PSD) of the background noise from the short-term PSD of the degraded signal. The square root of the result is considered as an estimate for the spectral magnitude of the desired signal, while the phase of the noisy signal is used as 1.2. SPEECH ENHANCEMENT 11 the desired phase. This method generally results in random fluctuations in the residual noise, known as musical noise, which may be annoying and disturb the perception of the enhanced speech [41]. Many variations for the spectral subtraction method have been developed during the years to cope with the musical residual noise phenomena [42–45]. Statistical methods for speech enhancement are designed to minimize the expected value of some distortion measure between the clean and estimated signals [32–34, 36, 38, 46, 47]. These approaches require presumption of reliable statistical models for the speech and noisy signals and specification of a perceptually meaningful distortion measure. Distortion measures which are of particular interest in speech enhancement applications are the squared error [32, 48], the squared error of the short-term spectral amplitude (STSA) [33] and the squared error of the log spectral amplitude (LSA) [34,38]. To enable further attenuation of the additive noise, estimation under speech presence uncertainty is often considered [32,33,37,38,41,49,50]. In addition, to eliminate residual noise in case of speech absence, voice activity detector (VAD) is often incorporated with the estimator output [51–56]. Subspace methods for speech enhancement attempt to decompose the vector space of the noisy signal into a signal-plus-noise subspace and a noise-only subspace [57–60]. Spectral enhancement is then performed by removing the noise subspace and estimating the speech coefficients from the signal-plus-noise subspace. Another spectral enhancement method relies on modeling the vectors of speech signal based on hidden Markov models (HMMs) [35, 61–63]. The probability distribution of the speech (and in some applications also of the noise) are estimated from long training sequences of clean samples. The speech signal is then estimated from the noisy observation based on the trained model according to some distortion criteria. HMMs are successfully applied for speech recognition applications, e.g., [64, 65]. However, for the application of speech enhancement they were not found to be sufficiently refined models [66].

1.2.1 Spectral modeling of speech signals

Spectral enhancement of speech signals often relies on the assumption that the spectral coefficients of the speech signal (as well as of the noise signal) are statistically indepen- dent, with conditional distribution which may be considered as, e.g., Gaussian [33,34,38], Super-Gaussian [47], Laplace [67], or Gamma [68]. Although the statistical independency 12 CHAPTER 1. INTRODUCTION assumption simplifies the design of the optimal estimator under any of the assumed dis- tributions, it does not hold in reality since consecutive spectral magnitudes are strongly correlated [24, 25, 41, 69]. In fact, in many of the above-mentioned speech enhancement algorithms, the time-frequency dependent speech spectral variances are estimated using the decision-directed approach [33,70] which relies on the strong correlation of successive spectral variances. Let x and d denote speech signal and uncorrelated noise signal, respec- tively, and let y = x + d denote the observed signal. Applying the STFT to the observed signal we obtain Ytk = Xtk + Dtk were t and k denote the time-frame and frequency-bin indices, respectively. The decision-directed estimator for the speech spectral variance is given by 2 λˆ = α Xˆ + (1 α) max Y 2 λ , 0 (1.1) tk t−1,k − | tk| − d,tk

ˆ   were λtk denotes the estimate for the speech spectral variance and λd,tk denotes the spectral variance of the noise spectral coefficient. The parameter α (0 α 1) is a weighting ≤ ≤ factor (typically chosen close to one) that controls the tradeoff between noise reduction and transient distortion brought into the signal [70]. A larger value of α results in a greater reduction of the musical noise phenomena, but at the expense of attenuated speech onsets and audible modifications of transient components. It is worth noting, that the noise spectral variance needs also to be estimated. This can be practically obtained during speech absence intervals or from the noisy observations by using the minima controlled recursive averaging algorithm [38,71] or the minimum approach [72]. A relaxed statistical model for speech signals was proposed in [69]. This model is based on the assumptions that speech spectral phases are iid random variables, and the speech spectral component Xtk is a zero-mean complex Gaussian random variable with iid real and imaginary components. The sequence of speech spectral variances λ t =0, 1, ... is { tk | } a random process and each spectral variance is correlated with the sequence of the spectral magnitudes. However, given a specific spectral variance λtk, the spectral magnitude at the same time-frequency bin is statistically independent with other spectral magnitudes. This model firstly treated both the sequences of the spectral coefficients X and of the { tk} spectral variances λ at a specific frequency-bin as random processes. { tk} Recently, it was proposed to model speech spectral coefficients using GARCH model [23–25]. This model explicitly parameterizes the time-varying volatility (conditional vari- 1.3. OTHER SPEECH PROCESSING APPLICATIONS 13 ance) at a specific frequency-bin index, by using both recent conditional variances and recent squared values of the spectral coefficients. It was shown that under perfect de- tection for the speech spectral coefficients, improved performance is achieved by using the GARCH model compared to using the decision-directed approach. In this disserta- tion, we introduce a GARCH model with Markov regimes for speech spectral modeling. This model exploits the advantages of both the conditional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains. The model parameters are allowed to change in time according to the state of a hidden Markov chain and may be ascribed to switching between speech phonemes or different speakers. We develop model based algorithms, which are shown to be useful for speech enhancement applications, speech dereverberation, and acoustic source separation.

1.3 Other speech processing applications

While the classical problem of speech enhancement was extensively studied during recent decades, the applications of hands-free communication and digital storing of audio signals raised interesting problems such as speech dereverberation, nonstationary noise reduction, and acoustic blind source separation. Speech signals that are received by a distant microphone from the speech source usually contain reverberation. The sound wave produced by the speaker is propagated outward from the source. The wavefronts reflect off the walls and other objects and superimposed at the microphone. These reflections can degrade the fidelity and intelligibility of speech signals and the performance of automatic speech recognition systems [73]. The received reverberated sound generally consists of a direct sound, reflections that arrive shortly after the direct sound (commonly called early reverberation), and reflections that arrive after the early reverberation (or late reverberation) [73–75]. While the early reflections mainly contribute to spectral coloration and may even improve the intelligibility of the speech sound, the late reverberation changes the waveform’s temporal envelope as decaying ’tails’ are added at sound offsets. This may cause distant and echo-ey sound quality [73, 76]. Algorithms for reverberation reduction (or dereverberation) can be divided to two classes. Algorithms in the first class are based on an estimation of the acoustic impulse response 14 CHAPTER 1. INTRODUCTION

(AIR). The desired signal is, in that case, estimated by deconvolution methods, e.g., [77–79]. Two drawbacks of these algorithms are, however, that they have shown to be sensitive to small changes in the AIR, and in some cases the order of the AIR needs to be known [80, 81]. Methods in the second category try to suppress reverberation without estimating the AIR, e.g., [74, 82–84]. The spectral coefficients of the desired signal are estimated using some statistical model, while trying to suppress the undesired reverberation. We develop a dual-microphone speech dereverberation algorithm for noisy environments, which is aimed at suppressing late reverberation and background noise. The spectral variance of the late reverberation is obtained with adaptively-estimated direct path compensation, and a MS-GARCH model is used to estimate the spectral variance of the desired signal, which includes the direct sound and early reverberation.

Blind separation of mixed audio signals received by a single microphone has been a challenging problem for many years. Examples of applications include separation of speakers [85, 86], separation of different musical sources (e.g., different musical instru- ments) [85, 87, 88], separation of speech or singing voice from background music [89–92], and signal enhancement in nonstationary noise environments [35, 62, 93–95]. In case the signals are received by multiple microphones, spatial filtering may be employed as well as mutual statistical information between the received signals, e.g., [96–102]. However, if several sources are recorded by a single microphone, some a priori information is necessary to enable a reasonable separation performance.

In [94,95] speech and nonstationary noise signals are assumed to evolve as autoregres- sive (AR) processes, while the a priori statistical information (codebooks) is obtained us- ing a training phase and includes several sets of linear prediction coefficients. In [87,88,90] the acoustic signals are modeled by Gaussian mixture models (GMMs). In [35, 62] the acoustic signals are modeled by hidden Markov models (HMMs) with AR sub-sources. The assumed statistical models provide a priori information about the distinct signals, which together with training sequences of signals and appropriately extracted codebooks enable source separation from signal mixtures. GMM and AR-based codebooks are gener- ally insufficient for representing statistically rich signals such as speech signals [92], since each state specifies a predetermined probability density function (pdf), or a mixture of pdf’s. We develop a new algorithm for single-sensor audio source separation of speech 1.4. DETECTION AND ESTIMATION OF SPEECH SIGNALS 15 and music signals, which is based on MS-GARCH modeling of the speech signals and GMM for the music signals. Since in MS-GARCH model the active state only specifies the evolution of the spectral variances along time, the corresponding statistical model can take almost any values for the spectral variances. The separation of the speech from the music signal is obtained by a classification and estimation approach, which enables to control the trade-off between residual interference and signal distortion.

1.4 Detection and estimation of speech signals

In many signal processing applications as well as communication applications, the signal to be estimated is not surely present in the available noisy observation. Therefore, as specified in Section 1.2, algorithms often try to estimate the signal under uncertainty (i.e., using some a priori probability for the existence of the signal) [32, 33, 38, 41], or alternatively, apply an independent detector for signal presence [51–56]. This detector may be designed based on the noisy observation, or, on the estimated signal. The spectral coefficients of the speech signal are generally sparse in the STFT domain in the sense that speech is present only in some of the frames, and in each frame only some of the frequency- bins contain the significant part of the signal energy. Therefore, both signal estimation and detection are generally required while processing noisy speech signals [38, 41, 103]. However, existing algorithms often focus on estimating the spectral coefficients rather than detecting their existence. The spectral-subtraction algorithm [29, 30] contains an elementary detector for speech activity in the time-frequency domain, but it generates musical noise caused by falsely detecting noise peaks as bins that contain speech, which are randomly scattered in the STFT domain. Subspace approaches for speech enhancement [57,59,60,104] decompose the vector of the noisy signal into a signal-plus-noise subspace and a noise subspace, and the speech spectral coefficients are estimated after removing the noise subspace. Accordingly, these algorithms are aimed at detecting the speech coefficients and subsequently estimating their values. McAulay and Malpass [32] were the first to propose a speech spectral estimator under a two-state model. They derived a maximum likelihood (ML) estimator for the speech spectral amplitude under speech- presence uncertainty. Ephraim and Malah followed this approach of signal estimation 16 CHAPTER 1. INTRODUCTION under speech presence uncertainty and derived an estimator which minimizes the mean- square error (MSE) of the short-term spectral amplitude (STSA) [33]. In [49], speech presence probability is evaluated to improve the minimum MSE (MMSE) of the LSA estimator, and in [38] a further improvement of the MMSE-LSA estimator is achieved based on a two-state model.

A reliable detector for speech activity in the time-frequency domain is of major impor- tance, not only to improve noise reduction, but also for noise estimation, speech coding and speech recognition. Many VAD algorithms are designed on a frame-by-frame basis, e.g., [51–56], or the activity of speech is detected in each time-frequency bin, e.g., [38,103]. The detection of speech presence by using a microphone array has recently received much attention, e.g., [105–108]. Spriet et al. [109] analyzed the impact of speech detection errors on the noise reduction performance of multichannel systems and showed that a reliable speech detector is crucial to achieve a potentially better speech enhancement performance.

Approaches for the design of coupled operations of signal detection and estimation have been proposed for some communication applications, e.g., [110–113]. In [114] a method for optimal simultaneous classification and estimation has been proposed by minimizing the erroneous classification probability in the worst case under a false alarm constraint. Middleton et al. [115, 116] were the first to propose simultaneous signal detection and estimation within the framework of statistical decision theory. They proposed some dual schemes for the two operations while considering several coupling methods between the two operations. The detector is generally optimized with the knowledge of the specific structure of the estimator, and the estimator is optimized in the sense of minimizing a Bayes risk associated with the combined operations.

In this research, we reformulate the speech enhancement problem as a joint problem of speech activity detection and spectral estimation. The problem formulation assumes sparsity of the speech spectral coefficients and it introduces coupled operations of detection and estimation using a combined Bayes risk. The Bayes risk incorporates both the cost of estimation errors and the cost of fault detection, whether it is a missed-detection of speech components or a false detection. While considering the problem of single-channel audio source separation, multi-hypotheses are used for both signals. In that case we incorporate simultaneous classification and estimation scheme for the separation. In both 1.5. THESIS STRUCTURE 17 cases, the combined cost results in cost parameters that enable to control the trade-off between signal distortion, caused by missed detection of speech components, and residual noise resulting from false-detection.

1.5 Thesis structure

This thesis is organized as follows. Chapter 2 briefly outlines the basic theories and methods which were used during this research. The original contribution of this research starts in Chapter 3. In Chapter 3, we develop a comprehensive approach for stationarity analysis of MS- GARCH models. We consider the general case of m-state Markov chains and (p, q)- order GARCH processes, and specify the unconditional variance of the process using the expectation of the regime dependent conditional variances. No history knowledge of the process is assumed, except for the model parameters. The expectation of the conditional variance at a given regime is then recursively constructed from the conditional expectation of both previous conditional and unconditional variances. Consequently, we obtain a complete recursion for the expected vector of state dependent conditional variances. The recursive vector form is constructed by of a representative matrix which is built from the model parameters. We show that constraining the largest absolute eigenvalue of the representative matrix to be less than one is necessary and sufficient for the convergence of the unconditional variance, and therefore, for the asymptotic stationarity of the process. We derive stationarity conditions for the general formulation of the two variants of MS- GARCH models introduced by Klaassen [7] and Haas et al. [8]. We show that our results reduce in some degenerated cases to the stationarity conditions developed by Bollerslev [1], and by Klaassen and Haas et al.. Furthermore, we show that the stationarity conditions developed by Klaassen are not only necessary but also sufficient for asymptotic stationarity of his model. In Chapter 4, we introduce a Markov-switching time-frequency GARCH (MSTF- GARCH) model for speech signals in the STFT domain. The MSTF-GARCH model exploits the advantages of both the conditional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains. The expansion 18 CHAPTER 1. INTRODUCTION coefficients are considered as nonstationary random signals in the time-frequency domain and modeled as multivariate complex GARCH processes with Markov-switching regimes. A corresponding recursive algorithm is developed for signal restoration in a noisy environ- ment. The conditional variance is estimated by iterating propagation and update steps with regime conditional probabilities. The model parameters are estimated from a train- ing data set prior to the signal restoration using ML approach, and the number of states is assumed to be known. We show that the derivation in [117] of bounds on the MSE of a composite source signal estimation is applicable for obtaining an upper bound on the MSE of a single step MSTF-GARCH estimation. Experimental results demonstrate the improved performance of the proposed algorithm for restoration of MSTF-GARCH process compared to using an estimator which assumes a and com- pared to using an estimator which assumes a smaller number of regimes than the process actually has. Furthermore, it is demonstrated that the squared absolute values of speech coefficients in the STFT domain are better evaluated by using the MSTF-GARCH model than by using the decision-directed approach.

In Appendix 4.A, we present an application of the MSTF-GARCH model to speech enhancement. We employ the MSTF-GARCH model by assuming different Markov chains in distinct frequency subbands with identical state transition probabilities. The GARCH parameters are state dependent and frequency variant. We define an additional state for the case where speech coefficients are absent (or below a certain threshold level) and introduce parameter estimation method which is computationally more efficient than the traditional ML approach. Furthermore, the probability of the speech absence state can be used as a soft voice activity detector which is naturally generated in the reconstruction algorithm. Experimental results demonstrate improved noise reduction performance while preserving weak components of the speech signal.

Chapter 5 addresses the problem of state smoothing in MS-GARCH processes in noisy environments. The dependency of the conditional variance on past observations and past active regimes are taken into consideration as we generalize both the forward-backward recursions [118] and the stable backward recursion [119, 120]. We derive two recursive steps for the evaluation of conditional densities of future observations. The first step is an upward recursion which manipulates the future observations for the evaluation of 1.5. THESIS STRUCTURE 19 their conditional densities, corresponding to all possible future paths. The second step is a backward recursion which integrates over these paths to evaluate the future densities required for the noncausal state probability. The computational complexity of the gener- alized recursions grows exponentially with the number of future observations employed for the fixed-lag smoothing. However, experimental results demonstrate that the significant part of the improvement in performance, compared to using causal estimation, is achieved by considering a few future observations.

In Chapter 6, we propose a new formulation for the speech enhancement problem based on simultaneous operations of detection and estimations. A detector for the speech coefficients is combined with an estimator, which jointly minimizes a cost function that takes into account both estimation and detection errors. Under speech-presence, the cost is proportional to a quadratic spectral amplitude (QSA) error [33], while under speech- absence, the distortion depends on a certain attenuation factor [29, 38, 70]. We derive a combined detector and estimator with cost parameters that enable to control the trade- off between speech distortion, caused by missed detection of speech components, and residual musical noise resulting from false-detection. The combined solution generalizes the well-known STSA algorithm, which involves merely estimation under signal presence uncertainty. In addition, we propose a modification of the decision-directed a priori SNR estimator, which is suitable for transient-noise environments. Experimental results show that the simultaneous detection and estimation yields better noise reduction than the STSA algorithm while not degrading the speech signal. The advantage of using a suitable indicator for transient noise is demonstrated in a nonstationary noise environment, where the proposed algorithm facilitates suppression of transient noise with a controlled level of speech distortion.

Appendix 6.B introduces a closely related application of removal of transient noise using a practical detector for the presence of transient noise. We formulate a speech en- hancement problem under multiple hypotheses, assuming some indicator or detector for the presence of noise transients in the STFT domain is available. Cost parameters control the trade-off between speech distortion and residual transient noise. We derive an optimal signal estimator that employs the available detector and show that the resulting estima- tor generalizes the optimally-modified log-spectral amplitude (OM-LSA) estimator [38]. 20 CHAPTER 1. INTRODUCTION

Experimental results demonstrate the improved performance obtained by the proposed algorithm, compared to using the OM-LSA.

Chapter 7 deals with the problem of single-channel audio source separation. A novel approach is proposed for single-channel blind source separation of speech and music sig- nals. This approach includes a new codebook for speech signals, as well as a new separa- tion algorithm which relies on a simultaneous classification and estimation procedure. The codebook is based on GARCH modeling of speech signals and a GMM for music signals. Two methods are proposed for classification and estimation. One is based on simultaneous operations of classification and estimation which jointly minimize a combined Bayes risk. The second method employs a given (non-optimal) classifier, and applies an estimator which is optimally designed to yield a controlled level of residual interference and signal distortion. The GARCH model for the speech signal with several states of parameters en- ables smooth (diagonal) covariance matrices with possible state switching. Experimental results demonstrate that for mixtures of speech and piano signals it is more advantageous to model the speech signal by GARCH than GMM, and the codebook generated by the GARCH model yields significantly improved separation performance. In addition, the classification and estimation approach enables the user to control the trade-off between the distortion of the desired signal caused by missed detection, and the amount of the residual signal resulting from false detection.

In Chapter 8, we consider the problem of speech dereverberation using MS-GARCH modeling. We develop an improved dual-microphone speech dereverberation algorithm which relies on a MS-GARCH modeling of the desired early speech component, which consists of the direct sound and early reverberation. The model is applied to distinctive frequency subbands and specifies the volatility clustering of successive spectral coeffi- cients, while a speech-absence state is used for evaluating the speech presence probability. Furthermore, an adaptive approach is developed to estimate the parameter for the di- rect path compensation (DPC) directly from the observed signals. Experimental results show that using the MS-GARCH modeling rather than the decision-directed approach, improved results can be obtained. Furthermore, by using the proposed algorithm, the performance obtained with blindly estimated DPC parameter is comparable to that ob- tained with an optimal DPC parameter that is calculated from the actual AIR, which is 1.6. LIST OF PUBLICATIONS 21 unknown in practice. Chapter 9 summarizes the main contributions of this dissertation and presents some future research directions.

1.6 List of publications

The chapters of this thesis are based on the following publications:

Chapter 3 is based on:

1. A. Abramson and I. Cohen, On the stationarity of Markov switching GARCH pro- cesses, Econometric Theory, vol. 23, no. 3, pp.485-500, 2007.

Chapter 4 is based on:

2. A. Abramson and I. Cohen, Recursive supervised estimation of a Markov-switching GARCH process in the short-time Fourier transform domain, IEEE Trans. on Signal Processing, vol. 55, no. 7, pp. 3227-3238, July 2007.

3. A. Abramson and I. Cohen, Asymptotic stationarity of Markov-switching time- frequency GARCH processes, in Proc. 30th IEEE Internat. Conf. Acoust. Speech Signal Processing., ICASSP-06, Toulouse, France May 2006, pp. III 452-455.

Appendix 4.A is based on:

4. A. Abramson and I. Cohen, Markov-switching GARCH model and application to speech enhancement in subbands, in Proc. 10th Internat. Workshop on Acous. Echo and Noise Control, IWAENC-2006, Paris, France September 2006. (Best student paper).

Chapter 5 is based on:

5. A. Abramson and I. Cohen, State smoothing in Markov-switching time-frequency GARCH models, IEEE Signal Processing Letters, vol. 13, no. 6, pp. 377-380, June 2006. 22 CHAPTER 1. INTRODUCTION

Chapter 6 is based on:

6. A. Abramson and I. Cohen, Simultaneous detection and estimation approach for speech enhancement, IEEE Trans. Audio, Speech, and Language Processing, vol 15, no. 8, pp. 2348-2359, Nov. 2007.

Appendix 6.B is based on:

7. A. Abramson and I. Cohen, Enhancement of speech signals under multiple hypotheses using an indicator for transient noise presence, Proc. 31th IEEE Internat. Conf. Acoust. Speech Signal Processing., ICASSP-07, pp. IV 533-536, Honolulu, Hawaii Apr. 2007. (Best student paper finalist).

Chapter 7 is based on:

8. A. Abramson and I. Cohen, Single-sensor audio source separation using classifi- cation and estimation approach and GARCH modeling, submitted to IEEE Trans. Audio, Speech, and Language Processing. and Chapter 8 is based on:

9. A. Abramson, Emanu¨el A. P. Habets, S. Gannot, and I. Cohen, Dual-microphone speech dereverberation using GARCH modeling, to appear in Proc. 32th IEEE In- ternat. Conf. Acoust. Speech Signal Processing., ICASSP-08, Las-Vegas, Apr. 2008. Chapter 2

Research Methods

In this chapter, we briefly review research methods which were useful during this research. We start by introducing the GARCH model formulation. We then continue by introducing a specific formulation for MS-GARCH model with its known conditions for asymptotic stationarity. The GARCH modeling approach for speech spectral coefficients is briefly re- viewed with the variance estimation algorithm, as well as spectral enhancement approach. Finally, we briefly review existing methods for single-sensor blind source separation and speech dereverberation by using spectral enhancement.

2.1 GARCH Models and stationarity analysis

A linear (p, q)-order GARCH model is defined as follows. Let εt denote a real-valued discrete-time , and let ψt denote the information set (σ-field) of all information through time t. The GARCH(p, q) process is then given by [1]

εt = σt vt (2.1) where v are iid random variables with zero mean, unit variance, and some predeter- { t} mined probability density. The conditional variance of the process, σ2 = E ε2 ψ , t { t | t−1} evolves as q p 2 2 2 σt = ξ + αiεt−i + βjσt−j (2.2) i=1 j=1 X X 23 24 CHAPTER 2. RESEARCH METHODS where

p 0, q> 0 , ≥ ξ > 0, α 0, i =1, ..., q, i ≥ β 0, j =1, ..., p. j ≥ For p = 0 the process reduces to the ARCH(q) process [2], and for q = p = 0 the process εt is simply a . The nonnegativity of the parameters αi, i = 1, ..., q and βj, j = 1, ..., p, together with ξ > 0, are sufficient to ensure a positive conditional 2 variance, σt . However, since the conditional variance is a time varying random process, it is net guarantied in general that the process would be finite.

Theorem 2.1. [1] The GARCH(p, q) process as defined in (2.1) and (2.2) is (asymptot- q p ically) wide-sense stationary with E(εt)=0, limt→∞ V ar(εt) = ξ/ i=1 αi + j=1 βj q p and Cov(εt,ετ )=0 for t = τ if and only if αi + βj < 1.P P  6 i=1 j=1 P P The proof can be found in [1,5]. Theorem 2.1 gives important constraint on the model parameters such that the second- order moment of the GARCH process would be finite. It also shows that under this condition, the process is asymptotically wide-sense stationary.

2.1.1 Markov-switching GARCH model

Let S 1, ..., m denote the (unobserved) regime at a discrete time t and let s be t ∈ { } t a realization of S , assuming that S is a first-order stationary Markov chain with t { t} transition probabilities a , p(S = j S = i), a transition probabilities matrix A, ij t | t−1 A = a , and stationary probabilities π = [π , π , ..., π ]′, π , p (S = i), where ′ { }ij ij 1 2 m i t denotes the transpose operation. Incorporating GARCH models with a hidden Markov chain, where each state of the chain (regime) allows a different GARCH behavior and thus a different volatility struc- ture, extends the dynamic formulation of the model and potentially enables improved forecasts of the volatility [6–11]. Unfortunately, the volatility of a GARCH process with switching-regimes depends on the entire history of the process, including the regime path, which makes the derivation of a volatility estimator impractical. Many different variants 2.1. GARCH MODELS AND STATIONARITY ANALYSIS 25 have been formulated for Markov-switching GARCH models. One popular formulation in econometrics is the model proposed by Klaassen [7] as a modification of Gray’s model [6]. These models integrate out the unobserved regime path so that the conditional variance can be constructed from previous observations only. Accordingly, given the Markovian active state s 1, 2 , the conditional variance follows [7]: t ∈{ }

σ2 = ξ + α ε2 + β E σ2 s , ψ . (2.3) t,st st st t−1 st t−1,St−1 | t t−1  As can be seen from (2.3), this model formulation originally assumes a degenerated model with only two-state Markov chain with GARCH(1, 1) in each state. Define a 2 2 matrix × C with elements c = p(S = j S = i)(α + β ). Then, we have the following theorem: ij t−1 | t i i Theorem 2.2. [7] Necessary conditions for asymptotic wide-sense stationarity of the process defined in (2.1) and (2.3) are c ,c < 1 and det(I C) > 0. The asymptotic 11 22 − conditional variances are then given by:

2 σ ξ1 1 =(I C)−1 . (2.4)  2  −   σ1 ξ2     Proof. The unconditional variance under st can be obtained by taking expectation which is conditioned on the active state

σ2 = ξ + α E ε2 s + β E σ2 s t,st st st t−1 | t st t−1,St−1 | t = ξ + α E  E ε2  s ,s  s + β E E σ2 s ,s s st st t−1 | t−1 t | t st t−1,St−1 | t−1 t | t = ξ + α E σ2 s + β E σ2  s   st st t−1,st−1 | t st t−1,st−1 | t = ξ +(α +β ) E σ2 s . (2.5) st st st t−1,st−1 | t  2 2 2 2 Next, assume that σt,1 and σt,2 are time invariant, and denote them by σ1 and σ2, respec- tively. Then 2 2 σ ξ1 σ 1 = + C 1 (2.6)  2     2  σ1 ξ2 σ1 where c = p (S = j S = i)(α + β ) and    ij t−1 | t i i p (s s ) p (s ) p (s s )= t−1 | t t−1 . (2.7) t−1 | t p (S =1 s ) p (S = 1)+ p (S =2 s ) p (S = 2) t−1 | t t−1 t−1 | t t−1 Under the assumption that σ2 and σ2 exist, (I C) is invertible and we have (2.4). 1 2 − 26 CHAPTER 2. RESEARCH METHODS

The necessary conditions for the existence of the unconditional variances are derived as follows. Since we have sufficient conditions for the positivity of the conditional variances (i.e., ξ > 0 and α , β 0 for s = 1, 2), all elements of (I C)−1 must be non negative s s s ≥ − and (I C)−1 must not have a zero raw. Since c = α + β c and c = α + β c − 12 1 1 − 11 21 2 2 − 22 we can write

1 1 c22 α1 + β1 c11 (I C)−1 = − − . (2.8) − det (I C)  α + β c 1 c  − 2 2 − 22 − 11   Since α + β c =(α + β ) [1 p (S = i S = i)] 0 (2.9) i i − ii i i − t−1 | t ≥ the nonnegativity of (2.8) implies that det (I C) > 0, so that 1 c 0 and 1 c 0. − − 11 ≥ − 22 ≥ But c and c must not equal one, otherwise, det (I C) 0. Therefore, necessary 11 22 − ≤ conditions for the existence of stationary variances are c ,c < 1 and det (I C) > 11 22 − 0.

In Chapter 3 we derive conditions which are both necessary and sufficient for asymp- totic stationarity, considering any finite-state Markov chain and any (p, q)-order of GARCH model in each state.

2.2 Time-frequency GARCH model and spectral speech enhancement

2.2.1 Time-frequency GARCH model

Recall x and d represent speech and uncorrelated additive noise signals, and y = x + d represents the observed signal. Applying the STFT to the observed signal we have in the time-frequency domain

Ytk = Xtk + Dtk . (2.10)

Let τ = X t =0, 1, ..., τ, k =0, ..., K 1 represent the set of clean spectral coeffi- X { tk | − } cients up to frame τ. Let tk and tk denote hypotheses for speech present and absence, H1 H0 respectively, in the time-frequency bin (t, k), and let

λ , E X 2 tk, τ (2.11) tk|τ | tk| | H1 X  2.2. TIME-FREQUENCY GARCH MODEL AND SPECTRAL SPEECH ENHANCEMENT27 denote the conditional variance of Xtk under the hypothesis that speech is present in the time-frequency bin (t, k), given the clean spectral coefficients up to time-frame τ. The time-frequency GARCH (TF-GARCH) relies on the following assumptions [23,25]:

1. Given λ and the state of speech presence in each time-frequency bin ( tk or { tk} H1 tk), the speech spectral coefficients X are generated by H0 { tk}

Xtk = λtkVtk , (2.12) p where V tk are identically zero, and V tk are statistically independent tk | H0 tk | H1 complex random variable with zero mean, unit variance, and independent and iden- tically distributed (iid) real and imaginary parts:

tk : E V =0, E V 2 =1 , H1 { tk} | tk| tk : V =0 .  (2.13) H1 tk

2. Under tk, the real and imaginary parts of V are iid with Gaussian probability H1 tk density.

3. The conditional variance λtk|t−1, referred to as the one-frame-ahead conditional vari- ance, is a random process which evolves as a GARCH(1, 1) process:

λ = λ + α X 2 + β λ λ (2.14) tk|t−1 min | t−1,k| t−1,k|t−2 − min  where

λ > 0, α 0, β 0, α + β < 1 . (2.15) min ≥ ≥

4. The noise spectral coefficients D are zero-mean statistically independent Gaus- { tk} sian random variables, with iid real and imaginary parts.

The first assumption implies that the speech spectral coefficients X tk are condi- tk | H1 tionally zero-mean statistically independent random variables given their variances λtk . { } However, the GARCH formulation parameterizes the correlation between successive con- ditional variances at the same frequency-bin index. 28 CHAPTER 2. RESEARCH METHODS

2.2.2 Variance estimation

Since the conditional variance in the TF-GARCH depends on the entire history of the process, while observing noisy signal, the conditional variances can not be reconstructed and need to be estimated. The estimation of the spectral variance from the noisy obser- vations is estimated by using two steps. First, the conditional variance estimate λˆtk|t−1 is being updated one frame ahead in time by using the additional information Ytk, than, for the next frame the conditional variance is updated based on the model formulation.

Specifically, an estimate for λtk|t is obtained by calculating its conditional mean under tk given Y and λˆ . By definition λ = X 2. Hence, the update step is obtained H1 tk tk|t−1 tk|t | tk| by [24,69]:

λˆ = E X 2 tk, λˆ ,Y tk|k | tk| H1 tk|t−1 tk nˆ ˆ o ξtk|t−1 ξtk|t−1 = λ + Y 2 (2.16) ˆ d,tk ˆ tk 1+ ξtk|t−1 1+ ξtk|t−1 ! | | where λ = E D 2 and ξˆ , λˆ /λ is the a priori SNR. Substituting (2.16) d,tk {| tk| } tk|t−1 tk|t−1 d,tk into the model formulation (2.14) we have the propagation step:

λˆ = E λ t−1,k, λˆ ,Y tk|t−1 tk|t−1 | H1 t−1,k|t−2 t−1,k = λ n + αE X 2 t−1,k, λˆ o ,Y + β λˆ λ min | t−1,k| | H1 t−1,k|t−2 t−1,k t−1,k|t−2 − min = λ + αλˆ n + β λˆ λ . o  (2.17) min t−1,k|t−1 t−1,k|t−2 − min   This two-step estimation method allows estimation of the conditional variances from the noisy coefficients. It is important to note that the model parameters are generally esti- mated from a training set using maximum likelihood (ML) approach [5].

2.2.3 Spectral enhancement

Having an estimate for the spectral variances of the speech spectral coefficients, λˆtk, an estimator for the coefficients Xtk is obtained by minimizing the expected distortion given λˆ , λ , Y and the a posteriori speech presence probability q , p tk Y [41]: tk d,tk tk tk H1 | tk  min E d Xtk, Xˆtk qtk, λˆtk,λd,tk,Ytk . (2.18) Xˆtk | n   o 2.2. TIME-FREQUENCY GARCH MODEL AND SPECTRAL SPEECH ENHANCEMENT29

In particular, restricting ourselves to a squared error distortion measure of the form

2 d X , Xˆ = g Xˆ g˜ (X ) (2.19) tk tk tk − tk     where g (X) andg ˜ (X) are specific functions which determine the fidelity criteria, the optimal estimator is calculated from

g Xˆ = E g˜ (X ) q , λˆ ,λ ,Y tk tk | tk tk d,tk tk   = q nE g˜ (X ) tk, λˆ ,λ o,Y tk tk | H1 tk d,tk tk +(1 nq ) E g˜ (X ) tk,Y o. (2.20) − tk tk | H0 tk  Fidelity criteria that are of particular interest for speech enhancement applications are the MMSE of the STSA [33] and MMSE of the LSA [34]. The MMSE-STSA estimator is derived by substituting into (2.19) the functions

g Xˆtk = Xˆtk   X , under tk tk 1 g˜ (Xtk) = | | H . (2.21)  G Y , under tk  min | tk| H0 Let γ = Y 2/λ denote the a posteriori SNR and let υ , γ ξˆ /(1 + ξˆ ). The tk | tk| d,tk tk tk tk tk resulting estimator is given by

Xˆ = q G ξˆ ,γ + (1 q ) G Y (2.22) tk tk STSA tk tk − tk min tk h   i where [33] √πυ υ υ υ G (ξ,γ) , exp (1 + υ) I + υ I , (2.23) STSA 2γ − 2 0 2 1 2   h    i and I ( ) denotes the modified Bessel function of order ν. The MMSE-LSA estimator is ν · obtained by using the functions

g Xˆtk = log Xˆtk   X , tk log tk under 1 g˜ (Xtk) = | | H (2.24)  log(G Y ) , under tk .  min | tk| H0 and the resulting estimator follows

qtk ˆ ˆ 1−qtk Xtk = GLSA ξtk,γtk Gmin Ytk (2.25)   30 CHAPTER 2. RESEARCH METHODS where [34] ξ 1 ∞ e−x G (ξ,γ)= exp dx . (2.26) LSA 1+ ξ 2 x  Zυ  It is important to note, that both estimators (2.22) and (2.25) are insensitive to the phase estimation error, and they are combined with the phase of the noisy signal [33].

2.3 Single-channel blind source separation

Separation of a mixture of signals observed via a single sensor is an ill posed problem, and some a priori information is required to enable reasonable reconstructions. In [87–89], a GMM is proposed for the signals’ codebook in the STFT domain, and in [94,95] an AR model is proposed with different sets of prediction coefficients for each of the signals in the time domain. However, each set of AR coefficients, together with the excitation variance corresponds to a specific in the STFT domain, similarly to the GMM. Under each of these models, each framed signal is considered as generated from some specific distribution which is related to the codebook with some probability, and a frame-by-frame separation is applied. Let s , s CN denote the vectors of the STFT expansion coefficients of signals s (n) 1 2 ∈ 1 and s2(n), respectively, for some specific frame index. Let q1 and q2 denote the active states of the codebooks corresponding to signals s1 and s2, respectively, with known a priori probabilities p1 (i) , p (q1 = i), i = 0, ..., m1 and p2 (j) , p (q2 = j), j = 0, ..., m2, and i p1 (i) = j p2 (j) = 1. Given that q1 = i and q2 = j, s1 and s2 are assumed conditionallyP zero-meanP complex-valued Gaussian random vectors with known diagonal covariance matrices, i.e. ,s 0, Σ(i) and s 0, Σ(j) . For the AR model 1 ∼ CN 1 2 ∼ CN 2 [35, 62, 94, 95], each set of prediction coefficients in the tim e domain corresponds to a specific covariance matrix in the STFT domain, up to scaling by the excitation variance. Assuming sufficiently long frames, these covariance matrices are considered as diagonal [95]. Based on a given codebook, it is proposed in [88] and [95] to first find the active pair of states i, j = q = i, q = j using a maximum a posteriori (MAP) criterion: { } { 1 2 }

ˆi, ˆj = argmax p (x i, j) p (i, j) (2.27) i,j | n o 2.4. SPEECH DEREVERBERATION 31 where x = s +s , p ( i, j)= p ( q = i, q = j), and for statistically independent signals 1 2 ·| ·| 1 2 p (i, j) = p1 (i) p2 (j). Subsequently, conditioned on these states (i.e., classification), the desired signal may be reconstructed in the mmse sense by

ˆs = E s x,ˆi, ˆj 1 1 | −1 (nˆi) (ˆi) o (ˆj) = Σ1 Σ1 + Σ2 x   , Wˆiˆj x (2.28)

1 and similarly ˆs2 = Wˆjˆi x. Alternatively [35, 88, 90], the desired signal may be recon- structed in the mmse sense directly from

ˆs = E s x 1 { 1 | } = E E [s x, i, j] ij { 1 | } = p (i, j x) W x . (2.29) | ij i,j X In case of additional uncorrelated stationary noise in the mixed signal, i.e.,

x = s1 + s2 + d (2.30) with d (0, Σ), the covariance matrix of the noise signal is added to the covariance ∼ CN matrix of the interfering signal, and then the signal estimators remain in the same forms.

2.4 Speech dereverberation

A generalized statistical reverberation model has been proposed in [73–75]. Accordingly, the AIR h (n), can be split into two segments, he (n) and hl (n):

h (n) , 0 n T e ≤ ≤ r h (n)=  h (n) , n T . (2.31)  l r  ≥  0 , otherwise  The value Tr is chosen such that he(n) contains the direct path, and that hl (n) contains of all later reflections. To enable modeling the energy related to the direct path, the

1 Note that in this section the index i always refers to the signal s1 and the index j refers to the other −1 (j) (i) (j) signal s2. Therefore, Wji =Σ2 Σ1 +Σ2 .   32 CHAPTER 2. RESEARCH METHODS following model is used:

−δn be (n) e , 0 n Tr he (n)= ≤ ≤ , (2.32)  0 , otherwise  where be (t) is a white zero-mean Gaussian stationary noise signal and δ is linked to the 2 reverberation time , T60. The reverberation component hl (t) follows

−δt bl (t) e , t Tr hl (t)= ≥ , (2.33)  0 , otherwise  where bl (n) is a white zero-mean Gaussian stationary noise signal. It is assumed that be (n) and bl (n) are uncorrelated, and the energy envelope of h (n) can be expressed as

σ2e−δn , 0 n T e ≤ ≤ r 2 2 −δn Eh h (n) =  σ e , n T , (2.34)  l r  ≥   0 , otherwise  2 2  where σe and σl denote the variances of be (n) and bl (n), respectively, and generally, σ2 σ2. e ≥ l Considering a speech signal x (n) which is propagates towards a microphone, through a room with AIR h (n), the received signal can be denoted as

y (n)= xe (n)+ xl (n)+ d (n) , (2.35)

where xe (n) is the early speech component, xl (n) is the late reverberant signal, and d (n) is an uncorrelated additive noise. Spectral enhancement approach for speech dereverberation is aimed at estimating the early speech component from the noisy observation by using a time-frequency dependent gain function. Consequently, in the STFT domain we have

Xˆe (t, k)= G (t, k) Y (t, k) . (2.36)

The gain function may be obtained by means of spectral subtraction [74] or MMSE- LSA sense [75]. In any case, the spectral variances of both the early speech component, λ (t, k) = E X (t, k) 2 , and of the late reverberant signal, λ (t, k) = E X (t, k) 2 , e | e | l | l | 2 The reverberation time, T 60, is defined as the time for the reverberation level to decay to 60 dB below the initial level. 2.4. SPEECH DEREVERBERATION 33 are estimated from the observed signal based on the reverberation model. Accordingly, while evaluating the gain function, the a priori and a posteriori SNRs are given by

λˆ (t, k) ξˆ(tk) = e λˆl (t, k)+ λd (t, k) Y (t, k) 2 γˆ (t, k) = | | . (2.37) λˆl (t, k)+ λd (t, k) 34 CHAPTER 2. RESEARCH METHODS Chapter 3

Stationarity Analysis of Markov-Switching GARCH Processes1

GARCH models with Markov-switching regimes are often used for volatility analysis of fi- nancial . Such models imply less persistence in the conditional variance than the standard GARCH model, and potentially provide a significant improvement in volatility forecast. Nevertheless, conditions for asymptotic wide-sense stationarity have been de- rived only for some degenerated models. In this chapter, we introduce a comprehensive approach for stationarity analysis of Markov-switching GARCH models, which manipu- lates a backward recursion of the model’s second-order moment. A recursive formulation of the state-dependent conditional variances is developed and the corresponding conditions for stationarity are obtained. In particular, we derive necessary and sufficient conditions for the asymptotic wide-sense stationarity of two different variants of Markov-switching GARCH processes, and obtain expressions for their asymptotic variances in the general case of m-state Markov chains and (p, q)-order GARCH processes.

1This chapter is based on [121].

35 36 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES 3.1 Introduction

Volatility analysis of financial time series is of major importance in many financial applica- tions. The generalized autoregressive conditional heteroscedasticity (GARCH) model [1] has been applied quite extensively in the field of econometrics, both by practitioners and by researchers, and shown to be useful for the analysis and forecasting the volatility of time-varying processes such as those pertaining to financial markets. Incorporating GARCH models with a hidden Markov chain, where each state of the chain (regime) allows a different GARCH behavior and thus a different volatility structure, extends the dynamic formulation of the model and potentially enables improved forecasts of the volatility [6–11]. Unfortunately, the volatility of a GARCH process with switching-regimes depends on the entire history of the process, including the regime path, which makes the derivation of a volatility estimator impractical.

Cai [12] and Hamilton and Susmel [13] applied the idea of regime-switching parameters into ARCH specification. The conditional variance of an ARCH model depends only on past observations, and accordingly the restriction to ARCH models avoids problems of infi- nite path dependency. Gray [6], Klaassen [7] and Haas, Mittnik and Paolella [8], proposed different variants of Markov-switching GARCH models, which also avoid the problem of dependency on the regime’s path. Gray introduced a Markov-switching GARCH model relying on the assumption that the conditional variance at any regime depends on the expectation of previous conditional variances, rather than their values. Accordingly, the conditional variance depends only on some finite set of past state-dependent, expected values via their conditional state probabilities, and thus can be constructed from past ob- servations. Klaassen proposed modifying Gray’s model by conditioning the expectation of previous conditional variances on all available observations and also on the current regime. A different concept of Markov-switching GARCH model has recently been introduced by Haas, Mittnik and Paolella. Accordingly, a finite state-space Markov chain is assumed to govern the ARCH parameters while the autoregressive behavior of the conditional vari- ance is subject to the assumption that past conditional variances are in the same regime as that of the current one.

Markov-switching GARCH processes, as well as the standard GARCH process, are 3.1. INTRODUCTION 37 nonstationary as their second-order moments change recursively over time. However, if these processes are asymptotically wide-sense stationary then their variances are guar- antied to be finite. A necessary and sufficient condition for the stationarity of a (single- regime) GARCH(p, q) process has been developed in [1]. Condition for the stationarity of a natural, path-dependent Markov-switching GARCH(p, q) model, has been developed in [122], and in [14] a deep analysis of the probabilistic structure of that model is de- rived with conditions for the existence of moments of any order. In [15–17], stationarity analysis has been derived for some mixing models of conditional heteroscedasticity, and conditions for the asymptotic stationarity of some AR and ARMA models with Markov- regimes has been derived in [18–22]. However, for the Markov-switching GARCH models described above, which avoid the dependency of the conditional variance on the chain’s history, stationarity conditions are known in the literature only for some special cases. Klaassen [7] developed necessary (but not necessarily sufficient) conditions for stationarity of his model in the special case of two regimes and GARCH modeling of order (1, 1). A necessary and sufficient stationarity condition has been developed by Haas, Mittnik and Paolella [8] for their Markov-switching GARCH model, but only in case of GARCH(1, 1) behavior in each regime.

In this chapter, we develop a comprehensive approach for stationarity analysis of Markov-switching GARCH models, in the general case of m-state Markov chains and (p, q)-order GARCH processes. We specify the unconditional variance of the process us- ing the expectation of the regime dependent conditional variances, and assume no history knowledge of the process except for the model parameters. The expectation of the con- ditional variance at a given regime is then recursively constructed from the conditional expectation of both previous conditional and unconditional variances. Consequently, we obtain a complete recursion for the expected vector of state dependent conditional vari- ances. The recursive vector form is constructed by means of a representative matrix which is built from the model parameters. We show that constraining the largest absolute eigen- value of the representative matrix to be less than one is necessary and sufficient for the convergence of the unconditional variance, and therefore, for the asymptotic stationarity of the process. We derive stationarity conditions for the general formulation of the two variants of Markov-switching GARCH models introduced by Klaassen and Haas et al.. 38 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES

We show that our results reduce in some degenerated cases to the stationarity condi- tions developed by Bollerslev [1], Klaassen [7] and Haas et al. [8]. Furthermore, we show that the stationarity conditions developed by Klaassen are not only necessary but also sufficient for asymptotic stationarity of his model. This chapter is organized as follows: In Section 3.2, we review the variants proposed by Klaassen and Haas et al. for Markov-switching GARCH models, and develop compre- hensive necessary and sufficient conditions for asymptotic stationarity appropriate for the general formulation of the models. In Section 3.3, we derive relations between our results and previous works.

3.2 Stationarity of Markov-switching GARCH mod- els

Let S 1, ..., m denote the (unobserved) regime at a discrete time t and let s be t ∈ { } t a realization of S , assuming that S is a first-order stationary Markov chain with t { t} transition probabilities a , p(S = j S = i), a transition probabilities matrix A, ij t | t−1 A = a , and stationary probabilities π = [π , π , ..., π ]′, π , p (S = i), where ′ { }ij ij 1 2 m i t denotes the transpose operation. Let denote the observation set up to time t, and It let v be a zero-mean unit-variance random process, with independent and identically { t} distributed elements. Given that St = st, a Markov-switching GARCH model of order (p, q) can be formulated as

εt = σt,st vt (3.1) where the conditional variance of the process σ2 = E ε2 S = s , is a function of t,st { t | t t It−1} p previous conditional variances and q previous squared observations. Klaassen [7] and Haas, Mittnik and Paolella [8], proposed different variants of Markov- switching GARCH models. The former is a modification of Gray’s model [6]. Each of these overcomes the problem of dependency on the regime’s path encountered when naturally integrating the GARCH model with switching-regimes. However, conditions for these models to be asymptotically wide-sense stationary and therefore to guarantee a 3.2. STATIONARITY OF MARKOV-SWITCHING GARCH MODELS 39

finite second-order moments, are known only for some special cases. Klaassen developed necessary conditions for the stationarity of his model in the case of two-state Markov chain and GARCH of order (1, 1). Hass et al. gave a necessary and sufficient stationarity condition for their model, but this condition is restricted to a first-order GARCH model in each of the regimes (i.e., p = q = 1). We first review these variants of Markov- switching GARCH models, which we call MSG-I and MSG-II, respectively. Then we develop necessary and sufficient conditions for their asymptotic wide-sense stationarity and derive their stationary variances.

3.2.1 MSG-I model

Gray [6] proposed to model the conditional variance of a Markov-switching GARCH model as dependent on the expectation of its past values over the entire set of states, rather than dependent on past states and the corresponding conditional variances. Accordingly, the state dependent conditional variance follows

q p σ2 = ξ + α ε2 + β E ε2 t,st st i,st t−i j,st t−j | It−j−1 i=1 j=1 Xq Xp m  = ξ + α ε2 + β p (S = s ) σ2 , (3.2) st i,st t−i j,st t−j t−j | It−j−1 t−j,st−j i=1 j=1 s − =1 X X tXj and the following constraints

ξ > 0, α 0, β 0, i =1, ..., q, j =1, ..., p, s =1, ..., m (3.3) st i,st ≥ j,st ≥ t are sufficient for the positivity of the conditional variance. Gray’s model integrates out the unobserved regime path so that the conditional vari- ance can be constructed from previous observations only. As a consequence, there is no path dependency problem although GARCH effects are still allowed. Empirical analy- sis of modeling financial time series demonstrates that this Markov-switching GARCH model implies less persistence in the conditional variance than the standard GARCH model, and in addition, its one-step ahead volatility forecast significantly outperforms the single-regime GARCH model (see, for instance [6,9,11]). 40 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES

Klaassen [7] proposed modifying Gray’s model by replacing p (S = s ) in t−j t−j | It−j−1 (3.2) by p (S = s ,S = s ) while evaluating σ2 . Consequently, all available t−j t−j | It−1 t t t,st observations are used, as well as the given regime in which the conditional variance is calculated. The conditional variance according to Klaassen’s model (denoted here as MSG-I) is given by

q p m σ2 = ξ + α ε2 + β p (S = s ,S = s ) σ2 , (3.4) t,st st i,st t−i j,st t−j t−j | It−1 t t t−j,st−j i=1 j=1 s − =1 X X tXj and the same constraints (3.3) are sufficient for the positivity of the conditional variance. Both models integrate out the unobserved regimes for evaluating the conditional vari- ance. However, Klaassen’s model employs all the available information while Gray’s model employs only part of it since it does not utilize all the available observations and the as- sumed regime in which the conditional variance is being calculated. Specifically, if process’ regimes are highly persistent, then both the current state st and the previous innovation

εt−1 give much information about previous states and thus the conditional probability of s given all the observations up to time t 1 and the next state, is substantially different t−1 − from the probability of s which is conditioned only on observations up to time t 2 [7]. t−1 − In contrast to Gray, Klaassen do manipulate this information in his model while eval- uating the expectation of previous conditional variances. Furthermore, the formulation (3.4) better exploits the available information, and its structure yields straightforward expressions for the multi-step ahead volatility forecasts [7,9]. The unconditional variance of the MSG-I process, defined in (3.1) and (3.4), can be calculated as follows:

2 2 E εt = EIt−1,St E εt t−1,st | I m    2  2 = E E − σ s = π E − σ s . (3.5) St It 1 t,st | t st It 1 t,st | t st=1   X  For notation simplification, we shall use E( s ) and p( s ) to represent E( S = ·| t ·| t ·| t s ) and p( S = s ), respectively, where s represents the regime realization at time t. t ·| t t t Furthermore, we shall use E ( ) to denote the expectation over the information up to time t · 3.2. STATIONARITY OF MARKOV-SWITCHING GARCH MODELS 41 t, i.e., E ( ). The expectation of the regime dependent conditional variance follows It · q E σ2 s = ξ + α E ε2 s t−1 t,st | t st i,st t−1 t−i | t i=1   p X m   + β E p (s ,s ) σ2 s , (3.6) j,st t−1 t−j | It−1 t t−j,st−j | t j=1 s − =1 X tXj h i 2 where the expectation over εt−i can be obtained by m 2 2 Et−1 εt−i st = εt−i p ( t−1 st,st−i) p (st−i st) d t−1 | It−1 I | | I st−i=1 Z   Xm = p (s s ) E ε2 s ,s . (3.7) t−i | t t−1 t−i | t−i t st−i=1 X   Note that given the current active state, the expected absolute value is independent of any future states. Therefore,

E ε2 s ,s = E ε2 s t−1 t−i | t−i t t−1 t−i | t−i   =  ε2 p (ε ,s ) p ( s ) dε d t−i t−i | It−i−1 t−i It−i−1 | t−i t−i It−i−1 ZIt−i−1 Zεt−i = E E ε2 ,s s t−i−1 t−i | It−i−1 t−i | t−i = E σ2 s .   (3.8) t−i−1 t−i,st−i | t−i h i Furthermore, the conditional expectation over the conditional variance in (3.6), weighted by the current state probability can be obtained by

2 2 Et−1 p (st−j t−1,st) σt−j,st−j st = σt−j,st−j p (st−j t−1,st) p ( t−1 st) d t−1 | I | It−1 | I I | I h i Z = σ2 p ( s ,s ) p (s s ) d t−j,st−j It−1 | t−j t t−j | t It−1 ZIt−1 2 = p (s s ) E − − σ s . (3.9) t−j | t t j 1 t−j,st−j | t−j h i Consequently, the expectation of the conditional variance at a given regime st can be recursively constructed, according to the model definitions, from both expectation of previous conditional variances and expected squared values given the current regime st. Let r = max p, q and define α , 0 for all i > q and β , 0 for all i>p. Then, by { } i,s i,s substituting (3.7), (3.8) and (3.9) into (3.6) we obtain

r m E σ2 s = ξ + (α + β ) p (s s ) E σ2 s , (3.10) t−1 t,st | t st i,st i,st t−i | t t−i−1 t−i,st−i | t−i i=1 st−i=1   X X h i 42 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES and applying Bayes’ rule we have

πst−i πst−i i p (st−i st)= p (st st−i)= A . (3.11) st−i,st | πst | πst  The expected state dependent conditional variance (3.10) is recursively generated from a weighted sum of its previous expected values through their conditioned probabilities and the model parameters. Let ξ , [ξ , ..., ξ ]′, let (i) be an m-by-m matrix with elements 1 m K

(i) πs˜ i s,s˜ , (αi,s + βi,s) A s,s˜ , s, s˜ =1, ..., m , (3.12) K πs   and let h , [E σ2 S =1 , ..., E σ2 S = m ]′ be an m-by-1 vector of the t t−1 t,1 | t t−1 t,m | t expected state dependent conditional variances. Then, we have

r h = ξ + (i)h . (3.13) t K t−i i=1 X ˜ ′ ′ ′ ′ ˜ ′ ′ Define the rm-by-1 vectors ht , ht, ht−1, ..., ht−r+1 and ξ , [ξ , 0, ..., 0] , and let   (1) (2) (r) K K ··· K  I 0 0  m m ··· m    0m Im  Ψ ,   (3.14) I  ......   . . . . .             0m 0 Im 0m   ···    be an mr-by-mr matrix where Im represents the identity matrix of size m-by-m and 0m is an m-by-m matrix of zeros. Then a recursive vector form of the expected conditional variance (3.13) can be written as

h˜ = ξ˜+ Ψ h˜ , t 0 , (3.15) t I t−1 ≥ with some initial conditions h˜−1. Let ρ ( ) denote the spectral radius of a matrix, i.e., its largest eigenvalue in modulus, · and let Λ be an m-by-m square matrix built from the mr-by-mr matrix (I Ψ )−1 such I − I that Λ = (I Ψ )−1 , i, j =1, ..., m. Then we have the following theorem: { I}ij − I ij  3.2. STATIONARITY OF MARKOV-SWITCHING GARCH MODELS 43

Theorem 3.1. An MSG-I process as defined by (3.1) and (3.4) is asymptotically wide-

2 ′ 2 sense stationary with variance limt→∞ E (εt )= π ΛIξ, if and only if ρ (ΨI ) < 1 .

Proof. The recursive equation (3.15) can be written as

t−1 h˜ =Ψth˜ + Ψtξ,˜ t 0 . (3.16) t I 0 I ≥ i=0 X According to the matrix convergence theorem (e.g., [124, pp. 327-329]), a necessary and sufficient condition for the convergence of (3.16) for t is ρ (Ψ ) < 1. Under this → ∞ I condition, Ψt converges to zero as t goes to infinity and t−1 Ψi converges to (I Ψ )−1, I i=0 I − I where the matrix (I ΨI ) is then guaranteed to be invertible.P Therefore, if ρ (ΨI ) < 1, − equation (3.16) yields −1 lim h˜t =(I ΨI ) ξ.˜ (3.17) t→∞ −

By definition, the first m elements of h˜t constitute the vector ht, while the first m elements of ξ˜ constitute the vector ξ, and the remaining elements of ξ˜ are zeros. Conse- quently,

lim ht =ΛIξ (3.18) t→∞ and using (3.5) we have 2 ′ lim E εt = π ΛI ξ. (3.19) t→∞ Otherwise, if ρ (Ψ ) 1, the expected variance  goes to infinity with the growth of the I ≥ time index.

3.2.2 MSG-II model

Another variant of Markov-switching GARCH model has recently been proposed by Haas, Mittnik and Paolella [8]. This model assumes that a Markov chain controls the ARCH parameters at each regime (i.e., ξs and αi,s), while the autoregressive behavior in each regime is subject to the assumption that past conditional variances are in the same regime as that of the current conditional variance. Specifically, the vector of conditional variances

2 2 2 2 ′ σt , σt,1, σt,2, ..., σt,m is given by

2Note that for a matrix with nonnegative elements, there exists a real eigenvalue which is equal to the spectral radius [123, p. 288]. 44 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES

q p 2 2 (j) 2 σt = ξ + αiεt−i + B σt−j , (3.20) i=1 j=1 X X

′ ′ where αi , [αi,1, ..., αi,m] , i = 1, ..., q, and βj , [βj,1, ..., βj,m] , j = 1, ..., p, are vectors of state dependent GARCH parameters, and B(j) , diag β is a diagonal matrix with { j} elements βj on its diagonal. The same constraints which are sufficient to ensure a positive conditional variance in MSG-I model (3.3) are also applied here to guaranty the positivity of the conditional variance.

Note that the conditional variance at a specific regime depends on previous conditional variances of the same regime through the diagonal matrices B(j). Consequently, this model allows derivation of the conditional variance at a given time from past observations only. Furthermore, the MSG-II model is analytically more tractable than MSG-I model [8] and its conditional variance can be straightforwardly constructed since the conditional variance at a specific time does not depend on previous state probabilities but only on previous observations and previous conditional variances.

(j) (i) Let αi be an m-by-1 vector of zeros for i > q and let B = 0m for j>p. Let Ω denote an m2-by-m2 block matrix of basic dimension m-by-m

Ω(i) Ω(i) Ω(i) 11 21 ··· m1  Ω(i) Ω(i) Ω(i)  Ω(i) , 12 22 ··· m2 , (3.21)  . .   . .     (i) (i) (i)   Ω1m Ω2m Ωmm   ···    with each block given by

Ω(i) , p (S = s S =s ˜) α e′ + B(i) , s, s˜ =1, ..., m , (3.22) ss˜ t−i | t i s  where es is an m-by-1 vector of all zeros, except its sth element which is one. We define an rm2-by-rm2 matrix by 3.2. STATIONARITY OF MARKOV-SWITCHING GARCH MODELS 45

Ω(1) Ω(2) Ω(r) ···  I 2 0 2 0 2  m m ··· m    0m2 Im2  Ψ ,   . (3.23) II  ......   . . . . .             0m2 0m2 Im2 0m2   ···    Let Λ be an m2-by-m2 matrix which is built from the rm2-by-rm2 matrix (I Ψ )−1 II − II such that Λ = (I Ψ )−1 , i, j = 1, ..., m2. Let π¯ , [π e′ , π e′ , ..., π e′ ]′, then { II}ij − II ij 1 1 2 2 m m we get the following theorem for the stationarity condition of an MSG-II process:

Theorem 3.2. An MSG-II process as defined by (3.1) and (3.20) is asymptotically wide-

2 ′ sense stationary with variance limt→∞ E (εt )= π¯ ΛIIξ, if and only if ρ (ΨII ) < 1.

The proof is given in Appendix 3.A.

3.2.3 Comparison of stationarity conditions

It has been pointed out in [8], that stationarity of the MSG-II model with p = q = 1 requires that the regression parameters β1,s < 1 for all s. It follows from (3.20) that for m general order (p, q), it is necessary that i=1 βi,s < 1 . However, the reaction parameters

αi,s may become rather large with correspondenceP to the regime probabilities. For MSG-I and model, the reaction parameters αi,s as well as the regression parameters βi,s may be larger than one, provided that the corresponding regime probabilities are sufficiently small. Furthermore, in the representative matrix ΨI (3.14) the reaction parameters and the regression parameters are weighted by the same weights p (S = s S =s ˜). Conse- t−i | t quently, for a given state s, the values of αi,s and βi,s in MSG-I model have the same contribution to the model stationarity3, but for MSG-II model, each of them affects dif- ferently the heteroscedasticity evolution. Figure 3.1 illustrates the stationarity regions for MSG-I model (solid line) and MSG-II (dashed-dotted line), in the case of two-state Markov chains and GARCH of order (1, 1). In (a), the regime transition probabilities are

3This also holds for the natural extension of GARCH(p, q) to Markov-switching, which has been analyzed in [122]. 46 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES a1,1 =0.6 and a2,2 =0.7 and the reaction parameters are α1,1 =0.4 and α1,2 =0.5. The stationarity region is the interior intersection of each curve and the two axes. In (b), a1,1 = 0.2, a2,2 = 0.3 are considered with reaction parameters α1,1 = 0.8 and α1,2 = 0.2. For the MSG-I model, stationarity is allowed with regression parameters larger than one while for the MSG-II, β1,1 and β1,2 must be both smaller than one for stationarity. In both cases, π2 > π1, however, in (a) the stationarity region of the MSG-II is contained in the stationarity region of MSG-I while in (b), in which case α >> α , for β [0.2, 0.55] 1,1 1,2 1,1 ∈ stationarity is achieved with a larger β1,2 for the MSG-II than for the MSG-I.

3.3 Relation to other works

Klaassen [7] developed conditions which are necessary, but not necessarily sufficient, for asymptotic stationarity of a two-state MSG-I model of order (1, 1). Consider the 2-by-2 matrix C with elements

cij = aji (α1,i + β1,i) πj/πi, i, j =1, 2 , (3.24)

Klaassen showed that the stationary variance of the process is given by

σ2 = π′(I C)−1ξ, (3.25) − and that the conditions:

c ,c < 1, and det(I C) > 0 , (3.26) 11 22 − are necessary to ensure that the stationary variance is finite and positive. For the special case of our analysis for GARCH orders of (1, 1) and MSG-I model with two states, the representative matrix ΨI reduces to matrix C and the stationary variance reduces to the expression given in (3.25). Metzler showed [125] that for a nonnegative matrix C (i.e., c 0), ρ (C) < 1 if and only if all of the principal minors of (I C) are ij ≥ − positive. Furthermore, together with Hawkins-Simon condition [126], ρ (C) is less than one if and only if (I C)−1 has no negative elements. Therefore, for the nonnegative matrix − C, the condition c ,c < 1 implies det(I C) > 0 and it is equivalent to ρ (C) < 1. 11 22 − Accordingly, the conditions of Klaassen are not only necessary but also sufficient for asymptotic stationarity. 3.3. RELATION TO OTHER WORKS 47

0.9

0.8

0.7

0.6

0.5 2 , 1

β 0.4

0.3

0.2

0.1

0 0 0.2 0.4 0.6 0.8 1 1.2 β1,2

(a) 1

0.9

0.8

0.7

0.6 2 , 0.5 1 β 0.4

0.3

0.2

0.1

0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 β1,2

(b)

Figure 3.1: Stationarity regions for two-state Markov-chains with GARCH of order (1, 1) corre- sponding to MSG-I (solid line) and MSG-II (dashed-dotted line). The regime transition proba- bilities and the reaction parameters are (a) a1,1 = 0.6, a2,2 = 0.7 and α1,1 = 0.4, α1,2 = 0.5; (b) a1,1 = 0.2, a2,2 = 0.3 and α1,1 = 0.8, α1,2 = 0.2. 48 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES

A necessary and sufficient condition for asymptotic stationarity of an MSG-II model of order (1, 1) has been developed by Haas et al. [8]. Accordingly, the largest eigenvalue in modulus of an m2-by-m2 block matrix D is constrained to be less than one, where

D D D 11 21 ··· m1   D12 D22 Dm2 D = ··· (3.27)  . . .   . . .       D1m D2m Dmm   ···    is built from matrices Dij of size m-by-m which are obtained by

(1) ′ Dij = aij B + α1ej . (3.28)  The stationarity analysis in [8] for an MSG-II process employs a forward recursive calculation of the expected conditional variance, assuming some initial conditions. As a result, the probabilities of state transitions, aij, are used for evaluating the expectation of the one step ahead conditional variance. Our analysis manipulates a backward recursion of the conditional variance expectation, and thus, it uses the stationary probabilities of the Markov chain, along with the transition probabilities, to generate previous conditional states probabilities p (s s ). Therefore, when we degenerate the MSG-II model to order t−i | t p = q = 1 (which is the case analyzed in [8]) the block matrices D (3.27) and ΨII (3.23) (1) are not identical, and specifically, for that order of model we have ΨII = Ω and

(1) πiaij Ωij = Dji . (3.29) πjaji

Although our representative matrix ΨII and that developed in [8] do not share the same elements, we show in Appendix 3.B that their eigenvalues are identical and therefore both conditions are equivalent for that order of MSG-II. A special case of any of the MSG models is a degenerated case of having a single regime of order (p, q) (the models reduce to a standard GARCH(p, q) model). In that case, the representative matrices are equal, ΨI = ΨII . Francq, Roussignol and Zako¨ıan [122] developed a stationarity condition for the natural case of Markov-switching GARCH model, in which case the conditional variance depends on the active regime-path. For the special case of a single-regime model they got the transition matrix of a standard GARCH(p, q) model which is equal to that which is derived e.g., by substituting (i) = K 3.4. CONCLUSIONS 49

αi,1 +βi,1 and I1 = 1 in (3.14). They showed that having the spectral radius of that matrix to be less than one is equivalent to Bollerslev’s condition for the asymptotic wide-sense r stationarity of a GARCH(p, q) model, i=1 (αi,1 + βi,1) < 1 [1]. P 3.4 Conclusions

Conditions for asymptotic wide-sense stationarity of random processes with time-variant distributions are useful for ensuring the existence of a finite asymptotic volatility of the process. We developed a comprehensive approach for stationarity analysis of Markov- switching GARCH processes where finite state space Markov chains control the switching between regimes, and GARCH models of order (p, q) are active in each regime. Necessary and sufficient conditions for the asymptotic stationarity are obtained by constraining the spectral radius of representative matrices, which are built from the model parameters. These matrices also enable derivation of compact expressions for the stationary variance of the processes.

3.A Proof of Theorem 3.2

In this appendix we prove Theorem 3.2, which gives necessary and sufficient condition for the asymptotic wide-sense stationarity of MSG-II model and also its stationary variance. Following (3.20) and (3.5), the expectation of the MSG-II conditional variance under a chain state s, follows:

q p E (σ2 s )= ξ + α E ε2 s + β E (σ2 s ) (3.30) t−1 t,s | t s i,s t−1 t−i | t j,s t−1 t−j,s | t i=1 j=1 X  X where using (3.7) and (3.8)

m E ε2 s = p (s s ) E σ2 s , (3.31) t−1 t−i | t t−i | t t−i−1 t−i,st−i | t−i st−i=1  X   and

2 2 Et−1(σt−j,s st) = ESt−j Et−1(σt−j,s st−j,st) | m | = p (s s ) E σ2 s . (3.32) t−j | t t−i−1 t−j,s | t−j st−j=1 X  50 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES

The main difference between an MSG-II model and MSG-I model is that the conditional variance depends on previous conditional variances of the same regime, regardless of the past regimes path. By contrast, for MSG-I model, the conditional variance is a linear combination of past state-dependent conditional variances, where for each one the state is conditioned to be the active one. Consequently, the computation of the unconditional variance for an MSG-II model requires the terms E (σ2 s ) for all s = 1, ..., m, t−j−1 t−j,s | t−j while in case of MSG-I model, only E (σ2 s ) is relevant to calculate the t−j−1 t−j,st−j | t−j expectation of the unconditional variance. Accordingly, an m2-by-1 vector is necessary to represent E (σ2 s ) elements, and rm2-by-rm2 matrix is employed for the recursive t−1 t,s | t formulation. By substituting (3.32) and (3.31) into (3.30) we have

r m E (σ2 s )= ξ + p (s s ) α E σ2 s + β E σ2 s . t−1 t,s | t s t−i | t i,s t−i−1 t−i,st−i | t−i i,s t−i−1 t−i,s | t−i i=1 s − =1 X tXi h   i (3.33)  Let g (s,s ) , E σ2 s , and let g , [g (1, 1) , g (2, 1) , ..., g (m, 1) , g (1, 2) , ..., g (m, m)]′ t t t−1 t,s | t t t t t t t be a vector of expected, state dependent, conditional variances. Then, a recursive formu- lation of the conditional variance is given by

˜g = ξ˜+ Ψ ˜g , t 0 , (3.34) t II t−1 ≥

′ ′ ′ ′ where ˜gt , gt, gt−1, ..., gt−r+1 . The completion of this proof follows the proof of Theorem 1.  

3.B Equivalence with Haas condition

(1) In this appendix we show that the eigenvalues of matrices D (3.27) and ΨII = Ω (3.23) are equal for the case of an m-state MSG-II model of order (1, 1). Let D˜ denote an m2-by-m2 matrix that is given by

B(1) + α e′ 0 0 1 1 m ··· m  (1) ′ .. .  0m B + α1e . . D˜ , 2 , (3.35)  . .. ..   . . . 0m     (1) ′   0m 0m B + α1em  ···    3.B. EQUIVALENCE WITH HAAS CONDITION 51 and let denote the Kronecker product. Then D = D˜ (A′ I ). Let 1 denote an ⊗ ⊗ m m m-by-1 vector of ones and let P , diag(π 1 ). By substituting (3.28) into (3.29), we ⊗ m have (1) πi (1) ′ Ωij = aij B + α1ei (3.36) πj and  Ω(1) = P −1 (A′ I ) DP.˜ (3.37) ⊗ m Therefore, Ω(1) and (A′ I ) D˜ are similar matrices, and the spectrum of Ω(1), eig Ω(1) , ⊗ m satisfies  eig Ω(1) = eig (A′ I ) D˜ = eig D˜ (A′ I ) = eig D . (3.38) ⊗ m ⊗ m { }  n o n o 52 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES Chapter 4

Markov-Switching GARCH Process in the Short-Time Fourier Transform Domain1

In this chapter, we introduce a Markov-switching generalized autoregressive conditional heteroscedasticity (GARCH) model for nonstationary processes with time-varying volatil- ity structure in the short-time Fourier transform (STFT) domain. The expansion coeffi- cients in the STFT domain are modeled as a multivariate complex GARCH process with Markov-switching regimes. The GARCH formulation parameterizes the correlation be- tween sequential conditional variances while the Markov chain allows the process to switch between regimes of different GARCH formulations. We obtain a necessary and sufficient condition for the asymptotic wide-sense stationarity of the model, and develop a recursive algorithm for signal restoration in a noisy environment. The conditional variance is esti- mated by iterating propagation and update steps with regime conditional probabilities, while the model parameters are evaluated a priori from a training data set. Experimental results demonstrate the performance of the proposed algorithm. In Appendix 4.A, we introduce an application of the Markov-switching GARCH model in the STFT domain to speech enhancement. A GARCH model is utilized with Markov switching regimes, where the parameters are assumed to be frequency variant. The model parameters are evaluated in each frequency subband and a special state (regime) is de-

1This chapter is based on [127, 128].

53 54 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN

fined for the case where speech coefficients are absent or below a threshold level. The problem of speech enhancement under speech presence uncertainty is addressed and it is shown a soft voice activity detector may be inherently incorporated within the algorithm. Experimental results demonstrate the potential of our proposed model to improve noise reduction while retaining weak components of the speech signal.

4.1 Introduction

The generalized autoregressive conditional heteroscedasticity (GARCH) model is widely- used in the field of econometrics for volatility forecast derivation of economic rates. This model, first introduced by Bollerslev [1] as a generalization of the ARCH model [2], explic- itly parameterizes the time-varying volatility by using both recent conditional variances and recent squared innovations. GARCH models preserve the persistence of the process volatility in the sense that small variations tend to follow small variations and large varia- tions tend to follow large variations. Incorporating GARCH models with hidden Markov chains, where each state (regime) of the chain implies a different GARCH behavior, ex- tends the dynamic formulation of the model and enables a better fit for a process with a more complex time-varying volatility structure [7–9]. However, a major drawback of such models is that estimating the volatility with switching-regimes requires knowledge of the entire history of the process, including the regime path. Consequently, Cai [12] and Hamil- ton and Susmel [13] proposed a Markov-switching ARCH model, which avoids problems of path dependency in a noiseless environment. The conditional variance in ARCH models depends on previous observations only, so the Markov chain does not have to be known for constructing the conditional variance for a given regime. Gray [6] introduced a variant of Markov-switching GARCH model relying on the assumption that the conditional variance given current regime is dependent on the expectation of the previous conditional variances rather than their values. Accordingly, the conditional variance depends on some finite, state dependent, expected conditional variances via their conditional state probabilities. Klaassen [7] proposed modifying Gray’s model by manipulating the current regime and all available observations while evaluating the expectation of previous conditional vari- ances. A different method for reducing the dependency of the conditional variance on 4.1. INTRODUCTION 55 past regimes has recently been proposed by Haas, Mittnik and Paolella [8]. Accordingly, a Markov chain governs the ARCH parameters while the autoregressive behavior of the conditional variance is subject to the assumption that past conditional variances are in the same regime as that of the current conditional variance. Gray, Klaassen and Haas et al. developed their variants of Markov-switching GARCH models for improved volatility forecasts of financial time-series under possible existence of shocks. They assumed that a process is observed in a noiseless environment so that its past observations provide a complete specification of its current conditional variance, for any given regime.

Recently, GARCH models have been employed for modeling speech signals in the time- frequency domain [23–25]. Speech signals in the short-time Fourier transform (STFT) domain demonstrate both “variability clustering” and heavy tail behavior similarly to fi- nancial time-series [25]. Motivated by these characteristics, it was proposed to model the conditional variance of speech signals in the STFT domain by a complex, K-dimensional GARCH model, with statistically independent elements (given past information) sharing the same GARCH specification. This time-frequency GARCH (TF-GARCH) model has been shown useful for speech enhancement applications, but it relies on the assumption that the model parameters are time-invariant. In [26], a GARCH model has been utilized in the time domain for speech recognition applications. The model parameters, charac- terizing the speech phonemes, are assumed speaker independent and time-varying. It was shown that estimating the GARCH specifications for each speech segment and using the parameters as part of the signal characteristics, speech recognition performance can be improved.

In this chapter, we introduce a Markov-switching time-frequency GARCH (MSTF- GARCH) model which exploits the advantages of both the conditional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains. Modeling probability density functions of speech signals by utilizing hidden Markov mod- els has been found useful in speech recognition applications [63–65], and modeling the speech spectral coefficients as hidden Markov processes with a probability density proto- type in each frame was applied to the problem of speech enhancement [35, 62]. Here we model the expansion coefficients of nonstationary random signals in the time-frequency domain as multivariate complex GARCH processes with Markov-switching regimes, and 56 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN obtain a necessary and sufficient condition for the asymptotic wide-sense stationarity of the model. A corresponding recursive algorithm is developed for signal restoration in a noisy environment. The conditional variance is estimated by iterating propagation and update steps with regime conditional probabilities. The model parameters are estimated from a training data set prior to the signal restoration using maximum-likelihood (ML) approach, and the number of states is assumed to be known. We show that the derivation in [117] of bounds on the mean-square error (MSE) of a composite source signal estimation is applicable for obtaining an upper bound on the MSE of a single step MSTF-GARCH estimation. Experimental results demonstrate the improved performance of the proposed algorithm for restoration of MSTF-GARCH process compared to using an estimator which assumes a stationary process and compared to using an estimator which assumes a smaller number of regimes than the process actually has. Furthermore, it is demonstrated that the squared absolute values of speech coefficients in the STFT domain are better evaluated by using the MSTF-GARCH model than by using the decision-directed approach. This chapter is organized as follows. In Section 4.2, we introduce the Markov-switching time-frequency GARCH model and obtain a necessary and sufficient condition for its asymptotic wide-sense stationarity. In Section 4.3, we address the problem of signal es- timation from noisy observations. In Section 4.4, we derive an upper bound on a single estimation step mean-square error. In Section 4.5, we address the problem of model esti- mation. Finally, in Section 4.6 we provide some experimental results which demonstrate restoration of MSTF-GARCH process from noisy observations, and estimation of con- ditional variances and squared absolute values in the STFT domain from noisy speech signals.

4.2 Markov-switching time-frequency GARCH model

In this section we briefly review the TF-GARCH model [25], and introduce a new time- frequency GARCH model with Markov-switching regimes, which allows further flexibility in the formulation of the time variation of the conditional variance. 4.2. MARKOV-SWITCHING TIME-FREQUENCY GARCH MODEL 57

4.2.1 Time-frequency GARCH model

Let X t =0, ..., T 1 , k =0, ..., K 1 be the coefficients of a time-frequency trans- { tk | − − } formation of a discrete-time signal x (e.g., STFT coefficients), where t is the time frame ′ index and k is the frequency-bin index. Let Xt , [Xt,0, ..., Xt,K−1] be the vector of spec- tral coefficients at time frame t, let τ = τ , X t =0, ..., τ represent the set of X X0 { t | } spectral coefficients up to time τ, and let λ , E X 2 τ denote the conditional tk|τ {| tk| | X } variance of the spectral coefficient at time-frequency bin (t, k), given the clean spectral coefficients up to time τ. Let V CK be a complex Gaussian random process with { t} ∈ V (0, I ), where I is a K-by-K identity matrix. A K-dimensional time-frequency t ∼ CN K K GARCH model of order (p, q), is defined as follows [25]:

X = λ V , k =0, ..., K 1 (4.1) tk tk|t−1 tk − q q p λ = ζ 1 + α X X∗ + β λ , (4.2) t|t−1 · i t−i ⊙ t−i j t−j|t−j−1 i=1 j=1 X X where 1 denotes a vector of ones, denotes a term-by-term multiplication and ∗ denotes ⊙ complex conjugation. The conditional variance vector, λ = E X X∗ t−1 , re- t|t−1 { t ⊙ t | X } ferred to as the one-frame-ahead conditional variance [25], is a linear function of the coefficients’ past squared values and conditional variances, where

ζ > 0, α 0, i =1, ..., q, β 0, j =1, ..., p , (4.3) i ≥ j ≥ are sufficient constraints for the positivity of the conditional variance [1]. The time- frequency GARCH has been introduced in [23] for modeling speech signals in the STFT domain, but the parameters of the GARCH model are assumed time invariant. Extending this model such that the model parameters may vary with time introduces additional flexibility in the model formulation, which may result in better characterization of speech signals and improved restoration in noisy environments.

4.2.2 MSTF-GARCH formulation

Let St denote the (unobserved) state at time t and let st be a realization of St, assuming S is a first-order Markov chain. Let t , t, t denote all available information t I {X S } up to time t, which contains the clean signal coefficients and the regimes path up to 58 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN time t, t , s , ..., s . Denote by λ , E X 2 t−1,s the one-frame-ahead S { 0 t} tk|t−1,st {| tk| | I t} conditional variance of the spectral coefficient X given the information up to time t 1 tk − and the chain state st. We assume that the spectral coefficients Xtk are generated by an m-state Markov-switching time-frequency GARCH process of order (p, q), denoted by X MSTF-GARCH(p, q), which follows: tk ∼

X = λ V , k =0, ..., K 1 , (4.4) tk tk|t−1,st tk − q and the one-frame-ahead conditional variance evolves as follows:

q p λ = ζ 1 + α X X∗ + β λ , (4.5) t|t−1,st st i,st t−i ⊙ t−i j,st t−j|t−j−1,st−j i=1 j=1 X X where

ζ > 0, α 0, β 0, i =1, ..., q, j =1, ..., p, s =1, ..., m (4.6) s i,s ≥ j,s ≥ are sufficient constraints for the positivity of the one-frame-ahead conditional variance. It follows from (4.4) and (4.5) that the conditional density of the coefficients depends on past values (through previous conditional variances) and also on the regime-path up to the current time. As considered in previous works on TF-GARCH, we assume that the model parameters are frequency-invariant. This restriction can be easily relaxed for the case of frequency (or sub-band) dependent parameters, i.e., ζk,s, αi,k,s and βi,k,s, but the complexity of the model estimation then grows rapidly (see Section 4.5). GARCH models provide a rich class of possible parametrization of conditional het- eroscedasticity (i.e., time-varying volatility) and the hidden Markov chain allows these GARCH formulations to switch along time. Volatility persistence naturally arises in a single-regime GARCH model. However, the existence of a Markov chain with different GARCH parameters allows the process to switch between regimes of different volatility formulations and different levels of volatility.

4.2.3 Stationarity of an MSTF-GARCH process

The conditional variance of a GARCH process, and in particular of a Markov-switching GARCH process, changes recursively over time. Consequently, asymptotic wide-sence 4.2. MARKOV-SWITCHING TIME-FREQUENCY GARCH MODEL 59 stationarity is required to ensure a finite second-order moment [7,8,121]. Necessary and sufficient conditions for the asymptotic stationarity of three variants of GARCH models with Markov-switching regimes have been derived in [121]. Those models generalize the models of Gray [6], Klaassen [7] and Haas et al. [8], but they all differ from our MSTF- GARCH model, which is a multivariate, complex valued process that entails the regime path for the construction of the conditional variance from past observations. A necessary and sufficient condition for asymptotic wide-sense stationarity of an MSTF-GARCH pro- cess has been derived in [128]. For the completeness of this chapter we briefly summarize these results:

Assuming a stationary Markov chain with stationary probabilities πs = p (St = s), the unconditional variance of the process can be calculated using (4.4) and (4.5):

E X X∗ = π E X X∗ s { t ⊙ t } st { t ⊙ t | t} s Xt

= πst E λt|t−1,st , (4.7)

st X  where

q p ∗ E λ = ζ 1 + α E X X s + β E λ − s , t|t−1,st st i,st t−i ⊙ t−i | t j,st t−j|t−j−1,St j | t i=1 j=1  X  X  (4.8) and E λ s = p (s s ) E λ . (4.9) t−i|t−i−1,St−i | t t−i | t t−i|t−i−1,st−i st−i  X 

Note that E λt|t−1,st denotes the expected value of the conditional variance under the regime St = st, but E λt|t−1,S denotes a conditional expectation of the conditional t |· variance at time t where the active regime at that time is unknown. Since no prior information is given, we have

E X X∗ s = p (s s ) E λ , (4.10) t−i ⊙ t−i | t t−i | t t−i|t−i−1,st−i st−i  X  and consequently we obtain [128]

r πst−i i E λt|t−1,st = ζst 1 + (αi,st + βi,st ) A E λt−i|t−i−1,s − , (4.11) st−i,st t i πst i=1 st−i  X X   60 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN where r , max p, q , α , 0 i > q, β , 0 i>p, and A is the transition probabilities { } i,st ∀ i,st ∀ matrix, i.e., A , a = p (S = j S = i). Define m-by-m matrices , i = 1, ..., r { }ij ij t | t−1 Ki with elements

πs˜ i i s,s˜ , (αi,s + βi,s) A s,s˜ , s, s˜ =1, ..., m , (4.12) {K } πs  and an mr-by-mr matrix as follows

K1 K2 ··· Kr  I 0 0  m ···    0 Im  Ψ ,   . (4.13)  ......   . . . . .             0 0 Im 0   ···    Let ρ ( ) denote the spectral radius of a matrix, i.e., its largest eigenvalue in modulus, · and let Φ be an m-by-m square matrix built from the mr-by-mr matrix (I Ψ)−1 such − that Φ = (I Ψ)−1 , i, j = 1, ..., m. Then a necessary and sufficient condition { }ij − ij for asymptotic wide-sense stationarity of an MSTF-GARCH process is ρ (Ψ) < 1, and the asymptotic covariance matrix of the process is then a diagonal matrix (see [128] for a detailed proof):

H lim E XtXt =(πΦζ) IK , (4.14) t→∞

′  where ζ , [ζ1, ..., ζm] , π is the row vector of the stationary probabilities of the Markov chain, and ( )H denotes the Hermitian transpose operation. · This stationarity condition is a necessary and sufficient condition for the existence of a finite second-order moment of the process. It implies that in some regimes (but not in all of them) the conditional variance may grow over time (i.e., i αi,s + j βj,s > 1 for some states s) but still the unconditional variance can be finiteP [121,128].P

4.3 Restoration of noisy MSTF-GARCH process

In this section we develop a recursive algorithm for the restoration of MSTF-GARCH processes observed in additive stationary noise. 4.3. RESTORATION OF NOISY MSTF-GARCH PROCESS 61

A hidden Markov process is a discrete-time finite-state Markov chain observed through a memoryless invariant channel, where the chain state is assumed to be hidden but the transition probabilities between sequential states are assumed to be known. As a conse- quence of the memoryless channel, the conditional density of the observed signal at time t (say Xt) given the chain state st, depends only on the given state and not on previous observations, i.e., the conditional density of a hidden Markov process (HMP) realizes p(X s X , X , ...) = p(X s ). Combining GARCH models with hidden Markov t | t, t−1 t−2 t | t chains, where each state is assumed to have a different GARCH formulation, introduces further complexity when trying to forecast or estimate the process, since the conditional variance of the process evolves as a function of previous conditional variances, as implied from (4.5). Consequently, the conditional density depends on the entire history of the process, i.e., past values and active states. To avoid this problem, several variants of GARCH processes with Markov-switching regimes have been proposed, e.g., [6–8]. These models formulate differently the conditional variance at any regime as dependent on past signal observations only. However, these variants of Markov-switching GARCH models have been developed for the purpose of forecasting volatility of financial time-series, as- suming that the process is observed in a noiseless environment, and that all past clean signal values are given.

We use an MSTF-GARCH(1, 1) model, as defined in (4.4) and (4.5), to model complex, nonstationarity random signals and we develop a recursive signal estimation algorithm for restoring the clean signal and its second-order moment, from noisy observations. The or- der (1, 1) is chosen for computational simplicity since higher (p, q)-orders imply strong dependency of successive conditional variances. Therefore, p = q = 1 is generally as- sumed for the applications of Markov-switching GARCH modeling, e.g., [6–8, 12, 13]. Let X and D denote the spectral coefficients of signal and uncorrelated addi- { tk} { tk} tive noise signal, respectively, and let Ytk = Xtk + Dtk represent the observed signal.

Let Xt be a K-dimensional complex-valued stochastic process, which evolves as an m- state first-order MSTF-GARCH, i.e., X MSTF-GARCH(1, 1), and let D represent t ∼ t a K-dimensional complex Gaussian random noise, D (0, Rd), with known di- t ∼ CN agonal covariance matrix Rd = diag σ2 . We assume that all MSTF-GARCH model { } parameters are known, i.e., the initial regimes probability π(0), the probability tran- 62 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN sitions matrix A, and the GARCH(1, 1) parameters in each of the m regimes. Let

(0) φ , π , A, ζ1, ..., ζm, α1, ..., αm, β1, ..., βm be the set of parameters which specifies the model, where for a first-order process we denote αs , α1,s and βs , β1,s. In practice, the model parameters φ are estimated from a set of clean training signals as generally done with hidden Markov models [35,62,64,129] while the covariance matrix of the noise process can be estimated using the minimum statistics [72] or the minima controlled re- cursive averaging algorithms [38, 71]. The problem of model estimation is addressed in Section 4.5.

The spectral restoration problem is generally formulated as deriving an estimator Xˆtk for the spectral coefficients, such that the expected value of a certain distortion measure is minimized. We develop a recursive estimator for the signal’s spectral coefficients and for their absolute squared values in the sense of minimum mean-square error (MMSE), and we then extend this framework to signal restoration in the sense of MMSE of the log-spectral amplitude (LSA), which is often used in speech enhancement applications, see for instance [34,38].

Let τ = τ , Y t =0, ..., τ be the set of observations up to time τ. The causal Y Y0 { t | } MMSE estimator of the coefficients Xt given the noisy observations up to time t is obtained as follows:

E X t = p s t E X s , t . (4.15) t |Y t |Y t | t Y st  X   Denote the state dependent, one-frame-ahead conditional covariance matrix of the clean signal as

Rx , E X XH s , t−1 . (4.16) st t t | t I  x Following the model formulation this covariance matrix is a function of Rst−1 and Xt−1 only. However, the clean signal values are usually unavailable, nor the sequence of active states, so the evaluation of (4.16) requires the whole available observations. To overcome this problem, we assume that given current regime, past estimated conditional covariances are sufficient statistics for the conditional variance estimation [24]. Accordingly, given the set of estimated one-frame-ahead conditional variances Λˆ , λˆ S =1, ..., m t t|t−1,St | t which manipulates the observations up to time t 1, we mayn use the following signalo − 4.3. RESTORATION OF NOISY MSTF-GARCH PROCESS 63 estimator: Xˆ = p s Λˆ , Y E X s , Rˆx , Y , (4.17) t t | t t t | t st t s Xt   n o where under a Gaussian model

−1 E X s , Rˆx , Y = Rˆx Rˆx + Rd Y . (4.18) t | t st t st st t n o   Note that Rˆx is a K-by-K diagonal matrix (since V are statistically independent) st { tk} ˆ with the estimated state-dependent conditional variance λt|t−1,st on its diagonal. This state-dependent conditional variance can be recursively estimated in the MMSE sense by calculating its conditional expectation under st given the observation Yt−1 and the previous set of estimated conditional variances:

λˆ , E λ s , Λˆ , Y ; φ t|t−1,st t|t−1,St | t t−1 t−1 = ζ n1 + α E X X∗ so, Λˆ , Y ; φ st st t−1 ⊙ t−1 | t t−1 t−1 n o + β E λ − s , Λˆ , Y ; φ . (4.19) st t−1|t−2,St 1 | t t−1 t−1 n o The conditional second-order moment in (4.19), can be obtained by

λˆ , E X X∗ s , Λˆ , Y ; φ t−1|t−1,st t−1 ⊙ t−1 | t t−1 t−1 n o = p s s , Λˆ , Y ; φ E X X∗ s ,s , Λˆ , Y ; φ t−1 | t t−1 t−1 t−1 ⊙ t−1 | t−1 t t−1 t−1 s −1 Xt   n o ˆ ∗ = p s s , Λ , Y ; φ E X X s , λˆ − , Y ; φ t−1 | t t−1 t−1 t−1 ⊙ t−1 | t−1 t−1|t−2,st 1 t−1 s −1 Xt   n o

, p s s , Λˆ , Y ; φ λˆ − . (4.20) t−1 | t t−1 t−1 t−1|t−1,st 1 s −1 Xt   The expected one-frame-ahead conditional variance in (4.19), given the one-frame-ahead regime, can be obtained by:

λˆ , E λ − s , Λˆ , Y ; φ t−1|t−2,st t−1|t−2,St 1 | t t−1 t−1 n ˆ o ˆ = p s s , Λ , Y ; φ E λ − s ,s , Λ ; φ t−1 | t t−1 t−1 t−1|t−2,St 1 | t−1 t t−1 s −1 Xt   n o ˆ = p s s , Λ , Y ; φ λˆ − . (4.21) t−1 | t t−1 t−1 t−1|t−2,st 1 s −1 Xt   The third lines in (4.20) and in (4.21) rely on the fact that given all observations up to time t 1 and given the state s , the second-order moment of the process at that − t−1 64 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN time, and also its conditional variance, are independent of any future state. Moreover, ˆ ˆ notice that λt|t,st and λt|t,st+1 in (4.20) represent the expected second-order moment of the process based on information up to time t, given the chain state at the same time, and ˆ ˆ given the next state, respectively. Similarly, λt|t−1,st and λt|t−1,st+1 in (4.21) represent the expectation of the one-frame-ahead conditional variance at time t given the chain state st, and given the chain state at the next time step, respectively. ˆ The MMSE estimation of the process’ second-order moment λt|t,st in (4.20) given the ˆ estimated one-frame-ahead conditional variance of the same regime λt|t−1,st (4.21), can be obtained by

λˆ = E X X∗ s , λˆ , Y t|t,st t ⊙ t | t t|t−1,st t n −1 o −1 = Rˆx Rˆx + Rd σ2 + Rˆx Rˆx + Rd (Y Y∗) , s =1, ..., m , (4.22) st st st st t ⊙ t t       similarly to the method in [24] applied to the case of a single-regime spectral GARCH. Following the notation in [24] we call (4.22) the update step as it updates the estimation of the signal’s second-order moment at time t from its estimated one-frame-ahead conditional variance, using the new observation Yt. Substituting (4.20), (4.21) and (4.22) into (4.19) we obtain the propagation step which propagates ahead in time to obtain a conditional variance estimation at the next time, t + 1 (assuming regime st+1), using the available information up to the current time t:

ˆ ˆ ˆ λt+1|t,st+1 = ζst+1 1 + αst+1 λt|t,st+1 + βst+1 λt|t−1,st+1 , st+1 =1, ..., m . (4.23)

t Let Λˆ , Λˆ 0, Λˆ 1, ..., Λˆ t be the set of the recursively estimated conditional variances up to time t,n then we can manipulateo all previous estimations to recursively evaluate the probability p s s , Λˆ t−1, Y ; φ in (4.20) and (4.21) by t−1 | t t−1   t−1 t−1 t−1 p s s , Λˆ , Y ; φ = p s Λˆ , Y ; φ a − /p s Λˆ , Y ; φ , (4.24) t−1 | t t−1 t−1 | t−1 st 1,st t | t−1       where

t−1 t−1 p s Λˆ , Y ; φ = p s Λˆ , Y ; φ a − . (4.25) t | t−1 t−1 | t−1 st 1,st s −1   Xt   4.3. RESTORATION OF NOISY MSTF-GARCH PROCESS 65

The conditional state probability at the right of (4.25) can be obtained by

t b Yt,st Λˆ ; φ t | p st Λˆ , Yt; φ = | b Y Λˆ t; φ    t |  ˆ ˆ t−1 b Yt st, λt|t−1,st p st Λ , Yt−1; φ = | | , (4.26) b Y s , λˆ  p s Λˆ t−1, Y ;φ st t| t t|t−1,st t | t−1 P     where b ( ) denotes a conditional density function. Specifically, b Y s , λˆ is the ·|· t | t t|t−1,st observation conditional density which is a complex  with zero-mean ˆx d and Rst + R covariance matrix,

−1 ˆ 1 H ˆx d b Yt st, λt|t−1,st = exp Yt Rs + R Yt . (4.27) | πK Rˆx + Rd t   | st |     Computing the conditional density b Y s , λˆ tends to be numerically unstable t | t t|t−1,st   ˆ for large values of K since the diagonal values of its covariance matrix (i.e., λt|t−1,st ) are typically of the same order of magnitude. Therefore, b Y s , λˆ tends to zero or t | t t|t−1,st infinity exponentially fast as K increases. It is therefore useful to recursively evaluate a normalized density ˜b Y s , λˆ as follows: t | t t|t−1,st   ˜b Y , ..., Y ˜ s , λˆ b Y ˜ s , λˆ t,0 t,k−1 | t t|t−1,st t,k | t t|t−1,st ˜b Y , ..., Y ˜ s , λˆ = , t,0 tk | t t|t−1,st     ˜b Y , ..., Y ˜ s , λˆ b Y ˜ s , λˆ   st t,0 t,k−1 | t t|t−1,st t,k | t t|t−1,st P    (4.28) for k˜ = 0, ..., K 1 and substitute it into (4.26). As can be seen from (4.26), this − normalization of b Y s , λˆ does not affect the value of p s Λˆ t, Y ; φ . t | t t|t−1,st t | t The causal one-frame-ahead conditional variance and the conditional second-order moment of the process can be obtained by

λˆ = p s Λˆ t, Y E λ s , Λˆ t|t−1 t | t t|t−1,St | t t s Xt   n o = p s Λˆ t, Y λˆ (4.29) t | t t|t−1,st s Xt  

λˆ = p s Λˆ t, Y E X X∗ s , λˆ , Y t|t t | t t ⊙ t | t t|t−1,st t s Xt   n o = p s Λˆ t, Y λˆ , (4.30) t | t t|t,st s Xt   66 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN while a state smoothing (i.e., noncausal state probability estimation) for the path- dependent MSTF-GARCH model has been derived in [130] and may be employed for noncausal estimation. The causal recursive MMSE signal restoration algorithm, presented in (4.17) to (4.26), has a compact vector form with respect to the regimes vector. Let st , [St =1, ..., St = m]′ be the regimes vector at time t, let

′ ρ (s ) , p S =1 Λˆ t, Y , ..., p S = m Λˆ t, Y (4.31) t τ τ | t τ | t h    i be the probabilities of the regimes vector sτ , conditioned on all observations up to frame t. Let Ct be a regimes probability matrix at time t conditioned on the next regime and all available observations up to time t, i.e., c = p S = i S = j, Λˆ t, Y , i, j = t,ij t | t+1 t 1, ..., m. Let α and β represent the vectors of the m regimes’ GARCH parameters, i.e., ′ ′ ′ ˆ s ˆ ˆ α , [α1, ..., αm] and β , [β1, ..., βm] . Let λtk|τ1, τ2 , λtk|τ1,Sτ2 =1, ..., λtk|τ1,Sτ2 =m be an m 1 vector of the kth index estimated conditional variancesh based on observatiions up × (i) (i) to time τ1, and the corresponding m regimes vector sτ2 . Denote by a and ct the ith column of matrices A and C , respectively, and let ( ) denote a term-by-term division t ÷ of two vectors. A step-by-step vector form of the causal signal estimation procedure is described in Table 4.1. The algorithm, summarized in Table 4.1, estimates both the spectral coefficients and their conditional variance in the MMSE sense. A more general signal enhancement prob- lem is formulated as minimization of the following distortion measure:

E f (X ) f(Xˆ ) 2 t , (4.32) | tk − tk | |Y n o where f(X) is a Borel integrable function. The estimator can be found from

f Xˆ = E f (X ) t , (4.33) tk tk |Y    where E f (X ) t = p S = s t E f (X ) s , t . (4.34) tk |Y t t |Y tk | t Y st  X   The log-spectral amplitude MMSE estimator, obtained by substituting f (X) = log X | | into (4.33), is of particular importance in speech enhancement applications, see for in- stance [34,38]. The LSA estimator [34] is given by 4.3. RESTORATION OF NOISY MSTF-GARCH PROCESS 67

Table 4.1: Vector form of the recursive MSTF-GARCH signal estimation Initialization:

ρ−1(s0)= π

λˆ s = λˆ s = 0 , k =0, ..., K 1 −1,k|−2, 0 −1,k|−1, 0 m×1 −

for t =0, ..., T 1 − λˆ s = ζ + α λˆ s + β λˆ s , k =0, ..., K 1 tk|t−1, t ⊙ t−1,k|t−1, t ⊙ t−1,k|t−2, t − 2 2 2 λˆ s = λˆ s σ 1 + λˆ s Y ( ) λˆ s + σ 1 tk|t, t tk|t−1, t ⊙ k tk|t−1, t ·| tk| ÷ tk|t−1, t k 2 ( ) λˆh s + σ 1 , k =0, ..., K 1 i ÷ tk|t−1, t k −   −1 b Y s , λˆ = π−K Rˆx + R −1 exp YH Rˆx + R Y , s =1, ..., m t| t t|t−1,st | st d| − t st d t t       B , diag b Y s , λˆ t t| t t|t−1,st n  ′ o −1 ρt(st)= Bt ρt−1(st) [1 Bt ρt−1(st)] ′ ρt(st+1)= A ρt(st) for i =1, ..., m : c(i) = a(i) ρ (s )/ρ (s = i) t ⊙ t t t t+1 ′ λˆ s = C λˆ s , k =0, ..., K 1 tk|t, t+1 t tk|t, t − ˆ ′ ˆ λtk|t−1,st+1 = Ct λtk|t−1,st , k =0, ..., K 1 −1 − ˆ ˆx ˆx d Xt|t,st = Rst Rst + R Yt Xˆ = ρ′ (s )Xˆ , k =0, ..., K 1 tk t t tk|t,st −

[ X = exp (E log X λ ,Y ) | tk| { | tk|| tk tk} = G (ξ , ϑ ) Y , (4.35) tk tk | tk| where 2 λtk Ytk γtkξtk ξtk , 2 , γtk , | 2 | , ϑtk , (4.36) σtk σtk 1+ ξtk and ξ 1 ∞ e−t G(ξ, ϑ)= exp dt . (4.37) 1+ ξ 2 t  Zϑ  ξtk and γtk represent the a priori and a posteriori SNRs respectively [33]. By substituting (4.35) into (4.34), and combining the result with the phase of the noisy signal [34], we obtain the spectral coefficient estimator in the MMSE-LSA sense 68 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN

t p(st|Y ) ˆ ˆ ˆ Xtk = Ytk G ξtk,st , ϑtk,st , (4.38) s Yt   where λˆ γ ξˆ ξˆ , tk|t,st , ϑˆ , tk tk,st , (4.39) tk,st σ2 tk,st ˆ tk 1+ ξtk,st and p (s t) is recursively estimated using (4.26). t|Y

4.4 Estimation efficiency

In this section we analyze the mean-square error of a one step ahead MMSE estima- tion using the proposed recursive algorithm. The recursive formulation of the MSTF- GARCH yields an accumulated error in the estimation of the variance and the signal. However, for each regime and in each frame the algorithm evaluates the conditional vari- ance as a weighted sum of previous estimated conditional variances and squared abso- lute values (4.20), (4.21). These weights are proportional to the conditional densities b Y s , λˆ in (4.26). Consequently, an over estimation of the conditional variance t | t t|t−1,st on a specific frame can be followed in the algorithm by giving a high probability (i.e., higher weight) to a regime with small parameters which compensates the previous over estimation. Similarly, an under estimation of the conditional variance can be compensated by giving a high probability to a regime with large parameters. Assume that the process is observed perfectly (without noise) up to time t 1 and − that the regime path is known up to that time. Then, λt−1|t−2,st−1 can be calculated by (4.5). Following Ephraim and Merhav [117] which derive bounds for the MSE of a composite source signal estimation, we assume that (i) the Markov chain is stationary ˆ and the necessary and sufficient condition for a bounded variance is satisfied; (ii) λt|t,st is square integrable with respect to b Y λ and b Y λ ; and (iii) the regime t | t|t−1,st t | t|t−1,s˜t transition probabilities are positive, i.e., aij amin > 0 i, j =1, ..., m. ≥ ∀ The one-step-ahead MMSE estimator (4.17) is unbiased in the sense that E Xˆ t = E X , and following [117] we obtain an upper bound for the variance of the one-step-n o { t} ahead estimation error, assuming that the process is observed with an additive, indepen- dent stationary noise. The one-step-ahead MSE is given by 4.4. ESTIMATION EFFICIENCY 69

1 H e2 , trE X Xˆ X Xˆ , (4.40) t K t − t t − t      where the signal estimator Xˆ t follows

Xˆ = E X t−1, Y t t | I t

= E X s , λ − , X , Y t | t−1 t−1|t−2,st 1 t−1 t = E X Λ , Y . (4.41) { t | t t}

Under the above assumptions the MSE can be written as [117, eq. (13) (17)] −

2 2 2 et = µt + ηt , (4.42) where

1 µ2 , trE cov (X Λ ,s , Y ) t K { t | t t t } 1 = trE cov X λ ,s , Y (4.43) K t | t|t−1,st t t   1 η2 , E p (s s , Λ , Y ) p (˜s s , Λ , Y ) g (s , s˜ , Λ , Y ) , (4.44) t 2 { t | t−1 t t t | t−1 t t t t t t } sXt6=˜st and

1 H g (s , s˜ , Λ , Y ) , tr Xˆ Xˆ Xˆ Xˆ t t t t K t|t,st − t|t,s˜t t|t,st − t|t,s˜t     

= g st, s˜t, λt|t−1,st , λt|t−1,s˜t , Yt . (4.45)  The state probabilities in (4.44) can be evaluated using (4.26):

b Yt st, λt|t−1,st ast−1,st p (st st−1, Λt, Yt)= | , (4.46) | b Y s , λ a − st t | t t|t−1,st st 1,st and the signal estimate given the state sPt is given by 

ˆ Xt|t,st = Wst Yt , (4.47)

x x d −1 where Wst is the conditional Wiener filter: Wst , Rst Rst + R . Substituting (4.47) into (4.45), we have  70 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN

1 g (s , s˜ , Λ , Y ) = YH (W W )H (W W ) Y t t t t K t st − s˜t st − s˜t t 1 , YH W 2 Y . (4.48) K t s˜tst t 2 2 2 The one-step-ahead MSE, et , is decomposed into two positive terms, µt and ηt . The first ˆ is the MSE of the estimator Xt|t,st which relies on knowing the true regime at time t, and therefore it is the optimal estimator in the MMSE sense. This term is evaluated by substituting (4.43) into (4.47):

2 1 d µ = tr a − W R . (4.49) t K st 1st st s Xt 2 The second term, ηt , is a weighted sum of cross error terms which depend on pairs of the process regimes. This term is difficult to evaluate, but it is upper bounded by [117, eq. (18) and (23)] 1 η2 a−2 (I (˜s )+ I (s )) , (4.50) t ≤ 2 min st t s˜t t sXt6=˜st where 2 1 2 −λ λ−1 tr Wsts˜t Rλ (st, s˜t) Ist (˜st) tr Wsts˜t Qs˜t Rλ (st, s˜t) Qst Qs˜t + 2 ≤ K | |·| | ·| | tr W Qs˜t st6=˜st  sts˜t ! X   (4.51) with λ> 0 [117, eq. (31) to (39) and (54) to (60)], Qst denotes the covariance matrix of the noisy signal given the regime st, and Rλ (st, s˜t) is defined by

R (s , s˜ ) , λQ−1 + (1 λ) Q−1 −1 . (4.52) λ t t st − s˜t   In the derivation of (4.51) it is assumed that Rλ (st, s˜t) is positive definite [117]. Since

Qst is a diagonal matrix with positive eigenvalues, Rλ (st, s˜t) is positive definite for any 0 <λ< 1. Substituting (4.51) into (4.50) and using the diagonality of the covariance matrices, we obtain an upper bound for the cross error term 1 η2 tr W 2 Q R (s , s˜ ) Q −λ Q λ−1 + tr W 2 R (s , s˜ ) t ≤ a2 K sts˜t s˜t ·| λ t t |·| st | ·| s˜t | sts˜t λ t t min st6=˜st X   (4.53)  for 0 <λ< 1. It is worthwhile noting that our MSE analysis follows the analysis in [117] but, the latter deals with a memoryless regime-switching process and a Toeplitz covariance matrix, whereas in our case both assumptions do not hold. 4.5. MODEL ESTIMATION 71 4.5 Model estimation

In this section we address the problem of estimating the model parameters φ , π(0), { A, ζ , ..., ζ , α , ..., α , β , ..., β . The ML estimation approach is commonly used for 1 m 1 m 1 m} estimating the parameters of GARCH models (e.g., [1, 5, 7]) and also for estimating the transition probability matrices (e.g., [129]). The model parameters are estimated from a (n) training data set of N clean signals of lengths Tn ,n = 1, ..., N. Let Xt denote the spectral coefficients of the nth clean training signal and let τ,(n) , nX(n) ot = 0, ..., τ . X { t | } (n) The conditional distribution of the vector Xt given its past observations is a mixing of ˆx,(n) zero mean Gaussian vectors with diagonal covariance matrices Rst :

b X(n) t−1,(n) = p s t−1,(n) b X(n) s , Rˆx,(n) . (4.54) t | X t | X t | t st st   X   

Given a set of model parameters φ, the diagonal covariance matrix of the density (n) x,(n) b X s , Rs can be recursively estimated by using the estimation algorithm intro- t | t t duced in Section 4.3, where the signal observations are known in this case. Assuming that the process is asymptotically wide-sense stationary, and that the training sequences are sufficiently large, the initial state probabilities, π(0), and the initial conditional variance,

λ0|−1,s0 , have negligible contribution to the total likelihood. Therefore it is convenient to choose in the following optimization problem the stationary values as the initial values, ˆ ˆ ˆ (0) i.e., λ0k|−1,s0 = Φζ as the initial conditional variances, and π = πˆ as the initial state probabilities.

The conditional log-likelihood of the training set is given by

Tn−1 (φ)= log b X(n) t−1,(n); φ . (4.55) L t | X n t=0 X X  

Using the constraints in (4.6) and imposing Aˆ to be a transition probability matrix, the ML estimates of the model parameters φ can be obtained by solving the following nonlinear constrained optimization problem: 72 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN

max (φ) φˆ L M s.t. ζˆ>0, αˆ 0, βˆ 0, aˆ =1 i 1, ..., m . (4.56) i i ≥ i ≥ ij ∀ ∈{ } j=1 X ˆ For a given parameters set φ, the sequence of state dependent conditional variances Λˆ t can be evaluated recursively according to the method described in Section 4.3 andn soo is the set of conditional state probabilities p (s t−1). The conditional log-likelihood (4.55) t | X can then numerically maximized under the linear constrains of (4.56) as specified in [6,7] or by using sequential quadratic programming [131,132]. The computational complexity required for the model estimation is much higher than that required for a single-regime GARCH model since m2 parameters are to be estimated for the transition probabilities matrix and in addition 3m GARCH parameters are to be evaluated. However, using the Markov-switching model, the optimization problem needs to be solved only once, prior to the restoration procedure. It is well known that the optimal set of parameters, φ, is not necessarily unique in a Markovian model [129] and in addition, the numerical optimization solution may only guarantee a local maxima of the likelihood function. However, the flexibility of the model enables better results than that achievable with a single-regime GARCH model [7,8]. This is also shown in our simulation results, both for MSTF-GARCH processes and for speech signals.

4.6 Experimental results

In this section we demonstrate the performance of the proposed algorithm when applied to restoration of noisy MSTF-GARCH signals, and to estimation of conditional variances and squared absolute values of speech signals in the STFT domain.

4.6.1 MSTF-GARCH signals

The proposed model estimation and signal restoration algorithm has been applied to MSTF-GARCH models of 3 and 5 regimes, degraded by additive independent white noise with 0 to 15 dB input signal-to-noise ratio (SNR). For each state space (m =3, 5), a set of 20 stationary models have been simulated with uniformly distributed parameters on the 4.6. EXPERIMENTAL RESULTS 73 interval (0, 1]. For each model, the parameters, φ, are estimated from a set of 10 training signals, each of time length T = 100 and dimension K = 100. The estimated parameters are employed for restoration of a set of test signals containing 20 noisy signals of the same size, and basically 4 types of estimated variances are compared by incorporating them into the signal’s recursive MMSE estimator of (4.17) and (4.18). The “theoretical limit” is referred to as the estimator which exploits the true conditional variances, λt|t−1,st , of the simulated process. This estimator is the optimal estimator in the MMSE sense and its performance is compared with those of the recursive estimators. The “MSTF-GARCH, true model” is referred to as the recursive signal estimator, described in Section 4.3, which manipulates the true parameters set, φ, and the “MSTF-GARCH, m = i” estimator employs a set of estimated parameters, φ,ˆ assuming that the model has i regimes. For the “MSTF-GARCH, m = i” estimator, the set of parameters, φ,ˆ is estimated using the ML approach as described in Section 4.5. The performance of our algorithm is also compared with that of an estimator that assumes a “constant variance” process. For that estimator (only), the vector of “stationary” variances, are evaluated for each noisy signal from the corresponding clean signal.

Figure 4.1 (a) shows the SNR improvement obtained by using the different estimators, when applied to 3-state MSTF-GARCH signals. It can be seen that even when assuming a small number of regimes, still the MSTF-GARCH estimator outperforms the “constant variance” estimator, and the results achieved by assuming 3 or 5 regimes are comparable to those obtained by using the true model parameters. Figure 4.1 (b) shows estimation results for 5-state MSTF-GARCH processes, under the assumption of 1, 3, 5 or 7 regimes. The estimation performances improve with the increase of the number of assumed regimes, but using a larger number of regimes than the true number (e.g., 7 instead of 5 or 5 instead of 3) yields less accurate results.

The time-varying behavior of the recursive estimator is demonstrated for a 5-state MSTF-GARCH signal degraded by additive white noise with 5 dB SNR. Figure 4.2 shows trace of the instantaneous output SNR for each time frame, obtained by the optimal es- timator, the recursive estimators with presumable 1 or 5 regimes (i.e., “MSTF-GARCH, m = 1, 5”) and a “constant variance” estimator. The varying volatility of the process implies time-varying performances for all those estimators. Nevertheless, under the as- 74 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN

6 theoretical limit MSTF−GARCH, true par. MSTF−GARCH, m=1 5.5 MSTF−GARCH, m=3 MSTF−GARCH, m=5 constant variance 5

4.5

4 SNR improvement (dB)

3.5

3 0 5 10 15 Input SNR (dB)

(a) 6 theoretical limit MSTF−GARCH, true par. 5.8 MSTF−GARCH, m=1 MSTF−GARCH, m=3 MSTF−GARCH, m=5 5.6 MSTF−GARCH, m=7

5.4

5.2

5 SNR improvement (dB)

4.8

4.6

0 5 10 15 Input SNR (dB)

(b)

Figure 4.1: SNR improvements obtained by using different MSTF-GARCH based estimators when applied to (a) 3-states MSTF-GARCH signals and (b) 5-state MSTF-GARCH signals. MSTF-GARCH models with various number of regimes are considered and compared with the true MSTF-GARCH parameters, the theoretical limit and a constant variance estimator. 4.6. EXPERIMENTAL RESULTS 75

10

8

6

4

2 Instantaneous output SNR (dB) 0 theoretical limit MSTF GARCH, m=5 MSTF GARCH, m=1 constant variance −2 10 20 30 40 50 60 frame number

Figure 4.2: Trace of instantaneous output SNR achieved by the proposed algorithm when applied to a realization of a 5-state MSTF-GARCH process degraded by additive white noise with 5 dB SNR, and restored by an MSTF-GARCH estimator, assuming 1 and 5 states. sumption of 5 regimes our recursive estimator follows the optimal estimator with a rela- tively small degradation in performance. The single-regime estimator yields comparable results as the 5-regimes estimator for frames with large input SNR. However, for frames with low input SNR the results obtained by the single-regime estimator are comparable to those obtained by the “constant variance” estimator.

4.6.2 Speech signals

The idea of using different states for the enhancement of speech signals was first introduced by Drucker [28]. He assumed five categories of speech signals, comprising fricatives, stops, vowels, glides, and nasals. The application of HMMs for speech enhancement requires a higher number of states [35, 62] since these models allow only a single density, or a finite set of mixture-densities, for the spectral coefficients in each state. The GARCH- based models allow continuous values of conditional variances with possible transients resulting from switching states. Hence, a small number of states may be sufficient for the representation of the coefficients’ second-order moments. Furthermore, the dynamic of the 76 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN spectral coefficients is frequency dependent. Therefore, we assume different parameters in different sub-bands. The speech signals used in our evaluation are taken from the TIMIT database. The training set includes 10 different utterances from 10 different speakers, half male and half female. The speech signals are sampled at 8 kHz and normalized to the same energy. Transformation into the STFT domain is obtained by using half overlapping Hamming analysis window of 32 millisecond length. We consider 1,3 and 5-state MSTF-GARCH models for the speech signals and estimate the one-frame-ahead conditional variance for test speech signals, not on the training set. Figure 4.3 shows typical estimates of the ˆ one-frame-ahead conditional variance, λtk|t−1, at frequencies of 1, 2 and 3 kHz, using the different MSTF-GARCH models and assuming independent model parameters in each frequency sub-band. The estimated conditional variances are compared with the clean signal’s squared absolute value X 2. It can be seen that by increasing the number of | tk| regimes, the conditional variance yields a better prediction of the squared absolute value of the signal. Moreover, it can be seen that the conditional variance estimated by a single- regime model is smoother than that estimated based on a multi-regime model, and the latter better tracks rapid changes in the signal’s energy with possible switching of regimes. During the first few frames, the speech signal is absent and thus, as long as the squared ab- solute value is below the minimum variance allowed by the model, the predicted variances are determined by the model threshold. However, the predicted variances converge to the absolute squared value as soon as the latter exceeds this threshold. Larger number of states may allow better representation of the conditional variance in different magnitude ranges and different volatilities, at the expense of greater computational complexity. Many speech enhancement algorithms employ the decision-directed approach for the speech spectral variance estimation [33,70]. Accordingly,

2 λˆDD = max α¯ Xˆ + (1 α¯) Y 2 σ2 , ξ σ2 , (4.57) tk t−1,k − | tk| − k min k    whereα ¯ (0 α¯ 1) is a weighting factor that controls the trade-off between noise ≤ ≤ reduction and transient distortion introduced into the signal. A larger value ofα ¯ results in a greater reduction of the musical noise phenomena, but at the expense of attenuated speech onsets and audible modifications of transient components. The parameter ξmin is 4.6. EXPERIMENTAL RESULTS 77

20

0

−20 Magnitude (dB) −40 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec]

(a) 10

−10

−30 Magnitude (dB) −50 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec]

(b) 0

−20

−40 Magnitude (dB) −60 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec]

(c)

Figure 4.3: Typical traces of one-frame-ahead conditional variance estimates for speech signals at frequencies (a) 1 kHz, (b) 2 kHz and (c) 3 kHz. The conditional variances are estimated by MSTF-GARCH models of single-state (dashed-dotted line), 3 states (dotted line) and 5 states (dashed line), and compared with the clean signal’s squared absolute value (solid line). 78 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN

20

0

−20 Magnitude (dB)

−40 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec]

(a) 20

0

−20 Magnitude (dB)

−40 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec]

(b)

Figure 4.4: Typical traces of estimated squared absolute values for speech signal at frequency of 2 kHz. The variances are estimated by a 5-state MSTF-GARCH model (dashed-dotted line), decision-directed approach (dotted line) and compared with the clean signal’s squared absolute value (solid line). The SNRs are (a) 0 dB and (b) 10 dB. 4.6. EXPERIMENTAL RESULTS 79 a lower bound on the a priori SNR. The GARCH modeling enables an analytical derivation of the decision-directed es- timator [69]. Considering the degenerated case of a single-state and a single-frequency ARCH(1) model (i.e., β = 0), the update step (4.22) can be written as

λˆ =α ¯ λˆ + (1 α¯ ) Y 2 σ2 (4.58) tk|t tk tk|t−1 − tk | tk| − k  with ˆ2 λtk|t−1 α¯tk , 1 2 , (4.59) − ˆ 2 λtk|t−1 + σk   ˆ and 0 < α¯tk < 1. Substituting the propagation step for λtk|t−1 (4.23) into (4.58) with α = 1, we obtain

λˆ =α ¯ E X 2 t−1 + (1 α¯ ) Y 2 σ2 +α ¯ ζ . (4.60) tk|t tk | t−1,k| |Y − tk | tk| − k tk   2 t−1 For ζ << E Xt−1,k , (4.60) is similar to the decision-directed variance esti- | | |Y 2 2 t−1 mation (4.57) withα ¯tk α¯ and where E Xt−1,k holds for Xˆt−1,k which is ≡ | | |Y the squared absolute value of the spectral coefficient estima te based on the observations

t−1. Accordingly, the degenerated ARCH-based variance estimation with α = 1 and low Y valued ζ is closely related to the decision-directed estimator with a time-varying frequency- dependent weighting factorα ¯tk. However, the GARCH (and ARCH) modeling approach manipulates the spectral variance as a random process, whereas the decision-directed ap- proach assumes the spectral variance is a parameter which is heuristically evaluated. In addition, the decision-directed approach thresholds the estimated variance to be larger

2 than ξminσk while in the GARCH modeling, the lower bound is inherently incorporated ˆ into the variance estimation. Since λtk|t−1 >ζ, from (4.22) we obtain the following lower bound ζ ζ λˆ > σ2 + Y 2 > 0 . (4.61) tk|t ζ + σ2 k ζ + σ2 | tk| k  k  Modeling the spectral coefficients as an MSTF-GARCH allows further flexibility for the variance estimation. Figure 4.4 demonstrates the estimated squared absolute values of a speech signal corrupted by a white Gaussian noise with SNR of (a) 0 dB and (b) 10 dB. The signal squared absolute value at frequency of 2 kHz is compared with its estimated variance using 5-state MSTF-GARCH model and by using the decision-directed approach. 80 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN

It shows that the MSTF-GARCH approach with 5 states yields a better estimate of the squared absolute value both under high and low SNR conditions, especially in low energy bins. Furthermore, the MSTF-GARCH approach enables a better tracking of rapid changes in the coefficients energy than the decision-directed approach. The differences between Figure 4.3 and Figure 4.4 is that the former demonstrates the prediction of the coefficients’ variances (i.e., the conditional variance) in a noiseless environment while the latter shows their second-order moments’ estimation in a noisy environment. The variance prediction has a small delay of tracking rapid changes and the update step yields a better estimate of the squared absolute value in high energy bins. However, when noisy observations are employed, low-energy bins may be under the noise level and thus the estimation may be less accurate (for both the MSTF-GARCH approach and the decision-directed approach). Figures 4.3 and 4.4 demonstrate that the proposed MSTF-GARCH model, when com- pared to a single-regime model, or to the decision-directed approach, improves the variance prediction and the squared absolute value estimation of speech signals in the STFT do- main. Still, one needs to derive a frequency-dependent model and to estimate the signal presence probability in each time-frequency bin of the noisy speech signal based on the proposed model, which is a subject for further research.

4.7 Conclusions

We have proposed a statistical model for nonstationary processes with time-varying volatility structure in the STFT domain. Exploiting the advantages of both the condi- tional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains, we model the expansion coefficients as multivariate, complex GARCH process with Markov-switching regimes. The correlation between successive co- efficients in the time-frequency domain is taken into consideration by using the GARCH formulation which specifies the conditional variance as a linear function of its past values and past squared innovations. The time-varying structure of the conditional variance is determined by a hidden Markov chain which allows a different GARCH formulation in each state. 4.7. CONCLUSIONS 81

We showed that an ML estimate of the model can be practically obtained from training signals (assuming that the number of states is known), and developed a recursive algorithm for estimating the signal and its conditional variance in the STFT domain from its noisy observations. The conditional variance is recursively estimated for any regime by iterating propagation and update steps, while the evaluation of the regime conditional probabilities is based on the recursive correlation of the process. Experimental results demonstrate the improved performance of the proposed recursive algorithm compared to using an estimator which assumes a stationary process, even when the number of assumed regimes is smaller than the true number. When the number of assumed regimes approaches the true one, the recursive estimator yields comparable restoration results to those achievable by using the true model parameters. The conditional variance of an MSTF-GARCH process, as well as the instantaneous SNR on each frame, change over time. It is demonstrated that the recursive estimation approach has relatively small performance degradation compared to the theoretical estimation limit in the MMSE sense. Performance evaluation with real speech signals demonstrates better variance estimation when using a multi-regime model, compared to using a single-regime model, and improved squared absolute value estimation in a noisy environment compared to using the decision-directed approach. Several extensions of this work, which may be interesting for further research, include analysis of the algorithm sensitivity to the number of the assumed states, the parameters values and the training set; generalization of the multivariate complex Markov-switching GARCH model, such that the conditional covariance matrix is not necessarily diagonal and the correlation between distinct frequency-bins is also taken into account; and finally estimation of the signal presence probability in the time-frequency domain and modifica- tion of the recursive signal estimation algorithm under signal presence uncertainty. 82 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN 4.A Application of Markov-Switching GARCH Model to Speech Enhancement in Subbands2

In this appendix, we introduce an application of the Markov-switching GARCH model for spectral speech enhancement. A GARCH model is utilized with Markov switching regimes, where the parameters are assumed to be frequency variant. The model param- eters are evaluated in each frequency subband and a special state (regime) is defined for the case where speech coefficients are absent or below a threshold level. The problem of speech enhancement under speech presence uncertainty is addressed and it is shown a soft voice activity detector may be inherently incorporated within the algorithm. Experimen- tal results demonstrate the potential of our proposed model to improve noise reduction while retaining weak components of the speech signal.

4.A.1 Introduction

Statistical modeling of speech signals in the short-time Fourier transform (STFT) domain is of much interest in many speech enhancement applications. The Gaussian model [33] enables to derive useful estimators for the speech expansion coefficients such as the mini- mum mean-square error (MMSE) of the short-term spectral amplitude (STSA), as well as MMSE of the log-spectral amplitude (LSA) [33, 34]. Recently, a generalized autoregres- sive conditional heteroscedasticity (GARCH) model has been introduced for statistically modeling speech signals in the STFT domain [24]. However, the proposed model assumes that the parameters are both time and frequency invariant and it also requires an inde- pendent detector for speech activity in the time-frequency domain. A Markov-switching time-frequency GARCH (MSTF-GARCH) model has been proposed in [127] for modeling nonstationary signals in the time-frequency domain. Accordingly, the parameters are al- lowed to change in time according to the state of a hidden Markov chain (e.g., switching between speech phonemes), but the parameters are still frequency-invariant. The model is estimated using training signals based on maximum likelihood (ML) approach and a recursive algorithm has been derived for conditional variance estimation and signal re-

2This appendix is based on [133]. 4.A. APPLICATION TO SPEECH ENHANCEMENT 83 construction from noisy observations. However, not only that different phonemes may result in different GARCH parameters, speech signals are generally characterized by dif- ferent both volatility and energy levels in various frequency bands. Therefore, different parameters may better represent different frequency subbands.

In this appendix, we modify the MSTF-GARCH model by assuming different Markov chains in distinct subbands with identical state transition probabilities. The GARCH parameters are state dependent and frequency variant. We define an additional state for the case where speech coefficients are absent (or below a certain threshold level) and introduce parameter estimation method which is computationally more efficient than the traditional ML approach. Furthermore, the probability of the speech absence state can be used as a soft voice activity detector which is naturally generated in the reconstruction algorithm. Experimental results demonstrate improved noise reduction performance while preserving weak components of the speech signal.

Section 4.A.2 introduces the statistical model. In Section 4.A.3, we show how the model parameters can be estimated and in Section 4.A.4, we derive the speech enhance- ment algorithm based on the proposed model. Finally, in Section 4.A.5 we evaluate the performance of the proposed algorithm.

4.A.2 Model formulation

Let X t =0, 1, ...T 1,k =0, 1, ..., K 1 denote the coefficients of a speech signal in { tk | − − } a STFT domain, where t is the time frame index and k is the frequency-bin index. Let v be iid complex Gaussian random variables with zero-mean and unit variance, let κ { tk} n denote the nth frequency subband with n 1, 2, ..., N and N

X = λ v , k κ (4.62) tk tk|t−1,st tk ∈ n q 2 λ = λ + α X + β λ − λ − , (4.63) tk|t−1,st min,n,st n,st | t−1,k| n,st t−1,k|t−2,st 1 − min,n,st 1  where λ > 0 and α , β 0 are sufficient constrains for the positivity min,n,st n,st n,st ≥ of the one-frame-ahead conditional variance, given that the initial conditions satisfy λ λ for all k κ and s = 0, 1, ..., m. Note that the model formula- 0k|−1,s0 ≥ min,n,s0 ∈ n 0 tion in [127] is slightly different. We assume that the parameters are frequency dependent while each λmin,n,st defines the minimum value of the conditional variance in subband κn under S (κ ) = s . Let a − , p (S = s S = s ), let π denotes the stationary t n t st 1,st t t | t−1 t−1 s probability of state s and let Ψ be an (m + 1) (m + 1) matrix with elements × πs˜ ψs+1,s˜+1 = as,s˜ (αn,s + βn,s) , s, s˜ =0, 1, ..., m . (4.64) πs Then, a necessary and sufficient condition for asymptotic wide-sense stationarity of the model defined in (4.62) and (4.63) is ρ (Ψ) < 1, where ρ ( ) denotes spectral radius [128]. · This condition is also necessary to ensure a finite second order moment for the process. The unconditional expectation of the state-dependent one-frame-ahead conditional variance follows

E λ = λ + α E X 2 s tk|t−1,st min,n,st n,st | t−1,k| | t

 + β E λ  − s β E λ − s (4.65) n,st t−1,k|t−2,St 1 | t − n,st min,n,St 1 | t   with

E λ − s = p (s s ) λ − min,n,St 1 | t t−1 | t min,n,st 1 st−1  X πst−1 = ast−1,st λmin,n,st−1 . (4.66) πst s −1 Xt Therefore, the stationary variance of the process is given by (see [128])

2 −1 lim E Xtk = π (Im+1 Ψ) λ˜min,n, (4.67) t→∞ | | −  where π is a row vector of the stationary probabilities, Im+1 is the identity matrix of order m + 1, T ˜ ˜ ˜ λ˜min,n , λmin,n,0, λmin,n,1, ..., λmin,n,m (4.68) h i 4.A. APPLICATION TO SPEECH ENHANCEMENT 85 and β λ˜ , λ n,s π a λ . (4.69) min,n,s min,n,s − π s˜ s,s˜ min,n,s˜ s s˜ X 4.A.3 Model estimation

The estimation of a GARCH model with Markov regimes is generally obtained from a training set using ML approach [5, 13]. However, the maximization of the likelihood function is numerically unstable for multi-regime processes and only a local maxima can be generally obtained. Assuming an (m + 1)-state Markov chain with GARCH of order (1, 1) in each regime, the maximization process generates (m + 1)2 variables for the transition probabilities and additional 3 (m + 1) variables for the GARCH parameters in each × regime. Speech signals in the STFT domain demonstrate different levels of magnitudes in different subbands and the coefficients are generally sparse. Therefore, we limit the conditional variances in each subband within a dynamic range of ηg dB and define a special state for speech absence hypothesis. Let ζ , max X 2 and ζ , max X 2 g t,k | tk| n t,k∈κn | tk| denote the global maximum energy and the local maximum energy of the coefficients (in subband κn), respectively. Then, for the speech absence state (namely, st = 0), we set

log10 ζg−ηg /10 λmin,n,0 = 10 , αn,0 = βn,0 =0 . (4.70)

Under speech presence, a local dynamic range of ηℓ dB (ηℓ < ηg) is assumed for the conditional variances. Furthermore, the parameters λmin,n,s, s > 0 are chosen to enable tracking any transients between different levels of magnitudes results in switching the active state. Without loss of generality, we sort the states according to the minimum variance level such that

log10 ζn−ηℓ/10 λmin,n,1 = max λmin,n,0, 10 , (4.71)  and for s =2, ..., m, λmin,n,s are log-spaced between λmin,n,1 and ζn. Each state practically represents different floor level for the spectral coefficients’ variance. The parameters

αn,s, βn,s for s> 0 set the volatility level of the conditional variance and they are chosen as follows. Assuming an immutable state s, the stationary variance follows

1 βn,s λ∞,n,s , lim λtk|t−1,s = λmin,n,s − (4.72) t→∞,k∈κn 1 α β − n,s − n,s 86 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN provided that αn,s +βn,s < 1. Since different states are related to different dynamic ranges in ascending order, we constrain λ λ and therefore ∞,n,s ≤ min,n,s+1 1 β λ − n,s min,n,s+1 . (4.73) 1 α β ≤ λ − n,s − n,s min,n,s The autoregressive parameters, βn,s, are chosen experimentally while the moving average parameters, αn,s, are chosen to satisfy equality in (4.73). Although the clean signal is assumed to be available for the model estimation, it is only the high energy values that are needed in each subband. These values can be practically estimated from the noisy coefficients using the spectral subtraction approach. The state transition probabilities can be estimated from test signals such that each active state is determined by the energy level of the subband.

4.A.4 Spectral enhancement of noisy speech

Let Dtk denote the spectral coefficients of a noise signal which is uncorrelated with the speech signal and assume that D (0, σ2 ). Let Y = X + D be the noisy tk ∼ CN tk tk tk tk observations and let t , Y τ =0, 1, ..., t , k =0, 1, ..., K 1 denote the set of the Y { τk | − } 2 observed coefficients up to time t. The noise variance σtk is assumed to be known and it can be practically estimated using the improved minima controlled recursive averaging approach [71]. Reconstruction of the one-frame-ahead conditional variances of the speech coefficients is carried out recursively for each state by

λˆ = λ + α E X 2 t−1,s tk|t−1,st min,n,st n,st | t−1,k| |Y t t−1 t−1 + β E λ  − ,s β E λ − ,s (4.74), n,st t−1,k|t−2,St 1 |Y t − n,st min,n,St 1 |Y t   where

E X 2 t−1,s = p s s , t−1 E X 2 t−1,s | t−1,k| |Y t t−1 | t Y | t−1,k| |Y t−1 st−1  X   t−1 ˆ , p s s , λ − , (4.75) t−1 | t Y t−1,k|t−1,st 1 st−1 X  t−1 t−1 ˆ E λ − ,s p s s , λ − (4.76) t−1,k|t−2,St 1 |Y t ≃ t−1 | t Y t−1,k|t−2,st 1 st−1  X  and t−1 t−1 E λ − ,s = p s s , λ − . (4.77) min,n,St 1 |Y t t−1 | t Y min,n,st 1 st−1  X  4.A. APPLICATION TO SPEECH ENHANCEMENT 87

A detailed algorithm for the conditional variance restoration is described in [127]. Having an estimate for the speech coefficient’s second order moment under each state, ˆ λtk|t,st , estimates of the speech coefficients are obtained by minimizing the mean-square error of the log-spectral amplitude (LSA). Let

ˆ 2 λtk|t,s ξˆ Y ξˆ , t , ϑˆ , tk,st | tk| . (4.78) tk,st 2 tk,st ˆ 2 σtk 1+ ξtk,st · σtk Then, the LSA estimation of the speech coefficients is given by

t p(st |Y ) ˆ ˆ ˆ Xtk = Ytk G ξtk,st, ϑtk,st , (4.79) s Yt   where ξ 1 ∞ e−t G (ξ, ϑ)= exp dt (4.80) 1+ ξ 2 t  Zϑ  is the LSA gain function [34] and the state probabilities, p (s t), are evaluated according t |Y to [127].

4.A.5 Experimental results and discussion

In this section, we demonstrate the application of the proposed model to speech enhance- ment and to speech presence probability estimation. The enhancement evaluation includes two objective quality measures; segmental SNR and log-spectral distortion (LSD). The speech signals used in our evaluation are taken from the TIMIT database. The signals are sampled in 16 kHz, degraded by a nonstation- ary factory noise and transformed into the STFT domain using half overlapping Hamming windows of 32 msec length. Twenty subbands are considered with global and local dy- namic ranges of ηg = 50 dB and ηℓ = 20 dB, and four-state Markov chains (i.e., m = 3) for each subband. The autoregressive parameters used in our simulations are βn,s = 0.8 for all n and s> 0. In each subband, the state persistence probability is 0.8 and as,s˜ are equally chosen for all s =s ˜. Figure 1 demonstrates the spectrograms and waveforms of 6 a clean signal, noisy signal with SNR of 5 dB, and the enhanced signal obtained by the proposed algorithm. It shows that the background noise is highly attenuated while weak speech components are retained, even while noise transients occur. Furthermore, the seg- mental SNR and the LSD are improved. A subjective study of speech spectrograms and 88 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN

(a) (b)

8

7

6

5

4

3 Frequency [kHz] 2

1

0 Amplitude

0 0.2 0.4 0.6 0.8 1 Time [Sec]

(c)

Figure 4.5: Speech spectograms and waveforms. (a) Clean signal: ”Draw every outer line.”; (b) speech corrupted by factory noise with 5 dB SNR (LSD= 6.68 dB, SegSNR= 0.05 dB); (c) speech reconstructed by using 4-state model (LSD= 3.14 dB, SegSNR= 6.76 dB). 4.A. APPLICATION TO SPEECH ENHANCEMENT 89

1

0.8

0.6

0.4

0.2 Speech presence probabillity MSTF−GARCH DD 0 −20 −15 −10 −5 0 5 10 15 20 Instantaneous SNR [dB]

Figure 4.6: Conditional speech presence probability obtained by the proposed algorithm and by the decision-directed based algorithm. informal listening tests confirm that the quality of the enhanced speech is improved by using frequency-dependent parameters which are derived from the different energy levels. The conditional speech presence probability results from the enhancement algorithm is compared with the statistical model-based voice activity detector (with hang-over) of Sohn et al. [52] when applied to subbands. The later evaluates the conditional likelihood , p ( t S = 0) /p ( t S = 0) by utilizing the decision-directed approach for the a Lt Y | t 6 Y | t priori SNR estimation (assuming only two states). The conditional speech presence prob- ability is obtained by p (S =0 t) = µ / (1 + µ ), where µ , p (S = 0) /p (S = 0) t 6 |Y Lt Lt t 6 t is the a priori probabilities ratio. Figure 2 demonstrates the speech presence probabili- ties achieved when both algorithms are applied to a speech signal corrupted by a white Gaussian noise with SNR of 15 dB. The instantaneous SNR is defined as the ratio be- tween the norms of the clean signal and the noise signal in each subband. It can be seen that the speech presence probability, derived from our proposed algorithm, results in a higher dynamic range for the probabilities and in much lower values for low energy coefficients. Furthermore, the probabilities ascribed to each instantaneous SNR are with higher variance resulting from the Markovian nature of the model. 90 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN Chapter 5

State Smoothing in Markov-Switching GARCH Models1

In this chapter, we address the problem of state smoothing in path-dependent Markov- switching generalized autoregressive conditional heteroscedasticity (GARCH) processes. We develop a smoothing algorithm which extends the forward-backward recursions of Chang and Hancock and the stable backward recursion of Lindgren, Askar and Derin. Two recursive steps are derived for the evaluation of conditional densities of future observations. The first step is an upward recursion which manipulates the future observations for the evaluation of their conditional densities, and the second step is a backward recursion which integrates over the possible future paths. Experimental results demonstrate the improvement in performance, compared to using causal estimation.

5.1 Introduction

State estimation is of both theoretical and practical importance whenever the underlying statistical model switches regimes over time [5, 129]. State smoothing (i.e., noncausal state estimation) of hidden Markov processes (HMPs) has been originally introduced by Chang and Hancock [118]. Their solution for estimating the noncausal state probability, which is implemented using forward-backward recursions, decouples a forward recursion for the evaluation of the joint probability density of the current state and all observations

1This chapter is based on [130].

91 92 CHAPTER 5. STATE SMOOTHING IN MS-GARCH MODELS up to the same time, and a backward recursion for obtaining the future observations’ density given the current state. Lindgren [119] and Askar and Derin [120] developed an alternative stable backward recursion for the state smoothing in HMPs. Kim [134] extended the stable backward recursion to nonmemoryless autoregressive hidden Markov processes (AR-HMPs) where both the current state (regime) and a finite set of past values are required for the conditional density evaluation, see also [5, chap. 22].

Generalized autoregressive conditional heteroscedasticity (GARCH) models and also Markov-switching GARCH (MS-GARCH) models, are widely used in the field of econo- metrics for volatility forecast derivation of rates [7,8,12,13] and they have re- cently been utilized to several signal processing applications. In [135] GARCH modeling has been applied to spatially non-uniform noise in multichannel signal processing. In [26] a regime-switching GARCH model has been utilized for speech recognition and a complex- valued GARCH model has been proposed in [24, 25] for modeling speech signals in the short-time Fourier transform (STFT) domain for the application of speech enhancement. Generally, when incorporating GARCH processes with switching-regimes, the volatility evaluation requires knowledge of the pertinent history of the regime-switching GARCH process, including the regime-path [12, 13]. Properties of path-dependent MS-GARCH models have been studied by Francq et al. [122]. In order to estimate the model param- eters, they showed that the conditional likelihood depends on all the possible paths and for a Markov-switching ARCH model (in which case there is no dependency on past ac- tive regimes) they showed that the forward-backward recursions can be employed for the conditional likelihood evaluation. The complex-valued GARCH model has been shown to be useful in speech enhancement applications [24, 25]. Motivated by extending the dynamic formulation of the time-frequency GARCH model and enabling a better fit for a process with a more complicated time-varying statistical behavior, a Markov-switching time-frequency GARCH (MSTF-GARCH) model has been introduced [127]. However, existing smoothing solutions are inapplicable in case of a path-dependent MS-GARCH model since both past observations and the regime path are required for the conditional variance estimation, whereas existing smoothing techniques rely on the assumption that given the current state, past active regimes are statistically independent of future densi- ties. 5.2. PROBLEM FORMULATION 93

In this chapter, we develop a state smoothing approach for MSTF-GARCH processes. The dependency of the conditional variance on past observations and past active regimes are taken into consideration as we generalize both the forward-backward recursions of Chang and Hancock [118] and the stable backward recursion of Lindgren [119] and Askar and Derin [120]. We derive two recursive steps for the evaluation of conditional densities of future observations. The first step is an upward recursion which manipulates the future observations for the evaluation of their conditional densities, corresponding to all possible future paths. The second step is a backward recursion which integrates over these paths to evaluate the future densities required for the noncausal state probability. The compu- tational complexity of the generalized recursions grows exponentially with the number of future observations employed for the fixed-lag smoothing. However, experimental results demonstrate that the significant part of the improvement in performance, compared to using causal estimation, is achieved by considering a few future observations. The organization of this chapter is as follows: In Section 5.2, we introduce the Markov- switching time-frequency GARCH model and formulate the state smoothing problem. In Section 5.3 we develop generalized forward-backward recursions as well as generalized stable backward recursion, and derive our noncausal state probability approach. Finally, in Section 5.4 we provide experimental results which demonstrate state smoothing for noisy Markov-switching time-frequency GARCH processes.

5.2 Problem formulation

Let X CK be a K-dimensional random vector at a discrete time t, and let X , k t ∈ tk ∈ 0, ..., K 1 be its kth element. Let t2 = X t t t represent the data set from { − } Xt1 { t | 1 ≤ ≤ 2} time t up to t and let t , t. Let S denote the (unobserved) state at time t and 1 2 X X0 t let st be a realization of St, assuming St is a first-order Markov chain with transition probabilities a , p (S = s S = s ). Let t , t, t denote all available stst+1 t+1 t+1 | t t I {X S } information up to time t, where t , t = s , ..., s . We assume that X are generated S S0 { 0 t} tk by an m-state Markov-switching time-frequency GARCH process of order (1, 1) which follows [127]: X = λ V , k =0, ..., K 1 , (5.1) tk tk|t−1 tk − q 94 CHAPTER 5. STATE SMOOTHING IN MS-GARCH MODELS where V are iid complex-valued random variables with zero-mean, unit variance and { tk} some known probability density. Given the state st, the conditional variance of Xtk, λ = E X 2 t−1,s , is a linear function of the previous conditional variance tk|t−1,st {| tk| | I t} and squared absolute value:

λ λ = ξ + α X 2 + β λ , (5.2) tk|t−1 ≡ tk|t−1,st st st | t−1,k| st t−1,k|t−2 where ξ > 0, α 0, and β 0, s = 1, ..., m are sufficient constrains for the s s ≥ s ≥ positivity of the conditional variance. Let Ψ be an m-by-m matrix with elements

ψij = aji(αi + βi)πj/πi (5.3) where π = p(S = i) is the stationary probability of state i, and let ρ( ) represent the i t · spectral radius of a matrix. Then, a necessary and sufficient condition for the process defined in (5.1) and (5.2) to be asymptotically wide-sense stationary is ρ(Ψ) < 1 [128].

Let Yt = Xt +Dt denote the observed noisy signal, where Dt denotes the noise process which is uncorrelated with the signal Xt, and let Dt be a zero-mean complex-valued Gaus- sian random process with a diagonal covariance matrix E D DH = diag σ2 , where t t { } ( )H denotes the Hermitian transpose operation. The state conditional probability of a · Markov-switching model, p (s τ ), is of considerable theoretical and practical importance t|Y for signal restoration and state sequence estimation (e.g., [127,129]). Solutions of the state smoothing problem, i.e., τ > t, are normally obtained for HMPs using the forward-backward recursions [118] or the stable backward recursion [119, 120]. Extensions of these recursions for nonmemoryless AR-HMPs [134], [5, Chap. 22], are based on the quality that st and a finite set of past clean observations give complete statistical knowledge of future densities. However, in case of a path-dependent MS-GARCH model, a recursive formulation specifies the conditional distribution of the process as dependent on both past observations and the regime path, and therefore existing smoothing solutions are inapplicable.

5.3 State probability smoothing

In this Section, we develop the noncausal state probability for the model defined in (5.1) and (5.2). The smoothed probability is derived by generalizing both the forward-backward 5.3. STATE PROBABILITY SMOOTHING 95 recursions [118] and the stable backward recursion [119,120].

5.3.1 Generalized forward-backward recursions

Assume that the conditional variance of the process is recursively estimated for any given state (e.g., as proposed in [127]) and assume that the set of the recursively estimated conditional variances at time t, Λˆ , λˆ S =1, ..., m , with the observed signal t t|t−1,St | t Yt are sufficient statistics for the next conditionaln variance estimationo for any given regime

ˆ τ ∗ τ2 τ1 [24,127]. Let λ 2 = E Xτ2 X , , τ2 τ1 > τ0 denote the vector of τ2|τ1,Sτ0 ⊙ τ2 | Sτ0 Y ≥ estimated conditional variances at time τ2 based on the observations up to time τ1 and on the given set of active regimes τ2 , where denotes a term-by-term multiplication and ∗ Sτ0 ⊙ denotes complex conjugation. Let

g λ , Y , E X X∗ S = s , λ , Y (5.4) t|t−1,st t t ⊙ t | t t t|t−1,st t   where the function g( ) is determined based on the statistical model of V [24]. Define · { tk} the generalized forward density by

α s , t , f s , Λˆ , Y (5.5) t Y t t t    and the generalized backward density by

β t+L t+l−1, t+l−1 , f t+L t+l−1, Λˆ , t+l−1 . (5.6) Yt+l | St Y Yt+l | St t Yt    Then, by substituting l = 1 we have

f s , t+L Λˆ = α s , t β t+L s , t , (5.7) t Yt | t t Y Yt+1 | t Y     and the noncausal state probability can be obtained by

p s t+L = p s Λˆ , t+L t|Y t | t Yt t t+L t  α (st, ) β  st, = Y Yt+1 | Y . (5.8) α (s , t) β t+L s , t st t Y Yt+1 | t Y Proposition 5.1. The generalized forwardP density of an MSTF-GARCH(1,1) process, α (s , t), satisfies the following recursion: t Y t t−1 α s , = f Y s , λˆ α s , a − , (5.9) t Y t | t t|t−1,st t−1 Y st 1st st−1    X  with the initial condition α (s , Y )= p(s )f (Y s ). 0 0 0 0 | 0 96 CHAPTER 5. STATE SMOOTHING IN MS-GARCH MODELS

Proof. The generalized forward density is obtained by

α s , t = f Y s , Λˆ f s , Λˆ . (5.10) t Y t | t t t t      Given the active regime, the state-dependent conditional variance is sufficient for the conditional density. Furthermore, Λˆ t and Λˆ t−1, Yt−1 represent the same statistical information. Hence n o

α s , t = f Y s , λˆ f s , Λˆ , Y , (5.11) t Y t | t t|t−1,st t t−1 t−1      where

t−1 f s , Λˆ , Y = α s , a − . (5.12) t t−1 t−1 t−1 Y st 1st st−1   X  Substituting (5.12) into (5.11) we obtain the recursive formulation for the generalized forward density2.

Proposition 5.2. The generalized backward density of an MSTF-GARCH(1,1) process, β t+L s , t , satisfies the following two-step recursion: Yt+1 | t Y Step I: Forl =1, ..., L and all t+l: St

ˆ + ˆ + −1 ˆ + −1 λ t l = ξs + 1 + αs + λ t l + βst+1 λ t l (5.13) t+l|t+l−1,St t l t l t+l−1|t+l−1,St t+l−1|t+l−2,St

λˆ t+l = g λˆ t+l, Yt+l . (5.14) t+l|t+l,St t+l|t+l−1,St   Step II: For l = L, ..., 1 and all t+l: St

f t+L t+l, λˆ , t+l−1 = β t+L t+l, t+l f Y t+l, λˆ , t+l−1 Yt+l | St t|t−1,st Yt Yt+l+1 | St Y t+l | St t|t−1,st Yt     (5.15) β t+L t+l−1, t+l−1 = f t+L t+l, λˆ , t+l−1 a , (5.16) Yt+l | St Y Yt+l | St t|t−1,st Yt st+l−1st+l st+l  X   with β t+L t+L, t+L =1 as the initial condition for the second step, and where 1 Yt+L+1 | St Y denotes a vector of ones. 

2The initial conditions for the generalized forward recursion have negligible effect on the conditional densities assuming an asymptotic stationary process which is sufficiently long. Therefore, the initial conditional variance f (Y s ) can be estimated by using the state-dependent stationary density of the 0 | 0 process. 5.3. STATE PROBABILITY SMOOTHING 97

Proof. The generalized backward density β t+L s , t = f t+L s , t can be ob- Yt+1 | t Y Yt+1 | t Y tained by   β t+L s , t = f t+L t+1, λˆ , Y a , (5.17) Yt+1 | t Y Yt+1 | St t|t−1,st t stst+1 st+1  X   where the multivariate density f t+L t+1, λˆ , Y in (5.17) can be obtained by Yt+1 | St t|t−1,st t   f t+L t+1, λˆ , Y = β t+L t+1, t+1 f Y t+1, λˆ , Y . (5.18) Yt+1 | St t|t−1,st t Yt+2 | St Y t+1 | St t|t−1,st t      From (5.17) and (5.18) we recursively obtain for any l =1, ..., L :

β t+L t+l−1, t+l−1 = f t+L t+l, λˆ , t+l−1 a (5.19) Yt+l | St Y Yt+l | St t|t−1,st Yt st+l−1st+l st+l  X   and

f t+L t+l, λˆ , t+l−1 = β t+L t+l, t+l f Y t+l, λˆ , t+l−1 . Yt+l | St t|t−1,st Yt Yt+l+1 | St Y t+l | St t|t−1,st Yt     (5.20) The conditional density f Y t+l, λˆ , t+l−1 in (5.20) is the density of the ob- t+l | St t|t−1,st Yt served data at time t + l conditioned on the regime path t+l, the recursively estimated St conditional variance at time t given st, and also on all observations from time t up to time t + l 1. This density has a diagonal covariance matrix with the following conditional − variance on its diagonal:

E Y Y∗ t+l, λˆ , t+l−1 t+l ⊙ t+l | St t|t−1,st Yt n 2 o = σ + λˆ t+l t+l|t+l−1,St = σ2 + ξ 1 + α E X X∗ t+l, λˆ , t+l−1 st+l st+l { t+l−1 ⊙ t+l−1 | St t|t−1,st Yt } + β E λ t+l−1, λˆ , t+l−2 . (5.21) st+l { t+l−1|t+l−2 | St t|t−1,st Yt }

The expected absolute squared value of the signal at a specific time given the active regime is independent of any future regimes, hence

∗ t+l t+l−1 ˆ ˆ + −1 E Xt+l−1 Xt+l−1 t , λt|t−1,st , t = λ t l ⊙ | S Y t+l−1|t+l−1,St n o = g λˆ t+l−1 , Yt+l−1 (5.22). t+l−1|t+l−2,St   Combining (5.22) with (5.21), we obtain Step I of the generalized backward recursion ((5.13) and (5.14)), and from (5.19) and (5.20) we obtain Step II ((5.15) and (5.16)). 98 CHAPTER 5. STATE SMOOTHING IN MS-GARCH MODELS

Step I is an upward recursion which manipulates the future observations for estimating their conditional variances corresponding to all possible regime sequences t+L. Step II St is a backward recursion, which integrates the Step I results to evaluate the generalized backward density. Each step of the generalized backward recursion is calculated for mL+1 regime sequences, and therefore the computational complexity increases exponentially with the delay L. However, as the correlation of the current state and future observations decreases along time, small values of L sufficiently enhance the chain sequence estimation, as can be seen in the experimental results.

5.3.2 Generalized stable backward recursion

The stable backward recursion is derived by using the smoothed probability of two se- quential states, which is given by [120]:

t+1 t+L t t+L f , f st+1 p t+1 t+L = St Yt+1 |Y |Y . (5.23) t t+L t S |Y f st+1, t+1   Y |Y  Under the assumption that Λˆ t, Yt are sufficient statistics for the next state-dependent conditional variance estimation,n weo obtain

f t+1, t+L t = f s , t+L s , t p s t St Yt+1 |Y t+1 Yt+1 | t Y t |Y  = f t+L t+1, t p s s , t p s t Yt+1 | St Y t+1 | t Y t |Y = f t+L t+1, λˆ  , Y a  ps Λˆ ,Y (5.24) Yt+1 | St t|t−1,st t stst+1 t | t t     and

f s , t+L t = f t+L s , t p s t t+1 Yt+1 |Y Yt+1 | t+1 Y t+1 |Y  = f t+L s , λˆ  p s Λˆ , Y . (5.25) Yt+1 | t+1 t+1|t,st+1 t+1 | t t     By substituting (5.24) and (5.25) into (5.23) and integrating out all states at time t + 1, we obtain the following backward recursion for the smoothed state probability:

t+L t+1 ˆ t+L f t+1 t , λt|t−1,st , Yt astst+1 p st+1 t+L Y | S |Y p st = p st Λˆ t, Yt ,(5.26) |Y |  t+L ˆ  ˆ  st+1 f t+1 st+1, λt+1|t,st+1 p st+1 Λt, Yt    X Y | |     where the conditional density f t+L t+1, λˆ , Y can be derived from the Yt+1 | St t|t−1,st t generalized backward recursion (5.13)-(5.16). However, the conditional density 5.4. EXPERIMENTAL RESULTS 99 f t+L s , λˆ in the denominator of (5.26) requires calculation of a similar Yt+1 | t+1 t+1|t,st+1 recursion which is not informed of the regime st. Although the stable backward recursion is known to be numerically more stable than the forward-backward recursions, the instability of the latter is insignificant for short delays and the former requires computation of the generalized backward recursion twice, one for evaluating f t+L t+1, λˆ , Y and one for f t+L s , λˆ . Yt+1 | St t|t−1,st t Yt+1 | t+1 t+1|t,st+1    

5.4 Experimental results

The generalized state smoothing has been applied to state detection in noisy MSTF- GARCH(1, 1) processes with 3 states and 5 to 15 dB signal-to-noise ratios (SNRs). Twenty random stationary models have been simulated with an unconditional Gaussian model and uniformly distributed parameters on the intervals (0, 1/3], (1/3, 2/3] and (2/3, 1] for each state respectively. For each model 20 signals are considered, each of dimension ˆ K = 100 and time length T = 100. The conditional variances λt|t−1,st are estimated using the recursive approach of [127]. Figure 5.1 shows the detection error rate p (ˆs = s ) for t 6 t casual estimation as well as for noncausal estimation with up to L = 4 samples delay. It can be seen that the state detection monotonically improves with the increase of the delay. However, the most significant improvement is achieved by using up to 2 future samples, and the contribution of additional future observations decays along time.

5.5 Conclusions

We have derived state smoothing for Markov-switching time-frequency GARCH process, in which case the conditional variances depend on both past observations and the regime path. Our noncausal state probability solution generalizes both the standard forward- backward recursions and the stable backward recursion of HMP by capturing both the signal correlation along time and its conditioning on the regime path. Accordingly, the backward recursion requires two recursive steps for evaluating the conditional density of the given future observations corresponding to all optional future paths. Although the computational complexity of the generalized backward recursion grows exponentially 100 CHAPTER 5. STATE SMOOTHING IN MS-GARCH MODELS

0.03

0.02 Error rate

0.01

0 1 2 3 4 Delay L (samples)

Figure 5.1: State smoothing error rate for 3-state MSTF-GARCH models with SNRs of 5 dB (triangle), 10 dB (asterisk) and 15 dB (circle). with the delay, a small number of future observations contribute with the most significant improvement to the state estimation. Combining the generalized recursions with the recursive signal restoration algorithm of [127] facilitates a noncausal signal restoration, which is a subject for further research. Chapter 6

Simultaneous Detection and Estimation Approach for Speech Enhancement1

In this chapter, we present a simultaneous detection and estimation approach for speech enhancement. A detector for speech presence in the short-time Fourier transform do- main is combined with an estimator, which jointly minimizes a cost function that takes into account both detection and estimation errors. Cost parameters control the trade- off between speech distortion, caused by missed detection of speech components, and residual musical noise resulting from false-detection. Furthermore, a modified decision- directed a priori signal-to-noise ratio (SNR) estimation is proposed for transient-noise environments. Experimental results demonstrate the advantage of using the proposed si- multaneous detection and estimation approach with the proposed a priori SNR estimator, which facilitate suppression of transient noise with a controlled level of speech distortion. In Appendix 6.B we formulate a speech enhancement problem under multiple hypothe- ses, assuming an indicator or detector for the transient noise presence is available in the short-time Fourier transform (STFT) domain. Hypothetical presence of speech or tran- sient noise is considered in the observed spectral coefficients, and cost parameters control the trade-off between speech distortion and residual transient noise. An optimal esti- mator, which minimizes the mean-square error of the log-spectral amplitude, is derived,

1This chapter is based on [136].

101 102 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION while taking into account the probability of erroneous detection. Experimental results demonstrate the improved performance in transient noise suppression, compared to using the optimally-modified log-spectral amplitude estimator.

6.1 Introduction

Optimal design of efficient speech enhancement algorithms has attracted significant re- search effort for several decades. Speech enhancement systems often operate in the short- time Fourier transform (STFT) domain, where the speech spectral coefficients are esti- mated from the spectral coefficients of the degraded signal. The spectral coefficients of the speech signal are generally sparse in the STFT domain in the sense that speech is present only in some of the frames, and in each frame only some of the frequency-bins contain the significant part of the signal energy. However, existing algorithms often focus on estimating the spectral coefficients rather than detecting their existence. The spectral- subtraction algorithm [29, 30] contains an elementary detector for speech activity in the time-frequency domain, but it generates musical noise caused by falsely detecting noise peaks as bins that contain speech, which are randomly scattered in the STFT domain. Subspace approaches for speech enhancement [57, 59, 60, 104] decompose the vector of the noisy signal into a signal-plus-noise subspace and a noise subspace, and the speech spectral coefficients are estimated after removing the noise subspace. Accordingly, these algorithms are aimed at detecting the speech coefficients and subsequently estimating their values. McAulay and Malpass [32] were the first to propose a speech spectral es- timator under a two-state model. They derived a maximum likelihood (ML) estimator for the speech spectral amplitude under speech-presence uncertainty. Ephraim and Malah followed this approach of signal estimation under speech presence uncertainty and derived an estimator which minimizes the mean-square error (MSE) of the short-term spectral amplitude (STSA) [33]. In [49], speech presence probability is evaluated to improve the minimum MSE (MMSE) of the log-spectral amplitude (LSA) estimator, and in [38] a fur- ther improvement of the MMSE-LSA estimator is achieved based on a two-state model. Under speech absence hypothesis Cohen and Berdugo [38] considered a constant attenu- ation factor to enable a more natural residual noise, characterized by reduced musicality. 6.1. INTRODUCTION 103

Under slowly time-varying noise conditions, an estimator which minimizes the MSE of the STSA or the LSA under speech presence uncertainty may yield reasonable re- sults [33, 38]. However, under quickly time-varying noise conditions, abrupt transients may not be sufficiently attenuated, since speech is falsely detected with some positive probability. Reliable detectors for speech activity and noise transients are necessary to fur- ther attenuate noise transients without much degrading the speech components [107,137]. Despite the sparsity of speech coefficients in the time-frequency domain and the impor- tance of signal detection for noise suppression performance, common speech enhancement algorithms deal with speech detection independently of speech estimation. Even when a voice activity detector is available in the STFT domain (e.g., [51–56,108]), it is not straightforward to consider the detection errors when designing the optimal speech es- timator. High attenuation of speech spectral coefficients due to missed detection errors may significantly degrade speech quality and intelligibility, while falsely detecting noise transients as speech-contained bins, may produce annoying musical noise.

In this chapter, we present a novel formulation of the speech enhancement problem, which incorporates simultaneous operations of detection and estimation. A detector for the speech coefficients is combined with an estimator, which jointly minimizes a cost func- tion that takes into account both estimation and detection errors. Under speech-presence, the cost is proportional to a quadratic spectral amplitude (QSA) error [33], while under speech-absence, the distortion depends on a certain attenuation factor [29,38,70]. We derive a combined detector and estimator with cost parameters that enable to control the trade-off between speech distortion, caused by missed detection of speech components, and residual musical noise resulting from false-detection. The combined solution gen- eralizes the well-known STSA algorithm, which involves merely estimation under signal presence uncertainty. In addition, we propose a modification of the decision-directed a priori signal-to-noise ratio (SNR) estimator, which is suitable for transient-noise envi- ronments. Experimental results show that the simultaneous detection and estimation yields better noise reduction than the STSA algorithm while not degrading the speech signal. The advantage of using a suitable indicator for transient noise is demonstrated in a nonstationary noise environment, where the proposed algorithm facilitates suppression of transient noise with a controlled level of speech distortion. 104 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION

The chapter is organized as follows. In Section 6.2, we briefly review classical speech enhancement under signal presence uncertainty. In Section 6.3, we reformulate the speech enhancement problem in the STFT domain as a simultaneous detection and estimation problem. In Section 6.4, we derive the combined solution for a QSA distortion function. In Section 6.5, we relate our proposed approach to the spectral-subtraction approach. In Section 6.6, we present an a priori SNR estimator suitable for transient noise envi- ronments, and in Section 6.7 we demonstrate the performance of the proposed approach compared to existing algorithms, both under stationary and transient-noise environments.

6.2 Classical speech enhancement

In this section, we present the classical approach for spectral speech enhancement in non- stationary noise environments, assuming that some indicator for transient noise activity is available. Let x (n) and d (n) denote speech and uncorrelated additive noise signals, and let y (n) = x (n)+ d (n) be the observed signal. Applying the STFT to the observed signal, we have

Yℓk = Xℓk + Dℓk , (6.1) where ℓ = 0, 1, ... is the time frame index and k = 0, 1, ..., K 1 is the frequency-bin − ℓk ℓk index. Let H1 and H0 denote, respectively, speech presence and absence hypotheses in the time-frequency bin (ℓ,k), i.e.,

ℓk H1 : Yℓk = Xℓk + Dℓk

ℓk H0 : Yℓk = Dℓk . (6.2)

We assume that the noise expansion coefficients can be represented as the sum of two

s t s uncorrelated noise components Dℓk = Dℓk + Dℓk, where Dℓk denotes a quasi-stationary t noise component and Dℓk denotes a highly nonstationary transient component. The tran- sient components are generally rare, but they may be of high energy and thus cause significant degradation to speech quality and intelligibility. However, in many applica- tions, a reliable indicator for the transient noise activity may be available in the system. For example, in an emergency car (e.g., police or ambulance) the engine noise may be 6.2. CLASSICAL SPEECH ENHANCEMENT 105 considered as quasi-stationary, but activating a siren results in a highly nonstationary noise which is perceptually very annoying. Since the sound generation in the siren is nonlinear, linear echo cancelers, e.g., [138], may be inappropriate. In a computer-based communication system, a transient noise such as a keyboard typing noise may be present in addition to quasi-stationary background office noise. Another example is a digital cam- era, where activating the lens-motor (zooming in/out) may result in high-energy transient noise components, which degrade the recorded audio. In the above examples, an indicator for the transient noise activity may be available, i.e., siren source signal, keyboard output signal and the lens-motor controller output. Furthermore, given that a transient noise source is active, a detector for the transient noise in the STFT domain may be designed and its spectrum can be estimated based on training data. The objective of a speech enhancement system is to reconstruct the spectral coefficients of the speech signal such that under speech-presence a certain distortion measure between the spectral coefficient and its estimate, d Xℓk, Xˆℓk , is minimized, and under speech- absence a constant attenuation of the noisy coefficient would be desired to maintain a natural background noise [38, 70]. Although the speech expansion coefficients are not necessarily present, most classical speech enhancement algorithms try to estimate the spectral coefficients rather than detecting their existence, or try to independently design detectors and estimators. The well-known spectral subtraction algorithm estimates the speech spectrum by subtracting the estimated noise spectrum from the noisy squared absolute coefficients [29, 30], and thresholding the result by some desired residual noise level. Thresholding the spectral coefficients is in fact a detection operation in the time- frequency domain, in the sense that speech coefficients are assumed to be absent in the low-energy time-frequency bins and present in noisy coefficients whose energy is above the threshold. McAulay and Malpass were the first to propose a two-state model for the speech signal in the time-frequency domain [32]. Accordingly, the MMSE estimator follows [115]

Xˆ = E X Y ℓk { ℓk | ℓk} = E X Y ,Hℓk p Hℓk Y . (6.3) ℓk | ℓk 1 1 | ℓk   The resulting estimator does not detect speech components, but rather, a soft-decision 106 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION is performed to further attenuate the signal estimate by the a posteriori speech presence probability. Ephraim and Malah followed the same approach and derived an estimator which minimizes the MSE of the STSA under signal presence uncertainty [33]. Accord- ingly,

Xˆ = E X Y ,Hℓk p Hℓk Y . (6.4) ℓk | ℓk| | ℓk 1 1 | ℓk

 

ℓk Both in [32] and [33], under H0 the speech components are assumed zero and the a priori ℓk probability of speech presence is both time and frequency invariant, i.e., p H1 = p (H1). In [38, 49], the speech presence probability is evaluated for each frequency-bin  and time- frame to improve the performance of the MMSE-LSA estimator [34]. Further improvement

ℓk of the MMSE-LSA suppression rule can be achieved by considering under H0 a constant attenuation factor Gf << 1, which is determined by subjective criteria for residual noise naturalness, see also [70]. The OM-LSA estimator [38] is given by

ℓk ℓk p(H1 | Yℓk) 1−p H | Y Xˆ = exp E log X Y ,Hℓk (G Y ) ( 1 ℓk) . (6.5) ℓk | ℓk| | ℓk 1 f | ℓk|

  

Suppose that an indicator for the presence of transient noise components is available in a highly nonstationary noise environment, then high-energy transients may be attenuated by using one of the above-mentioned estimators (6.3)–(6.5) and heuristically setting the

ℓk a priori speech presence probability p H1 to a sufficiently small value. Unfortunately, this also results in suppression of desired speech components and intolerable degradation of speech quality. In general, an estimation-only approach under signal presence uncer-

ℓk tainty produces larger speech degradation for small p H1 , since the optimal estimate is attenuated by the a posteriori speech presence probability.  On the other hand, increasing ℓk p H1 prevents the estimator from sufficiently attenuating noise components. Integrat- ing a jointly optimal detector and estimator into the speech enhancement system may significantly improve the speech enhancement performance under highly non-stationary noise conditions and may allow further reduction of transient components without much degradation of the desired signal. 6.3. REFORMULATION OF THE SPEECH ENHANCEMENT PROBLEM 107 6.3 Reformulation of the speech enhancement prob- lem

In this section, we reformulate the speech enhancement as a simultaneous detection and estimation problem. Middleton and Esposito [115] were the first to propose simultaneous signal detection and estimation within the framework of statistical decision theory. A decision space,

ℓk ℓk ℓk η0 , η1 , is assumed for the detection operation where under the decision ηj , signal ℓk ˆ ˆ hypothesis Hj is accepted and a corresponding estimate Xℓk = Xℓk,j is considered. The detection and estimation are strongly coupled so that the detector is optimized with the knowledge of the specific structure of the estimator, and the estimator is optimized in the sense of minimizing a Bayesian risk associated with the combined operations. For notation simplification, we omit the time-frequency indices (ℓ,k). Let

C X, Xˆ 0 (6.6) j ≥   denote the cost of making a decision ηj (and choosing an estimator Xˆj) where X is the desired signal. Then, the Bayes risk of the two operations associated with simultaneous detection and estimation is defined by [115,116]

1 R = C X, Xˆ p (η Y ) p (Y X) p (X) dXdY (6.7) j j | | j=0 Ωy Ωx X Z Z   where Ωx and Ωy are the spaces of the speech and noisy signals, respectively. The si- multaneous detection and estimation approach is aimed at jointly minimizing the Bayes risk over both the decision rule and the corresponding signal estimate. Let q , p (H1) denote the a priori speech presence probability and let XR and XI denote the real and imaginary parts of the expansion coefficient X. Then, the a priori distribution of the speech expansion coefficient follows

p (X)= q p (X H )+(1 q) p (X H ) , (6.8) | 1 − | 0 where p (X H ) = δ (X) and δ (X) , δ (X ,X ) denotes the Dirac-delta function. The | 0 R I cost function Cj X, Xˆ may be defined differently whether H1 or H0 is true. Therefore,   108 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION we let C X, Xˆ , C X, Xˆ H (6.9) ij j | i     denote the cost which is conditioned on the true hypothesis2. The cost function

Cij X, Xˆ depends on both the true signal value and its estimate under the decision

ηj and therefore couples the operations of detection and estimation. By substituting (6.8) into (6.7) we obtain

R = p (Y X) p (η0 Y ) q p (X H1) C10 X, Xˆ Ωy Ωx | | | Z Z n h   + (1 q) p (X H ) C X, Xˆ − | 0 00 + p (η Y ) q p (X H ) C X,iXˆ 1 | | 1 11 + (1 q) ph(X H ) C X,Xˆ dXdY . (6.10) − | 0 01  io Let

rij (Y )= Cij X, Xˆ p (X Hi) p (Y X) dX (6.11) Ωx | | Z   denote a risk associated with the pair H , η and the observation Y . Then, the combined { i j} Bayes risk follows

R = p (η Y ) [ qr (Y )+(1 q) r (Y )] 0 | 10 − 00 ZΩy + p (η Y ) [ qr (Y )+(1 q) r (Y )] dY . (6.12) 1 | 11 − 01

Since the detector’s decision under a given observation is binary, i.e., p (η Y ) 0, 1 , j | ∈{ } for minimizing the combined risk we first evaluate the optimal estimator under each of the decisions, then the optimal decision rule is derived based on the optimal estimators

Xˆ0, Xˆ1 to further minimize the combined risk. The two-stage minimization guaranties minimum combined risk [116]. The optimal nonrandom decision rule which minimizes the combined risk (6.12) is given by: Decide η (i.e., p (η Y )=1) if 1 1 |

q [r (Y ) r (Y )] (1 q) [r (Y ) r (Y )] , (6.13) 10 − 11 ≥ − 01 − 00 otherwise, decide η0.

2Note that X = 0 implies that H is true and X = 0 implies H so the sub-index i may seem to be 0 6 1 redundant. However, this notation simplifies the subsequent formulations. 6.3. REFORMULATION OF THE SPEECH ENHANCEMENT PROBLEM 109

(a) (b)

Figure 6.1: (a) Independent detection and estimation system; (b) strongly coupled detection and estimation system.

The optimal estimator under a decision ηj is obtained from (6.12) by

arg min qr1j (Y )+(1 q) r0j (Y ) . (6.14) Xˆj { − }

Note that rij (Y ) depends on the estimate Xˆj through the cost function. Figure 6.1 shows a block diagram of the simultaneous detection and estimation scheme compared with an independent detection and estimation system. The standard, non-coupled detec- tion and estimation system (a) consists of an estimator and a detector which independently chooses to accept or reject the estimator output. In the simultaneous detection and es- timation scheme, the estimator is obtained by (6.14) and the interrelated decision rule

(6.13) chooses the appropriate estimator, Xˆ0 or Xˆ1, for minimizing the combined Bayes risk. Since the risk rij (Y ) is a function of the signal estimate Xˆj, the decision rule (6.13) requires knowledge of the estimator under any of its own decisions. Therefore, the arrow between the estimation and the detection blocks is unidirectional. It is important to note that the optimal estimator (6.14) minimizes the Bayes risk under any given decision rule, even if the detector is not optimal and/or is unknown to the estimator. The cost function associated with the pair H , η is generally defined by { i j}

Cij X, Xˆ = bij dij X, Xˆ , (6.15)     where dij X, Xˆ is an appropriate distortion measure and the cost parameters bij control the trade-off between the costs associated with the pairs H , η . That is, a high valued { i j} b01 raises the cost of a false alarm, (i.e., decision of speech presence when speech is actually absent) which may result in residual musical noise. Similarly, b10 is associated with the cost of missed detection of a signal component, which may cause perceptual 110 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION signal distortion. Under a correct classification, normalized cost parameters are generally used, b = b = 1. However, d ( , ) is not necessarily zero since estimation errors are 00 11 ii · · still possible even when there is no detection error. Contrary to the approach in [115,116,139], we do not reject the signal estimator when a decision η is made. Instead, we allow the estimator Xˆ = 0 to compensate 0 0 6 for any detection errors and to reduce potential musical noise and audible distortions. Furthermore, when speech is indeed absent the distortion function is defined to allow some natural background noise level such that under H0 the attenuation factor will be lower bounded by a constant gain floor Gf << 1 as proposed in [24,29,38,70].

6.4 Quadratic spectral amplitude cost function

In this section, we derive a speech simultaneous detection and estimation scheme for a QSA cost function. The distortion measure of the QSA cost function is defined by

2 X Xˆ i =1 , ˆ j dij X, X = | | − 2 (6.16)   ˆ Gf Y Xj i =0 ,    | | −    and is related to the STSA suppression rule of Ephraim and Malah [33]. We assume that both X and D are statistically independent, zero-mean, complex-valued Gaussian random variables with variances λx and λd, respectively. Let ξ , λx/λd denote the a priori SNR under hypothesis H , let γ , Y 2/λ denote the a posteriori SNR and let 1 | | d υ , γξ/ (1 + ξ). For evaluating the optimal detector and estimator under the QSA cost function we denote by X , a ejα and Y , R ejθ the clean and noisy spectral coefficients, respectively, where a = X and R = Y . Accordingly, the pdf of the speech expansion | | | | coefficient under H1 satisfies

a a2 p (a, α H )= exp . (6.17) | 1 πλ −λ x  x  The combined risk under the QSA cost function is independent of the signal phase nor the estimation phase. Therefore, we definea ˆj = Xˆj as the estimated amplitude under

6.4. QUADRATIC SPECTRAL AMPLITUDE COST FUNCTION 111

ηj. Substituting the QSA cost function into (6.14) we have

∞ 2π 2 aˆj = arg min q b1j (a aˆ) p (a, α H1) p (Y a, α) dαda aˆ − | |  Z0 Z0 + (1 q) b (G R aˆ)2 p (Y H ) , (6.18) − 0j f − | 0 and by constraining the derivative according toa ˆ to equal zero, we obtain

∞ 2π aˆj [b1j Λ(Y )+ b0j ]= b1j Λ(Y ) a p (a, α H1) p (Y a, α) dαda/p (Y H1)+b0j Gf R 0 0 | | | Z Z (6.19) where Λ (Y ) is the generalized likelihood ratio defined by [33]

q p (Y H ) Λ(Y ) , | 1 (1 q) p (Y H ) − | 0 q eυ = . (6.20) (1 q) 1+ ξ − Note that given the a priori speech presence probability, the generalized likelihood ratio is a function of the a priori and a posteriori SNRs, Λ(ξ,γ). Using [33] we observe that

∞ 2π a p (a, α H ) p (Y a, α) dαda/p (Y H ) | 1 | | 1 Z0 Z0 √πυ υ υ υ = exp (1 + υ) I + υ I R 2γ − 2 0 2 1 2   h    i , GSTSA (ξ,γ) R , (6.21) where I ( ) denotes the modified Bessel function of order ν. ν · Let

φj (ξ,γ) , b1j Λ(ξ,γ)+ b0j . (6.22)

Then, by using the phase of the noisy signal [33] we obtain from (6.19) and (6.21) the optimal estimation under the decision η , j 0, 1 : j ∈{ }

−1 Xˆj = [b1j Λ(ξ,γ) GSTSA (ξ,γ)+ b0j Gf ] φj (ξ,γ) Y

, Gj (ξ,γ) Y . (6.23)

For evaluating the optimal decision rule we need to compute the risk rij(Y ). Under H1 112 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION we obtain

γ b exp 1+ξ ξ υ r (Y ) = 1j − G2 γ + (1 + υ) G √πυ exp 1j π 1+ ξ  j 1+ ξ − j − 2  υ υ   (1 + υ) I + υ I × 0 2 1 2 γ h    io b exp 1+ξ ξ = 1j − G2 γ + (1 + υ) 2γG G , (6.24) π 1+ ξ  j 1+ ξ − j STSA  

(see proof in the Appendix) where Gj holds for Gj(ξ,γ), the gain function under the

QSA cost function and the decision ηj which is defined in (6.23), and GSTSA holds for

GSTSA(ξ,γ) which is defined in (6.21). For deriving the risk under H , r (Y ), we observe p (X ,X H ) = δ (X ,X ). 0 0j R I | 0 R I Consequently,

∞ r (Y ) = b [G (ξ,γ) G ]2 Y 2 p (X ,X H ) p (Y X ,X ) dX dX 0j 0j j − f | | R I | 0 | R I R I Z Z−∞ b  = 0j [G (ξ,γ) G ]2 γ e−γ . (6.25) π j − f

Substituting (6.24) and (6.25) into (6.13), we obtain the optimal decision rule under the QSA cost function:

ξ Λ(ξ,γ) b G2 G2 + (1 + υ)(b 1)+2(G b G ) G 10 0 − 1 (1 + ξ) γ 10 − 1 − 10 0 STSA   η1 2 2 ≷ b01 (G1 Gf ) (G0 Gf ) . (6.26) η0 − − −

To conclude the above results, simultaneous detection and estimation from noisy ob- servations requires (i) calculating the gain factor under any of the decisions using (6.23), and (ii) finding the optimal decision ηj using (6.26). The corresponding signal estimate is obtained by applying the gain Gj to the noisy observation. Figure 6.2 demonstrates attenuation curves under QSA cost function as a function of the instantaneous SNR defined by γ 1, for several a priori SNRs, using the parameters − q = 0.8, (as proposed in [33]) G = 25 dB and cost parameters b = 5 and b = 1.1. f − 01 10 The gains G1 (dashed line), G0 (dotted line) and the total detection and estimation system gain (solid line) are compared to the STSA gain under signal presence uncertainty of Ephraim and Malah [33] (dashed-dotted line). The a priori SNRs range from 15 dB − to 15 dB. Not only that the cost parameters shape the STSA gain curve, when combined 6.4. QUADRATIC SPECTRAL AMPLITUDE COST FUNCTION 113 with the detector the proposed method provides a significant non-continuous modification of the standard STSA estimator. For example, for a priori SNRs of ξ = 5 and ξ = 15 − dB, as shown in Figure 6.2(b) and (d) respectively, as long as the instantaneous SNR is higher than about 2 dB (for ξ = 5 dB) or 5 dB (for ξ = 15 dB), the detector − − − decision is η1, while for lower instantaneous SNRs, the detector decision is η0. Note that if an ideal detector for the speech coefficients would be available, a more significantly non-continuous gain would be desired to block the noise-only coefficients. However, in the proposed simultaneous detection and estimation approach the detector is not ideal but optimized to minimize the combined risk and the non-continuity of the system gain depends on the chosen cost parameters as well as on the gain floor. As shown in our experimental results, this non-continues gain function may yield greater noise reduction with slightly higher level of musicality, while not degrading speech quality. It is of interest to examine the asymptotic behavior of the estimator (6.23) under each of the decisions. When the cost parameter associated with false alarm is much smaller than the generalized likelihood ratio, i.e., b01 << Λ(ξ,γ), the spectral gain of the estimator under the decision η1 is G1 (ξ,γ) ∼= GSTSA (ξ,γ), which is optimal when the signal is surely present. However, if b01 >> Λ(ξ,γ), the spectral gain under η1 needs to compensate the possibility of a high-cost false-decision made by the detector and thus

G1 (ξ,γ) ∼= Gf . On the other hand, if the cost parameter associated with missed detection −1 is small and we have b10 << Λ(ξ,γ) , then G0 (ξ,γ) ∼= Gf (i.e., estimation where speech −1 is surely absent) but under b10 >> Λ(ξ,γ) , in order to overcome the high cost related to missed detection, we have G0 (ξ,γ) ∼= GSTSA (ξ). Recall that Λ(ξ,γ) = p (H Y ) (6.27) 1+Λ(ξ,γ) 1 | is the a posteriori probability for speech presence [33], it can be shown that the proposed estimator (6.23) generalizes the well-known STSA estimator. For the case of b =1 i, j ij ∀ we have

Xˆ = [ p (H Y ) G (ξ,γ)+(1 p (H Y )) G ] Y 0 1 | STSA − 1 | f

= Xˆ1 . (6.28)

In that case the detection operation is not required since the estimation is independent of 114 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION

−6 −12

−14 −7

−16 −8 −18

−20 −9

−22

Gain [dB] Gain [dB] −10 −24 G −11 −26 1 G 0 −28 −12 G (detection & estimation) G (q<1) −30 STSA −13 −25 −20 −15 −10 −5 0 5 10 15 20 25 −25 −20 −15 −10 −5 0 5 10 15 20 25 Instantaneous SNR (γ−1) [dB] Instantaneous SNR (γ−1) [dB]

(a) (b) −1

0 −2

−2 −3

−4 −4 Gain [dB] Gain [dB] −6 −5 −8 −6 −10 −7 −25 −20 −15 −10 −5 0 5 10 15 20 25 −25 −20 −15 −10 −5 0 5 10 15 20 25 Instantaneous SNR (γ−1) [dB] Instantaneous SNR (γ−1) [dB]

(c) (d)

Figure 6.2: Gain curves of G1 (dashed line), G0 (dotted line) and the total detection and estimation system gain curve (solid line), compared with the STSA gain under signal presence uncertainty (dashed-dotted line). The a priori SNRs are (a) ξ = 15 dB, (b) ξ = 5 dB, − − (c) ξ = 5 dB and (d) ξ = 15 dB. 6.5. RELATION TO SPECTRAL SUBTRACTION 115 the decision rule. If we also set Gf to zero, the estimation reduces to the STSA suppression rule under signal presence uncertainty [33]. The simultaneous detection and estimation approach requires the calculation of two gain functions, G0(ξ,γ) and G1(ξ,γ), and the decision rule. However, as can be seen from

(6.23), both G0(ξ,γ) and G1(ξ,γ) are linear functions of GSTSA(ξ,γ) and the generalized likelihood ratio Λ(ξ,γ). In addition, the decision rule (6.26) requires the calculation of a second-order polynomial. Therefor, the additional complexity of the simultaneous detec- tion and estimation approach is insignificant compared to the STSA estimator [33], which also requires the calculation of the gain function GSTSA(ξ,γ) (6.21) and the generalized likelihood function (6.31).

6.5 Relation to spectral subtraction

The general formulation of the spectral subtraction approach assumes a spectral estimator which can be written as [29,30]

1 1 Y τ τ τ τ τ ℓk Xˆℓk = max ( Yℓk µE [ Dℓk ]) , βE [ Dℓk ] (6.29) | | − | | | | Yℓk n o | | where E [ D τ ] is the τ-order moment of the noise spectral coefficient, µ 1 represents | ℓk| ≥ an over-subtraction factor, and 0 < β << 1 represents spectral floor factor. Boll [30] considered τ = 1 while Berouti et al. [29] used τ = 2. McAulay and Malpass [32] showed that under a Gaussian statistical model, spectral subtraction with τ = 2, µ = 1 and β =0 yields a maximum-likelihood estimator for the speech spectral variance. The spectral subtraction scheme (6.29) classifies high-energy time-frequency bins as active speech bins, and only in these bins the signal is estimated. Low-energy bins below a given threshold are classified as noise-only bins, and set to some background noise level for reducing the residual musical noise. Consequently, low-energy bins that contain the speech signal are not detected, while noise peaks are detected as speech bins. When the over-subtraction factor µ is increased, fewer noise peaks are detected as speech and therefore the residual musical noise is reduced at the expanse of deterioration of speech quality. The spectral floor βE [ D τ ]1/τ “fills-in” the valleys of the residual noise, which | ℓk| yields a more natural noise with less annoying musicality [29]. However, a large β reduces 116 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION the background noise suppression. Further reduction of the musical noise may be achieved by local smoothing of the noisy spectral values prior to noise subtraction. As a result, noise peaks are attenuated and the spectral estimation error can be reduced [30]. However, as the speech signal is highly nonstationary, its intelligibility may be dramatically decreased when the smoothing parameter increases.

The classical spectral subtraction approach heuristically combines a detector and an estimator for the speech spectral coefficients while the parameters µ, β and the smoothing length control the trade-off between the residual musical noise and the speech quality. In the proposed simultaneous detection and estimation approach, the detector is optimally designed jointly with the estimator. The residual noise musicality is controlled by both the spectral gain floor Gf which bounds the attenuation and the false-alarm cost parameter b01. A high-valued false-alarm cost parameter (with relation to the generalized likelihood ratio) reduces the estimation gain under η1, which compensates for a false-detection.

The amount of speech distortion is affected by the missed detection parameter b10, which increases the estimation gain under η0. Since the decision rule depends on both parameters as well as on the gain floor, it is the combination of the three parameters that control the trade-offs between noise reduction and speech distortion.

The different behaviors of the spectral subtraction and the simultaneous detection and estimation approach are illustrated in Figures 6.3 and 6.4. The signals in the time domain are shown in Figure 6.3. The clean signal is a sinusoidal wave which is active only in a specific time interval and the noisy signal contains white Gaussian noise with SNR of 5 dB. The noisy signal is transformed into the STFT domain using half-overlapping Hamming windows of 256 taps. The signal enhanced by spectral subtraction with τ = 2, µ = 1 and β = 0.2 is shown in Figure 6.3(c) and the signal enhanced by using the proposed algorithm is shown in Figure 6.3(d) with b = 3, b = 5, G = 20 dB 01 10 f − and q = 0.8. The a priori SNR needed for the simultaneous detection and estimation approach is estimated using the decision-directed approach as will be defined in (6.30), with a weighting factor α = 0.92 and ξ = 20 dB as the lower bound for the a min − priori SNR, while the variance of the background noise coefficients is evaluated from the noise signal (for both algorithms). The amplitudes of the signals in the STFT domain (at the specific frequency band of the desired signal’s frequency) are shown in Figure 6.5. RELATION TO SPECTRAL SUBTRACTION 117

(a)

(b)

(c)

(d)

Time

Figure 6.3: Signals in the time domain. (a) Clean sinusoidal signal; (b) noisy signal; (c) enhanced signal obtained by using the spectral-subtraction estimator; (d) enhanced signal obtained by using the detection and estimation approach. 50 40 30 20

Amplitude [dB] 10 0 10 20 30 40 50 60 Frame

Figure 6.4: Amplitudes of the STFT coefficients along the time-trajectory corresponding to the frequency of the sinusoidal signal: noisy signal (dotted line), spectral subtraction (dashed line), and simultaneous detection and estimation (solid line). 118 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION

6.4. It can be seen that when the desired signal is absent, high-energy noise components are falsely detected by the spectral subtraction algorithm which potentially results in an annoying musical noise. The detection and estimation algorithm results in a higher attenuation of the noise peaks and smoother and more natural background noise while not increasing the audible distortion in the enhanced signal. Furthermore, it may seem from Figure 6.4 that when the desired signal is active and the instantaneous SNR is high, both algorithms imply similar results. However, in time frames where the desired signal is present, the spectral subtraction approach results in higher residual noise in frequencies where the signal is absent or of low SNR. Therefore, the enhanced signal using the spectral subtraction approach is inferior to the enhanced signal using the detection and estimation approach even in time intervals where the signal is present, as can be seen from Figures 6.3(c) and (d).

6.6 A priori SNR estimation

Speech enhancement in the STFT domain generally relies on an estimation-only approach under signal presence uncertainty e.g., [32, 33, 38]. The a priori SNR is often estimated by using the decision-directed approach [33]. Accordingly, in each time-frequency bin we compute ξˆ = max αG2 ξˆ ,γ γ (1 α)(γ 1) , ξ (6.30) ℓk ℓ−1,k ℓ−1,k ℓ−1,k − ℓk − min n   o where α (0 α 1) is a weighting factor that controls the trade-off between noise ≤ ≤ reduction and transient distortion introduced into the signal, and ξmin is a lower bound for the a priori SNR which is necessary for reducing the residual musical noise in the enhanced signal [33, 70]. Since the a priori SNR is defined under the assumption that

ℓk H1 is true, it is proposed in [38] to replace the gain G in (6.30) by GH1 which represents the spectral gain when the signal is surely present (i.e., q = 1). Increasing the value of α results in a greater reduction of the musical noise phenomena, at the expense of further attenuation of transient speech components (e.g., speech onsets) [70]. By using the proposed approach with high cost for false speech detection, the musical noise can be reduced without increasing the value of α, which enables rapid changes in the a priori SNR estimate. The lower bound for the a priori SNR is related to the spectral gain floor 6.6. A PRIORI SNR ESTIMATION 119

Gf since both imply a lower bound on the spectral gain. The latter parameter is used to evaluate both the optimal detector and estimator while taking into account the desired residual noise level. The decision-directed estimator is widely used, but is not suitable for transient noise environments, since a high-energy noise burst may yield an instantaneous increase in the a posteriori SNR and a corresponding increase in ξˆℓk as can be seen from (6.30). The spectral gain would then be higher than the desired value, and the transient noise ˆs component would not be sufficiently attenuated. Let λdℓk denote the estimated spectral ˆt variance of the stationary noise component and let λdℓk denote the estimated spectral variance of the transient component. The former may be practically estimated by using the improved minima-controlled recursive averaging (IMCRA) algorithm [38,71] or by

t using the minimum-statistics approach [72], while λdℓk may be evaluated based on a ˆ training phase as assumed in [140]. The total variance of the noise component is λdℓk = ˆs ˆt t λdℓk +λdℓk . Note that λdℓk = 0 in time-frequency bins where the transient noise is inactive. Since the a priori SNR is highly dependent on the noise variance, we first estimate the speech spectral variance by

λˆ = max αG2 ξˆ ,γ Y 2 (1 α) Y 2 λˆ , λ (6.31) xℓk H1 ℓ−1,k ℓ−1,k | ℓ−1,k| − | ℓk| − dℓk min n     o ˆs ˆ ˆ ˆ where λmin = ξmin λdℓk . Then, the a priori SNR is evaluated by ξℓk = λxℓk /λdℓk . It is straightforward to show that in a stationary noise environment the proposed a priori

SNR estimator reduces to the decision-directed estimator (6.30), with GH1 substituting G. However, under the presence of a transient noise component, the proposed method yields a lower a priori SNR estimate, which enables higher attenuation of the high-energy transient noisy component. Furthermore, to allow further reduction of the transient noise component to the level of the residual stationary noise, we modify the gain floor by ˜ ˆs ˆ Gf = Gf λdℓk /λdℓk as proposed in [141]. The different behaviors under transient noise conditions of the proposed modified decision-directed a priori SNR estimator and the decision-directed estimator as proposed in [38] are illustrated in Figures 6.5 and 6.6. Figure 6.5 shows the signals in the time do- main: the analyzed signal contains a sinusoidal wave which is active in only two specific segments. The noisy signal contains both additive white Gaussian noise with 5 dB SNR 120 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION

(a)

(b)

(c)

(d)

(e) Time

Figure 6.5: Signals in the time domain. (a) Clean sinusoidal signal; (b) noisy signal with both stationary and transient components; (c) enhanced signal obtained by using the STSA and the decision-directed estimators; (d) enhanced signal obtained by using the STSA and the modified a priori SNR estimators ;(e) enhanced signal obtained by using the detection and estimation approach and the modified a priori SNR estimator. 50

40

30

20 Amplitude [dB]

10

0 10 20 30 40 50 60 Frame

Figure 6.6: Amplitudes of the STFT coefficients along time-trajectory corresponding to the frequency of the sinusoidal signal: noisy signal (light solid line), STSA with decision-directed estimation (dotted line), STSA with the modified a priori SNR estimator (dashed-dotted line) and simultaneous detection and estimation with the modified a priori SNR estimator (dark solid line). 6.7. EXPERIMENTAL RESULTS 121 and high-energy transient noise components. The signal enhanced by using the decision- directed estimator and the STSA suppression rule is shown in Figure 6.5(c). The signal enhanced by using the modified a priori SNR estimator and the STSA suppression rule is shown in Figure 6.5(d), and the result obtained by using the proposed modified a priori SNR estimation with the detection and estimation approach is shown in Figure 6.5(d) (us- ing the same parameters as in the previous section). Both the decision-directed estimator and the modified a priori SNR estimator are applied with α =0.98 and ξ = 20 dB. min − Clearly, in stationary noise intervals, and where the SNR is high, similar results are ob- tained by both a priori SNR estimators. However, the proposed modified a priori SNR estimator obtain higher attenuation of the transient noise, whether it is incorporated with the STSA or the simultaneous detection and estimation approach. Figure 6.6 shows the amplitudes of the STFT coefficients of the noisy and enhanced signals at the frequency band which contains the desired sinusoidal component. Accordingly, the modified a priori SNR estimator enables a greater reduction of the background noise, particularly transient noise components. Moreover, it can be seen that using the simultaneous detection and es- timation yields better attenuation of both the stationary and background noise compared to the STSA estimator, even while using the same a priori SNR estimator.

6.7 Experimental results

In our experimental study, we first evaluate the detection and estimation approach com- pared with the STSA suppression rule under a stationary noise environment. Then, we consider the problem of hands-free communication in an emergency car, and demonstrate the advantage of the modified a priori SNR estimator together with the simultaneous detection and estimation approach under transient noise environment. Speech signals are taken from the TIMIT database [142], sampled at 16 kHz and degraded by additive noise. The test signals include 16 speech utterances from 16 different speakers, half male half female. The noisy signals are transformed into the STFT domain using half-overlapping Hamming windows of 32 msec length, and the background-noise spectrum is estimated by using the IMCRA algorithm (for all the considered enhancement algorithms) [38,71]. The performance evaluation in our study includes objective quality measures, a subjec- 122 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION

Table 6.1: Segmental SNR and Log Spectral Distortion Obtained by Using Either the Simul- taneous Detection and Estimation Approach or the STSA Estimator in Stationary Noise Envi- ronment.

Input SNR Input Signal Detection & Estimation STSA (α = 0.98) STSA (α = 0.92) dB SegSNR LSD SegSNR LSD SegSNR LSD SegSNR LSD -5 -6.801 20.897 1.255 7.462 0.085 9.556 -0.684 10.875 0 -3.797 16.405 4.136 5.242 3.169 6.386 2.692 7.391 5 0.013 12.130 5.98 3.887 5.266 4.238 5.110 4.747 10 4.380 8.194 6.27 3.143 5.93 3.167 6.014 3.157 tive study of spectrograms and informal listening tests. The first quality measure is the segmental SNR defined by [143]

K−1 2 1 n=0 x (n + ℓK/2) SegSNR = 10 log10 K−1 2 , (6.32) T ( n=0 [x (n + ℓK/2) xˆ (n + ℓK/2)] ) |L| Xℓ∈L P − where represents the set of framesP which contain speech, denotes the number of L |L| elements in , K = 512 is the number of samples per frame and the operator confines L T the SNR at each frame to a perceptually meaningful range between 10 dB and 35 dB. − The second quality measure is log-spectral distortion (LSD) which is defined by

1 L−1 K/2 2 1 1 2 LSD = 10 log X 10 log Xˆ , (6.33) L K/2+1 10 C ℓk − 10 C ℓk  Xℓ=0  Xk=0 h i  where X , max X 2, ǫ is a spectral power clipped such that the log-spectrum dynamic C {| | } range is confined to about 50 dB, that is, ǫ = 10−50/10 max X 2 . The third quality · ℓ,k {| ℓk| } measure (used in Section 6.7-B) is the perceptual evaluation of speech quality (PESQ) score [144].

6.7.1 Comparison with the STSA estimator

In this section, the suppression rule results from the proposed simultaneous detection and estimation approach is compared to the STSA estimation [33] for stationary white Gaus- sian noise with SNRs in the range [ 5, 10] dB. For both algorithms the a priori SNR is − estimated by the decision-directed approach (6.30) with ξ = 15 dB, and the a priori min − 6.7. EXPERIMENTAL RESULTS 123 speech presence probability is qℓk = 0.8, as proposed in [33]. For the STSA estimator a decision-directed estimation [38] with α = 0.98 reduces the residual musical noise but generally implies transient distortion of the speech signal [33,70]. However, the inherent detector obtained by the simultaneous detection and estimation approach may improve the residual noise reduction and therefore a lower weighting factor α may be used to allow lower speech distortion. Indeed, we have found out that for the simultaneous detection and estimation approach α = 0.92 implies better results, while for the STSA algorithm, better results are achieved with α = 0.98. The cost parameters for the simultaneous detection and estimation should be chosen according to the system specification, i.e., whether the quality of the speech signal or the amount of noise reduction is of higher importance. Table 6.1 summarizes the average segmental SNR and LSD for these two enhancement algorithms, with cost parameters b = 10 and b = 2, and G = 15 dB 01 10 f − for the simultaneous detection and estimation algorithm. The results for the STSA algo- rithm are presented for α =0.98 as well as for α =0.92 (note that for the STSA estimator

Gf = 0 is considered as originally proposed). It shows that the simultaneous detection and estimation yields improved segmental SNR and LSD, while a greater improvement is achieved for lower input SNR. Informal subjective listening tests and inspection of spec- trograms demonstrate improved speech quality with higher attenuation of the background noise. However, since the weighting factor used for the a priori SNR estimate is lower, and the gain function is discontinuous, the residual noise resulting from the simultaneous detection and estimation algorithm is slightly more musical than that resulting from the STSA algorithm (examples are available online [145]).

6.7.2 Speech enhancement under nonstationary noise environ- ment

In this section, we demonstrate the potential advantage of the simultaneous detection and estimation approach with the proposed a priori SNR estimator under transient noise. We consider a hands-free communication in an emergency car (police car, ambulance etc.) where the engine noise is assumed quasi-stationary. However, activating the emergency siren significantly degrades the perceptual quality and intelligibility of the speech signal, 124 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION

Table 6.2: Segmental SNR, Log Spectral Distortion and PESQ Score Under Transient Noise.

SegSNR LSD PESQ Input Signal -6.703 6.587 2.017 OM-LSA -4.94 5.338 2.141 STSA 4.502 3.580 2.839 Detection and estimation

b01 = b10 =1.5 5.761 3.236 3.072 Detection and estimation

b01 = b10 =5 6.506 3.141 3.071 since its energy is much higher than that of the speech signal. The sound generation in a siren is nonlinear, which produces harmonics not present in the original signal (siren source signal), as can be seen in Figure 6.7(b). However, using the available siren source signal, a reliable indicator in the time-frequency domain for the presence of siren noise, and an

t estimate for the variance of the transient noise, λdℓk , may be designed in a training phase. Note that standard echo-cancellation algorithms are not suitable for eliminating noise generated by nonlinear systems and nonlinear algorithms may be required (e.g., [146,147]).

The proposed approach is compared with the STSA algorithm [33] and the OM-LSA algorithm [38]. The speech presence probability required for the OM-LSA estimator as well as for the simultaneous detection and estimation approach is estimated as proposed in [38], while for the STSA estimatorq ˆℓk = 0.8 is used as originally proposed in [33]. However, since the a priori SNR estimate has a major importance under transient noise, the proposed modified decision-directed estimator is applied both for the simultaneous detection and estimation approach and for the STSA algorithm with ξ = 20 dB. min − For the simultaneous detection and estimation algorithm α = 0.92 is used while for the STSA algorithm α =0.98 (as shown in Section 6.7.1 to be more appropriate for the STSA estimator). For the OM-LSA algorithm, the decision-directed estimator with α = 0.92 is implemented as specified in [38] and the gain floor is G = 20 dB. Figure 6.7 shows f − waveforms and spectrograms of a clean signal, noisy signal and enhanced signals. The noisy signal contains engine car noise with 0 dB SNR and additional siren noise with 1 dB − SNR, such that the total SNR is about 3 dB. The speech enhanced by using the OM-LSA − 6.7. EXPERIMENTAL RESULTS 125

8 0 8 0

−10 −10 6 6

−20 −20 4 4

−30 −30 Frequency [kHz] Frequency [kHz] 2 −40 2 −40

0 −50 0 −50

−60 −60 Amplitude Amplitude −70 −70 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 Time [Sec] Time [Sec]

(a) (b)

8 0 8 0

−10 −10 6 6

−20 −20 4 4

−30 −30 Frequency [kHz] Frequency [kHz] 2 −40 2 −40

0 −50 0 −50

−60 −60 Amplitude Amplitude −70 −70 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 Time [Sec] Time [Sec]

(c) (d)

8 0 8 0

−10 −10 6 6

−20 −20 4 4

−30 −30 Frequency [kHz] Frequency [kHz] 2 −40 2 −40

0 −50 0 −50

−60 −60 Amplitude Amplitude −70 −70 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 Time [Sec] Time [Sec]

(e) (f)

Figure 6.7: Speech spectrograms (in dB) and waveforms. (a) Clean speech signal: ”Draw every outer line first, then fill in the interior”; (b) speech degraded by engine car noise and siren noise with SNR of 3 dB; (c) speech enhanced by using the OM-LSA estimator; (d) speech enhanced − by using the STSA estimator (together with the modified a priori SNR estimator); (e) speech enhanced by using the simultaneous detection and estimation approach with b01 = b10 = 1.5;

(f) speech enhanced by using the simultaneous detection and estimation approach with b01 = b10 = 5. 126 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION algorithm and the STSA algorithm are shown in Figures 6.7(c) and (d), respectively. The signal enhanced by using the simultaneous detection and estimation approach is shown in Figures 6.7(e) and (f) with b01 = b10 =1.5 and b01 = b10 = 5, respectively, and a gain floor of G = 20 dB. It can be seen that compared with the decision-directed-based f − OM-LSA algorithm, the modified a priori SNR estimator substantially contributes to the transient noise reduction, whether it is integrated with the simultaneous detection and estimation approach or with the STSA algorithm. However, the simultaneous detection and estimation approach which is combined with adapted speech presence probability and gain floor yields greater reduction of transient noise without affecting the quality of the enhanced speech signal. Averaged quality measures for the whole set of tested utterances are summarized in Table 6.2, for the same noise conditions. The results demonstrate improved speech quality obtained by using the modified a priori SNR estimator either while combined with the STSA or the simultaneous detection and estimation approach, applying the detection and estimation approach introduced additional improvement to the enhanced signal. Subjective listening tests confirm that the speech quality improvement achieved by using the proposed method is perceptually substantial (audio files are available online [145]).

6.8 Conclusions

We have presented a novel formulation of the single-channel speech enhancement problem in the time-frequency domain. Our formulation relies on coupled operations of detection and estimation in the STFT domain, and a cost function that combines both the esti- mation and detection errors. A detector for the speech coefficients and a corresponding estimator for their values are jointly designed to minimize a combined Bayes risk. In addition, cost parameters enable to control the trade-off between speech quality, noise reduction and residual musical noise. The proposed method generalizes the traditional spectral enhancement approach which considers estimation-only under signal presence uncertainty. In addition we propose a modified decision-directed a priori SNR estimator which is adapted to transient noise environment. Experimental results show greater noise reduction with improved speech quality when compared with the STSA suppression rules 6.A. RISK DERIVATION 127 under stationary noise. Furthermore, it is demonstrated that under transient noise envi- ronment, greater reduction of transient noise components may be achieved by exploiting reliable information for the a priori SNR estimation with simultaneous detection and estimation approach.

6.A Risk derivation

In this appendix we derive the risk r (Y ) . Under H , η we obtain 1j { 1 j} ∞ 2π r (Y )= b (a G R)2 p (a, α H ) p (Y a, α) dαda, (6.34) 1j 1j − j | 1 | Z0 Z0 and the multiplication of the two pdf’s implies

a a2 2R a cos(α θ) p (a, α H ) p (Y a, α)= exp γ + − , (6.35) | 1 | π2λ λ − λ − λ x d   d  −1 where λ , (1/λx +1/λd) . Integrating (6.34) with regard to the phase variable we obtain [148, eq. 3.339, 8.406.3]

2π 2R a cos(α θ) 2R exp − dα =2πJ i a , (6.36) λ 0 λ Z0  d   d  where J ( ) denotes the Bessel function of order zero. Note that in this appendix i , √ 1. 0 · − Using [149, eq. 13.3.1, 2] we have

∞ a2 2R λ a exp J i a da = eυ , (6.37) − λ 0 λ 2 Z0    d  and ∞ a2 2R λ1.5Γ(1.5) a2 exp J i a da = F (1.5;1; υ) , (6.38) − λ 0 λ 2Γ (1) 1 1 Z0    d  where Γ ( ) denotes the Gamma function with Γ (1) = 1 and Γ (1.5) = √π/2, and · 1F1 (a; b; x) is the confluent hypergeometric function [150, eq. A.1.31.c]

υ υ υ F (1.5;1; υ)= e 2 (1 + υ) I + υ I . (6.39) 1 1 0 2 1 2 h    i Using [149, eq. 13.3.2], [150, eq. A.1.19.c] we obtain

∞ a2 2R λ2Γ(2) a3 exp J i a da = F (2;1; υ) − λ 0 λ 2Γ (1) 1 1 Z0    d  λ2 = (1 + υ) eυ (6.40) 2 128 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION

Substituting (6.35)–(6.40) into (6.34) yields

γ b exp 1+ξ ξ υ r (Y ) = 1j − G2 γ + (1 + υ) G √πυ exp 1j π 1+ ξ  j 1+ ξ − j − 2  υ υ   (1 + υ) I + υ I 0 2 1 2 γ h    io b exp 1+ξ ξ = 1j − G2 γ + (1 + υ) 2γG G . (6.41) π 1+ ξ  j 1+ ξ − j STSA   6.B. SPEECH ENHANCEMENT UNDER MULTIPLE HYPOTHESES 129 6.B Enhancement of Speech Signals Under Multi- ple Hypotheses Using an Indicator for Transient Noise Presence3

In this appendix, we formulate a speech enhancement problem under multiple hypothe- ses, assuming an indicator or detector for the transient noise presence is available in the short-time Fourier transform (STFT) domain. Hypothetical presence of speech or tran- sient noise is considered in the observed spectral coefficients, and cost parameters control the trade-off between speech distortion and residual transient noise. An optimal esti- mator, which minimizes the mean-square error of the log-spectral amplitude, is derived, while taking into account the probability of erroneous detection. Experimental results demonstrate the improved performance in transient noise suppression, compared to using the optimally-modified log-spectral amplitude estimator.

6.B.1 Introduction

Enhancement of speech signals is of great interest in many voice communication systems, whenever the source signal is corrupted by noise. In a highly non-stationary noise en- vironments, noise transients may be extremely annoying and significantly degrade the perceived quality and performances of subsequent coding or speech recognition systems. Existing speech enhancement algorithms, e.g., [32, 33, 38], are generally inadequate for eliminating non-stationary noise components. In some applications, an indicator for the transient noise activity may be available, e.g., a siren noise in an emergency car, lens-motor noise of a digital video camera or a keyboard typing noise in a computer-based communication system. The transient spectral variances can be estimated in such cases from training signals. However, applying a standard estimator to the spectral coefficients may result in removal of critical speech components in case of falsely detecting the speech components, or under-suppression of transient noise in case of miss detecting the noise transients. In this appendix, we formulate a speech enhancement problem under multiple hypothe-

3This appendix is based on [140]. 130 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION ses, assuming some indicator or detector for the presence of noise transients in the STFT domain is available. Cost parameters control the trade-off between speech distortion and residual transient noise. We derive an optimal signal estimator that employs the avail- able detector and show that the resulting estimator generalizes the optimally-modified log-spectral amplitude (OM-LSA) estimator [38]. Experimental results demonstrate the improved performance obtained by the proposed algorithm, compared to using the OM- LSA. This appendix is organized as follows. In Section 6.B.2 we formulate the problem of spectral enhancement under multiple hypotheses. In Section 6.B.3 we derive the optimal estimator. In Section 6.B.4 we provide some experimental results and conclude in Section 6.B.5.

6.B.2 Problem formulation

Let x (n), ds (n) and dt (n) denote speech and two uncorrelated additive interference sig- nals, respectively, and let

y (n)= x (n)+ ds (n)+ dt (n) (6.42) be the observed signal. We assume that ds (n) is a quasi-stationary background noise while dt (n) is a highly non-stationary transient signal. The speech signal and the transient noise are not always present in the STFT domain, so we have four hypotheses for the noisy coefficients:

ℓk s H1s : Yℓk = Xℓk + Dℓk ,

ℓk s t H1t : Yℓk = Xℓk + Dℓk + Dℓk ,

ℓk s H0s : Yℓk = Dℓk ,

ℓk s t H0t : Yℓk = Dℓk + Dℓk , (6.43) where ℓ denotes the time frame index and k denotes the frequency-bin index. In many speech enhancement applications, an indicator for the transient source may be available, e.g., siren noise in an emergency car, keyboard typing in computer-based communication system and a lens-motor noise in a digital video camera. In such cases, a 6.B. SPEECH ENHANCEMENT UNDER MULTIPLE HYPOTHESES 131 priori information based on a training phase may yield a reliable detector for the transient noise. However, false detection of transient noise components when signal components are present may significantly degrade the speech quality and intelligibility. Furthermore, miss detection of transient noise components may result in a residual transient noise, which is perceptually annoying. Let ηℓk, j 0, 1 denote the detector decision in the time-frequency bin (ℓ,k), i.e.,a j ∈{ } transient component is classified as a speech component under η1 and as a noise component 4 under η0 . Let C10 denote the false-alarm cost with relation to the noise transient, i.e., cost of making a decision η0 when a noise transient is inactive or is not dominant w.r.t the speech component, and let the miss detection cost C01 be defined similarly. Let

d (x, y) , (log x log y )2 (6.44) | | − | | denote the squared log-amplitude distortion function, let A , X and let R , Y . ℓk | ℓk| ℓk | ℓk| Considering a realistic detector, we introduce the following criterion for the estimation of

ℓk the speech expansion coefficient under the decision ηj :

ˆ ℓk ℓk ℓk Aℓk = arg min C1jp H1s H1t ηj ,Yℓk Aˆ ∪ | E d X , Aˆ Y ,Hℓk Hℓk  × ℓk | ℓk 1s ∪ 1t + C hp Hℓk H ℓk ηℓk,Y d Gi R , Aˆ (6.45) 0j 0t ∪ 0s | j ℓk min ℓk   o where the costs of perfect detection C00 and C11 are normalized to one. That is, under speech presence we aim at minimizing the MSE of the LSA. Otherwise, a constant atten- uation Gmin << 1 is imposed for maintaining naturalness of the residual noise [38]. The cost parameters control the trade-off between speech distortion, consequent upon false detection of noise transients, and residual transient noise, resulting from miss detection of transient noise components.

6.B.3 Optimal estimation under a given detection

In this section we derive an optimal estimator for the speech signal under multiple hy- potheses.

4Note that the detector is used for discriminating between transient speech components and transient noise components, and therefore not employed when transients are absent. 132 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION

Spectral Estimation

ℓk ℓk ℓk We first reduce the problem into two basic hypotheses, H1 and H0 . Under H1 , the speech component is assumed present and more dominant than the noise component. This hypothesis includes Hℓk as well as Hℓk given that X β Dt , where β > 0 is 1s 1t | ℓk| ≥ | ℓk| ℓk ℓk ℓk a predefined threshold parameter. The hypothesis H0 includes the cases H0s , H0t and also Hℓk with X < β Dt . Under Hℓk we estimate the speech in the MMSE-LSA 1t | ℓk| | ℓk| 1 ℓk sense, and under H0 we impose a constant attenuation to the noisy component. Note ℓk that ideally under H1t an estimate for the speech component would be desired. However, if the noise transient is much more dominant we would better apply the constant low attenuation to the noisy component to avoid a strong residual noisy transient. Let p , p ηℓk Hℓk . We are interested in detecting the interfering transient noise ij j | i so p01 is the probability  of a false alarm and p10 is the probability of miss detection. We assume that given any transient in the noisy coefficients, the detection error probability is independent of the observation and the signal-to-noise ratio (SNR). Therefore,

p ηℓk Hℓk,Y = p (6.46) j | i ℓk ij  and

p Hℓk ηℓk,Y = p p Hℓk Y /p ηℓk Y . (6.47) i | j ℓk ij i | ℓk j | ℓk    This assumption can be easily relaxed by employing a time-frequency dependent proba-

ℓk bility pij . Considering the two basic hypotheses and substituting (6.47) into (6.45) we obtain

ˆ ℓk Aℓk = arg min p1jC1jp H1 Yℓk Aˆ |   d X , Aˆ p X Y ,Hℓk dX × ℓk ℓk | ℓk 1 ℓk Z   + p C p Hℓk Y d G R , Aˆ , (6.48) 0j 0j 0 | ℓk min ℓk   o which yields

log Aˆ p C p Hℓk Y + p C p Hℓk Y = ℓk 1j 1j 1 | ℓk 0j 0j 0 | ℓk p C p Hℓk Y E log X Y ,Hℓk  1j 1j 1 | ℓk | ℓk| | ℓk 1 +p C p Hℓk Y log( G R ) . (6.49) 0j 0j 0 | ℓk min ℓk  6.B. SPEECH ENHANCEMENT UNDER MULTIPLE HYPOTHESES 133

5 Let ξℓk and γℓk denote the a priori and a posteriori SNRs, respectively , let υℓk ,

ξℓkγℓk/ (1 + ξℓk) and let

ℓk ℓk p H1 p Yℓk H1 Λ(ξℓk,γℓk) , | p Hℓk p Y Hℓk 0  ℓk | 0  p Hℓk eυℓk = 1   (6.50) ℓk 1+ ξ p H0  ℓk denote the generalized likelihood ratio [33]. Accordingly ,

p Hℓk Y =Λ(ξ ,γ ) / (1+Λ(ξ ,γ )) . (6.51) 1 | ℓk ℓk ℓk ℓk ℓk  Let

φj (ξℓk,γℓk)= p1jC1jΛ(ξℓk,γℓk)+ p0jC0j (6.52) and let ξ 1 ∞ e−t G (ξ,γ) , exp dt (6.53) LSA 1+ ξ 2 t  Zϑ  denote the LSA gain function [34]. Then, combining the magnitude estimate Aˆℓk with the phase of the noisy spectral coefficient Yℓk we obtain an optimal estimate under the ℓk decision ηj :

−1 φj ˆ p0j C0j p1j C1j Λ Xℓk = Gmin GLSA (ξℓk,γℓk) Yℓk h i , Gηj (ξℓk,γℓk) Yℓk , (6.54)

where Λ and φj hold for Λ(ξℓk,γℓk) and φj (ξℓk,γℓk), respectively.

In case of a decision η1 (i.e., transient component is classified as speech), the miss- detection cost C01 as well as the probabilities p01 and p11 control the trade-off between the attenuation associated with the hypothesis H1 and the constant attenuation under speech absence, Gmin. Under a decision η0, the trade-off is controlled by the false-alarm cost and the probabilities p00 and p10. Note that in case p = p and C = C for j 0, 1 , the estimator (6.54) reduces 0j 1j 0j 1j ∈{ } to the OM-LSA estimator [38] under any of the detector decisions, since in that case the decision made by the detector does not contribute any statistical information.

5Note that the noise variance depends on whether a transient component is present or not. This will be specified in the next subsection. 134 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION

0 ξ=10 dB

−5

−10 Gain [dB]

Gη −15 1 ξ=−10 dB Gη 0 G LSA −20 G OMLSA

−20 −15 −10 −5 0 5 10 15 20 Instantaneous SNR (γ−1) [dB]

Figure 6.8: Gain curves for p(H ) = 0.8, C = 5, C = 3, G = 15 dB and false-detection 1 01 10 min − and miss-detection probabilities of p01 = p10 = 0.1.

Figure 6.8 shows attenuation curves as a function of the instantaneous SNR, γ 1, − for different a priori SNRs. The detection-dependent gains Gη0 (dashed-dotted line) and

Gη1 (dotted line) are compared to the LSA gain (dashed line) and the OM-LSA gain (solid line) [34, 38]. It shows that the cost parameters with the error probabilities of the detector shape the attenuation curve under any of the decisions made by the detector to compensate for any erroneous detection.

A priori and a posteriori SNR estimation

The spectrum of the background noise, λ , E Ds 2 , can be estimated by using the s,ℓk | ℓk| minima-controlled recursive averaging algorithm [71]. Th e a priori signal-to-stationary noise ratio ξs , λ /λ , where λ , E X 2 , is practically estimated using the ℓk x,ℓk s,ℓk x,ℓk | ℓk| decision-directed approach [33,38]. Given that a transien t noise is present, the transient noise spectrum may be estimated from a training phase. Therefore, under η0 we may estimate the a priori and a posteriori SNRs by using λˆs,ℓk + λˆt,ℓk as the estimate for the noise spectrum [141], where λt,ℓk is defined similarly to λs,ℓk. However, in case of an 6.B. SPEECH ENHANCEMENT UNDER MULTIPLE HYPOTHESES 135 erroneous detection, this approach may significantly distort the speech component, since both the a priori and a posteriori SNRs would be much smaller than their desired values. Therefore, we propose to smooth the noisy spectra

ζ = µζ + (1 µ) Y 2 , (6.55) ℓk ℓ−1,k − | ℓk|

ℓk with 0 <µ< 1. Accordingly, under a decision η0 we update the estimates such that

ℓk ˆ ˆs s η1 : ξℓk = ξℓk , γˆℓk =γ ˆℓk , λˆs λˆs ℓk ˆ ˆs d,ℓk s d,ℓk η0 : ξℓk = ξℓk , γˆℓk =γ ˆℓk . (6.56) ζℓk ζℓk

As a result, the outcome of falsely detecting transient noise is less destructive since ζℓk would be much smaller than λs,ℓk + λt,ℓk. However, in case of a perfect detection, ζℓk is a reliable estimator for the noise spectrum given that µ is sufficiently small. In addition, under the existence of a high energy transient component we would like to further atten- uate the noisy component to the level of the residual background noise. Therefore, under ℓk ˜ ˆ η0 we update Gmin = Gmin λs,ℓk/ζℓk. q

6.B.4 Experimental results

In this section, we demonstrate the application of the proposed algorithm to speech en- hancement in a computer-based communication system. The background office noise is slowly-varying while possible keyboard typing interference may exist. Since the keyboard signal is available to the computer, a reliable detector for the transient-like keyboard noise is assumed to be available based on a training phase but still, erroneous detections are reasonable. The speech signals are sampled at 16 kHz and degraded by a stationary back- ground noise with 15 dB SNR and a keyboard typing noise such that the total SNR is 0.8 dB. The STFT is applied to the noisy signal with Hamming windows of 32 msec length and 75% overlap. The transient noise detector is assumed to have an error probability of 10% and the miss-detection and false-detection costs are set to1.2. The weighting factor for the noisy spectra is µ =0.5. Figure 6.9 demonstrates the spectrograms and waveforms of a signal enhanced by using the proposed algorithm, compared to using the OM-LSA algorithm. It can be seen that 136 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION

8 8

6 6

4 4

Frequency [kHz] 2 Frequency [kHz] 2

0 0

Amplitude 0 0.5 1 1.5 2 Amplitude 0 0.5 1 1.5 2 Time [Sec] Time [Sec]

(a) (b) 8 8

6 6

4 4

Frequency [kHz] 2 Frequency [kHz] 2

0 0

Amplitude 0 0.5 1 1.5 2 Amplitude 0 0.5 1 1.5 2 Time [Sec] Time [Sec]

(c) (d)

Figure 6.9: Speech spectrograms and waveforms. (a) Clean signal (”Draw any outer line first”); (b) noisy signal (office noise including keyboard typing noise, SNR=0.8 dB ); (c) speech enhanced by using the OM-LSA estimator; (d) speech enhanced by using the proposed algorithm. 6.B. SPEECH ENHANCEMENT UNDER MULTIPLE HYPOTHESES 137

Table 6.3: Segmental SNR and Log Spectral Distortion Obtained Using the OM-LSA and the Proposed Algorithm.

Method SegSNR [dB] LSD [dB] PESQ

Noisy speech -2.23 7.69 1.07 OM-LSA -1.31 6.77 0.97 Proposed Alg. 5.41 1.67 2.87

using our approach, the transient noise is significantly attenuated, while the OM-LSA is unable to eliminate the keyboard transients. The objective evaluation includes three quality measures: segmental SNR (SegSNR), log-spectral distortion (LSD) and perceptual evaluation of speech quality (PESQ) score. The results are summarized in Table 6.3. It can be seen that the proposed detection and estimation approach significantly improves speech quality compared to using the OM-LSA algorithm. Informal listening tests confirm that the annoying keyboard typing noise is dramatically reduced and the speech quality is significantly improved.

6.B.5 Conclusions

We have introduced a new approach for a single-channel speech enhancement in a highly non-stationary noise environment where a reliable detector for interfering transients is available. The speech expansion coefficients are estimated under multiple-hypotheses in the MMSE-LSA sense while considering possible erroneous detection. The proposed algorithm generalizes the OM-LSA estimator and enables greater suppression of transient noise components. 138 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION Chapter 7

Single-Sensor Audio Source Separation Using Classification and Estimation Approach and GARCH Modeling1

In this chapter, we propose a new algorithm for single-sensor blind source separation of speech and music signals, which is based on generalized autoregressive conditional het- eroscedasticity (GARCH) modeling of the speech signals and Gaussian mixture modeling (GMM) of the music signals. The separation of the speech from the music signal is ob- tained by a classification and estimation approach, which enables to control the trade-off between residual interference and signal distortion. Experimental results demonstrate that for mixtures of speech and piano music signals, an improved source separation can be achieved compared to using Gaussian mixture model for both signals. The trade-off between signal distortion and residual interference is controlled by adjusting some cost parameters, which are shown to determine the missed and false detection rates in the proposed classification and estimation approach.

1This chapter is based on [151].

139 140 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION 7.1 Introduction

Separation of mixed audio signals received by a single microphone has been a challenging problem for many years. Examples of applications include separation of speakers [85,86], separation of different musical sources (e.g., different musical instruments) [85, 87, 88], separation of speech or singing voice from background music [89–92], and signal enhance- ment in nonstationary noise environments [35,62,93–95]. In case the signals are received by multiple microphones, spatial filtering may be employed as well as between the received signals, e.g., see [96] and references therein. However, for the un- derdetermined case of several sources which are recorded by a single microphone, some a priori information is necessary to enable reasonable separation performance. Existing al- gorithms for single-sensor audio source separation generally deal with two main problems. The first is to obtain appropriate statistical models for the mixed signals, i.e., codebook, and the second problem is the design of a separation algorithm.

In [94, 95] speech and nonstationary noise signals are assumed to evolve as mixtures of autoregressive (AR) processes in the time domain. The a priori statistical information (codebook), which in this case includes the sets of AR prediction coefficients, is obtained by using a training phase. In [87,88,90] the acoustic signals are modeled by Gaussian mix- ture models (GMMs), and in [35,62] the acoustic signals are modeled by hidden Markov models (HMMs) with AR sub-sources. The trained codebooks provide statistical infor- mation about the distinct signals, which enables source separation from signal mixtures. The desired signal may be reconstructed based on the assumed model by minimizing the mean-square error (mse) [35,88,90], or by a maximum a posteriori (MAP) approach [62]. However, in case of several sources received by a single sensor, separation performances are far from being perfect. Falsely assigning an interfering component to the desired sig- nal may cause an annoying residual interference, while falsely attenuating components of the desired signal may result in signal distortion and perceptual degradation.

GMM and AR-based codebooks are generally insufficient for source separation of sta- tistically rich signals such as speech signals since they only allow a finite set of probability density functions (pdf’s) [92, 152]. Recently, generalized autoregressive conditional het- eroscedasticity (GARCH) models have been proposed for modeling speech signals for 7.1. INTRODUCTION 141 speech enhancement [23, 24, 127, 133], speech recognition [26], and voice activity detec- tion [27] applications. The GARCH model takes into account the correlation between successive spectral variances and specifies a time-varying conditional variance (volatility) as a function of past spectral variances and squared-absolute values. As a result, the spec- tral variances may smoothly change along time and the pdf is much less restricted [1,5,25].

In this chapter, we propose a novel approach for single-sensor audio source separation of speech and music signals. We consider both problems of codebook design and the ability to control the trade-off between the residual interference and the distortion of the desired signal. Accordingly, the proposed approach includes a new codebook for speech signals, as well as a new separation algorithm which relies on a simultaneous classification and estimation method. The codebook is based on GARCH modeling of speech signals and Gaussian mixture modeling of music signals. We apply the models to distinctive frequency subbands, and define a specific state for the case that the signal is absent in the observed subband. The proposed separation algorithm relies on integrating a classifier and an estimator while reconstructing each signal. The classifier attempts at classifying the observed signal into the appropriate hypotheses of each of the signals, and the estimator output is based on the classification. Two methods are proposed for classification and estimation. One is based on simultaneous operations of classification and estimation while minimizing a combined Bayes risk. The second method employs a given (non-optimal) classifier, and applies an estimator which is optimally designed to yield a controlled level of residual interference and signal distortion. The GARCH model for the speech signal with several states of parameters enables smooth (diagonal) covariance matrices with possible state switching. Experimental results demonstrate that for mixtures of speech and piano signals it is more advantageous to model the speech signal by GARCH than GMM, and the codebook generated by the GARCH model yields significantly improved separation performance. In addition, the classification and estimation approach, together with the signal absence state, enables the user to control the trade-off between distortion of the desired signal caused by missed detection, and amount of residual interference resulting from false detection.

This chapter is organized as follows. In Section 7.2, we briefly review codebook- based methods for single-channel audio source separation. We formulate the simultaneous 142 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION classification and estimation problem for mixtures of signals and derive an optimal solution for the classifier and the combined estimator. Furthermore, we show that a constrained optimization with a given classifier yields the same estimator. In Section 7.3, we define the GARCH codebook which is considered for speech signals and review the recursive conditional variance estimation. In Section 7.4, we describe the implementation of the proposed algorithm, and in Section 7.5 we provide some experimental results for audio separation of speech and music signals.

7.2 Codebook-Based Separation

Separation of a mixture of signals observed via a single sensor is an ill posed problem. Some a priori information about the mixed signals is generally necessary to enable rea- sonable reconstructions. Benaroya et al. [87–89] proposed a GMM for the signals’ code- book in the short-time Fourier transform (STFT) domain, and in [35,62,94,95] mixtures of AR models are considered in the time domain. In each case, a set of clean similar signals is used to train the codebooks prior to the separation step. Although the AR pro- cesses are defined in the time domain, for process of length N with prediction coefficients 1, a , ..., a and innovation variance σ2, the covariance matrix is σ2(AT A)−1, where A is { 1 p} an N N lower triangular Toeplitz matrix with [1 a ... a 0 ... 0]T as the first column. If × 1 p the frame length N tends to infinity, the covariance matrix become circulant and hence diagonalized by the Fourier transform [35, 95]. Accordingly, each set of AR coefficients, together with the excitation variance, corresponds to a specific covariance matrix in the STFT domain similarly to the GMM. Therefore, under any of these models, each framed signal is considered as generated from some specific distribution, which is related to the codebook with some probability, and separation is applied on a frame-by-frame basis. We now start with brief introduction of existing codebooks and separation algorithms. Let s , s CN denote the vectors of the STFT expansion coefficients of signals s (n) 1 2 ∈ 1 and s2(n), respectively, for some specific frame index. Let q1 and q2 denote the active states of the codebooks corresponding to signals s1 and s2, respectively, with known a priori probabilities p1 (i) , p (q1 = i), i = 1, ..., m1 and p2 (j) , p (q2 = j), j = 1, ..., m2, and i p1 (i) = j p2 (j) = 1. Given that q1 = i and q2 = j, s1 and s2 are assumed P P 7.2. CODEBOOK-BASED SEPARATION 143 conditionally zero-mean complex-valued Gaussian random vectors (see, e.g., [153, p. 89]) with known diagonal covariance matrices, i.e., s 0, Σ(i) and s 0, Σ(j) . 1 ∼ CN 1 2 ∼ CN 2 Based on a given codebook, it is proposed in [88] and [95] to first find the active pair of states i, j = q = i, q = j using a MAP criterion: { } { 1 2 }

ˆi, ˆj = argmax p (x i, j) p (i, j) (7.1) i,j | n o where x = s +s , p ( i, j)= p ( q = i, q = j), and for statistically independent signals 1 2 ·| ·| 1 2 p (i, j) = p1 (i) p2 (j). Subsequently, conditioned on these states (i.e., classification), the desired signal may be reconstructed in the mmse sense by

ˆs = E s x,ˆi, ˆj 1 1 | −1 (nˆi) (ˆi) o (ˆj) = Σ1 Σ1 + Σ2 x   , Wˆiˆj x (7.2)

2 and similarly ˆs2 = Wˆjˆi x. Alternatively [35, 88, 90], the desired signal may be recon- structed in the mmse sense directly from

ˆs = E s x 1 { 1 | } = p (i, j x) W x . (7.3) | ij i,j X Note that in case of additional uncorrelated stationary noise in the mixed signal, i.e., x = s + s + d with d (0, Σ), the covariance matrix of the noise signal is added to 1 2 ∼ CN the covariance matrix of the interfering signal, and then the signal estimators remain in the same forms. Furthermore, without loss of generality, we may restrict ourselves to the problem of restoring the signal s1 from the observed signal x. In the following subsections, we introduce two related methods for separation. In Section 7.2.1 we formulate the problem of source separation as a simultaneous classification and estimation problem in the sense of statistical decision theory. A classifier is aimed at finding the appropriate states within the codebooks, and the estimator tries to estimate the desired signal based on the given classification. Coupled classifier and estimator jointly

2 Note that in this chapter the index i always refers to the signal s1 and the index j refers to the other −1 (j) (i) (j) signal s2. Therefore, Wji =Σ2 Σ1 +Σ2 .   144 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION minimize a combined Bayes risk, which penalizes for both classification and estimation errors. Relying on the fact that audio signals are generally sparse in the STFT domain, we define additional specific states for the codebook which represent signal absence, and consider false detection of the desired signal and missed detection. The false detection results in under-attenuation of the interfering signal. On the other hand, missed detection of the desired signal may result in removal of desired components and excessive distortion of the separated signals. To allow the user a control over the residual interference and the signal distortion, we introduce cost parameters which are related to missed detection and false detection of the desired signal. In Section 7.2.2, we introduce a slightly different formulation of optimal estimation under a given classifier. An independent (given) classifier may be applied, for example, by using the MAP classifier (7.1). Based on this classification, the signal estimation is derived by solving a constrained optimization with respect to the level of residual interference and signal distortion. We denote this approach as joint classification and estimation. We show that in case of degenerated simultaneous classification and estimation formulation, closely related solutions can be derived under both approaches.

7.2.1 Simultaneous Classification and Estimation

Simultaneous detection and estimation formulation was first proposed by Middleton et al. [115,116]. This scheme assumes coupled operations of detection and estimation which jointly minimize a combined Bayes risk. Recently, a similar approach has been proposed for speech enhancement in nonstationary noise environments [136]. It was shown that ap- plying simultaneous operations of speech detection and estimation in the STFT domain improves the enhanced signal compared to using an estimation only approach. Further- more, the contribution of the detector is more significant when the interfering signal is highly nonstationary. In this subsection we develop a simultaneous classification and estimation approach for a codebook-based single-channel audio source separation. By introducing cost parameters for classification errors the trade-off between residual inter- ference and signal distortion may be controlled.

Let η denote a classifier for the mixed signal, where ηij indicates that the mixed signal 7.2. CODEBOOK-BASED SEPARATION 145 x is classified to be associated with the pair of states i, j . Let { } ¯¯ ¯¯ Cij (s,ˆs) , bij s ˆs 2 (7.4) ij ij k − k2 denote the combined cost of classification and estimation, where we use a squared-error ¯i¯j distortion, and bij > 0 are parameters which impose a penalty for making a decision that ¯ i, j is the active pair while actually s was generated with covariance matrix Σ(i) and s { } 1 1 2 (¯j) ¯ ¯ with covariance matrix Σ2 (i.e., q1 = i and q2 = j). The combined risk of classification and estimation is then given by

¯¯ R = Cij (s ,ˆs ) p (x s ,¯i, ¯j) p (s ¯i, ¯j) p (¯i, ¯j) p (η x) ds dx . (7.5) ij 1 1 | 1 1 | ij | 1 i,j ¯ ¯ X Xi,j Z Z The simultaneous classification and estimation is aimed at finding the optimal esti- mator and classifier which jointly minimize the combined risk:

min R . (7.6) ˆs ηij , 1 { }

To derive a solution to (7.6) we first note that the signal s1 is independent of the value of q , hence p (s ¯i, ¯j)= p (s ¯i). Similarly, x given s is independent of the value of q . 2 1 | 1 | 1 1 ¯i¯j Accordingly, rij (x, ˆs1) which is defined by

¯¯ ¯¯ rij (x, ˆs )= Cij (s ,ˆs ) p (x s , ¯j) p (s ¯i) ds (7.7) ij 1 ij 1 1 | 1 1 | 1 Z denotes the average risk related to a decision η when the true pair is ¯i, ¯j . The combined ij { } risk (7.5) can be written as

¯¯ R = p (η x) p (¯i, ¯j) rij (x, ˆs ) dx . (7.8) ij | ij 1 i,j ¯ ¯ X Z Xi,j The classifier’s decision for a given observation is nonrandom. Therefore, given the ob- served signal x, the optimal estimator under a decision ηij made by the classifier [i.e., p(η x) = 1 for a particular pair i, j ] is obtained by ij | { } ¯ ¯ ¯i¯j ˆs1,ij = arg min p (i, j) rij (x, ˆs1) . (7.9) ˆs1 ¯ ¯ Xi,j Substituting (7.7) into (7.9) and setting the derivative to be equal to zero, we obtain the optimal estimate under ηij: ¯i¯j ¯ ¯ ¯ ¯ ¯i¯j bij p (x i, j) p (i, j) W¯i¯j x ˆs1,ij = ¯¯ | bij p (x ¯i, ¯j) p (¯i, ¯j) P ¯i¯j ij | , GijPx . (7.10) 146 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION

The derivation of (7.10) is given in Appendix 7.A. Note that in case the parameters ¯i¯j bij are all equal, then the estimator (7.10) reduces to the mmse estimator (7.3) and the estimation does not depend on the classification rule. ¯i¯j The average risk rij (x, ˆs1) is a function of the observed mixed signal and the optimal estimate under ηij. Let 1 denote a column vector of ones. Then, by substituting (7.10) into (7.7), we obtain (see Appendix 7.B)

¯i¯j ¯i¯j H 2 T (¯j) r (x, ˆs )= b p (x ¯i, ¯j) x W 2W¯¯G x + 1 Σ W¯¯1 . (7.11) ij 1 ij | ¯i¯j − ij ij 2 ij h  i From (7.6) and (7.8), the optimal classification rule ηij (x) is obtained by minimizing the weighted average risks over all pairs of states:

¯ ¯ ¯i¯j min p (i, j) rij (x, ˆs1) . (7.12) ij ¯ ¯ Xi,j

¯i¯j If we consider the degenerated case of equal parameters bij, then the averaged risk ¯i¯j rij (x, ˆs1) does not depend on i, j and therefore there is no specific pair which mini- mizes (7.12). However, as already mentioned above, in this case there is no need for a classification since the estimator does not depend on the decision rule. To summarize, minimizing the combined Bayes risk is obtained by first evaluating the optimal gain matrix G under each pair i, j using (7.10), and subsequently the optimal ij { } classifier chooses the appropriate pair (and the appropriate gain matrix) using (7.11) and (7.12). The combined solution guaranties minimum combined Bayes risk [116]. ¯i¯j The selection of the parameters bij is application dependent, since these parameters determine the penalty for choosing each set of states compared to all other sets. Recall we would like to define specific states for signal absence, we consider from now on that the signal states are q 0, 1, ..., m and q 0, 1, ..., m where q = 0 and q = 0 are 1 ∈{ 1} 2 ∈{ 2} 1 2 ¯i¯j ¯i ¯j the signal absence states. Accordingly, we define separable parameters bij = bi bj, where b0 with i = 0 is related to the cost of false detection and b¯i with ¯i = 0 is related to the i 6 0 6 cost of missed detection of the desired signal. Specifically, we define

b i =0, ¯i =0 1,m 6 ¯ bi =  b i =0, ¯i =0 (7.13) i  1,f  6  1 o.w.   7.2. CODEBOOK-BASED SEPARATION 147

¯j with b1,m, b1,f > 0, and for signal s2, bj is defined similarly (with parameters b2,m and b2,f ). By using this definition, we practically assume equal parameters (i.e., one) for all cases except for missed detection and false detection. As can be seen from (7.11)–(7.12), higher b2,m (or b2,f ) results in larger average risk which corresponds to this decision, and therefore, lower chances for the optimal detector to take this decision. However, as can be seen from (7.10), the high valued parameter raises the contribution of the corresponding state on the system estimate. If a parameter is smaller than one, than the chances of the detector to take this decision are higher, but, as the estimator (7.10) compensates for wrong decisions, this contribution on the system estimate would be low. Missing to detect the desired signal results, in general, in removal of desired signal components and therefore distort the desired signal estimate. On the other hand, false detection may result in residual interference. By affecting both the decision rule and the corresponding estimation, these parameters help to control the trade-off between residual interference when the desired signal is absent (resulting from false detection) and the distortion of the desired signal caused by missed detection. The computational complexity of the simultaneous classification and estimation ap- proach is higher than that associated with the sequential MAP classification and mmse estimation (7.1)–(7.2), or the mmse estimator (7.3). However, the estimator (7.10) is optimal, not only when combined with the optimal classifier (7.12), but also when com- bined with any given classifier [116]. Therefore, this estimator may be combined with a sub-optimal classifier [e.g., the MAP classifier given by (7.1)] to reduce the computational requirements, while still using parameters which compensate for false classification. In the following subsection we discuss this option of employing a non-ideal classifier and show that the same estimator (7.10) can be obtained by solving a constrained optimization problem. In this problem formulation it is shown that the cost parameters may also have the interpretation of Lagrange multipliers.

7.2.2 Joint Classification and Estimation

The application of a given classifier (e.g., a MAP classifier) followed by an estimator is shown in Figure 7.1. We denote this scheme as joint classification and estimation. 148 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION

ηij x = s1 + s2 Classifier Estimator ˆs1,ij

Figure 7.1: A cascade classification and estimation scheme.

In order to simplify the derivation, we assume in this subsection only signal absence or presence states i, j 0, 1 (i.e., m = m = 1) where i = 0 and i = 1 represent presence ∈{ } 1 2 and respectively absence of s1, and j similarly specifies the state of s2. The classifier is generally not ideal and may suffer from miss and false detections. Therefore, under false decision that the signal is absent when actually the signal is present, we may want to control the distortion level, while under false detection of signal components we wish to control the level of residual interference. Under the two hypotheses, the mean signal distortion is defined by

ε¯2(x) , p (q =1 x) E s ˆs 2 q =1, x (7.14) d 1 | k 1 − 1k2 | 1  and the mean residual interference is defined by

ε¯2(x) , p (q =0 x) E s ˆs 2 q =0, x . (7.15) r 1 | k 1 − 1k2 | 1 

Therefore, for a decision that signal is absent (i.e., η0j) we have the following problem

2 ˆs1,0j = arg min p (q1 =0 x) E s1 ˆs1 2 q1 =0, x ˆs1 | k − k |  s.t. ε¯2(x) σ2 , (7.16) d ≤ d while for a signal-presence decision (η1j) we have

2 ˆs1,1j = arg min p (q1 =1 x) E s1 ˆs1 2 q1 =1, x ˆs1 | k − k |  s.t. ε¯2(x) σ2 (7.17) r ≤ r

2 2 where σd and σr are bounds for the mean distortion and mean residual interference, respectively. The optimal estimator can be obtained by using a method similar to [57,104].

Under η0j the Lagrangian is defined by (e.g., [154]):

L (ˆs ,µ )= p (q =0 x) E s ˆs 2 q =0, x + µ ε¯2(x) σ2 (7.18) d 1 d 1 | k 1 − 1k2 | 1 d d − d   7.2. CODEBOOK-BASED SEPARATION 149 and µ ε¯2(x) σ2 =0 for µ 0 . (7.19) d d − d d ≥  ¯2 Under η1j the Lagrangian Lr (ˆs1,µr) is defined similarly using µr and εr(x). Then, ˆs1,0j (or

ˆs1,1j) is a stationary feasible point if it satisfies the gradient equation of the appropriate 3 Lagrangian [i.e., L (ˆs ,µ ) or L (ˆs ,µ )]. From s L (ˆs ,µ ) = 0 we have d 1 d r 1 r ▽ˆ1 d 1 d p (q =0 x) E s q =0, x + µ p (q =1 x) E s q =1, x ˆs = 1 | { | 1 } d 1 | { | 1 } 1,0j p (q =0 x)+ µ p (q =1 x) 1 | d 1 | p (q1 =0 x) ¯ p (¯j q1 =0, x) E s q1 =0, ¯j, x = | j | { | } p (q =0, ¯j x)+ µ p (q =1, ¯j x) ¯j 1 P | d ¯j 1 | µd p (q1 =1 x) ¯ p (¯j q1 =1, x) E s q1 =1, ¯j, x + P | j | P { | } p (q =0, ¯j x)+ µ p (q =1, ¯j x) ¯j 1 P | d ¯j 1 | ¯ p (q1 =0, ¯j x) W0¯jx + µd ¯ p (q1 =1, ¯j x) W1¯jx = j P | Pj | p (q =0, ¯j x)+ µ p (q =1, ¯j x) P ¯j 1 | d P¯j 1 | ¯i ¯ ¯ ¯i¯j µ˜Pd p (i, j x) W¯i¯jx P = | (7.20) µ˜¯i p (¯i, ¯j x) P ¯i¯j d | where P ¯ ¯i µd i =1 µ˜d = . (7.21)  1 ¯i =0  Similarly, under signal-presence decision we have  ¯i ¯¯ µ˜ p (¯i, ¯j x) W¯i¯jx ˆs = ij r | (7.22) 1,1j µ˜¯i p (¯i, ¯j x) P ¯i¯j r | with P ¯ ¯i µr i =0 µ˜r = . (7.23)  1 ¯i =1  Therefore, in general we can write  ¯i ¯ ¯ ¯i¯j µip (i, j x) W¯i¯jx ˆs1,ij = | (7.24) µ¯ip (¯i, ¯j x) P ¯i¯j i | with P

µd i =0, ¯i =1 ¯ µi =  µ i =1, ¯i =0 (7.25) i  r   1 o.w.

3  Note that as shown in [57,104], there is no closed form solution for the value of the Lagrange multiplier. Instead it is used as a non-negative parameter. 150 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION

¯j and the estimator (7.24) is the same as (7.10) with bj = 1, b1,m = µd, and b1,f = µr in ¯i¯j (7.13). Therefore, we can identify the parameters bij as non-negative Lagrange multipliers of a constrained optimization problem. In addition, if b1,m (or b1,f) equals zero, then the corresponding Lagrange multiplier also reduces to zero and the constraint in (7.16) [or (7.17)] is inapplicable. Therefore, the problem reduces to a standard conditional mmse problem, which results in the estimator (7.2) which assumes a perfect classifier. The main difference between the problem formulations in Sections 7.2.1 and 7.2.2 is that the former defines a classifier and a coupled estimator which are designed to minimize a combined Bayes risk, while the latter assumes a given classifier, and formulates a constrained optimization problem in order to find the optimal estimator for the given classification rule.

7.3 GMM Vs. GARCH Codebook

In this section, we introduce a new codebook for mixtures of speech and music signals. GMM was used in [88–90] for generating codebooks for speech signals as well as for music signals in the STFT domain, under the assumption of diagonal covariance matrices. The covariance matrices and the a priori state probabilities are estimated by either maximizing the log-likelihood of the trained signal using expectation-maximization algorithm [62,155] , or by using the k-means vector quantization algorithm [62,156]. Using a finite-state model with predetermined densities as in the case of GMM, mixture of AR models or HMM with AR sub-sources, the diagonal vector of the covariance matrices can take values only from RN a specific subspace of + spanned by the given codewords. This limitation for the pdf’s may restrict the usage of these models for statistically rich signals such as speech [92]. GARCH is a statistical model which explicitly parameterizes a time-varying condi- tional variance using past variances and squared absolute values, while considering volatil- ity clustering and excess kurtosis (i.e., heavy-tailed distribution) [1]. Expansion coeffi- cients of speech signals in the STFT domain, are clustered in the sense that successive mag- nitudes at a fixed frequency bin are highly correlated [25]. GARCH model has been found useful for modeling speech signals in speech enhancement applications [24], [127, 133], speech recognition [26], and voice activity detection [27]. It has been shown [127] that 7.3. GMM VS. GARCH CODEBOOK 151 spectral variance estimation resulting from this model is a generalization of the decision- directed estimator [33] with improved tracking of the speech spectral volatility. Therefore, we propose in this work to use GMM for modeling the music signal (say s2) and GARCH model with several states for the speech signal (say s1).

According to the GMM formalism, p2(j) is the a priori probability for the active state q = j, where conditioning on q = j, the vector in the STFT domain s 0, Σ(j) . 2 2 2 ∼ CN 2 For defining the GARCH modeling we first let s1(ℓ) denote the ℓth frame of s1 in the

STFT domain. We assume that s1(ℓ) is a mixture of GARCH processes of order (1, 1).

Then, given that q1 (ℓ)= iℓ is the active state at frame ℓ, s1 (ℓ) has a complex-normal pdf

(iℓ) (iℓ) with zero mean and a diagonal covariance matrix Σ1 = diag λℓ|ℓ−1 . The conditional (iℓ) n o variance vector λℓ|ℓ−1 is the vector of variances at frame ℓ conditioning on the information up to frame ℓ 1. This conditional variance is a linear function of the previous conditional − variance and squared absolute value:

λ(iℓ) = λ(iℓ) 1 + α(iℓ)s (ℓ 1) s∗ (ℓ 1) ℓ|ℓ−1 min 1 − ⊙ 1 − + β(iℓ) λ(iℓ−1) λ(iℓ−1) 1 (7.26) ℓ−1|ℓ−2 − min   where denotes a term-by-term vector multiplication, ∗ denotes complex conjugate, ⊙ (i ) and λ ℓ > 0 and α(iℓ), β(iℓ) 0 for i = 0, 1, ..., m are sufficient conditions for the min ≥ ℓ 1 (iℓ) (iℓ) positivity of the conditional variance [127,133]. In addition, α + β < 1 for all iℓ is a sufficient condition for a finite unconditional variance4 [5]. The conditional density results from (7.26) is time varying and depends on all past values (through previous conditional (i) variances) and also on the regime path up to the current time. While λmin set the lower bounds for the conditional variances in each state, the parameters α(i) and β(i) set the volatility level and the autoregression behavior of the conditional variances. Note that this model is a degenerated case of the Markov-switching GARCH (MS-GARCH) model [7,8,127]. In the MS-GARCH model the sequence of states is a first-order Markov chain with state transition probabilities p (q (ℓ)= i q (ℓ 1) = i ). However, to reduce the 1 ℓ | 1 − ℓ−1 model complexity and to allow a simpler online estimation procedure under the presence of a highly nonstationary interfering signal, we assume here that the state transition probabilities equal the a priori state probabilities, i.e., p (q (ℓ)= i q (ℓ 1) = i ) = 1 ℓ | 1 − ℓ−1 4For necessary and sufficient condition see [121]. 152 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION p1 (iℓ), similarly to the assumption used in [88] for the GMM approach.

(iℓ) It can be seen from (7.26) that the vector of conditional variances λℓ|ℓ−1 may take RN (iℓ) any values in + with lower bound λmin for each entry. However, even if the active

(iℓ) (iℓ) state is known, the covariance matrix Σ1 (or the vector of conditional variances λℓ|ℓ−1) is unknown and should be reconstructed recursively using all previous signal values and active states. Moreover, since both s1 and the Markov chain are random processes, the vector of conditional variances is also a random process which follows (7.26). As we only have a mixed observation, we may estimate this random process of conditional variances based on the recursive estimation algorithm proposed in [127]. Assume that we have an estimate for the set of conditional variances at frame ℓ based on information up to frame ˆ ˆ (iℓ) ℓ 1, Λℓ , λℓ|ℓ−1 , then, following the model definition an mmse estimate of the − iℓ next-frame conditionaln o variance follows

λˆ (iℓ+1) = E λ(q1(ℓ+1)) q (ℓ +1) = i , Λˆ , x (ℓ) ℓ+1|ℓ ℓ+1|ℓ | 1 ℓ+1 ℓ = λ(inℓ+1) 1 + α(iℓ+1)E s (ℓ) s∗ (ℓ) Λˆ , xo(ℓ) min 1 ⊙ 1 | ℓ + β(iℓ+1) E λ(q1(ℓn)) Λˆ , x (ℓ) E λ(q1(ℓ))o Λˆ , x (ℓ) 1 (7.27) ℓ|ℓ−1 | ℓ − min | ℓ  n o n o  for iℓ+1 =0, 1, ..., m1. Using

λˆ (iℓ,jℓ) , E s (ℓ) s∗ (ℓ) i , j , Λˆ , x (ℓ) ℓ|ℓ 1 ⊙ 1 | ℓ ℓ ℓ n −1 o −1 = Σˆ (iℓ) Σˆ (iℓ) + Σ(jℓ) Σ(jℓ)1+Σˆ (iℓ) Σˆ (iℓ) + Σ(jℓ) x (ℓ) x∗ (ℓ) (7.28) 1 1 2 2 1 1 2 ⊙       we obtain

E s (ℓ) s∗ (ℓ) Λˆ , x (ℓ) = p i , j Λˆ , x (ℓ) E s (ℓ) s∗ (ℓ) i , j , Λˆ , x (ℓ) 1 ⊙ 1 | ℓ ℓ ℓ | ℓ 1 ⊙ 1 | ℓ ℓ ℓ i ,j n o Xℓ ℓ   n o = p i , j Λˆ , x (ℓ) λˆ(iℓ,jℓ) (7.29) ℓ ℓ | ℓ ℓ|ℓ i ,j Xℓ ℓ  

E λ(q1(ℓ)) Λˆ , x (ℓ) = p i , j Λˆ E λ(q1(ℓ)) q (ℓ)= i , Λˆ , x (ℓ) ℓ|ℓ−1 | ℓ ℓ ℓ | ℓ ℓ|ℓ−1 | 1 ℓ ℓ i ,j n o Xℓ ℓ   n o p i , j Λˆ λˆ (iℓ) (7.30) ≈ ℓ ℓ | ℓ ℓ|ℓ−1 i ,j Xℓ ℓ   and E λ(q1(ℓ)) Λˆ , x (ℓ) = p i , j Λˆ λ(iℓ) . (7.31) min | ℓ ℓ ℓ | ℓ min i ,j n o Xℓ ℓ   7.4. IMPLEMENTATION OF THE ALGORITHM 153

A detailed recursive estimation algorithm is given in [127]. The model parameters, i.e., (i ) λ ℓ , α(iℓ), β(iℓ) m1 , can be estimated from a training set using a maximum-likelihood { min }iℓ=0 approach [5,7,127] or may be evaluated as proposed in [133] such that each state would represent a different level of the optional dynamic range of the signals’ energy. By using the recursive estimation algorithm, we evaluate for each time frame ℓ and for each state ˆ (iℓ) iℓ an estimate of the spectral covariance matrix Σ1 which is required for the separation algorithm. Hence, the sets Σˆ (iℓ) and p (i ) together with the GMM for the back- 1 1 ℓ iℓ iℓ { } ground music signal may ben employedo by the classification and estimation procedure to obtain an estimate for each signal. Note that for the GMM, each state defines a specific pdf which is known a priori while for the mixing GARCH model the covariance matrices in each state are time-varying and are recursively reconstructed.

7.4 Implementation of the Algorithm

The existing GMM-, AR- and HMM-based algorithms, generally estimate each frame of the signal in the STFT domain using a vector formulation. However, many spec- tral enhancement algorithms for speech signals treat each frequency bin separately, e.g., [33,34,38]. The application of subband-based audio processing algorithms have been pro- posed for automatic speech recognition, e.g., [157, 158], speech enhancement [133], and also for single-channel source separation [152]. Instead of applying a statistical model for the whole frame, each subband is assumed to follow a different statistical model. Con- (i) sidering the GARCH modeling, the parameters λmin specify the lower bounds for the conditional variances under each state. Since speech signals are generally characterized by lower energy levels in higher frequency-bands, it is advantageous to apply different model parameters in different subbands, as proposed in [133]. For the implementation of the proposed algorithm we assume K < N linearly-spaced frequency subbands for each frame with independent model parameters. Moreover, the sparsity of the expansion coefficients in the STFT domain (of both speech and music signals) implies that in a specific time-frame the signal may be present in some of the frequency subbands and absent (or of negligible energy) in others. Therefore, we define a specific state for signal absence in each subband k 1, ..., K , q = 0 and q = 0. ∈ { } 1 2 154 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION

2 For these states the pdf is assumed to be a zero-mean complex Gaussian with σmin,k I covariance matrix. Note that in the GMM case, each state corresponds to a specific predetermined Gaussian density while in the GARCH case, by setting α(0) = β(0) = 0

(0) 2 and λmin = σmin,k for the kth subband, the covariance under q1 = 0 is also time invariant 2 and equals σmin,k I. In our experimental study independent models are assumed for each subband and therefore the model training and both the conditional variance estimation (in case of speech signal) and the separation algorithm are applied independently in each subband. However, in general, some overlap may be considered between adjacent subbands to allow some dependency between adjacent bands, as well as cross-band state probabilities, as proposed in [152].

Prior to source separation, both the GMM and GARCH models need to be estimated using a set of training signals. In case of GMM, for each state j = 0 we need to estimate 6 (j) (for each subband independently) the diagonal vector of the covariance matrix Σ2 , and the state probability p2(j). For the GARCH modeling, the state probabilities are also required, however, only three scalars are needed to represent the covariance matrix for any i = 0 : λ(i) , α(i), and β(i). In our application, the GMM is trained by using the 6 min k-means vector quantization algorithm [62,156]. This model is sensitive to the similarity between the training signals and the desired signal in the mixture, and to achieve good representation, the spectral shapes in the trained and mixed signals need to be closely related, as applied, e.g., in [88,89,95]. For training the GARCH model we use the method proposed in [133]. Accordingly, the training signals are used only to calculate the peak

(0) 2 (0) (0) energy in each of the subbands. Then, we set λmin = σmin,k, and α = β = 0. For the speech presence states, i 1, ..., m , the parameters are chosen as follows. The lower ∈ { 1} (i) (0) bounds λmin are log-spaced in the dynamic range, i.e., between λmin and the peak energy. The parameters β(i) are experimentally set to 0.8 and α(i) are evaluated for each subband independently such that the stationary variance in the subband, under an immutable state, would be equal to the lower bound for the next state (see [133] for details). This approach yields reasonable results since it enables to represent the whole dynamic range of the signals energy while the conditional variance is updated each frame by using past observation and past conditional variance. In addition, since only the peak energy is required for each subband, this approach has relatively low sensitivity to the training set 7.4. IMPLEMENTATION OF THE ALGORITHM 155

1 Z−

Classifier I S ˆs1(ℓ) Conditional ˆ (iℓ) sˆ1(n) Σ1 S T ηij x(n) Variance n o Gij F x(ℓ) { } T Estimator ˆs2(ℓ) F T sˆ2(n) Estimator T (j) Σ2 n o GARCH GMM

Figure 7.2: Block diagram of the proposed algorithm.

and only the peak energy levels need to be similar to that of the test set.

A block diagram of the proposed separation algorithm is shown in Figure 7.2 when considering a single band (in practice, a similar process is applied in each subband inde- pendently). The observed signal is first transformed into the STFT domain. Then, two steps are applied for each frame ℓ. First, the GARCH conditional covariance matrices ˆ (iℓ) ˆ (iℓ) Σ1 = diag λℓ|ℓ−1 are updated using (7.28) for any pair iℓ, jℓ , and then prop- iℓ { } nagated one frame aheado using (7.27) to yield the conditional variance estimate for the next frame. Second, using the sets Σˆ (iℓ) and Σ(j) the simultaneous classification and { 1 } { 2 } estimation method is applied yielding each of the estimates ˆs1(ℓ) and ˆs2(ℓ). Finally, the desired signals are obtained by inverse transforming the signal into the time domain.

Considering a simultaneous classification and estimation approach, as proposed in Sec- tion 7.2.1, the interrelations between the classifier and the estimator are employed such that the classification rule is calculated by using the set of gain matrices G , and the { ij} classifier’s output, ηij, specifies the gain matrix to be used. However, a cascade of classifi- cation and estimation (as considered in Section 7.2.2) may be applied as the classification and estimation block to enable a sub-optimal solution with lower computational cost. In fact, the computational complexity of this sub-optimal method is comparable to that of the mmse estimator (7.3) since the a posteriori probabilities required for the MAP classifier are used also in the estimation step. 156 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION 7.5 Experimental Results

In this section we present experimental results for evaluating the performance of the proposed algorithm. In Section 7.5.1 we describe the experimental setup in our evalua- tion, and the objective quality measures. Then, in Section 7.5.2 we present experimental results. The experimental results are focused on (i) evaluating the performance of the proposed codebook compared to using a GMM-only model (while using mmse estimation for both codebooks), and (ii) evaluating the performance of the proposed simultaneous classification and estimation approach in the sense of signal distortion and residual inter- ference.

7.5.1 Experimental setup and quality measures

In our experimental study, we consider speech signals mixed with piano music of about the same level. In the test set of our experimental study, speech signals are taken from the TIMIT database [142] and include 8 different utterances by 8 different speakers, half male and half female. The speech signals are mixed with two different piano compositions (F¨ur Elise by L. van Beethoven and Romance O’ Blue by W. A. Mozart) to yield 16 different mixed signals. For each of the piano signals, the first 10 seconds are used to create the mixing signals while the rest of each composition (about 4 min each) is used for training the model. For training the models to speech signals, a set of signals which are not on the test set was used, with half male and half female (about 30 sec length). All signals in the experiment are normalized to the same energy level, and sampled at 16 kHz and transformed into the STFT domain using half-overlapping Hamming window of 32 msec length. The GMM parameters (for the piano model) are trained using the k-means vector quantization algorithm and the GARCH parameters are estimated using only the signal’s peak energy in each subband, as described in Section 7.4. For each of the sources, K = 10 linearly spaced subbands are considered and for the signal-absence state the covariance

2 2 matrix is set to σmin,kI, where σmin,k is 40 dB less than the higher averaged energy in the kth subband. Furthermore, in each subband, only frames in which the energy is within 40 dB of the peak energy (in the same subband) are considered for training the GMM. The proposed algorithm is compared with the mmse estimator proposed in [88]. The 7.5. EXPERIMENTAL RESULTS 157 latter algorithm assumes a single-band GMM’s for both signals (i.e., with K = 1) and is referred to in the following as the GMM-based algorithm. This model is trained using the same training sets using the k-means algorithm. For each of the algorithms 4, 8 and 16 states are considered for the GMM, while the GARCH model is trained with up to 8 states per subband (excluding the signal absence state). The performance evaluation in our study includes objective quality measures, a subjec- tive study of waveforms and spectrograms, and informal listening tests. The first quality measure is the segmental SNR (in the time domain) which is defined in dB by [143]

N−1 2 1 n=0 s (n + ℓN/2) SegSNR = 10 log10 N−1 2 , (7.32) 1 T ( n=0 [s (n + ℓN/2) sˆ(n + ℓN/2)] ) |H | ℓX∈H1 P − P where represents the set of frames which contain the desired signal, denotes the H1 |H1| number of elements in , N = 512 is the number of samples per frame and the operator H1 confines the SNR in each frame to a perceptually meaningful range between 10 dB T − and 35 dB. The second quality measure is log-spectral distortion (LSD) which is defined in dB by [41]

1 N/2 2 1 L−1 1 LSD = [10 log s (ℓ, f) 10 log ˆs (ℓ, f)]2 , (7.33) L N/2+1 10 C − 10 C  Xℓ=0  Xf=0  where s (ℓ, f) denotes the fth element of the spectral vector s (ℓ) (i.e., f denotes the frequency-bin index), x , max x 2, ǫ is a spectral power clipped such that C {| | } the log-spectrum dynamic range is confined to about 50 dB, that is, ǫ = 10−50/10 × max s(ℓ, f) 2 . Although the Segmental SNR and the LSD are common measures for ℓ,f {| | } speech enhancement, for the application of source separation it was proposed in [159,160] to measure the signal to interference ratio (SIR). For source s1 we may write

sˆ1 = ζ1s1 + ζ2s2 + d . (7.34)

Accordingly, the Segmental SIR fors ˆ1 is defined in dB as follows:

ζ (ℓ)2 N−1 s2 (n + ℓN/2) SegSIR = 10 log 1 n=0 1 , (7.35) 10 2 N−1 2 T ( ζ2(ℓ) n=0 s2 (n + ℓN/2)) Xℓ P P where the parameters ζ1(ℓ) and ζ2(ℓ) are calculated for each segment as specified in [159]. 158 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION

The above mentioned measures attempt to evaluate the averaged performance of the algorithm. The proposed classification and estimation approach enables one to control the trade-off between the level of residual interference resulting from false detection of the desired signal, and signal distortion resulting mainly from missed detection. To measure this trade-off while applying the algorithm on a subband basis, we propose to measure the distortion of the estimated signal and the amount of interference reduction. Now let H1 and denote the sets of (subband) frames which contain the desired signal and in which H0 the desired signal is absent, respectively. The signal distortion, denoted as LSDH1 , is evaluated using the LSD formulation (7.33) and averaged only over . The interference H1 reduction is evaluated in dB by [161]:

2 ℓ∈H0 ˆs1(ℓ) IRH0 = 10log k k . (7.36) 10 s (ℓ) 2 Pℓ∈H0 k 2 k P 7.5.2 Simulation results

For evaluating the contribution of the proposed codebook (i.e., GARCH model for speech and GMM for music), the results obtained by using the proposed model are first compared with the results obtained by using the GMM-based algorithm. Since the GMM-based algo- rithm employs an mmse estimator, the proposed algorithm was applied in this experiment using constant cost parameters. As shown in Section 7.2.1, this also yields mmse estima- tion. Figure 7.3 shows quality measures as a function of the number of GARCH states5. These results are shown for 4-, 8- or 16-state GMM used for the music signal. For com- parison, the results obtained by using the GMM-based algorithm are shown with 4, 8, and 16 states (for both signals). Note that for each algorithm, different number of subbands is considered, and different statistical model. However, the signal estimate in both cases is in the sense of mmse. It can be seen that employing a GARCH model for the speech signal significantly improves the separation results, and sometimes even using a single- state GARCH model outperforms the GMM modeling with up to 16 states. Moreover, it can be seen that excluding the SegSIR measure for speech signals, the performances are improved monotonically with the growth of the number of GARCH states (except

5The improvements in SegSNR and SegSIR are obtained by subtracting the initial values calculated for the mixed signals from those calculated for the processed signals. 7.5. EXPERIMENTAL RESULTS 159

5 3

2 4

1 3

0 2 −1

1 GARCH, 4−state GMM −2 GARCH, 8−state GMM GARCH, 16−state GMM Music − SegSNR improvement [dB] Speech − SegSNR improvement [dB] 0 4x4−state GMM −3 8x8−state GMM 16x16−state GMM −1 −4 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 GARCH states GARCH states

5 7

6.5 4.5 6

4 5.5

5 3.5 Music − LSD [dB] Speech − LSD [dB] 4.5 3 4

2.5 3.5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 GARCH states GARCH states

13 12

12 10 11

10 8 9

8 6

7 4 6

5 Music − SegSIR improvement [dB] Speech − SegSIR improvement [dB] 2 4

3 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 GARCH states GARCH states

Figure 7.3: Quality measures for mmse estimation as functions of the number of GARCH states. The results (with different numbers of GMM states for the music signal) are compared with the GMM-based algorithm. Left column: results for speech signals; right column: results for music signals. Rows (from top to bottom): SegSNR improvement, LSD, and SegSIR improvement. 160 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION

−10 5

−12 4.5

−14 4 [dB] [dB] 1 0 H

H −16 3.5 IR LSD −18 3

b =10−2 1,m −20 b =100 2.5 1,m b =102 1,m 2 −22 0 1 2 3 100 101 102 103 10 10 10 10 b1,f b1,f

(a) (b)

Figure 7.4: Trade-off between residual interference and signal distortion resulting from changing the false detection and missed detection parameters; (a) residual music signal and (b) speech signal distortion. for some cases with 8-state GMM). However, the significant improvement is obtained by using up to 5 states for the GARCH model with 4- or 16- state GMM for the music. Informal listening tests verify that increasing the number of GARCH states from one to 3 or 5, significantly improves the reconstructed signals and particularly the perceptual quality of the speech signal. Using three (or more) states for the speech model results in improved signals’ quality compared to using the GMM for both the speech and the music signals. The GMM-based algorithm preserves mainly low frequencies of the music signal and the residual speech components sound somewhat scrappy. The proposed approach results in a more natural music signal which consists of higher range of frequencies. The residual speech signal also sounds more natural.

Next, we verify the performance of the proposed simultaneous classification and es- timation method. As this method enables one to control the trade-off between residual interference and signal distortion, we examine the influence of the cost parameters on these measures. The proposed algorithm was applied to the test set with different cost parameters. Figure 7.4 shows the trade-off between signal distortion and the reduction of the residual interference while examining the estimated speech signals. The averaged interfering reduction, IRH0 , (in this case the reduction of the residual music) and the av- 7.5. EXPERIMENTAL RESULTS 161

Table 7.1: Averaged Quality Measures for the Estimated Speech Signals Using 3-state GARCH Model and 8-state GMM.

Parameters SegSNR SegSIR IRH0 LSDH1 [b1,m, b1,f, b2,m, b2,f] improvement improvement [1, 1, 1, 1] 3.67 11.67 -10.34 2.45 [10−2, 102, 102, 10−2] 3.80 13.83 -15.39 3.59 [102, 10−2, 10−2, 102] 3.54 11.00 -9.55 2.04 [10−1, 101, 101, 10−1] 3.76 17.73 -11.73 3.06 [102, 10−2, 102, 10−2] 3.68 11.44 -9.88 2.16

Table 7.2: Averaged Quality Measures for the Estimated Music Signals Using 3-state GARCH Model and 8-state GMM.

Parameters SegSNR SegSIR IRH0 LSDH1 [b1,m, b1,f, b2,m, b2,f] improvement improvement [1, 1, 1, 1] 4.91 9.81 -7.27 3.04 [10−2, 102, 102, 10−2] 5.46 9.77 -5.95 3.35 [102, 10−2, 10−2, 102] 4.72 10.05 -8.91 2.84 [10−1, 101, 101, 10−1] 5.21 9.75 -6.27 3.25 [102, 10−2, 102, 10−2] 4.89 9.98 -7.47 2.91 162 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION

(a)

(b)

(c)

0 0.5 1 1.5 2 2.5 3 Time [sec]

Figure 7.5: Original and mixed signals. (a) Speech signal: ”Draw every outer line first, then fill in the interior”; (b) piano signal (F¨ur Elise); (c) mixed signal.

(a)

(b)

(c)

(d)

0 0.5 1 1.5 2 2.5 3 Time [sec]

Figure 7.6: Separation of speech and music signals. (a) speech signal reconstructed by us- ing the GMM-based algorithm (SegSNR improvement = 0.76, LSD = 3.77, SegSIR improve- ment = 1.29); (b) speech signal reconstructed using the proposed approach (SegSNR improve- ment = 2.46, LSD = 3.56, SegSIR improvement = 8.61); (c) piano signal reconstructed by using the GMM algorithm (SegSNR improvement = -2.77, LSD = 4.34, SegSIR improvement = 2.50); (d) piano signal reconstructed using the proposed approach (SegSNR improvement = 0.32, LSD = 3.19, SegSIR improvement = 4.79). 7.6. CONCLUSIONS 163

eraged speech distortion, LSDH1 , are shown as functions of the false detection parameter for the speech signal, b1,f, and for some values of the missed detection parameter. These results are evaluated using 3-state GARCH model and 8-state GMM, and the simulta- neous classification and estimation method. It is shown that when the false detection parameter increases, the level of residual interference decreases and the signal distortion increases. Therefore, for a specific application these parameters may be chosen to achieve a desired trade-off between signal distortion and residual interference. In Tables 7.1 and 7.2, we provide quality measures for both types of signals using different sets of parameters. This test was conducted also for the whole test set using the simultaneous classification and estimation approach. It can be seen that by using different parameters, improved performance may be achieved compared to using equal parameters (i.e., using mmse estimation). However, as expected, different parameters would be needed to achieve the best performances in the sense of different quality measures. Specifically, in case of speech signals, the higher interference-reduction is achieved with the parameters (from the tested sets of parameters) which corresponds to the highest distortion. On the other hand, the lowest distortion is obtained with the lowest amount of interference reduction. Figures 7.5 and 7.6 demonstrate the separation of a specific mixture of speech and piano signals. The speech waveform, the piano waveform and their mixture are shown in Figure 7.5, and Figure 7.6 shows the separated signals resulting from an 8-state GMM- based algorithm and from the proposed simultaneous classification and separation ap- proach (using 3-state GARCH model for the speech signal, 8-state GMM for the piano signal, b1,m = b2,f = 5, and b1,f = b2,m = 15). It can be seen that for this particular mix- ture, by estimating the speech signal the proposed algorithm results in higher attenuation of the piano signal, and the estimation of the piano signal preserves more energy of the desired signal, especially at its second half.

7.6 Conclusions

We have proposed a new approach for single-channel audio source separation of acoustic signals, which is based on classifying the mixed signal into codebooks, and estimating the 164 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION subsources. Unlike other classical methods which apply estimation alone, or distinctive operations of classification and estimation, in our method both operations are designed simultaneously, or the estimator is designed to allow a compensation for erroneous classi- fication. In addition, a new codebook is proposed for speech signals in the STFT domain based on the GARCH model. Accordingly, less restrictive pdf’s are enabled in the STFT domain compared to GMM or AR-based model. Experimental results show that for mix- ture of speech and music (piano) signals, applying the proposed codebook significantly improves the separation results compared to using GMM for both signals, even when using a smaller number of states. In addition, applying a simultaneous classification and estimation approach enables one to control the trade-off between signal distortion and residual interference.

The proposed classification and estimation method may be advantageously utilized for other codebooks and for different types of signals. However, the selection of the optimal parameters in the general case may be codebook- as well as application-dependent and may be a subject for further research. Furthermore, the GARCH modeling for speech signals may be combined with various statistical models for the music signals other than GMM, such as mixture of AR or HMM with AR subsources.

7.A Derivation of (7.10)

¯ ¯ ¯i¯j By setting the derivative of ¯i,¯j p (i, j) rij (x, ˆs1) in (7.9) to zero we obtain P ¯i¯j ¯¯ ¯ ¯ 0 = bijp (ij) ˆs1,ij p (x s1, j) p (s1 i) ds1 ¯¯ | | Xij  Z s p (x s , ¯j) p (s ¯i) ds (7.37) − 1 | 1 1 | 1 Z  where

p (x s , ¯j) p (s ¯i) = p (x s ,¯i, ¯j) p (s ¯i, ¯j) | 1 1 | | 1 1 | = p (x ¯i, ¯j) p (s x,¯i, ¯j) . (7.38) | 1 | 7.B. DERIVATION OF (7.11) 165

Substituting (7.38) into (7.37) we obtain

¯i¯j ¯¯ ¯ ¯ 0 = bijp (ij) [ˆs1,ijp (x i, j) ¯¯ | Xij p (x ¯i, ¯j) E s x,¯i, ¯j ] (7.39) − | { 1 | } and accordingly ¯i¯j ¯ ¯ ¯ ¯ ¯i¯j bij p (x i, j) p (i, j) W¯i¯j x ˆs1,ij = ¯¯ | . (7.40) bij p (x ¯i, ¯j) p (¯i, ¯j) P ¯i¯j ij | P 7.B Derivation of (7.11)

The average risk is given by

¯¯ ¯¯ rij (x, ˆs ) = Cij (s ,ˆs ) p (x s , ¯j) p (s ¯i) ds (7.41) ij 1 ij 1 1 | 1 1 | 1 Z ¯¯ = bij s ˆs 2 p (x, s ¯i, ¯j) ds . ij k 1 − 1k2 1 | 1 Z To simplify the notation, we assume in this appendix that the active states of both signals are known, so we may omit the indices ¯i, ¯j . Furthermore, we use s to denote { } s and we assume diagonal covariance matrices Σ = diag σ2(1), σ2(2), ..., σ2(N) and 1 1 { 1 1 1 } Σ = diag σ2(1), σ2(2), ..., σ2(N) . Following these notations we obtain 2 { 2 2 2 } s 2 p (x, s) ds = s(f) 2 p (x(f), s(f)) ds(f) k k2 | | Z Xf Z p (x(f ′), s(f ′)) ds(f ′) (7.42) × ′ fY6=f Z  where in this appendix s(f) and x(f) denote the fth elements of vectors s and x, respec-

2 −1 2 −1 −1 tively (i.e., f denotes the frequency-bin index). Let λ(f) , (σ1(f) + σ2 (f) ) , let ξ(f) , σ2(f)/σ2(f), let γ(f) , x(f) 2 /σ2(f), and let υ(f) , ξ(f)γ(f)/ (1 + ξ(f)). By 1 2 | | 2 integrating over both the real and imaginary parts of s(f) and using [148, eq. 3.462.2] we obtain ξ(f)(1+ υ(f)) γ(f) s(f) 2 p (x(f), s(f)) ds(f)= exp (7.43) | | π (1 + ξ(f))2 −1+ ξ(f) Z   and

p (x(f ′), s(f ′)) ds(f ′) = p (x(f ′)) Z γ(f ′) exp ′ − 1+ξ(f ) = 2 ′ ′ . (7.44) π σ2 (fn)(1+ ξ(fo)) 166 CHAPTER 7. SINGLE-SENSOR AUDIO SOURCE SEPARATION

Let Ξ , diag ξ(1),ξ(2), ..., ξ(N) and V , diag υ(1),υ(2), ..., υ(N) . Substituting (7.43) { } { } and (7.44) into (7.42) we obtain

ξ(f)(1+ υ(f)) γ(f) s 2 p (x, s) ds = exp k k2 π (1 + ξ(f))2 −1+ ξ(f) Z Xf    1 γ(f ′) 2 ′ ′ exp ′ × ′ π σ2 (f )(1+ ξ(f )) −1+ ξ(f ) fY6=f   1T Σ (I + V )(I + Ξ)−1 1 = 1 exp xH (Σ + Σ )−1 x . (7.45) πN Σ (I + Ξ) − 1 2 | 2 |  Let subscripts R and I denote the real and imaginary parts of a complex-valued vari- able, respectively, and let gij(f) denote the fth diagonal element of matrix Gij. Then, using [148, eq. 3.462.2] we obtain

H H T T ˆs s + s ˆs p (x, s) ds = 2 ˆsRsR + sI ˆsI p (x, s) ds Z Z   = 2 gij(f) [xR(f)sR(f)+ xI (f)sI (f)] p (x(f), s(f)) ds(f) Xf  Z p (x(f ′), s(f ′)) ds(f ′) × ′ fY6=f Z  γ(f ′) 2 exp ′ gij(f) υ(f) σ2(f) − 1+ξ(f ) = 2 2 ′ ′ π ′ π σ2(fn)(1+ ξ(fo)) Xf fY6=f 1T G Σ V 1 = 2 ij 2 exp xH (Σ + Σ )−1 x . (7.46) πN Σ (I + Ξ) − 1 2 | 2 |  Finally,

p (x, s) ds = p (x)

Z H −1 exp x (Σ1 + Σ2) x = − . (7.47) πN Σ + Σ  | 1 2| Substituting (7.45)–(7.47) into (7.42) and using W = Ξ(I + Ξ)−1, we obtain

¯¯ r (x, ˆs ) = bij p (x) xH W 2 2WG x + 1T Σ W 1 ij 1 ij − ij 2 b = ij  exp xH (Σ + Σ )−1 x  πN Σ + Σ − 1 2 | 1 2| xH W 2 2WG x + 1T Σ W 1 . (7.48) × − ij 2    Chapter 8

Dual-Microphone Speech Dereverberation Using GARCH Modeling1

In this chapter, we develop a dual-microphone speech dereverberation algorithm for noisy environments, which is aimed at suppressing late reverberation and background noise. The spectral variance of the late reverberation is obtained with adaptively-estimated direct path compensation. A Markov-switching generalized autoregressive conditional heteroscedasticity (GARCH) model is used to estimate the spectral variance of the desired signal, which includes the direct sound and early reverberation. Experimental results demonstrate the advantage of the proposed algorithm compared to a decision-directed- based algorithm.

8.1 Introduction

In many speech communication systems the received signal is degraded by reverberation, as well as background noise. The reverberant signal consists of a direct sound, early re- verberation, and late reverberation. Early reflections mainly contribute to coloration and tend to improve the intelligibility, whereas late reverberation causes a noise-like perception and degrades the fidelity and intelligibility of the speech signal.

1This chapter is based on [162].

167 168 CHAPTER 8. SPEECH DEREVERBERATION USING GARCH MODELING

Speech dereverberation algorithms can be divided into two classes. Algorithms in the first class are based on estimating and inverting the room impulse response (RIR), e.g., [163]. In the second class, algorithms try to suppress reverberation without estimating the RIR, e.g., [82]. Recently, Habets et al. [83] proposed a dual-microphone dereverber- ation system which is aimed at suppressing late reverberation that results from the tail of the RIR by applying a spectral enhancement approach. A direct path compensation (DPC) is applied to the late reverberant spectral variance estimate to enable better at- tenuation of the late reverberation with less distortion of the desired signal. However, the parameter of the DPC was evaluated directly from the RIR which is unknown in practice. In addition, the a priori signal to noise ratio (SNR) required for the spectral enhancement is estimated by using the traditional decision-directed approach. Recently, the general- ized autoregressive conditional heteroscedasticity (GARCH) model with Markov regimes has been shown to be useful for speech enhancement applications [127, 133]. The model takes into account the strong correlation of successive spectral magnitudes, and is more appropriate than the decision-directed approach for speech spectral variance estimation in noisy environments.

In this chapter, we develop an improved dual-microphone speech dereverberation algo- rithm which relies on a Markov-switching GARCH (MS-GARCH) modeling of the desired early speech component, which consists of the direct sound and early reverberation. The model is applied to distinctive frequency subbands and specifies the volatility clustering of successive spectral coefficients, while a speech-absence state is used for evaluating the speech presence probability. Furthermore, an adaptive approach is developed to esti- mate the parameter for the DPC directly from the observed signals. Experimental results show that using the MS-GARCH modeling rather than the decision-directed approach, improved results can be obtained. Furthermore, by using the proposed algorithm, the performance obtained with blindly estimated DPC parameter is comparable to that ob- tained with an optimal DPC parameter that is calculated from the actual RIR, which is unknown in practice.

The chapter is organized as follows. In Section 8.2, we formulate the speech dere- verberation problem and briefly review the algorithm proposed in [83]. In Section 8.3, we derive an adaptive estimator for the DPC parameter. In Section 8.4 we describe the 8.2. DUAL-MICROPHONE DEREVERBERATION 169

MS-GARCH model which is used for the desired signal, and in Section 8.5 we present some experimental results which demonstrate the improved performance of the proposed algorithm.

8.2 Dual-microphone dereverberation

Consider an M microphone array located in a reverberant environment. Let a (n) = − m T [am,0 (n) , ..., am,L−1 (n)] denote the RIR at time n from the source signal s (n) to the mth microphone, and let dm (n) denote the noise component received at the mth microphone. The observed signals are then given by

T zm (n)= am (n) s (n)+ dm (n) (8.1) where s (n) = [s (n) , ..., s (n L + 1)]T . The RIR, a (n), can be divided into the direct − m d r path and early reflections, denoted by am (n), and late reflections, denoted by am (n). Accordingly, d am,j (n) 0 j < tr am,j (n)= ≤ , (8.2)  ar (n) t j < L  m,j r ≤ where t is the time where the late reverberation starts (about 40 to 80 ms). Hence, the r  reverberant signal can be divided into two signals

T am (n) s (n)= xm (n)+ rm (n) , (8.3) where xm (n) is the desired early speech component, and rm (n) denotes the late rever- berant component. Applying the short-time Fourier transform (STFT) to the observed signals, we have

Zm (ℓ,k)= Xm (ℓ,k)+ Rm (ℓ,k)+ Dm (ℓ,k) , (8.4) where ℓ represents the frame index, and k the frequency bin index. At the output of a delay and sum beamformer (DSB) which is steered towards the desired source, we have the time-frequency signal

Y (ℓ,k)= X (ℓ,k)+ R (ℓ,k)+ D (ℓ,k) . (8.5)

Habets et al. [83] proposed a dual microphone dereverberation algorithm which is aimed at estimating the early speech component. In the system, shown in Figure 8.1, it is 170 CHAPTER 8. SPEECH DEREVERBERATION USING GARCH MODELING

ˆ Z1(ℓ, k) Y (ℓ, k) X(ℓ, k) 1 + 2 Post-Filter

Z2(ℓ, k) λˆ (ℓ, k) NE d

λˆr(ℓ, k) LRSVE

Figure 8.1: Dual microphone speech dereverberation system. assumed that the arrival times of the direct speech signals are aligned. The lower branch is a late reverberant spectral variance estimator (LRSVE), λˆr (ℓ,k), while the upper branch includes a beamformer, a background noise estimator (NE), λˆd (ℓ,k), and a post-filter.

The spectral variance of the noise signal, λd (ℓ,k), can be estimated, e.g., using [71]. The a priori SNR λ (ℓ,k) ξ (ℓ,k)= x (8.6) λr (ℓ,k)+ λd (ℓ,k) is estimated using the decision-directed approach [33].

The desired spectral coefficients are estimated by minimizing the mean square error of the log-spectral amplitude (LSA) [34] by assuming two hypotheses, speech presence (H1) and absence (H0). The resulting optimally-modified LSA estimator is given by [38]

ˆ p(ℓ,k) 1−p(ℓ,k) X (ℓ,k)= GH1 (ℓ,k) GH0 (ℓ,k) Y (ℓ,k) , (8.7)

where GH1 (ℓ,k) is the LSA gain under speech presence [34] and

ˆ λd (ℓ,k) GH0 (ℓ,k)= Gmin (8.8) λˆd (ℓ,k)+ λˆr (ℓ,k) to allow reduction of the late reverberant signal down to the noise floor [83]. In the next subsection we derive an adaptive estimator for the late reverberant spectral variance, and in Section 8.4 we formulate the MS-GARCH modeling applied for the desired signal. The speech presence probability p (ℓ,k) is discussed in Section 8.4.2. 8.3. LATE REVERBERANT SPECTRAL ESTIMATION 171 8.3 Late reverberant spectral estimation

The spectral variance of the late reverberation at each microphone, λr,m(ℓ,k), can be ob- tained based on Polack’s statistical reverberation model of the RIR [83], using an estimate of the spectral variance of the reverberant signal, λ (ℓ,k)= E X (ℓ,k)+ R (ℓ,k) 2 . b,m | m m | Let T60(k) denote the reverberation time of the room in the kth frequency band, let

δ(k) = 3ln(10)/T60(k), let R denote the frame rate of the STFT, and let α(k) = exp 2δ(k)R/f . Then, the spectral variance of the late reverberant signal λ (ℓ,k) {− s} r at the output of the DSB is estimated by

2 1 tr tr λˆ (ℓ,k)= α(k) R λˆ ℓ ,k . (8.9) r 2 b,m − R m=1 X  

However, to avoid over-estimation of λr(ℓ,k) when the source-microphone distance is smaller than the critical distance (i.e., the energy of the direct path is larger than the energy of all reflections) it was proposed to compensate the over estimation of the spectral variance of the reverberant signal using

ˆ′ κm(ℓ) ˆ′ 1 ˆ λb,m(ℓ)= α(k)λb,m(ℓ 1,k)+ λb,m(ℓ,k) , (8.10) 1+ κm(ℓ) − 1+ κm(ℓ) where κm(ℓ) is a compensation parameter which is related to the direct and reverber- ˆ′ ant energy at the mth microphone. The compensated estimate λb,m(ℓ) is then used in (8.9) as the spectral variance estimate of the reverberant signal. It was shown in [83] that applying this DPC prevents over-estimation of the late reverberant spectral variance and improves the quality of the output signal. However, the DPC parameter, κm, was calculated directly from the presumably known RIR. Here, we propose to estimate the ˆ′ parameter κm adaptively. In case κm is too large the spectral variance λb,m(ℓ,k) could become larger than λˆb,m(ℓ,k), which indicates that over-estimation can occur and that the value of κm should be decreased. Furthermore, during the free-decay, which occurs ˆ′ ˆ after an offset of the source signal, λb,m(ℓ,k) should be equal to λb,m(ℓ,k). Estimation of κm could therefore be performed after a speech offset. Unfortunately, the detection of speech offsets is rather difficult. However, we can conclude that κm should at least fulfill the following conditions: (i) λˆ (ℓ,k) λˆ′ (ℓ,k), (ii) when speech is present and b,m ≥ b,m ˆ ˆ′ ˆ ˆ′ λb,m(ℓ,k) < λb,m(ℓ,k) the value of κm can be increased, (iii) when λb,m(ℓ,k) > λb,m(ℓ,k) 172 CHAPTER 8. SPEECH DEREVERBERATION USING GARCH MODELING

ˆ ˆ′ the value of κm can be decreased slowly, and (iv) when λb,m(ℓ,k) = λb,m(ℓ,k) the value of κm is assumed to be correct. Therefore, we can update κm(ℓ) when speech is present using

λˆ′ (ℓ,k) κˆ (ℓ +1) = max κˆ (ℓ)+ µ k b,m 1 , 0 , (8.11) m m κ ˆ − ( Pk λb,m(ℓ,k) ! ) where µκ (0 <µκ < 1) denotes the step-size. P

8.4 Modeling early reverberation using GARCH

Speech signals are characterized by time-varying energy levels and volatility. The spectral coefficients of the speech signal can be effectively characterized using an MS-GARCH model [127,133]. The GARCH parameters specify the volatility of the spectral coefficients, and the Markovian regimes allow the model to switch between different sets of GARCH parameters. Let q 0, ..., Q denote the active state of a first-order Markov chain ℓ ∈ { } at frame ℓ with known state-transition probabilities. Let λ (ℓ,k ℓ 1) denote the x,qℓ | − conditional spectral variance of the desired signal X (ℓ,k) conditioned on qℓ and on all information up to previous frame, and let V (ℓ,k) be iid complex Gaussian random { } variables with zero-mean and unit variance. We assume that the spectral coefficients of the desired signal follow an MS-GARCH model [127], i.e., given qℓ

X (ℓ,k)= λ (ℓ,k ℓ 1) V (ℓ,k) (8.12) x,qℓ | − q where

λ (ℓ,k ℓ 1) = λ + α X (ℓ 1,k) 2 x,qℓ | − min,qℓ qℓ | − | + β λ (ℓ 1,k ℓ 2) λ (8.13) qℓ x,qℓ−1 − | − − min,qℓ−1   with λ > 0 and α , β 0 for q =0, ..., Q. As can be seen from (8.12) and (8.13), min,qℓ qℓ qℓ ≥ ℓ the conditional spectral variances of successive frames at a specific frequency bin are strongly correlated. However, given the sequence of the conditional spectral variances and the active states, the spectral coefficients X (ℓ,k) are statistically independent. It was { } shown that the spectral variance estimation resulting from this model is a generalization 8.4. MODELING EARLY REVERBERATION USING GARCH 173 of the decision-directed estimator with improved tracking of the speech spectral volatility [127].

8.4.1 Spectral variance estimation

Let ℓ = Y (l,k) l ℓ denote the set of the observed spectral coefficients up to frame Y { | ≤ } ℓ. Given ℓ the set of conditional spectral variances can be recursively estimated using a Y propagation step

λˆ (ℓ,k ℓ 1) = λ + α E X (ℓ 1,k) 2 ℓ−1, q x,qℓ | − min,qℓ qℓ | − | |Y ℓ + β E λ (ℓ  1,k ℓ 2) ℓ−1, q qℓ x − | − |Y ℓ β E λ ℓ−1, q (8.14) − qℓ min,qℓ−1 |Y ℓ  and an update step

E X (ℓ 1,k) 2 ℓ−1, q = p q ℓ−1, q E X (ℓ 1,k) 2 ℓ−1, q | − | |Y ℓ ℓ−1 |Y ℓ | − | |Y ℓ−1 qℓ−1  X   , p q ℓ−1, q λˆ (ℓ 1,k ℓ 1) . (8.15) ℓ−1 |Y ℓ x,qℓ−1 − | − qℓ−1 X  A detailed estimation algorithm is given in [127]. The estimate of the spectral variance of the desired signal is then obtained by

λˆ (ℓ,k)= p q ℓ λˆ (ℓ,k ℓ) . (8.16) x ℓ |Y x,qℓ | qℓ X  Note that although the spectral variance is specified for each frequency bin independently, the Markovian state is frequency-independent. However, since different frequency bands of speech signals are characterized by different energy level and volatility, it was proposed in [133] to apply the model independently to distinctive subbands. Furthermore, a simple model estimation approach was proposed such that each state represents different energy level, and a specific state specifies signal absence. However, in our case the desired signal contains early reverberation such that the spectral variance at speech offsets has smoother decay than in case of a nonreverberant signal. Consequently, an immediate transition from a state which represents high spectral energy to a state which represents very low energy would not be expected. Therefore, the state transition probabilities are set such that the probability for a progressive state-transition is much higher than the probability for an immediate transition from the higher energy level to the lower. 174 CHAPTER 8. SPEECH DEREVERBERATION USING GARCH MODELING

8.4.2 Speech presence probability

The posteriori speech presence probability, p (ℓ,k), required for (8.7) is originally cal- culated [38] based on a Gaussian model from the a priori speech presence probability. The latter is evaluated based on the time-frequency distribution of the a priori SNR, ξ (ℓ,k). For a multi-sensor system, it was proposed in [83] to exploit the spatial infor- mation and to use additional parameter Pspatial (ℓ,k) for the a priori probability which is evaluated based on the spatial coherence between the microphone signals. In our case, the multi-state model for the speech spectral coefficients inherently results in a condi- tional probability for each state. Having a specific state for speech absence (say qℓ = 0), we obtain a speech presence probability for each subband in each frame, p q =0 ℓ . ℓ 6 |Y Accordingly, we define  p p q =0 ℓ > T h ℓ 6 |Y h  ℓ Psb (ℓ,k)= pl p qℓ =0  < Tl (8.17)  6 |Y  ℓ p qℓ =0 otherwise  6 |Y  where pl Tl Th ph are constrain parameters for the subband speech presence prob- ≤ ≤ ≤  ability. The subband probability, Psb (ℓ,k), is employed as an additional multiplicative parameter for the evaluation of the a priori speech presence probability. Note that al- though we do not use a specific index for the subband, p q =0 ℓ is calculated for ℓ 6 |Y each subband independently, and therefore Psb (ℓ,k) includes also a frequency bin index.

8.5 Experimental results

In our experimental study, we consider synthetic RIRs which were generated using the image method. The speech signals, sampled at 8 kHz, include male and female speakers, each of 20 seconds. A moderate level of white Gaussian noise was added to each of the microphone signals. The distance between the two microphones is 0.15 meter, and the source-to-microphone distance was set to 0.5 and 1 meter (which are both smaller than the critical distance). While applying the MS-GARCH model, the model parameters are estimated from the noisy signal as proposed in [133]. Segmental signal to interference ratio (SegSIR) and log spectral distortion (LSD) are used to evaluate the performance of the proposed algorithm, as well as informal listening 8.5. EXPERIMENTAL RESULTS 175

Table 8.1: SegSIR and LSD obtained by using the decision-directed approach and the proposed

MS-GARCH-based approach. T60 = 0.5 sec and d=0.5 meter. In parentheses - results using optimal DPC parameters.

d=0.5 m, SNR=15 dB d=0.5 m, SNR=20 dB SegSIR [dB] LSD [dB] SegSIR [dB] LSD [dB]

Unprocessed 5.849 4.875 7.284 2.681

Decision-directed 8.359 1.995 8.745 1.744 (8.783) (1.825) (9.230) (1.535)

MS-GARCH 9.010 1.700 9.392 1.493 (9.265) (1.606) (9.715) (1.367)

tests and inspection of spectrograms. For the quality measures, the direct sound signal was used as the reference signal. Figure 8.2 shows experimental results of the proposed algorithm as a function of the number of GARCH states, and for several reverberation times. The input SNR is 15 dB and the source to microphone distance is 0.5 m. It can be seen that the performance improves monotonically with the growth of the number of states, but, the most significant improvement is achieved by using up to 3 Markovian states. Tables 8.1 and 8.2 compare the performance of the proposed algorithm with that of the original algorithm [83] which employs a decision-directed estimator for the a priori

SNR. The reverberation time is T60 = 0.5 sec, and the proposed algorithm was applied with 3-state MS-GARCH model. In Table 8.1 the source to microphones distance is 0.5 meter and in Table 8.2 the distance is 1 meter. In both algorithms, the DPC parameters

κ1 and κ2 are blindly estimated adaptively, as proposed in Section 8.3, and the results shown in parentheses are obtained using the optimal values which are evaluated from the actual RIRs. It can be seen that the GARCH modeling is more advantageous than the decision-directed approach, and the blindly estimated DPC parameters yield results which are comparable to using the optimal value. In Figure 8.3 spectrogram and waveform of a noisy signal are shown with input SNR 176 CHAPTER 8. SPEECH DEREVERBERATION USING GARCH MODELING

Table 8.2: SegSIR and LSD obtained by using the decision-directed approach and the proposed

MS-GARCH-based approach. T60 = 0.5 sec and d=1 meter. In parentheses - results using optimal DPC parameters.

d=1 m, SNR=15 dB d=1 m, SNR=20 dB SegSIR [dB] LSD [dB] SegSIR [dB] LSD [dB]

Unprocessed 2.295 6.379 2.864 4.578

Decision-directed 4.289 3.583 4.385 3.482 (4.452) (3.455) (4.578) (3.333)

MS-GARCH 4.551 3.521 4.654 3.442 (4.941) (3.390) (5.110) (3.298)

of 20 dB and a source to microphone distance of 1 m. The smearing caused by the late reverberation and the background noise are reduced. Wave files are available online at: http://siglab.technion.ac.il/˜ ari a/Audio demos.htm.

8.6 Conclusions

We have developed a dual-microphone speech dereverberation algorithm for noisy environ- ments which is based on MS-GARCH modeling of the desired early speech component. The spectral variance of the late reverberation is estimated from the observed signals while compensating for the energy of the direct path. The algorithm blindly operates in noisy and reverberant environments without any knowledge of the RIR, except for the reverberation time, which can be obtained blindly using, e.g., [164]. It is shown that compared to the original algorithm which employs the decision-directed estimator [83], improved performance is obtained with little distortion to the desired signal. 8.6. CONCLUSIONS 177

14 4

12 3.5

10 3

8 2.5 LSD [dB] SegSIR [dB] 2 6 1.5 4 1 2 3 4 5 2 3 4 5 GARCH states GARCH states

Figure 8.2: SegSIR and LSD as functions of the number of GARCH states (solid line: T60 = 0.25 sec, dashed line: T60 = 0.5 sec, and dotted line: T60 = 0.75 sec).

4 4

3.5 3.5

3 3

2.5 2.5

2 2

1.5 1.5 Frequency [kHz] Frequency [kHz] 1 1

0.5 0.5

0 0 Amplitude Amplitude 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 Time [Sec] Time [Sec]

(a) (b)

Figure 8.3: Spectrograms and waveforms of (a) a noisy and reverberated speech signal, and (b) the processed signal. 178 CHAPTER 8. SPEECH DEREVERBERATION USING GARCH MODELING Chapter 9

Research Summary and Future Directions

9.1 Research summary

In this thesis, we have introduced a new statistical model for nonstationary signals in the joint time-frequency domain, which is based on complex-valued GARCH model with Markov regimes. The model exploits the advantages of both the conditional heteroscedas- ticity structure of GARCH models and the time-varying characteristics of hidden Markov chains. We have developed conditions for finite second order moments and for asymptotic stationarity of the model, as well as for other MS-GARCH formulations which are used in econometrics. Moreover, we have developed recursive algorithms for the estimation of the conditional variance, as well as for signal restoration in noisy environment. A new formulation was proposed for the speech enhancement problem, based on simultaneous operations of speech detection and estimation. Considering the problem of a single-sensor audio source separation, we have generalized the simultaneous detection and estimation formulation to a multi-hypotheses case and incorporated the proposed MS-GARCH model for the speech signal. The result is a new algorithm for single-sensor audio source sepa- ration which is based on classification and estimation and GARCH modeling. The main contributions of the thesis chapters are as follows: In Chapter 3, we developed a comprehensive approach for stationarity analysis of MS- GARCH processes where finite-state-space Markov chains control the switching between

179 180 CHAPTER 9. RESEARCH SUMMARY AND FUTURE DIRECTIONS regimes, and GARCH models of order (p, q) are active in each regime. In case of processes with time-varying variances, conditions for asymptotic wide-sense stationarity are useful to ensure the existence of a finite asymptotic second-order moment. These conditions also show how some Markovian regimes can allow the conditional variance to grow over time and still the process will have a finite second-order-moment. Necessary and sufficient conditions for the asymptotic stationarity were obtained by constraining the spectral radius of representative matrices, which were built from the model parameters. These matrices also enabled derivation of compact expressions for the stationary variance of the processes.

Next, in Chapter 4, we have proposed a statistical model for nonstationary processes with time-varying volatility structure in the STFT domain such as speech signals. Ex- ploiting the advantages of both the conditional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains, we modeled the expansion coefficients as multivariate complex GARCH process with Markov-switching regimes. The correlation between successive coefficients in the time-frequency domain was taken into consideration by using the GARCH formulation which specifies the condi- tional variance as a linear function of its past values and past squared innovations. The time-varying structure of the conditional variance was determined by a hidden Markov chain which allows a different GARCH formulation in each state. We developed a recur- sive algorithm for estimating the signal and its conditional variance in the STFT domain from noisy observations. The conditional variance is recursively estimated for any regime by iterating propagation and update steps, while the evaluation of the regime conditional probabilities is based on the recursive correlation of the process. Experimental results demonstrated the improved performance of the proposed recursive algorithm compared to using an estimator which assumes a stationary process, even when the number of as- sumed regimes is smaller than the true number. When the number of assumed regimes approaches the true one, the recursive estimator yields comparable restoration results to those achievable by using the true model parameters. It was demonstrated that the recursive estimation approach has relatively small performance degradation compared to the theoretical estimation limit in the MMSE sense. Performance evaluation with real speech signals demonstrated better variance estimation when using a multi-regime model, 9.1. RESEARCH SUMMARY 181 compared to using a single-regime model, and improved squared absolute value estimation in a noisy environment compared to using the decision-directed approach.

Chapter 5 addressed the problem of noncausal estimation. We developed state smooth- ing (i.e., noncausal state probability estimation) for MS-GARCH process, in which case the conditional variances depend on both past observations and the regime path. The state smoothing may be incorporated within the restoration algorithm to improve signal recon- struction as it employs further information. In addition, state smoothing may improve the probability evaluation for speech absence and therefore may result in improved VAD. Our noncausal state probability solution generalized both the standard forward-backward recursions and the stable backward recursion of HMP by capturing both the signal corre- lation along time and its conditioning on the regime path. Accordingly, we showed that the backward recursion requires two recursive steps for evaluating the conditional density of the given future observations corresponding to all optional future paths. Although the computational complexity of the generalized backward recursion grows exponentially with the delay, it was shown that a small number of future observations contribute with the most significant improvement to the state estimation.

In Chapter 6, a novel formulation of the single-channel speech enhancement problem was developed. The formulation relies on coupled operations of detection and estimation in the STFT domain, and a cost function that combines both the estimation and detec- tion errors. A detector for the speech coefficients and a corresponding estimator for their values were jointly designed to minimize a combined Bayes risk. In addition, cost param- eters enable to control the trade-off between speech quality, noise reduction and residual musical noise. The proposed method generalized the traditional spectral enhancement approach which considers estimation-only under signal presence uncertainty. In addition we have proposed a modified decision-directed a priori SNR estimator which is adapted to transient noise environment. Experimental results showed greater noise reduction with improved speech quality when compared with the STSA suppression rules under station- ary noise. Furthermore, it was demonstrated that under transient noise environment, greater reduction of transient noise components may be achieved by exploiting reliable information for the a priori SNR estimation with simultaneous detection and estimation approach. 182 CHAPTER 9. RESEARCH SUMMARY AND FUTURE DIRECTIONS

In Chapter 7, we have proposed a novel approach for a single-channel blind source separation of acoustic signals. The approach was based on classifying the mixed signal into appropriate sub-models related to a given codebook, and correspondingly estimate each of the sources. Unlike other classical methods which apply estimation alone, or distinctive operations of classification and estimation, in our method both operations were designed simultaneously, or the estimator was designed to allow compensation for erroneous classification. A new codebook was proposed for speech signals in the STFT domain based on the GARCH model. Accordingly, less restrictive pdf’s are allowed in the STFT domain compared to GMM or AR-based model. Experimental results showed that for mixture of speech and music signals, applying the proposed codebook significantly improves the separation results compared to using GMM for both signals, even when using a smaller number of states. In addition, applying a simultaneous classification and estimation approach enables one to control the missed and false detection rates and the trade-off between signal distortion and residual interference. Finally, in Chapter 8, we have developed a dual-microphone speech dereverberation al- gorithm for noisy environments, which was based on MS-GARCH modeling of the desired early speech component. The spectral variance of the late reverberation was estimated from the observed signals while compensating for the energy of the direct path. The algorithm blindly operates in noisy and reverberant environments without any knowledge of the RIR, except for the reverberation time. It was shown that compared with the original algorithm which employs the decision-directed estimator, improved performance was obtained with little distortion to the desired signal.

9.2 Future research directions

In this thesis, we have proposed a complex-valued MS-GARCH model and developed model-based algorithms for speech processing applications. Several directions may be interesting for future research. Here we discuss some of the main issues. More specific details are given in the conclusions of each chapter. Multivariate GARCH model: The MS-GARCH model considered in this research formulates the correlation along time of successive spectral variances. In addition, all 9.2. FUTURE RESEARCH DIRECTIONS 183 frequencies in a specific subband share the same Markovian state and the same GARCH parameters. However, given their conditional variances, spectral coefficients at a specific frame are assumed statistically independent. A general formulation for a multivariate MS- GARCH may parameterizes statistical dependency between different frequencies at the same time-frame. Specifically, a single-state multivariate GARCH process X CN can t ∈ be formulated as a zero-mean process with Λt covariance matrix which is given by [165,166]

q k p k H H H Λt = C + AijXt−iXt−iAij + BijΛt−iBij . (9.1) i=1 j=1 ! i=1 j=1 ! X X X X

To guarantee positive definiteness of Λt, C should be positive definite and Aij and Bij real valued matrices. This formulation allows non-diagonal covariance matrices such that different frequencies are correlated by the model definition. Considering voiced speech segments, modeling the correlation between different frequencies, such as between the pitch frequency and its harmonies, may significantly improve the performance of model- based algorithms. Spectral variance estimation using speech detection: Integrating the simul- taneous detection and estimation approach with MS-GARCH modeling for the spectral coefficients may improve both the conditional variance restoration and the detection oper- ation. Specifically, a speech-absence state in the MS-GARCH formulation gives important information for speech presence. However, the spectral coefficients in some frequencies may be of negligible energy even under a speech-present state. Since the conditional vari- ances are reconstructed recursively, the uncertainty assumption requires incorporation of a detection scheme within the propagation and update steps of the variance estimation to improve conditional variance estimation. Furthermore, since the MS-GARCH is a multi- state model, incorporation of a detection and estimation scheme requires generalization of the later approach to a multi-hypotheses case. Multichannel speech processing: The proposed MS-GARCH modeling as well as the simultaneous detection and estimation approach may be employed for develop- ing improved multichannel speech processing algorithms, such as multichannel speech enhancement, beamforming, and relative transfer function (RTF) identification. A major drawback of many existing multichannel postfiltering techniques is that highly nonstationary noise components are not dealt with. The MS-GARCH model and the 184 CHAPTER 9. RESEARCH SUMMARY AND FUTURE DIRECTIONS simultaneous detection and estimation approach for the speech coefficients may be in- corporated within the postfiltering of a beamformer. Considering a generalized sidelobe canceler scheme [167, 168], speech components are stronger at the beamformer output than in the noise reference signals, while noise components are strongest at the reference signals [107]. Accordingly, the beam-to-reference ratio may improve speech detection. In the blocking branch of the beamformer, which is aimed to create the noise reference sig- nals, integrating a reliable detector may yield better reduction of both the coherent and incoherent noise at the beamformer output since the detection may improve the blocking of the speech components from leaking into the noise reference signals. While the detec- tion and estimation which are applied in the postfiltering should be designed for better speech quality and perceptual intelligibility, the detection operation within the blocking branch should be designed for maximum blocking of speech components. The proposed statistical model may also be useful for designing an RTF identification scheme that is adapted to speech signals. A detection and estimation scheme may be utilized to overcome the uncertainty of speech presence in the time-frequency domain. In time-frequency bins where speech components are detected, their PSD needs to be estimated as well as the RTF. However, under speech absence, only the cross-PSD of noise components may be estimated. Consequently, RTF identification performance as well as the rate of convergence may be improved. Bibliography

[1] T. Bollerslev, “Generalized autoregressive conditional heteroskedasticity,” Journal of Econometrics, vol. 31, no. 3, pp. 307–327, 1986.

[2] R. F. Engle, “Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom inflation,” Econometrica, vol. 50, no. 4, pp. 987–1007, July 1982.

[3] T. T. Baillie and T. Bollerslev, “Prediction in dynamic models with time-dependent conditional variances,” Journal of Econometrics, vol. 52, pp. 91–113, 1992.

[4] J. D. Hamilton, “A new approach to the economic analysis of nonstationary time series and the business cycle,” Econometrica, vol. 57, pp. 357–384, March 1989.

[5] ——, Time Series Analysis. Princeton University Press, 1994.

[6] S. F. Gray, “Modeling the conditional distribution of interest rates as a regime- switching process,” Journal of Financial Economics, vol. 42, pp. 27–62, September 1996.

[7] F. Klaassen, “Improving GARCH volatility forecasts with regime-switching GARCH,” Empirical Economics, vol. 27, no. 2, pp. 363–394, March 2002.

[8] M. Haas, S. Mittnik, and M. S. Paolella, “A new approach to Markov-switching GARCH models,” Journal of Financial Econometrics, vol. 2, no. 4, pp. 493–530, Autumn 2004.

[9] J. Marcucci, “Forecasting stock Market volatility with regime-switching GARCH models,” Studies in Nonlinear Dynamics and Econometrics, vol. 9, no. 4, Article 6, 2005.

185 186 BIBLIOGRAPHY

[10] M. J. Dueker, “Markov switching in GARCH processes and mean reverting stock market volatility,” Journal of Business and Economic Statistics, vol. 15, no. 1, pp. 26–34, January 1997.

[11] M. Fr¨ommel, “Modelling exchange rate volatility in the run-up to EMU using Markov switching GARCH model,” Hannover University, Discussion paper, no. 306, October 2004.

[12] J. Cai, “A Markov model of switching-regime ARCH,” Journal of Business and Economics Statistics, vol. 12, no. 3, pp. 309–316, July 1994.

[13] J. D. Hamilton and R. Susmel, “Autoregressive conditional heteroskedasticity and changes in regime,” Journal of Econometrics, vol. 64, pp. 307–333, July 1994.

[14] C. Francq and J.-M. Zako¨ıan, “The l2-structure of standard and switching-regime garch models,” Stochastic Processes and their Applications, vol. 115, no. 9, pp. 1557–1582, 2005.

[15] C. S. Wong and W. K. Li, “On a mixture autoregressive conditional heteroscedas- tic model,” Journal of the American Statistical Association, vol. 96, pp. 982–985, September 2001.

[16] C. A. . E. Lazar, “Normal mixture GARCH(1,1) applications to exchange mod- elling,” ISMA Centre Discussion Papers in 2004-06, The Business School for Financial Markets at University of Reading.

[17] M. Haas, S. Mittnik, and M. S. Paolella, “Mixed normal conditional heteroskedas- ticity,” Journal of Financial Econometrics, vol. 2, no. 2, pp. 211–250, 2004.

[18] M. Yang, “Some properties of vector autoregressive processes with Markov-switching coefficients,” Econometric Theory, vol. 16, pp. 23–43, 2000.

[19] C. Francq and J.-M. Zako¨ıan, “Comments on the paper by Minxian Yang: ”Some Properties of Vector Autoregressive Processes with Markov-Switching Co- efficients”,” Econometric Theory, vol. 18, pp. 815–818, 2002. BIBLIOGRAPHY 187

[20] ——, “Stationarity of multivariate Markov-switching ARMA models,” Journal of Econometrics, vol. 102, pp. 339–364, 2001.

[21] J. Yao, “On squar-integrability of an AR process with Markov switching,” Statistics and Probability Letters, vol. 52, pp. 265–27–, 2001.

[22] A. Timmermann, “Moments of Markov switching models,” Journal of Econometrics, vol. 96, pp. 75–111, 2000.

[23] I. Cohen, “Modeling speech signals in time-frequency domain using GARCH,” Sig- nal Processing, vol. 84, no. 12, pp. 2453–2459, Dec. 2004.

[24] ——, “Speech spectral modeling and enhancement based on autoregressive condi- tional heteroscedasticity models,” Signal Processing, vol. 86, no. 4, pp. 698–709, Apr. 2006.

[25] ——, “From volatility modeling of financial time-series to stochastic modeling and enhancement of speech signals,” in Speech Enhancement, J. Benesty, S. Makino, and J. Chen, Eds. Springer, 2005, ch. 5, pp. 97–114.

[26] M. Abdolahi and H. Amindavar, “GARCH coefficients as feature for speech recogni- tion in persian isolated digit,” in Proc. 30th IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-05, Pennsylvania ,Philadelphia, May 2005, pp. I.957–I.960.

[27] R. Tahmasbi and S. Rezaei, “A soft voice activity detection using GARCH filter and variance Gamma distribution,” IEEE Trans. Audio, Speech, and Language Pro- cessing, vol. 5, no. 4, pp. 1129–1134, May 2007.

[28] H. Drucker, “Speech processing in a high ambient noise environment,” IEEE Trans. Audio Electroacoust., vol. AU-16, no. 2, pp. 165–168, June 1968.

[29] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP- 79, vol. 4, Apr. 1979, pp. 208–211. 188 BIBLIOGRAPHY

[30] S. F. Boll, “Suppression of acousting noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, no. 2, pp. 113–120, Apr. 1979.

[31] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech,” Proc. IEEE, vol. 67, no. 12, pp. 1586–1604, Dec. 1979.

[32] R. J. McAulay and M. L. Malpass, “Speech enhancement using a soft-decision noise suppression filter,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-28, no. 2, pp. 137–145, Apr. 1980.

[33] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, no. 6, pp. 1109–1121, Dec. 1984.

[34] ——, “Speech enhancement using a minimum mean-square error log-spectral am- plitude estimator,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 33, no. 2, pp. 443–445, Apr. 1985.

[35] Y. Ephraim, “A Bayesian estimation approach for speech enhancement using hidden Marks models,” IEEE Trans. Signal Processing, vol. 40, no. 4, pp. 725–735, Apr. 1992.

[36] ——, “Statistical-model-based speech enhancement systems,” Proceedings of The IEEE, vol. 80, no. 10, pp. 1526–1555, Oct. 1992.

[37] V. I. Shin, D.-S. Kim, M. Y. Kim, and J. Kim, “Enhancement of noisy speech by using improved soft global decision,” in Eurospeech, Scandinavia, 2001.

[38] I. Cohen and B. Berdugo, “Speech enhancement for non-stationary environments,” Signal Processing, vol. 81, pp. 2403–2418, Nov. 2001.

[39] W. Fong, S. J. Godsill, A. Doucet, and M. West, “Monte Carlo smoothing with application to audio signal enhancement,” IEEE Trans. Signal Processing, vol. 50, no. 2, pp. 438–449, Feb. 2002. BIBLIOGRAPHY 189

[40] Y. Hu and P. C. Loizou, “A perceptually motivated approach for speech enhance- ment,” IEEE Trans. Speech Audio Processing, vol. 11, no. 5, pp. 457–465, Sept. 2003.

[41] I. Cohen and S. Gannot, “Spectral enhancement methods,” in Springer Handbook of Speech Processing, J. Benesty, M. M. Sondhi, and Y. Huang, Eds. Springer, 2007, ch. 45.

[42] D. E. Tsoukalas, J. N. Mourjopoulos, and G. Kokkinakis, “Speech enhancement based on audible noise suppression,” IEEE Trans. Speech Audio Processing, vol. 5, no. 5, pp. 497–514, Nov. 1997.

[43] Z. Goh, K. C. Tah, and T. G. Tan, “Postprocessing method for suppressing musical noise generated by spectral subtraction,” IEEE Trans. Speech Audio Processing, vol. 6, no. 3, pp. 287–292, May 1998.

[44] B. L. Lim, Y. C. Tong, and J. S. Chang, “A parametric formulation of the gener- alized spectral subtraction method,” IEEE Trans. Speech Audio Processing, vol. 6, no. 4, pp. 328–337, July 1998.

[45] H. Gustafsson, S. E. Nordholm, and I. Claesson, “Spectral subtraction using reduced delay convolution and addptive averaging,” IEEE Trans. Speech Audio Processing, vol. 9, no. 8, pp. 799–807, Nov. 2001.

[46] D. Burshtein and S. Gannot, “Speech enhancement using a mixture-maximum model,” IEEE Trans. Speech Audio Processing, vol. 10, no. 6, pp. 341–351, Sept. 2002.

[47] R. Martin, “Speech enhancement based on minimum mean-square error estimation and super-Gaussian priors,” IEEE Trans. Speech Audio Processing, vol. 13, no. 5, pp. 845–856, Sept. 2005.

[48] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New insights into the noise reduction wiener filter,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1218 – 1234, July 2006. 190 BIBLIOGRAPHY

[49] D. Malah, R. V. Cox, and A. J. Accardi, “Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments,” in Proc. 24th IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-99, Phoenix, Arizona, Mar. 1999, pp. 789–792.

[50] Y. Ephraim and I. Cohen, “Recent advancements in speech enhancement,” in CRC Electrical Engineering Handbook, 2005.

[51] J. Sohn and W. Sung, “A voice activity detector employing soft decision based noise spectrum adaptation,” in Proc. 23rd IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-98, vol. 1, Seattle, Washington, May 1998, pp. 365–368.

[52] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detec- tion,” IEEE Signal Processing Lett., vol. 6, no. 1, pp. 1–3, Jan. 1999.

[53] Y. D. Cho and A. Kondoz, “Analysis and improvement of a statistical model-based voice activity detector,” IEEE Signal Processing Lett., vol. 8, no. 10, pp. 276–278, Oct. 2001.

[54] S. Gazor and W. Zhang, “A soft voice activity detector based on a Laplacian- Gaussian model,” IEEE Trans. Speech Audio Processing, vol. 11, no. 5, pp. 498–505, Sept. 2003.

[55] A. Davis, S. Nordholm, and R. Tongneri, “Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 2, pp. 412–423, Mar. 2006.

[56] J.-H. Chang, N. S. Kim, and S. K. Mitra, “Voice activity detection based on multiple statistical models,” IEEE Trans. Signal Processing, vol. 54, no. 6, pp. 1965–1976, June 2006.

[57] Y. Ephraim and H. L. Van Trees, “A signal subspace approach for speech enhance- ment,” IEEE Trans. Speech Audio Processing, vol. 3, no. 4, pp. 251–266, July 1995.

[58] F. Asano, Hayamizu, T. Yamada, and Nakamura, “Speech enhancement based on the subspace method,” IEEE Trans. Speech Audio Processing, vol. 8, no. 5, pp. 497–507, Sept. 2000. BIBLIOGRAPHY 191

[59] F. Jabloun and B. Champagne, “A perceptual signal subspace approach for speech enhancement in colored noise,” in Proc. 27th IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-02, Orlando, Florida, May 2002, pp. 569–572.

[60] H. Lev-Ari and Y. Ephraim, “Extension of the signal subspace speech enhancement approach to colored noise,” IEEE Signal Processing Lett., vol. 10, no. 4, pp. 104–106, Apr. 2003.

[61] B. H. Juang and L. R. Rabiner, “Mixture autoregressive hidden markov models for speech signals,” IEEE Trans. Speech Audio Processing, vol. ASSP-33, no. 6, pp. 1404–1413, Dec. 1985.

[62] Y. Ephraim, D. Malah, and B.-H. Juang, “On the application of hidden Markov models for enhancing noisy speech,” IEEE Trans. Acoust., Speech, Signal Process- ing, vol. 37, no. 12, pp. 1846–1856, Dec. 1989.

[63] Y. Ephraim and W. J. Roberts, “Revisiting autoregressive hidden Markov modeling of speech signals,” IEEE Signal Processing Lett., vol. 12, no. 2, pp. 166–169, Feb. 2005.

[64] S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, “An introduction to the appli- cation of the theory of probabilistic functions of a Markov process to automatic speech recognition,” Bell Systems Technical Journal, vol. 62, no. 4, pp. 1035–1074, Apr. 1983.

[65] N. Z. Tishby, “On the application of mixture ar hidden Markov models to text independent speaker recognition,” IEEE Trans. Signal Processing, vol. 39, no. 3, pp. 563–570, Mar. 1991.

[66] Y. Ephraim, H. Lev-Ari, and W. J. J. Roberts, “A brief survey of speech enhance- ment,” in The Electronic Handbook. CRC Press, 2005.

[67] R. Martin and C. Breithaupt, “Speech enhancement in the DFT domain using Laplacian speech priors,” in Proc. Int. Workshop on Acoust. Echo and Noise Con- trol, IWAENC-03, Kyoto, Japan, Sept. 2003, pp. 87–90. 192 BIBLIOGRAPHY

[68] R. Martin, “Speech enhancement using mmse short time spectral estimation with Gamma distributed speech priors,” in Proc. 27th IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-02, Orlando, Florida, May 2002, pp. I–253 – I–256.

[69] I. Cohen, “Relaxed statistical model for speech enhancement and a priori SNR estimation,” IEEE Trans. Speech Audio Processing, vol. 13, no. 5, pp. 870–881, Sept. 2005.

[70] O. Capp´e, “Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor,” IEEE Trans. Speech Audio Processing, vol. 2, no. 2, pp. 345–349, Apr. 1994.

[71] I. Cohen, “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Processing, vol. 11, no. 5, pp. 466–475, Sept. 2003.

[72] R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans. Speech Audio Processing, vol. 9, pp. 504–512, July 2001.

[73] E. A. P. Habets, Single- and Multi-Microphone Speech Dereverberation using Spec- tral Enhancement. Ph.D. Thesis, Technische Universiteit Eindhoven, The Nether- lands, June 2007.

[74] E. A. P. Habets and P. C. W. Sommen, “Speech dereverberation using spectral sub- traction and a generalized statistical reverberation model,” Submitted to Elsevier’s Speech Communication Journal.

[75] E. A. P. Habets, S. Gannot, I. Cohen, and P. C. W. Sommen, “Joint dereverberation and residual echo suppression of speech signals in a noisy environment,” Submitted to IEEE Trans. on Audio, Speech, and Languages Processing.

[76] P. Naylor and N. Gaubitch, “Speech dereverberation,” in Proc. Int. Workshop on Acoust. Echo and Noise Control, IWAENC-05, Eindhoven, The Netherlands, Sept. 2005. BIBLIOGRAPHY 193

[77] S. Haykin, Blind Deconvolution, 4th ed. Prentice Hall information and system sciences series, 1994.

[78] M. Triki and D. T. M. Slock, “Delay and predict equalization for blind speech dereverberation,” in Proc. 31st IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-06, Toulouse, France, May 2006, pp. 97–100.

[79] M. Delcroix, T. Hikichi, and M. Miyoshi, “Precise derevergeration using multi- channel linear prediction,” IEEE Trans. Audio, Speech, and Language Processing, vol. 15, no. 2, pp. 430–440, Feb. 2007.

[80] B. Radlovic, R. Williamson, and R. Kennedy, “Equalization in an acoustic reverber- ant environment: robustness results,” IEEE Trans. Speech Audio Processing, vol. 8, no. 3, pp. 311–319, May 2000.

[81] Y. Huang and J. Bnesty, “A class of frequency-domain adaptive approaches to blind multihannel identification,” IEEE Trans. Signal Processing, vol. 51, no. 1, pp. 11–24, Jan. 2003.

[82] N. Gaubitch and P. Naylor, “Spatiotemporal averagingmethod for enhancement of reverberant speech,” in Proc. of the 15th International Conference on Digital Signal Processing (DSP 2007), July 2007, pp. 607–610.

[83] E. Habets, S. Gannot, and I. Cohen, “Dual-microphone speech dereverberation in a noisy environment,” in Proc. 6th IEEE Int. Symposium on Signal Process. and Information Technology, ISSPIT-2006, Vancouver, Canada, Aug. 2006, pp. 651– 655.

[84] K. Lebart and J. Boucher, “A new method based on spectral subtraction for speech dereverberation,” Acta Acoustica, pp. 359–366, 2001.

[85] G.-J. Jang, T.-W. Lee, and Y.-H. Oh, “Single-channel signal separation using time- domain basis functions,” IEEE Signal Processing Lett., vol. 10, no. 6, pp. 168–171, June 2003. 194 BIBLIOGRAPHY

[86] M. V. S. Shashanka, B. Raj, and P. Smaragdis, “Sparse overcomplete decomposition for single channel speaker separation,” in Proc. 32nd IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-07, Honolulu, Hawaii, Apr. 2007, pp. 641–644.

[87] L. Benaroya and F. Bimbot, “Wiener based source seperation with HMM/GMM using a single sensor,” in Proc. 4th Int. Sym. on Independent Component Analysis and Blind Signal Separation (ICA), Nara, Japan, Apr. 2003, pp. 957–961.

[88] L. Benaroya, F. Bimbot, and R. Gribonval, “Audio source separation with a single sensor,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 14, no. 1, pp. 191–199, Jan. 2006.

[89] L. Benaroya, F. Bimbot, G. Gravier, and R. Gribonval, “Experiments in audio source separation with one sensor for robust speech recognition,” Speech Commu- nication, vol. 48, no. 7, pp. 848–854, July 2006.

[90] A. Ozerov, P. Philippe, R. Gribonval, and F. Bimbot, “One microphone singing voice separation using source-adapted models,” in Proc. IEEE Workshop on Appplications of Signal Processing to Audio and Acoustics, New Paltz, NY, Oct. 2005, pp. 90–93.

[91] L. Benaroya, R. Blouet, C. F´evotte, and I. Cohen, “Single sensor source separation based on Wiener filtering and multiple window STFT,” in Proc. Int. Workshop on Acoust. Echo and Noise Control, IWAENC-06, Paris, France, Sept. 2006, paper no. 52, pp. 1–4.

[92] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval, “Adaptation of Bayesian models for single-channel source separation and its application to voice/music sep- aration in popular songs,” IEEE Trans. Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1564–1578, July 2007.

[93] G. Hu and D. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. Neural Networks, vol. 5, no. 14, pp. 1135– 1150, Sept. 2004. BIBLIOGRAPHY 195

[94] S. Srinivasan, J. Smuelsson, and W. B. Kleijn, “Codebook-based Bayesian speech enhancement,” in Proc. 30nd IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-05, vol. 1, Philadelphia, USA, Mar. 2005, pp. 1077–1080.

[95] ——, “Codebook driven short-term predictor parameter estimation for speech en- hancement,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 163–173, Jan. 2006.

[96] J.-F. Cardoso, “Blind signal separation: Statistical principles,” Proceedings of the IEEE, vol. 86, no. 10, pp. 2009–2025, Oct. 1998.

[97] E. Vincent, “Musical source separation using time-frequency source priors,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 91–98, Jan. 2006.

[98] S. C. Douglas and X. Sun, “Convolutive blind separation of speech mixtures using the natural gradient,” Speech Communication, vol. 39, pp. 65–78, 2003.

[99] L. Parra and C. Spence, “Convolutive blind separation of non-stationary sources,” IEEE Trans. Speech Audio Processing, vol. 8, no. 3, pp. 320–327, May 2000.

[100] N. Mitianoudis and M. E. Davies, “Audio source separation in convolutive mix- tures,” IEEE Trans. Speech Audio Processing, vol. 11, no. 5, pp. 489–497, Sept. 2003.

[101] G.-J. Jang and T.-W. Lee, “Single-channel signal separation using time-domain basis functions,” Journal of Research, vol. 4, pp. 1365–1369, 2003.

[102] J. R. Hopgood and P. J. W. Rayner, “Single channel nonstationarity stochastic signal separation using time-varying filters,” IEEE Trans. Signal Processing, vol. 51, no. 7, pp. 1739–1752, July 2003.

[103] K. V. Sørensen and S. V. Andersen, “Speech presence detection in the time- frequency domain using minimum statistics,” in Proc. of the 6th Nordic Signal Processing symposium - NORSIG, Espoo, Finland, June 2004, pp. 340–343. 196 BIBLIOGRAPHY

[104] Y. Hu and P. C. Loizou, “A generalized subspace approach for enhancing speech corrupted by colored noise,” IEEE Trans. Speech Audio Processing, vol. 11, no. 4, pp. 334–341, July 2003.

[105] I. Potamitis and E. Fishler, “Microphone array voice activity detection and noise suppression using wideband generalized likelihood ratio,” in Eurospeech, Geneva, Sept. 2003, pp. 525–528.

[106] J. Rosca, R. Balan, N. P. Fan, C. Beaugeant, and V. Gilg, “Multichannel voice detec- tion in adverse environments,” in Proc. European Signal Process. Conf., EUSIPCO- 02, Toulouse, France, Sept. 2002, pp. 251–254.

[107] I. Cohen and B. Berdugo, “Multichannel signal detection based on transient beam- to-reference ratio,” IEEE Signal Processing Lett., vol. 10, no. 9, pp. 259–262, Sept. 2003.

[108] I. Potamitis, “Estimation of speech presence probability in the field of micfophone array,” IEEE Signal Processing Lett., vol. 11, no. 12, pp. 956–959, Dec. 2004.

[109] A. Spriet, M. Moonen, and J. Wouters, “The impact of speech detection errors on the noise reduction performance of multi-channel Wiener filtering and generalized sidelobe cancellation,” Signal Processing, vol. 85, pp. 1073–1088, June 2005.

[110] T. G. Birdsall and J. O. Gobin, “Sufficient statistics and reproducing densities in simultaneous sequential detection and estimation,” IEEE Trans. Inform. Theory, vol. IT-19, no. 6, pp. 760–768, Nov. 1973.

[111] J. Goutsias and J. M. Mendel, “Optimal simultaneous detection and estimation of filtered discrete semi-Markov chains,” IEEE Trans. Inform. Theory, vol. 34, no. 3, pp. 551–568, May 1988.

[112] Z. Xie, C. K. Rushforth, R. T. Short, and T. K. Moon, “Joint signal detection and parameter estimation in multiuser communications,” IEEE Trans. Comput., vol. 41, no. 7, pp. 1208–1216, Aug. 1993. BIBLIOGRAPHY 197

[113] G. K. Kaleh and R. Vallet, “Joint parameter estimation and symbol detection for linear or nonlinear unknown channels,” IEEE Trans. Commun., vol. 42, no. 7, pp. 2406–2413, July 1994.

[114] B. Bayg¨un and A. O. Hero III, “Optimal simultaneous detection and estimation under a false alarm constraint,” IEEE Trans. Inform. Theory, vol. 41, no. 3, pp. 688–703, May 1995.

[115] D. Middleton and F. Esposito, “Simultaneous optimum detection and estimation of signals in noise,” IEEE Trans. Inform. Theory, vol. IT-14, no. 3, pp. 434–444, May 1968.

[116] A. Fredriksen, D. Middleton, and D. Vandelinde, “Simultaneous signal detection and estimation under multiple hypotheses,” IEEE Trans. Inform. Theory, vol. IT- 18, no. 5, pp. 607–614, 1972.

[117] Y. Ephraim and N. Merhav, “Lower and upper bounds on the minimum mean- square error in composite source signal estimation,” IEEE Trans. Inform. Theory, vol. 38, no. 6, pp. 1709–1724, Nov. 1992.

[118] R. W. Chang and J. C. Hancock, “On receiver structures for channels having mem- ory,” IEEE Trans. Inform. Theory, vol. IT-12, no. 4, pp. 463–468, Oct. 1966.

[119] G. Lindgren, “Markov regime models for mixed distributions and switching regres- sions,” Scan. J. Statist., vol. 5, pp. 81–91, 1978.

[120] M. Askar and H. Derin, “A recursive algorithm for the Bayes solution of the smooth- ing problem,” IEEE Trans. Automat. Contr., vol. AC-26, no. 2, pp. 558–561, Apr. 1981.

[121] A. Abramson and I. Cohen, “On the stationarity of GARCH processes with Markov switching regimes,” Econometric Theory, vol. 23, no. 3, pp. 485–500, 2007.

[122] C. Francq, M. Roussignol, and J.-M. Zako¨ıan, “Conditional heteroskedasticity driven by hidden Markov chains,” Journal of Time Series Analysis, vol. 22, no. 2, pp. 197–220, 2001. 198 BIBLIOGRAPHY

[123] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 1985.

[124] P. Lancaster and M. Tismenetsky, The Theory of Matrices, 2nd ed., W. Rheinbold, Ed. Academic Press, INC., 1985.

[125] L. A. Metzler, “A multiple-region theory of income and trade,” Econometrica, vol. 18, no. 4, pp. 329–354, October 1950.

[126] D. Hawkins and H. Simon, “Note: some conditions of macroeconomics stability,” Econometrica, vol. 17, no. 4, pp. 245–248, July 1949.

[127] A. Abramson and I. Cohen, “Recursive supervised estimation of a Markov-switching GARCH process in the short-time Fourier transform domain,” IEEE Trans. Signal Processing, vol. 55, no. 7, pp. 3227–3238, July 2007.

[128] ——, “Asymptotic stationarity of Markov-switching time-frequency GARCH pro- cesses,” in Proc. 31th IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-06, Toulouse, France, May 2006, pp. III 452–445.

[129] Y. Ephraim and N. Merhav, “Hidden Markov processes,” IEEE Trans. Inform. Theory, vol. 48, no. 6, pp. 1518–1569, June 2002.

[130] A. Abramson and I. Cohen, “State smoothing in Markov-switching time-frequency GARCH models,” IEEE Signal Processing Lett., vol. 13, no. 6, pp. 377–380, June 2006.

[131] P. Gill, W. Murray, and M. Wright, Practical Optimization. Academic Press, 1981.

[132] S. Han, “A globally convergent method for nonlinear programming,” Journal of Optimization Theory and Applications, vol. 22, no. 4, pp. 297–309, 1977.

[133] A. Abramson and I. Cohen, “Markov-switching GARCH model and application to speech enhancement in subbands,” in Proc. Int. Workshop on Acoust. Echo and Noise Control, IWAENC-06, Paris, France, Sept. 2006, paper no. 7, pp. 1–4.

[134] C. J. Kim, “Dynamic linear models with Markov-switching,” Journal of Economet- rics, vol. 60, pp. 1–22, 1994. BIBLIOGRAPHY 199

[135] H. Amiri, H. Amindavar, and R. L. Kirlin, “Array signal processing using GARCH noise modeling,” in Proc. 29th IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-04, Montreal, Canada, Mar. 2004, pp. II–105–II–108.

[136] A. Abramson and I. Cohen, “Simultaneous detection and estimation approach for speech enhancement,” IEEE Trans. Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2348–2359, Nov. 2007.

[137] A. Subramanya, M. L. Seltzer, and A. Acero, “Automatic removal of typed keystrokes from speech signals,” IEEE Signal Processing Lett., vol. 14, no. 5, pp. 363–366, May 2007.

[138] W. A. Harrison, J. S. Lim, and E. Singer, “A new application of adaptive noise cancellation,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 34, no. 1, pp. 21–27, Feb. 1986.

[139] A. G. Jaffer and S. C. Gupta, “Coupled detection-estimation of gaussian processes in gaussian noise,” IEEE Trans. Inform. Theory, vol. IT-18, no. 1, pp. 106–110, Jan. 1972.

[140] A. Abramson and I. Cohen, “Enhancement of speech signals under multiple hy- potheses using an indicator for transient noise presence,” in Proc. 32nd IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-07, Honolulu, Hawaii, Apr. 2007, pp. 553–556.

[141] E. Habets, I. Cohen, and S. Gannot, “MMSE log-spectral amplitude estimator for multiple interferences,” in Proc. Int. Workshop on Acoust. Echo and Noise Control., IWAENC-06, Paris, France, Sept. 2006.

[142] J. S. Garofolo, “Getting started with the DARPA TIMIT CD-ROM: an acoustic phonetic continuous speech database,” Technical report, National Institute of Stan- dards and Technology (NIST), Gaithersburg, Maryland (prototype as of December 1988).

[143] S. R. Quackenbush, T. P. Barnwell, and M. A. Clements, Objective Meaasures of Speech Quality. Englewood Cliffs, NJ: Prentice-Hall, 1988. 200 BIBLIOGRAPHY

[144] ITU-T Rec. P.862, “Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs,” International Telecommunication Union, Geneva, Switzerland, Feb. 2001.

[145] A. Abramson Homepage. [Online]. Available: http: siglab.technion.ac.il ∼ari a \\ \

[146] A. Gu´erin, G. Faucon, and R. L. Bouquin-Jeann`es, “Nonlinear acoustic echo can- cellation based on voltera filters,” IEEE Trans. Speech Audio Processing, vol. 11, no. 6, pp. 672–683, Nov. 2003.

[147] E. Haensler and G. Schmidt, Eds., Topics in Acoustic Echo and Noise Control. Springer-Verlag, 2006.

[148] I. S. Gradshteyn and I. M. Ryzhik, Table of Integrals, Series, and Products, 6th ed., A. Jefferey and D. Zwillinger, Eds. Academic Press, 2000.

[149] G. N. Watson, A Treatise on the Theory of Bessel Functions, 2nd ed. Cambridge University Press, 1962.

[150] D. Middleton, An Introduction to Statistical Comunication Theory, 2nd ed. IEEE press, 1996.

[151] A. Abramson and I. Cohen, “Single-sensor blind source separation using classifica- tion and estimation approach and GARCH modeling,” submitted to IEEE Trans. Audio, Speech, and Language Processing.

[152] M. J. Reyes-Gomez, D. P. W. Ellis, and N. Jojic, “Multiband audio modelong for single-channel acoustic source separation,” in Proc. 29st IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-04, Montreal, Canada, May 2004, pp. 641–644.

[153] D. G. Manokis, V. K. Ingle, and S. M. Kogon, Statistical and Adaptive Signal Processing: Spectral Estimation, Signal Modeling, Adaptive Filtering, and Array Processing. Boston, MA: McGRAW-Hill, 2000.

[154] D. P. Bertsekas, Nonlinear Programming, 2nd ed. Athena Scientific, 1999. BIBLIOGRAPHY 201

[155] A. P. Dempster, N. M. Larid, and D. B. Rubin, “Maximum likelihood from incom- plete data via the EM algorithm,” Journal of The Royal Statistical Society, vol. 39, no. 1, pp. 1–38, 1977.

[156] G. A. F. Seber, Multivariate Observations. New York: Wiley, 1984.

[157] H. Bourlard and S. Dupont, “Subband-based speech recognition,” in Proc. 22nd IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-97, Munich, Germany, Apr. 1997, pp. 1251–1254.

[158] S. Okawa, E. Bocchieri, and A. Potamianos, “Multi-band speech recognition in noisy environments,” in Proc. 23rd IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-98, Seattle, WA, USA, May 1998, pp. 641–644.

[159] E. Vincent, R. Gribonval, and C. F´evotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.

[160] R. Gribonval, L. Benaroya, E. Vincent, and C. F´evotte, “Proposals for performance measurement in source separation,” in Proc. 4th Int. Sym. on Independent Com- ponent Analysis and Blind Signal Separation (ICA), Nara, Japan, Apr. 2003, pp. 763–768.

[161] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamform- ing and nonstationarity with applications to speech,” IEEE Trans. Signal Process- ing, vol. 49, no. 8, pp. 1614–1626, Aug. 2001.

[162] A. Abramson, E. A. P. Habets, S. Gannot, and I. Cohen, “Dual-microphone speech dereverberation using GARCH modeling,” to appear in Proc. 33th IEEE Int. Conf. Acoust., Speech Signal Process., ICASSP-08.

[163] Y. Huang and J. Benesty, “A class of frequency-domain adaptive approaches to blind multichannel identification,” IEEE Trans. Signal Processing, vol. 51, no. 1, pp. 11–24, Jan. 2003. 202 BIBLIOGRAPHY

[164] R. Ratnam, D. Jones, B. Wheeler, W. O’Brien Jr., C. Lansing, and A. Feng, “Blind estimation of reverberation time,” Journal of the Acoustical Society of America, vol. 114, no. 5, pp. 2877–2892, Nov. 2003.

[165] R. F. Engle and K. F. Kroner, “Multivariate simultaneous generalized ARCH,” Econometrics Theory, vol. 11, pp. 122–150, 1995.

[166] F. Comte and O. Lieberman, “Asymptotic theory for multivariate GARCH pro- cesses,” Journal of Multivariate Analysis, vol. 84, no. 1, pp. 61–84, January 2003.

[167] S. Gannot and I. Cohen, “Adaptive beamforming and postfiltering,” in Springer Handbook of Speech Processing, J. Benesty, M. M. Sondhi, and Y. Huang, Eds. Springer, 2007, ch. 48.

[168] I. Cohen, S. Gannot, and B. Berdugo, “An integrated real-time beamforming and post filtering system for nonstationary noise environments,” EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1064–1073, Oct. 2003.

מודלי GARCH בעלי משטר מיתוג מרקובי ויישומיהם לעיבוד ספרתי של אותות דיבור

ארי אברמסון

מודלי GARCH בעלי משטר מיתוג מרקובי ויישומיהם לעיבוד ספרתי של אותות דיבור

חיבור על מחקר

לשם מילוי חלקי של הדרישות לקבלת התואר דוקטור לפילוסופיה

ארי אברמסון

הוגש לסנט הטכניון - המכון טכנולוגי לישראל טבת תשס"ח חיפה דצמבר 2007

המחקר נעשה בהנחיית פרופסור/ח' ישראל כהן בפקולטה להנדסת חשמל

תודות

ברצוני להביע את תודתי והערכתי העמוקה לפרופ' ישראל כהן על הנחייתו המסורה. תודה על התמיכה המקצועית, על עידודך לשלמות ועל הרבה עצות מועילות לכל אורך שלבי המחקר.

תודה לשותפי למשרד קותי אברג'יל על הרבה דיונים פוריים. תודה לדר' עמנואל הבטס על דיונים מועילים ועל שאפשר לי להרחיב את המחקר לכיוון נוסף של ביטול הד אקוסטי.

תודה מיוחדת להורי מירי ומשה ולאהובתי אפרת על תמיכתם האינסופית לכל אורך הדרך.

אני מודה לטכניון, לקרן הלאומית למדע (מענק מס' 1085/05),ולאיחוד האירופי תחת פרוייקט Memories על התמיכה הכספית הנדיבה בהשתלמותי

תקציר

מחקר זה מתמקד בתיאוריה של מודלי generalized autoregressive ) GARCH conditional heteroscedasticity) ויישומיהם לעיבוד אותות, ובמיוחד לעיבוד ספרתי של אותות דיבור. מודל GARCH הוצג לראשונה על ידי Bollerslev ב - 1986 כהכללה של מודל ARCH אותו

פיתח Engle ב1982-. מודל זה מתאר באופן פרמטרי הפכפכות (שונות מותנית) התלויה בזמן, בעזרת ערכים קודמים של השונות המותנית ושל ריבועי ערכים קודמים של התהליך. המודל שימושי מאד בתחום מדעי הכלכלה עבור ניתוח וחיזוי מידת ההשתנות של תהליכים בשווקים פיננסיים. לאחרונה, מודל זה הוצע עבור יישומים לעיבוד אותות דיבור, כגון הדגשת אותות דיבור, זיהוי דובר, וכן גלאי לקיום דיבור. בעבודת מחקר זו, אנו מציעים ניסוח חדשני למודל GARCH מרוכב עם משטר מיתוג מרקובי (MS-GARCH) עבור אותות שאינם

סטציונריים, במישור המשולב זמן - תדר. מודל זה מנצל הן את תכונות מודל GARCH המתאר תהליך בעל שונות מותנית המשתנה בזמן והן את מאפייני הדינאמיות הזמנית של שרשראות מרקוביות. המוטיבציה העיקרית למחקר זה מבוססת על תיאור סטטיסטי של אותות דיבור במישור זמן - תדר, כאשר דוגמאות ליישומים כוללות, לדוגמה, הפחתת רעש במערכות תקשורת (הן רעש רקע והן רעש רגעי - טרנזיינטי), הדגשת דיבור והפחתת הדהוד במערכות תקשורת בהן המיקרופון ממוקם הרחק מהדובר, וכן הפרדת מקורות קול הנקלטים במיקרופון יחיד. התיזה מתחילה בניתוח סטציונריות אסימפטוטית של מודל MS-GARCH. תהליך -MS

GARCH (כמו גם תהליך GARCH בעל מצב יחיד) אינו סטציונרי מכיוון שהמומנט מסדר שני משתנה בזמן. אולם, במידה ותהליכים אלו הינם סטציונריים אסימפטוטית במובן

א הרחב, אזי מובטח כי השונות של התהליך תהיה בעלת גבול סופי. תנאים מספיקים והכרחיים לסטציונריות אסימפטוטית של תהליך GARCH פותחו על ידי Bollerslev, ואילו

עבור מודלי MS-GARCH קיימים בספרות ניתוחי סטציונריות רק עבור ניסוחים מנוונים של המודל. גם עבור ניסוחים מנוונים אלו, בחלק מהמקרים פותחו תנאים שהינם הכרחיים אך לא בהכרח מספיקים לסטציונריות אסימפטוטית. במחקר זה, אנו מציעים ניתוח לסטציונריות של מודלי MS-GARCH המנוסחים באופן מלא ומפתחים תנאים שהינם הכרחיים ומספיקים לסטציונריות של מספר ניסוחי מודל כלליים, כפי שהוצעו בספרות. מקדמי אותות דיבור במישור התמרת פורייה לזמן קצר מתאפיינים הן בחלקות של שונות המקדמים והן בפילוג בעל זנב עבה. שתי תכונות חשובות אלו מאפיינות גם תהליכי GARCH. תחת מוטיבציה זו, הוצע לאחרונה בתחום עיבוד אותות לנצל את מודל GARCH עבור מקדמי אותות דיבור במישור הזמן-תדר ליישומים של הדגשת דיבור. אולם, בעבודות אלו מניחים כי קיים גלאי למקדמי אות הדיבור ובנוסף מניחים כי פרמטרי המודל הינם קבועים בזמן. הנחה זו מגבילה את השימוש במודל במקרה של שינוי דובר המתאפיין בפרמטרים שונים, או אף בשינוי הברות. אנו מציעים ניסוח חדש למודל MS-GARCH עבור תהליכים אקראיים לא סטציונריים מרוכבים. מודל זה מתאים לייצוג אותות דיבור במישור התמרת פורייה לזמן קצר. כל מצב בשרשרת המרקובית החבויה מגדיר אוסף שונה של פרמטרים עבור מודל ה- GARCH ובכך מתאפשר ייצוג שונה עבור שינוי הברות ו/או דובר שעשויים להתאפיין בשינוי מצב השרשרת המרקובית. כאשר בוחנים תהליכים כלכליים, בדרך כלל אין בעיית רעש במדידת התהליך ולכן ערכי התהליך הנתונים עד לזמן כלשהו מספיקים בכדי לשחזר את השונות המותנית של התהליך באינדקס הזמן הבא. לעומת זאת, כאשר מנתחים אותות דיבור הנקלטים בסביבה רועשת, אין בידינו אלא מדידות רועשות של התהליך, ולפיכך לא ניתן לשחזר במדויק את ערכי התהליך ולא את השונות המותנית. לצורך כך פיתחנו אלגוריתם רקורסיבי לשערוך השונות המותנית של התהליך על סמך מדידות רועשות וכן אלגוריתם לשערוך האות עצמו. האלגוריתם מבוסס על שני צעדים, בדומה למסנן Kalman. בשלב הראשון, השונות המותנית מפועפעת צעד אחד קדימה בהתבסס על הגדרת המודל ואילו בשלב השני מתבצע עדכון השונות בעזרת המדידה הרועשת באותו הזמן. בנוסף, פיתחנו אלגוריתם לשערוך לא סיבתי של שרשרת המצבים המרקוביים מתוך מדידות

ב רועשות. האלגוריתמים הללו נמצאו יעילים עבור שחזור מקדמי אותות דיבור ליישומים של הדגשת אותות דיבור בסביבה רועשת וכן להפחתת הד אקוסטי. התוצאות שהתקבלו טובות יותר מאלו שהתקבלו בעזרת אלגוריתמים סטנדרטיים הקיימים בספרות. תכונה חשובה נוספת של אותות דיבור הינה דלילות מקדמי הייצוג במישור התמרת פורייה לזמן קצר. פרט לעובדה שהדיבור אינו קיים בחלק מקטעי הזמן, גם בקטעי הזמן בהם הוא קיים, החלק הארי של האנרגיה נמצא בחלק קטן ממקדמי הייצוג. אלגוריתמים קיימים להדגשת אותות דיבור מניחים בדרך כלל משערך למקדמי הייצוג וכן גלאי נפרד לקיום המקדמים. במידה והגלאי מחליט כי מקדם ייצוג מכיל רכיב רעש בלבד, ניתן לדחות אותו במוצא מערכת השחזור. לחילופין, אלגוריתמים אחרים מבצעים שערוך המקדמים תחת חוסר ודאות. בגישה זו, מניחים הסתברות אפריורית לקיום מקדמי הייצוג של אות הדיבור ותחת הנחה זו מבצעים שערוך סטטיסטי של המקדמים. גישות של שילוב גלאי ומשערך מצומדים הפועלים יחדיו קיימות בספרות עבור עיבוד אותות והן עבור יישומים שונים בתקשורת. אנו מציעים בעבודת מחקר זו ניסוח חדשני לבעיית הדגשת אות דיבור. הניסוח כולל גלאי ומשערך הפועלים במקביל ואשר יחדיו ממזערים פונקצית מחיר מוכללת. שתי היפותזות מוצעות עבור מקדמי אותות הדיבור אשר מייצגות קיום או היעדרות של מקדמי הייצוג עבור כל מקדם במישור הזמן-תדר. פונקצית המחיר המוכללת כוללת מרכיב עבור שגיאת השערוך וכן מרכיב נוסף עבור שגיאת הגלאי. הגלאי והמשערך הינם מצומדים ומתוכננים יחדיו כדי למזער את תוחלת המחיר. על סמך ניסוח זה אנו מפתחים אלגוריתם להדגשת אותות דיבור תוך שימוש בפונקציות מחיר מתאימות. האלגוריתם המוצע הינו בעל ביצועים טובים יותר מאלגוריתם המבצע שערוך בלבד. יחד עם זאת, היעילות הגדולה בניסוח המוצע הינה היכולת לשלב גלאי חיצוני עבור רעש טרנזיינטי במערכת, תוך שמירת האופטימליות של המשערך גם בהינתן גלאי לא אידיאלי זה. אנו מציגים תוצאות המראות ששילוב האלגוריתם המוצע עם גלאי לא אידיאלי עבור רעש טרנזיינטי מאפשר הנחתה ניכרת של רעש זה תוך פגיעה קטנה בלבד ברכיבי אות הדיבור הרצוי. בעיה מעניינת נוספת שנחקרה במסגרת תזה זו הינה הפרדת מקורות קול אשר נקלטו על ידי מיקרופון יחיד (דוברים שונים, כלים מוזיקליים שונים, או שילוב של שירה/דיבור עם מוזיקה). במידה ומספר מקורות נקלטים במספר מיקרופונים, ניתן לנצל את האינפורמציה ההדדית הקיימת באותות הנקלטים במיקרופונים השונים ו/או להפעיל גישות של סינון

ג מרחבי. בגישות אלו ניתן להגיע לעיתים לתוצאות משביעות רצון, גם כאשר מספר המקורות עולה על מספר המיקרופונים. אולם, כאשר מספר מקורות נקלטים במיקרופון יחיד יש צורך במידע אפריורי כלשהו על כל אחד מהמקורות כדי לאפשר הפרדה כלשהי. אלגוריתמים קיימים לפתרון בעיה זו מניחים מודל סטטיסטי שונה (מילון) עבור כל אחד מהאותות ומבצעים שערוך בגישה סטטיסטית של כל אחד מהאותות הרצויים. מילון זה נלמד בעזרת סדרת אימון המכילה אותות אשר אמורים לייצג מבחינה סטטיסטית את האות הרצוי. מודלים קיימים מבוססים על עירוב של פילוגים גאוסיים (GMM) או מודלים אוטו -

רגרסיבים (AR) בעלי מספר סטים של מקדמי ייצוג. בכל אחד מהמקרים הללו, מטריצת השונות המשותפת האפשרית עבור וקטור נתון של הסיגנל מוגבלת לתת מרחב הנפרש על ידי הפילוגים המוגדרים במילון. עבור בעיה זו של הפרדת מקורות, אנו מציעים אלגוריתם חדש המבוסס על שימוש במודל GARCH והן על גישה המשלבת גילוי ושערוך. מאחר ועבור כל אחד מהאותות קיים מודל בעל מספר היפותזות, ההפרדה מבוצעת למעשה על ידי שילוב של מסווג ומשערך. שימוש בפרמטרי מחיר מאפשר למשתמש לשלוט על הפשרה בין עיוות האות, הנגרם כתוצאה מאי גילוי מקדמי האות הרצוי לבין הפרעה שיורית הנגרמת כתוצאה מגילוי שגוי. בעזרת אלגוריתם זה מתקבלת הפרדה טובה יותר מאשר בעזרת אלגוריתם קיים המניח מודל GMM בלבד והמבצע שערוך סטטיסטי בלבד.

ד