Markov-Switching GARCH Models and Applications to Digital Speech Processing
Ari Abramson
Markov-Switching GARCH Models and Applications to Digital Speech Processing
Research Thesis
As Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy
Ari Abramson
Submitted to the Senate of the Technion—Israel Institute of Technology Tevet 5768 Haifa December 2007
i
The Research Thesis was Done Under the Supervision of Associate Professor Israel Cohen in the Department of Electrical Engineering.
Acknowledgement
I wish to express my deep gratitude and appreciation to Prof. Israel Cohen for his guidance and dedicated supervision. Thank for your professional support, for your encouragement to perfection, and for many valuable suggestions throughout all the stages of this research.
I would also like to thank Kuti Avargel for many fruitful discussions and Dr. Emanu¨el Habets for valuable discussions and for giving me the opportunity to expand my research to speech dereverberation.
Special thanks to my parents, Miri and Moshe and to my beloved Efrat who encouraged and supported me through the whole way.
The Generous Financial Help of The Technion, The Israel Science Foundation (Grant no. 1085/05), and The European Commission’s IST Program Under Project Memories is Gratefully Acknowledged. ii Contents
1 Introduction 7 1.1 Markov-switching GARCH models ...... 8 1.2 Speechenhancement ...... 10 1.2.1 Spectral modeling of speech signals ...... 11 1.3 Other speech processing applications ...... 13 1.4 Detection and estimation of speech signals ...... 15 1.5 Thesisstructure...... 17 1.6 Listofpublications ...... 21
2 Research Methods 23 2.1 GARCH Models and stationarity analysis ...... 23 2.1.1 Markov-switching GARCH model ...... 24 2.2 Time-frequency GARCH model and spectral speech enhancement ..... 26 2.2.1 Time-frequency GARCH model ...... 26 2.2.2 Varianceestimation...... 28 2.2.3 Spectralenhancement ...... 28 2.3 Single-channel blind source separation ...... 30 2.4 Speechdereverberation ...... 31
3 Stationarity Analysis of MS-GARCH Processes 35 3.1 Introduction...... 36 3.2 Stationarity of Markov-switching GARCH models ...... 38 3.2.1 MSG-Imodel ...... 39 3.2.2 MSG-IImodel...... 43
iii iv CONTENTS
3.2.3 Comparison of stationarity conditions ...... 45 3.3 Relationtootherworks...... 46 3.4 Conclusions ...... 49 3.A ProofofTheorem3.2 ...... 49 3.B Equivalence with Haas condition ...... 50
4 MS-GARCH Process in the STFT Domain 53 4.1 Introduction...... 54 4.2 Markov-switching time-frequency GARCH model ...... 56 4.2.1 Time-frequency GARCH model ...... 57 4.2.2 MSTF-GARCHformulation ...... 57 4.2.3 Stationarity of an MSTF-GARCH process ...... 58 4.3 Restoration of noisy MSTF-GARCH process ...... 60 4.4 Estimationefficiency ...... 68 4.5 Modelestimation ...... 71 4.6 Experimentalresults ...... 72 4.6.1 MSTF-GARCHsignals...... 72 4.6.2 Speechsignals...... 75 4.7 Conclusions ...... 80 4.A Application to Speech Enhancement ...... 82 4.A.1 Introduction...... 82 4.A.2 Modelformulation ...... 83 4.A.3 Modelestimation ...... 85 4.A.4 Spectral enhancement of noisy speech ...... 86 4.A.5 Experimental results and discussion ...... 87
5 State Smoothing in MS-GARCH Models 91 5.1 Introduction...... 91 5.2 Problemformulation ...... 93 5.3 Stateprobabilitysmoothing ...... 94 5.3.1 Generalized forward-backward recursions ...... 95 5.3.2 Generalized stable backward recursion ...... 98 CONTENTS v
5.4 Experimentalresults ...... 99 5.5 Conclusions ...... 99
6 Simultaneous Detection and Estimation 101 6.1 Introduction...... 102 6.2 Classical speech enhancement ...... 104 6.3 Reformulation of the speech enhancement problem ...... 107 6.4 Quadratic spectral amplitude cost function ...... 110 6.5 Relationtospectralsubtraction ...... 115 6.6 AprioriSNRestimation ...... 118 6.7 Experimentalresults ...... 121 6.7.1 Comparison with the STSA estimator ...... 122 6.7.2 Speech enhancement under nonstationary noise environment . . . . 123 6.8 Conclusions ...... 126 6.A Riskderivation ...... 127 6.B Speech Enhancement Under Multiple Hypotheses ...... 129 6.B.1 Introduction...... 129 6.B.2 Problemformulation ...... 130 6.B.3 Optimal estimation under a given detection ...... 131 6.B.4 Experimentalresults ...... 135 6.B.5 Conclusions ...... 137
7 Single-Sensor Audio Source Separation 139 7.1 Introduction...... 140 7.2 Codebook-BasedSeparation ...... 142 7.2.1 Simultaneous Classification and Estimation ...... 144 7.2.2 Joint Classification and Estimation ...... 147 7.3 GMMVs.GARCHCodebook ...... 150 7.4 ImplementationoftheAlgorithm ...... 153 7.5 ExperimentalResults...... 156 7.5.1 Experimental setup and quality measures ...... 156 7.5.2 Simulationresults...... 158 vi CONTENTS
7.6 Conclusions ...... 163 7.A Derivationof(7.10)...... 164 7.B Derivationof(7.11)...... 165
8 Speech Dereverberation Using GARCH Modeling 167 8.1 Introduction...... 167 8.2 Dual-microphonedereverberation ...... 169 8.3 Latereverberantspectralestimation ...... 171 8.4 Modeling early reverberation using GARCH ...... 172 8.4.1 Spectralvarianceestimation ...... 173 8.4.2 Speech presence probability ...... 174 8.5 Experimentalresults ...... 174 8.6 Conclusions ...... 176
9 Research Summary and Future Directions 179 9.1 Researchsummary ...... 179 9.2 Futureresearchdirections ...... 182
Bibliography 185 List of Figures
3.1 Stationarity regions for two-state Markov-chains with GARCH of order (1, 1) 47
4.1 SNR improvements obtained by using MSTF-GARCH based estimators . . 74 4.2 Trace of instantaneous output SNR achieved by the proposed algorithm . . 75 4.3 Typical traces of one-frame-ahead conditional variance estimates for speech signals ...... 77 4.4 Typical traces of estimated squared absolute values for speech signal . . . . 78 4.5 Speechspectogramsandwaveforms ...... 88 4.6 Conditional speech presence probability ...... 89
5.1 State smoothing error rate for 3-state MSTF-GARCH models ...... 100
6.1 Independent detection and estimation system, and strongly coupled detec- tionandestimationsystem...... 109
6.2 Gain curves of G1, G0, and the total detection and estimation system gain curve, compared with the STSA gain under signal presence uncertainty . . 114 6.3 Signals in the time domain: sinusoidal signal with stationarynoise . . . . . 117 6.4 Amplitudes of the STFT coefficients along the time-trajectory correspond- ing to the frequency of the sinusoidal signal ...... 117 6.5 Signals in the time domain: sinusoidal signal with stationary and transient noise...... 120 6.6 Amplitudes of the STFT coefficients along the time-trajectory correspond- ing to the frequency of the sinusoidal signal ...... 120 6.7 Spectrograms and waveforms of speech signal degraded by engine noise and asirennoise...... 125
vii viii LIST OF FIGURES
6.8 Gain curves for p(H )=0.8, C = 5, C = 3, and G = 15dB . . . .134 1 01 10 min − 6.9 Speech degraded by keyboard typing noise, spectrograms and waveforms . 136
7.1 A cascade classification and estimation scheme...... 148 7.2 Blockdiagramoftheproposedalgorithm...... 155 7.3 Quality measures for mmse estimation as functions of the number of GARCHstates...... 159 7.4 Trade-off between residual interference and signal distortion resulting from changing the false detection and missed detection parameters...... 160 7.5 Original and mixed signals. (a) Speech signal: ”Draw every outer line first, then fill in the interior”; (b) piano signal (F¨ur Elise); (c) mixed signal. . . 162 7.6 Separation of speech and music signals. (a) speech signal reconstructed by using the GMM-based algorithm; (b) speech signal reconstructed using the proposed approach; (c) piano signal reconstructed by using the GMM algorithm; (d) piano signal reconstructed using the proposed approach. . . 162
8.1 Dual microphone speech dereverberation system...... 170 8.2 SegSIR and LSD as functions of the number of GARCH states ...... 177 8.3 Spectrograms and waveforms of reverberant and processed speech signals . 177 List of Tables
4.1 Vector form of the recursive MSTF-GARCH signal estimation ...... 67
6.1 Segmental SNR and Log Spectral Distortion Obtained by Using Either the Simultaneous Detection and Estimation Approach or the STSA Estimator inStationaryNoiseEnvironment...... 122 6.2 Segmental SNR, Log Spectral Distortion and PESQ Score Under Transient Noise...... 124 6.3 Segmental SNR and Log Spectral Distortion Obtained Using the OM-LSA andtheProposedAlgorithm...... 137
7.1 Averaged Quality Measures for the Estimated Speech Signals Using 3-state GARCHModeland8-stateGMM...... 161 7.2 Averaged Quality Measures for the Estimated Music Signals Using 3-state GARCHModeland8-stateGMM...... 161
8.1 SegSIR and LSD obtained by using the decision-directed approach and the proposed MS-GARCH-based approach, with d=0.5 m ...... 175 8.2 SegSIR and LSD obtained by using the decision-directed approach and the proposed MS-GARCH-based approach, with d=1 m ...... 176
ix x LIST OF TABLES Abstract
This dissertation addresses theory and applications of generalized autoregressive condi- tional heteroscedasticity (GARCH) models with Markov regimes for digital speech pro- cessing. The GARCH model is widely-used in the field of econometrics for volatility forecast derivation of econometric rates, and it was recently proposed in the field of signal processing for applications such as speech enhancement, speech recognition, and voice activity detection. GARCH models explicitly parameterize the time-varying volatility by using both past conditional variances and past squared innovations (prediction errors), while taking into account excess kurtosis (i.e., heavy tailed distribution) and volatility clustering.
In this thesis, we develop a new statistical model for nonstationary signals in the joint time-frequency domain based on GARCH formulation with Markov regimes. The pro- posed model exploits the advantages of both the conditional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains. The main motivation for this research is spectral modeling of speech signals for hands-free communication applications such as speech enhancement, nonstationary noise reduction, dereverberation, and audio source separation.
We analyze the asymptotic stationarity of Markov-switching GARCH (MS-GARCH) processes in the general case of (p, q)-order GARCH models with finite-state Markov chains. Necessary and sufficient conditions for asymptotic wide-sense stationarity are de- veloped for several model formulations which are known in the literature. The properties of the proposed model are investigated and algorithms are developed for conditional vari- ance, as well as for signal estimation in noisy environments. The proposed model with the corresponding estimation algorithms are shown to be useful for applications of speech enhancement and speech dereverberation. In addition, a state smoothing algorithm is
1 2 ABSTRACT developed for the sequence of active states estimation. Furthermore, a new formulation for the speech enhancement problem is proposed in this thesis, which incorporates simul- taneous operations of detection and estimation. A detector for speech presence in the short-time Fourier transform domain is combined with an estimator, which jointly mini- mizes a cost function that takes into account both detection and estimation errors. We show that the proposed simultaneous detection and estimation approach enables greater noise reduction than estimation only approach, without further degrading the speech signal. A simultaneous classification and estimation approach together with GARCH model- ing is employed for developing an algorithm for single-sensor audio source separation. We show that for mixtures of speech and music signals, an improved source separation can be achieved compared to using Gaussian mixture model for both signals. Moreover, cost parameters enable one to control the trade-off between missed and false detection of the desired signal, and correspondingly the trade-off between signal distortion and residual interference. Notation
x scalar / time-domain signal x column vector
Xtk , X(t, k) time-frequency coefficient A , a the (i, j) element of matrix A { }ij ij A−1 matrix inverse diag x diagonal matrix with the vectorx on its diagonal { } eig A the spectrum of matrix A { } tr trace {·} x absolute value | | A determinant | | ( )T , ( )′ transpose operation · · ( )H Hermitian · ( )∗ complex conjugate · p( ) probability / probability density · E expectation {·} V ar variance {·} Cov covariance {·} det( ) determinant · I ( ) modified Bessel function of order ν ν · J ( ) Bessel function of order ν ν · Γ( ) Gamma function · 1F1 (a; b; x) confluent hypergeometric function ρ( ) spectral radius ·
3 4 NOTATION
1 column vector of ones 0 m m matrix of zeros m × I m m identity matrix m × Kronecker product ⊗ term-by-term vector multiplication ⊙ ( ) term-by-term vector division ÷ ▽ gradient RN N-demential real-valued vectors RN + N-demential positive real-valued vectors CN N-demential complex-valued vectors Euclidian norm k·k ( ) real part · R ( ) imaginary part · I Abbreviations
AIR Acoustic impulse response AR Autoregressive ARCH Autoregressive conditional heteroscedasticity ARMA Autoregressive moving average DPC Direct path compensation DSB Delay and sum beamformer GARCH Generalized autoregressive conditional heteroscedasticity GMM Gaussian mixture model HMM Hidden Markov model HMP Hidden Markov process IMCRA Improved minima-controlled recursive averaging
IRH0 Interference reduction LRSVE Late reverberant spectral variance estimator LSA Log spectral amplitude LSD Log spectral distortion MAP Maximum a posteriori ML Maximum likelihood MMSE Minimum mean-square error MSE Mean-square error MS-GARCH / MSG Markov-switching GARCH MSTF-GARCH Markov-switching time-frequency GARCH NE Noise estimator OM-LSA Optimally modified log-spectral amplitude
5 6 ABBREVIATIONS
PESQ Perceptual evaluation of speech quality PSD Power spectral density QSA Quadratic spectral amplitude RIR Room impulse response SegSNR Segmental signal-to-noise ratio SegSIR Segmental signal-to-interference ratio SNR Signal-to-noise ratio STFT Short-time Fourier transform STSA Short-term spectral amplitude TF-GARCH Time-frequency generalized autoregressive conditional heteroscedasticity VAD Voice activity detector Chapter 1
Introduction
This dissertation addresses the theory of generalized autoregressive conditional het- eroscedasticity (GARCH) models with Markov regimes and their applications to signal processing, and in particular to digital speech processing.
GARCH model explicitly parameterizes a time-varying volatility by using both re- cent conditional variances and recent squared innovations. This model is widely-used in the field of econometrics for the analysis and volatility forecasts in financial markets. Recently, this model was proposed for speech processing applications, such as speech en- hancement, speech recognition, and voice activity detection. In this thesis we formulate a new complex-valued Markov-switching GARCH (MS-GARCH) model for nonstation- ary signals in the joint time-frequency domain. The MS-GARCH model exploits the advantages of both the conditional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains. The basic motivation for our research is based on spectral modeling of speech signals, where examples for applications may include, e.g., noise reduction in communication systems (both background and tran- sient noise), speech enhancement and dereverberation in hands-free communication, and audio source separation.
The thesis starts with an asymptotic stationarity analysis of MS-GARCH processes. In case of processes with time-varying variances, conditions for asymptotic wide-sense stationarity are useful to ensure a stable process, with a finite second-order moment. We then formulate a new complex-valued MS-GARCH model for nonstationary signals in the short-time Fourier transform (STFT) domain. The properties of the model are
7 8 CHAPTER 1. INTRODUCTION investigated and algorithms are developed for causal, as well as noncausal estimation in noisy environment. The proposed model is shown to be useful for spectral modeling of speech signals for the applications of speech enhancement and speech dereverberation. A basic property of speech signals is that their expansion coefficients are sparse in the STFT domain. Therefore, a reliable detector may significantly improve performance in noisy environments. We propose a new formulation for the speech enhancement problem, which incorporates simultaneous operations of detection and estimation. This approach is applied to develop speech enhancement algorithms under stationary, as well as tran- sient noise. A simultaneous classification and estimation approach together with GARCH modeling is employed to develop an algorithm for single-sensor audio source separation. In this chapter we briefly describe scientific background for the main topics of this research and specify the structure of the thesis.
1.1 Markov-switching GARCH models
The GARCH model is widely-used in the field of econometrics, both by practitioners and by researchers [1–5]. The model represents a powerful tool for analysis and forecasting of volatility in financial markets. This model, first introduced by Bollerslev [1] as a gener- alization of the ARCH model [2], explicitly parameterizes the time-varying volatility by using both recent conditional variances and recent squared innovations. GARCH mod- els preserve the persistence of the process volatility in the sense that small variations tend to follow small variations and large variations tend to follow large variations. Incor- porating GARCH models with hidden Markov chains, where each state (regime) of the chain implies a different GARCH behavior, extends the dynamic formulation of the model and enables a better fit for a process with a more complex time-varying volatility struc- ture [6–11]. However, a major drawback of such models is that estimating the volatility with switching-regimes requires knowledge of the entire history of the process, including the regime path. Consequently, Markov-switching ARCH models were proposed [12, 13], which avoid problems of path dependency in a noiseless environment. The conditional variance in ARCH models depends on previous observations only, so the Markov chain does not have to be known for constructing the conditional variance for a given regime. 1.1. MARKOV-SWITCHING GARCH MODELS 9
In [6], a variant of MS-GARCH (MS-GARCH) model was introduced relying on the assumption that the conditional variance given current regime is dependent on the ex- pectation of the previous conditional variances rather than their values. Accordingly, the conditional variance depends on some finite, state dependent, expected conditional variances via their conditional state probabilities. Klaassen [7] proposed modifying this model by manipulating the current regime and all available observations while evaluat- ing the expectation of previous conditional variances. A different method for reducing the dependency of the conditional variance on past regimes has recently been proposed in [8]. Accordingly, a Markov chain governs the ARCH parameters while the autoregres- sive behavior of the conditional variance is subject to the assumption that past conditional variances are in the same regime as that of the current conditional variance. These vari- ants of MS-GARCH models were developed for improved volatility forecasts of financial time-series under possible existence of shocks.
MS-GARCH processes, as well as the standard (single-state) GARCH process, are non- stationary as their second-order moments change recursively over time. However, if these processes are asymptotically wide-sense stationary then their variances are guarantied to be finite. A necessary and sufficient condition for the stationarity of a (single-regime) GARCH(p, q) process has been developed in [1]. A deep analysis of the probabilistic structure of MS-GARCH model is derived in [14] with conditions for the existence of moments of any order. In [15–17], stationarity analysis has been derived for some mixing models of conditional heteroscedasticity, and conditions for the asymptotic stationarity of some AR and ARMA models with Markov-regimes has been derived in [18–22]. However, for the MS-GARCH models, stationarity conditions are known in the literature only for some special cases. In [7], necessary (but not necessarily sufficient) conditions for station- arity are developed for the special case of two regimes and GARCH modeling of order (1, 1). A necessary and sufficient stationarity condition has been developed in [8] for a spe- cific MS-GARCH model formulation, but only in case of GARCH(1, 1) behavior in each regime. We introduce a comprehensive approach for stationarity analysis of MS-GARCH models, which manipulates a backward recursion of the model’s second-order moment. A recursive formulation of the state-dependent conditional variances is developed and the corresponding conditions for stationarity are obtained. In particular, we derive necessary 10 CHAPTER 1. INTRODUCTION and sufficient conditions for the asymptotic wide-sense stationarity of two different vari- ants of MS-GARCH processes, and obtain expressions for their asymptotic variances in the general case of m-state Markov chains and (p, q)-order GARCH processes. Recently, GARCH models have been employed for modeling speech signals in the time- frequency domain [23–25], for speech recognition application [26], and for voice activity detection [27]. Speech signals in the STFT domain demonstrate both variability clus- tering and heavy tail behavior similarly to financial time-series [25]. Motivated by these characteristics, it was proposed to model the conditional variance of speech signals in the STFT domain by a complex GARCH model. This model has been shown useful for speech enhancement applications [23–25], but it relies on the assumption that the model parame- ters are time-invariant. It is commonly assumed in econometrics that the analyzed process is observed in a noiseless environment so that its past observations provide a complete specification of its current conditional variance, for any given regime. In our proposed MS-GARCH approach for speech modeling, the desired signal is generally observed in a noisy environment. Accordingly, we developed algorithm for conditional variance estima- tion which is based on iterating propagation and update steps with regime conditional probabilities. In addition, we developed state smoothing algorithm which generalizes existing algorithms which are used for state smoothing in hidden-Markov processes.
1.2 Speech enhancement
The problem of spectral enhancement of noisy speech signals from a single microphone has attracted considerable research effort for over thirty years, and is still an active research area, e.g., [28–40]. This problem is often formulated as estimation of speech spectral components from a degraded signal which consists of statistically independent additive noise. A variety of different approaches for spectral enhancement of noisy speech signals have been introduced over the years. One of the earlier methods is the spectral subtraction [29,30]. Accordingly, an estimate of the clean signal is obtained by subtracting an estimate of the power spectral density (PSD) of the background noise from the short-term PSD of the degraded signal. The square root of the result is considered as an estimate for the spectral magnitude of the desired signal, while the phase of the noisy signal is used as 1.2. SPEECH ENHANCEMENT 11 the desired phase. This method generally results in random fluctuations in the residual noise, known as musical noise, which may be annoying and disturb the perception of the enhanced speech [41]. Many variations for the spectral subtraction method have been developed during the years to cope with the musical residual noise phenomena [42–45]. Statistical methods for speech enhancement are designed to minimize the expected value of some distortion measure between the clean and estimated signals [32–34, 36, 38, 46, 47]. These approaches require presumption of reliable statistical models for the speech and noisy signals and specification of a perceptually meaningful distortion measure. Distortion measures which are of particular interest in speech enhancement applications are the squared error [32, 48], the squared error of the short-term spectral amplitude (STSA) [33] and the squared error of the log spectral amplitude (LSA) [34,38]. To enable further attenuation of the additive noise, estimation under speech presence uncertainty is often considered [32,33,37,38,41,49,50]. In addition, to eliminate residual noise in case of speech absence, voice activity detector (VAD) is often incorporated with the estimator output [51–56]. Subspace methods for speech enhancement attempt to decompose the vector space of the noisy signal into a signal-plus-noise subspace and a noise-only subspace [57–60]. Spectral enhancement is then performed by removing the noise subspace and estimating the speech coefficients from the signal-plus-noise subspace. Another spectral enhancement method relies on modeling the vectors of speech signal based on hidden Markov models (HMMs) [35, 61–63]. The probability distribution of the speech (and in some applications also of the noise) are estimated from long training sequences of clean samples. The speech signal is then estimated from the noisy observation based on the trained model according to some distortion criteria. HMMs are successfully applied for speech recognition applications, e.g., [64, 65]. However, for the application of speech enhancement they were not found to be sufficiently refined models [66].
1.2.1 Spectral modeling of speech signals
Spectral enhancement of speech signals often relies on the assumption that the spectral coefficients of the speech signal (as well as of the noise signal) are statistically indepen- dent, with conditional distribution which may be considered as, e.g., Gaussian [33,34,38], Super-Gaussian [47], Laplace [67], or Gamma [68]. Although the statistical independency 12 CHAPTER 1. INTRODUCTION assumption simplifies the design of the optimal estimator under any of the assumed dis- tributions, it does not hold in reality since consecutive spectral magnitudes are strongly correlated [24, 25, 41, 69]. In fact, in many of the above-mentioned speech enhancement algorithms, the time-frequency dependent speech spectral variances are estimated using the decision-directed approach [33,70] which relies on the strong correlation of successive spectral variances. Let x and d denote speech signal and uncorrelated noise signal, respec- tively, and let y = x + d denote the observed signal. Applying the STFT to the observed signal we obtain Ytk = Xtk + Dtk were t and k denote the time-frame and frequency-bin indices, respectively. The decision-directed estimator for the speech spectral variance is given by 2 λˆ = α Xˆ + (1 α) max Y 2 λ , 0 (1.1) tk t−1,k − | tk| − d,tk
ˆ were λtk denotes the estimate for the speech spectral variance and λd,tk denotes the spectral variance of the noise spectral coefficient. The parameter α (0 α 1) is a weighting ≤ ≤ factor (typically chosen close to one) that controls the tradeoff between noise reduction and transient distortion brought into the signal [70]. A larger value of α results in a greater reduction of the musical noise phenomena, but at the expense of attenuated speech onsets and audible modifications of transient components. It is worth noting, that the noise spectral variance needs also to be estimated. This can be practically obtained during speech absence intervals or from the noisy observations by using the minima controlled recursive averaging algorithm [38,71] or the minimum statistics approach [72]. A relaxed statistical model for speech signals was proposed in [69]. This model is based on the assumptions that speech spectral phases are iid random variables, and the speech spectral component Xtk is a zero-mean complex Gaussian random variable with iid real and imaginary components. The sequence of speech spectral variances λ t =0, 1, ... is { tk | } a random process and each spectral variance is correlated with the sequence of the spectral magnitudes. However, given a specific spectral variance λtk, the spectral magnitude at the same time-frequency bin is statistically independent with other spectral magnitudes. This model firstly treated both the sequences of the spectral coefficients X and of the { tk} spectral variances λ at a specific frequency-bin as random processes. { tk} Recently, it was proposed to model speech spectral coefficients using GARCH model [23–25]. This model explicitly parameterizes the time-varying volatility (conditional vari- 1.3. OTHER SPEECH PROCESSING APPLICATIONS 13 ance) at a specific frequency-bin index, by using both recent conditional variances and recent squared values of the spectral coefficients. It was shown that under perfect de- tection for the speech spectral coefficients, improved performance is achieved by using the GARCH model compared to using the decision-directed approach. In this disserta- tion, we introduce a GARCH model with Markov regimes for speech spectral modeling. This model exploits the advantages of both the conditional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains. The model parameters are allowed to change in time according to the state of a hidden Markov chain and may be ascribed to switching between speech phonemes or different speakers. We develop model based algorithms, which are shown to be useful for speech enhancement applications, speech dereverberation, and acoustic source separation.
1.3 Other speech processing applications
While the classical problem of speech enhancement was extensively studied during recent decades, the applications of hands-free communication and digital storing of audio signals raised interesting problems such as speech dereverberation, nonstationary noise reduction, and acoustic blind source separation. Speech signals that are received by a distant microphone from the speech source usually contain reverberation. The sound wave produced by the speaker is propagated outward from the source. The wavefronts reflect off the walls and other objects and superimposed at the microphone. These reflections can degrade the fidelity and intelligibility of speech signals and the performance of automatic speech recognition systems [73]. The received reverberated sound generally consists of a direct sound, reflections that arrive shortly after the direct sound (commonly called early reverberation), and reflections that arrive after the early reverberation (or late reverberation) [73–75]. While the early reflections mainly contribute to spectral coloration and may even improve the intelligibility of the speech sound, the late reverberation changes the waveform’s temporal envelope as decaying ’tails’ are added at sound offsets. This may cause distant and echo-ey sound quality [73, 76]. Algorithms for reverberation reduction (or dereverberation) can be divided to two classes. Algorithms in the first class are based on an estimation of the acoustic impulse response 14 CHAPTER 1. INTRODUCTION
(AIR). The desired signal is, in that case, estimated by deconvolution methods, e.g., [77–79]. Two drawbacks of these algorithms are, however, that they have shown to be sensitive to small changes in the AIR, and in some cases the order of the AIR needs to be known [80, 81]. Methods in the second category try to suppress reverberation without estimating the AIR, e.g., [74, 82–84]. The spectral coefficients of the desired signal are estimated using some statistical model, while trying to suppress the undesired reverberation. We develop a dual-microphone speech dereverberation algorithm for noisy environments, which is aimed at suppressing late reverberation and background noise. The spectral variance of the late reverberation is obtained with adaptively-estimated direct path compensation, and a MS-GARCH model is used to estimate the spectral variance of the desired signal, which includes the direct sound and early reverberation.
Blind separation of mixed audio signals received by a single microphone has been a challenging problem for many years. Examples of applications include separation of speakers [85, 86], separation of different musical sources (e.g., different musical instru- ments) [85, 87, 88], separation of speech or singing voice from background music [89–92], and signal enhancement in nonstationary noise environments [35, 62, 93–95]. In case the signals are received by multiple microphones, spatial filtering may be employed as well as mutual statistical information between the received signals, e.g., [96–102]. However, if several sources are recorded by a single microphone, some a priori information is necessary to enable a reasonable separation performance.
In [94,95] speech and nonstationary noise signals are assumed to evolve as autoregres- sive (AR) processes, while the a priori statistical information (codebooks) is obtained us- ing a training phase and includes several sets of linear prediction coefficients. In [87,88,90] the acoustic signals are modeled by Gaussian mixture models (GMMs). In [35, 62] the acoustic signals are modeled by hidden Markov models (HMMs) with AR sub-sources. The assumed statistical models provide a priori information about the distinct signals, which together with training sequences of signals and appropriately extracted codebooks enable source separation from signal mixtures. GMM and AR-based codebooks are gener- ally insufficient for representing statistically rich signals such as speech signals [92], since each state specifies a predetermined probability density function (pdf), or a mixture of pdf’s. We develop a new algorithm for single-sensor audio source separation of speech 1.4. DETECTION AND ESTIMATION OF SPEECH SIGNALS 15 and music signals, which is based on MS-GARCH modeling of the speech signals and GMM for the music signals. Since in MS-GARCH model the active state only specifies the evolution of the spectral variances along time, the corresponding statistical model can take almost any values for the spectral variances. The separation of the speech from the music signal is obtained by a classification and estimation approach, which enables to control the trade-off between residual interference and signal distortion.
1.4 Detection and estimation of speech signals
In many signal processing applications as well as communication applications, the signal to be estimated is not surely present in the available noisy observation. Therefore, as specified in Section 1.2, algorithms often try to estimate the signal under uncertainty (i.e., using some a priori probability for the existence of the signal) [32, 33, 38, 41], or alternatively, apply an independent detector for signal presence [51–56]. This detector may be designed based on the noisy observation, or, on the estimated signal. The spectral coefficients of the speech signal are generally sparse in the STFT domain in the sense that speech is present only in some of the frames, and in each frame only some of the frequency- bins contain the significant part of the signal energy. Therefore, both signal estimation and detection are generally required while processing noisy speech signals [38, 41, 103]. However, existing algorithms often focus on estimating the spectral coefficients rather than detecting their existence. The spectral-subtraction algorithm [29, 30] contains an elementary detector for speech activity in the time-frequency domain, but it generates musical noise caused by falsely detecting noise peaks as bins that contain speech, which are randomly scattered in the STFT domain. Subspace approaches for speech enhancement [57,59,60,104] decompose the vector of the noisy signal into a signal-plus-noise subspace and a noise subspace, and the speech spectral coefficients are estimated after removing the noise subspace. Accordingly, these algorithms are aimed at detecting the speech coefficients and subsequently estimating their values. McAulay and Malpass [32] were the first to propose a speech spectral estimator under a two-state model. They derived a maximum likelihood (ML) estimator for the speech spectral amplitude under speech- presence uncertainty. Ephraim and Malah followed this approach of signal estimation 16 CHAPTER 1. INTRODUCTION under speech presence uncertainty and derived an estimator which minimizes the mean- square error (MSE) of the short-term spectral amplitude (STSA) [33]. In [49], speech presence probability is evaluated to improve the minimum MSE (MMSE) of the LSA estimator, and in [38] a further improvement of the MMSE-LSA estimator is achieved based on a two-state model.
A reliable detector for speech activity in the time-frequency domain is of major impor- tance, not only to improve noise reduction, but also for noise estimation, speech coding and speech recognition. Many VAD algorithms are designed on a frame-by-frame basis, e.g., [51–56], or the activity of speech is detected in each time-frequency bin, e.g., [38,103]. The detection of speech presence by using a microphone array has recently received much attention, e.g., [105–108]. Spriet et al. [109] analyzed the impact of speech detection errors on the noise reduction performance of multichannel systems and showed that a reliable speech detector is crucial to achieve a potentially better speech enhancement performance.
Approaches for the design of coupled operations of signal detection and estimation have been proposed for some communication applications, e.g., [110–113]. In [114] a method for optimal simultaneous classification and estimation has been proposed by minimizing the erroneous classification probability in the worst case under a false alarm constraint. Middleton et al. [115, 116] were the first to propose simultaneous signal detection and estimation within the framework of statistical decision theory. They proposed some dual schemes for the two operations while considering several coupling methods between the two operations. The detector is generally optimized with the knowledge of the specific structure of the estimator, and the estimator is optimized in the sense of minimizing a Bayes risk associated with the combined operations.
In this research, we reformulate the speech enhancement problem as a joint problem of speech activity detection and spectral estimation. The problem formulation assumes sparsity of the speech spectral coefficients and it introduces coupled operations of detection and estimation using a combined Bayes risk. The Bayes risk incorporates both the cost of estimation errors and the cost of fault detection, whether it is a missed-detection of speech components or a false detection. While considering the problem of single-channel audio source separation, multi-hypotheses are used for both signals. In that case we incorporate simultaneous classification and estimation scheme for the separation. In both 1.5. THESIS STRUCTURE 17 cases, the combined cost results in cost parameters that enable to control the trade-off between signal distortion, caused by missed detection of speech components, and residual noise resulting from false-detection.
1.5 Thesis structure
This thesis is organized as follows. Chapter 2 briefly outlines the basic theories and methods which were used during this research. The original contribution of this research starts in Chapter 3. In Chapter 3, we develop a comprehensive approach for stationarity analysis of MS- GARCH models. We consider the general case of m-state Markov chains and (p, q)- order GARCH processes, and specify the unconditional variance of the process using the expectation of the regime dependent conditional variances. No history knowledge of the process is assumed, except for the model parameters. The expectation of the conditional variance at a given regime is then recursively constructed from the conditional expectation of both previous conditional and unconditional variances. Consequently, we obtain a complete recursion for the expected vector of state dependent conditional variances. The recursive vector form is constructed by means of a representative matrix which is built from the model parameters. We show that constraining the largest absolute eigenvalue of the representative matrix to be less than one is necessary and sufficient for the convergence of the unconditional variance, and therefore, for the asymptotic stationarity of the process. We derive stationarity conditions for the general formulation of the two variants of MS- GARCH models introduced by Klaassen [7] and Haas et al. [8]. We show that our results reduce in some degenerated cases to the stationarity conditions developed by Bollerslev [1], and by Klaassen and Haas et al.. Furthermore, we show that the stationarity conditions developed by Klaassen are not only necessary but also sufficient for asymptotic stationarity of his model. In Chapter 4, we introduce a Markov-switching time-frequency GARCH (MSTF- GARCH) model for speech signals in the STFT domain. The MSTF-GARCH model exploits the advantages of both the conditional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains. The expansion 18 CHAPTER 1. INTRODUCTION coefficients are considered as nonstationary random signals in the time-frequency domain and modeled as multivariate complex GARCH processes with Markov-switching regimes. A corresponding recursive algorithm is developed for signal restoration in a noisy environ- ment. The conditional variance is estimated by iterating propagation and update steps with regime conditional probabilities. The model parameters are estimated from a train- ing data set prior to the signal restoration using ML approach, and the number of states is assumed to be known. We show that the derivation in [117] of bounds on the MSE of a composite source signal estimation is applicable for obtaining an upper bound on the MSE of a single step MSTF-GARCH estimation. Experimental results demonstrate the improved performance of the proposed algorithm for restoration of MSTF-GARCH process compared to using an estimator which assumes a stationary process and com- pared to using an estimator which assumes a smaller number of regimes than the process actually has. Furthermore, it is demonstrated that the squared absolute values of speech coefficients in the STFT domain are better evaluated by using the MSTF-GARCH model than by using the decision-directed approach.
In Appendix 4.A, we present an application of the MSTF-GARCH model to speech enhancement. We employ the MSTF-GARCH model by assuming different Markov chains in distinct frequency subbands with identical state transition probabilities. The GARCH parameters are state dependent and frequency variant. We define an additional state for the case where speech coefficients are absent (or below a certain threshold level) and introduce parameter estimation method which is computationally more efficient than the traditional ML approach. Furthermore, the probability of the speech absence state can be used as a soft voice activity detector which is naturally generated in the reconstruction algorithm. Experimental results demonstrate improved noise reduction performance while preserving weak components of the speech signal.
Chapter 5 addresses the problem of state smoothing in MS-GARCH processes in noisy environments. The dependency of the conditional variance on past observations and past active regimes are taken into consideration as we generalize both the forward-backward recursions [118] and the stable backward recursion [119, 120]. We derive two recursive steps for the evaluation of conditional densities of future observations. The first step is an upward recursion which manipulates the future observations for the evaluation of 1.5. THESIS STRUCTURE 19 their conditional densities, corresponding to all possible future paths. The second step is a backward recursion which integrates over these paths to evaluate the future densities required for the noncausal state probability. The computational complexity of the gener- alized recursions grows exponentially with the number of future observations employed for the fixed-lag smoothing. However, experimental results demonstrate that the significant part of the improvement in performance, compared to using causal estimation, is achieved by considering a few future observations.
In Chapter 6, we propose a new formulation for the speech enhancement problem based on simultaneous operations of detection and estimations. A detector for the speech coefficients is combined with an estimator, which jointly minimizes a cost function that takes into account both estimation and detection errors. Under speech-presence, the cost is proportional to a quadratic spectral amplitude (QSA) error [33], while under speech- absence, the distortion depends on a certain attenuation factor [29, 38, 70]. We derive a combined detector and estimator with cost parameters that enable to control the trade- off between speech distortion, caused by missed detection of speech components, and residual musical noise resulting from false-detection. The combined solution generalizes the well-known STSA algorithm, which involves merely estimation under signal presence uncertainty. In addition, we propose a modification of the decision-directed a priori SNR estimator, which is suitable for transient-noise environments. Experimental results show that the simultaneous detection and estimation yields better noise reduction than the STSA algorithm while not degrading the speech signal. The advantage of using a suitable indicator for transient noise is demonstrated in a nonstationary noise environment, where the proposed algorithm facilitates suppression of transient noise with a controlled level of speech distortion.
Appendix 6.B introduces a closely related application of removal of transient noise using a practical detector for the presence of transient noise. We formulate a speech en- hancement problem under multiple hypotheses, assuming some indicator or detector for the presence of noise transients in the STFT domain is available. Cost parameters control the trade-off between speech distortion and residual transient noise. We derive an optimal signal estimator that employs the available detector and show that the resulting estima- tor generalizes the optimally-modified log-spectral amplitude (OM-LSA) estimator [38]. 20 CHAPTER 1. INTRODUCTION
Experimental results demonstrate the improved performance obtained by the proposed algorithm, compared to using the OM-LSA.
Chapter 7 deals with the problem of single-channel audio source separation. A novel approach is proposed for single-channel blind source separation of speech and music sig- nals. This approach includes a new codebook for speech signals, as well as a new separa- tion algorithm which relies on a simultaneous classification and estimation procedure. The codebook is based on GARCH modeling of speech signals and a GMM for music signals. Two methods are proposed for classification and estimation. One is based on simultaneous operations of classification and estimation which jointly minimize a combined Bayes risk. The second method employs a given (non-optimal) classifier, and applies an estimator which is optimally designed to yield a controlled level of residual interference and signal distortion. The GARCH model for the speech signal with several states of parameters en- ables smooth (diagonal) covariance matrices with possible state switching. Experimental results demonstrate that for mixtures of speech and piano signals it is more advantageous to model the speech signal by GARCH than GMM, and the codebook generated by the GARCH model yields significantly improved separation performance. In addition, the classification and estimation approach enables the user to control the trade-off between the distortion of the desired signal caused by missed detection, and the amount of the residual signal resulting from false detection.
In Chapter 8, we consider the problem of speech dereverberation using MS-GARCH modeling. We develop an improved dual-microphone speech dereverberation algorithm which relies on a MS-GARCH modeling of the desired early speech component, which consists of the direct sound and early reverberation. The model is applied to distinctive frequency subbands and specifies the volatility clustering of successive spectral coeffi- cients, while a speech-absence state is used for evaluating the speech presence probability. Furthermore, an adaptive approach is developed to estimate the parameter for the di- rect path compensation (DPC) directly from the observed signals. Experimental results show that using the MS-GARCH modeling rather than the decision-directed approach, improved results can be obtained. Furthermore, by using the proposed algorithm, the performance obtained with blindly estimated DPC parameter is comparable to that ob- tained with an optimal DPC parameter that is calculated from the actual AIR, which is 1.6. LIST OF PUBLICATIONS 21 unknown in practice. Chapter 9 summarizes the main contributions of this dissertation and presents some future research directions.
1.6 List of publications
The chapters of this thesis are based on the following publications:
Chapter 3 is based on:
1. A. Abramson and I. Cohen, On the stationarity of Markov switching GARCH pro- cesses, Econometric Theory, vol. 23, no. 3, pp.485-500, 2007.
Chapter 4 is based on:
2. A. Abramson and I. Cohen, Recursive supervised estimation of a Markov-switching GARCH process in the short-time Fourier transform domain, IEEE Trans. on Signal Processing, vol. 55, no. 7, pp. 3227-3238, July 2007.
3. A. Abramson and I. Cohen, Asymptotic stationarity of Markov-switching time- frequency GARCH processes, in Proc. 30th IEEE Internat. Conf. Acoust. Speech Signal Processing., ICASSP-06, Toulouse, France May 2006, pp. III 452-455.
Appendix 4.A is based on:
4. A. Abramson and I. Cohen, Markov-switching GARCH model and application to speech enhancement in subbands, in Proc. 10th Internat. Workshop on Acous. Echo and Noise Control, IWAENC-2006, Paris, France September 2006. (Best student paper).
Chapter 5 is based on:
5. A. Abramson and I. Cohen, State smoothing in Markov-switching time-frequency GARCH models, IEEE Signal Processing Letters, vol. 13, no. 6, pp. 377-380, June 2006. 22 CHAPTER 1. INTRODUCTION
Chapter 6 is based on:
6. A. Abramson and I. Cohen, Simultaneous detection and estimation approach for speech enhancement, IEEE Trans. Audio, Speech, and Language Processing, vol 15, no. 8, pp. 2348-2359, Nov. 2007.
Appendix 6.B is based on:
7. A. Abramson and I. Cohen, Enhancement of speech signals under multiple hypotheses using an indicator for transient noise presence, Proc. 31th IEEE Internat. Conf. Acoust. Speech Signal Processing., ICASSP-07, pp. IV 533-536, Honolulu, Hawaii Apr. 2007. (Best student paper finalist).
Chapter 7 is based on:
8. A. Abramson and I. Cohen, Single-sensor audio source separation using classifi- cation and estimation approach and GARCH modeling, submitted to IEEE Trans. Audio, Speech, and Language Processing. and Chapter 8 is based on:
9. A. Abramson, Emanu¨el A. P. Habets, S. Gannot, and I. Cohen, Dual-microphone speech dereverberation using GARCH modeling, to appear in Proc. 32th IEEE In- ternat. Conf. Acoust. Speech Signal Processing., ICASSP-08, Las-Vegas, Apr. 2008. Chapter 2
Research Methods
In this chapter, we briefly review research methods which were useful during this research. We start by introducing the GARCH model formulation. We then continue by introducing a specific formulation for MS-GARCH model with its known conditions for asymptotic stationarity. The GARCH modeling approach for speech spectral coefficients is briefly re- viewed with the variance estimation algorithm, as well as spectral enhancement approach. Finally, we briefly review existing methods for single-sensor blind source separation and speech dereverberation by using spectral enhancement.
2.1 GARCH Models and stationarity analysis
A linear (p, q)-order GARCH model is defined as follows. Let εt denote a real-valued discrete-time stochastic process, and let ψt denote the information set (σ-field) of all information through time t. The GARCH(p, q) process is then given by [1]
εt = σt vt (2.1) where v are iid random variables with zero mean, unit variance, and some predeter- { t} mined probability density. The conditional variance of the process, σ2 = E ε2 ψ , t { t | t−1} evolves as q p 2 2 2 σt = ξ + αiεt−i + βjσt−j (2.2) i=1 j=1 X X 23 24 CHAPTER 2. RESEARCH METHODS where
p 0, q> 0 , ≥ ξ > 0, α 0, i =1, ..., q, i ≥ β 0, j =1, ..., p. j ≥ For p = 0 the process reduces to the ARCH(q) process [2], and for q = p = 0 the process εt is simply a white noise. The nonnegativity of the parameters αi, i = 1, ..., q and βj, j = 1, ..., p, together with ξ > 0, are sufficient to ensure a positive conditional 2 variance, σt . However, since the conditional variance is a time varying random process, it is net guarantied in general that the process would be finite.
Theorem 2.1. [1] The GARCH(p, q) process as defined in (2.1) and (2.2) is (asymptot- q p ically) wide-sense stationary with E(εt)=0, limt→∞ V ar(εt) = ξ/ i=1 αi + j=1 βj q p and Cov(εt,ετ )=0 for t = τ if and only if αi + βj < 1.P P 6 i=1 j=1 P P The proof can be found in [1,5]. Theorem 2.1 gives important constraint on the model parameters such that the second- order moment of the GARCH process would be finite. It also shows that under this condition, the process is asymptotically wide-sense stationary.
2.1.1 Markov-switching GARCH model
Let S 1, ..., m denote the (unobserved) regime at a discrete time t and let s be t ∈ { } t a realization of S , assuming that S is a first-order stationary Markov chain with t { t} transition probabilities a , p(S = j S = i), a transition probabilities matrix A, ij t | t−1 A = a , and stationary probabilities π = [π , π , ..., π ]′, π , p (S = i), where ′ { }ij ij 1 2 m i t denotes the transpose operation. Incorporating GARCH models with a hidden Markov chain, where each state of the chain (regime) allows a different GARCH behavior and thus a different volatility struc- ture, extends the dynamic formulation of the model and potentially enables improved forecasts of the volatility [6–11]. Unfortunately, the volatility of a GARCH process with switching-regimes depends on the entire history of the process, including the regime path, which makes the derivation of a volatility estimator impractical. Many different variants 2.1. GARCH MODELS AND STATIONARITY ANALYSIS 25 have been formulated for Markov-switching GARCH models. One popular formulation in econometrics is the model proposed by Klaassen [7] as a modification of Gray’s model [6]. These models integrate out the unobserved regime path so that the conditional variance can be constructed from previous observations only. Accordingly, given the Markovian active state s 1, 2 , the conditional variance follows [7]: t ∈{ }
σ2 = ξ + α ε2 + β E σ2 s , ψ . (2.3) t,st st st t−1 st t−1,St−1 | t t−1 As can be seen from (2.3), this model formulation originally assumes a degenerated model with only two-state Markov chain with GARCH(1, 1) in each state. Define a 2 2 matrix × C with elements c = p(S = j S = i)(α + β ). Then, we have the following theorem: ij t−1 | t i i Theorem 2.2. [7] Necessary conditions for asymptotic wide-sense stationarity of the process defined in (2.1) and (2.3) are c ,c < 1 and det(I C) > 0. The asymptotic 11 22 − conditional variances are then given by:
2 σ ξ1 1 =(I C)−1 . (2.4) 2 − σ1 ξ2 Proof. The unconditional variance under st can be obtained by taking expectation which is conditioned on the active state
σ2 = ξ + α E ε2 s + β E σ2 s t,st st st t−1 | t st t−1,St−1 | t = ξ + α E E ε2 s ,s s + β E E σ2 s ,s s st st t−1 | t−1 t | t st t−1,St−1 | t−1 t | t = ξ + α E σ2 s + β E σ2 s st st t−1,st−1 | t st t−1,st−1 | t = ξ +(α +β ) E σ2 s . (2.5) st st st t−1,st−1 | t 2 2 2 2 Next, assume that σt,1 and σt,2 are time invariant, and denote them by σ1 and σ2, respec- tively. Then 2 2 σ ξ1 σ 1 = + C 1 (2.6) 2 2 σ1 ξ2 σ1 where c = p (S = j S = i)(α + β ) and ij t−1 | t i i p (s s ) p (s ) p (s s )= t−1 | t t−1 . (2.7) t−1 | t p (S =1 s ) p (S = 1)+ p (S =2 s ) p (S = 2) t−1 | t t−1 t−1 | t t−1 Under the assumption that σ2 and σ2 exist, (I C) is invertible and we have (2.4). 1 2 − 26 CHAPTER 2. RESEARCH METHODS
The necessary conditions for the existence of the unconditional variances are derived as follows. Since we have sufficient conditions for the positivity of the conditional variances (i.e., ξ > 0 and α , β 0 for s = 1, 2), all elements of (I C)−1 must be non negative s s s ≥ − and (I C)−1 must not have a zero raw. Since c = α + β c and c = α + β c − 12 1 1 − 11 21 2 2 − 22 we can write
1 1 c22 α1 + β1 c11 (I C)−1 = − − . (2.8) − det (I C) α + β c 1 c − 2 2 − 22 − 11 Since α + β c =(α + β ) [1 p (S = i S = i)] 0 (2.9) i i − ii i i − t−1 | t ≥ the nonnegativity of (2.8) implies that det (I C) > 0, so that 1 c 0 and 1 c 0. − − 11 ≥ − 22 ≥ But c and c must not equal one, otherwise, det (I C) 0. Therefore, necessary 11 22 − ≤ conditions for the existence of stationary variances are c ,c < 1 and det (I C) > 11 22 − 0.
In Chapter 3 we derive conditions which are both necessary and sufficient for asymp- totic stationarity, considering any finite-state Markov chain and any (p, q)-order of GARCH model in each state.
2.2 Time-frequency GARCH model and spectral speech enhancement
2.2.1 Time-frequency GARCH model
Recall x and d represent speech and uncorrelated additive noise signals, and y = x + d represents the observed signal. Applying the STFT to the observed signal we have in the time-frequency domain
Ytk = Xtk + Dtk . (2.10)
Let τ = X t =0, 1, ..., τ, k =0, ..., K 1 represent the set of clean spectral coeffi- X { tk | − } cients up to frame τ. Let tk and tk denote hypotheses for speech present and absence, H1 H0 respectively, in the time-frequency bin (t, k), and let
λ , E X 2 tk, τ (2.11) tk|τ | tk| | H1 X 2.2. TIME-FREQUENCY GARCH MODEL AND SPECTRAL SPEECH ENHANCEMENT27 denote the conditional variance of Xtk under the hypothesis that speech is present in the time-frequency bin (t, k), given the clean spectral coefficients up to time-frame τ. The time-frequency GARCH (TF-GARCH) relies on the following assumptions [23,25]:
1. Given λ and the state of speech presence in each time-frequency bin ( tk or { tk} H1 tk), the speech spectral coefficients X are generated by H0 { tk}
Xtk = λtkVtk , (2.12) p where V tk are identically zero, and V tk are statistically independent tk | H0 tk | H1 complex random variable with zero mean, unit variance, and independent and iden- tically distributed (iid) real and imaginary parts:
tk : E V =0, E V 2 =1 , H1 { tk} | tk| tk : V =0 . (2.13) H1 tk
2. Under tk, the real and imaginary parts of V are iid with Gaussian probability H1 tk density.
3. The conditional variance λtk|t−1, referred to as the one-frame-ahead conditional vari- ance, is a random process which evolves as a GARCH(1, 1) process:
λ = λ + α X 2 + β λ λ (2.14) tk|t−1 min | t−1,k| t−1,k|t−2 − min where
λ > 0, α 0, β 0, α + β < 1 . (2.15) min ≥ ≥
4. The noise spectral coefficients D are zero-mean statistically independent Gaus- { tk} sian random variables, with iid real and imaginary parts.
The first assumption implies that the speech spectral coefficients X tk are condi- tk | H1 tionally zero-mean statistically independent random variables given their variances λtk . { } However, the GARCH formulation parameterizes the correlation between successive con- ditional variances at the same frequency-bin index. 28 CHAPTER 2. RESEARCH METHODS
2.2.2 Variance estimation
Since the conditional variance in the TF-GARCH depends on the entire history of the process, while observing noisy signal, the conditional variances can not be reconstructed and need to be estimated. The estimation of the spectral variance from the noisy obser- vations is estimated by using two steps. First, the conditional variance estimate λˆtk|t−1 is being updated one frame ahead in time by using the additional information Ytk, than, for the next frame the conditional variance is updated based on the model formulation.
Specifically, an estimate for λtk|t is obtained by calculating its conditional mean under tk given Y and λˆ . By definition λ = X 2. Hence, the update step is obtained H1 tk tk|t−1 tk|t | tk| by [24,69]:
λˆ = E X 2 tk, λˆ ,Y tk|k | tk| H1 tk|t−1 tk nˆ ˆ o ξtk|t−1 ξtk|t−1 = λ + Y 2 (2.16) ˆ d,tk ˆ tk 1+ ξtk|t−1 1+ ξtk|t−1 ! | | where λ = E D 2 and ξˆ , λˆ /λ is the a priori SNR. Substituting (2.16) d,tk {| tk| } tk|t−1 tk|t−1 d,tk into the model formulation (2.14) we have the propagation step:
λˆ = E λ t−1,k, λˆ ,Y tk|t−1 tk|t−1 | H1 t−1,k|t−2 t−1,k = λ n + αE X 2 t−1,k, λˆ o ,Y + β λˆ λ min | t−1,k| | H1 t−1,k|t−2 t−1,k t−1,k|t−2 − min = λ + αλˆ n + β λˆ λ . o (2.17) min t−1,k|t−1 t−1,k|t−2 − min This two-step estimation method allows estimation of the conditional variances from the noisy coefficients. It is important to note that the model parameters are generally esti- mated from a training set using maximum likelihood (ML) approach [5].
2.2.3 Spectral enhancement
Having an estimate for the spectral variances of the speech spectral coefficients, λˆtk, an estimator for the coefficients Xtk is obtained by minimizing the expected distortion given λˆ , λ , Y and the a posteriori speech presence probability q , p tk Y [41]: tk d,tk tk tk H1 | tk min E d Xtk, Xˆtk qtk, λˆtk,λd,tk,Ytk . (2.18) Xˆtk | n o 2.2. TIME-FREQUENCY GARCH MODEL AND SPECTRAL SPEECH ENHANCEMENT29
In particular, restricting ourselves to a squared error distortion measure of the form
2 d X , Xˆ = g Xˆ g˜ (X ) (2.19) tk tk tk − tk where g (X) andg ˜ (X) are specific functions which determine the fidelity criteria, the optimal estimator is calculated from
g Xˆ = E g˜ (X ) q , λˆ ,λ ,Y tk tk | tk tk d,tk tk = q nE g˜ (X ) tk, λˆ ,λ o,Y tk tk | H1 tk d,tk tk +(1 nq ) E g˜ (X ) tk,Y o. (2.20) − tk tk | H0 tk Fidelity criteria that are of particular interest for speech enhancement applications are the MMSE of the STSA [33] and MMSE of the LSA [34]. The MMSE-STSA estimator is derived by substituting into (2.19) the functions
g Xˆtk = Xˆtk X , under tk tk 1 g˜ (Xtk) = | | H . (2.21) G Y , under tk min | tk| H0 Let γ = Y 2/λ denote the a posteriori SNR and let υ , γ ξˆ /(1 + ξˆ ). The tk | tk| d,tk tk tk tk tk resulting estimator is given by
Xˆ = q G ξˆ ,γ + (1 q ) G Y (2.22) tk tk STSA tk tk − tk min tk h i where [33] √πυ υ υ υ G (ξ,γ) , exp (1 + υ) I + υ I , (2.23) STSA 2γ − 2 0 2 1 2 h i and I ( ) denotes the modified Bessel function of order ν. The MMSE-LSA estimator is ν · obtained by using the functions
g Xˆtk = log Xˆtk X , tk log tk under 1 g˜ (Xtk) = | | H (2.24) log(G Y ) , under tk . min | tk| H0 and the resulting estimator follows
qtk ˆ ˆ 1−qtk Xtk = GLSA ξtk,γtk Gmin Ytk (2.25) 30 CHAPTER 2. RESEARCH METHODS where [34] ξ 1 ∞ e−x G (ξ,γ)= exp dx . (2.26) LSA 1+ ξ 2 x Zυ It is important to note, that both estimators (2.22) and (2.25) are insensitive to the phase estimation error, and they are combined with the phase of the noisy signal [33].
2.3 Single-channel blind source separation
Separation of a mixture of signals observed via a single sensor is an ill posed problem, and some a priori information is required to enable reasonable reconstructions. In [87–89], a GMM is proposed for the signals’ codebook in the STFT domain, and in [94,95] an AR model is proposed with different sets of prediction coefficients for each of the signals in the time domain. However, each set of AR coefficients, together with the excitation variance corresponds to a specific covariance matrix in the STFT domain, similarly to the GMM. Under each of these models, each framed signal is considered as generated from some specific distribution which is related to the codebook with some probability, and a frame-by-frame separation is applied. Let s , s CN denote the vectors of the STFT expansion coefficients of signals s (n) 1 2 ∈ 1 and s2(n), respectively, for some specific frame index. Let q1 and q2 denote the active states of the codebooks corresponding to signals s1 and s2, respectively, with known a priori probabilities p1 (i) , p (q1 = i), i = 0, ..., m1 and p2 (j) , p (q2 = j), j = 0, ..., m2, and i p1 (i) = j p2 (j) = 1. Given that q1 = i and q2 = j, s1 and s2 are assumed conditionallyP zero-meanP complex-valued Gaussian random vectors with known diagonal covariance matrices, i.e. ,s 0, Σ(i) and s 0, Σ(j) . For the AR model 1 ∼ CN 1 2 ∼ CN 2 [35, 62, 94, 95], each set of prediction coefficients in the tim e domain corresponds to a specific covariance matrix in the STFT domain, up to scaling by the excitation variance. Assuming sufficiently long frames, these covariance matrices are considered as diagonal [95]. Based on a given codebook, it is proposed in [88] and [95] to first find the active pair of states i, j = q = i, q = j using a maximum a posteriori (MAP) criterion: { } { 1 2 }
ˆi, ˆj = argmax p (x i, j) p (i, j) (2.27) i,j | n o 2.4. SPEECH DEREVERBERATION 31 where x = s +s , p ( i, j)= p ( q = i, q = j), and for statistically independent signals 1 2 ·| ·| 1 2 p (i, j) = p1 (i) p2 (j). Subsequently, conditioned on these states (i.e., classification), the desired signal may be reconstructed in the mmse sense by
ˆs = E s x,ˆi, ˆj 1 1 | −1 (nˆi) (ˆi) o (ˆj) = Σ1 Σ1 + Σ2 x , Wˆiˆj x (2.28)
1 and similarly ˆs2 = Wˆjˆi x. Alternatively [35, 88, 90], the desired signal may be recon- structed in the mmse sense directly from
ˆs = E s x 1 { 1 | } = E E [s x, i, j] ij { 1 | } = p (i, j x) W x . (2.29) | ij i,j X In case of additional uncorrelated stationary noise in the mixed signal, i.e.,
x = s1 + s2 + d (2.30) with d (0, Σ), the covariance matrix of the noise signal is added to the covariance ∼ CN matrix of the interfering signal, and then the signal estimators remain in the same forms.
2.4 Speech dereverberation
A generalized statistical reverberation model has been proposed in [73–75]. Accordingly, the AIR h (n), can be split into two segments, he (n) and hl (n):
h (n) , 0 n T e ≤ ≤ r h (n)= h (n) , n T . (2.31) l r ≥ 0 , otherwise The value Tr is chosen such that he(n) contains the direct path, and that hl (n) contains of all later reflections. To enable modeling the energy related to the direct path, the
1 Note that in this section the index i always refers to the signal s1 and the index j refers to the other −1 (j) (i) (j) signal s2. Therefore, Wji =Σ2 Σ1 +Σ2 . 32 CHAPTER 2. RESEARCH METHODS following model is used:
−δn be (n) e , 0 n Tr he (n)= ≤ ≤ , (2.32) 0 , otherwise where be (t) is a white zero-mean Gaussian stationary noise signal and δ is linked to the 2 reverberation time , T60. The reverberation component hl (t) follows
−δt bl (t) e , t Tr hl (t)= ≥ , (2.33) 0 , otherwise where bl (n) is a white zero-mean Gaussian stationary noise signal. It is assumed that be (n) and bl (n) are uncorrelated, and the energy envelope of h (n) can be expressed as
σ2e−δn , 0 n T e ≤ ≤ r 2 2 −δn Eh h (n) = σ e , n T , (2.34) l r ≥ 0 , otherwise 2 2 where σe and σl denote the variances of be (n) and bl (n), respectively, and generally, σ2 σ2. e ≥ l Considering a speech signal x (n) which is propagates towards a microphone, through a room with AIR h (n), the received signal can be denoted as
y (n)= xe (n)+ xl (n)+ d (n) , (2.35)
where xe (n) is the early speech component, xl (n) is the late reverberant signal, and d (n) is an uncorrelated additive noise. Spectral enhancement approach for speech dereverberation is aimed at estimating the early speech component from the noisy observation by using a time-frequency dependent gain function. Consequently, in the STFT domain we have
Xˆe (t, k)= G (t, k) Y (t, k) . (2.36)
The gain function may be obtained by means of spectral subtraction [74] or MMSE- LSA sense [75]. In any case, the spectral variances of both the early speech component, λ (t, k) = E X (t, k) 2 , and of the late reverberant signal, λ (t, k) = E X (t, k) 2 , e | e | l | l | 2 The reverberation time, T 60, is defined as the time for the reverberation level to decay to 60 dB below the initial level. 2.4. SPEECH DEREVERBERATION 33 are estimated from the observed signal based on the reverberation model. Accordingly, while evaluating the gain function, the a priori and a posteriori SNRs are given by
λˆ (t, k) ξˆ(tk) = e λˆl (t, k)+ λd (t, k) Y (t, k) 2 γˆ (t, k) = | | . (2.37) λˆl (t, k)+ λd (t, k) 34 CHAPTER 2. RESEARCH METHODS Chapter 3
Stationarity Analysis of Markov-Switching GARCH Processes1
GARCH models with Markov-switching regimes are often used for volatility analysis of fi- nancial time series. Such models imply less persistence in the conditional variance than the standard GARCH model, and potentially provide a significant improvement in volatility forecast. Nevertheless, conditions for asymptotic wide-sense stationarity have been de- rived only for some degenerated models. In this chapter, we introduce a comprehensive approach for stationarity analysis of Markov-switching GARCH models, which manipu- lates a backward recursion of the model’s second-order moment. A recursive formulation of the state-dependent conditional variances is developed and the corresponding conditions for stationarity are obtained. In particular, we derive necessary and sufficient conditions for the asymptotic wide-sense stationarity of two different variants of Markov-switching GARCH processes, and obtain expressions for their asymptotic variances in the general case of m-state Markov chains and (p, q)-order GARCH processes.
1This chapter is based on [121].
35 36 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES 3.1 Introduction
Volatility analysis of financial time series is of major importance in many financial applica- tions. The generalized autoregressive conditional heteroscedasticity (GARCH) model [1] has been applied quite extensively in the field of econometrics, both by practitioners and by researchers, and shown to be useful for the analysis and forecasting the volatility of time-varying processes such as those pertaining to financial markets. Incorporating GARCH models with a hidden Markov chain, where each state of the chain (regime) allows a different GARCH behavior and thus a different volatility structure, extends the dynamic formulation of the model and potentially enables improved forecasts of the volatility [6–11]. Unfortunately, the volatility of a GARCH process with switching-regimes depends on the entire history of the process, including the regime path, which makes the derivation of a volatility estimator impractical.
Cai [12] and Hamilton and Susmel [13] applied the idea of regime-switching parameters into ARCH specification. The conditional variance of an ARCH model depends only on past observations, and accordingly the restriction to ARCH models avoids problems of infi- nite path dependency. Gray [6], Klaassen [7] and Haas, Mittnik and Paolella [8], proposed different variants of Markov-switching GARCH models, which also avoid the problem of dependency on the regime’s path. Gray introduced a Markov-switching GARCH model relying on the assumption that the conditional variance at any regime depends on the expectation of previous conditional variances, rather than their values. Accordingly, the conditional variance depends only on some finite set of past state-dependent, expected values via their conditional state probabilities, and thus can be constructed from past ob- servations. Klaassen proposed modifying Gray’s model by conditioning the expectation of previous conditional variances on all available observations and also on the current regime. A different concept of Markov-switching GARCH model has recently been introduced by Haas, Mittnik and Paolella. Accordingly, a finite state-space Markov chain is assumed to govern the ARCH parameters while the autoregressive behavior of the conditional vari- ance is subject to the assumption that past conditional variances are in the same regime as that of the current one.
Markov-switching GARCH processes, as well as the standard GARCH process, are 3.1. INTRODUCTION 37 nonstationary as their second-order moments change recursively over time. However, if these processes are asymptotically wide-sense stationary then their variances are guar- antied to be finite. A necessary and sufficient condition for the stationarity of a (single- regime) GARCH(p, q) process has been developed in [1]. Condition for the stationarity of a natural, path-dependent Markov-switching GARCH(p, q) model, has been developed in [122], and in [14] a deep analysis of the probabilistic structure of that model is de- rived with conditions for the existence of moments of any order. In [15–17], stationarity analysis has been derived for some mixing models of conditional heteroscedasticity, and conditions for the asymptotic stationarity of some AR and ARMA models with Markov- regimes has been derived in [18–22]. However, for the Markov-switching GARCH models described above, which avoid the dependency of the conditional variance on the chain’s history, stationarity conditions are known in the literature only for some special cases. Klaassen [7] developed necessary (but not necessarily sufficient) conditions for stationarity of his model in the special case of two regimes and GARCH modeling of order (1, 1). A necessary and sufficient stationarity condition has been developed by Haas, Mittnik and Paolella [8] for their Markov-switching GARCH model, but only in case of GARCH(1, 1) behavior in each regime.
In this chapter, we develop a comprehensive approach for stationarity analysis of Markov-switching GARCH models, in the general case of m-state Markov chains and (p, q)-order GARCH processes. We specify the unconditional variance of the process us- ing the expectation of the regime dependent conditional variances, and assume no history knowledge of the process except for the model parameters. The expectation of the con- ditional variance at a given regime is then recursively constructed from the conditional expectation of both previous conditional and unconditional variances. Consequently, we obtain a complete recursion for the expected vector of state dependent conditional vari- ances. The recursive vector form is constructed by means of a representative matrix which is built from the model parameters. We show that constraining the largest absolute eigen- value of the representative matrix to be less than one is necessary and sufficient for the convergence of the unconditional variance, and therefore, for the asymptotic stationarity of the process. We derive stationarity conditions for the general formulation of the two variants of Markov-switching GARCH models introduced by Klaassen and Haas et al.. 38 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES
We show that our results reduce in some degenerated cases to the stationarity condi- tions developed by Bollerslev [1], Klaassen [7] and Haas et al. [8]. Furthermore, we show that the stationarity conditions developed by Klaassen are not only necessary but also sufficient for asymptotic stationarity of his model. This chapter is organized as follows: In Section 3.2, we review the variants proposed by Klaassen and Haas et al. for Markov-switching GARCH models, and develop compre- hensive necessary and sufficient conditions for asymptotic stationarity appropriate for the general formulation of the models. In Section 3.3, we derive relations between our results and previous works.
3.2 Stationarity of Markov-switching GARCH mod- els
Let S 1, ..., m denote the (unobserved) regime at a discrete time t and let s be t ∈ { } t a realization of S , assuming that S is a first-order stationary Markov chain with t { t} transition probabilities a , p(S = j S = i), a transition probabilities matrix A, ij t | t−1 A = a , and stationary probabilities π = [π , π , ..., π ]′, π , p (S = i), where ′ { }ij ij 1 2 m i t denotes the transpose operation. Let denote the observation set up to time t, and It let v be a zero-mean unit-variance random process, with independent and identically { t} distributed elements. Given that St = st, a Markov-switching GARCH model of order (p, q) can be formulated as
εt = σt,st vt (3.1) where the conditional variance of the process σ2 = E ε2 S = s , is a function of t,st { t | t t It−1} p previous conditional variances and q previous squared observations. Klaassen [7] and Haas, Mittnik and Paolella [8], proposed different variants of Markov- switching GARCH models. The former is a modification of Gray’s model [6]. Each of these overcomes the problem of dependency on the regime’s path encountered when naturally integrating the GARCH model with switching-regimes. However, conditions for these models to be asymptotically wide-sense stationary and therefore to guarantee a 3.2. STATIONARITY OF MARKOV-SWITCHING GARCH MODELS 39
finite second-order moments, are known only for some special cases. Klaassen developed necessary conditions for the stationarity of his model in the case of two-state Markov chain and GARCH of order (1, 1). Hass et al. gave a necessary and sufficient stationarity condition for their model, but this condition is restricted to a first-order GARCH model in each of the regimes (i.e., p = q = 1). We first review these variants of Markov- switching GARCH models, which we call MSG-I and MSG-II, respectively. Then we develop necessary and sufficient conditions for their asymptotic wide-sense stationarity and derive their stationary variances.
3.2.1 MSG-I model
Gray [6] proposed to model the conditional variance of a Markov-switching GARCH model as dependent on the expectation of its past values over the entire set of states, rather than dependent on past states and the corresponding conditional variances. Accordingly, the state dependent conditional variance follows
q p σ2 = ξ + α ε2 + β E ε2 t,st st i,st t−i j,st t−j | It−j−1 i=1 j=1 Xq Xp m = ξ + α ε2 + β p (S = s ) σ2 , (3.2) st i,st t−i j,st t−j t−j | It−j−1 t−j,st−j i=1 j=1 s − =1 X X tXj and the following constraints
ξ > 0, α 0, β 0, i =1, ..., q, j =1, ..., p, s =1, ..., m (3.3) st i,st ≥ j,st ≥ t are sufficient for the positivity of the conditional variance. Gray’s model integrates out the unobserved regime path so that the conditional vari- ance can be constructed from previous observations only. As a consequence, there is no path dependency problem although GARCH effects are still allowed. Empirical analy- sis of modeling financial time series demonstrates that this Markov-switching GARCH model implies less persistence in the conditional variance than the standard GARCH model, and in addition, its one-step ahead volatility forecast significantly outperforms the single-regime GARCH model (see, for instance [6,9,11]). 40 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES
Klaassen [7] proposed modifying Gray’s model by replacing p (S = s ) in t−j t−j | It−j−1 (3.2) by p (S = s ,S = s ) while evaluating σ2 . Consequently, all available t−j t−j | It−1 t t t,st observations are used, as well as the given regime in which the conditional variance is calculated. The conditional variance according to Klaassen’s model (denoted here as MSG-I) is given by
q p m σ2 = ξ + α ε2 + β p (S = s ,S = s ) σ2 , (3.4) t,st st i,st t−i j,st t−j t−j | It−1 t t t−j,st−j i=1 j=1 s − =1 X X tXj and the same constraints (3.3) are sufficient for the positivity of the conditional variance. Both models integrate out the unobserved regimes for evaluating the conditional vari- ance. However, Klaassen’s model employs all the available information while Gray’s model employs only part of it since it does not utilize all the available observations and the as- sumed regime in which the conditional variance is being calculated. Specifically, if process’ regimes are highly persistent, then both the current state st and the previous innovation
εt−1 give much information about previous states and thus the conditional probability of s given all the observations up to time t 1 and the next state, is substantially different t−1 − from the probability of s which is conditioned only on observations up to time t 2 [7]. t−1 − In contrast to Gray, Klaassen do manipulate this information in his model while eval- uating the expectation of previous conditional variances. Furthermore, the formulation (3.4) better exploits the available information, and its structure yields straightforward expressions for the multi-step ahead volatility forecasts [7,9]. The unconditional variance of the MSG-I process, defined in (3.1) and (3.4), can be calculated as follows:
2 2 E εt = EIt−1,St E εt t−1,st | I m 2 2 = E E − σ s = π E − σ s . (3.5) St It 1 t,st | t st It 1 t,st | t st=1 X For notation simplification, we shall use E( s ) and p( s ) to represent E( S = ·| t ·| t ·| t s ) and p( S = s ), respectively, where s represents the regime realization at time t. t ·| t t t Furthermore, we shall use E ( ) to denote the expectation over the information up to time t · 3.2. STATIONARITY OF MARKOV-SWITCHING GARCH MODELS 41 t, i.e., E ( ). The expectation of the regime dependent conditional variance follows It · q E σ2 s = ξ + α E ε2 s t−1 t,st | t st i,st t−1 t−i | t i=1 p X m + β E p (s ,s ) σ2 s , (3.6) j,st t−1 t−j | It−1 t t−j,st−j | t j=1 s − =1 X tXj h i 2 where the expectation over εt−i can be obtained by m 2 2 Et−1 εt−i st = εt−i p ( t−1 st,st−i) p (st−i st) d t−1 | It−1 I | | I st−i=1 Z Xm = p (s s ) E ε2 s ,s . (3.7) t−i | t t−1 t−i | t−i t st−i=1 X Note that given the current active state, the expected absolute value is independent of any future states. Therefore,
E ε2 s ,s = E ε2 s t−1 t−i | t−i t t−1 t−i | t−i = ε2 p (ε ,s ) p ( s ) dε d t−i t−i | It−i−1 t−i It−i−1 | t−i t−i It−i−1 ZIt−i−1 Zεt−i = E E ε2 ,s s t−i−1 t−i | It−i−1 t−i | t−i = E σ2 s . (3.8) t−i−1 t−i,st−i | t−i h i Furthermore, the conditional expectation over the conditional variance in (3.6), weighted by the current state probability can be obtained by
2 2 Et−1 p (st−j t−1,st) σt−j,st−j st = σt−j,st−j p (st−j t−1,st) p ( t−1 st) d t−1 | I | It−1 | I I | I h i Z = σ2 p ( s ,s ) p (s s ) d t−j,st−j It−1 | t−j t t−j | t It−1 ZIt−1 2 = p (s s ) E − − σ s . (3.9) t−j | t t j 1 t−j,st−j | t−j h i Consequently, the expectation of the conditional variance at a given regime st can be recursively constructed, according to the model definitions, from both expectation of previous conditional variances and expected squared values given the current regime st. Let r = max p, q and define α , 0 for all i > q and β , 0 for all i>p. Then, by { } i,s i,s substituting (3.7), (3.8) and (3.9) into (3.6) we obtain
r m E σ2 s = ξ + (α + β ) p (s s ) E σ2 s , (3.10) t−1 t,st | t st i,st i,st t−i | t t−i−1 t−i,st−i | t−i i=1 st−i=1 X X h i 42 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES and applying Bayes’ rule we have
πst−i πst−i i p (st−i st)= p (st st−i)= A . (3.11) st−i,st | πst | πst The expected state dependent conditional variance (3.10) is recursively generated from a weighted sum of its previous expected values through their conditioned probabilities and the model parameters. Let ξ , [ξ , ..., ξ ]′, let (i) be an m-by-m matrix with elements 1 m K
(i) πs˜ i s,s˜ , (αi,s + βi,s) A s,s˜ , s, s˜ =1, ..., m , (3.12) K πs and let h , [E σ2 S =1 , ..., E σ2 S = m ]′ be an m-by-1 vector of the t t−1 t,1 | t t−1 t,m | t expected state dependent conditional variances. Then, we have
r h = ξ + (i)h . (3.13) t K t−i i=1 X ˜ ′ ′ ′ ′ ˜ ′ ′ Define the rm-by-1 vectors ht , ht, ht−1, ..., ht−r+1 and ξ , [ξ , 0, ..., 0] , and let (1) (2) (r) K K ··· K I 0 0 m m ··· m 0m Im Ψ , (3.14) I ...... . . . . . 0m 0 Im 0m ··· be an mr-by-mr matrix where Im represents the identity matrix of size m-by-m and 0m is an m-by-m matrix of zeros. Then a recursive vector form of the expected conditional variance (3.13) can be written as
h˜ = ξ˜+ Ψ h˜ , t 0 , (3.15) t I t−1 ≥ with some initial conditions h˜−1. Let ρ ( ) denote the spectral radius of a matrix, i.e., its largest eigenvalue in modulus, · and let Λ be an m-by-m square matrix built from the mr-by-mr matrix (I Ψ )−1 such I − I that Λ = (I Ψ )−1 , i, j =1, ..., m. Then we have the following theorem: { I}ij − I ij 3.2. STATIONARITY OF MARKOV-SWITCHING GARCH MODELS 43
Theorem 3.1. An MSG-I process as defined by (3.1) and (3.4) is asymptotically wide-
2 ′ 2 sense stationary with variance limt→∞ E (εt )= π ΛIξ, if and only if ρ (ΨI ) < 1 .
Proof. The recursive equation (3.15) can be written as
t−1 h˜ =Ψth˜ + Ψtξ,˜ t 0 . (3.16) t I 0 I ≥ i=0 X According to the matrix convergence theorem (e.g., [124, pp. 327-329]), a necessary and sufficient condition for the convergence of (3.16) for t is ρ (Ψ ) < 1. Under this → ∞ I condition, Ψt converges to zero as t goes to infinity and t−1 Ψi converges to (I Ψ )−1, I i=0 I − I where the matrix (I ΨI ) is then guaranteed to be invertible.P Therefore, if ρ (ΨI ) < 1, − equation (3.16) yields −1 lim h˜t =(I ΨI ) ξ.˜ (3.17) t→∞ −
By definition, the first m elements of h˜t constitute the vector ht, while the first m elements of ξ˜ constitute the vector ξ, and the remaining elements of ξ˜ are zeros. Conse- quently,
lim ht =ΛIξ (3.18) t→∞ and using (3.5) we have 2 ′ lim E εt = π ΛI ξ. (3.19) t→∞ Otherwise, if ρ (Ψ ) 1, the expected variance goes to infinity with the growth of the I ≥ time index.
3.2.2 MSG-II model
Another variant of Markov-switching GARCH model has recently been proposed by Haas, Mittnik and Paolella [8]. This model assumes that a Markov chain controls the ARCH parameters at each regime (i.e., ξs and αi,s), while the autoregressive behavior in each regime is subject to the assumption that past conditional variances are in the same regime as that of the current conditional variance. Specifically, the vector of conditional variances
2 2 2 2 ′ σt , σt,1, σt,2, ..., σt,m is given by
2Note that for a matrix with nonnegative elements, there exists a real eigenvalue which is equal to the spectral radius [123, p. 288]. 44 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES
q p 2 2 (j) 2 σt = ξ + αiεt−i + B σt−j , (3.20) i=1 j=1 X X
′ ′ where αi , [αi,1, ..., αi,m] , i = 1, ..., q, and βj , [βj,1, ..., βj,m] , j = 1, ..., p, are vectors of state dependent GARCH parameters, and B(j) , diag β is a diagonal matrix with { j} elements βj on its diagonal. The same constraints which are sufficient to ensure a positive conditional variance in MSG-I model (3.3) are also applied here to guaranty the positivity of the conditional variance.
Note that the conditional variance at a specific regime depends on previous conditional variances of the same regime through the diagonal matrices B(j). Consequently, this model allows derivation of the conditional variance at a given time from past observations only. Furthermore, the MSG-II model is analytically more tractable than MSG-I model [8] and its conditional variance can be straightforwardly constructed since the conditional variance at a specific time does not depend on previous state probabilities but only on previous observations and previous conditional variances.
(j) (i) Let αi be an m-by-1 vector of zeros for i > q and let B = 0m for j>p. Let Ω denote an m2-by-m2 block matrix of basic dimension m-by-m
Ω(i) Ω(i) Ω(i) 11 21 ··· m1 Ω(i) Ω(i) Ω(i) Ω(i) , 12 22 ··· m2 , (3.21) . . . . (i) (i) (i) Ω1m Ω2m Ωmm ··· with each block given by
Ω(i) , p (S = s S =s ˜) α e′ + B(i) , s, s˜ =1, ..., m , (3.22) ss˜ t−i | t i s where es is an m-by-1 vector of all zeros, except its sth element which is one. We define an rm2-by-rm2 matrix by 3.2. STATIONARITY OF MARKOV-SWITCHING GARCH MODELS 45
Ω(1) Ω(2) Ω(r) ··· I 2 0 2 0 2 m m ··· m 0m2 Im2 Ψ , . (3.23) II ...... . . . . . 0m2 0m2 Im2 0m2 ··· Let Λ be an m2-by-m2 matrix which is built from the rm2-by-rm2 matrix (I Ψ )−1 II − II such that Λ = (I Ψ )−1 , i, j = 1, ..., m2. Let π¯ , [π e′ , π e′ , ..., π e′ ]′, then { II}ij − II ij 1 1 2 2 m m we get the following theorem for the stationarity condition of an MSG-II process:
Theorem 3.2. An MSG-II process as defined by (3.1) and (3.20) is asymptotically wide-
2 ′ sense stationary with variance limt→∞ E (εt )= π¯ ΛIIξ, if and only if ρ (ΨII ) < 1.
The proof is given in Appendix 3.A.
3.2.3 Comparison of stationarity conditions
It has been pointed out in [8], that stationarity of the MSG-II model with p = q = 1 requires that the regression parameters β1,s < 1 for all s. It follows from (3.20) that for m general order (p, q), it is necessary that i=1 βi,s < 1 . However, the reaction parameters
αi,s may become rather large with correspondenceP to the regime probabilities. For MSG-I and model, the reaction parameters αi,s as well as the regression parameters βi,s may be larger than one, provided that the corresponding regime probabilities are sufficiently small. Furthermore, in the representative matrix ΨI (3.14) the reaction parameters and the regression parameters are weighted by the same weights p (S = s S =s ˜). Conse- t−i | t quently, for a given state s, the values of αi,s and βi,s in MSG-I model have the same contribution to the model stationarity3, but for MSG-II model, each of them affects dif- ferently the heteroscedasticity evolution. Figure 3.1 illustrates the stationarity regions for MSG-I model (solid line) and MSG-II (dashed-dotted line), in the case of two-state Markov chains and GARCH of order (1, 1). In (a), the regime transition probabilities are
3This also holds for the natural extension of GARCH(p, q) to Markov-switching, which has been analyzed in [122]. 46 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES a1,1 =0.6 and a2,2 =0.7 and the reaction parameters are α1,1 =0.4 and α1,2 =0.5. The stationarity region is the interior intersection of each curve and the two axes. In (b), a1,1 = 0.2, a2,2 = 0.3 are considered with reaction parameters α1,1 = 0.8 and α1,2 = 0.2. For the MSG-I model, stationarity is allowed with regression parameters larger than one while for the MSG-II, β1,1 and β1,2 must be both smaller than one for stationarity. In both cases, π2 > π1, however, in (a) the stationarity region of the MSG-II is contained in the stationarity region of MSG-I while in (b), in which case α >> α , for β [0.2, 0.55] 1,1 1,2 1,1 ∈ stationarity is achieved with a larger β1,2 for the MSG-II than for the MSG-I.
3.3 Relation to other works
Klaassen [7] developed conditions which are necessary, but not necessarily sufficient, for asymptotic stationarity of a two-state MSG-I model of order (1, 1). Consider the 2-by-2 matrix C with elements
cij = aji (α1,i + β1,i) πj/πi, i, j =1, 2 , (3.24)
Klaassen showed that the stationary variance of the process is given by
σ2 = π′(I C)−1ξ, (3.25) − and that the conditions:
c ,c < 1, and det(I C) > 0 , (3.26) 11 22 − are necessary to ensure that the stationary variance is finite and positive. For the special case of our analysis for GARCH orders of (1, 1) and MSG-I model with two states, the representative matrix ΨI reduces to matrix C and the stationary variance reduces to the expression given in (3.25). Metzler showed [125] that for a nonnegative matrix C (i.e., c 0), ρ (C) < 1 if and only if all of the principal minors of (I C) are ij ≥ − positive. Furthermore, together with Hawkins-Simon condition [126], ρ (C) is less than one if and only if (I C)−1 has no negative elements. Therefore, for the nonnegative matrix − C, the condition c ,c < 1 implies det(I C) > 0 and it is equivalent to ρ (C) < 1. 11 22 − Accordingly, the conditions of Klaassen are not only necessary but also sufficient for asymptotic stationarity. 3.3. RELATION TO OTHER WORKS 47
0.9
0.8
0.7
0.6
0.5 2 , 1
β 0.4
0.3
0.2
0.1
0 0 0.2 0.4 0.6 0.8 1 1.2 β1,2
(a) 1
0.9
0.8
0.7
0.6 2 , 0.5 1 β 0.4
0.3
0.2
0.1
0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 β1,2
(b)
Figure 3.1: Stationarity regions for two-state Markov-chains with GARCH of order (1, 1) corre- sponding to MSG-I (solid line) and MSG-II (dashed-dotted line). The regime transition proba- bilities and the reaction parameters are (a) a1,1 = 0.6, a2,2 = 0.7 and α1,1 = 0.4, α1,2 = 0.5; (b) a1,1 = 0.2, a2,2 = 0.3 and α1,1 = 0.8, α1,2 = 0.2. 48 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES
A necessary and sufficient condition for asymptotic stationarity of an MSG-II model of order (1, 1) has been developed by Haas et al. [8]. Accordingly, the largest eigenvalue in modulus of an m2-by-m2 block matrix D is constrained to be less than one, where
D D D 11 21 ··· m1 D12 D22 Dm2 D = ··· (3.27) . . . . . . D1m D2m Dmm ··· is built from matrices Dij of size m-by-m which are obtained by
(1) ′ Dij = aij B + α1ej . (3.28) The stationarity analysis in [8] for an MSG-II process employs a forward recursive calculation of the expected conditional variance, assuming some initial conditions. As a result, the probabilities of state transitions, aij, are used for evaluating the expectation of the one step ahead conditional variance. Our analysis manipulates a backward recursion of the conditional variance expectation, and thus, it uses the stationary probabilities of the Markov chain, along with the transition probabilities, to generate previous conditional states probabilities p (s s ). Therefore, when we degenerate the MSG-II model to order t−i | t p = q = 1 (which is the case analyzed in [8]) the block matrices D (3.27) and ΨII (3.23) (1) are not identical, and specifically, for that order of model we have ΨII = Ω and
(1) πiaij Ωij = Dji . (3.29) πjaji
Although our representative matrix ΨII and that developed in [8] do not share the same elements, we show in Appendix 3.B that their eigenvalues are identical and therefore both conditions are equivalent for that order of MSG-II. A special case of any of the MSG models is a degenerated case of having a single regime of order (p, q) (the models reduce to a standard GARCH(p, q) model). In that case, the representative matrices are equal, ΨI = ΨII . Francq, Roussignol and Zako¨ıan [122] developed a stationarity condition for the natural case of Markov-switching GARCH model, in which case the conditional variance depends on the active regime-path. For the special case of a single-regime model they got the transition matrix of a standard GARCH(p, q) model which is equal to that which is derived e.g., by substituting (i) = K 3.4. CONCLUSIONS 49
αi,1 +βi,1 and I1 = 1 in (3.14). They showed that having the spectral radius of that matrix to be less than one is equivalent to Bollerslev’s condition for the asymptotic wide-sense r stationarity of a GARCH(p, q) model, i=1 (αi,1 + βi,1) < 1 [1]. P 3.4 Conclusions
Conditions for asymptotic wide-sense stationarity of random processes with time-variant distributions are useful for ensuring the existence of a finite asymptotic volatility of the process. We developed a comprehensive approach for stationarity analysis of Markov- switching GARCH processes where finite state space Markov chains control the switching between regimes, and GARCH models of order (p, q) are active in each regime. Necessary and sufficient conditions for the asymptotic stationarity are obtained by constraining the spectral radius of representative matrices, which are built from the model parameters. These matrices also enable derivation of compact expressions for the stationary variance of the processes.
3.A Proof of Theorem 3.2
In this appendix we prove Theorem 3.2, which gives necessary and sufficient condition for the asymptotic wide-sense stationarity of MSG-II model and also its stationary variance. Following (3.20) and (3.5), the expectation of the MSG-II conditional variance under a chain state s, follows:
q p E (σ2 s )= ξ + α E ε2 s + β E (σ2 s ) (3.30) t−1 t,s | t s i,s t−1 t−i | t j,s t−1 t−j,s | t i=1 j=1 X X where using (3.7) and (3.8)
m E ε2 s = p (s s ) E σ2 s , (3.31) t−1 t−i | t t−i | t t−i−1 t−i,st−i | t−i st−i=1 X and
2 2 Et−1(σt−j,s st) = ESt−j Et−1(σt−j,s st−j,st) | m | = p (s s ) E σ2 s . (3.32) t−j | t t−i−1 t−j,s | t−j st−j=1 X 50 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES
The main difference between an MSG-II model and MSG-I model is that the conditional variance depends on previous conditional variances of the same regime, regardless of the past regimes path. By contrast, for MSG-I model, the conditional variance is a linear combination of past state-dependent conditional variances, where for each one the state is conditioned to be the active one. Consequently, the computation of the unconditional variance for an MSG-II model requires the terms E (σ2 s ) for all s = 1, ..., m, t−j−1 t−j,s | t−j while in case of MSG-I model, only E (σ2 s ) is relevant to calculate the t−j−1 t−j,st−j | t−j expectation of the unconditional variance. Accordingly, an m2-by-1 vector is necessary to represent E (σ2 s ) elements, and rm2-by-rm2 matrix is employed for the recursive t−1 t,s | t formulation. By substituting (3.32) and (3.31) into (3.30) we have
r m E (σ2 s )= ξ + p (s s ) α E σ2 s + β E σ2 s . t−1 t,s | t s t−i | t i,s t−i−1 t−i,st−i | t−i i,s t−i−1 t−i,s | t−i i=1 s − =1 X tXi h i (3.33) Let g (s,s ) , E σ2 s , and let g , [g (1, 1) , g (2, 1) , ..., g (m, 1) , g (1, 2) , ..., g (m, m)]′ t t t−1 t,s | t t t t t t t be a vector of expected, state dependent, conditional variances. Then, a recursive formu- lation of the conditional variance is given by
˜g = ξ˜+ Ψ ˜g , t 0 , (3.34) t II t−1 ≥
′ ′ ′ ′ where ˜gt , gt, gt−1, ..., gt−r+1 . The completion of this proof follows the proof of Theorem 1.
3.B Equivalence with Haas condition
(1) In this appendix we show that the eigenvalues of matrices D (3.27) and ΨII = Ω (3.23) are equal for the case of an m-state MSG-II model of order (1, 1). Let D˜ denote an m2-by-m2 matrix that is given by
B(1) + α e′ 0 0 1 1 m ··· m (1) ′ .. . 0m B + α1e . . D˜ , 2 , (3.35) . .. .. . . . 0m (1) ′ 0m 0m B + α1em ··· 3.B. EQUIVALENCE WITH HAAS CONDITION 51 and let denote the Kronecker product. Then D = D˜ (A′ I ). Let 1 denote an ⊗ ⊗ m m m-by-1 vector of ones and let P , diag(π 1 ). By substituting (3.28) into (3.29), we ⊗ m have (1) πi (1) ′ Ωij = aij B + α1ei (3.36) πj and Ω(1) = P −1 (A′ I ) DP.˜ (3.37) ⊗ m Therefore, Ω(1) and (A′ I ) D˜ are similar matrices, and the spectrum of Ω(1), eig Ω(1) , ⊗ m satisfies eig Ω(1) = eig (A′ I ) D˜ = eig D˜ (A′ I ) = eig D . (3.38) ⊗ m ⊗ m { } n o n o 52 CHAPTER 3. STATIONARITY ANALYSIS OF MS-GARCH PROCESSES Chapter 4
Markov-Switching GARCH Process in the Short-Time Fourier Transform Domain1
In this chapter, we introduce a Markov-switching generalized autoregressive conditional heteroscedasticity (GARCH) model for nonstationary processes with time-varying volatil- ity structure in the short-time Fourier transform (STFT) domain. The expansion coeffi- cients in the STFT domain are modeled as a multivariate complex GARCH process with Markov-switching regimes. The GARCH formulation parameterizes the correlation be- tween sequential conditional variances while the Markov chain allows the process to switch between regimes of different GARCH formulations. We obtain a necessary and sufficient condition for the asymptotic wide-sense stationarity of the model, and develop a recursive algorithm for signal restoration in a noisy environment. The conditional variance is esti- mated by iterating propagation and update steps with regime conditional probabilities, while the model parameters are evaluated a priori from a training data set. Experimental results demonstrate the performance of the proposed algorithm. In Appendix 4.A, we introduce an application of the Markov-switching GARCH model in the STFT domain to speech enhancement. A GARCH model is utilized with Markov switching regimes, where the parameters are assumed to be frequency variant. The model parameters are evaluated in each frequency subband and a special state (regime) is de-
1This chapter is based on [127, 128].
53 54 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN
fined for the case where speech coefficients are absent or below a threshold level. The problem of speech enhancement under speech presence uncertainty is addressed and it is shown a soft voice activity detector may be inherently incorporated within the algorithm. Experimental results demonstrate the potential of our proposed model to improve noise reduction while retaining weak components of the speech signal.
4.1 Introduction
The generalized autoregressive conditional heteroscedasticity (GARCH) model is widely- used in the field of econometrics for volatility forecast derivation of economic rates. This model, first introduced by Bollerslev [1] as a generalization of the ARCH model [2], explic- itly parameterizes the time-varying volatility by using both recent conditional variances and recent squared innovations. GARCH models preserve the persistence of the process volatility in the sense that small variations tend to follow small variations and large varia- tions tend to follow large variations. Incorporating GARCH models with hidden Markov chains, where each state (regime) of the chain implies a different GARCH behavior, ex- tends the dynamic formulation of the model and enables a better fit for a process with a more complex time-varying volatility structure [7–9]. However, a major drawback of such models is that estimating the volatility with switching-regimes requires knowledge of the entire history of the process, including the regime path. Consequently, Cai [12] and Hamil- ton and Susmel [13] proposed a Markov-switching ARCH model, which avoids problems of path dependency in a noiseless environment. The conditional variance in ARCH models depends on previous observations only, so the Markov chain does not have to be known for constructing the conditional variance for a given regime. Gray [6] introduced a variant of Markov-switching GARCH model relying on the assumption that the conditional variance given current regime is dependent on the expectation of the previous conditional variances rather than their values. Accordingly, the conditional variance depends on some finite, state dependent, expected conditional variances via their conditional state probabilities. Klaassen [7] proposed modifying Gray’s model by manipulating the current regime and all available observations while evaluating the expectation of previous conditional vari- ances. A different method for reducing the dependency of the conditional variance on 4.1. INTRODUCTION 55 past regimes has recently been proposed by Haas, Mittnik and Paolella [8]. Accordingly, a Markov chain governs the ARCH parameters while the autoregressive behavior of the conditional variance is subject to the assumption that past conditional variances are in the same regime as that of the current conditional variance. Gray, Klaassen and Haas et al. developed their variants of Markov-switching GARCH models for improved volatility forecasts of financial time-series under possible existence of shocks. They assumed that a process is observed in a noiseless environment so that its past observations provide a complete specification of its current conditional variance, for any given regime.
Recently, GARCH models have been employed for modeling speech signals in the time- frequency domain [23–25]. Speech signals in the short-time Fourier transform (STFT) domain demonstrate both “variability clustering” and heavy tail behavior similarly to fi- nancial time-series [25]. Motivated by these characteristics, it was proposed to model the conditional variance of speech signals in the STFT domain by a complex, K-dimensional GARCH model, with statistically independent elements (given past information) sharing the same GARCH specification. This time-frequency GARCH (TF-GARCH) model has been shown useful for speech enhancement applications, but it relies on the assumption that the model parameters are time-invariant. In [26], a GARCH model has been utilized in the time domain for speech recognition applications. The model parameters, charac- terizing the speech phonemes, are assumed speaker independent and time-varying. It was shown that estimating the GARCH specifications for each speech segment and using the parameters as part of the signal characteristics, speech recognition performance can be improved.
In this chapter, we introduce a Markov-switching time-frequency GARCH (MSTF- GARCH) model which exploits the advantages of both the conditional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains. Modeling probability density functions of speech signals by utilizing hidden Markov mod- els has been found useful in speech recognition applications [63–65], and modeling the speech spectral coefficients as hidden Markov processes with a probability density proto- type in each frame was applied to the problem of speech enhancement [35, 62]. Here we model the expansion coefficients of nonstationary random signals in the time-frequency domain as multivariate complex GARCH processes with Markov-switching regimes, and 56 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN obtain a necessary and sufficient condition for the asymptotic wide-sense stationarity of the model. A corresponding recursive algorithm is developed for signal restoration in a noisy environment. The conditional variance is estimated by iterating propagation and update steps with regime conditional probabilities. The model parameters are estimated from a training data set prior to the signal restoration using maximum-likelihood (ML) approach, and the number of states is assumed to be known. We show that the derivation in [117] of bounds on the mean-square error (MSE) of a composite source signal estimation is applicable for obtaining an upper bound on the MSE of a single step MSTF-GARCH estimation. Experimental results demonstrate the improved performance of the proposed algorithm for restoration of MSTF-GARCH process compared to using an estimator which assumes a stationary process and compared to using an estimator which assumes a smaller number of regimes than the process actually has. Furthermore, it is demonstrated that the squared absolute values of speech coefficients in the STFT domain are better evaluated by using the MSTF-GARCH model than by using the decision-directed approach. This chapter is organized as follows. In Section 4.2, we introduce the Markov-switching time-frequency GARCH model and obtain a necessary and sufficient condition for its asymptotic wide-sense stationarity. In Section 4.3, we address the problem of signal es- timation from noisy observations. In Section 4.4, we derive an upper bound on a single estimation step mean-square error. In Section 4.5, we address the problem of model esti- mation. Finally, in Section 4.6 we provide some experimental results which demonstrate restoration of MSTF-GARCH process from noisy observations, and estimation of con- ditional variances and squared absolute values in the STFT domain from noisy speech signals.
4.2 Markov-switching time-frequency GARCH model
In this section we briefly review the TF-GARCH model [25], and introduce a new time- frequency GARCH model with Markov-switching regimes, which allows further flexibility in the formulation of the time variation of the conditional variance. 4.2. MARKOV-SWITCHING TIME-FREQUENCY GARCH MODEL 57
4.2.1 Time-frequency GARCH model
Let X t =0, ..., T 1 , k =0, ..., K 1 be the coefficients of a time-frequency trans- { tk | − − } formation of a discrete-time signal x (e.g., STFT coefficients), where t is the time frame ′ index and k is the frequency-bin index. Let Xt , [Xt,0, ..., Xt,K−1] be the vector of spec- tral coefficients at time frame t, let τ = τ , X t =0, ..., τ represent the set of X X0 { t | } spectral coefficients up to time τ, and let λ , E X 2 τ denote the conditional tk|τ {| tk| | X } variance of the spectral coefficient at time-frequency bin (t, k), given the clean spectral coefficients up to time τ. Let V CK be a complex Gaussian random process with { t} ∈ V (0, I ), where I is a K-by-K identity matrix. A K-dimensional time-frequency t ∼ CN K K GARCH model of order (p, q), is defined as follows [25]:
X = λ V , k =0, ..., K 1 (4.1) tk tk|t−1 tk − q q p λ = ζ 1 + α X X∗ + β λ , (4.2) t|t−1 · i t−i ⊙ t−i j t−j|t−j−1 i=1 j=1 X X where 1 denotes a vector of ones, denotes a term-by-term multiplication and ∗ denotes ⊙ complex conjugation. The conditional variance vector, λ = E X X∗ t−1 , re- t|t−1 { t ⊙ t | X } ferred to as the one-frame-ahead conditional variance [25], is a linear function of the coefficients’ past squared values and conditional variances, where
ζ > 0, α 0, i =1, ..., q, β 0, j =1, ..., p , (4.3) i ≥ j ≥ are sufficient constraints for the positivity of the conditional variance [1]. The time- frequency GARCH has been introduced in [23] for modeling speech signals in the STFT domain, but the parameters of the GARCH model are assumed time invariant. Extending this model such that the model parameters may vary with time introduces additional flexibility in the model formulation, which may result in better characterization of speech signals and improved restoration in noisy environments.
4.2.2 MSTF-GARCH formulation
Let St denote the (unobserved) state at time t and let st be a realization of St, assuming S is a first-order Markov chain. Let t , t, t denote all available information t I {X S } up to time t, which contains the clean signal coefficients and the regimes path up to 58 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN time t, t , s , ..., s . Denote by λ , E X 2 t−1,s the one-frame-ahead S { 0 t} tk|t−1,st {| tk| | I t} conditional variance of the spectral coefficient X given the information up to time t 1 tk − and the chain state st. We assume that the spectral coefficients Xtk are generated by an m-state Markov-switching time-frequency GARCH process of order (p, q), denoted by X MSTF-GARCH(p, q), which follows: tk ∼
X = λ V , k =0, ..., K 1 , (4.4) tk tk|t−1,st tk − q and the one-frame-ahead conditional variance evolves as follows:
q p λ = ζ 1 + α X X∗ + β λ , (4.5) t|t−1,st st i,st t−i ⊙ t−i j,st t−j|t−j−1,st−j i=1 j=1 X X where
ζ > 0, α 0, β 0, i =1, ..., q, j =1, ..., p, s =1, ..., m (4.6) s i,s ≥ j,s ≥ are sufficient constraints for the positivity of the one-frame-ahead conditional variance. It follows from (4.4) and (4.5) that the conditional density of the coefficients depends on past values (through previous conditional variances) and also on the regime-path up to the current time. As considered in previous works on TF-GARCH, we assume that the model parameters are frequency-invariant. This restriction can be easily relaxed for the case of frequency (or sub-band) dependent parameters, i.e., ζk,s, αi,k,s and βi,k,s, but the complexity of the model estimation then grows rapidly (see Section 4.5). GARCH models provide a rich class of possible parametrization of conditional het- eroscedasticity (i.e., time-varying volatility) and the hidden Markov chain allows these GARCH formulations to switch along time. Volatility persistence naturally arises in a single-regime GARCH model. However, the existence of a Markov chain with different GARCH parameters allows the process to switch between regimes of different volatility formulations and different levels of volatility.
4.2.3 Stationarity of an MSTF-GARCH process
The conditional variance of a GARCH process, and in particular of a Markov-switching GARCH process, changes recursively over time. Consequently, asymptotic wide-sence 4.2. MARKOV-SWITCHING TIME-FREQUENCY GARCH MODEL 59 stationarity is required to ensure a finite second-order moment [7,8,121]. Necessary and sufficient conditions for the asymptotic stationarity of three variants of GARCH models with Markov-switching regimes have been derived in [121]. Those models generalize the models of Gray [6], Klaassen [7] and Haas et al. [8], but they all differ from our MSTF- GARCH model, which is a multivariate, complex valued process that entails the regime path for the construction of the conditional variance from past observations. A necessary and sufficient condition for asymptotic wide-sense stationarity of an MSTF-GARCH pro- cess has been derived in [128]. For the completeness of this chapter we briefly summarize these results:
Assuming a stationary Markov chain with stationary probabilities πs = p (St = s), the unconditional variance of the process can be calculated using (4.4) and (4.5):
E X X∗ = π E X X∗ s { t ⊙ t } st { t ⊙ t | t} s Xt
= πst E λt|t−1,st , (4.7)
st X where
q p ∗ E λ = ζ 1 + α E X X s + β E λ − s , t|t−1,st st i,st t−i ⊙ t−i | t j,st t−j|t−j−1,St j | t i=1 j=1 X X (4.8) and E λ s = p (s s ) E λ . (4.9) t−i|t−i−1,St−i | t t−i | t t−i|t−i−1,st−i st−i X
Note that E λt|t−1,st denotes the expected value of the conditional variance under the regime St = st, but E λt|t−1,S denotes a conditional expectation of the conditional t |· variance at time t where the active regime at that time is unknown. Since no prior information is given, we have
E X X∗ s = p (s s ) E λ , (4.10) t−i ⊙ t−i | t t−i | t t−i|t−i−1,st−i st−i X and consequently we obtain [128]
r πst−i i E λt|t−1,st = ζst 1 + (αi,st + βi,st ) A E λt−i|t−i−1,s − , (4.11) st−i,st t i πst i=1 st−i X X 60 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN where r , max p, q , α , 0 i > q, β , 0 i>p, and A is the transition probabilities { } i,st ∀ i,st ∀ matrix, i.e., A , a = p (S = j S = i). Define m-by-m matrices , i = 1, ..., r { }ij ij t | t−1 Ki with elements
πs˜ i i s,s˜ , (αi,s + βi,s) A s,s˜ , s, s˜ =1, ..., m , (4.12) {K } πs and an mr-by-mr matrix as follows
K1 K2 ··· Kr I 0 0 m ··· 0 Im Ψ , . (4.13) ...... . . . . . 0 0 Im 0 ··· Let ρ ( ) denote the spectral radius of a matrix, i.e., its largest eigenvalue in modulus, · and let Φ be an m-by-m square matrix built from the mr-by-mr matrix (I Ψ)−1 such − that Φ = (I Ψ)−1 , i, j = 1, ..., m. Then a necessary and sufficient condition { }ij − ij for asymptotic wide-sense stationarity of an MSTF-GARCH process is ρ (Ψ) < 1, and the asymptotic covariance matrix of the process is then a diagonal matrix (see [128] for a detailed proof):
H lim E XtXt =(πΦζ) IK , (4.14) t→∞
′ where ζ , [ζ1, ..., ζm] , π is the row vector of the stationary probabilities of the Markov chain, and ( )H denotes the Hermitian transpose operation. · This stationarity condition is a necessary and sufficient condition for the existence of a finite second-order moment of the process. It implies that in some regimes (but not in all of them) the conditional variance may grow over time (i.e., i αi,s + j βj,s > 1 for some states s) but still the unconditional variance can be finiteP [121,128].P
4.3 Restoration of noisy MSTF-GARCH process
In this section we develop a recursive algorithm for the restoration of MSTF-GARCH processes observed in additive stationary noise. 4.3. RESTORATION OF NOISY MSTF-GARCH PROCESS 61
A hidden Markov process is a discrete-time finite-state Markov chain observed through a memoryless invariant channel, where the chain state is assumed to be hidden but the transition probabilities between sequential states are assumed to be known. As a conse- quence of the memoryless channel, the conditional density of the observed signal at time t (say Xt) given the chain state st, depends only on the given state and not on previous observations, i.e., the conditional density of a hidden Markov process (HMP) realizes p(X s X , X , ...) = p(X s ). Combining GARCH models with hidden Markov t | t, t−1 t−2 t | t chains, where each state is assumed to have a different GARCH formulation, introduces further complexity when trying to forecast or estimate the process, since the conditional variance of the process evolves as a function of previous conditional variances, as implied from (4.5). Consequently, the conditional density depends on the entire history of the process, i.e., past values and active states. To avoid this problem, several variants of GARCH processes with Markov-switching regimes have been proposed, e.g., [6–8]. These models formulate differently the conditional variance at any regime as dependent on past signal observations only. However, these variants of Markov-switching GARCH models have been developed for the purpose of forecasting volatility of financial time-series, as- suming that the process is observed in a noiseless environment, and that all past clean signal values are given.
We use an MSTF-GARCH(1, 1) model, as defined in (4.4) and (4.5), to model complex, nonstationarity random signals and we develop a recursive signal estimation algorithm for restoring the clean signal and its second-order moment, from noisy observations. The or- der (1, 1) is chosen for computational simplicity since higher (p, q)-orders imply strong dependency of successive conditional variances. Therefore, p = q = 1 is generally as- sumed for the applications of Markov-switching GARCH modeling, e.g., [6–8, 12, 13]. Let X and D denote the spectral coefficients of signal and uncorrelated addi- { tk} { tk} tive noise signal, respectively, and let Ytk = Xtk + Dtk represent the observed signal.
Let Xt be a K-dimensional complex-valued stochastic process, which evolves as an m- state first-order MSTF-GARCH, i.e., X MSTF-GARCH(1, 1), and let D represent t ∼ t a K-dimensional complex Gaussian random noise, D (0, Rd), with known di- t ∼ CN agonal covariance matrix Rd = diag σ2 . We assume that all MSTF-GARCH model { } parameters are known, i.e., the initial regimes probability π(0), the probability tran- 62 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN sitions matrix A, and the GARCH(1, 1) parameters in each of the m regimes. Let
(0) φ , π , A, ζ1, ..., ζm, α1, ..., αm, β1, ..., βm be the set of parameters which specifies the model, where for a first-order process we denote αs , α1,s and βs , β1,s. In practice, the model parameters φ are estimated from a set of clean training signals as generally done with hidden Markov models [35,62,64,129] while the covariance matrix of the noise process can be estimated using the minimum statistics [72] or the minima controlled re- cursive averaging algorithms [38, 71]. The problem of model estimation is addressed in Section 4.5.
The spectral restoration problem is generally formulated as deriving an estimator Xˆtk for the spectral coefficients, such that the expected value of a certain distortion measure is minimized. We develop a recursive estimator for the signal’s spectral coefficients and for their absolute squared values in the sense of minimum mean-square error (MMSE), and we then extend this framework to signal restoration in the sense of MMSE of the log-spectral amplitude (LSA), which is often used in speech enhancement applications, see for instance [34,38].
Let τ = τ , Y t =0, ..., τ be the set of observations up to time τ. The causal Y Y0 { t | } MMSE estimator of the coefficients Xt given the noisy observations up to time t is obtained as follows:
E X t = p s t E X s , t . (4.15) t |Y t |Y t | t Y st X Denote the state dependent, one-frame-ahead conditional covariance matrix of the clean signal as
Rx , E X XH s , t−1 . (4.16) st t t | t I x Following the model formulation this covariance matrix is a function of Rst−1 and Xt−1 only. However, the clean signal values are usually unavailable, nor the sequence of active states, so the evaluation of (4.16) requires the whole available observations. To overcome this problem, we assume that given current regime, past estimated conditional covariances are sufficient statistics for the conditional variance estimation [24]. Accordingly, given the set of estimated one-frame-ahead conditional variances Λˆ , λˆ S =1, ..., m t t|t−1,St | t which manipulates the observations up to time t 1, we mayn use the following signalo − 4.3. RESTORATION OF NOISY MSTF-GARCH PROCESS 63 estimator: Xˆ = p s Λˆ , Y E X s , Rˆx , Y , (4.17) t t | t t t | t st t s Xt n o where under a Gaussian model
−1 E X s , Rˆx , Y = Rˆx Rˆx + Rd Y . (4.18) t | t st t st st t n o Note that Rˆx is a K-by-K diagonal matrix (since V are statistically independent) st { tk} ˆ with the estimated state-dependent conditional variance λt|t−1,st on its diagonal. This state-dependent conditional variance can be recursively estimated in the MMSE sense by calculating its conditional expectation under st given the observation Yt−1 and the previous set of estimated conditional variances:
λˆ , E λ s , Λˆ , Y ; φ t|t−1,st t|t−1,St | t t−1 t−1 = ζ n1 + α E X X∗ so, Λˆ , Y ; φ st st t−1 ⊙ t−1 | t t−1 t−1 n o + β E λ − s , Λˆ , Y ; φ . (4.19) st t−1|t−2,St 1 | t t−1 t−1 n o The conditional second-order moment in (4.19), can be obtained by
λˆ , E X X∗ s , Λˆ , Y ; φ t−1|t−1,st t−1 ⊙ t−1 | t t−1 t−1 n o = p s s , Λˆ , Y ; φ E X X∗ s ,s , Λˆ , Y ; φ t−1 | t t−1 t−1 t−1 ⊙ t−1 | t−1 t t−1 t−1 s −1 Xt n o ˆ ∗ = p s s , Λ , Y ; φ E X X s , λˆ − , Y ; φ t−1 | t t−1 t−1 t−1 ⊙ t−1 | t−1 t−1|t−2,st 1 t−1 s −1 Xt n o
, p s s , Λˆ , Y ; φ λˆ − . (4.20) t−1 | t t−1 t−1 t−1|t−1,st 1 s −1 Xt The expected one-frame-ahead conditional variance in (4.19), given the one-frame-ahead regime, can be obtained by:
λˆ , E λ − s , Λˆ , Y ; φ t−1|t−2,st t−1|t−2,St 1 | t t−1 t−1 n ˆ o ˆ = p s s , Λ , Y ; φ E λ − s ,s , Λ ; φ t−1 | t t−1 t−1 t−1|t−2,St 1 | t−1 t t−1 s −1 Xt n o ˆ = p s s , Λ , Y ; φ λˆ − . (4.21) t−1 | t t−1 t−1 t−1|t−2,st 1 s −1 Xt The third lines in (4.20) and in (4.21) rely on the fact that given all observations up to time t 1 and given the state s , the second-order moment of the process at that − t−1 64 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN time, and also its conditional variance, are independent of any future state. Moreover, ˆ ˆ notice that λt|t,st and λt|t,st+1 in (4.20) represent the expected second-order moment of the process based on information up to time t, given the chain state at the same time, and ˆ ˆ given the next state, respectively. Similarly, λt|t−1,st and λt|t−1,st+1 in (4.21) represent the expectation of the one-frame-ahead conditional variance at time t given the chain state st, and given the chain state at the next time step, respectively. ˆ The MMSE estimation of the process’ second-order moment λt|t,st in (4.20) given the ˆ estimated one-frame-ahead conditional variance of the same regime λt|t−1,st (4.21), can be obtained by
λˆ = E X X∗ s , λˆ , Y t|t,st t ⊙ t | t t|t−1,st t n −1 o −1 = Rˆx Rˆx + Rd σ2 + Rˆx Rˆx + Rd (Y Y∗) , s =1, ..., m , (4.22) st st st st t ⊙ t t similarly to the method in [24] applied to the case of a single-regime spectral GARCH. Following the notation in [24] we call (4.22) the update step as it updates the estimation of the signal’s second-order moment at time t from its estimated one-frame-ahead conditional variance, using the new observation Yt. Substituting (4.20), (4.21) and (4.22) into (4.19) we obtain the propagation step which propagates ahead in time to obtain a conditional variance estimation at the next time, t + 1 (assuming regime st+1), using the available information up to the current time t:
ˆ ˆ ˆ λt+1|t,st+1 = ζst+1 1 + αst+1 λt|t,st+1 + βst+1 λt|t−1,st+1 , st+1 =1, ..., m . (4.23)
t Let Λˆ , Λˆ 0, Λˆ 1, ..., Λˆ t be the set of the recursively estimated conditional variances up to time t,n then we can manipulateo all previous estimations to recursively evaluate the probability p s s , Λˆ t−1, Y ; φ in (4.20) and (4.21) by t−1 | t t−1 t−1 t−1 t−1 p s s , Λˆ , Y ; φ = p s Λˆ , Y ; φ a − /p s Λˆ , Y ; φ , (4.24) t−1 | t t−1 t−1 | t−1 st 1,st t | t−1 where
t−1 t−1 p s Λˆ , Y ; φ = p s Λˆ , Y ; φ a − . (4.25) t | t−1 t−1 | t−1 st 1,st s −1 Xt 4.3. RESTORATION OF NOISY MSTF-GARCH PROCESS 65
The conditional state probability at the right of (4.25) can be obtained by
t b Yt,st Λˆ ; φ t | p st Λˆ , Yt; φ = | b Y Λˆ t; φ t | ˆ ˆ t−1 b Yt st, λt|t−1,st p st Λ , Yt−1; φ = | | , (4.26) b Y s , λˆ p s Λˆ t−1, Y ;φ st t| t t|t−1,st t | t−1 P where b ( ) denotes a conditional density function. Specifically, b Y s , λˆ is the ·|· t | t t|t−1,st observation conditional density which is a complex normal distribution with zero-mean ˆx d and Rst + R covariance matrix,
−1 ˆ 1 H ˆx d b Yt st, λt|t−1,st = exp Yt Rs + R Yt . (4.27) | πK Rˆx + Rd t | st | Computing the conditional density b Y s , λˆ tends to be numerically unstable t | t t|t−1,st ˆ for large values of K since the diagonal values of its covariance matrix (i.e., λt|t−1,st ) are typically of the same order of magnitude. Therefore, b Y s , λˆ tends to zero or t | t t|t−1,st infinity exponentially fast as K increases. It is therefore useful to recursively evaluate a normalized density ˜b Y s , λˆ as follows: t | t t|t−1,st ˜b Y , ..., Y ˜ s , λˆ b Y ˜ s , λˆ t,0 t,k−1 | t t|t−1,st t,k | t t|t−1,st ˜b Y , ..., Y ˜ s , λˆ = , t,0 tk | t t|t−1,st ˜b Y , ..., Y ˜ s , λˆ b Y ˜ s , λˆ st t,0 t,k−1 | t t|t−1,st t,k | t t|t−1,st P (4.28) for k˜ = 0, ..., K 1 and substitute it into (4.26). As can be seen from (4.26), this − normalization of b Y s , λˆ does not affect the value of p s Λˆ t, Y ; φ . t | t t|t−1,st t | t The causal one-frame-ahead conditional variance and the conditional second-order moment of the process can be obtained by
λˆ = p s Λˆ t, Y E λ s , Λˆ t|t−1 t | t t|t−1,St | t t s Xt n o = p s Λˆ t, Y λˆ (4.29) t | t t|t−1,st s Xt
λˆ = p s Λˆ t, Y E X X∗ s , λˆ , Y t|t t | t t ⊙ t | t t|t−1,st t s Xt n o = p s Λˆ t, Y λˆ , (4.30) t | t t|t,st s Xt 66 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN while a state smoothing (i.e., noncausal state probability estimation) for the path- dependent MSTF-GARCH model has been derived in [130] and may be employed for noncausal estimation. The causal recursive MMSE signal restoration algorithm, presented in (4.17) to (4.26), has a compact vector form with respect to the regimes vector. Let st , [St =1, ..., St = m]′ be the regimes vector at time t, let
′ ρ (s ) , p S =1 Λˆ t, Y , ..., p S = m Λˆ t, Y (4.31) t τ τ | t τ | t h i be the probabilities of the regimes vector sτ , conditioned on all observations up to frame t. Let Ct be a regimes probability matrix at time t conditioned on the next regime and all available observations up to time t, i.e., c = p S = i S = j, Λˆ t, Y , i, j = t,ij t | t+1 t 1, ..., m. Let α and β represent the vectors of the m regimes’ GARCH parameters, i.e., ′ ′ ′ ˆ s ˆ ˆ α , [α1, ..., αm] and β , [β1, ..., βm] . Let λtk|τ1, τ2 , λtk|τ1,Sτ2 =1, ..., λtk|τ1,Sτ2 =m be an m 1 vector of the kth index estimated conditional variancesh based on observatiions up × (i) (i) to time τ1, and the corresponding m regimes vector sτ2 . Denote by a and ct the ith column of matrices A and C , respectively, and let ( ) denote a term-by-term division t ÷ of two vectors. A step-by-step vector form of the causal signal estimation procedure is described in Table 4.1. The algorithm, summarized in Table 4.1, estimates both the spectral coefficients and their conditional variance in the MMSE sense. A more general signal enhancement prob- lem is formulated as minimization of the following distortion measure:
E f (X ) f(Xˆ ) 2 t , (4.32) | tk − tk | |Y n o where f(X) is a Borel integrable function. The estimator can be found from
f Xˆ = E f (X ) t , (4.33) tk tk |Y where E f (X ) t = p S = s t E f (X ) s , t . (4.34) tk |Y t t |Y tk | t Y st X The log-spectral amplitude MMSE estimator, obtained by substituting f (X) = log X | | into (4.33), is of particular importance in speech enhancement applications, see for in- stance [34,38]. The LSA estimator [34] is given by 4.3. RESTORATION OF NOISY MSTF-GARCH PROCESS 67
Table 4.1: Vector form of the recursive MSTF-GARCH signal estimation Initialization:
ρ−1(s0)= π
λˆ s = λˆ s = 0 , k =0, ..., K 1 −1,k|−2, 0 −1,k|−1, 0 m×1 −
for t =0, ..., T 1 − λˆ s = ζ + α λˆ s + β λˆ s , k =0, ..., K 1 tk|t−1, t ⊙ t−1,k|t−1, t ⊙ t−1,k|t−2, t − 2 2 2 λˆ s = λˆ s σ 1 + λˆ s Y ( ) λˆ s + σ 1 tk|t, t tk|t−1, t ⊙ k tk|t−1, t ·| tk| ÷ tk|t−1, t k 2 ( ) λˆh s + σ 1 , k =0, ..., K 1 i ÷ tk|t−1, t k − −1 b Y s , λˆ = π−K Rˆx + R −1 exp YH Rˆx + R Y , s =1, ..., m t| t t|t−1,st | st d| − t st d t t B , diag b Y s , λˆ t t| t t|t−1,st n ′ o −1 ρt(st)= Bt ρt−1(st) [1 Bt ρt−1(st)] ′ ρt(st+1)= A ρt(st) for i =1, ..., m : c(i) = a(i) ρ (s )/ρ (s = i) t ⊙ t t t t+1 ′ λˆ s = C λˆ s , k =0, ..., K 1 tk|t, t+1 t tk|t, t − ˆ ′ ˆ λtk|t−1,st+1 = Ct λtk|t−1,st , k =0, ..., K 1 −1 − ˆ ˆx ˆx d Xt|t,st = Rst Rst + R Yt Xˆ = ρ′ (s )Xˆ , k =0, ..., K 1 tk t t tk|t,st −
[ X = exp (E log X λ ,Y ) | tk| { | tk|| tk tk} = G (ξ , ϑ ) Y , (4.35) tk tk | tk| where 2 λtk Ytk γtkξtk ξtk , 2 , γtk , | 2 | , ϑtk , (4.36) σtk σtk 1+ ξtk and ξ 1 ∞ e−t G(ξ, ϑ)= exp dt . (4.37) 1+ ξ 2 t Zϑ ξtk and γtk represent the a priori and a posteriori SNRs respectively [33]. By substituting (4.35) into (4.34), and combining the result with the phase of the noisy signal [34], we obtain the spectral coefficient estimator in the MMSE-LSA sense 68 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN
t p(st|Y ) ˆ ˆ ˆ Xtk = Ytk G ξtk,st , ϑtk,st , (4.38) s Yt where λˆ γ ξˆ ξˆ , tk|t,st , ϑˆ , tk tk,st , (4.39) tk,st σ2 tk,st ˆ tk 1+ ξtk,st and p (s t) is recursively estimated using (4.26). t|Y
4.4 Estimation efficiency
In this section we analyze the mean-square error of a one step ahead MMSE estima- tion using the proposed recursive algorithm. The recursive formulation of the MSTF- GARCH yields an accumulated error in the estimation of the variance and the signal. However, for each regime and in each frame the algorithm evaluates the conditional vari- ance as a weighted sum of previous estimated conditional variances and squared abso- lute values (4.20), (4.21). These weights are proportional to the conditional densities b Y s , λˆ in (4.26). Consequently, an over estimation of the conditional variance t | t t|t−1,st on a specific frame can be followed in the algorithm by giving a high probability (i.e., higher weight) to a regime with small parameters which compensates the previous over estimation. Similarly, an under estimation of the conditional variance can be compensated by giving a high probability to a regime with large parameters. Assume that the process is observed perfectly (without noise) up to time t 1 and − that the regime path is known up to that time. Then, λt−1|t−2,st−1 can be calculated by (4.5). Following Ephraim and Merhav [117] which derive bounds for the MSE of a composite source signal estimation, we assume that (i) the Markov chain is stationary ˆ and the necessary and sufficient condition for a bounded variance is satisfied; (ii) λt|t,st is square integrable with respect to b Y λ and b Y λ ; and (iii) the regime t | t|t−1,st t | t|t−1,s˜t transition probabilities are positive, i.e., aij amin > 0 i, j =1, ..., m. ≥ ∀ The one-step-ahead MMSE estimator (4.17) is unbiased in the sense that E Xˆ t = E X , and following [117] we obtain an upper bound for the variance of the one-step-n o { t} ahead estimation error, assuming that the process is observed with an additive, indepen- dent stationary noise. The one-step-ahead MSE is given by 4.4. ESTIMATION EFFICIENCY 69
1 H e2 , trE X Xˆ X Xˆ , (4.40) t K t − t t − t where the signal estimator Xˆ t follows
Xˆ = E X t−1, Y t t | I t
= E X s , λ − , X , Y t | t−1 t−1|t−2,st 1 t−1 t = E X Λ , Y . (4.41) { t | t t}
Under the above assumptions the MSE can be written as [117, eq. (13) (17)] −
2 2 2 et = µt + ηt , (4.42) where
1 µ2 , trE cov (X Λ ,s , Y ) t K { t | t t t } 1 = trE cov X λ ,s , Y (4.43) K t | t|t−1,st t t 1 η2 , E p (s s , Λ , Y ) p (˜s s , Λ , Y ) g (s , s˜ , Λ , Y ) , (4.44) t 2 { t | t−1 t t t | t−1 t t t t t t } sXt6=˜st and
1 H g (s , s˜ , Λ , Y ) , tr Xˆ Xˆ Xˆ Xˆ t t t t K t|t,st − t|t,s˜t t|t,st − t|t,s˜t
= g st, s˜t, λt|t−1,st , λt|t−1,s˜t , Yt . (4.45) The state probabilities in (4.44) can be evaluated using (4.26):
b Yt st, λt|t−1,st ast−1,st p (st st−1, Λt, Yt)= | , (4.46) | b Y s , λ a − st t | t t|t−1,st st 1,st and the signal estimate given the state sPt is given by
ˆ Xt|t,st = Wst Yt , (4.47)
x x d −1 where Wst is the conditional Wiener filter: Wst , Rst Rst + R . Substituting (4.47) into (4.45), we have 70 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN
1 g (s , s˜ , Λ , Y ) = YH (W W )H (W W ) Y t t t t K t st − s˜t st − s˜t t 1 , YH W 2 Y . (4.48) K t s˜tst t 2 2 2 The one-step-ahead MSE, et , is decomposed into two positive terms, µt and ηt . The first ˆ is the MSE of the estimator Xt|t,st which relies on knowing the true regime at time t, and therefore it is the optimal estimator in the MMSE sense. This term is evaluated by substituting (4.43) into (4.47):
2 1 d µ = tr a − W R . (4.49) t K st 1st st s Xt 2 The second term, ηt , is a weighted sum of cross error terms which depend on pairs of the process regimes. This term is difficult to evaluate, but it is upper bounded by [117, eq. (18) and (23)] 1 η2 a−2 (I (˜s )+ I (s )) , (4.50) t ≤ 2 min st t s˜t t sXt6=˜st where 2 1 2 −λ λ−1 tr Wsts˜t Rλ (st, s˜t) Ist (˜st) tr Wsts˜t Qs˜t Rλ (st, s˜t) Qst Qs˜t + 2 ≤ K | |·| | ·| | tr W Qs˜t st6=˜st sts˜t ! X (4.51) with λ> 0 [117, eq. (31) to (39) and (54) to (60)], Qst denotes the covariance matrix of the noisy signal given the regime st, and Rλ (st, s˜t) is defined by
R (s , s˜ ) , λQ−1 + (1 λ) Q−1 −1 . (4.52) λ t t st − s˜t In the derivation of (4.51) it is assumed that Rλ (st, s˜t) is positive definite [117]. Since
Qst is a diagonal matrix with positive eigenvalues, Rλ (st, s˜t) is positive definite for any 0 <λ< 1. Substituting (4.51) into (4.50) and using the diagonality of the covariance matrices, we obtain an upper bound for the cross error term 1 η2 tr W 2 Q R (s , s˜ ) Q −λ Q λ−1 + tr W 2 R (s , s˜ ) t ≤ a2 K sts˜t s˜t ·| λ t t |·| st | ·| s˜t | sts˜t λ t t min st6=˜st X (4.53) for 0 <λ< 1. It is worthwhile noting that our MSE analysis follows the analysis in [117] but, the latter deals with a memoryless regime-switching process and a Toeplitz covariance matrix, whereas in our case both assumptions do not hold. 4.5. MODEL ESTIMATION 71 4.5 Model estimation
In this section we address the problem of estimating the model parameters φ , π(0), { A, ζ , ..., ζ , α , ..., α , β , ..., β . The ML estimation approach is commonly used for 1 m 1 m 1 m} estimating the parameters of GARCH models (e.g., [1, 5, 7]) and also for estimating the transition probability matrices (e.g., [129]). The model parameters are estimated from a (n) training data set of N clean signals of lengths Tn ,n = 1, ..., N. Let Xt denote the spectral coefficients of the nth clean training signal and let τ,(n) , nX(n) ot = 0, ..., τ . X { t | } (n) The conditional distribution of the vector Xt given its past observations is a mixing of ˆx,(n) zero mean Gaussian vectors with diagonal covariance matrices Rst :
b X(n) t−1,(n) = p s t−1,(n) b X(n) s , Rˆx,(n) . (4.54) t | X t | X t | t st st X
Given a set of model parameters φ, the diagonal covariance matrix of the density (n) x,(n) b X s , Rs can be recursively estimated by using the estimation algorithm intro- t | t t duced in Section 4.3, where the signal observations are known in this case. Assuming that the process is asymptotically wide-sense stationary, and that the training sequences are sufficiently large, the initial state probabilities, π(0), and the initial conditional variance,
λ0|−1,s0 , have negligible contribution to the total likelihood. Therefore it is convenient to choose in the following optimization problem the stationary values as the initial values, ˆ ˆ ˆ (0) i.e., λ0k|−1,s0 = Φζ as the initial conditional variances, and π = πˆ as the initial state probabilities.
The conditional log-likelihood of the training set is given by
Tn−1 (φ)= log b X(n) t−1,(n); φ . (4.55) L t | X n t=0 X X
Using the constraints in (4.6) and imposing Aˆ to be a transition probability matrix, the ML estimates of the model parameters φ can be obtained by solving the following nonlinear constrained optimization problem: 72 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN
max (φ) φˆ L M s.t. ζˆ>0, αˆ 0, βˆ 0, aˆ =1 i 1, ..., m . (4.56) i i ≥ i ≥ ij ∀ ∈{ } j=1 X ˆ For a given parameters set φ, the sequence of state dependent conditional variances Λˆ t can be evaluated recursively according to the method described in Section 4.3 andn soo is the set of conditional state probabilities p (s t−1). The conditional log-likelihood (4.55) t | X can then numerically maximized under the linear constrains of (4.56) as specified in [6,7] or by using sequential quadratic programming [131,132]. The computational complexity required for the model estimation is much higher than that required for a single-regime GARCH model since m2 parameters are to be estimated for the transition probabilities matrix and in addition 3m GARCH parameters are to be evaluated. However, using the Markov-switching model, the optimization problem needs to be solved only once, prior to the restoration procedure. It is well known that the optimal set of parameters, φ, is not necessarily unique in a Markovian model [129] and in addition, the numerical optimization solution may only guarantee a local maxima of the likelihood function. However, the flexibility of the model enables better results than that achievable with a single-regime GARCH model [7,8]. This is also shown in our simulation results, both for MSTF-GARCH processes and for speech signals.
4.6 Experimental results
In this section we demonstrate the performance of the proposed algorithm when applied to restoration of noisy MSTF-GARCH signals, and to estimation of conditional variances and squared absolute values of speech signals in the STFT domain.
4.6.1 MSTF-GARCH signals
The proposed model estimation and signal restoration algorithm has been applied to MSTF-GARCH models of 3 and 5 regimes, degraded by additive independent white noise with 0 to 15 dB input signal-to-noise ratio (SNR). For each state space (m =3, 5), a set of 20 stationary models have been simulated with uniformly distributed parameters on the 4.6. EXPERIMENTAL RESULTS 73 interval (0, 1]. For each model, the parameters, φ, are estimated from a set of 10 training signals, each of time length T = 100 and dimension K = 100. The estimated parameters are employed for restoration of a set of test signals containing 20 noisy signals of the same size, and basically 4 types of estimated variances are compared by incorporating them into the signal’s recursive MMSE estimator of (4.17) and (4.18). The “theoretical limit” is referred to as the estimator which exploits the true conditional variances, λt|t−1,st , of the simulated process. This estimator is the optimal estimator in the MMSE sense and its performance is compared with those of the recursive estimators. The “MSTF-GARCH, true model” is referred to as the recursive signal estimator, described in Section 4.3, which manipulates the true parameters set, φ, and the “MSTF-GARCH, m = i” estimator employs a set of estimated parameters, φ,ˆ assuming that the model has i regimes. For the “MSTF-GARCH, m = i” estimator, the set of parameters, φ,ˆ is estimated using the ML approach as described in Section 4.5. The performance of our algorithm is also compared with that of an estimator that assumes a “constant variance” process. For that estimator (only), the vector of “stationary” variances, are evaluated for each noisy signal from the corresponding clean signal.
Figure 4.1 (a) shows the SNR improvement obtained by using the different estimators, when applied to 3-state MSTF-GARCH signals. It can be seen that even when assuming a small number of regimes, still the MSTF-GARCH estimator outperforms the “constant variance” estimator, and the results achieved by assuming 3 or 5 regimes are comparable to those obtained by using the true model parameters. Figure 4.1 (b) shows estimation results for 5-state MSTF-GARCH processes, under the assumption of 1, 3, 5 or 7 regimes. The estimation performances improve with the increase of the number of assumed regimes, but using a larger number of regimes than the true number (e.g., 7 instead of 5 or 5 instead of 3) yields less accurate results.
The time-varying behavior of the recursive estimator is demonstrated for a 5-state MSTF-GARCH signal degraded by additive white noise with 5 dB SNR. Figure 4.2 shows trace of the instantaneous output SNR for each time frame, obtained by the optimal es- timator, the recursive estimators with presumable 1 or 5 regimes (i.e., “MSTF-GARCH, m = 1, 5”) and a “constant variance” estimator. The varying volatility of the process implies time-varying performances for all those estimators. Nevertheless, under the as- 74 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN
6 theoretical limit MSTF−GARCH, true par. MSTF−GARCH, m=1 5.5 MSTF−GARCH, m=3 MSTF−GARCH, m=5 constant variance 5
4.5
4 SNR improvement (dB)
3.5
3 0 5 10 15 Input SNR (dB)
(a) 6 theoretical limit MSTF−GARCH, true par. 5.8 MSTF−GARCH, m=1 MSTF−GARCH, m=3 MSTF−GARCH, m=5 5.6 MSTF−GARCH, m=7
5.4
5.2
5 SNR improvement (dB)
4.8
4.6
0 5 10 15 Input SNR (dB)
(b)
Figure 4.1: SNR improvements obtained by using different MSTF-GARCH based estimators when applied to (a) 3-states MSTF-GARCH signals and (b) 5-state MSTF-GARCH signals. MSTF-GARCH models with various number of regimes are considered and compared with the true MSTF-GARCH parameters, the theoretical limit and a constant variance estimator. 4.6. EXPERIMENTAL RESULTS 75
10
8
6
4
2 Instantaneous output SNR (dB) 0 theoretical limit MSTF GARCH, m=5 MSTF GARCH, m=1 constant variance −2 10 20 30 40 50 60 frame number
Figure 4.2: Trace of instantaneous output SNR achieved by the proposed algorithm when applied to a realization of a 5-state MSTF-GARCH process degraded by additive white noise with 5 dB SNR, and restored by an MSTF-GARCH estimator, assuming 1 and 5 states. sumption of 5 regimes our recursive estimator follows the optimal estimator with a rela- tively small degradation in performance. The single-regime estimator yields comparable results as the 5-regimes estimator for frames with large input SNR. However, for frames with low input SNR the results obtained by the single-regime estimator are comparable to those obtained by the “constant variance” estimator.
4.6.2 Speech signals
The idea of using different states for the enhancement of speech signals was first introduced by Drucker [28]. He assumed five categories of speech signals, comprising fricatives, stops, vowels, glides, and nasals. The application of HMMs for speech enhancement requires a higher number of states [35, 62] since these models allow only a single density, or a finite set of mixture-densities, for the spectral coefficients in each state. The GARCH- based models allow continuous values of conditional variances with possible transients resulting from switching states. Hence, a small number of states may be sufficient for the representation of the coefficients’ second-order moments. Furthermore, the dynamic of the 76 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN spectral coefficients is frequency dependent. Therefore, we assume different parameters in different sub-bands. The speech signals used in our evaluation are taken from the TIMIT database. The training set includes 10 different utterances from 10 different speakers, half male and half female. The speech signals are sampled at 8 kHz and normalized to the same energy. Transformation into the STFT domain is obtained by using half overlapping Hamming analysis window of 32 millisecond length. We consider 1,3 and 5-state MSTF-GARCH models for the speech signals and estimate the one-frame-ahead conditional variance for test speech signals, not on the training set. Figure 4.3 shows typical estimates of the ˆ one-frame-ahead conditional variance, λtk|t−1, at frequencies of 1, 2 and 3 kHz, using the different MSTF-GARCH models and assuming independent model parameters in each frequency sub-band. The estimated conditional variances are compared with the clean signal’s squared absolute value X 2. It can be seen that by increasing the number of | tk| regimes, the conditional variance yields a better prediction of the squared absolute value of the signal. Moreover, it can be seen that the conditional variance estimated by a single- regime model is smoother than that estimated based on a multi-regime model, and the latter better tracks rapid changes in the signal’s energy with possible switching of regimes. During the first few frames, the speech signal is absent and thus, as long as the squared ab- solute value is below the minimum variance allowed by the model, the predicted variances are determined by the model threshold. However, the predicted variances converge to the absolute squared value as soon as the latter exceeds this threshold. Larger number of states may allow better representation of the conditional variance in different magnitude ranges and different volatilities, at the expense of greater computational complexity. Many speech enhancement algorithms employ the decision-directed approach for the speech spectral variance estimation [33,70]. Accordingly,
2 λˆDD = max α¯ Xˆ + (1 α¯) Y 2 σ2 , ξ σ2 , (4.57) tk t−1,k − | tk| − k min k whereα ¯ (0 α¯ 1) is a weighting factor that controls the trade-off between noise ≤ ≤ reduction and transient distortion introduced into the signal. A larger value ofα ¯ results in a greater reduction of the musical noise phenomena, but at the expense of attenuated speech onsets and audible modifications of transient components. The parameter ξmin is 4.6. EXPERIMENTAL RESULTS 77
20
0
−20 Magnitude (dB) −40 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec]
(a) 10
−10
−30 Magnitude (dB) −50 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec]
(b) 0
−20
−40 Magnitude (dB) −60 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec]
(c)
Figure 4.3: Typical traces of one-frame-ahead conditional variance estimates for speech signals at frequencies (a) 1 kHz, (b) 2 kHz and (c) 3 kHz. The conditional variances are estimated by MSTF-GARCH models of single-state (dashed-dotted line), 3 states (dotted line) and 5 states (dashed line), and compared with the clean signal’s squared absolute value (solid line). 78 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN
20
0
−20 Magnitude (dB)
−40 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec]
(a) 20
0
−20 Magnitude (dB)
−40 0 0.2 0.4 0.6 0.8 1 1.2 Time [sec]
(b)
Figure 4.4: Typical traces of estimated squared absolute values for speech signal at frequency of 2 kHz. The variances are estimated by a 5-state MSTF-GARCH model (dashed-dotted line), decision-directed approach (dotted line) and compared with the clean signal’s squared absolute value (solid line). The SNRs are (a) 0 dB and (b) 10 dB. 4.6. EXPERIMENTAL RESULTS 79 a lower bound on the a priori SNR. The GARCH modeling enables an analytical derivation of the decision-directed es- timator [69]. Considering the degenerated case of a single-state and a single-frequency ARCH(1) model (i.e., β = 0), the update step (4.22) can be written as
λˆ =α ¯ λˆ + (1 α¯ ) Y 2 σ2 (4.58) tk|t tk tk|t−1 − tk | tk| − k with ˆ2 λtk|t−1 α¯tk , 1 2 , (4.59) − ˆ 2 λtk|t−1 + σk ˆ and 0 < α¯tk < 1. Substituting the propagation step for λtk|t−1 (4.23) into (4.58) with α = 1, we obtain
λˆ =α ¯ E X 2 t−1 + (1 α¯ ) Y 2 σ2 +α ¯ ζ . (4.60) tk|t tk | t−1,k| |Y − tk | tk| − k tk 2 t−1 For ζ << E Xt−1,k , (4.60) is similar to the decision-directed variance esti- | | |Y 2 2 t−1 mation (4.57) withα ¯tk α¯ and where E Xt−1,k holds for Xˆt−1,k which is ≡ | | |Y the squared absolute value of the spectral coefficient estima te based on the observations
t−1. Accordingly, the degenerated ARCH-based variance estimation with α = 1 and low Y valued ζ is closely related to the decision-directed estimator with a time-varying frequency- dependent weighting factorα ¯tk. However, the GARCH (and ARCH) modeling approach manipulates the spectral variance as a random process, whereas the decision-directed ap- proach assumes the spectral variance is a parameter which is heuristically evaluated. In addition, the decision-directed approach thresholds the estimated variance to be larger
2 than ξminσk while in the GARCH modeling, the lower bound is inherently incorporated ˆ into the variance estimation. Since λtk|t−1 >ζ, from (4.22) we obtain the following lower bound ζ ζ λˆ > σ2 + Y 2 > 0 . (4.61) tk|t ζ + σ2 k ζ + σ2 | tk| k k Modeling the spectral coefficients as an MSTF-GARCH allows further flexibility for the variance estimation. Figure 4.4 demonstrates the estimated squared absolute values of a speech signal corrupted by a white Gaussian noise with SNR of (a) 0 dB and (b) 10 dB. The signal squared absolute value at frequency of 2 kHz is compared with its estimated variance using 5-state MSTF-GARCH model and by using the decision-directed approach. 80 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN
It shows that the MSTF-GARCH approach with 5 states yields a better estimate of the squared absolute value both under high and low SNR conditions, especially in low energy bins. Furthermore, the MSTF-GARCH approach enables a better tracking of rapid changes in the coefficients energy than the decision-directed approach. The differences between Figure 4.3 and Figure 4.4 is that the former demonstrates the prediction of the coefficients’ variances (i.e., the conditional variance) in a noiseless environment while the latter shows their second-order moments’ estimation in a noisy environment. The variance prediction has a small delay of tracking rapid changes and the update step yields a better estimate of the squared absolute value in high energy bins. However, when noisy observations are employed, low-energy bins may be under the noise level and thus the estimation may be less accurate (for both the MSTF-GARCH approach and the decision-directed approach). Figures 4.3 and 4.4 demonstrate that the proposed MSTF-GARCH model, when com- pared to a single-regime model, or to the decision-directed approach, improves the variance prediction and the squared absolute value estimation of speech signals in the STFT do- main. Still, one needs to derive a frequency-dependent model and to estimate the signal presence probability in each time-frequency bin of the noisy speech signal based on the proposed model, which is a subject for further research.
4.7 Conclusions
We have proposed a statistical model for nonstationary processes with time-varying volatility structure in the STFT domain. Exploiting the advantages of both the condi- tional heteroscedasticity structure of GARCH models and the time-varying characteristics of hidden Markov chains, we model the expansion coefficients as multivariate, complex GARCH process with Markov-switching regimes. The correlation between successive co- efficients in the time-frequency domain is taken into consideration by using the GARCH formulation which specifies the conditional variance as a linear function of its past values and past squared innovations. The time-varying structure of the conditional variance is determined by a hidden Markov chain which allows a different GARCH formulation in each state. 4.7. CONCLUSIONS 81
We showed that an ML estimate of the model can be practically obtained from training signals (assuming that the number of states is known), and developed a recursive algorithm for estimating the signal and its conditional variance in the STFT domain from its noisy observations. The conditional variance is recursively estimated for any regime by iterating propagation and update steps, while the evaluation of the regime conditional probabilities is based on the recursive correlation of the process. Experimental results demonstrate the improved performance of the proposed recursive algorithm compared to using an estimator which assumes a stationary process, even when the number of assumed regimes is smaller than the true number. When the number of assumed regimes approaches the true one, the recursive estimator yields comparable restoration results to those achievable by using the true model parameters. The conditional variance of an MSTF-GARCH process, as well as the instantaneous SNR on each frame, change over time. It is demonstrated that the recursive estimation approach has relatively small performance degradation compared to the theoretical estimation limit in the MMSE sense. Performance evaluation with real speech signals demonstrates better variance estimation when using a multi-regime model, compared to using a single-regime model, and improved squared absolute value estimation in a noisy environment compared to using the decision-directed approach. Several extensions of this work, which may be interesting for further research, include analysis of the algorithm sensitivity to the number of the assumed states, the parameters values and the training set; generalization of the multivariate complex Markov-switching GARCH model, such that the conditional covariance matrix is not necessarily diagonal and the correlation between distinct frequency-bins is also taken into account; and finally estimation of the signal presence probability in the time-frequency domain and modifica- tion of the recursive signal estimation algorithm under signal presence uncertainty. 82 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN 4.A Application of Markov-Switching GARCH Model to Speech Enhancement in Subbands2
In this appendix, we introduce an application of the Markov-switching GARCH model for spectral speech enhancement. A GARCH model is utilized with Markov switching regimes, where the parameters are assumed to be frequency variant. The model param- eters are evaluated in each frequency subband and a special state (regime) is defined for the case where speech coefficients are absent or below a threshold level. The problem of speech enhancement under speech presence uncertainty is addressed and it is shown a soft voice activity detector may be inherently incorporated within the algorithm. Experimen- tal results demonstrate the potential of our proposed model to improve noise reduction while retaining weak components of the speech signal.
4.A.1 Introduction
Statistical modeling of speech signals in the short-time Fourier transform (STFT) domain is of much interest in many speech enhancement applications. The Gaussian model [33] enables to derive useful estimators for the speech expansion coefficients such as the mini- mum mean-square error (MMSE) of the short-term spectral amplitude (STSA), as well as MMSE of the log-spectral amplitude (LSA) [33, 34]. Recently, a generalized autoregres- sive conditional heteroscedasticity (GARCH) model has been introduced for statistically modeling speech signals in the STFT domain [24]. However, the proposed model assumes that the parameters are both time and frequency invariant and it also requires an inde- pendent detector for speech activity in the time-frequency domain. A Markov-switching time-frequency GARCH (MSTF-GARCH) model has been proposed in [127] for modeling nonstationary signals in the time-frequency domain. Accordingly, the parameters are al- lowed to change in time according to the state of a hidden Markov chain (e.g., switching between speech phonemes), but the parameters are still frequency-invariant. The model is estimated using training signals based on maximum likelihood (ML) approach and a recursive algorithm has been derived for conditional variance estimation and signal re-
2This appendix is based on [133]. 4.A. APPLICATION TO SPEECH ENHANCEMENT 83 construction from noisy observations. However, not only that different phonemes may result in different GARCH parameters, speech signals are generally characterized by dif- ferent both volatility and energy levels in various frequency bands. Therefore, different parameters may better represent different frequency subbands.
In this appendix, we modify the MSTF-GARCH model by assuming different Markov chains in distinct subbands with identical state transition probabilities. The GARCH parameters are state dependent and frequency variant. We define an additional state for the case where speech coefficients are absent (or below a certain threshold level) and introduce parameter estimation method which is computationally more efficient than the traditional ML approach. Furthermore, the probability of the speech absence state can be used as a soft voice activity detector which is naturally generated in the reconstruction algorithm. Experimental results demonstrate improved noise reduction performance while preserving weak components of the speech signal.
Section 4.A.2 introduces the statistical model. In Section 4.A.3, we show how the model parameters can be estimated and in Section 4.A.4, we derive the speech enhance- ment algorithm based on the proposed model. Finally, in Section 4.A.5 we evaluate the performance of the proposed algorithm.
4.A.2 Model formulation
Let X t =0, 1, ...T 1,k =0, 1, ..., K 1 denote the coefficients of a speech signal in { tk | − − } a STFT domain, where t is the time frame index and k is the frequency-bin index. Let v be iid complex Gaussian random variables with zero-mean and unit variance, let κ { tk} n denote the nth frequency subband with n 1, 2, ..., N and N X = λ v , k κ (4.62) tk tk|t−1,st tk ∈ n q 2 λ = λ + α X + β λ − λ − , (4.63) tk|t−1,st min,n,st n,st | t−1,k| n,st t−1,k|t−2,st 1 − min,n,st 1 where λ > 0 and α , β 0 are sufficient constrains for the positivity min,n,st n,st n,st ≥ of the one-frame-ahead conditional variance, given that the initial conditions satisfy λ λ for all k κ and s = 0, 1, ..., m. Note that the model formula- 0k|−1,s0 ≥ min,n,s0 ∈ n 0 tion in [127] is slightly different. We assume that the parameters are frequency dependent while each λmin,n,st defines the minimum value of the conditional variance in subband κn under S (κ ) = s . Let a − , p (S = s S = s ), let π denotes the stationary t n t st 1,st t t | t−1 t−1 s probability of state s and let Ψ be an (m + 1) (m + 1) matrix with elements × πs˜ ψs+1,s˜+1 = as,s˜ (αn,s + βn,s) , s, s˜ =0, 1, ..., m . (4.64) πs Then, a necessary and sufficient condition for asymptotic wide-sense stationarity of the model defined in (4.62) and (4.63) is ρ (Ψ) < 1, where ρ ( ) denotes spectral radius [128]. · This condition is also necessary to ensure a finite second order moment for the process. The unconditional expectation of the state-dependent one-frame-ahead conditional variance follows E λ = λ + α E X 2 s tk|t−1,st min,n,st n,st | t−1,k| | t + β E λ − s β E λ − s (4.65) n,st t−1,k|t−2,St 1 | t − n,st min,n,St 1 | t with E λ − s = p (s s ) λ − min,n,St 1 | t t−1 | t min,n,st 1 st−1 X πst−1 = ast−1,st λmin,n,st−1 . (4.66) πst s −1 Xt Therefore, the stationary variance of the process is given by (see [128]) 2 −1 lim E Xtk = π (Im+1 Ψ) λ˜min,n, (4.67) t→∞ | | − where π is a row vector of the stationary probabilities, Im+1 is the identity matrix of order m + 1, T ˜ ˜ ˜ λ˜min,n , λmin,n,0, λmin,n,1, ..., λmin,n,m (4.68) h i 4.A. APPLICATION TO SPEECH ENHANCEMENT 85 and β λ˜ , λ n,s π a λ . (4.69) min,n,s min,n,s − π s˜ s,s˜ min,n,s˜ s s˜ X 4.A.3 Model estimation The estimation of a GARCH model with Markov regimes is generally obtained from a training set using ML approach [5, 13]. However, the maximization of the likelihood function is numerically unstable for multi-regime processes and only a local maxima can be generally obtained. Assuming an (m + 1)-state Markov chain with GARCH of order (1, 1) in each regime, the maximization process generates (m + 1)2 variables for the transition probabilities and additional 3 (m + 1) variables for the GARCH parameters in each × regime. Speech signals in the STFT domain demonstrate different levels of magnitudes in different subbands and the coefficients are generally sparse. Therefore, we limit the conditional variances in each subband within a dynamic range of ηg dB and define a special state for speech absence hypothesis. Let ζ , max X 2 and ζ , max X 2 g t,k | tk| n t,k∈κn | tk| denote the global maximum energy and the local maximum energy of the coefficients (in subband κn), respectively. Then, for the speech absence state (namely, st = 0), we set log10 ζg−ηg /10 λmin,n,0 = 10 , αn,0 = βn,0 =0 . (4.70) Under speech presence, a local dynamic range of ηℓ dB (ηℓ < ηg) is assumed for the conditional variances. Furthermore, the parameters λmin,n,s, s > 0 are chosen to enable tracking any transients between different levels of magnitudes results in switching the active state. Without loss of generality, we sort the states according to the minimum variance level such that log10 ζn−ηℓ/10 λmin,n,1 = max λmin,n,0, 10 , (4.71) and for s =2, ..., m, λmin,n,s are log-spaced between λmin,n,1 and ζn. Each state practically represents different floor level for the spectral coefficients’ variance. The parameters αn,s, βn,s for s> 0 set the volatility level of the conditional variance and they are chosen as follows. Assuming an immutable state s, the stationary variance follows 1 βn,s λ∞,n,s , lim λtk|t−1,s = λmin,n,s − (4.72) t→∞,k∈κn 1 α β − n,s − n,s 86 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN provided that αn,s +βn,s < 1. Since different states are related to different dynamic ranges in ascending order, we constrain λ λ and therefore ∞,n,s ≤ min,n,s+1 1 β λ − n,s min,n,s+1 . (4.73) 1 α β ≤ λ − n,s − n,s min,n,s The autoregressive parameters, βn,s, are chosen experimentally while the moving average parameters, αn,s, are chosen to satisfy equality in (4.73). Although the clean signal is assumed to be available for the model estimation, it is only the high energy values that are needed in each subband. These values can be practically estimated from the noisy coefficients using the spectral subtraction approach. The state transition probabilities can be estimated from test signals such that each active state is determined by the energy level of the subband. 4.A.4 Spectral enhancement of noisy speech Let Dtk denote the spectral coefficients of a noise signal which is uncorrelated with the speech signal and assume that D (0, σ2 ). Let Y = X + D be the noisy tk ∼ CN tk tk tk tk observations and let t , Y τ =0, 1, ..., t , k =0, 1, ..., K 1 denote the set of the Y { τk | − } 2 observed coefficients up to time t. The noise variance σtk is assumed to be known and it can be practically estimated using the improved minima controlled recursive averaging approach [71]. Reconstruction of the one-frame-ahead conditional variances of the speech coefficients is carried out recursively for each state by λˆ = λ + α E X 2 t−1,s tk|t−1,st min,n,st n,st | t−1,k| |Y t t−1 t−1 + β E λ − ,s β E λ − ,s (4.74), n,st t−1,k|t−2,St 1 |Y t − n,st min,n,St 1 |Y t where E X 2 t−1,s = p s s , t−1 E X 2 t−1,s | t−1,k| |Y t t−1 | t Y | t−1,k| |Y t−1 st−1 X t−1 ˆ , p s s , λ − , (4.75) t−1 | t Y t−1,k|t−1,st 1 st−1 X t−1 t−1 ˆ E λ − ,s p s s , λ − (4.76) t−1,k|t−2,St 1 |Y t ≃ t−1 | t Y t−1,k|t−2,st 1 st−1 X and t−1 t−1 E λ − ,s = p s s , λ − . (4.77) min,n,St 1 |Y t t−1 | t Y min,n,st 1 st−1 X 4.A. APPLICATION TO SPEECH ENHANCEMENT 87 A detailed algorithm for the conditional variance restoration is described in [127]. Having an estimate for the speech coefficient’s second order moment under each state, ˆ λtk|t,st , estimates of the speech coefficients are obtained by minimizing the mean-square error of the log-spectral amplitude (LSA). Let ˆ 2 λtk|t,s ξˆ Y ξˆ , t , ϑˆ , tk,st | tk| . (4.78) tk,st 2 tk,st ˆ 2 σtk 1+ ξtk,st · σtk Then, the LSA estimation of the speech coefficients is given by t p(st |Y ) ˆ ˆ ˆ Xtk = Ytk G ξtk,st, ϑtk,st , (4.79) s Yt where ξ 1 ∞ e−t G (ξ, ϑ)= exp dt (4.80) 1+ ξ 2 t Zϑ is the LSA gain function [34] and the state probabilities, p (s t), are evaluated according t |Y to [127]. 4.A.5 Experimental results and discussion In this section, we demonstrate the application of the proposed model to speech enhance- ment and to speech presence probability estimation. The enhancement evaluation includes two objective quality measures; segmental SNR and log-spectral distortion (LSD). The speech signals used in our evaluation are taken from the TIMIT database. The signals are sampled in 16 kHz, degraded by a nonstation- ary factory noise and transformed into the STFT domain using half overlapping Hamming windows of 32 msec length. Twenty subbands are considered with global and local dy- namic ranges of ηg = 50 dB and ηℓ = 20 dB, and four-state Markov chains (i.e., m = 3) for each subband. The autoregressive parameters used in our simulations are βn,s = 0.8 for all n and s> 0. In each subband, the state persistence probability is 0.8 and as,s˜ are equally chosen for all s =s ˜. Figure 1 demonstrates the spectrograms and waveforms of 6 a clean signal, noisy signal with SNR of 5 dB, and the enhanced signal obtained by the proposed algorithm. It shows that the background noise is highly attenuated while weak speech components are retained, even while noise transients occur. Furthermore, the seg- mental SNR and the LSD are improved. A subjective study of speech spectrograms and 88 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN (a) (b) 8 7 6 5 4 3 Frequency [kHz] 2 1 0 Amplitude 0 0.2 0.4 0.6 0.8 1 Time [Sec] (c) Figure 4.5: Speech spectograms and waveforms. (a) Clean signal: ”Draw every outer line.”; (b) speech corrupted by factory noise with 5 dB SNR (LSD= 6.68 dB, SegSNR= 0.05 dB); (c) speech reconstructed by using 4-state model (LSD= 3.14 dB, SegSNR= 6.76 dB). 4.A. APPLICATION TO SPEECH ENHANCEMENT 89 1 0.8 0.6 0.4 0.2 Speech presence probabillity MSTF−GARCH DD 0 −20 −15 −10 −5 0 5 10 15 20 Instantaneous SNR [dB] Figure 4.6: Conditional speech presence probability obtained by the proposed algorithm and by the decision-directed based algorithm. informal listening tests confirm that the quality of the enhanced speech is improved by using frequency-dependent parameters which are derived from the different energy levels. The conditional speech presence probability results from the enhancement algorithm is compared with the statistical model-based voice activity detector (with hang-over) of Sohn et al. [52] when applied to subbands. The later evaluates the conditional likelihood , p ( t S = 0) /p ( t S = 0) by utilizing the decision-directed approach for the a Lt Y | t 6 Y | t priori SNR estimation (assuming only two states). The conditional speech presence prob- ability is obtained by p (S =0 t) = µ / (1 + µ ), where µ , p (S = 0) /p (S = 0) t 6 |Y Lt Lt t 6 t is the a priori probabilities ratio. Figure 2 demonstrates the speech presence probabili- ties achieved when both algorithms are applied to a speech signal corrupted by a white Gaussian noise with SNR of 15 dB. The instantaneous SNR is defined as the ratio be- tween the norms of the clean signal and the noise signal in each subband. It can be seen that the speech presence probability, derived from our proposed algorithm, results in a higher dynamic range for the probabilities and in much lower values for low energy coefficients. Furthermore, the probabilities ascribed to each instantaneous SNR are with higher variance resulting from the Markovian nature of the model. 90 CHAPTER 4. MS-GARCH PROCESS IN THE STFT DOMAIN Chapter 5 State Smoothing in Markov-Switching GARCH Models1 In this chapter, we address the problem of state smoothing in path-dependent Markov- switching generalized autoregressive conditional heteroscedasticity (GARCH) processes. We develop a smoothing algorithm which extends the forward-backward recursions of Chang and Hancock and the stable backward recursion of Lindgren, Askar and Derin. Two recursive steps are derived for the evaluation of conditional densities of future observations. The first step is an upward recursion which manipulates the future observations for the evaluation of their conditional densities, and the second step is a backward recursion which integrates over the possible future paths. Experimental results demonstrate the improvement in performance, compared to using causal estimation. 5.1 Introduction State estimation is of both theoretical and practical importance whenever the underlying statistical model switches regimes over time [5, 129]. State smoothing (i.e., noncausal state estimation) of hidden Markov processes (HMPs) has been originally introduced by Chang and Hancock [118]. Their solution for estimating the noncausal state probability, which is implemented using forward-backward recursions, decouples a forward recursion for the evaluation of the joint probability density of the current state and all observations 1This chapter is based on [130]. 91 92 CHAPTER 5. STATE SMOOTHING IN MS-GARCH MODELS up to the same time, and a backward recursion for obtaining the future observations’ density given the current state. Lindgren [119] and Askar and Derin [120] developed an alternative stable backward recursion for the state smoothing in HMPs. Kim [134] extended the stable backward recursion to nonmemoryless autoregressive hidden Markov processes (AR-HMPs) where both the current state (regime) and a finite set of past values are required for the conditional density evaluation, see also [5, chap. 22]. Generalized autoregressive conditional heteroscedasticity (GARCH) models and also Markov-switching GARCH (MS-GARCH) models, are widely used in the field of econo- metrics for volatility forecast derivation of economics rates [7,8,12,13] and they have re- cently been utilized to several signal processing applications. In [135] GARCH modeling has been applied to spatially non-uniform noise in multichannel signal processing. In [26] a regime-switching GARCH model has been utilized for speech recognition and a complex- valued GARCH model has been proposed in [24, 25] for modeling speech signals in the short-time Fourier transform (STFT) domain for the application of speech enhancement. Generally, when incorporating GARCH processes with switching-regimes, the volatility evaluation requires knowledge of the pertinent history of the regime-switching GARCH process, including the regime-path [12, 13]. Properties of path-dependent MS-GARCH models have been studied by Francq et al. [122]. In order to estimate the model param- eters, they showed that the conditional likelihood depends on all the possible paths and for a Markov-switching ARCH model (in which case there is no dependency on past ac- tive regimes) they showed that the forward-backward recursions can be employed for the conditional likelihood evaluation. The complex-valued GARCH model has been shown to be useful in speech enhancement applications [24, 25]. Motivated by extending the dynamic formulation of the time-frequency GARCH model and enabling a better fit for a process with a more complicated time-varying statistical behavior, a Markov-switching time-frequency GARCH (MSTF-GARCH) model has been introduced [127]. However, existing smoothing solutions are inapplicable in case of a path-dependent MS-GARCH model since both past observations and the regime path are required for the conditional variance estimation, whereas existing smoothing techniques rely on the assumption that given the current state, past active regimes are statistically independent of future densi- ties. 5.2. PROBLEM FORMULATION 93 In this chapter, we develop a state smoothing approach for MSTF-GARCH processes. The dependency of the conditional variance on past observations and past active regimes are taken into consideration as we generalize both the forward-backward recursions of Chang and Hancock [118] and the stable backward recursion of Lindgren [119] and Askar and Derin [120]. We derive two recursive steps for the evaluation of conditional densities of future observations. The first step is an upward recursion which manipulates the future observations for the evaluation of their conditional densities, corresponding to all possible future paths. The second step is a backward recursion which integrates over these paths to evaluate the future densities required for the noncausal state probability. The compu- tational complexity of the generalized recursions grows exponentially with the number of future observations employed for the fixed-lag smoothing. However, experimental results demonstrate that the significant part of the improvement in performance, compared to using causal estimation, is achieved by considering a few future observations. The organization of this chapter is as follows: In Section 5.2, we introduce the Markov- switching time-frequency GARCH model and formulate the state smoothing problem. In Section 5.3 we develop generalized forward-backward recursions as well as generalized stable backward recursion, and derive our noncausal state probability approach. Finally, in Section 5.4 we provide experimental results which demonstrate state smoothing for noisy Markov-switching time-frequency GARCH processes. 5.2 Problem formulation Let X CK be a K-dimensional random vector at a discrete time t, and let X , k t ∈ tk ∈ 0, ..., K 1 be its kth element. Let t2 = X t t t represent the data set from { − } Xt1 { t | 1 ≤ ≤ 2} time t up to t and let t , t. Let S denote the (unobserved) state at time t and 1 2 X X0 t let st be a realization of St, assuming St is a first-order Markov chain with transition probabilities a , p (S = s S = s ). Let t , t, t denote all available stst+1 t+1 t+1 | t t I {X S } information up to time t, where t , t = s , ..., s . We assume that X are generated S S0 { 0 t} tk by an m-state Markov-switching time-frequency GARCH process of order (1, 1) which follows [127]: X = λ V , k =0, ..., K 1 , (5.1) tk tk|t−1 tk − q 94 CHAPTER 5. STATE SMOOTHING IN MS-GARCH MODELS where V are iid complex-valued random variables with zero-mean, unit variance and { tk} some known probability density. Given the state st, the conditional variance of Xtk, λ = E X 2 t−1,s , is a linear function of the previous conditional variance tk|t−1,st {| tk| | I t} and squared absolute value: λ λ = ξ + α X 2 + β λ , (5.2) tk|t−1 ≡ tk|t−1,st st st | t−1,k| st t−1,k|t−2 where ξ > 0, α 0, and β 0, s = 1, ..., m are sufficient constrains for the s s ≥ s ≥ positivity of the conditional variance. Let Ψ be an m-by-m matrix with elements ψij = aji(αi + βi)πj/πi (5.3) where π = p(S = i) is the stationary probability of state i, and let ρ( ) represent the i t · spectral radius of a matrix. Then, a necessary and sufficient condition for the process defined in (5.1) and (5.2) to be asymptotically wide-sense stationary is ρ(Ψ) < 1 [128]. Let Yt = Xt +Dt denote the observed noisy signal, where Dt denotes the noise process which is uncorrelated with the signal Xt, and let Dt be a zero-mean complex-valued Gaus- sian random process with a diagonal covariance matrix E D DH = diag σ2 , where t t { } ( )H denotes the Hermitian transpose operation. The state conditional probability of a · Markov-switching model, p (s τ ), is of considerable theoretical and practical importance t|Y for signal restoration and state sequence estimation (e.g., [127,129]). Solutions of the state smoothing problem, i.e., τ > t, are normally obtained for HMPs using the forward-backward recursions [118] or the stable backward recursion [119, 120]. Extensions of these recursions for nonmemoryless AR-HMPs [134], [5, Chap. 22], are based on the quality that st and a finite set of past clean observations give complete statistical knowledge of future densities. However, in case of a path-dependent MS-GARCH model, a recursive formulation specifies the conditional distribution of the process as dependent on both past observations and the regime path, and therefore existing smoothing solutions are inapplicable. 5.3 State probability smoothing In this Section, we develop the noncausal state probability for the model defined in (5.1) and (5.2). The smoothed probability is derived by generalizing both the forward-backward 5.3. STATE PROBABILITY SMOOTHING 95 recursions [118] and the stable backward recursion [119,120]. 5.3.1 Generalized forward-backward recursions Assume that the conditional variance of the process is recursively estimated for any given state (e.g., as proposed in [127]) and assume that the set of the recursively estimated conditional variances at time t, Λˆ , λˆ S =1, ..., m , with the observed signal t t|t−1,St | t Yt are sufficient statistics for the next conditionaln variance estimationo for any given regime ˆ τ ∗ τ2 τ1 [24,127]. Let λ 2 = E Xτ2 X , , τ2 τ1 > τ0 denote the vector of τ2|τ1,Sτ0 ⊙ τ2 | Sτ0 Y ≥ estimated conditional variances at time τ2 based on the observations up to time τ1 and on the given set of active regimes τ2 , where denotes a term-by-term multiplication and ∗ Sτ0 ⊙ denotes complex conjugation. Let g λ , Y , E X X∗ S = s , λ , Y (5.4) t|t−1,st t t ⊙ t | t t t|t−1,st t where the function g( ) is determined based on the statistical model of V [24]. Define · { tk} the generalized forward density by α s , t , f s , Λˆ , Y (5.5) t Y t t t and the generalized backward density by β t+L t+l−1, t+l−1 , f t+L t+l−1, Λˆ , t+l−1 . (5.6) Yt+l | St Y Yt+l | St t Yt Then, by substituting l = 1 we have f s , t+L Λˆ = α s , t β t+L s , t , (5.7) t Yt | t t Y Yt+1 | t Y and the noncausal state probability can be obtained by p s t+L = p s Λˆ , t+L t|Y t | t Yt t t+L t α (st, ) β st, = Y Yt+1 | Y . (5.8) α (s , t) β t+L s , t st t Y Yt+1 | t Y Proposition 5.1. The generalized forwardP density of an MSTF-GARCH(1,1) process, α (s , t), satisfies the following recursion: t Y t t−1 α s , = f Y s , λˆ α s , a − , (5.9) t Y t | t t|t−1,st t−1 Y st 1st st−1 X with the initial condition α (s , Y )= p(s )f (Y s ). 0 0 0 0 | 0 96 CHAPTER 5. STATE SMOOTHING IN MS-GARCH MODELS Proof. The generalized forward density is obtained by α s , t = f Y s , Λˆ f s , Λˆ . (5.10) t Y t | t t t t Given the active regime, the state-dependent conditional variance is sufficient for the conditional density. Furthermore, Λˆ t and Λˆ t−1, Yt−1 represent the same statistical information. Hence n o α s , t = f Y s , λˆ f s , Λˆ , Y , (5.11) t Y t | t t|t−1,st t t−1 t−1 where t−1 f s , Λˆ , Y = α s , a − . (5.12) t t−1 t−1 t−1 Y st 1st st−1 X Substituting (5.12) into (5.11) we obtain the recursive formulation for the generalized forward density2. Proposition 5.2. The generalized backward density of an MSTF-GARCH(1,1) process, β t+L s , t , satisfies the following two-step recursion: Yt+1 | t Y Step I: Forl =1, ..., L and all t+l: St ˆ + ˆ + −1 ˆ + −1 λ t l = ξs + 1 + αs + λ t l + βst+1 λ t l (5.13) t+l|t+l−1,St t l t l t+l−1|t+l−1,St t+l−1|t+l−2,St λˆ t+l = g λˆ t+l, Yt+l . (5.14) t+l|t+l,St t+l|t+l−1,St Step II: For l = L, ..., 1 and all t+l: St f t+L t+l, λˆ , t+l−1 = β t+L t+l, t+l f Y t+l, λˆ , t+l−1 Yt+l | St t|t−1,st Yt Yt+l+1 | St Y t+l | St t|t−1,st Yt (5.15) β t+L t+l−1, t+l−1 = f t+L t+l, λˆ , t+l−1 a , (5.16) Yt+l | St Y Yt+l | St t|t−1,st Yt st+l−1st+l st+l X with β t+L t+L, t+L =1 as the initial condition for the second step, and where 1 Yt+L+1 | St Y denotes a vector of ones. 2The initial conditions for the generalized forward recursion have negligible effect on the conditional densities assuming an asymptotic stationary process which is sufficiently long. Therefore, the initial conditional variance f (Y s ) can be estimated by using the state-dependent stationary density of the 0 | 0 process. 5.3. STATE PROBABILITY SMOOTHING 97 Proof. The generalized backward density β t+L s , t = f t+L s , t can be ob- Yt+1 | t Y Yt+1 | t Y tained by β t+L s , t = f t+L t+1, λˆ , Y a , (5.17) Yt+1 | t Y Yt+1 | St t|t−1,st t stst+1 st+1 X where the multivariate density f t+L t+1, λˆ , Y in (5.17) can be obtained by Yt+1 | St t|t−1,st t f t+L t+1, λˆ , Y = β t+L t+1, t+1 f Y t+1, λˆ , Y . (5.18) Yt+1 | St t|t−1,st t Yt+2 | St Y t+1 | St t|t−1,st t From (5.17) and (5.18) we recursively obtain for any l =1, ..., L : β t+L t+l−1, t+l−1 = f t+L t+l, λˆ , t+l−1 a (5.19) Yt+l | St Y Yt+l | St t|t−1,st Yt st+l−1st+l st+l X and f t+L t+l, λˆ , t+l−1 = β t+L t+l, t+l f Y t+l, λˆ , t+l−1 . Yt+l | St t|t−1,st Yt Yt+l+1 | St Y t+l | St t|t−1,st Yt (5.20) The conditional density f Y t+l, λˆ , t+l−1 in (5.20) is the density of the ob- t+l | St t|t−1,st Yt served data at time t + l conditioned on the regime path t+l, the recursively estimated St conditional variance at time t given st, and also on all observations from time t up to time t + l 1. This density has a diagonal covariance matrix with the following conditional − variance on its diagonal: E Y Y∗ t+l, λˆ , t+l−1 t+l ⊙ t+l | St t|t−1,st Yt n 2 o = σ + λˆ t+l t+l|t+l−1,St = σ2 + ξ 1 + α E X X∗ t+l, λˆ , t+l−1 st+l st+l { t+l−1 ⊙ t+l−1 | St t|t−1,st Yt } + β E λ t+l−1, λˆ , t+l−2 . (5.21) st+l { t+l−1|t+l−2 | St t|t−1,st Yt } The expected absolute squared value of the signal at a specific time given the active regime is independent of any future regimes, hence ∗ t+l t+l−1 ˆ ˆ + −1 E Xt+l−1 Xt+l−1 t , λt|t−1,st , t = λ t l ⊙ | S Y t+l−1|t+l−1,St n o = g λˆ t+l−1 , Yt+l−1 (5.22). t+l−1|t+l−2,St Combining (5.22) with (5.21), we obtain Step I of the generalized backward recursion ((5.13) and (5.14)), and from (5.19) and (5.20) we obtain Step II ((5.15) and (5.16)). 98 CHAPTER 5. STATE SMOOTHING IN MS-GARCH MODELS Step I is an upward recursion which manipulates the future observations for estimating their conditional variances corresponding to all possible regime sequences t+L. Step II St is a backward recursion, which integrates the Step I results to evaluate the generalized backward density. Each step of the generalized backward recursion is calculated for mL+1 regime sequences, and therefore the computational complexity increases exponentially with the delay L. However, as the correlation of the current state and future observations decreases along time, small values of L sufficiently enhance the chain sequence estimation, as can be seen in the experimental results. 5.3.2 Generalized stable backward recursion The stable backward recursion is derived by using the smoothed probability of two se- quential states, which is given by [120]: t+1 t+L t t+L f , f st+1 p t+1 t+L = St Yt+1 |Y |Y . (5.23) t t+L t S |Y f st+1, t+1 Y |Y Under the assumption that Λˆ t, Yt are sufficient statistics for the next state-dependent conditional variance estimation,n weo obtain f t+1, t+L t = f s , t+L s , t p s t St Yt+1 |Y t+1 Yt+1 | t Y t |Y = f t+L t+1, t p s s , t p s t Yt+1 | St Y t+1 | t Y t |Y = f t+L t+1, λˆ , Y a p s Λˆ ,Y (5.24) Yt+1 | St t|t−1,st t stst+1 t | t t and f s , t+L t = f t+L s , t p s t t+1 Yt+1 |Y Yt+1 | t+1 Y t+1 |Y = f t+L s , λˆ p s Λˆ , Y . (5.25) Yt+1 | t+1 t+1|t,st+1 t+1 | t t By substituting (5.24) and (5.25) into (5.23) and integrating out all states at time t + 1, we obtain the following backward recursion for the smoothed state probability: t+L t+1 ˆ t+L f t+1 t , λt|t−1,st , Yt astst+1 p st+1 t+L Y | S |Y p st = p st Λˆ t, Yt ,(5.26) |Y | t+L ˆ ˆ st+1 f t+1 st+1, λt+1|t,st+1 p st+1 Λt, Yt X Y | | where the conditional density f t+L t+1, λˆ , Y can be derived from the Yt+1 | St t|t−1,st t generalized backward recursion (5.13)-(5.16). However, the conditional density 5.4. EXPERIMENTAL RESULTS 99 f t+L s , λˆ in the denominator of (5.26) requires calculation of a similar Yt+1 | t+1 t+1|t,st+1 recursion which is not informed of the regime st. Although the stable backward recursion is known to be numerically more stable than the forward-backward recursions, the instability of the latter is insignificant for short delays and the former requires computation of the generalized backward recursion twice, one for evaluating f t+L t+1, λˆ , Y and one for f t+L s , λˆ . Yt+1 | St t|t−1,st t Yt+1 | t+1 t+1|t,st+1 5.4 Experimental results The generalized state smoothing has been applied to state detection in noisy MSTF- GARCH(1, 1) processes with 3 states and 5 to 15 dB signal-to-noise ratios (SNRs). Twenty random stationary models have been simulated with an unconditional Gaussian model and uniformly distributed parameters on the intervals (0, 1/3], (1/3, 2/3] and (2/3, 1] for each state respectively. For each model 20 signals are considered, each of dimension ˆ K = 100 and time length T = 100. The conditional variances λt|t−1,st are estimated using the recursive approach of [127]. Figure 5.1 shows the detection error rate p (ˆs = s ) for t 6 t casual estimation as well as for noncausal estimation with up to L = 4 samples delay. It can be seen that the state detection monotonically improves with the increase of the delay. However, the most significant improvement is achieved by using up to 2 future samples, and the contribution of additional future observations decays along time. 5.5 Conclusions We have derived state smoothing for Markov-switching time-frequency GARCH process, in which case the conditional variances depend on both past observations and the regime path. Our noncausal state probability solution generalizes both the standard forward- backward recursions and the stable backward recursion of HMP by capturing both the signal correlation along time and its conditioning on the regime path. Accordingly, the backward recursion requires two recursive steps for evaluating the conditional density of the given future observations corresponding to all optional future paths. Although the computational complexity of the generalized backward recursion grows exponentially 100 CHAPTER 5. STATE SMOOTHING IN MS-GARCH MODELS 0.03 0.02 Error rate 0.01 0 1 2 3 4 Delay L (samples) Figure 5.1: State smoothing error rate for 3-state MSTF-GARCH models with SNRs of 5 dB (triangle), 10 dB (asterisk) and 15 dB (circle). with the delay, a small number of future observations contribute with the most significant improvement to the state estimation. Combining the generalized recursions with the recursive signal restoration algorithm of [127] facilitates a noncausal signal restoration, which is a subject for further research. Chapter 6 Simultaneous Detection and Estimation Approach for Speech Enhancement1 In this chapter, we present a simultaneous detection and estimation approach for speech enhancement. A detector for speech presence in the short-time Fourier transform do- main is combined with an estimator, which jointly minimizes a cost function that takes into account both detection and estimation errors. Cost parameters control the trade- off between speech distortion, caused by missed detection of speech components, and residual musical noise resulting from false-detection. Furthermore, a modified decision- directed a priori signal-to-noise ratio (SNR) estimation is proposed for transient-noise environments. Experimental results demonstrate the advantage of using the proposed si- multaneous detection and estimation approach with the proposed a priori SNR estimator, which facilitate suppression of transient noise with a controlled level of speech distortion. In Appendix 6.B we formulate a speech enhancement problem under multiple hypothe- ses, assuming an indicator or detector for the transient noise presence is available in the short-time Fourier transform (STFT) domain. Hypothetical presence of speech or tran- sient noise is considered in the observed spectral coefficients, and cost parameters control the trade-off between speech distortion and residual transient noise. An optimal esti- mator, which minimizes the mean-square error of the log-spectral amplitude, is derived, 1This chapter is based on [136]. 101 102 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION while taking into account the probability of erroneous detection. Experimental results demonstrate the improved performance in transient noise suppression, compared to using the optimally-modified log-spectral amplitude estimator. 6.1 Introduction Optimal design of efficient speech enhancement algorithms has attracted significant re- search effort for several decades. Speech enhancement systems often operate in the short- time Fourier transform (STFT) domain, where the speech spectral coefficients are esti- mated from the spectral coefficients of the degraded signal. The spectral coefficients of the speech signal are generally sparse in the STFT domain in the sense that speech is present only in some of the frames, and in each frame only some of the frequency-bins contain the significant part of the signal energy. However, existing algorithms often focus on estimating the spectral coefficients rather than detecting their existence. The spectral- subtraction algorithm [29, 30] contains an elementary detector for speech activity in the time-frequency domain, but it generates musical noise caused by falsely detecting noise peaks as bins that contain speech, which are randomly scattered in the STFT domain. Subspace approaches for speech enhancement [57, 59, 60, 104] decompose the vector of the noisy signal into a signal-plus-noise subspace and a noise subspace, and the speech spectral coefficients are estimated after removing the noise subspace. Accordingly, these algorithms are aimed at detecting the speech coefficients and subsequently estimating their values. McAulay and Malpass [32] were the first to propose a speech spectral es- timator under a two-state model. They derived a maximum likelihood (ML) estimator for the speech spectral amplitude under speech-presence uncertainty. Ephraim and Malah followed this approach of signal estimation under speech presence uncertainty and derived an estimator which minimizes the mean-square error (MSE) of the short-term spectral amplitude (STSA) [33]. In [49], speech presence probability is evaluated to improve the minimum MSE (MMSE) of the log-spectral amplitude (LSA) estimator, and in [38] a fur- ther improvement of the MMSE-LSA estimator is achieved based on a two-state model. Under speech absence hypothesis Cohen and Berdugo [38] considered a constant attenu- ation factor to enable a more natural residual noise, characterized by reduced musicality. 6.1. INTRODUCTION 103 Under slowly time-varying noise conditions, an estimator which minimizes the MSE of the STSA or the LSA under speech presence uncertainty may yield reasonable re- sults [33, 38]. However, under quickly time-varying noise conditions, abrupt transients may not be sufficiently attenuated, since speech is falsely detected with some positive probability. Reliable detectors for speech activity and noise transients are necessary to fur- ther attenuate noise transients without much degrading the speech components [107,137]. Despite the sparsity of speech coefficients in the time-frequency domain and the impor- tance of signal detection for noise suppression performance, common speech enhancement algorithms deal with speech detection independently of speech estimation. Even when a voice activity detector is available in the STFT domain (e.g., [51–56,108]), it is not straightforward to consider the detection errors when designing the optimal speech es- timator. High attenuation of speech spectral coefficients due to missed detection errors may significantly degrade speech quality and intelligibility, while falsely detecting noise transients as speech-contained bins, may produce annoying musical noise. In this chapter, we present a novel formulation of the speech enhancement problem, which incorporates simultaneous operations of detection and estimation. A detector for the speech coefficients is combined with an estimator, which jointly minimizes a cost func- tion that takes into account both estimation and detection errors. Under speech-presence, the cost is proportional to a quadratic spectral amplitude (QSA) error [33], while under speech-absence, the distortion depends on a certain attenuation factor [29,38,70]. We derive a combined detector and estimator with cost parameters that enable to control the trade-off between speech distortion, caused by missed detection of speech components, and residual musical noise resulting from false-detection. The combined solution gen- eralizes the well-known STSA algorithm, which involves merely estimation under signal presence uncertainty. In addition, we propose a modification of the decision-directed a priori signal-to-noise ratio (SNR) estimator, which is suitable for transient-noise envi- ronments. Experimental results show that the simultaneous detection and estimation yields better noise reduction than the STSA algorithm while not degrading the speech signal. The advantage of using a suitable indicator for transient noise is demonstrated in a nonstationary noise environment, where the proposed algorithm facilitates suppression of transient noise with a controlled level of speech distortion. 104 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION The chapter is organized as follows. In Section 6.2, we briefly review classical speech enhancement under signal presence uncertainty. In Section 6.3, we reformulate the speech enhancement problem in the STFT domain as a simultaneous detection and estimation problem. In Section 6.4, we derive the combined solution for a QSA distortion function. In Section 6.5, we relate our proposed approach to the spectral-subtraction approach. In Section 6.6, we present an a priori SNR estimator suitable for transient noise envi- ronments, and in Section 6.7 we demonstrate the performance of the proposed approach compared to existing algorithms, both under stationary and transient-noise environments. 6.2 Classical speech enhancement In this section, we present the classical approach for spectral speech enhancement in non- stationary noise environments, assuming that some indicator for transient noise activity is available. Let x (n) and d (n) denote speech and uncorrelated additive noise signals, and let y (n) = x (n)+ d (n) be the observed signal. Applying the STFT to the observed signal, we have Yℓk = Xℓk + Dℓk , (6.1) where ℓ = 0, 1, ... is the time frame index and k = 0, 1, ..., K 1 is the frequency-bin − ℓk ℓk index. Let H1 and H0 denote, respectively, speech presence and absence hypotheses in the time-frequency bin (ℓ,k), i.e., ℓk H1 : Yℓk = Xℓk + Dℓk ℓk H0 : Yℓk = Dℓk . (6.2) We assume that the noise expansion coefficients can be represented as the sum of two s t s uncorrelated noise components Dℓk = Dℓk + Dℓk, where Dℓk denotes a quasi-stationary t noise component and Dℓk denotes a highly nonstationary transient component. The tran- sient components are generally rare, but they may be of high energy and thus cause significant degradation to speech quality and intelligibility. However, in many applica- tions, a reliable indicator for the transient noise activity may be available in the system. For example, in an emergency car (e.g., police or ambulance) the engine noise may be 6.2. CLASSICAL SPEECH ENHANCEMENT 105 considered as quasi-stationary, but activating a siren results in a highly nonstationary noise which is perceptually very annoying. Since the sound generation in the siren is nonlinear, linear echo cancelers, e.g., [138], may be inappropriate. In a computer-based communication system, a transient noise such as a keyboard typing noise may be present in addition to quasi-stationary background office noise. Another example is a digital cam- era, where activating the lens-motor (zooming in/out) may result in high-energy transient noise components, which degrade the recorded audio. In the above examples, an indicator for the transient noise activity may be available, i.e., siren source signal, keyboard output signal and the lens-motor controller output. Furthermore, given that a transient noise source is active, a detector for the transient noise in the STFT domain may be designed and its spectrum can be estimated based on training data. The objective of a speech enhancement system is to reconstruct the spectral coefficients of the speech signal such that under speech-presence a certain distortion measure between the spectral coefficient and its estimate, d Xℓk, Xˆℓk , is minimized, and under speech- absence a constant attenuation of the noisy coefficient would be desired to maintain a natural background noise [38, 70]. Although the speech expansion coefficients are not necessarily present, most classical speech enhancement algorithms try to estimate the spectral coefficients rather than detecting their existence, or try to independently design detectors and estimators. The well-known spectral subtraction algorithm estimates the speech spectrum by subtracting the estimated noise spectrum from the noisy squared absolute coefficients [29, 30], and thresholding the result by some desired residual noise level. Thresholding the spectral coefficients is in fact a detection operation in the time- frequency domain, in the sense that speech coefficients are assumed to be absent in the low-energy time-frequency bins and present in noisy coefficients whose energy is above the threshold. McAulay and Malpass were the first to propose a two-state model for the speech signal in the time-frequency domain [32]. Accordingly, the MMSE estimator follows [115] Xˆ = E X Y ℓk { ℓk | ℓk} = E X Y ,Hℓk p Hℓk Y . (6.3) ℓk | ℓk 1 1 | ℓk The resulting estimator does not detect speech components, but rather, a soft-decision 106 CHAPTER 6. SIMULTANEOUS DETECTION AND ESTIMATION is performed to further attenuate the signal estimate by the a posteriori speech presence probability. Ephraim and Malah followed the same approach and derived an estimator which minimizes the MSE of the STSA under signal presence uncertainty [33]. Accord- ingly, Xˆ = E X Y ,Hℓk p Hℓk Y . (6.4) ℓk | ℓk| | ℓk 1 1 | ℓk ℓk Both in [32] and [33], under H0 the speech components are assumed zero and the a priori ℓk probability of speech presence is both time and frequency invariant, i.e., p H1 = p (H1). In [38, 49], the speech presence probability is evaluated for each frequency-bin and time- frame to improve the performance of the MMSE-LSA estimator [34]. Further improvement ℓk of the MMSE-LSA suppression rule can be achieved by considering under H0 a constant attenuation factor Gf << 1, which is determined by subjective criteria for residual noise naturalness, see also [70]. The OM-LSA estimator [38] is given by ℓk ℓk p(H1 | Yℓk) 1−p H | Y Xˆ = exp E log X Y ,Hℓk (G Y ) ( 1 ℓk) . (6.5) ℓk | ℓk| | ℓk 1 f | ℓk|