Improving Denoising Auto-Encoder Based Speech Enhancement with the Speech Parameter Generation Algorithm

Improving Denoising Auto-encoder Based Speech Enhancement With the Speech Parameter Generation Algorithm Syu-Siang Wang∗‡, Hsin-Te Hwang†, Ying-Hui Lai‡, Yu Tsao‡, Xugang Lu§, Hsin-Min Wang† and Borching Su∗ ∗ Graduate Institute of Communication Engineering, National Taiwan University, Taiwan E-mail: [email protected] † Institute of Information Science, Academia Sinica, Taipei, Taiwan E-mail: [email protected] ‡ Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan E-mail: [email protected] § National Institute of Information and Communications Technology, Japan Abstract—This paper investigates the use of the speech pa- supervised methods can achieve better performance than the rameter generation (SPG) algorithm, which has been successfully unsupervised counterparts [8], [9], [10], [11]. Notable super- adopted in deep neural network (DNN)-based voice conversion vised SE algorithms include nonnegative matrix factorization (VC) and speech synthesis (SS), for incorporating temporal information to improve the deep denoising auto-encoder (DDAE)- (NMF) [9], [12], sparse coding [13], deep neural network based speech enhancement. In our previous studies, we have (DNN) [11], [14], and deep denoising auto-encoder (DDAE) confirmed that DDAE could effectively suppress noise compo- [8], [15] algorithms. nents from noise corrupted speech. However, because DDAE The DDAE based SE method includes training and enhance- converts speech in a frame by frame manner, the enhanced ment phases. In the training phase, we need to prepare paired speech shows some level of discontinuity even though context features are used as input to the DDAE. To handle this issue, clean and noisy training speech utterances, which are set as this study proposes using the SPG algorithm as a post-processor the output and input of the DDAE model, respectively. The to transform the DDAE processed feature sequence to one with parameters in the DDAE model are learned based on the a smoothed trajectory. Two types of temporal information with minimal mean square error (MMSE) criterion with the goal SPG are investigated in this study: static-dynamic and context of transforming the noisy speech to match the clean one. In features. Experimental results show that the SPG with context features outperforms the SPG with static-dynamic features and the enhancement phase, DDAE transforms the noisy speech to the baseline system, which considers context features without the enhanced one using the parameters learned in the training SPG, in terms of standardized objective tests in different noise phase. From previous studies, the DDAE approach can effec- types and SNRs. tively remove noise components from noise-corrupted speech and provide better performance in terms of several standard- I. INTRODUCTION ized objective evaluation metrics, compared to conventional A primary goal of speech enhancement (SE) is to reduce SE approaches [8], [15]. However, because DDAE transforms noise components, and thus enhance the signal-to-noise ratio acoustic features in a frame-by-frame manner, the enhanced (SNR) of noise-corrupted speech. In a wide range of voice speech shows some level of discontinuity even though context communication applications, SE serves as a key element features are used as input to the DDAE model. In this study, to increase the quality and intelligibility of speech signals we intend to incorporate the temporal trajectory information [1], [2], [3]. Generally, SE algorithms can be classified into of a speech utterance to overcome the discontinuity issue of two categories: unsupervised and supervised ones. The un- DDAE. supervised algorithms are derived by probabilistic models of The discontinuity issue is also found in the DNN-based speech and noise signals. Notable examples include spectral speech synthesis (SS) [16] and DNN-based voice conversion subtraction [4], Wiener filter [5], Kalman filtering [6], and (VC) [17] tasks. Several approaches have been proposed to minimum mean-square-error (MMSE) spectral estimator [7]. overcome it, among them an effective approach is the speech These methods assume statistical models for speech and parameter generation (SPG) algorithm. The SPG algorithm noise signals. The clean speech is estimated from the noisy was first proposed for the hidden Markov model (HMM)- observation without any prior information on the noise type based SS [18], [19], and later was applied in the Gaussian or speaker identity. One limitation of these approaches is that mixture model (GMM)-based VC [20], DNN-based VC [17], accurate estimation of noise statistics can be very challenging, [21], and DNN-based SS [16]. Previous studies have confirmed especially when the noise is non-stationary. In contrast, the that two types of features are effective in covering temporal supervised algorithms require a set of training data to learn information, namely static-dynamic features and context fea- a transformation structure to facilitate an online SE process. tures. The static-dynamic features are obtained by appending When a sufficient amount of training data is available, the dynamic components to the original static ones, while the context features are prepared by attaching adjacent features enhanced feature vector xî by: to the center ones. The SPG algorithm generates speech xˆ = W2h(y˜ )+b2, with smooth temporal trajectories by using the dynamic or i i (1) contextual features as constraints in the speech generation where W2 and b2 are the connecting weights and bias vectors process. In this study, we use SPG as a post-processor to for the reconstruction stage, and h(y˜i) is obtained by transform the DDAE enhanced feature sequence to one with 1 a smoothed trajectory. To conduct the static-dynamic-feature- h(y˜ )=σ(y˜ )= , i i 1+ (−y˜ ) (2) based SPG, we use the static-dynamic features of the noisy exp i speech and clean speech as the input and output of the with DDAE model. Similarly, for the context-feature-based SPG, y˜i = W1yi + b1, (3) the context features of the noisy speech and clean speech are W b used as the input and output of the DDAE model. Experimental where 1 and 1 are the connecting weights and bias vectors results show that the SPG smoothed DDAE model with context for the encoding stage. {θ | θ ∈ W1, W2, b1, b2} features achieves better performance than the SPG smoothed Parameters were determined by DDAE model with static-dynamic features. The results also optimizing the objective function in (4) through all the training confirm that DDAE with SPG always outperforms the baseline sample vectors. ∗ system (i.e., DDAE without SPG) in various standardized θ = argmin(L(θ)+αψ(W1, W2)+βφ(h(y˜i), yi)}, (4) objective tests in different noise types and SNRs. θ The remainder of the paper is organized as follows. The where α and β are the weight decay and sparse penalty param- (W W )=( W 2 + W 2 DDAE SE system is briefly introduced in Section II. The pro- eters, respectively; ψ 1, 2 1 F 2 F posed SPG smoothed DDAE SE framework and experimental ); φ(h(y˜i), yi) denotes the sparsity constraint, where the evaluations are presented in Sections III and IV, respectively. KullbackLeibler (KL) divergence [22] between two Bernoulli Finally, the summaries of our findings are given in Section V distributions is used in this study; and L(θ) is the distance between clean- and reconstructed feature vectors defined as II. THE DEAP DENOISING AUTO-ENCODER I L(θ)= x − xˆ 2, (5) This section reviews the DDAE speech enhancement sys- i i 2 i=1 tem. Brief mathematical derivations are also provided. where I is the total number of training samples. DDAE is a deap DAE consisting of more layers. III. THE PROPOSED DDAE WITH SPG METHOD ଶ ෤܅ො ൌ ڮڮڮ ෤ ෤ ڮڮڮ ෤ ൌ܅ଵ ෡ ڮڮڮ !"# ොݏ݌݃ $ Fig. 1. One hidden layer DAE model. yi and xî denote the i-th training sample of the noisy and enhanced speech, respectively. Fig. 2. The proposed SPG smoothed DDAE (DAS) speech enhancement architecture. The DDAE-based speech enhancement method consists of two phases, namely the offline and online phases. The offline Figure 2 shows the block diagram of the proposed DDAE phase first prepares paired clean speech and noisy speech with SPG (denoted as DAS) SE technique. utterances, which are used as the output and input of the DDAE model, respectively. The parameters in the DDAE A. The training stage model are estimated by the MMSE criterion with the aim of At the training stage, after feature extraction, the noisy Y =[Y ··· Y ··· Y] perfectly transforming noisy speech to the clean one. With the speech feature vector 1 , , i , , I and clean X =[X ··· X ··· X] estimated DDAE model parameters, the noisy utterances are speech feature vector 1 , , i , , I are used reconstructed to the enhanced one in the online phase. as input and output to the DDAE, respectively, for constructing Figure 1 shows the block diagram of a one-layered denois- the model. The superscript denotes the vector transposition; ing auto-encoder (DAE). In the figure, the DAE outputs the Yi and Xi are the noisy and clean speech feature vectors at frame i, respectively. Both feature vectors Yi and Xi could composed by the static-dynamic features. Similarly, M can be be either composed by their static-dynamic or context fea- derived accordingly when the enhanced speech feature vector ˆ tures. For example, if Yi consists of static-dynamic features, X is composed by the context features. Y =[y Δ(1)y Δ(2)y] Δ(1)y then i i , i , i . The velocity i and (2) IV. EXPERIMENTS acceleration Δ yi features can be calculated from the static features yi−1, yi, and yi+1 by A. Experimental setup (1) yi+1 − yi−1 The experiments were conducted on a Mandarin hearing Δ yi = , 2 (6) in noise test (MHINT) database. The database included 320 (2) Δ yi = yi−1 − 2yi + yi+1.

Load more