<<

Non-stationary Cancellation Using Deep Based on Adversarial Learning

Kyung-Hyun Lim, Jin-Young Kim, and Sung-Bae Cho(&)

Department of Computer Science, Yonsei University, Seoul, South Korea {lkh1075,seago0828,sbcho}@yonsei.ac.kr

Abstract. Studies have been conducted to get a clean data from non-stationary noisy signal, which is one of the areas in speech enhancement. Since conven- tional methods rely on first-order statistics, the effort to eliminate noise using deep learning method is intensive. In the real environment, many types of are mixed with the target sound, resulting in difficulty to remove only noises. However, most of previous works modeled a small amount of non-stationary noise, which is hard to be applied in real world. To cope with this problem, we propose a novel deep learning model to enhance the auditory signal with adversarial learning of two types of discriminators. One discriminator learns to distinguish a clean signal from the enhanced one by the generator, and the other is trained to recognize the difference between eliminated noise signal and real noise signal. In other words, the second discriminator learns the of noise. Besides, a novel learning method is proposed to stabilize the unstable adversarial learning process. Compared with the previous works, to verify the performance of the propose model, we use 100 kinds of noise. The experimental results show that the proposed model has better performance than other con- ventional methods including the state-of-the-art model in removing non- stationary noise. To evaluate the performance of our model, the scale-invariant source-to-noise ratio is used as an objective evaluation metric. The proposed model shows a statistically significant performance of 5.91 compared with other methods in t-test.

Keywords: Speech enhancement Á Noise cancellation Á Non-stationary noise Á Generative adversarial networks Á Autoencoder

1 Introduction

Speech enhancement can be divided into several specific problems such as noise elimination and quality improvement. Since noise has a very diverse form, it is too difficult to remove it without prior knowledge. In particular, the problem becomes more difficult for non-stationary noises. As shown in Fig. 1, it is difficult to recognize the non-stationary noise in the given signal [1]. That is why modeling it is an essential issue for the noise cancellation. Several works to eliminate stationary noise have been conducted for a long time. Traditional methods have many limitations because they use filtering based on threshold [1]. If signal data is processed in the through Fourier transform (FT) as shown in Fig. 2, the general noise does not overlap much with the

© Springer Nature Switzerland AG 2019 H. Yin et al. (Eds.): IDEAL 2019, LNCS 11871, pp. 367–374, 2019. https://doi.org/10.1007/978-3-030-33607-3_40 368 K.-H. Lim et al. clean signal frequency domain, so filtering can work to some extent. However, it can be confirmed that it is difficult to remove the non-stationary noise by filtering because the frequency domain of the voice signal and the noise overlaps or the area is not constant. Many researchers have studied to eliminate noise by parameterizing it using deep learning [1–4]. The models based on recurrent neural network (RNN) are used to model signal data which is time series data [5, 6]. Various conversions of sound data are applied into model as input, resulting in high cost. Other models include the generative adversarial network (GAN) method [6]. Pascual et al. proposed a model which had excellent performance in noise removal, but it has a limitation that it is vulnerable to non-stationary noise [6]. To overcome the limitations of the previous works, we propose a model that learns the form of noise itself. The proposed model is trained using two discriminators and takes the adversarial learning structure. In addition, a novel method is proposed to stabilize a process with adversarial learning. The proposed model reduces the pre- processing cost using raw data as input and generalizes various noises through adversarial learning with two discriminators. Besides, non-stationary noise can be removed by additional noise-aware discriminator.

Fig. 1. Example of (top) general noise and (bottom) non-stationary noise cancellation.

Fig. 2. Comparison of general noise with non-stationary noise in frequency domain. Non-stationary Noise Cancellation Using Deep Autoencoder 369

2 Related Works

There are two main approaches to noise cancellation: filtering and deep learning. The filtering methods previously applied in the noise cancellation are the Wiener filtering and spectral subtraction [7, 8]. Both methods basically eliminate noise by removing a specific signal with high or low frequency based on threshold. However, no matter how well we set the threshold, there can be several types of noise in the real world. It is impossible to set a threshold that can eliminate all these noises. To address the limi- tation that filtering alone makes it difficult to construct a function to map the complex relationship between noisy and clean signal, a deep learning method is introduced for noise cancellation [1]. Tamura and Waibel used the basic multilayer perceptron [1]. They argued that the neural network can find the relationship between two signals. However, the data used in the experiment were strongly conditioned that only one speaker was used, and it was made up of not many words. Since then, various attempts have been made to create a model that can eliminate independent and diverse noises. Especially, since signal also has the form of time series data, RNN is used as a basic model [9–11]. Parveen and Green succeeded in showing that the noise could be removed independently from the topic with speech data including 55 speakers, but it had the limitation that there were only eleven words included in the signal and experimented with less amount of noise [11]. Another study attempted to remove noise using auto-encoder [12]. This model also showed good performance for general noise but did consider small amount of noise. Since using raw data reduces the information loss and preprocessing costs, Pascual et al. proposed an end-to-end model that first used adversarial learning in the field of noise cancellation [6]. However, it had also limitations that non-stationary noise still became a vulnerability. In the real environment, since the general noise can also be non-stationary due to various environmental changes, solving this problem is an important issue [13]. The proposed model includes a powerful noise-aware module to eliminate non-stationary noise, and its performance will be evaluated in Sect. 4.

3 The Proposed Method

3.1 Overview The process of the proposed method is divided into three parts: making powerful discriminators, training proposed generator, and removing the noise with shadow generator. While training the model with adversarial process, there is unstable learning procedure [14, 15]. To prevent this problem, we use two generators having the same structure. Only the first generator is trained in adversarial. In the first phase, original generator and discriminators are learned through adversarial learning, and only the shadow generator is learned using discriminators learned in the second phase. It is intended to create a generator that can learn steadily with a strong teacher. In the third phase, noise is removed using the learned shadow generator. 370 K.-H. Lim et al.

3.2 Pretraining Discriminators Figure 3 represents the adversarial learning structure of the original generator and the two discriminators in the first phase. These discriminators are constructed with the same structure. It is composed of the fully-convolution layer to extract the feature from data and learn. One is trained to classify clean and noise as true and classify origin- generated and noisy signal as false. The task of it is to distinguish between the clean signal and the generated signal. The other uses true-pair with the difference of the noisy signal and clean signal and uses false-pair with the difference between the noisy signal and generated signal, which means that it learns the difference between the two signals in the form of noise, residual between two signals. It helps the generator learn to produce a clean signal more closely. The generator is structured like auto-encoder. The encoder and decoder are composed of fully-convolution layer and the U-NET structure is added to prevent the information loss between the encoder and the decoder [16].

Fig. 3. The overall structure of the proposed method. Real noise data mean noise itself.

After the discriminators are learned, the generator is trained by using discrimina- tors. Since Dx learns true-pair as true and false-pair as false, the objective function can be expressed as Eq. (1). Likewise, Dx0 has similar learning objectives, so that the loss function can be defined as in Eq. (2).

; ~ 2 ; ~ 2 ; ~ max min½fgDxðÞÀTxðÞx 1 þ fgDxðÞTGðÞhðÞx x ŠþkkGhðÞÀZ x x 1 ð1Þ Gh Dx

~ 2 ~ ~ 2 ; ~ ~ max min½fgDx0 ðÞÀx À x 1 þ fgDx0 ðÞx À GhðÞx Šþkk fGhðÞÀZ x xgÀðÞx À x 1 ð2Þ Gh Dx0 where x means clean signal and is the target that the model should make. ~x means noisy signal, and Gh means one of the generators. T means concatenation. Noisy signal is used as an input of an encoder, and the output of the encoder and Z, which are prior distributions, are used as inputs of the decoder. The generator adapts its parameters such that discriminators distinguish the outputs of the generator as ‘True’. Besides, additional L1 loss causes the generator to produce a signal closer to the target [17]. Equation (1) is for minimizing the difference between the signal generated and the Non-stationary Noise Cancellation Using Deep Autoencoder 371 clean signal. Equation (2) can produce a signal closer to reality by adding the difference between the noise signal and the generated signal difference, which minimizes the noise data [15].

3.3 Training Shadow Generator Adversarial learning method is difficult to learn in stable [14, 15]. Generators that are in charge of complex tasks compared to discriminators often diverge during learning [13]. To prevent this, we propose a shadow generator Gh0 . It is trained using pretrained discriminators that are powerful instructors to produce clean signal. Gh0 just learns with discriminators what is the target distribution, but it is not trained in adversarial. As shown in Fig. 4, Gh that does adversarial learning is not able to converge easily and its loss oscillates, while Gh0 continues to learn in stable. Loss functions are defined as Eqs. (3) and (4) in a form like Gh.

; ~ 2 ; minfgDxðÞÀGh0 ðÞZ x 1 þ kkGh0 ðÞÀZ N C 1 ð3Þ Gh0

; ~ 2 ; ~ ~ minfgDx0 ðÞÀGh0 ðÞÀZ x x 1 þfkkGh0 ðÞÀZ x xgÀðÞx À x 1 ð4Þ Gh0

4 Experiments

4.1 Dataset and Evaluation Metric To verify the proposed method, we use noise sounds provided by Hu and Wang [18]. Noise data has 20 categories and has several noises in each category. We use data recorded by 30 native English speakers producing 400 sentences each as clean signal [19]. The dataset for learning is constructed by combining noise data and clean signal data as shown in Fig. 5. Non-stationary noises are selected from the noise data and divided into training data and test data at 8:2 ratio. We use SI-SNR as the evaluation metric to compare the similarity of signal [20]. The SI-SNR value is calculated using Eq. (5).

hix; ~x x Ctarget ¼ ; kkx 2 ~ ; Enoise ¼ x À Ctarget ð5Þ

2 Ctarget SI À SNR ¼ 10log10 2 kkEnoise 372 K.-H. Lim et al.

Fig. 4. The change of two generators loss according to epoch.

Fig. 5. Process of making noisy signal using noise data and clean signal data.

4.2 Results of Noise Cancellation The left figures in Figs. 6 and 7 represent the of the signals in time and frequency domain. Comparing the waveforms of clean signal and noisy signal, it can be confirmed that non-stationary noise is applied to clean signal. The waveform of clean signal and generated signal can be seen to be almost similar. Four comparison groups are used to compare the performance of the proposed model. In the first group, autoencoder is used for learning only the generator without adversarial learning. We conduct the experiment using the traditional low pass filtering method and the model [6] with the second group. The right in Fig. 6 shows the SI-SNR values of each model for the test data in a box plot. Table 1 shows the result of t-test with SI-SNR values.

Fig. 6. (Left) Waveform of input and output signals. (Right) Box plot using SI-SNR values of each model. Non-stationary Noise Cancellation Using Deep Autoencoder 373

Fig. 7. Spectrogram of signals.

Table 1. Result of t-test on SI-SNR value. Ours Autoencoder Noisy signal Pascual et al. [6] Filtering Mean 5.971 4.176 1.121 −9.428 −16.982 Variance 11.356 13.174 35.173 11.356 65.659 p-value – 2.29 Â 10−137 0. 0. 0.

The t-test shows that the experiment result is statistically significant. The experimental results demonstrate that our model outperforms other conventional methods in canceling the non-stationary noise.

5 Conclusions

In this paper, we propose a model that can eliminate more non-stationary noise and unseen noise than other studies. We create a powerful instructor through adversarial learning of the discriminators and the generator. In order to solve the problem that the generator easily diverges during the adversarial learning process, Gh0 with the same structure is provided to facilitate stable learning. Gh0 learns from powerful instructors and generates enhanced signal from a given noisy signal. Several experiments show that the proposed method shows promising results of the best performance with 5.91 in terms of SI-SNR. This means that our model can remove not only non-stationary noise but also various types of noises. We will add a comparative model to demonstrate the objective performance of the model. Also, several metrics used in the domain will be added to evaluate the model. Finally, we will build and release a web demo system to test our model.

Acknowledgment. This work was supported by grant funded by 2019 IT promotion fund (Development of AI based Precision Medicine Emergency System) of the Korean government (Ministry of Science and ICT). 374 K.-H. Lim et al.

References

1. Tamura, S., Waibel, A.: Noise reduction using connectionist models. In: IEEE International Conference on , Speech and , pp. 553–556 (1988) 2. Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. In: IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19 (2014) 3. Kumar, A., Florencio, D.: Speech Enhancement in Multiple-noise Conditions Using Deep Neural Networks. arXiv:1605.02427 (2016) 4. Weninger, F., et al.: Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: International Conference on Latent Variable Analysis and Signal Separation, pp. 91–99 (2015) 5. Kim, J.Y., Bu, S.J., Cho, S.B.: Hybrid deep learning based on GAN for classifying BSR noises from invehicle sensors. In: International Conference on Hybrid Artificial Intelligent Systems, pp. 561–572 (2018) 6. Pascual, S., Bonafonte, A., Serra, J.: SEGAN: Speech Enhancement Generative Adversarial Network. arXiv:1703.09452 (2017) 7. Lim, J., Oppenheim, A.: All-pole modeling of degraded speech. In: IEEE Transactions on Acoustics Speech and Signal Processing, vol. 26, no. 3, pp. 197–210 (1978) 8. Berouti, M., Schwartz, R., Makhoul, J.: Enhancement of speech corrupted by acoustic noise. In: International Conference on Acoustics Speech and Signal Processing, vol. 4, pp. 208–211 (1979) 9. Maas, A.L., Le, Q.V., O’Neil, T.M., Vinyals, O., Nguyen, P., Nguyen, A.Y.: Recurrent neural networks for noise reduction in robust ASR. In: InterSpeech, pp. 22–25 (2012) 10. Sun, L., Du, J., Dai, L.R., Lee, C.H.: Multiple-target deep learning for LSTM-RNN based speech enhancement. In: IEEE Hands-free Speech Communications and Microphone Arrays, pp. 136–140 (2017) 11. Parveen, S., Green, P.: Speech enhancement with missing data techniques using recurrent neural networks. In: IEEE International Conference on Acoustics Speech and Signal Processing, pp. 733–736 (2004) 12. Lu, X., Tsao, Y., Matsuda, S., Hori, C.: Speech enhancement based on deep denoising autoencoder. In: InterSpeech, pp. 436–440 (2013) 13. Cohen, I.: Multichannel post-filtering in nonstationary noise environments. In: IEEE Transactions on Signal Processing, vol. 52, no. 5, pp. 1149–1160 (2004) 14. Goodfellow, I., et al.: Generative adversarial nets. In: Neural Information Processing Systems, pp. 2672–2680 (2014) 15. Kim, J.Y., Bu, S.J., Cho, S.B.: Zero-day malware detection using transferred generative adversarial networks based on deep . In: Information Sciences, vol. 460–461, pp. 83–102 (2018) 16. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.: Image-to-Image Translation with Conditional Adversarial Networks. arXiv:1611.07004 (2016) 17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 18. Hu, G., Wang, D.L.: A tandem algorithm for pitch estimation and voiced speech segregation. In: IEEE Trans. on Audio Speech and Language Processing, vol. 18, pp. 2067–2079 (2010) 19. Valentini, C.: Noisy speech database for training speech enhancement algorithms and TTS models. In: University of Edinburgh, School of Informatics, Centre for Speech Technology Research (2016) 20. Luo, Y., Mesgarani, N.: TasNet: Surpassing Ideal Time-frequency Masking for Speech Separation. arXiv:1809.07454 (2018)