Self-Supervised Adversarial Multi-Task Learning for Vocoder-Based Monaural Speech Enhancement

Self-supervised Adversarial Multi-task Learning for Vocoder-based Monaural Speech Enhancement Zhihao Du1, Ming Lei2, Jiqing Han1, Shiliang Zhang2 1School of Computer Science and Technology, Harbin Institute of Technology 2Speech Lab, Alibaba DAMO Academy duzhihao,jqhan @hit.edu.cn, lm86501,sly.zsl @alibaba-inc.com { } { } Abstract to monaural speech enhancement in our previous study [13]. Specifically, a denoising autoencoder is trained to reconstruct In our previous study, we introduce the neural vocoder into the clean Mel power spectrum, and a flow-based generative monaural speech enhancement, in which a flow-based gener- vocoder, Flowavenet [14], is employed to synthesize the speech ative vocoder is used to synthesize speech waveforms from the waveform from the enhanced features without using the noisy Mel power spectra enhanced by a denoising autoencoder. As phase spectrum or predicting a clean one. As a result, vocoder- a result, this vocoder-based enhancement method outperforms based enhancement methods bypass the problem of phase pre- several state-of-the-art models on a speaker-dependent dataset. diction and outperform several state-of-the-art models on a However, we find that there is a big gap between the enhance- speaker-dependent dataset [13]. ment performance on the trained and untrained noises. There- According to the results in our previous study, we find fore, in this paper, we propose the self-supervised adversarial that there is a big gap between the enhancement perfor- multi-task learning (SAMLE) to improve the noise generaliza- mance on trained and untrained noises. Therefore, in this tion ability. In addition, the speaker dependence is also eval- paper, we propose the self-supervised adversarial multi-task uated for the vocoder-based methods, which is important for learning (SAMLE) to improve the noise generalization ability real-life applications. Experimental results show that the pro- for vocoder-based enhancement methods. In addition, since posed SAMLE further improves the enhancement performance vocoders are originally designed to synthesize waveforms for on both trained and untrained noises, resulting in a better noise a specific speaker, previous vocoder-based enhancement meth- generalization ability. Moreover, we find that vocoder-based ods are always trained and evaluated on a speaker-dependent enhancement methods can be speaker-independent through a dataset [13, 15, 16]. However, enhancement methods should large-scale training. be speaker independent in real-life applications, i.e. speakers Index Terms: self-supervised, multi-task learning, vocoder- for training and test are different. Therefore, we also evalu- based speech enhancement ate the speaker dependence of flow-based generative vocoder (Flowavenet) [14] and autoregressive vocoder (WaveRNN) [17] 1. Introduction for both the speech synthesis and enhancement tasks. Monaural speech enhancement aims at separating speeches from the noisy backgrounds by using a single microphone. 2. Vocoder-based enhancement method Since it was formulated as a supervised learning problem, The proposed method consists of a denoising autoencoder monaural speech enhancement has achieved a huge progress by (DAE), an autoregressive vocoder and a noise classifier. As using the deep learning techniques [1]. Early supervised meth- shown in Fig.1, DAE attempts to reconstruct the clean Mel ods only enhance the magnitude spectrum of noisy speeches but power spectrum from the noisy complex spectrum, and the au- leave the noisy phase spectrum unchanged [2], such as the ideal toregressive vocoder is used to synthesize the speech waveform ratio mask (IRM) [3] and spectral mapping [4]. Recent studies from the predicted Mel power spectrum. Meanwhile, a noise suggest that the phase spectrum is also important for perceptual classifier is involved to perform the self-supervised adversarial quality and speech intelligibility [5]. Therefore, many studies multi-task learning (SAMLE). As in [13], we first train the DAE are making efforts on enhancing the magnitude and phase spec- and vocoder separately, and then stack them together. tra simultaneously. In the complex spectrum domain, the magnitude and phase spectra are enhanced by estimating the com- 2.1. Denoising autoencoder plex ratio mask (CRM) [6] or their clean counterparts [7, 8]. In the waveform domain, the WaveNet [9] and Wave-U-Net [10] In DAE, the encoder is trained to learn an intermediate repre- are introduced and trained with the energy-conserving loss [11] sentation h from the complex spectrum of noisy speech, and and frequency-domain loss [12], respectively. the decoder aims at reconstructing the clean Mel power spec- In general, it is more difficult to enhance the complex spectrum from h. The mean absolute error (MAE) is employed as trum and waveform than the magnitude spectrum due to the the loss function, which is defined between the predicted Mel ˆ lack of clear structures in them. Inspired by the progress in power spectrum S and the clean one S: speech synthesis community, we introduce the neural vocoder T F 1 1 MAE(S, Sˆ)= S(t, f) Sˆ(t, f) (1) This work was performed while the first author was an intern at L T F − Speech Lab, Alibaba DAMO Academy. This research was supported t=1 f=1 X X by National Key Research and Development Program of China under Grant 2017YFB1002102, National Natural Science Foundation of where T and F indicate the total numbers of time frames and China under Grant U1736210. frequency bins, respectively. means the absolute value. An- |·| Figure 1: (Color online) An overview of the proposed vocoder-based enhancement method. The solid and dashed arrows indicate the forward and backward passes, respectively. other optional choice here is mean square error (MSE), how- tains slightly better results. Therefore, we employ the second ever, previous studies show that MAE leads to higher speech way in our following experiments. intelligibility and perceptual quality than MSE [18]. Different from our previous study [13], we feed the DAE 2.2.2. Self-supervised noise classification with noisy complex spectra rather than the noisy Mel power To perform the adversarial multi-task learning, we need to ar- spectra. The complex spectrum can benefit speech enhancement range the noise category to each noisy utterance. However, it from two respects. First, it contains not only the magnitude is expensive and time-consuming to label thousands of noisy information but also the phases. By using the phase features, utterances manually. In addition, a noise can be arranged with such as group delay, the performance of speech recognition and several categories from different respects, which is another ob- speaker identification has been improved [19, 20]. Therefore, stacle to manually noise labeling. Therefore, we propose a self- the complex spectrum may provide another cue (phase informa- supervised noise classification method by developing an auto- tion) for speech enhancement. Second, the complex spectrum matic noise labeling criterion. According to the energy dis- has higher frequency resolution than the Mel power spectrum, tribution in the frequency domain, we divide noises into three which provides more details for speech enhancement. categories, i.e. “low-frequency”, “high-frequency” and “full- frequency” noises. Formally, the labeling criterion is defined as 2.2. Self-supervised adversarial multi-task learning follows: 2.2.1. Adversarial multi-task learning 0 Pl > Pa/2 Only trained with the reconstruction loss MAE, DAE may suf- c := 1 P P /2 (3) L 8 h > a fer the performance degradation on untrained noises, because <>2 otherwise the intermediate representation h is not restricted, which may T ↵F b c contain the information of background noises. To overcome this :> 2 Pl := N (t, f) (4) problem, an additional task, noise classification, is added to the | | t=1 DAE, resulting in a multi-task learning scheme. Specifically, a X fX=1 T F noise classifier C is involved, which is trained to distinguish 2 Ph := N (t, f) (5) the noise categories according to the intermediate representa- | | t=1 f= βF tions by minimizing the cross entropy: X Xb c T F K 2 (1) (2) (T ) Pa := N (t, f) (6) CE(c, C (h)) = cik log C h ,h ,...,h | | L k i i i t=1 f=1 k=1 X X X ⇣ ⌘ 2 (2) where N is the power spectrum of a noise. Pl, Ph and Pa where, K and T are the total numbers of noise categories and represent| | the total energy in the low, high and all frequency- time frames. ci is a one-hot code of the noise category for the bins, respectively. ↵ and β are hyper-parameters to adjust the (t) noisy utterance i. hi means the intermediate representation range of low and high frequency-bins (0 <↵6 β<1). of utterance i at frame t. While the noise classifier attempts By applying the self-supervised noise classification to the to minimize the cross entropy loss CE, the encoder tries to adversarial multi-task learning, we obtain the self-supervised maximize it by adjusting the intermediateL representation of each adversarial multi-task learning (SAMLE), and its optimizing frame. Through the adversarial multi-task learning, the interme- target is given as follows: diate representation can be noise-invariant. min max MAE(S, D'(E✓(X))) λ CE(c, C (E✓(X))) There are two ways to implement the adversarial training. ✓,' L − L One is adding a gradient reverse layer (GRL) [21] between (7) the encoder and noise classifier, like domain adaption methods where λ is a hyper-parameter to balance the feature reconstruc- [22, 23]. The other way is updating the encoder and noise clas- tion and noise classification. Through SAMLE, the intermedi- sifier iteratively, like the generative adversarial networks [24]. ate representations can be noise-invariant, which improves the In our preliminary experiments, we find that the latter one ob- noise generalization ability of DAE. 2.3.

Load more