INTERSPEECH 2020 October 25–29, 2020, Shanghai, China

A Cross-channel Attention-based Wave-U-Net for Multi-channel Speech Enhancement

Minh Tri Ho1, Jinyoung Lee1, Bong-Ki Lee2, Dong Hoon Yi2, Hong-Goo Kang1

1Department of Electrical and Electronic Engineering, Yonsei University, Seoul, South Korea 2Artificial Intelligence Lab, LG Electronics Co., Seoul, South Korea [email protected]

Abstract performance and robustness are not reliable in harsh environ- ments. Recently, deep-learning based approaches have been In this paper, we present a novel architecture for multi- introduced to the first stage of enhancement [11, 12]. In ad- channel speech enhancement using a cross-channel attention- dition, end-to-end models have been designed to extract both based Wave-U-Net structure. Despite the advantages of utiliz- spectral and spatial features. Examples of this approach include ing spatial information as well as spectral information, it is chal- the Wave-U-Net model [7] and time-domain convolutional au- lenging to effectively train a multi-channel deep learning sys- toencoder model [13]. Since end-to-end structures only mimic tem in an end-to-end framework. With a channel-independent the behavior of beamforming at the first layer of the encoder, encoding architecture for spectral estimation and a strategy to their capabilities for spatial filtering should be limited by the extract spatial information through an inter-channel attention signal-to- ratio (SNR) of the input signal and the number mechanism, we implement a multi-channel speech enhance- of microphones. In our preliminary experiments, we found out ment system that has high performance even in reverberant and that this simple approach would not work in extremely low SNR extremely noisy environments. Experimental results show that conditions. the proposed architecture has superior performance in terms of To overcome the problem, we modify the Wave-U-Net signal-to- ratio improvement (SDRi), short-time ob- structure to process each channel separately, but utilize inter- jective intelligence (STOI), and phoneme error rate (PER) for channel information using a nonlinear spatial filter designed by speech recognition. a lightweight attention block. Motivated by the attention block Index Terms: Multi-channel Speech Enhancement, Wave-U- proposed in [14], we exploit inter-channel information in a more Net, Cross-Channel Attention straightforward manner which directly compares information in one channel with another. Moreover, since encoding each input 1. Introduction channel independently helps to preserve spatial information, the Speech enhancement aiming to enhance the quality of the target output of each encoding layer can be repeatedly utilized to esti- speech signal is crucial to improving the robustness to noise for mate inter-channel information speech recognition [1, 2]. With the development of deep learn- Our contributions in this paper are three-fold. First, we pro- ing, data-driven speech enhancement approaches have shown pose a novel modification of the well-known Wave-U-Net ar- breakthroughs when using a single microphone. In most of sin- chitecture for multi-channel speech enhancement. It separately gle channel approaches [3, 4, 5] the speech signal is first trans- encodes each channel, then interchanges information between formed into the frequency domain, after which time-frequency encoded channels after each downsampling block is efficiently (TF) masks are estimated to determine the amount of noise re- combined by a convolution size of one before passing to the duction in each TF bin. However, their performance improve- skip connection. Secondly, we introduce a cross-channel atten- ments are not significant in low signal-to-noise environments tion block to boost network performance by effectively exploit- because of their limitations on estimating the phase spectrum. ing spatial information of multi-channel data. To the best of our Williamson et al. [6] estimated the TF masks in the complex knowledge, our paper is the first work proving a model’s per- domain, but it was not easy to train the network. To over- formance in an artificial multi-channel acoustic scenario with come this limitation, Wave-U-Net [7] was proposed as a time all of the following four intricate challenges: minimum num- domain approach. Since the Wave-U-Net model directly es- ber of microphones (only two microphones with a small dis- timates the clean target signal waveform, it does not need to tance between them), varying positions of both speech and noise consider the phase problem and spectral efficiency caused by sources, reverberation and extremely low SNR conditions (-10 the time-frequency transformation with a fixed-length analysis dB, -5 dB cases). frame. The rest of this paper is organized as follows. Section When multiple microphones are available, the performance 2 gives a review of related works while section 3 describes of speech enhancement algorithms can be further improved our baseline model. Section 4 represents our proposed cross- because of spatially related information between the micro- channel attention Wave-U-Net architecture. Section 5 describes phones [8]. Statistical approaches such as beamforming [9] our experiment and analyses the experiment results, followed and multi-channel Wiener filtering [10] first estimate the di- by the conclusion in the final section. rection of arrival (DOA) between microphones, then enhance the incoming signal from the estimated source direction but 2. Related Works attenuate interference from other directions using a linear fil- ter [8]. Although the methods are fast and lightweight, their With the success of deep-learning based single-channel speech enhancement approaches, much research has been done on This work is supported and funded by LG Electronics Co., Ltd. combining models with traditional statistic-based beamforming

Copyright © 2020 ISCA 4049 http://dx.doi.org/10.21437/Interspeech.2020-2548 Concat Mic 1 Enhanced speech

Mic 2

Encoder ,

Mic 1

, 24 , 2, 2, 48

∙ ∙ ∙ 240 Decoder

/512, /512, 240

/4, 48 /4,

/2, 24 /2, 24 /2, 48 /4,

/

푇 /1024

푇 푇

( , 1) 푇

푇 푇 푇

푇 푇 Conv1d kernel size 1 Figure 1: Wave-U-Net structure for multi-channel data.

, Conv1d

DS stands for a downsampling block and US does for an Mic 2 kernel size 15 , 24 ,

∙ ∙ ∙ 240

/512, /512, 240

/4, 48 /4,

/2, 24 /2, 24 /2, 48 /2,

/4, 48 /4,

/1024 푇

푇 푇

( , 1) 푇

푇 푇 푇 upsampling block. L denotes the number of downsam- 푇 pling/upsampling blocks. Downsampling

Figure 2: Proposed Cross-channel attention-based Wave-U-Net algorithms. Typical works that manifest this idea include [11], structure for multi-channel data. Feature maps obtained at the [12] and [15]. The common strategy of these works is that the encoding layer are interchanged between channels after pro- neural network tries to reduce single channel noise using TF cessing each downsampling block. masks at the first stage, after which beamforming is applied to linearly integrate the multi-channel signals. This approach 4. Proposed Model not only gives better results than the pure statistical methods In this section, we propose a cross-channel attention-based but also shows robustness in terms of various noisy types and multi-channel Wave-U-Net model by modifying the baseline ar- SNR ranges. However, in this method, the neural network only chitecture described in Section 3. Fig. 2 illustrates the architec- learns the spectral characteristics of a single channel, not spa- ture of our proposed model. Although the proposed algorithm tial information. To address this limitation, recent deep learning can be generalized to an arbitrary number of channels, we fix approaches have used inter-channel features as additional inputs the number of channels to two for simplicity in this paper. to the network, such as generalized cross-correlation (GCC), in- teraural phase or level difference (IDP, ILD) [16, 17]. Although 4.1. Encoder Structure learning spectral information together with spatial features re- sults in performance improvements, adding spatial features as To provide flexibility for the processing of each channel and to the separated input makes it difficult to learn the mutual rela- explicitly utilize cross-channel relationships, the encoder pro- tionship between spatial and spectral information. cesses each channel independently. Feature maps from the en- coder of each channel are used as inputs to the cross-channel To handle multi-channel data, the author of Wave-U-Net attention block, then are interchanged between channels. The proposed that the first layer of the network takes into account main objective of the cross-channel attention block is to derive all input channels. A similar solution could be found in [13], the relationship between two channels. Details of this block are in which the authors used a dilated convolution in- described in subsection 4.3. At the bottleneck of the network, stead of the U-Net structure. However, when handling all the feature maps obtained by the encoder of each channel are con- input channels together, the first convolution layer only played catenated, after which they are projected into one feature map the role of performing nonlinear channel fusion. using a 1-D convolution layer. When the number of channels is greater than two, one channel is chosen as a reference channel 3. Wave-U-Net Baseline for Speech and feature maps are interchanged between the reference and Enhancement other channels. 4.2. Decoder Structure Wave-U-Net was originally developed for a singing voice source separation task by reconfiguring the U-Net structure [5] In each decoding layer, feature maps extracted from the encod- in the time domain. Based on an encoder-decoder structure, it ing layer are fused by a 1-D convolution layer with size 1. The introduces skip connections between the same layer in down- processing is different from the baseline structure that uses a sampling and upsampling blocks so that the high-level features direct skip connection between the same level layers of the en- of the decoding layer also include local features from the en- coder and decoder. The size 1 convolution has two main roles. coding later [7]. Firstly, features from the encoder are effectively combined in a manner such that the network itself can learn to drop or bring The single channel Wave-U-Net structure can be extended any features. Secondly, this convolution helps to reduce the to a multi-channel structure if the number of channels of the number of network parameters by half of the total feature map input waveform signal is increased to correspond to the number size. Details on the network parameters are summarized in Ta- of microphones. Therefore, the input shape of the multi-channel ble 1 in Section 5. model is T ×C, where T and C are the number of audio samples and channels, respectively. The second dimension is treated as 4.3. Cross-channel Attention Block a feature map of the first convolution layer. We use this simple form of multi-channel Wave-U-Net illustrated in Fig. 1 as the In real-life situations, the target source position does not change baseline for the proposed model. much compared to interference ones; therefore, the time delay

4050 Table 1: Network setup. l ∈ 1, 2, ..., L is layer index, for base- line: L = 10, for propose model: L = 12 Mic 1 ∙ ∙ ∙ (T, 1) Baseline Proposed model model Number of encoder 1 2 Mic 2 Encoder input (16384,2) (16384,1) ∙ ∙ ∙ (T, 1) Encoder Number of DS block 12 10 1Dconv kernel size 15 15 퐴푙 2 Mic 1 Number of kernel of th 24l 24l

Mic 2 l -1Dconv layer

1

푙 푙 +

푙 1Dconv kernel size 15 15

푁 푁

, , ,

푁 Absolute 푙 푙 Bottleneck

, , ∙

푙 푇 푇 Number of kernel 312 264

푇 value 푇푙, 푁푙 Number of decoder 1 1 Conv1d Number of DS block 12 10 Decoder Learnable 1Dconv kernel size 5 5 ∙ Sigmoid Number of kernel of Elementwise th 24l 24l multiplication l -1Dconv layer 푙 푙 푙 푙

1 푇 ,2푁 푇 ,2푁

푙 푙

+ Elementwise

, , 푁

, ,

푙 푙

, , addition

푇 푇 푇푙, 푁푙 푇푙, 푁푙 Afterward, the mask is transformed back to the original sig- 푙 푙 nal’s space by 1-D convolution with kernel size 1 and sigmoid 푋1 푋 2 activation. This process can be modeled as:

Figure 3: Cross-channel attention block (brown box) between M l = σ(f(σ [|tanh(f(Xl , Φl )) two encoders. The attention block is applied after every down- α,β 1 1 l l l l l sampling block in the encoder. (T ,N ) represents the shape of tanh(f(X2, Φ2))|], Φ3)), (2) feature map tensor at downsampling block l. where denotes element-wise multiplication between two ma- l T l×Nl l between channels in the voice active region is shorter than the trices. M ∈ R represents the mask and f(X, Φi) rep- one in the interference region. In addition, the power of the resents the 1-D convolution with kernel size 1 on the input X voice active region is likely to be higher than those of noise of and kernel Φi. Signals from one layer after multiplying with l ones even in low SNR cases. Our proposed cross-channel atten- the mask are added again to obtain the final output Ai: tion block utilizes this characteristic to emphasize voice active l l l l Ai = M Xi + Xi . (3) regions while attenuating directive interference regions. Fig. 3 illustrates a block diagram of the proposed cross-channel atten- The residual connection has two advantages. Firstly, it tion block. helps to avoid the gradient vanishing problem frequently ob- l served when multiple layers are used. Note that multiplying Xi is the feature map corresponding to the encoder of chan- nel i. The notation (.)l is the indicator of the attention block feature maps with the masking values in the range [0, 1] con- th l l l tinuously reduces the value over layers. Secondly, in case the located at the l downsampling block. Xi has shape T × N , where T l and N l are the number of samples in the time domain clean signal is mis-filtered out in a certain layer, this operation and the number of feature maps at the downsampling block lth, keeps the information so that it can still be processed in the next l respectively. At first, two feature maps are put into a 1-D con- layers. The output Ai of attention block corresponding to fea- l volution layer with kernel size 1, followed by a hyperbolic tan- ture map Xi of channel i is ”cross” concatenated to the feature l l gent (tanh) activation function. Inspired by works in [18] and map of another channel; in short, A1 is concatenated to X2 and [19], this convolution layer plays a role as a linear transforma- vice versa. tion of the input tensor into an intermediate space. On the other hand, the tanh activation bounds the input tensor to the range 5. Experiment of [−1, 1], which is useful when it is used as an input to the 5.1. Database Setup learnable sigmoid activation function later. Afterward, the two signals are element-wise multiplied with each other. Intuitively, 5.1.1. Room Simulation and Noise Setup the multiplication operation emphasizes the regions which are Multi-channel data used for this experiment is generated by a slowly varying in time and have high power; thus, we expect spatialization of single-channel data in artificially defined room that they would be highly related to voice active regions. Next, conditions. We used the image source method (ISM) [20] to we feed the absolute value of the multiplication result to the calculate the room impulse response to each microphone. Py- input of the learnable sigmoid function: roomacoustics library [21] was used to implement the ISM for 1 σα,β (x) = , (1) acoustic simulation with a room geometry of 8-meters length, 1 + e−α(x−β) 8-meters width, and 3-meters height. Reverberation was in- where σα,β is the sigmoid function parameterized by two pa- cluded with the first reflection order. Two omni-directional mi- rameters α and β. The sigmoid function works as a filter to crophones were fixed at the middle of the wall with 8 centime- reduce the noise components. The parameter β controls the ters horizontally apart. With these setups, we established a polar threshold highlighting signal whose value greater than it while co-ordinate with the pole at the middle of two microphones and attenuating signal with a smaller value. On the other hand, the the polar axis was perpendicular to the wall containing 2 micro- parameter α controls the softness of the mask; a large value of phones. The clean source was located in front of two micro- α pushes the signal close to saturated values 0 and 1. phones with the radial distance of 1 meter and the angle varied

4051 Table 2: Average SDR improvement (dB), STOI and PER (%)

Method SDRi STOI PER Noisy - 0.686 76.440 Ideal MVDR 10.134 0.834 55.125 Single channel 11.082 0.857 54.410 Wave-U-Net Figure 4: From left to right: Visualization of attention mask at Multichannel WaveUNet, layer 1, layer 3, layer 5 and layer 7. Darker color indicates 16.469 0.948 41.411 lower value. Noise type: NOISEX-92/destroyerops, SNR: 0 dB MSE loss Multichannel WaveUNet, 17.034 0.952 40.880 wSDR loss Proposed 18.032 0.961 39.323

5.3. Experiment Result We evaluated our proposed model via three objective measure- ments: signal-to-distortion ratio improvement (SDRi), short time objective intelligibility (STOI) and phoneme error rate (PER). The PER score was evaluated via the listen, attend and spell (LAS) model [26] by inputting 40-dimensional log-mel features of enhanced speech and comparing the model out- puts with the ground truth labels in the TIMIT data set. The LAS model was pre-trained with two pyramidal BLSTM lay- ers at the Listener and one attention-based LSTM layer for the AttendAndSpell module. We compared our model with the ideal MVDR beamformer, single channel Wave-U-Net which was trained with only one channel of the multi-channel data, Figure 5: Activation output’s spectrograms of 24 feature maps the baseline multi-channel Wave-U-Net with MSE loss and the of the first convolution layer of channel 1. Noise type: NOISEX- baseline multi-channel Wave-U-Net with wSDR loss. Exper- 92/destroyerops, SNR: 0 dB imental results were summarized in Table 2. The proposed method shows the best performance for all the three metrics. randomly from -30 to 30 degree. The noise source was located The visualization of attention masks in Fig. 4 illustrates in the room with the radial distance changing from 2 to 4 meters different mask’s behaviors at different downsampling blocks. while the angle varying from -90 to 90 degree. The clean and At early layers, the masks focus on certain feature map noise sources were placed so as the angle difference between channels. For example, at layer 1 the mask highlights the nd th th th them was at least 15 degree. 2 , 5 , 9 , 13 channels of the feature map. Among 24 fea- We used the well-known TIMIT database [22] for the clean ture map channels of the first downsampling block shown in source signal. For the noise source, we selected 15 noise types Fig. 5, these channels highlighted voice active region where it has high energy and small amount of noisy component. In addi- from 3 noise databases: NOISEX-92, AURORA-4 [23] and DE- nd th MAND database [24]. 15 noise types were separated into 11 tion, the 2 and 9 channels mainly present information of the high frequency (from 3-4kHz) part of speech, which relates to types for seen data and 4 types for unseen data. We covered th th a wide range of SNR cases, including an extremely low SNR the plosive sound. The 5 and 13 channels contain harmonic scenario, from -10 dB to 10 dB. information of speech, but disregard the low-frequency noise region. At deeper layers, the shape of masks gradually changes to handle time delay related information, proving that the masks 5.2. Network and Training Setup try to synchronize time delay between two channels. Masks at rd th Details of the network setup for our experiment are summarized the 3 and 5 layers are more selective in voice active region in Table 1. For the baseline Wave-U-Net structure, we kept the where only a few channels are highlighted. Moreover, masks same architecture in [7]. The proposed model contained two of the layers containing low-frequency information such as the th separate encoders for two channels as described in Section 3. 7 layer have very low value, which matches with the fact that To reduce the network size, we decreased the number of down- noise power is dominated at the low-frequency region. sampling blocks and upsampling blocks to 10 instead of 12. For the training setup, we trained both the baseline and pro- 6. Conclusion posed model using weighted signal-to-distortion (wSDR) loss In this paper, we proposed a cross-channel attention-based [25]. Our experiments showed that the wSDR loss performed Wave-U-Net for multi-channel speech enhancement, aiming better than the mean square error (MSE) loss because it helped to straightforwardly and efficiently exploit spatial information. to compensate for the time delay of the noisy signal. We di- Given the minimum number of microphones, our experimen- vided our data set into a training set, a validation set and a test tal results showed considerable improvements in terms of SDR, set with approximately 30,000 utterances (30 hours), 10,000 STOI and PER in an artificial room scenario with reverbera- utterances (10 hours) and 15,000 utterances, respectively. We tion and extremely low SNR conditions. For future develop- used ADAM optimizer with learning rate 0.0001, decay rates ments, we will be exploiting intra-channel temporal information β = 0.9, β = 0.999 and a batch size of 32. Early stopping 1 2 by flexibly introducing recurrent layers, and re-designing input was performed if there has been no improvement on the valida- layers to address the latency issue of the model. tion set for 20 epochs.

4052 7. References [20] J. B. Allen and D. A. Berkley, “Image method for efficiently sim- ulating small-room ,” 1976. [1] J. Benesty, S. Makino, and J. Chen, Speech enhancement. Springer Science & Business Media, 2005. [21] R. Scheibler, E. Bezzam, and I. Dokmanic, “Pyroomacoustics: A python package for audio room simulation and array processing [2] B. T. Atmaja, M. N. Farid, and D. Arifianto, “Speech enhance- algorithms,” 2018 IEEE International Conference on Acoustics, ment on smartphone voice recording,” 2016. Speech and Signal Processing (ICASSP), Apr 2018. [Online]. [3] A. Narayanan and D. Wang, “Ideal ratio mask estimation us- Available: http://dx.doi.org/10.1109/ICASSP.2018.8461310 ing deep neural networks for robust speech recognition,” in 2013 IEEE International Conference on Acoustics, Speech and Signal [22] J. S. Garofolo, “Timit acoustic phonetic continuous speech cor- Processing (ICASSP). IEEE, 2013, pp. 7092–7096. pus,” Linguistic Data Consortium, 1993. [4] P. Chandna, M. Miron, J. Janer, and E. Gomez,´ “Monoaural audio [23] D. Pearce and J. Picone, “Aurora working group: Dsr front end source separation using deep convolutional neural networks,” in lvcsr evaluation au/384/02,” Inst. for Signal & Inform. Process., International conference on latent variable analysis and signal Mississippi State Univ., Tech. Rep, 2002. separation. Springer, 2017, pp. 258–266. [24] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments [5] A. Jansson, E. J. Humphrey, N. Montecchio, R. M. Bittner, A. Ku- multi-channel acoustic noise database (demand): A database of mar, and T. Weyde, “Singing voice separation with deep u-net multichannel environmental noise recordings,” in Proceedings of convolutional networks,” in ISMIR, 2017. Meetings on Acoustics ICA2013, vol. 19, no. 1. Acoustical So- ciety of America, 2013, p. 035081. [6] D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio mask- ing for monaural speech separation,” IEEE/ACM transactions on [25] H.-S. Choi, J.-H. Kim, J. Huh, A. Kim, J.-W. Ha, and K. Lee, audio, speech, and language processing, vol. 24, no. 3, pp. 483– “Phase-aware speech enhancement with deep complex u-net,” 492, 2015. arXiv preprint arXiv:1903.03107, 2019. [7] D. Stoller, S. Ewert, and S. Dixon, “Wave-U-Net: A Multi-Scale [26] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend Neural Network for End-to-End Audio Source Separation,” arXiv and spell: A neural network for large vocabulary conversational e-prints, p. arXiv:1806.03185, Jun 2018. speech recognition,” in 2016 IEEE International Conference on [8] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, consolidated perspective on multimicrophone speech enhance- pp. 4960–4964. ment and source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 692–730, 2017. [9] M. Brandstein and D. Ward, Microphone arrays: signal process- ing techniques and applications. Springer Science & Business Media, 2013. [10] K. U. Simmer, J. Bitzer, and C. Marro, “Post-filtering techniques,” in Microphone arrays. Springer, 2001, pp. 39–60. [11] C. Boeddeker, H. Erdogan, T. Yoshioka, and R. Haeb-Umbach, “Exploring practical aspects of neural mask-based beamforming for far-field speech recognition,” in 2018 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 6697–6701. [12] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved mvdr beamforming using single-channel mask prediction networks.” in Interspeech, 2016, pp. 1981–1985. [13] N. Tawara, T. Kobayashi, and T. Ogawa, “Multi-channel speech enhancement using time-domain convolutional denoising autoen- coder,” in Interspeech, 2019, pp. 86–90. [14] B. Tolooshams, R. Giri, A. H. Song, U. Isik, and A. Krish- naswamy, “Channel-attention dense u-net for multichannel speech enhancement,” arXiv preprint arXiv:2001.11542, 2020. [15] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved mvdr beamforming using single-channel mask prediction networks.” in Interspeech, 2016, pp. 1981–1985. [16] X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L. Seltzer, G. Chen, Y. Zhang, M. Mandel, and D. Yu, “Deep beam- forming networks for multi-channel speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5745–5749. [17] Z.-Q. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 27, no. 2, pp. 457–468, 2018. [18] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Mi- sawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018. [19] R. Giri, U. Isik, and A. Krishnaswamy, “Attention wave-u-net for speech enhancement,” in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019, pp. 249–253.

4053