A Cross-Channel Attention-Based Wave-U-Net for Multi-Channel Speech Enhancement

INTERSPEECH 2020 October 25–29, 2020, Shanghai, China A Cross-channel Attention-based Wave-U-Net for Multi-channel Speech Enhancement Minh Tri Ho1, Jinyoung Lee1, Bong-Ki Lee2, Dong Hoon Yi2, Hong-Goo Kang1 1Department of Electrical and Electronic Engineering, Yonsei University, Seoul, South Korea 2Artificial Intelligence Lab, LG Electronics Co., Seoul, South Korea [email protected] Abstract performance and robustness are not reliable in harsh environments. Recently, deep-learning based approaches have been In this paper, we present a novel architecture for multi- introduced to the first stage of enhancement [11, 12]. In ad- channel speech enhancement using a cross-channel attention- dition, end-to-end models have been designed to extract both based Wave-U-Net structure. Despite the advantages of utiliz- spectral and spatial features. Examples of this approach include ing spatial information as well as spectral information, it is chal- the Wave-U-Net model [7] and time-domain convolutional au- lenging to effectively train a multi-channel deep learning sys- toencoder model [13]. Since end-to-end structures only mimic tem in an end-to-end framework. With a channel-independent the behavior of beamforming at the first layer of the encoder, encoding architecture for spectral estimation and a strategy to their capabilities for spatial filtering should be limited by the extract spatial information through an inter-channel attention signal-to-noise ratio (SNR) of the input signal and the number mechanism, we implement a multi-channel speech enhance- of microphones. In our preliminary experiments, we found out ment system that has high performance even in reverberant and that this simple approach would not work in extremely low SNR extremely noisy environments. Experimental results show that conditions. the proposed architecture has superior performance in terms of To overcome the problem, we modify the Wave-U-Net signal-to-distortion ratio improvement (SDRi), short-time ob- structure to process each channel separately, but utilize inter- jective intelligence (STOI), and phoneme error rate (PER) for channel information using a nonlinear spatial filter designed by speech recognition. a lightweight attention block. Motivated by the attention block Index Terms: Multi-channel Speech Enhancement, Wave-U- proposed in [14], we exploit inter-channel information in a more Net, Cross-Channel Attention straightforward manner which directly compares information in one channel with another. Moreover, since encoding each input 1. Introduction channel independently helps to preserve spatial information, the Speech enhancement aiming to enhance the quality of the target output of each encoding layer can be repeatedly utilized to esti- speech signal is crucial to improving the robustness to noise for mate inter-channel information speech recognition [1, 2]. With the development of deep learn- Our contributions in this paper are three-fold. First, we pro- ing, data-driven speech enhancement approaches have shown pose a novel modification of the well-known Wave-U-Net ar- breakthroughs when using a single microphone. In most of sin- chitecture for multi-channel speech enhancement. It separately gle channel approaches [3, 4, 5] the speech signal is first trans- encodes each channel, then interchanges information between formed into the frequency domain, after which time-frequency encoded channels after each downsampling block is efficiently (TF) masks are estimated to determine the amount of noise re- combined by a convolution size of one before passing to the duction in each TF bin. However, their performance improve- skip connection. Secondly, we introduce a cross-channel atten- ments are not significant in low signal-to-noise environments tion block to boost network performance by effectively exploit- because of their limitations on estimating the phase spectrum. ing spatial information of multi-channel data. To the best of our Williamson et al. [6] estimated the TF masks in the complex knowledge, our paper is the first work proving a model’s per- domain, but it was not easy to train the network. To over- formance in an artificial multi-channel acoustic scenario with come this limitation, Wave-U-Net [7] was proposed as a time all of the following four intricate challenges: minimum num- domain approach. Since the Wave-U-Net model directly es- ber of microphones (only two microphones with a small dis- timates the clean target signal waveform, it does not need to tance between them), varying positions of both speech and noise consider the phase problem and spectral efficiency caused by sources, reverberation and extremely low SNR conditions (-10 the time-frequency transformation with a fixed-length analysis dB, -5 dB cases). frame. The rest of this paper is organized as follows. Section When multiple microphones are available, the performance 2 gives a review of related works while section 3 describes of speech enhancement algorithms can be further improved our baseline model. Section 4 represents our proposed cross- because of spatially related information between the micro- channel attention Wave-U-Net architecture. Section 5 describes phones [8]. Statistical approaches such as beamforming [9] our experiment and analyses the experiment results, followed and multi-channel Wiener filtering [10] first estimate the di- by the conclusion in the final section. rection of arrival (DOA) between microphones, then enhance the incoming signal from the estimated source direction but 2. Related Works attenuate interference from other directions using a linear filter [8]. Although the methods are fast and lightweight, their With the success of deep-learning based single-channel speech enhancement approaches, much research has been done on This work is supported and funded by LG Electronics Co., Ltd. combining models with traditional statistic-based beamforming Copyright © 2020 ISCA 4049 http://dx.doi.org/10.21437/Interspeech.2020-2548 Concat Mic 1 Enhanced speech Mic 2 Encoder , Mic 1 , 24 , 2, 2, 48 ∙ ∙ ∙ 240 Decoder /512, /512, 240 /4, 48 /4, /2, 24 /2, 24 /2, 48 /4, / 푇 /1024 푇 푇 ( , 1) 푇 푇 푇 푇 푇 푇 Conv1d kernel size 1 Figure 1: Wave-U-Net structure for multi-channel data. , Conv1d DS stands for a downsampling block and US does for an Mic 2 kernel size 15 , 24 , ∙ ∙ ∙ 240 /512, /512, 240 /4, 48 /4, /2, 24 /2, 24 /2, 48 /2, /4, 48 /4, /1024 푇 푇 푇 ( , 1) 푇 푇 푇 푇 푇 upsampling block. L denotes the number of downsam- 푇 pling/upsampling blocks. Downsampling Figure 2: Proposed Cross-channel attention-based Wave-U-Net algorithms. Typical works that manifest this idea include [11], structure for multi-channel data. Feature maps obtained at the [12] and [15]. The common strategy of these works is that the encoding layer are interchanged between channels after pro- neural network tries to reduce single channel noise using TF cessing each downsampling block. masks at the first stage, after which beamforming is applied to linearly integrate the multi-channel signals. This approach 4. Proposed Model not only gives better results than the pure statistical methods In this section, we propose a cross-channel attention-based but also shows robustness in terms of various noisy types and multi-channel Wave-U-Net model by modifying the baseline ar- SNR ranges. However, in this method, the neural network only chitecture described in Section 3. Fig. 2 illustrates the architec- learns the spectral characteristics of a single channel, not spa- ture of our proposed model. Although the proposed algorithm tial information. To address this limitation, recent deep learning can be generalized to an arbitrary number of channels, we fix approaches have used inter-channel features as additional inputs the number of channels to two for simplicity in this paper. to the network, such as generalized cross-correlation (GCC), in- teraural phase or level difference (IDP, ILD) [16, 17]. Although 4.1. Encoder Structure learning spectral information together with spatial features results in performance improvements, adding spatial features as To provide flexibility for the processing of each channel and to the separated input makes it difficult to learn the mutual rela- explicitly utilize cross-channel relationships, the encoder pro- tionship between spatial and spectral information. cesses each channel independently. Feature maps from the encoder of each channel are used as inputs to the cross-channel To handle multi-channel data, the author of Wave-U-Net attention block, then are interchanged between channels. The proposed that the first layer of the network takes into account main objective of the cross-channel attention block is to derive all input channels. A similar solution could be found in [13], the relationship between two channels. Details of this block are in which the authors used a dilated convolution autoencoder in- described in subsection 4.3. At the bottleneck of the network, stead of the U-Net structure. However, when handling all the feature maps obtained by the encoder of each channel are con- input channels together, the first convolution layer only played catenated, after which they are projected into one feature map the role of performing nonlinear channel fusion. using a 1-D convolution layer. When the number of channels is greater than two, one channel is chosen as a reference channel 3. Wave-U-Net Baseline for Speech and feature maps are interchanged between the reference and Enhancement other channels. 4.2. Decoder Structure Wave-U-Net was originally developed for a singing voice source separation task by reconfiguring the U-Net structure [5] In each decoding layer, feature maps extracted from the encod- in the time domain. Based on an encoder-decoder structure, it ing layer are fused by a 1-D convolution layer with size 1. The introduces skip connections between the same layer in down- processing is different from the baseline structure that uses a sampling and upsampling blocks so that the high-level features direct skip connection between the same level layers of the en- of the decoding layer also include local features from the encoder and decoder.

Load more