Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging

Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging Yong Xu Qiuqiang Kong Qiang Huang Wenwu Wang Mark D. Plumbley Center for Vision, Speech and Sigal Processing University of Surrey, Guildford, UK Email: fyong.xu, q.kong, q.huang, w.wang, [email protected] Abstract—Environmental audio tagging is a newly proposed Fourier basis. Recent research [10] suggests that the Fourier task to predict the presence or absence of a specific audio event basis sets may not be optimal and better basis sets can be in a chunk. Deep neural network (DNN) based methods have been learned from raw waveforms directly using a large set of audio successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional data. To learn the basis automatically, convolutional neural neural network (CNN) to extract robust features from mel-filter network (CNN) is applied on the raw waveforms which is banks (MFBs), spectrograms or even raw waveforms for audio similar to CNN processing on the pixels of the image [13]. tagging. Gated recurrent unit (GRU) based recurrent neural Processing raw waveforms has seen many promising results networks (RNNs) are then cascaded to model the long-term on speech recognition [10] and generating speech and music temporal structure of the audio signal. To complement the input information, an auxiliary CNN is designed to learn on the spatial [14], with less research in non-speech sound processing. features of stereo recordings. We evaluate our proposed methods Most audio tagging systems [8], [15], [16], [17] use mono on Task 4 (audio tagging) of the Detection and Classification channel recordings, or simply average the multi-channels as of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. the input signal. However, using this kind of merging strategy Compared with our recent DNN-based method, the proposed disregards the spatial information of the stereo audio. This is structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the development set. The spatial features can further reduce likely to decrease recognition accuracy because the intensity the EER to 0.10. The performance of the end-to-end learning on and phase of the sound received from different channels are raw waveforms is also comparable. Finally, on the evaluation set, different. For example, kitchen sound and television sound we get the state-of-the-art performance with 0.12 EER while the from different directions will have different intensities on performance of the best existing system is 0.15 EER. different channels, depending on the direction of the sources. Multi-channel signals contain spatial information which could I. INTRODUCTION be used to help to distinguish different sound sources. Spatial Audio tagging (AT) aims at putting one or several tags features have been demonstrated to improve results in scene on a sound clip. The tags are the sound events that occur classification [18] and sound event detection [19]. However, in the audio clip, for example, “speech”, “television”, “per- there is little work using multi-channel information for the cussion”, “bird singing”, and so on. Audio tagging has many audio tagging task. applications in areas such as information retrieval [1], sound Our main contribution in this paper includes three parts. classification [2] and recommendation system [3]. First, we show experimental results on different features in- Many frequency domain audio features such as mel- cluding MFBs and spectrogram as well as the raw waveforms frequency cepstrum coefficients (MFCCs) [4], Mel filter banks on the audio tagging task of the DCASE 2016 challenge. feature (MFBs) [5] and spectrogram [6] have been used for Second, we propose a convolutional gated recurrent neural speech recognition related tasks [7] for many years. However, network (CGRNN) which is the combination of the CNN and it is unclear how these features perform on the non-speech the gated recurrent unit (GRU) to process non-speech sounds. arXiv:1702.07787v1 [cs.SD] 24 Feb 2017 audio processing tasks. Recently MFCCs and the MFBs were Third, the spatial features are incorporated in the hidden layer compared on the audio tagging task [8] and the MFBs can get to utilize the location information. The work is organized better performance in the framework of deep neural networks. as follows: in Section II, the proposed CGRNN is presented The spectrogram has been suggested to be better than the for audio tagging. In section III, the spatial features will be MFBs in the sound event detection task [9], but has not yet illustrated and incorporated into the proposed method. The been investigated in the audio tagging task. experimental setup and results are shown in Section IV and Besides the frequency domain audio features, processing Section V. Section VI summarizes the work and foresees the sound from the raw time domain waveforms, has attracted future work. a lot of attentions recently [10], [11], [12]. However, most of this works are for speech recognition related tasks; there II. CONVOLUTIONAL GATED RECURRENT NETWORK FOR are few works investigating raw waveforms for environmental AUDIO TAGGING audio analysis. For common signal processing steps, the short Neural networks have several types of structures: the most time Fourier transform (STFT) is always adopted to transform common one is the deep feed-forward neural network. Another raw waveforms into frequency domain features using a set of popular structure is the convolutional neural network (CNN), 0 푁−1 푝푡 푝푡 풉풕 풉ퟎ 풉ퟏ 풉풕 Pooling layer Unfold 0 1 2 0 1 2 Activations ℎ푡 ℎ푡 ℎ푡 … ℎ푡 ℎ푡 ℎ푡 … …… 푾 푾 푾 푾 Convolutional layer …… Input features 푥0 푥1 푥2 푥3 푥4 푥5 … 푥0 푥1 푥2 푥3 푥4 푥5 … 푡 푡 푡 푡 푡 푡 푡 푡 푡 푡 푡 푡 풙풕 풙ퟎ 풙ퟏ 풙풕 0-th Filter with fixed size 퐹 (푁 − 1)-th Filter with fixed size 퐹 Fig. 2. The structure of the simple recurrent neural network and its unfolded Fig. 1. The structure of the one-dimension CNN which consists of one version. The current activation ht is determined by the current input xt and convolutional layer and one max-pooling layer. N filters with a fixed size the previous activation ht−1. i F are convolved with the one dimensional signal to get outputs ptfi = i 0; ··· ; (N − 1)g. xt means the i-th dimension feature of the current frame. current input xt and the previous activation ht−1. RNN with the capability to learn the long-term pattern is superior to a which is widely used in image classification [20], [13]. CNNs feed-forward DNN, because a feed-forward DNN is designed can extract robust features from pixel-level values for images that the input contextual features each time are independent. [13] or raw waveforms for speech signals [10]. Recurrent The hidden activations of RNN are formulated as: neural network is the third structure which is often used for sequence modeling, such as language models [21] and h h h ht = '(W xt + R ht−1 + b ) (1) speech recognition [22]. In this section, we will introduce the convolutional neural network and the recurrent neural network However, a simple recurrent neural network with the recurrent with gated recurrent units. connection only on the hidden layer is difficult to train due to the well-known vanishing gradient or exploding gradient A. One dimension convolutional neural network problems [25]. The long short-term memory (LSTM) structure Audio or speech signals are one dimensional. Fig. 1 shows [26] was proposed to overcome this problem by introducing the structure of a one-dimension CNN which consists of input gate, forget gate, output gate and cell state to control the one convolutional layer and one max-pooling layer. N filters information stream through the time. The fundamental idea of with a fixed size F are convolved with the one dimensional the LSTM is memory cell which maintains its state through t signal to get outputs pifi = 0; ··· ; (N − 1)g. Given that the time [27]. dimension of the input features was M, the activation h of the As an alternative structure to the LSTM, the gated recurrent convolutional layer would have (M −F +1) values. The max- unit (GRU) was proposed in [28]. The GRU was demonstrated pooling size is also (M − F + 1) which means each filter will to be better than LSTM in some tasks [29], and is formulated give one output value. This is similar to speech recognition as follows [27]: work [10] where CNN has been used to extract features from the raw waveform signal. The way of each filter producing one r = δ(Wrx + Rrh + br) (2) value can also be explained as a global pooling layer which is t t t−1 a structural regularizer that explicitly enforces feature maps to z z z be confidence maps of meaningful feature channels [23]. So N zt = δ(W xt + R ht−1 + b ) (3) activations are obtained as the robust features from the basic h h h features. As for the input feature size M, a short time window, h~t = '(W xt + rt (R ht−1) + b ) (4) e.g., 32 ms, was fed into the CNN each time. The long- term pattern will be learned by the following recurrent neural ht = zt ht−1 + (1 − zt) h~t (5) network. As for the filter size or kernel size, a large receptive field is set considering that only one convolutional layer is where ht, rt and zt are hidden activations, reset gate values designed in this work. For example, F = 400 and M = 512 and update gate values at frame t, respectively.

Load more