
Aalborg Universitet A CNN-Based Approach to Identification of Degradations in Speech Signals Saishu, Yuki; Poorjam, Amir Hossein; Christensen, Mads Græsbøll Published in: Eurasip Journal on Audio, Speech, and Music Processing DOI (link to publication from Publisher): 10.1186/s13636-021-00198-4 Publication date: 2021 Document Version Accepted author manuscript, peer reviewed version Link to publication from Aalborg University Citation for published version (APA): Saishu, Y., Poorjam, A. H., & Christensen, M. G. (2021). A CNN-Based Approach to Identification of Degradations in Speech Signals. Eurasip Journal on Audio, Speech, and Music Processing, 2021(1), 1-10. [9]. https://doi.org/10.1186/s13636-021-00198-4 General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. ? Users may download and print one copy of any publication from the public portal for the purpose of private study or research. ? You may not further distribute the material or use it for any profit-making activity or commercial gain ? You may freely distribute the URL identifying the publication in the public portal ? Take down policy If you believe that this document breaches copyright please contact us at [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from vbn.aau.dk on: September 28, 2021 Saishu et al. RESEARCH A CNN-Based Approach to Identification of Degradations in Speech Signals Yuki Saishu1, Amir Hossein Poorjam1,2* and Mads Græsbøll Christensen1 Abstract The presence of degradations in speech signals, which causes acoustic mismatch between training and operating conditions, deteriorates the performance of many speech-based systems. A variety of enhancement techniques have been developed to compensate the acoustic mismatch in speech-based applications. To apply these signal enhancement techniques, however, it is necessary to know prior information about the presence and the type of degradations in speech signals. In this paper, we propose a new convolutional neural network (CNN)-based approach to automatically identify the major types of degradations commonly encountered in speech-based applications, namely additive noise, nonlinear distortion, and reverberation. In this approach, a set of parallel CNNs, each detecting a certain degradation type, is applied to the log-mel spectrogram of audio signals. Experimental results using two different speech types, namely pathological voice and normal running speech, show the effectiveness of the proposed method in detecting the presence and the type of degradations in speech signals which outperforms the state-of-the-art method. Using the score weighted class activation mapping, we provide a visual analysis of how the network makes decision for identifying different types of degradation in speech signals by highlighting the regions of the log-mel spectrogram which are more influential to the target degradation. Keywords: Signal enhancement; Convolutional neural network; Identification of degradation; Quality control; Visualization 1 Introduction modeled as follows: Advances in portable devices such as smartphones and tablets, that are equipped with high-quality micro- xn(t) = s(t) + e(t); (1) phones, facilitate capturing and processing speech sig- xr(t) = s(t) ∗ h(t); (2) nals in a wide range of environments. However, the x t s t ; (3) quality of the recordings is not necessarily as expected, d( ) = ( ( )) as they might be subject to degradation. In prac- where t is the time index, s t is the clean speech sig- tice, the presence of degradation during the operat- ( ) nal recorded by a microphone in a noise-free and non- ing time can deteriorate the performance of speech- reverberant environment, e t is an additive noise, based systems, such as speech recognition [1], speaker ( ) represents a nonlinear function, h t is a room impulse identification [2], pathological voice analysis (assess- ( ) response (RIR), and the indicates the convolution ment of voice signal of a speaker with a voice disorder) ∗ operation. We note that in reality, these degradations [3,4], mainly due to acoustic mismatch between train- are even more complex. For example, they may be ing and operating conditions. The most common types time-dependent. A variety of effective signal enhance- of degradation typically encountered in speech-based ment techniques have been developed to enhance a de- applications are background noise, reverberation, and graded speech signal such as noise reduction [5,6], nonlinear distortion. dereverberation [7,8], and restoration of some types A speech signal degraded by additive noise, rever- of nonlinear distortion [9, 10]. Most of these enhance- beration, and nonlinear distortion can be, respectively, ment algorithms have been designed to deal with a *Correspondence: [email protected] specific type of degradation in a signal, although re- 1 Audio Analysis Lab, CREATE, Aalborg University, Rendsburggade 14, cent research in comprehensive speech enhancement, 9000 Aalborg, Denmark Full list of author information is available at the end of the article dealing with both additive noise and reverberation, is Saishu et al. Page 2 of 10 promising [11, 12, 13]. Nevertheless, to properly com- meeting all requirements at the same time might not pensate for the effects of degradations, it is necessary be feasible in some cases. to know or obtain information about the presence and As an alternative solution, Poorjam et al. proposed a the type of degradations in speech signals. Since man- data-driven approach which uses a set of parallel Gaus- ual inspection of the signals is very time consuming, sian mixture models (GMMs) to detect three types costly, and even impossible in many speech-based ap- of degradation in pathological voice signals, namely plications, an accurate degradation detection system background noise, reverberation, and nonlinear distor- would be useful to automatically identify the presence tion [4]. All detectors in this approach are similar in and type of degradations. terms of the complexity, underlying assumptions, and There are a variety of approaches to identify dif- the acoustic features except that they are trained using ferent types of degradation in speech signals. For ex- different degraded signals. This approach is focused on ample, Ma et al. in [14] proposed a hidden Markov pathological voices and, particularly, on the sustained model based approach to distinguish different types of vowels. noise in speech signals. In another study by Desmond In this paper, we propose a more accurate convo- et al. [15], the reverberant signals are detected us- lutional neural network (CNN)-based approach which ing a channel-specific statistical model. In [16, 17], can identify degradations not only in sustained vowels, clipping in speech signals, as an example of nonlin- but also in normal running speech. CNNs are computa- ear distortion, is detected. Although effective, these tionally efficient deep neural networks that are able to approaches are focused on detecting a single, specific learn complex patterns in the spectrogram of a speech type of degradation. The use of a multiclass classifi- signal. In this approach, we apply a set of parallel cation, on the other hand, can be used to detect dif- CNNs to the log-mel spectrograms of the signals. Each ferent types of degradations. In [18, 19], Poorjam et CNN model, trained with signals corrupted by a spe- al. proposed two generalized multiclass classification- cific degradation type, is responsible for detecting the based approaches detecting various types of degrada- corresponding degradation in a test signal. The pre- tion, which investigated only on pathological voice sig- diction scores of an unseen test sample can be used to nals and the accuracy was still inadequate. Moreover, associate multiple degradation labels to an observation there is no control over the class assignment in these and can be interpreted as the degree of contribution approaches when a new type of degradation is observed of each degradation in a degraded signal. Moreover, for which the classifier has not been trained. For exam- using the score class activation mapping (score-CAM) ple, clipping, packet-loss, dynamic range compression, technique [20], we visually explain on what basis the automatic gain control, and distortions due to using CNN models make a specific decision in detecting dif- low quality or improperly configured equipment are ferent types of degradation by finding the regions in considered as new types of degradation for a multi- the mel-scale spectrograms of a degraded signal that class classifier trained only with noisy and reverberant are most influential to the scores of the target class. In signals. this technique, different activation maps are applied to To overcome the limitations of the multiclass-based the input spectrogram, each perturbing a region of the approaches, one can use a multilable classification ap- spectrogram. Then, the effect of each activation map proach in which more than one class labels may be on the prediction scores is observed. The importance assigned to each sample. Compared to the multiclass- of each activation map is determined by the predic- based methods, this approach can better deal with tion score on the target class. Finally, a saliency map some challenging cases such as the presence of a new is generated by a weighted linear combination of all ac- degradation type and when more than one degrada- tivation maps to visualize the internal representation tion coexists. In the former case, the sample may be in a CNN [20]. Since this technique does not require classified as none of the target classes. In the latter any modifications to the architecture of the network, case, more than one detector can accept a signal sub- it can be applied to a wide variety of CNN models. ject to a mixture of degradations. One possible solu- The rest of this paper is organized as follows.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages11 Page
-
File Size-