Auditory-Inspired Morphological Processing of Speech Spectrograms Applications in Automatic Speech Recognition and Speech Enhancement

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Universidad Carlos III de Madrid e-Archivo Noname manuscript No. (will be inserted by the editor) Auditory-inspired Morphological Processing of Speech Spectrograms Applications in Automatic Speech Recognition and Speech Enhancement Joyner Cadore · Francisco J. Valverde-Albacete · Ascensión Gallardo-Antol´ın · Carmen Peláez-Moreno Received: date / Accepted: date Abstract A set of auditory-inspired features for speech Speech Recognition · Speech Enhancement · Auditory- processing is presented in this paper. The proposed based features acoustic features combine spectral subtraction and two- dimensional non-linear filtering techniques most usually employed for image processing. In particular, morpho- 1 Introduction logical operations like erosion and dilation are applied to a noisy speech spectrogram that has been previously Machine performance in Speech Recognition tasks is enhanced by a conventional spectral subtraction pro- still far from that of humans [4]. cedure and filtered by an auditory filterbank. These The standard model conceptualizes the recognition methods had been applied both to speech enhance- process as a virtual channel in which word articula- ment and recognition. In the first application, aniso- tions (issued from the speaker's mind) are introduced tropic structural elements on gray-scale spectrograms and word percepts (selected among the receiver's hy- have been found to provide a better perceptual quality potheses) are recognized. Data-driven statistical mod- than isotropic and reveal themselves as more appropri- eling techniques have made speech technologies lift off, ate for retaining the speech structure while removing enabling the introduction of these technologies into a background noise. A number of perceptual quality esti- variety of every day's applications. mation measures have been employed for several Signal- With increased exposure to the public and the rise to-Noise Ratios on the Aurora database. For speech of commercial expectations, the sophistication and im- recognition, the combination of spectral substraction provement of Speech Technology relies more and more and auditory-inspired morphological filtering has been on the Machine-Learning capabilities of computers. But found to improve the recognition rate in a noisy Isolet the emphasis on Machine-Learning algorithms often comes database. at the expense of genuine insight into the nature of Spo- ken Language [16], which has brought about an ever- Keywords Spectral subtraction · Spectrogram · widening divide between the interests of phoneticians Morphological processing · Image filtering · Automatic and linguists, on the one hand, and speech technolo- gists, on the other. Yet, the suspicion looms above the latter crowd that the performance of some technologies J. Cadore Universidad Carlos III de Madrid stemming from this model is plateauing (Speech Recog- Avda. de la Universidad 30, 28911 Madrid, Spain. nition performance, notably) and the field is in need of E-mail: [email protected] a new technological paradigm. F.J. Valverde-Albacete In ASR (Automatic Speech Recognition), the stan- E-mail: [email protected] dard procedure begins with the encoding of the speech A. Gallardo-Antol´ın signal as a string of feature vectors (also called acous- E-mail: [email protected] tic frames) [36]. This is followed by an acoustic decod- C. Peláez-Moreno ing stage where a maximum-likelihood (ML) strategy E-mail: [email protected] is applied to a set of generative models (e.g. the Hid- 2 Joyner Cadore et al. den Markov Models, HMM) and the feature vectors to on a cochleogram with structural elements designed to obtain a basic speech-unit percept correlate. Finally, model the HAS masking effects. This new features had the linguistic decoding process brings to bear the lexi- been tested on a hybrid MLP/HMM recognizer. cal and linguistic resources needed to obtain the word This paper is organized as follows: section 2 intro- sequences [25, 1]. duces the main notions of the HAS employed in this Among non-satisfactorily tackled challenges in ASR paper in section 3, we present the preprocessing stage, are various types of speech variability (rate, accent, di- spectrogram calculation and spectral subtraction. Sec- alect, gender, etc.) and background noise. One way to tion 4 is devoted to the explanation of our vision of address these limitations is trying to imitate human morphological processing to imitate HAS capabilities. acoustic capabilities, e.g. finding a more suitable audi- Finally, in sections 5 and 6 we describe two applica- tory model. tions of the proposed method (Speech Enhancement In this paper we focus on the noise problem where and ASR, respectively) and end with some conclusions humans are known to perform remarkably well whilst and ideas for future work in section 7. machines still lag behind [36]. There are many solutions inspired by the human auditory system aimed at solving this issue, e.g. feature extraction based on the 2 Speech Processing at the Cochlear Level well-known mel-frequency cepstral coefficients (MFCC, [7]), and on the Gammatone-based coefficients (GTC, 2.1 Sound level [41]). Other solutions use the so-called spectro-temporal features, that consider both the time and frequency do- The basic quantity over which perception of sound is mains in the feature extraction stage ([32, 19]). measured is sound pressure level which is a normalized, logarithmic1 sound pressure: Following that line of work, in this paper we use morphological filtering over the spectrogram of a noisy p Lp = 20 log (dB SPL) (1) signal to mimic some properties of the Human Auditory p0 System (HAS), such as frequency and temporal mask- where p is sound pressure and p = 20 µP a is the ref- ing, thereby filtering out as much noise as possible [9]. 0 erence sound pressure, the lowest audible pressure for Morphological filtering is an image-processing tool human ears at mid-frequencies. for extracting image components that are useful for Another quantity is sound (intensity) level, a nor- purposes like thinning, pruning, structure enhancement malized, logarithmic intensity level: and noise filtering [15]. Since our goal is to empha- size the areas of interest of the spectrogram, the struc- I LI = 10 log (dB SL) (2) tural element shapes should be based on the spectro- I0 temporal masking properties of the HAS, as in [9]. Our 2 approach consists on substituting binary structural ele- where I / p is acoustic intensity, an energy related −12 2 ments|thus avoiding the need for thresholding|by full quantity. When using I0 = 10 N=m for reference, gray-scale ones and we propose anisotropic structural both levels can be equated and we drop the subindex: elements as a means to better characterize the response p I of the ear. L = 20 log (dB SPL) = 10 log (dB SL) (3) p0 I0 Previous to the application of these ideas in ASR taks, we present some experiments for speech enhance- Both dB SPL and dB SL could, in this case, be simple ment in order to assess the benefits of morphological notated as dB. filtering. Spectral Subtraction (SS, [3]) is used as a preprocessing stage to subsequently apply morphological filtering on the spectrogram thus producing a per- 2.2 Auditory filterbank description ceptually enhanced signal where musical noise (mainly caused by SS) has been reduced. In order to thoroughly It is widely accepted that the cochlea carries out a log- evaluate the filtered signals we use a large quantity of arithmic compression of the auditory range whereby speech utterances which, on the other hand, precludes higher frequency intervals are represented with less de- the use of subjective quality measures. For this reason, tail than lower frequency ranges. This realisation stems estimations of these subjective opinions are computed from experiments to detect critical bands, which is the from a set of objective quality measures [21, 22, 38, 2]. frequency bandwidth around a center frequency whose Finally, we apply this method on ASR tasks. In par- 1 Unless otherwise noted log refers in this paper to base-10 ticular, we have have applied morphological filtering logarithms. Auditory-inspired Morphological Processing of Speech Spectrograms 3 components affect the sound level and pitch perception 2.2.3 ERB and ERB-rate of the center frequency. In this light, the notion of an auditory filterbank The ERB was defined in [33, 14] as a more adjusted relates three concepts: measurement of the critical band: { A discretization of a frequency range into N bands. BW (f) = 6:23 · f 2 − 93:39 · f + 28:52 (f in kHz) { A choice of the center of the bands to be related to ERB (7) special frequencies or frequency ranges in the inner ear, which entails the definition of a frequency scale. Based upon these bands a new logarithmic scale { A choice of the bandwiths and shapes of the different may be defined, the ERB-rate [33] filters that takes into consideration the notion of critical bands. F (f) = 11:17 · log f+0:312 + 43:0 (f in kHz) (8) ERB f+14:675 The use of logarithmic frequency scales eases the con- ceptualization of phenomena like masking (see Figure 1), or the ERBN number, [14, 34]: and we will consider several scales of logarithmic frequency: Mel, Bark and the Equivalent Rectangular Band- ERBN number(f) = 21:4 log(4:37f + 1) (9) width -induced (ERB) scale. All of them use methods to calculate the critical bandwidths at different cen- This scaling is at the base of the Gammatone fil- ter frequencies and at the same time define scales of terbank, an alternative to the one employed by MFCC equal difference in perception of pitches/levels related that will be employed in our tests (see section 6). This to those center frequencies.

Auditory-Inspired Morphological Processing of Speech Spectrograms Applications in Automatic Speech Recognition and Speech Enhancement

Speech Signal Basics

Benchmarking Audio Signal Representation Techniques for Classiﬁcation with Convolutional Neural Networks

Designing Speech and Multimodal Interactions for Mobile, Wearable, and Pervasive Applications

Introduction to Digital Speech Processing Schafer Rabiner and Ronald W

Models of Speech Synthesis ROLF CARLSON Department of Speech Communication and Music Acoustics, Royal Institute of Technology, S-100 44 Stockholm, Sweden

Audio Processing Laboratory

ECE438 - Digital Signal Processing with Applications 1

Speech Processing Based on a Sinusoidal Model

Edinburgh Research Explorer

Speech Processing and Prosody Denis Jouvet

Linear Prediction Analysis of Speech

Speech Recognition Using Neural Networks