Recent Advances in Electrical Engineering

Comparison of -Warped Filter Banks in relation to Robust Features for Speaker Identification

SHARADA V CHOUGULE MAHESH S CHAVAN Electronics and Telecomm. Engg.Dept. Electronics Engg. Department, Finolex Academy of Management and KIT’s College of Engineering, Technology, Ratnagiri, Maharashtra, India Kolhapur,Maharashtra,India [email protected] [email protected]

Abstract: - Use of psycho-acoustically motivated warping such as mel-scale warping in common in speaker recognition task, which was first applied for speech recognition. The mel-warped cepstral coefficients (MFCCs) have been used in state-of-art speaker recognition system as a standard acoustic feature set. Alternate frequency warping techniques such as Bark and ERB rate scale can have comparable performance to mel-scale warping. In this paper the performance acoustic features generated using filter banks with Bark and ERB rate warping is investigated in relation to robust features for speaker identification. For this purpose, a sensor mismatched database is used for closed set text-dependent and text-independent cases. As MFCCs are much sensitive to mismatched conditions (any type of mismatch of data used for training evaluation purpose) , in order to reduce the additive noise, spectral subtraction is performed on mismatched speech data. Also normalization of feature vectors is carried out over each frame, to compensate for channel mismatch. Experimental analysis shows that, percentage identification rate for text-dependent case using mel, bark and ERB warped filter banks is comparably same in mismatched conditions. However, in case of text-independent speaker identification, ERB rate warped filter bank features shows improved performance than mel and bark warped features for the same sensor mismatched condition. Also it is observed that, without any compensation (spectral subtraction or cepstral normalization) bark warped filter bank features, shows somewhat superior speaker identification results for both text-dependent and text-independent case.

Key-Words: - Text-dependant/Text-independent speaker identification, MFCCS, Spectral Subtraction, Mel, Bark , ERB scale

1 Introduction four basic stages of a speaker recognition system In recent years, human speech is becoming a most are: i) Feature extraction ii) Pattern formation in popular form of person identity for various real terms of speaker model iii) Pattern Matching and iv) world applications, such as banking over telephone, Decision making. Each of these stages are equally secure access control systems and also in forensic important for a robust system. (Robustness here is in applications. Recognizing a person using his or her relation to mismatched conditions between training voice (speech-in terms of utterance of a word, and testing of the system for speaker identification phrase or sentence) is mainly categorized based on task). type of end decision required. If the end decision is In this paper, the robustness of the speaker in terms of accepting or rejecting the identity claim identification system is investigated on cepstral of a person, then the task is called speaker features, warped using three different warped scales verification. And if the end decision is to identify a Mel, Bark and ERB respectively. In this person of best match amongst a set of known speech experimental work, speech data with mismatch in database, the result is in terms of speaker sensors (microphones) is used to test the robustness identification. In case of closed-set speaker for text-dependent and text-independent systems identification, the unknown speaker is from the set respectively. of N known speaker’s database, whereas open-set The paper is organized as follows: Section II carries speaker identification refers to an input speaker the discussion of feature extraction and recent which is outside the N known speaker database. The research in relation to robust features. In Section III,

ISBN: 978-1-61804-260-6 157 Recent Advances in Electrical Engineering

description of the formation of filter banks with mel, i) Segmentation of speech signal in frames of 20 to bark and ERB scale warping is given. Results of the 30 msec (this frame duration is generally selected to experiments are discussed in Section IV. Conclusion extract the vocal tract features of a person). based on analysis of results is given in Section V. ii) Window analysis of framed speech, with frame overlap of 10 msec. 2 Warped Filter Banks iii) Spectrum representation and analysis using FFT The speech signal consists of a large amount of raw (Magnitude only). data (such as pause between utterances or undesired iv) Formation of filter bank according to auditory distortions intercepting the speech) which is actually scales and finding energy of each filter on mel-scale does not carry any useful information. Feature warping. We use log because our ear works in extraction (with preprocessing of speech signal), is decibels. the first stage in speaker recognition system. The v) Use DCT for compressing the information and main purpose of feature extraction is to eliminate decorrelating the source and filter information the raw speech data and extract the features which (related to source-filter mode of human speech convey some characteristics of the speaker. It is a production). compact and more suitable representation of raw speech data. In contradictory to speech recognition 2.1 Frequency Scales task, the extracted features should carry speaker Frequency scales describe how the physical specific information in the form of acoustic vectors. frequency of an incoming signal is related to the For better accuracy of the system (in terms of representation of that frequency by the human identification or verification), the extracted features auditory system. Different psycho-acoustical should be robust against undesired distortions such techniques provides somewhat different estimates of as noise or any type of mismatch between data used the bandwidth of band pass filters. Mel-scale for training and testing the system. [2],Bark-scale[3], ERB(Equal Rectangular Psycho-acoustic studies proved that the basilar Bandwidth)-scale[3] are some of the widely used membrane, located at the front end of the human frequency scales based on frequency domain auditory system, can be modeled as a bank of masking principle. overlapping band pass filters, each tuned to a To analyze the performance of segmental features specific frequency (the characteristic frequency) and using perceptually motivated warping, filter banks with a bandwidth that increase roughly are designed using three types on non-linear logarithmically with increasing characteristic frequency scales as given below. frequency. The bandwidth of these filters are known as critical bands of and are similar in the 2.1.1 Mel-Scale nature to the physiologically-based filters. Nerves at one point of cochlea (inner ear), responds maximum Mel-scale frequency is related to linear frequency at one particular frequency and less for nearby by empirical equation in (1), which is linear upto . Therefore shape of band pass filters is 1000 Hz and logarithmic above 1000 Hz. The MEL generally triangular. scale that was proposed by Stevens et al. [4] Given the roughly logarithmically increasing width describes how a listener judges the distance between of the critical band filters, about 20 to 24 critical pitches. The reference point is obtained by defining band filters can cover the maximum frequency a 1000 Hz tone 40 dB above the listener’s threshold range of 15000 Hz for human perception[5]. A to be 1000 mels. means of mapping linear frequency to this f fmel 2595*log10(1 ) perceptual representation is through warping using 700 either mel-scale, bark scale or ERB (equal (1) rectangular bandwidth) scale. All three scale are 2.1.2 Bark-Scale based on human perception mechanism discussed above. Another means of mapping linear frequency to the MFCCs (Mel Frequency Cepstral Coefficients) is perceptual representation is through bark scale. In the most popularly used feature extraction technique this mapping one bark covers one critical band with in state-of-art speaker recognition systems [1]. the functional relationship of the frequency f to the MFCCs can be considered as filter bank processing bark z given by [5] as: adapted to speech specificities. The cepstral coefficients representing a feature vector are derived through following steps:

ISBN: 978-1-61804-260-6 158 Recent Advances in Electrical Engineering

ERB Rate-Scale Filter Bank 1 0.9 f 0.8 zf13*tan11 (0.76* ) 3.5*tan ( ) 0.7 7500 0.6 (2) 0.5 Normalized Weights Normalized 0.4

0.3 2.1.3 ERB-Rate-Scale 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 The ERB scale is a measure that gives an Frequency (Hz) approximation to the bandwidth of filters in human (c) ERB Rate Scale hearing using rectangular band pass filters; several Fig.2 Warped Filter Banks for various Frequency different approximations of the ERB scale exist [6]. Scales The following is one of such approximations relating the ERB and the frequency f: 3 Performance under Matched and 46.065* f ERB( f ) 11.17*log(1 ) Mismatched conditions f 14678.49 Various undesired distortions like background noise, (3) channel noise damages the quality of speech data when recorded with distant microphone. Figure (1) compares these three frequency scales in Experimental evaluations for noisy and multi- the range 100 Hz to 8000 Hz. speaker environment for speaker recognition is

Warped Frequency Scales 1 carried out in [11], using LP (linear prediction)

Mel 0.9 Bark ERB based combined spectral and temporal processing 0.8 approach. A detailed study and overview of feature 0.7

0.6 extraction methods in real world conditions is 0.5 discussed in [12]. Studied in [13], proposes various 0.4 Perceptual Perceptual Scales (Normalized) 0.3 alternative mel frequency warped feature

0.2 representations in the presence of interfering noise. 0.1 A closed-set Speaker Identification system is 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) developed in which cepstral features are derived as Fig.1 Normalized Perceptual Scales discussed in section II. To form the model of each speaker, Linde-Buzzo-Gray (LBG) algorithm [8] is The filter banks designed using above scales for the developed with codebook size of 64.Thus 39x64 sampling rate of 16000 Hz are shown below: dimensional codebook is formed for each speaker.

Mel-Scale Filter Bank 1 For performance evaluation, multi-speaker, 0.9 continuous (Hindi) speech database is used, which is 0.8

0.7 generated by TIFR, Mumbai (India) [9] and made 0.6 available by Department of Information 0.5 Technology, Government of India. A set of 100

Normalized Weights Normalized 0.4

0.3 speakers (TIFR India Hindi Database each speaking 0.2 10 different Hindi sentences (of 6 to 8 millisecond) 0.1

0 out of which two sentences are same for all 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) speakers. Initially the performance of the system is (a) Mel-Scale evaluated for matched conditions (speech data used

Bark-Scale Filter Bank 1 for training and testing the system is recorded with 0.9 same close-talking directional microphone). 0.8

0.7 The results of percentage identification rate are

0.6 given for Text-Dependent and Text-Independent 0.5 case respectively. Normalized Weights Normalized 0.4 0.3

0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz)

(b) Bark Scale

ISBN: 978-1-61804-260-6 159 Recent Advances in Electrical Engineering

(b) (a) Fig.4 Performance of Speaker Identification System for various warping scale for (a) Text- Dependent Case (b) Text Independent Case under Mismatched Conditions From the plots in figure (4), it is observed that, under mismatched conditions, the percentage identification rate decreases for both text-dependent as well as text-independent case. The drop in identification rate is more in text-independent case, than in text-dependent case. For text-dependent case, under mismatched conditions the reason for decrease in error rate is due to changes in transducer device for training and (b) testing (training with closed talking directional microphone and testing with desk mounted omni- Fig.3 Performance of Speaker Identification System direstional microphone). As the phonetic contents for various warping scale for (a) Text-Dependent are same for training as well as testing, the only Case (b) Text Independent Case under Matched reason for decrease in identification rate is due to Conditions channel mismatch and may due to some noise intercepted during testing phase (due to desk From above graphs, it is observed that, there is mounted microphone). negligibly small differenxe in percentage identification rate for text-dependent and text- In text-independent case, the identification rate independent speaker identification under matched drop off further. Here the additional reason may be conditions. due to changes in phonetic contents during training and testing sets.

4 Compensation against Mismatch Condition The Cepstral Mean Normalization is often used during the feature extraction phase of speaker recognition/verification systems to compensate for convolutional channel distortion of voice signals. The convolutional channel distortion can be caused by different microphones used between the testing and training phases or different transmission channels used during testing and training. While the Cepstral Mean Normalization is relatively effective (a) in removing convolutional distortion, it is not able to compensate for additive channel

ISBN: 978-1-61804-260-6 160 Recent Advances in Electrical Engineering

distortion.Therefore, it is not capable of removing incresed cosiderably by inclusion of spectral an additive channel distortion such as white subtraction before passing the segmented speech Gaussian noise, which occurs commonly in through warped filter banks. It is observed that, transmission channels [5],[6]. normalization of cepstral features have shown negligible improvement in correct identification. The magnitude or power estimate obtained with STFT is susceptible to various types of additive 4 Conclusion noise (such as background noise). To compensate In this paper, the performance features based on for additive noise and to restore the magnitude or three frequency warped filter banks is studied for power spectrum of speech signal, spectral text-dependent and text-independent cases under subtraction is used. Magnitude of the spectrum over matched and mismatched conditions. short duration (equal to frame length) is obtained by For matched conditions it was observed that, eliminating phase information. Here spectrum of performance (percentage identification rate) is noise is subtracted from noisy speech spectrum, almost identical for both text-dependent and text- therefore it is known as spectral subtraction. For independent case. Also the all the three warped filter this, noise spectrum is estimated and updated over bank features gives analogous performance for the periods when signal is absent and only noise is codebook size above 32. present [7],[10]. Thus speech signal is enhanced by Under mismatched conditions, identification rate for eliminating noise. text-independent case in much less than for text- independent case. Also, it is observed that, bark warped filter bank features gives somewhat improved performance as compared to mel and ERB warped filter banks, for both text-dependent and text-independent case. Using spectral subtraction for compensation against any additive noise and normalizing the warped filter-bank features for compensation of mismatch between sensors, the performance of speaker identification system is improved to a better extent. It is observed that, normalizing the filter bank features shows only a little improvement in percentage identification. Subtracting the magnitude (a) spectrum estimates the noise interference due to sensor mismatch and eliminated that noise by subtracting noise spectrum from spectrum of noisy speech. Surprisingly, it is observe that, ERB rate warped features shows superior performance for text-independent case. In all, it is concluded that, ERB warped feature vectors is more sensitive to mismatched condition, than the mel and bark scale. Also the bark-scale warped filter bank features gives comparable performance to that of mel-scale without any compensation for mismatch. ERB-Rate warped features gives improves performance using spectral subtraction.

(b) Fig.5 Performance of Speaker Identification References: System for various warping scale, under Mismatched Conditions using compensation: (a) [1] S. Davis and P. Mermelstein, Comparison of Text-Dependent Case (b) Text Independent Case parametric representations for monosyllabic word recognition in continuously spoken From experimental observations as shown in sentences, IEEE Trans. , Speech, and figure (5), the percentage identification rate is

ISBN: 978-1-61804-260-6 161 Recent Advances in Electrical Engineering

Signal Processing, Vol. 28, no. 4, pp. 357–366, [9] Samudravijaya K, P.V.S.Rao,S.S.Agrawal, August 1980. “Hindi Speech Database”, Proceedings of [2] B. C. J. Moore and B. R. Glasberg, A revision International Conference on Spoken Language of Zwicker’s loudness model, Acustica – Acta Processing, 2000, China. Acustica, Vol. 82, pp. 335–345, 1996. [10] X. Menendez-Pidal, R. Chan, D. Wu, and M. [3] D O’Shaughnessy, Speech Communication: Tanaka, ”Compensation of channel and noise Human and Machine, Addison-Wesley, 1987. distortions combining normalization and [4] J. Volkmann, S. S. Stevens, and E. B. Newman, speech enhancement techniques”, Speech “A scale for the measurement of the Communication vol. 34, pp. 115–126, 2001. psychological magnitude pitch (A),” J. Acoust. [11] P Krishnamoorthy and S R Mahadeva Soc. Am., vol. 8, no. 3, pp. 208–208, 1987 Prasanna, Application of combined spectral and [5] Thomas F Quateier,”Discrete Time Processing temporal processing methods for speaker of Speech Signals- Principles and Practice”, recognition under noisy, reverberant or multi- Pearson Eduaction,1997 speaker environment, Sadhana Indian [6] Gish H., Karnofski K., Krasner M., Rouscos S., Academy of Sciences Vol. 34, Part 5, October Schwartz R., Wolf J.,“Investigation of Text- 2009, pp. 729–754. Independent Speaker Identification over [12] Tomi Kinnunen, Haizhou Li, An overview of Telephone Channels.” in Proceedings of IEEE text independent speaker recognition from International Conference of Acoustics, features to supervectors, Speech Speechand Signal Processing, ICASSP 1985, Communication volume 52, pp 12-40, 2010. Volume 10. pp: 379-382, Apr 1985. [13] T. Kinnunen, Md. J. Alam, P. Matejka, P. [7] Saeed V. Vaseghi ,”Advanced Digital Signal Kenny, J. Cernocky, D. O'Shaughnessy, Processing and Noise Reduction, Sec-ond Edition, Frequency Warping and Robust Speaker John Wiley & Sons Ltd,2000 Verification: A Comparison of Alternative [8] Y. Linde, A. Buzo, and R. M. Gray,”An Mel-Scale Representations”, Proc. Interspeech algorithm for vector quantizer design”, IEEE 2013, pp. 3122--3126, Lyon, France, August Trans. Commun., vol. 28, no. 1, pp. 84–95, 2013 1980.

ISBN: 978-1-61804-260-6 162