Comparison of Frequency-Warped Filter Banks in Relation to Robust Features for Speaker Identification
Total Page:16
File Type:pdf, Size:1020Kb
Recent Advances in Electrical Engineering Comparison of Frequency-Warped Filter Banks in relation to Robust Features for Speaker Identification SHARADA V CHOUGULE MAHESH S CHAVAN Electronics and Telecomm. Engg.Dept. Electronics Engg. Department, Finolex Academy of Management and KIT’s College of Engineering, Technology, Ratnagiri, Maharashtra, India Kolhapur,Maharashtra,India [email protected] [email protected] Abstract: - Use of psycho-acoustically motivated warping such as mel-scale warping in common in speaker recognition task, which was first applied for speech recognition. The mel-warped cepstral coefficients (MFCCs) have been used in state-of-art speaker recognition system as a standard acoustic feature set. Alternate frequency warping techniques such as Bark and ERB rate scale can have comparable performance to mel-scale warping. In this paper the performance acoustic features generated using filter banks with Bark and ERB rate warping is investigated in relation to robust features for speaker identification. For this purpose, a sensor mismatched database is used for closed set text-dependent and text-independent cases. As MFCCs are much sensitive to mismatched conditions (any type of mismatch of data used for training evaluation purpose) , in order to reduce the additive noise, spectral subtraction is performed on mismatched speech data. Also normalization of feature vectors is carried out over each frame, to compensate for channel mismatch. Experimental analysis shows that, percentage identification rate for text-dependent case using mel, bark and ERB warped filter banks is comparably same in mismatched conditions. However, in case of text-independent speaker identification, ERB rate warped filter bank features shows improved performance than mel and bark warped features for the same sensor mismatched condition. Also it is observed that, without any compensation (spectral subtraction or cepstral normalization) bark warped filter bank features, shows somewhat superior speaker identification results for both text-dependent and text-independent case. Key-Words: - Text-dependant/Text-independent speaker identification, MFCCS, Spectral Subtraction, Mel, Bark , ERB scale 1 Introduction four basic stages of a speaker recognition system In recent years, human speech is becoming a most are: i) Feature extraction ii) Pattern formation in popular form of person identity for various real terms of speaker model iii) Pattern Matching and iv) world applications, such as banking over telephone, Decision making. Each of these stages are equally secure access control systems and also in forensic important for a robust system. (Robustness here is in applications. Recognizing a person using his or her relation to mismatched conditions between training voice (speech-in terms of utterance of a word, and testing of the system for speaker identification phrase or sentence) is mainly categorized based on task). type of end decision required. If the end decision is In this paper, the robustness of the speaker in terms of accepting or rejecting the identity claim identification system is investigated on cepstral of a person, then the task is called speaker features, warped using three different warped scales verification. And if the end decision is to identify a Mel, Bark and ERB respectively. In this person of best match amongst a set of known speech experimental work, speech data with mismatch in database, the result is in terms of speaker sensors (microphones) is used to test the robustness identification. In case of closed-set speaker for text-dependent and text-independent systems identification, the unknown speaker is from the set respectively. of N known speaker’s database, whereas open-set The paper is organized as follows: Section II carries speaker identification refers to an input speaker the discussion of feature extraction and recent which is outside the N known speaker database. The research in relation to robust features. In Section III, ISBN: 978-1-61804-260-6 157 Recent Advances in Electrical Engineering description of the formation of filter banks with mel, i) Segmentation of speech signal in frames of 20 to bark and ERB scale warping is given. Results of the 30 msec (this frame duration is generally selected to experiments are discussed in Section IV. Conclusion extract the vocal tract features of a person). based on analysis of results is given in Section V. ii) Window analysis of framed speech, with frame overlap of 10 msec. 2 Warped Filter Banks iii) Spectrum representation and analysis using FFT The speech signal consists of a large amount of raw (Magnitude only). data (such as pause between utterances or undesired iv) Formation of filter bank according to auditory distortions intercepting the speech) which is actually scales and finding energy of each filter on mel-scale does not carry any useful information. Feature warping. We use log because our ear works in extraction (with preprocessing of speech signal), is decibels. the first stage in speaker recognition system. The v) Use DCT for compressing the information and main purpose of feature extraction is to eliminate decorrelating the source and filter information the raw speech data and extract the features which (related to source-filter mode of human speech convey some characteristics of the speaker. It is a production). compact and more suitable representation of raw speech data. In contradictory to speech recognition 2.1 Frequency Scales task, the extracted features should carry speaker Frequency scales describe how the physical specific information in the form of acoustic vectors. frequency of an incoming signal is related to the For better accuracy of the system (in terms of representation of that frequency by the human identification or verification), the extracted features auditory system. Different psycho-acoustical should be robust against undesired distortions such techniques provides somewhat different estimates of as noise or any type of mismatch between data used the bandwidth of band pass filters. Mel-scale for training and testing the system. [2],Bark-scale[3], ERB(Equal Rectangular Psycho-acoustic studies proved that the basilar Bandwidth)-scale[3] are some of the widely used membrane, located at the front end of the human frequency scales based on frequency domain auditory system, can be modeled as a bank of masking principle. overlapping band pass filters, each tuned to a To analyze the performance of segmental features specific frequency (the characteristic frequency) and using perceptually motivated warping, filter banks with a bandwidth that increase roughly are designed using three types on non-linear logarithmically with increasing characteristic frequency scales as given below. frequency. The bandwidth of these filters are known as critical bands of hearing and are similar in the 2.1.1 Mel-Scale nature to the physiologically-based filters. Nerves at one point of cochlea (inner ear), responds maximum Mel-scale frequency is related to linear frequency at one particular frequency and less for nearby by empirical equation in (1), which is linear upto frequencies. Therefore shape of band pass filters is 1000 Hz and logarithmic above 1000 Hz. The MEL generally triangular. scale that was proposed by Stevens et al. [4] Given the roughly logarithmically increasing width describes how a listener judges the distance between of the critical band filters, about 20 to 24 critical pitches. The reference point is obtained by defining band filters can cover the maximum frequency a 1000 Hz tone 40 dB above the listener’s threshold range of 15000 Hz for human perception[5]. A to be 1000 mels. means of mapping linear frequency to this f fmel 2595*log10(1 ) perceptual representation is through warping using 700 either mel-scale, bark scale or ERB (equal (1) rectangular bandwidth) scale. All three scale are 2.1.2 Bark-Scale based on human perception mechanism discussed above. Another means of mapping linear frequency to the MFCCs (Mel Frequency Cepstral Coefficients) is perceptual representation is through bark scale. In the most popularly used feature extraction technique this mapping one bark covers one critical band with in state-of-art speaker recognition systems [1]. the functional relationship of the frequency f to the MFCCs can be considered as filter bank processing bark z given by [5] as: adapted to speech specificities. The cepstral coefficients representing a feature vector are derived through following steps: ISBN: 978-1-61804-260-6 158 Recent Advances in Electrical Engineering ERB Rate-Scale Filter Bank 1 0.9 f 0.8 zf13*tan11 (0.76* ) 3.5*tan ( ) 0.7 7500 0.6 (2) 0.5 Normalized Weights Normalized 0.4 0.3 2.1.3 ERB-Rate-Scale 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 The ERB scale is a measure that gives an Frequency (Hz) approximation to the bandwidth of filters in human (c) ERB Rate Scale hearing using rectangular band pass filters; several Fig.2 Warped Filter Banks for various Frequency different approximations of the ERB scale exist [6]. Scales The following is one of such approximations relating the ERB and the frequency f: 3 Performance under Matched and 46.065* f ERB( f ) 11.17*log(1 ) Mismatched conditions f 14678.49 Various undesired distortions like background noise, (3) channel noise damages the quality of speech data when recorded with distant microphone. Figure (1) compares these three frequency scales in Experimental evaluations for noisy and multi- the range 100 Hz to 8000 Hz. speaker environment for speaker recognition is Warped Frequency Scales 1 carried out in [11], using LP (linear prediction) Mel 0.9 Bark ERB based combined spectral and temporal processing 0.8 approach. A detailed study and overview of feature 0.7 0.6 extraction methods in real world conditions is 0.5 discussed in [12]. Studied in [13], proposes various 0.4 Perceptual Perceptual Scales (Normalized) 0.3 alternative mel frequency warped feature 0.2 representations in the presence of interfering noise. 0.1 A closed-set Speaker Identification system is 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) developed in which cepstral features are derived as Fig.1 Normalized Perceptual Scales discussed in section II.