Techniques for Feature Extraction in Speech Recognition System : a Comparative Study

TECHNIQUES FOR FEATURE EXTRACTION IN SPEECH RECOGNITION SYSTEM : A COMPARATIVE STUDY

Urmila Shrawankar Research Student, (Computer Science & Engg.), SGB Amravati University [email protected]

Dr. Vilas Thakare Professor & Head, PG Dept. of Computer Science, SGB Amravati University, Amravati

ABSTRACT The time domain waveform of a speech signal carries all of the auditory information. From the phonological point of view, very little can be said on the basis of the waveform itself. However, past research in mathematics, acoustics, and speech technology have provided many methods for converting data, that can be considered as information if interpreted correctly. In order to find some statistically relevant information from incoming data, it is important to have mechanisms for reducing the information of each segment in the audio signal into a relatively small number of parameters, or features. These features should describe each segment in such a characteristic way that other similar segments can be grouped together by comparing their features. There are enormous interesting and exceptional ways to describe the speech signal in terms of parameters. Though, they all have their strengths and weaknesses, we have presented some of the most used methods with their importance. Keywords : Speech Recognition System, Signal Processing, Hybrid Feature Extraction Methods

I. INTRODUCTION highly variable due to different speakers, nt Speech is one of the ancient ways to express speaking rates, contents and acoustic ourselves. Today these speech signals are conditions. also used in biometric recognition The feature analysis component of an ASR technologies and communicating with system plays a crucial role in the overall machine. performance of the system. Many feature These speech signals are slowly timed extraction techniques are available, these varying signals (quasi-stationary). When include examined over a sufficiently short period of Linear predictive analysis (LPC) time (5-100 msec), its characteristics are Linear predictive cepstral coefficients fairly stationary. But, if for a period of time (LPCC), the signal characteristics changes, it reflects perceptual linear predictive coefficients to the different speech sounds being spoken. (PLP) The information in speech signal is actually Mel-frequency cepstral coefficients represented by short term amplitude (MFCC) spectrum of the speech wave form. This Power spectral analysis (FFT) allows us to extract features based on the Mel scale cepstral analysis (MEL) short term amplitude spectrum from speech Relative spectra filtering of log domain (phonemes). coefficients (RASTA) The fundamental difficulty of speech First order derivative (DELTA) recognition is that the speech signal is Etc. II. DIGITAL SIGNAL PROCESSING IV. LINEAR PREDICTIVE CODING (DSP) TECHNIQUES (LPC) The signal processing front-end, which LPC is one of the most powerful speech converts the speech waveform to some of analysis techniques and is a useful method type of parametric representation. This for encoding quality speech at a low bit rate. parametric representation is then used for The basic idea behind linear predictive further analysis and processing. analysis is that a specific speech sample at Digital signal processing (DSP) techniques the current time can be approximated as a is the core of speech recognition system. linear combination of past speech samples. DSP methods are used in speech analysis, synthesis, coding, recognition, and enhancement, as well as voice modification, speaker recognition, and language identification.

III. FEATURE EXTRACTION Theoretically, it should be possible to recognize speech directly from the digitized Fig : LPC : The LPC processor waveform. However, because of the large Methodology variability of the speech signal, it is better to LP is a model based on human speech perform some feature extraction that would production. It utilizes a conventional source- reduce that variability. Particularly, filter model, in which the glottal, vocal tract, eliminating various source of information, and lip radiation transfer functions are such as whether the sound is voiced or integrated into one all-pole filter that unvoiced and, if voiced, it eliminates the simulates acoustics of the vocal tract. effect of the periodicity or pitch, amplitude The principle behind the use of LPC is to of excitation signal and fundamental minimize the sum of the squared differences frequency etc. between the original speech signal and the The reason for computing the short-term estimated speech signal over a finite spectrum is that the cochlea of the human duration. This could be used to give a ear performs a quasi-frequency analysis. The unique set of predictor coefficients. These analysis in the cochlea takes place on a predictor coefficients are estimated every nonlinear frequency scale (known as the frame, which is normally 20 ms long. The Bark scale or the mel scale). This scale is predictor coefficients are represented by ak. approximately linear up to about 1000 Hz Another important parameter is the gain (G). and is approximately logarithmic thereafter. The transfer function of the time varying So, in the feature extraction, it is very digital filter is given by common to perform a frequency warping of the frequency axis after the spectral H(z) = G/(1-Σakz-k) computation.

This section is a summary of feature Where k=1 to p, which will be 10 for the extraction techniques that are in use today, LPC-10 algorithm and 18 for the improved or that may be useful in the future, algorithm that is utilized. Levinsion-Durbin especially in the speech recognition area. recursion will be utilized to compute the Many of these techniques are also useful in required parameters for the auto-correlation other areas of speech processing. method (Deller et al., 2000). The LPC analysis of each frame also to overcome this problem, although speech involves the decision-making process of also contains aperiodic content (Ishizuka & voiced or unvoiced. A pitch-detecting Nakatani, 2006). algorithm is employed to determine to correct pitch period / frequency. It is important to re-emphasis that the pitch, gain and coefficient parameters will be varying with time from one frame to another. In reality the actual predictor coefficients are never used in recognition, since they typical show high variance. The predictor coefficient is transformed to a more robust set of parameters known as cepstral coefficients.

Performance Analysis Following parameters are involved in performance evaluation of LPC’s Fig : MFCC : Complete Pipeline MFCC Bit Rates Overall Delay of the System Methodology Computational Complexity The non-linear frequency scale used an Objective Performance Evaluation approximation to the Mel-frequency scale which is approximately linear for Types of LPC frequencies below 1 kHz and logarithmic for Following are the types of LPC frequencies above 1 kHz. This is motivated Voice-excitation LPC by the fact that the human auditory system Residual Excitation LPC becomes less frequency-selective as Pitch Excitation LPC frequency increases above 1 kHz. Multiple Excitation LPC(MPLPC) The MFCC features correspond to the Regular Pulse Excited LPC(RPLP) cepstrum of the log filterbank energies. To Coded Excited LPC(CELP) calculate them, the log energy is first computed from the filterbank outputs as V. MEL FREQUENCY CEPSTRAL COEFFICIENTS (MFCC) The use of Mel Frequency Cepstral where Xt[n] is the DFT of the tth input Coefficients can be considered as one of the speech frame, Hm[n] is the frequency standard method for feature extraction response of mth filter in the filterbank, N is (Motlíček, 2002). The use of about 20 the window size of the transform and M is MFCC coefficients is common in ASR, the total number of filters. Then, the discrete although 10-12 coefficients are often cosine transform (DCT) of the log energies considered to be sufficient for coding speech is computed as (Hagen at al., 2003). The most notable downside of using MFCC is its sensitivity to noise due to its dependence on the spectral form. Methods that utilize information in the Since the human auditory system is sensitive periodicity of speech signals could be used to time evolution of the spectral content of the signal, an effort is often made to include the extraction of this information as part of E). In the HFCC-E scheme the filter feature analysis. In order to capture the bandwidth is decoupled from the filter changes in the coefficients over time, first spacing. This is in contrast to the earlier and second difference coefficients are MFCC implementations, where these were computed as respectively. dependent variables. Another difference to the MFCC is that in HFCC-E the filter bandwidth is derived from the equivalent rectangular bandwidth (ERB), which is These dynamic coefficients are then based on critical bands concept introduced concatenated with the static coefficients by Moore and Glasberg, 1995 rather than on the Mel scale. Still, the centre frequency of according to making up the final the individual filters is computed by output of feature analysis representing the tth utilizing the Mel scale. Furthermore, in speech frame. HFCC-E scheme the filter bandwidth is further scaled by a constant, which

Skowronski and Harris labeled as E-factor. VI. LFCC SPEECH FEATURES (LFCC- Larger values of the E-factor E={4, 5, 6} FB40) were reported. The LFCC is computed as the MFCC-FB40 with the only difference that the Mel- VIII. PURE FFT frequency warping step is skipped. Thus, the Despite the popularity of MFCCs and LPC, desired frequency range is implemented by a direct use of vectors containing coefficients filter-bank of 40 equal-width and equal- of FFT power-spectrum are also possible for height linearly spaced filters. The bandwidth feature extraction. As compared to methods of each filter is 164 Hz, and the whole filter- exploiting knowledge about the human bank covers the frequency range [133, 6857] auditory system, the pure FFT spectrum Hz. Obviously, the equal bandwidth of all carries comparatively more information filters renders unnecessary the effort for about the speech signal. However, much of normalization of the area under each filter. the extra information is located at the Computation of the LFCC relatively higher frequency bands when The N - point DFT is applied on the using high sampling rates (e.g., 44.1 kHz discrete time domain input signal x(n). etc.), which are not usually considered to be The filter bank is applied on the salient in speech recognition. The logarithm magnitude spectrum [absolute value of x of the FFT spectrum is also often used to (k)] model loudness perception. The logarithmically compressed filterbank outputs [X.sub.i] are computed. IX. POWER SPECTRAL ANALYSIS Finally, the DCT is applied on the filter- (FFT) bank outputs to obtain the LFCC FB-40 One of the more common techniques of parameters. studying a speech signal is via the power Analogically to the MFCC FB-40 we spectrum. The power spectrum of a speech compute only the first J = 13 parameters. signal describes the frequency content of the signal over time. VII. HFCC-E of Skowronsky & Harris: The first step towards computing the power [Skowronski & Harris,2002] introduced the spectrum of the speech signal is to perform a Human Factor Cepstral Coefficients (HFCC- Discrete Fourier Transform (DFT). A DFT computes the frequency information of the The result obtained so far an inverse equivalent time domain signal. Since a DFT is performed to obtain the speech signal contains only real point values, equivalent autocorrelation function. we can use a real-point Fast Fourier Finally, the PLP coefficients are Transform (FFT) for increased efficiency. computed after autoregressive modeling The resulting output contains both the and conversion of the autoregressive magnitude and phase information of the coefficients to cepstral coefficients. original time domain signal. XII. MEL SCALE CEPSTRAL X. PERCEPTUAL LINEAR ANALYSIS (MEL) PREDICTION (PLP) Mel scale cepstral analysis is very similar to The Perceptual Linear Prediction PLP model perceptual linear predictive analysis of developed by Hermansky 1990. The goal of speech, where the short term spectrum is the original PLP model is to describe the modified based on psychophysically based psychophysics of human hearing more spectral transformations. In this method, accurately in the feature extraction process. however, the spectrum is warped according PLP is similar to LPC analysis, is based on to the MEL Scale, whereas in PLP the the short-term spectrum of speech. In spectrum is warped according to the Bark contrast to pure linear predictive analysis of Scale. The main difference between Mel speech, perceptual linear prediction (PLP) scale cepstral analysis and perceptual linear modifies the short-term spectrum of the prediction is related to the output cepstral speech by several psychophysically based coefficients. The PLP model uses an all-pole transformations. model to smooth the modified power spectrum. The output cepstral coefficients XI. PLP SPEECH FEATURES (PLP- are then computed based on this model. In FB19) contrast Mel scale cepstral analysis uses The PLP parameters rely on Bark-spaced cepstral smoothing to smooth the modified filter-bank of 18 filters for covering the power spectrum. This is done by direct frequency range [0, 5000] Hz. Specifically, transformation of the log power spectrum to the PLP coefficients are computed as the cepstral domain using an inverse follows: Discrete Fourier Transform (DFT). The discrete time domain input signal x(n) is subject to the N - point DFT XIII. RELATIVE SPECTRA The critical-band power spectrum is FILTERING (RASTA) computed through discrete convolution To compensate for linear channel distortions of the power spectrum with the piece- the analysis library provides the ability to wise approximation of the critical-band perform RASTA filtering. The RASTA filter curve, where B is the Bark warped can be used either in the log spectral or frequency obtained through the Hertz-to- cepstral domains. In effect the RASTA filter Bark conversion. band passes each feature coefficient. Linear Equal loudness pre-emphasis is applied channel distortions appear as an additive on the down-sampled constant in both the log spectral and the Intensity-loudness compression is cepstral domains. The high-pass portion of performed. the equivalent band pass filter alleviates the effect of convolutional noise introduced in the channel. The low-pass filtering helps in commonly used methods such as short- smoothing frame to frame spectral changes. term energy estimate Es , short-term power estimate Ps , short-term zero XIV. RASTA-PLP crossing rate Zs etc. Another popular speech feature 3. Then the mean and variance for the representation is known as RASTA-PLP, an measures calculated for background acronym for Relative Spectral Transform - noise, assuming that the first 5 blocks Perceptual Linear Prediction. PLP was are background noise. originally proposed by Hynek Hermansky as 4. Framing. a way of warping spectra to minimize the 5. Windowing. differences between speakers while 6. Calculating of MFCC features. preserving the important speech information 7. Calculating of LPC features. [Herm90]. RASTA is a separate technique 8. The speech recognition system consists that applies a band-pass filter to the energy of MFCC and LPC-based two in each frequency subband in order to subsystems. These subsystems are smooth over short-term noise variations and trained by neural networks with MFCC to remove any constant offset resulting from and LPC features, respectively. static spectral coloration in the speech The recognition process stages: channel e.g. from a telephone line 1. In MFCC and LPC based recognition [HermM94]. subsystems recognition processes are realized in parallel. XV. COMBINED LPC & MFCC 2. The recognition results of MFCC and [К.R. Aida–Zade, C. Ardil and S.S. LPC based recognition subsystems are Rustamov, 2006] compared and the speech recognition The determination algorithms MFCC and system confirms the result, which LPC coefficients expressing the basic confirmed by the both subsystems. speech features are developed by author. Since the MFCC and LPC methods are Combined use of cepstrals of MFCC and applied to the overlapping frames of speech LPC in speech recognition system is signal, the dimension of feature vector suggested by author to improve the depends on dimension of frames. At the reliability of speech recognition system. The same time, the number of frames depends on recognition system is divided into MFCC the length of speech signal, sampling and LPC-based recognition subsystems. The frequency, frame step, frame length. Author training and recognition processes are uses sampling frequency is 16khs, the frame realized in both subsystems separately, and step is 160 samples, and the frame length is recognition system gets the decision being 400 samples. The other problem of speech the same results of each subsystems. Author recognition is the same speech has different claimed that, results in decrease of error rate time duration. Even when the same person during recognition. repeats the same speech, it has the different Steps of Combined use of cepstrals of time durations. Author suggested that, for MFCC and LPC : partially removing the problem, time 1. The speech signals is passed through a durations are led to the same scale. When first-order FIR high pass filter the dimension of scale defined for the 2. Voice activation detection (VAD). speech signal increases, then the dimension Locating the endpoints of an utterance in of feature vector corresponding to the signal a speech signal with the help of some also increases. 7. The robustness of these features is XVI. MATCHING PURSUIT (MP) enhanced by averaging. [Selina Chu, Shrikanth Narayanan, and MP selects vector that is exactly in the order Jay Kuo, 2009] of eliminating the largest residual energy. Desirable types of features should be robust, That is even the first few atoms found by stable, and straightforward, with the MP will naturally contain the most representation being sparse and physically information, making them to be more interpretable. Author explained that using significant features. This also allows to map MP this representation is possible. The each signal from a larger problem space into advantages of this representation are the a point in a smaller feature space. ability to capture the inherent structure within each type of signal and to map from a XVII. INTEGRATED PHONEME large, complex signal onto a small, simple SUBSPACE (IPS) feature space. More importantly, it is [Hyunsin Park, Tetsuya Takiguchi, and conceivably more invariant to background Yasuo Ariki, 2009] noise and could capture characteristics in the The effectiveness of feature extraction signal where MFCCs tend to fail. methods has been confirmed in speech MP is a desirable method to provide a coarse recognition or speech enhancement representation and to reduce the residual experiments, but it remains difficult to energy with as few atoms as possible. Since recognize observed speech in reverberant MP selects atoms in order by eliminating the environments. If the impulse response of a largest residual energy, it lends itself in room is longer than the length of short-time providing the most useful atoms, even just Discrete Fourier Transform (DFT), after a few iterations. The MP algorithm selects atoms in a stepwise manner among the set of waveforms in the dictionary that best correlate the signal structures. The iteration can be stopped when the coefficient associated with the atom selection falls below a threshold or when a certain number of atoms selected overall have been reached. Figure IPS: Block diagrams Another common stopping criterion is to use (a) Feature extraction of MFCC and proposed feature, the signal to residual energy ratio. (b) Integrated Phoneme Subspace (IPS) transform. MP features are selected by the following the effects of reverberation are both additive process. and multiplicative in the power spectrum 1. Windowing domain. Consequently, it becomes difficult 2. Decomposition to estimate the reverberant effects in the 3. Process stops after obtaining atoms. time or frequency domain. Therefore, author 4. Record the frequency and scale proposed a new data driven speech feature parameters for each of these atoms extraction method that is “Integrated 5. Find the mean and the standard deviation Phoneme Subspace (IPS) method”, which is corresponding to each parameter based on the logarithmic mel-frequency separately, resulting in 4 feature values. filter bank domain. 6. Chose atoms to extract features for both From the experimental work the proposed training and test data. feature is obtained by transform matrices that are linear and time-invariant. The MDL- based phoneme subspace selection transformation of the log power experiment confirmed that optimal subspace spectrum to the cepstral domain using an dimensions differ. Author claimed that, the inverse DFT. experiment results in isolated word In LPC reduced word error rates is found recognition under clean and reverberant in difficult conditions as compared to conditions showed that the proposed method PLP. outperforms conventional MFCC. The FFT-based approach is good for its proposed method can be combined with linearity in the frequency domain and its other methods, such as speech signal computational speed. processing or model adaptation, to improve FFT does not discard or distort the recognition accuracy in real-life information in any anticipatory manner. environments. In FFT the representation of the signal remains easily perceivable for further XVIII. CONCLUDING HIGHLIGHTS analysis and post-processing. The feature space of a MFCC obtained The effects of noise in the FFT spectrum using DCT is not directly dependent on can also be easily comprehended. speech data, the observed signal with The PLP features outperform MFCC in noise does not show good performance specific conditions. without utilizing noise suppression The MFCC reduces the frequency methods. information of the speech signal into a ICA, applied to speech data in the time small number of coefficients. or time-frequency domain, it gives good In MFCC, the logarithmic operation performance in phoneme recognition attempts to model loudness perception in tasks. the human auditory system. LDA, applied to speech data in the time- MFCC is a very simplified model of frequency domain shows better auditory processing; it is easy and performance than combined linear relatively fast to compute. discriminants in the temporal and The PLP functions provides limited spectral domain in continuous digit capability of dealing with these recognition task. distortion by employing a RASTA filter MEL analysis and PLP analysis of which makes PLP analysis more robust speech, are similar where the short term to linear spectral distortions spectrum is modified based on psychophysically based spectral XIX. CONCLUSION transformations. We have discussed some features extraction In MEL the spectrum is warped techniques and their pros and cons. Some according to the MEL Scale. new methods are developed using In PLP the spectrum is warped according combination of more techniques. Authors to the Bark Scale. have claimed improvement in performance. The PLP model uses an all-pole model There is a need to develop new hybrid to smooth the modified power spectrum. methods that will give better performance in The output cepstral coefficients are then robust speech recognition area. computed. Mel scale cepstral analysis uses cepstral REFERENCES smoothing to smooth the modified 1. Selina Chu, Shrikanth Narayanan, and Jay Kuo, power spectrum. This is done by direct Environmental Sound Recognition With Time– Frequency Audio Features IEEE TRANSACTIONS ON AUDIO, SPEECH, AND 11. Motlíček P.: Feature Extraction in Speech LANGUAGE PROCESSING, VOL. 17, NO. 6, Coding and Recognition, Report, Portland, to AUGUST 2009 research, data, and theory. Belmont, CA: 2. Mostafa Hydari, Mohammad Reza Karami, Thomson/Wadsworth, 2003 US, Oregon Ehsan Nadernejad, Speech Signals Enhancement Graduate Institute of Science and Technology, Using LPC Analysis based on Inverse Fourier pp. 1-50, 2002 Methods, Contemporary Engineering Sciences, 12. X. Huang, A. Acero, and H.-W. Hon, Spoken Vol. 2, 2009, no. 1, 1 - 15 Language Processing: A Guide to Theory, 3. Hyunsin Park, Tetsuya Takiguchi, and Yasuo Algorithm and System Development. Prentice Ariki, Research Article Integrated Phoneme Hall PTR, 2001. SubspaceMethod for Speech Feature Extraction, 13. Deller J.R., Hansen J.H.L. & Proakis J.G.: Hindawi Publishing Corporation EURASIP Discrete-Time Processing of Speech Signals. Journal on Audio, Speech, and Music Processing IEEE Press, 2000 Volume 2009, Article ID 690451, 6 pages 14. Karjalainen M.: Kommunikaatioakustiikka. doi:10.1155/2009/690451 Helsinki University of Technology Laboratory of 4. Kevin M. Indrebo, Richard J. Povinelli, Michael Acoustics and Audio Signal Processing, T. Johnson, IEEE Minimum Mean-Squared Technical Report 51, 1999 Error Estimation of Mel-Frequency Cepstral 15. Hermansky, H.: Should Recognizers Have Ears? Coefficients Using a Novel Distortion Model Speech Communication, Vol. 1998, No. 25, pp. IEEE TRANSACTIONS ON AUDIO, SPEECH, 3-27, 1998 AND LANGUAGE PROCESSING, VOL. 16, 16. Lee Y. & Hwang K.-W.: Selecting Good speech NO. 8, NOVEMBER 2008 Features for Recognition. ETRI Journal, Vol. 18, 5. Ishizuka K.& Nakatani T.: A feature extraction No. 1, 1996 method using subband based periodicity and 17. Moore B.C.J.: Hearing. Academic Press, San aperiodicity decomposition with noise robust Diego, 1995 frontend processing for automatic speech 18. Rabiner, Juang, Fundamentals of Speech recognition. Speech Communication, Vol. 48, Recognition, 1993 Issue 11, pp. 1447-1457, 2006 19. Hermansky H., Morgan N., Bayya A. & Kohn 6. К.R. Aida–Zade, C. Ardil and S.S. Rustamov, P.: RASTA-PLP Speech Analysis. Technical Investigation of Combined use of MFCC and Report (TR-91-069), International Computer LPC Features in Speech Recognition Systems, Science Institute, Berkeley, CA., 1991 PROCEEDINGS OF WORLD ACADEMY OF 20. Hermansky, H.: Perceptual linear predictive SCIENCE, ENGINEERING AND (PLP) analysis for speech. Journal of Acoustic TECHNOLOGY VOLUME 13 MAY 2006 Society of America, Vol. 87, pp. 1738-1752, ISSN 1307-6884 1990 7. Hagen A., Connors D.A. & Pellm B.L.: The 21. Rabiner L.R. & Schafer R.W.: Digital Processing Analysis and Design of Architecture Systems for of Speech Signals, Prentice-Hall, 1978 Speech Recognition on Modern Handheld- 22. Collins A.M. & Loftus E.F.: A spreading- Computing Devices. Proceedings of the 1st activation theory of semantic processing. IEEE/ACM/IFIP international conference on Psychological Review, Vol. 82, pp. 407-428, hardware/software design and system synthesis, 1975 pp. 65-70, 2003 23. Smith E.E., Shoben E.J. & Rips L.J: Structure 8. Mark D. Skowronski and John G. Harris, and process in semantic memory: A featural “Improving The Filter Bank Of A Classic model for semantic decisions. Psychological Speech Feature Extraction Algorithm”, IEEE Intl Review, Vol. 81, pp. 214-241, 1974 Symposium on Circuits and Systems, Bangkok, Thailand, vol IV, pp 281-284, May 25 - 28, 2003, ISBN: 0-7803-7761-31 9. M. D. Skowronski and J. G. Harris, “Increased mfcc filter bandwidth for noise-robust phoneme recognition,” International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 801–4, 2002. 10. M. D. Skowronski and J. G. Harris, “Human factor cepstral coefficients,” December 2002, Cancun, Mexico