Appendix A Linear Prediction Analysis of Speech Linear prediction (LP) is one of the most important tools in speech analysis. The philosophy behind linear prediction is that a speech sample can be approximated as a linear combination of past samples. Then, by minimizing the sum of the squared differences between the actual speech samples and the linearly predicted ones over a finite interval, a unique set of predictor coefficients can be determined [15]. LP analysis decomposes the speech into two highly independent components, the vocal tract parameters (LP coefficients) and the glottal excitation (LP residual). It is assumed that speech is produced by exciting a linear time-varying filter (the vocal tract) by random noise for unvoiced speech segments, or a train of pulses for voiced speech. Figure A.1 shows a model of speech production for LP analysis [163]. It consists of a time varying filter H(z) which is excited by either a quasi periodic or a random noise source. The most general predictor form in linear prediction is the autoregressive moving average (ARMA) model where the speech sample s(n) is modelled as a linear combination of the past outputs and the present and past inputs [164–166]. It can be written mathematically as follows p q s(n)=− ∑ aks(n − k)+G ∑ blu(n − l), b0 = 1(A.1) k=1 l=0 where ak,1 ≤ k ≤ p,bl,1 ≤ l ≤ q and gain G are the parameters of the filter. Equivalently, in frequency domain, the transfer function of the linear prediction speech model is [164] q −l 1 + ∑ blz ( )= l=1 . H z p (A.2) −k 1 + ∑ akz k=1 H(z) is referred to as a pole-zero model. The zeros represent the nasals and the poles represent the resonances (formants) of the vocal-tract. When ak = 0for1≤ k ≤ p, S.R. Krothapalli and S.G. Koolagudi, Emotion Recognition using Speech Features, 99 SpringerBriefs in Electrical and Computer Engineering, DOI 10.1007/978-1-4614-5143-3, © Springer Science+Business Media New York 2013 100 A Linear Prediction Analysis of Speech Fig. A.1 Model of speech production for LP analysis H(z) becomes an all-zero or moving average (MA) model. Conversely, when bl = 0for1≤ l ≤ q, H(z) becomes an all-pole or autoregressive (AR) model [165]. For non-nasal voiced speech sounds the transfer function of the vocal-tract has no zeros whereas the nasals and unvoiced sounds usually includes the poles (resonances) as well as zeros (anti resonances) [163]. Generally the all-pole model is preferred for most applications because it is com- putationally more efficient and its the acoustic tube model for speech production. It can model sounds such as vowels well enough. The zeros arise only in nasals and in unvoiced sounds like fricatives. These zeros are approximately modelled by including more poles [164]. In addition, the location of a poles considerably more important perceptually than the location of a zero [167]. Moreover, it is easy to solve an all-pole model. To solve a pole-zero model, it is necessary to solve a set of nonlinear equations, but in the case of an all-pole model, only a set of linear equations need to be solved. The transfer function of the all-pole model is [165] ( )= G . H z p (A.3) −k 1 + ∑ akz k=1 The number p implies that the past p output samples are being considered, which is also the order of the linear prediction. With this transfer function, we get a difference A.1 The Prediction Error Signal 101 equation for synthesizing the speech samples s(n) as p s(n)=− ∑ aks(n − k)+Gu(n) (A.4) k=1 where the coefficients ak’s are known as linear predictive coefficients (LPCs) and p is the order of the LP filter. It should be selected such that there are at least a pair of poles per each formant. Generally, the prediction order is chosen using the relation p = 2 × (BW + 1) (A.5) where BW is the speech bandwidth in kHz. A.1 The Prediction Error Signal The error signal or the residual signal e(n) is the difference between the input speech and the estimated speech [165]. p e(n)=s(n)+∑ aks(n − k). (A.6) k=1 Here the gain G is usually ignored to allow the parameterizations to be independent of the signal intensity. In z-domain e(n) can be viewed as the output of the prediction filter A(z) to the input speech signal s(n) which is expressed as E(z)=A(z)S(z) (A.7) where p ( )= 1 = + −k = . A z ( ) 1 ∑ akz ; G 1 (A.8) H z k=1 The LP residual represents the excitations for production of speech [168]. The residual is typically a series of pulses, when derived from voiced speech or noise- like, when derived from unvoiced speech. The whole LP model can be decomposed into the two parts, the analysis part and the synthesis part as shown in Fig. A.2.The LP analysis filter removes the formant structure of the speech signal and leaves a lower energy output prediction error which is often called the LP residual or excitation signal. The synthesis part takes the error signal as an input [165]. The input is filtered by the synthesis filter 1/A(z), and the output is the speech signal. 102 A Linear Prediction Analysis of Speech Fig. A.2 LP analysis and synthesis model A.2 Estimation of Linear Prediction Coefficients There are two widely used methods for estimating the LP coefficients (LPCs): (i) Autocorrelation and (ii) Covariance. Both methods choose the short term filter coefficients (LPCs) ak in such a way that the energy in the error signal (residual) is minimized. For speech processing tasks, the autocorrelation method is almost exclusively used because of its computational efficiency and inherent stability whereas the covariance method does not guarantee the stability of the all-pole LP synthesis filter [163, 169]. The autocorrelation method of computing LPCs is described below: First, speech signal s(n) is multiplied by a window w(n) to get the windowed speech segment sw(n). Normally, a Hamming or Hanning window is used. The windowed speech signal is expressed as sw(n)=s(n)w(n). (A.9) The next step is to minimize the energy in the residual signal. The residual energy Ep is defined as [165] ∞ ∞ p 2 2 Ep = ∑ e (n)= ∑ sw(n)+ ∑ aksw(n − k) . (A.10) n=−∞ n=−∞ k=1 The values of ak that minimize Ep are found by setting the partial derivatives of the energy Ep with respect to the LPC parameters equal to 0. ∂E p = 0, 1 ≤ k ≤ p. (A.11) ∂ak This results in the following p linear equations for the p unknown parameters a1,...,ap p ∞ ∞ ∑ ak ∑ sw(n − i)sw(n − k)=− ∑ sw(n − i)sw(n), 1 ≤ i ≤ p. (A.12) k=1 n=−∞ n=−∞ A.2 Estimation of Linear Prediction Coefficients 103 This linear equations can be expressed in terms of the autocorrelation function. This is because the autocorrelation function of the windowed segment sw(n) is defined as ∞ Rs(i)= ∑ sw(n)sw(n + i), 1 ≤ i ≤ p. (A.13) n=−∞ Exploiting the fact that the autocorrelation function is an even function i.e., Rs(i)= Rs(−i). By substituting the values from Eq. (A.13)inEq.(A.12), we get p ∑ Rs (|i − k|)ak = −Rs(i), 1 ≤ i ≤ p. (A.14) k=1 These set of p linear equations can be represented in the following matrix form as [163] ⎡ ⎤⎡ ⎤ ⎡ ⎤ Rs(0) Rs(1) ··· Rs(p − 1) a1 Rs(1) ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ Rs(1) Rs(0) ··· Rs(p − 2)⎥⎢ a2 ⎥ ⎢ Rs(2) ⎥ ⎢ ⎥⎢ ⎥ = −⎢ ⎥ ⎣ . ⎦⎣ . ⎦ ⎣ . ⎦ (A.15) . .. Rs(p − 1) Rs(p − 2) ··· Rs(0) ap Rs(p) This can be summarized using vector-matrix notation as Rsa = −rs (A.16) where the p × p matrix Rs is known as the autocorrelation matrix. The resulting matrix is a Toeplitz matrix where all elements along a given diagonal are equal. This allows the linear equations to be solved by the Levinson-Durbin algorithm. Because of the Toeplitz structure of Rs, A(z) is minimum phase [163]. At the synthesis filter H(z)=1/A(z), the zeros of A(z) become the poles of H(z). Thus, the minimum phase of A(z) guarantees the stability of H(z). Appendix B MFCC Features The MFCC feature extraction technique basically includes windowing the signal, applying the DFT, taking the log of the magnitude and then warping the frequencies on a Mel scale, followed by applying the inverse DCT. The detailed description of various steps involved in the MFCC feature extraction is explained below. 1. Pre-emphasis: Pre-emphasis refers to filtering that emphasizes the higher frequencies. Its purpose is to balance the spectrum of voiced sounds that have a steep roll-off in the high frequency region. For voiced sounds, the glottal source has an approximately −12 dB/octave slope [170]. However, when the acoustic energy radiates from the lips, this causes a roughly +6 dB/octave boost to the spectrum. As a result, a speech signal when recorded with a microphone from a distance has approximately a −6 dB/octave slope downward compared to the true spectrum of the vocal tract. Therefore, pre-emphasis removes some of the glottal effects from the vocal tract parameters. The most commonly used pre-emphasis filter is given by the following transfer function − H(z)=1 − bz 1 (B.1) where the value of b controls the slope of the filter and is usually between 0.4 and 1.0 [170].
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages24 Page
-
File Size-