Linear Statistical Modeling of Speech and Its Applications -- Over 36 Year History of LPC
Total Page:16
File Type:pdf, Size:1020Kb
Linear Statistical Modeling of Speech and its Applications -- Over 36 year history of LPC -- Fumitada ITAKURA Graduate School of Engineering, Nagoya University Furo-cho 1, Chikusa-ku Nagoya 464-8603, Japan [email protected] Abstract which are locally time-invariant for describing important Linear prediction of speech signal is now a common speech events. A time series obtained by sampling a speech knowledge for students, engineers and researchers who are signal shows a significant autocorrelation between adjacent interested in speech science and technology. The first samples. The short time autocorrelation function is related to introduction of the concept to the international speech the running spectrum, which plays the most important role in community was two presentations [6],[7] at the 6th ICA held speech perception. Let {xn [ ], n= integer} be a discrete time in Tokyo in 1968. While Atal and Schroeder used the concept speech waveform sampled at every ∆T seconds. Assume that of linear prediction to encode speech waveform efficiently, {[x nii− ],= 0," , p } are (p+1) dimensional random variables Itakura and Saito used the concept of maximum likelihood taken from a stochastic process which is stationary over a estimation of speech spectrum to implement a new type of short interval such as 30 milliseconds. When x[n ] is a real vocoder. Since this epoch, these concepts are investigated widely in order to achieve accurate speech analysis, efficient sampled value, the next autoregressive (AR) relation is transmission of speech and speech recognition. In this paper, assumed: the history of LPC will be surveyed mainly based on my p α[][ixn−= i ]σε [ n ] (1) personal experience. ∑i=0 ′ 1. Introduction where α[i ] s (withα [0]= 1) are the AR or linear predictive coefficients, and σε[n ] is the residual and σ is its rms value. Homer Dudley is a foremost pioneer of analysis and synthesis An explicit representation of Eq. (1) in a digital filter form is of speech. He invented the "Channel Vocoder", as early as shown in Fig. 1, where D is the unit time delay ∆T . The 1939 at Bell Telephone Laboratories [1]. He established the transfer function H(z) equivalent to Eq.(1) is principles of the carrier nature of speech. Following his footstep, various speech analysis and synthesis methods have Xz() σ Hz()== p (2) been proposed for the purpose of transmitting speech signal as Ez() 1[]+ α iz−i ∑i=1 efficient as possible. Most research has been concentrated on finding feature parameters which express speech spectral where zjT= exp(ω∆ ) , ( −π <∆<ωπT ). characteristics efficiently. Among these methods, the most effective and successful one is based on the autoregressive or all-pole model of speech signal. From 1966 through 1968 3. Maximum likelihood estimation of [2],[3],[4] we proposed the maximum likelihood estimation of speech spectrum speech signal (MLES) and its application to vocoder. However An approach to estimating parameters in the frequency domain the MLES approach itself was not been very efficient due to has been first reported [2]. It is assumed that the speech signal its poor quantization characteristics. The PARCOR scheme has the following characteristics. was proposed in 1969 by F. Itakura, in which all-pole model is (1)The speech production system can be represented by a specified by a set of parameters called PARCOR coefficients transfer function with poles only. or the reflection coefficients. It is shown that a significant (2) The speech signal is generated by adding a random information reduction is accomplished through optimal non- white Gaussian signal to the system mentioned above. The linear quantization and bit allocation of parameters. In 1975, averaged value of the random signal is zero, and its the line spectral pair (LSP) representation of all-pole model 2 was proposed by F.Itakura, which is now widely used in many variance is assumed to beσ . speech compression systems.[14] This paper describes the Based on these assumptions, parameters such as speech analysis and synthesis methods developed in 1960’s Θ=(σα2 , [1], α [2]," , α [p ]) can be estimated from through 1980’s at the Electrical Communications Laboratories X = (x [1], xxN[2], ..., [ ]) . of the Nippon Telegraph and Telephone Public Corporation. The all-pole speech envelope is assumed to These methods involve MLES, PARCOR, LSP and CSM speech analysis/synthesis. 2. All-pole speech production model based autoregressive (AR) stochastic process Speech is a continuously time-varying process. By making reasonable assumptions, it is possible to develop linear models Fig.1 All-pole (AR) model of speech production We3.D III - 2077 2 σ 2 called the maximum likelihood method. L(X |Θ ) is SHjT()ωω=−∆ (exp( ) AR 2π maximized with respect to σ 2 ; by computing σ 2 1 σ 2 = arg max L(|)X Θ . Consequently, the maximization of = σ 2 p 2 2π 1− zz−1 ∏i=1()i L(|)X Θ is equivalent to the minimization of 2 pp σ 1 2 ˆ = (3) σααα ==−J ( [1], [2], ..., [piVijj ])∑∑ α [ ] [ ] α [ ] (8) p 2 2π αω[ijiT ]exp(−∆ ) ij==00 ∑i=1 and the corresponding likelihood function is given by 2 σ 1 pp = −N ⎡ ⎪⎧ ⎪⎫⎤ p L(|)X Θ=22 log2παeiVijj [][ − ][] α (9) 2π Ai[]cos(ω i∆ T ) σσ= ⎢ ⎨∑∑ ⎬⎥ ∑ip=− 2 ⎣⎢ ⎩⎭⎪ ij==00 ⎪⎦⎥ where zi represents roots of a polynomial By maximizing the above expression with respect to 1[1][2]++ααzz−−12 ++=" α [] pz −p 0 and A[i ] is given by {[],α ii= 1,2,,}" p, we get the following Yule-Walker (or pi− normal) equation: A[]ijji=+αα [ ] [ ] p ∑ j =0 (4) ∑Vi[][]0,(1,2,,)−== jα j i" p (10) A logarithmic likelihood function j =0 −N ⎡⎤1 p LAVk(|)X Θ= log(2)πσ 2 + [] It has been shown that this maximization of the likelihood ⎢⎥2 ∑ kp=− k 2 ⎣⎦σ function is equivalent to the minimization of the following (5) spectral distance measure defined by ⎡⎤π ⎡⎤ −NS1()DFT ω π =+⎢⎥2log(2π ) logSdAR (ωω ) + ∫−π ⎢⎥ E{dddd (ω )}=+−− 2[] (ωωω ) exp( ( )) 1 22⎢⎥πωSAR () ∫−π ⎣⎦⎣⎦ (11) π ⎡ 22 ⎤ where Vk[] is the short time autocorrelation function of =−+−ddd23()ω ()ωωω 4 ()" d ∫−π ⎢ ⎥ samples X : ⎣ 3! 4! ⎦ 1 Ni− where dSS()ω = log{()/ARωω DFT ()}is the log spectral Vi[]=+ xjxj [ ] [ i ] N ∑ j=1 difference between the AR spectrum and observed DFT (6) spectrum. It is non-negative and zero if and only if SSAR()ω ≡ DFT ()ω for all ω ; it is a measure of the goodness and is the short-time power spectrum which is SDFT ()ω of fit between two spectra SSAR()ω and DFT ()ω . The obtained by the discrete Fourier transformation of X : integrant of Eq.(11) is the penalty function against the 1 N 2 mismatch between SS()ω and ()ω as shown in Fig.2. SxkjTk()ωω=−∆ []exp( ). (7) AR DFT DFT 2π N ∑ k =1 Due to its asymmetricity, it is very sensitive to the On the basis of these equations, when X is given, the underestimation S AR ()ω << SDFT ()ω , but is tolerant for the logarithmic likelihood ratio is represented by Vk[] alone. The overestimation SSAR()ω >> DFT ()ω . For this reason, MLES is approach which determines Θ , by maximizing L(X |Θ ) , is well suited for extracting spectral peaks such as the formants Fig.2 Distance measure Fi.3 Maximum Likelihood Vocoder III - 2078 of speech. Fig.3 Definition of PARCOR coefficient 4. PARCOR speech analysis and synthesis Physically, these k parameters correspond to reflection method coefficients in the acoustic tube model of the vocal tract. The following relation can be obtained by substituting Eqs. The direct form speech synthesis filter Hz( ) with parameters (12) , (13) and (14). m−1 α[]i requires a higher quantization accuracy to maintain the α[]iV [ m− i ] { } ∑i=0 wm[1]− km ==m−1 stability of Hz( ) . To overcome this problem, the partial α[]iV [] i um[1]− ∑i=0 (15) autocorrelation (PARCOR) lattice filter was invented in 1969 2 [4]. Since then, the optimum coding of the PARCOR um[]=−− um [ 1](1 km ) coefficients have been extensively studied in order to improve αα()mm[]iiki=− (−− 1) [] β ( m 1) [], the synthetic speech quality. [13],[14] m ()mm (−− 1) ( m 1) ββ[]iiki= [−− 1]m α [ − 1], (16) 4.1 Definition of PARCOR coefficients βα(1)mm−−[]imi=− (1) [ ] The conventional autocorrelation coefficients Vk[] can be 4.2 Direct PARCOR coefficient derivation regarded as a measure of linear time shift dependency, but the The PARCOR coefficients are also derived directly from set of these parameters is still redundant since there is the speech signal. This is called the PARCOR lattice analysis significant dependency among them. The notion of partial method. The PARCOR coefficients have already been defined autocorrelation was introduced to reduce the redundancy using as the correlation coefficient between the residuals of the successive linear prediction techniques. Suppose that x[]n and forward and of the backward predictions. The forward and x[]np− are predicted values of x[]nxnp and [− ] from the backward residual linear operators are introduced: p−1 intermediate samples xn[−+ p 1],,[" xn − 2],[ xn − 1] in the εα(1)p− []niDxnADxn==(1)pi− [] [] ( )[] fp∑i=0 −1 (1)pp−−p−1 (1) x []nixni=−α [][ − ], p−1 (1)pi− ∑i=1 where AD( )= α [ iD ] , forms, (12) p−1 ∑i=0 (1)pp−−p−1 (1) (17) x []np−=−β [][] ixni − (1)p− p ()pi ∑i=1 εβ[np−= ] [] iDxnB [] = ( Dxn )[] bp∑i=1 −1 (1)p− (1)p− where the prediction coefficients α []i and β []i are p where BD( )= β (1)pi− [ iD ] , determined to minimize the residuals p−1 ∑i=1 Exnx{([]− (1)2p− [])} n where D is the delay operator for unit time.