Linear Statistical Modeling of Speech and its Applications -- Over 36 year history of LPC --

Fumitada ITAKURA

Graduate School of Engineering, Nagoya University Furo-cho 1, Chikusa-ku Nagoya 464-8603, Japan [email protected] Abstract which are locally time-invariant for describing important Linear prediction of speech signal is now a common speech events. A time series obtained by sampling a speech knowledge for students, engineers and researchers who are signal shows a significant autocorrelation between adjacent interested in speech science and technology. The first samples. The short time autocorrelation function is related to introduction of the concept to the international speech the running spectrum, which plays the most important role in community was two presentations [6],[7] at the 6th ICA held speech perception. Let {xn [ ], n= integer} be a discrete time in Tokyo in 1968. While Atal and Schroeder used the concept speech waveform sampled at every ∆T seconds. Assume that of linear prediction to encode speech waveform efficiently, {[x nii− ],= 0," , p } are (p+1) dimensional random variables Itakura and Saito used the concept of maximum likelihood taken from a stochastic process which is stationary over a estimation of speech spectrum to implement a new type of short interval such as 30 milliseconds. When x [n ] is a real . Since this epoch, these concepts are investigated widely in order to achieve accurate speech analysis, efficient sampled value, the next autoregressive (AR) relation is transmission of speech and . In this paper, assumed: the history of LPC will be surveyed mainly based on my p α[][ixn−= i ]σε [ n ] (1) personal experience. ∑i=0 ′ 1. Introduction where α[i ] s (withα [0]= 1) are the AR or linear predictive coefficients, and σε [n ] is the residual and σ is its rms value. Homer Dudley is a foremost pioneer of analysis and synthesis An explicit representation of Eq. (1) in a digital filter form is of speech. He invented the "Channel Vocoder", as early as shown in Fig. 1, where D is the unit time delay ∆T . The 1939 at Bell Telephone Laboratories [1]. He established the transfer function H(z) equivalent to Eq.(1) is principles of the carrier nature of speech. Following his footstep, various speech analysis and synthesis methods have Xz() σ Hz()== p (2) been proposed for the purpose of transmitting speech signal as Ez() 1[]+ α iz−i ∑i=1 efficient as possible. Most research has been concentrated on finding feature parameters which express speech spectral where zjT= exp(ω∆ ) , ( −π <∆<ωπT ). characteristics efficiently. Among these methods, the most effective and successful one is based on the autoregressive or all-pole model of speech signal. From 1966 through 1968 3. Maximum likelihood estimation of [2],[3],[4] we proposed the maximum likelihood estimation of speech spectrum speech signal (MLES) and its application to vocoder. However An approach to estimating parameters in the frequency domain the MLES approach itself was not been very efficient due to has been first reported [2]. It is assumed that the speech signal its poor quantization characteristics. The PARCOR scheme has the following characteristics. was proposed in 1969 by F. Itakura, in which all-pole model is (1)The speech production system can be represented by a specified by a set of parameters called PARCOR coefficients transfer function with poles only. or the reflection coefficients. It is shown that a significant (2) The speech signal is generated by adding a random information reduction is accomplished through optimal non- white Gaussian signal to the system mentioned above. The linear quantization and bit allocation of parameters. In 1975, averaged value of the random signal is zero, and its the line spectral pair (LSP) representation of all-pole model 2 was proposed by F.Itakura, which is now widely used in many variance is assumed to beσ . speech compression systems.[14] This paper describes the Based on these assumptions, parameters such as speech analysis and synthesis methods developed in 1960’s Θ=(σα2 , [1], α [2]," , α [p ]) can be estimated from through 1980’s at the Electrical Communications Laboratories X = (x [1], xxN[2], ..., [ ]) . of the Nippon Telegraph and Telephone Public Corporation. The all-pole speech envelope is assumed to These methods involve MLES, PARCOR, LSP and CSM speech analysis/synthesis. 2. All-pole speech production model based autoregressive (AR) stochastic process Speech is a continuously time-varying process. By making reasonable assumptions, it is possible to develop linear models Fig.1 All-pole (AR) model of speech production

We3.D III - 2077 2 σ 2 called the maximum likelihood method. L (X |Θ ) is SHjT()ωω=−∆ (exp( ) AR 2π maximized with respect to σ 2 ; by computing σ 2 1 σ 2 = arg max L(|)X Θ . Consequently, the maximization of = σ 2 p 2 2π 1− zz−1 ∏i=1()i L(|)X Θ is equivalent to the minimization of

2 pp σ 1 2 ˆ = (3) σααα ==−J ( [1], [2], ..., [piVijj ])∑∑ α [ ] [ ] α [ ] (8) p 2 2π αω[ijiT ]exp(−∆ ) ij==00 ∑i=1 and the corresponding likelihood function is given by 2 σ 1 pp = −N ⎡ ⎪⎧ ⎪⎫⎤ p L(|)X Θ=22 log2παeiVijj [][ − ][] α (9) 2π Ai[]cos(ω i∆ T ) σσ=  ⎢ ⎨∑∑ ⎬⎥ ∑ip=− 2 ⎣⎢ ⎩⎭⎪ ij==00 ⎪⎦⎥ where zi represents roots of a polynomial By maximizing the above expression with respect to 1[1][2]++ααzz−−12 ++=" α [] pz −p 0 and A [i ] is given by {[],α ii= 1,2,,}" p, we get the following Yule-Walker (or pi− normal) equation: A[]ijji=+αα [ ] [ ] p ∑ j =0 (4) ∑Vi[][]0,(1,2,,)−== jα j i" p (10) A logarithmic likelihood function j =0 −N ⎡⎤1 p LAVk(|)X Θ= log(2)πσ 2 + [] It has been shown that this maximization of the likelihood ⎢⎥2 ∑ kp=− k 2 ⎣⎦σ function is equivalent to the minimization of the following (5) spectral distance measure defined by ⎡⎤π ⎡⎤ −NS1()DFT ω π =+⎢⎥2log(2π ) logSdAR (ωω ) + ∫−π ⎢⎥ E{dddd (ω )}=+−− 2[] (ωωω ) exp( ( )) 1 22⎢⎥πωSAR () ∫−π ⎣⎦⎣⎦ (11) π ⎡ 22 ⎤ where Vk[] is the short time autocorrelation function of =−+−ddd23()ω ()ωωω 4 ()" d ∫−π ⎢ ⎥ samples X : ⎣ 3! 4! ⎦ 1 Ni− where dSS()ω = log{()/ARωω DFT ()}is the log spectral Vi[]=+ xjxj [ ] [ i ] N ∑ j=1 difference between the AR spectrum and observed DFT (6) spectrum. It is non-negative and zero if and only if SSAR()ω ≡ DFT ()ω for all ω ; it is a measure of the goodness and is the short-time power spectrum which is SDFT ()ω of fit between two spectra SSAR()ω and DFT ()ω . The obtained by the discrete Fourier transformation of X : integrant of Eq.(11) is the penalty function against the 1 N 2 mismatch between SS()ω and ()ω as shown in Fig.2. SxkjTk()ωω=−∆ []exp( ). (7) AR DFT DFT 2π N ∑ k =1 Due to its asymmetricity, it is very sensitive to the

On the basis of these equations, when X is given, the underestimation S AR ()ω << SDFT ()ω , but is tolerant for the logarithmic likelihood ratio is represented by Vk[] alone. The overestimation SSAR()ω >> DFT ()ω . For this reason, MLES is approach which determines Θ , by maximizing L (X |Θ ) , is well suited for extracting spectral peaks such as the formants

Fig.2 Distance measure Fi.3 Maximum Likelihood Vocoder

III - 2078 of speech. Fig.3 Definition of PARCOR coefficient 4. PARCOR speech analysis and synthesis Physically, these k parameters correspond to reflection method coefficients in the acoustic tube model of the vocal tract. The following relation can be obtained by substituting Eqs. The direct form filter Hz ( ) with parameters (12) , (13) and (14). m−1 α[]i requires a higher quantization accuracy to maintain the α[]iV [ m− i ] { } ∑i=0 wm[1]− km ==m−1 stability of Hz ( ) . To overcome this problem, the partial α[]iV [] i um[1]− ∑i=0 (15) autocorrelation (PARCOR) lattice filter was invented in 1969 2 [4]. Since then, the optimum coding of the PARCOR um[]=−− um [ 1](1 km ) coefficients have been extensively studied in order to improve αα()mm[]iiki=− (−− 1) [] β ( m 1) [], the synthetic speech quality. [13],[14] m ()mm (−− 1) ( m 1) ββ[]iiki= [−− 1]m α [ − 1], (16) 4.1 Definition of PARCOR coefficients βα(1)mm−−[]imi=− (1) [ ] The conventional autocorrelation coefficients Vk[] can be 4.2 Direct PARCOR coefficient derivation regarded as a measure of linear time shift dependency, but the The PARCOR coefficients are also derived directly from set of these parameters is still redundant since there is the speech signal. This is called the PARCOR lattice analysis significant dependency among them. The notion of partial method. The PARCOR coefficients have already been defined autocorrelation was introduced to reduce the redundancy using  as the correlation coefficient between the residuals of the successive linear prediction techniques. Suppose that x[]n and forward and of the backward predictions. The forward and  x[]np− are predicted values of x[]nxnp and [− ] from the backward residual linear operators are introduced: p−1 intermediate samples xn[−+ p 1],,[" xn − 2],[ xn − 1] in the εα(1)p− []niDxnADxn==(1)pi− [] [] ( )[] fp∑i=0 −1 (1)pp−−p−1 (1) x []nixni=−α [][ − ], p−1 (1)pi− ∑i=1 where AD ( )= α [ iD ] , forms, (12) p−1 ∑i=0 (1)pp−−p−1 (1) (17) x []np−=−β [][] ixni − (1)p− p ()pi ∑i=1 εβ[np−= ] [] iDxnB [] = ( Dxn )[] bp∑i=1 −1 (1)p− (1)p− where the prediction coefficients α []i and β []i are p where BD ( )= β (1)pi− [ iD ] , determined to minimize the residuals p−1 ∑i=1  Exnx{([]− (1)2p− [])} n where D is the delay operator for unit time. From Eqs. (14) and (16), the following recursions are obtained: (13) and  Exnp{([][])}−− x(1)p− n − p 2 ADmmmm() = A−−110() D−= kB (), D AD()1 of the forward and backward prediction. (18) Bmmmm()DDBDkADBDD =− [ -1 ()− 1 ()], 0 () = The PARCOR coefficient k between x [n ] and x[np− ] is p From this recursion we can implement a PARCOR lattice defined as the correlation coefficient between two analyzis/synthesis system as shown in Fig.4   residuals x[]nx−−−−(1)pp−− [] n and xnpx[ ] (1) [ np ], as shown in Fig. 3. 4.3 Extraction of source parameters using the modified autocorrelation method p−1 εα(1)p− []nixni=−(1)p− [][ ] f ∑i=0 As the input signal passes through the PARCOR lattice p analyzer, the autocorrelation between the adjacent samples are εβ(1)p− []np−=(1)p− [][] ixni − b ∑i=1 gradually removed. If the number of sections p, is sufficiently (1)pp−− (1) Ennp{}εεfb[] [− ] k p = 22(1)pp−− (1) EnEnp{[]}{[]}εεfb− (14)

Fig.3 Definition of PARCOR, Fig.4 Lattice form PARCOR vocoder system Partial Correlation coefficients

III - 2079 large, for example, from 10 to 16, the spectral envelop of the Theorem: All the roots of Pz ( ) = Qz ( )= 0 lie on the unit input signal will be extracted almost completely, and the circle, and alternate one another. In this case, the all-pole filter spectrum envelope of the residual will be flattened. Thus, only 1()Az−1 is guaranteed to be stable. For even p, excitation source characteristics such as signal amplitude, p voicing, and pitch period are contained in it. The signal Pz( ) and Qz ( ) are factored as follows: p/2 amplitude is the root mean square value of the residual Pz( ) =−(1 z−−−112 ) (1-2cosω z + z ) ∏i=1 2i variance. The autocorrelation coefficients of the residual are (20) p/2 computed to detect periodicity and to determine pitch period. Qz( ) =+(1 z−−−112 ) (1-2cosω z + z ) ∏i=1 2i-1 Pitch period is determined by searching the maximum It can be readily seen that a set of parameters autocorrelation. −1 (,ω123ωω , ,," ωpp− 1 , ω ) is derived from Ap ()z , and 4.4 PARCOR speech synthesis method −1 conversely, Ap ()z is also reconstructed in terms of Speech synthesis from the PARCOR coefficients and p (,ω123ωω , ,," ωpp− 1 , ω ). Therefore the set of {}αi i=1 is also excitation source parameters is the inverse process of speech p analysis. The excitation source is generated by controlling the equivalent to set (,ω123ωω , ,," ωpp− 1 , ω ), similar to {}ki i=1 . A pulse and white noise generators according to pitch period, considerable amount of computation is eliminated by an voicing and amplitude excites a time-varying filter composed elegant algorithm used to obtain the roots of Pz ( ) and Qz ( ) of lattice sections to produce synthesized speech. by exploiting the fact that both have complex conjugate roots on the unit circle. 5. Line spectrum pair (LSP) speech analysis and synthesis method 5.2 LSP speech synthesis method PARCOR method does have limitations with regard to data All pole LSP filter is a digital filter whose transfer −1 compression. PARCOR-synthesized speech quality rapidly function is identical to 1()Ap z . The all-pole digital deteriorates at bit rates lower than 2.4 kbps. There are two 11 filter −−11= is realized by a digital filter main reasons for this. Firstly, the required quantization bits per Azpp() 1(1())−− Az parameter are four to eight bits for the PARCOR coefficients with a negative feedback loop, whose transfer function [7,8,9,10]. Secondly, the spectral distortion due to parameter −1 (())1− Az interpolation increases rapidly as the frame (parameter is p .By using this relation, the LSP all-pole refreshing) period is increased. This chapter describes an filter can be implemented by the signal flow graph approach to speech analysis and synthesis, which compensates shown in Fig.5. Using this configuration, generation of for the problems. This approach, called the "line spectrum one sample output requires p multiplications and (3p+1) pair" (LSP) method [16], again exploits the all-pole modeling additions or subtractions. of speech. The LSP parameters are one of LPC parameters in the frequency domain.

Fig.5 LSP all-pole filter for speech synthesis

5.1 LSP speech analysis method 5.3 Physical meaning of LSP

−1 Let 1()Azp be a stable all-pole digital filter in Eq. (4), and The LSP parameters have an interesting physical interpretation. If vocal tract characteristics can be define the following (p + 1)th order polynomials. Azp+1() expressed by 1()A z −1 , the vocal tract is modeled as a corresponds to relations Eq.(18), p −−+1(1)p non-uniform section acoustic tube consisting of p short Pz() =− App( z ) z A () z (19) sections of equal length. The acoustic tube is open at the Qz() A( z−−+1(1) ) zp A () z =+pp lips terminal, and each section is numbered from the lips. Differences of vocal tract areas between the adjacent

III - 2080 sections n and n+1 causes reflection of traveling wave. Several speech analysis and synthesis methods based on linear The reflection coefficients are equal to the nth PARCOR prediction, PARCOR , LSP and CSM have been described. coefficients k . Section p+1, which corresponds to the Their principles and physical interpretations have also been n presented. The characteristics of these methods have been glottis terminal, is terminated by a matched resistance. clarified by means of a number of objective and subjective The excitation signal applied to the glottis drives the experiments. As a result, the LSP speech analysis and acoustic tube. Now consider a pair of artificial boundary synthesis method has been found to be superior to other conditions, where the acoustic tube is completely closed methods. Today active research research efforts in narrow- or open at the glottis. These conditions correspond to band speech coding is focused on extremely low bit rate −1 transmission, under 1000 bps.[30],[31] A very interesting kkpp++11==−1 and 1 . Under these conditions, Az() p+1 problem remains on how to eliminate speech signal should be identical to either Pz() or Qz () . The acoustic redundancy. In most of these methods, redundancies in feature tube is now loss-less, and therefore the transfer function parameters in both space and time domains, are eliminated as displays a line spectrum structure. far as possible.

6. Composite Sinusoidal Modeling(CSM) Reference In 1975, Itakura proposed the line spectrum representation [1] H. Dudley, “The Vocoder”, Bell Labs. Rec., vol. I8, pp. (LSR) concept and its algorithm to obtain a set of parameters 122-126, Dec. 1939 (1939) 169-177. for new speech spectrum representation. Independently from [2] S. Saito and F. Itakura, “A theoretical consideration of this, Sagayama developed a composite sinusoidal modeling statistically optimum methods for speech spectral density”, (CSM) concept which is equivalent to LSR but give a quite Report No. 3107, Electrical Communication Laboratory, different formulation, solving algorithm and synthesis scheme. NTT, Tokyo (1966) (in Japanese). Sagayama clarified the duality of LPC and CSM and provided [3] F. Itakura and S. Saito, “Analysis-synthesis transmission the unified view covering LPC, PARCOR, LSR, LSP and system based on maximum likelihood spectrum CSM, CSM is not only an new concept of speech spectrum estimation”, Proc. Conv. Acoust. Soc. Japan, 2-3-1 (Nov., analysis but also a key idea to understand the linear prediction 1967) 231-232 (in Japanese). from a unified point of view. Consider the composite [4] F. Itakura and S. Saito, “Speech analysis-synthesis system sinusoidal model which is a sum of n sinusoids of arbitrary based on the partial autocorrelation coefficients”, Proc. Conv. Acoust. Soc. Japan, 2-2-6 (Oct. 1969) 199-200 (in amplitudes {}2m , frequencies {}ω and phases {}φ . i i i Japanese). q xn[]=+2m sin(ω n φ ) (21) [5] B. S. Atal and M. R. Schroeder, “Predictive coding of ∑i=1 i ii speech signals”, Proc. 1967 Conf. Comm. and Process., Its autocorrelation function is given in terms of amplitudes and (1967) 360-361. frequencies as follows: [6] B. S. Atal and M. R. Schroeder, “Predictive coding of q Vn[]= m cos(ω n ) (22) ∑i=1 ii speech signals”, Proc. Int. Cong. Acoust. C-5- The above formula includes 2q parameters 4,1968(Tokyo) {},1,,ω iq=="" and {},1,,φ iq. The problem is to find a [7] F. Itakura and S. Saito, “Analysis-synthesis ii based on maximum likelihood method”, Proc. Int. Cong. set of parameters {ωii } and {m } , which makes {Vn[]} equal Acoust. C-5-5,1968(Tokyo) to the sample autocorrelation function for nq=−0,1," ,2 1 ; a [8] B. S. Atal, “Speech analysis and synthesis by linear prediction of the speech wave”, J. Acoust. Soc. Am., 47 kind of correlation matching problem. The amplitude {}m is i (1970) 65(A). determined by solving a system of linear equation fo Hankel [9] J. D. Markel, “Digital inverse filtering, a new tool for form, and the frequencies {}ωi is obtained by solving a Van formant trajectory estimation”, IEEE Trans., AU-20 der Monde matrix equation. (1972) 129-137. [10] J. Makhoul and J. Wolf, “Linear prediction and the spectral analysis of speech”, NTIS No. AD-749066, BBN Conclusion Report No. 2304 Bolt Beraneck and Newman, Inc., Cambridge, Massachusetts (1972). [11] I. Makhoul, “Spectral analysis of speech by linear prediction”, IEEE Trans., AU-21 (1973) 140-148. [12] J. D. Markel and A. H. Gray, Jr., “On autocorrelation with application to speech analysis”, IEEE Trans., AU-21 (1973) 69-79. [13] N.Kitawaki and F.Itakura, ”Efficient coding 0f speech by nonlinear quantization and nonuniformity of PARCOR coefficients”, IECE Trans., Vol. J61-A, No.6,1978,pp.543-550 [14] Y.Tohkura, ”Improvement of voice quality in PARCOR bandwidth compression system”, IECE Trans. Vol.J61-A, No.3,1978

III - 2081 [15] M. J. Narasimha, K. Shenoi, and A. M. Peterson, “A [24] S. Sagayama, “Stability condition of LSP speech synthesis Hilbert space approach to linear predictive analysis of digital filters”, Proc. Annual Conv. Acoust. Soc. Japan, 3- speech signals”, Tech. Report 3606-10, Radio Science 4-12 (March, 1972) 153-154 (in Japanese). Lab., Stanford Electronics Lab., Stanford University, [25] T. Kobayashi and F. Itakura, “An algorithm to extract line California (1974). spectrum pair parameters”, 2-7-9 (May, 1980) 599-600 (in [16] F. Itakura, “Line spectrum representation of linear Japanese). predictor coefficients of speech signals”, J. Acoust. Soc. [26] F. Itakura and N. Sugamura, “Principle and organization Am., 157 ( Suppl.) (1975) R9, 535. of the LSP speech synthesizer”, Trans. of the Committee [17] F. Itakura, “Line spectrum representation of linear on Speech Research, 579-46 (1979) (in Japanese). predictors”, Trans. of the Committee on Speech Research, [27] S. Sagayama, “Theoretical study in CSM, LSP, and their S75-34 (1975) (in Japanese). properties”, Trans. of the Committee on Speech Research, [18] S. Sagayama and F. Itakura, “Composite sinusoidal The Acoustical Society of Japan, 582-14 (May, 1982) (in modeling applied to spectrum analysis of speech”, Trans. Japanese). of the Committee on Speech Research, 579-06 (1979) (in [28] S. Sagayama and F. Itakura, “Composite sinusoidal Japanese). modeling applied to spectral analysis of speech”, Trans. [19] S. Sagayama, “Symmetrical Formulations of linear IECE81/2, J64-A, No. 2 (1981) 105-112 (in Japanese). predictive coding and composite sinusoidal modeling”, [29] F. Itakura and S. Saito, “PARCOR digital speech Proc. Annual Convention of Acoust. Soc. Japan, 1-1-1, synthesizer”, Proc. Annual Convention of Acoust. Soc. (Oct., 1980) 359-360 (in Japanese). Japan, 2-1-2 (Oct. 1970) 121-122 (in Japanese). [20] U. Grenander and G. Szego, Toeplitz Forms and their [30] Y.Shiraki and M.Honda,”Very low bit rate speech coding Applications, University of California Press (1958). based on joint segment quantization and variable length [21] F. Itakura and S. Saito, “Digital filtering techniques for segment quantizer”, J. Acoust. Soc. Amer. Suppl. 1, vl.79, speech analysis and synthesis”, 7th Int. Cong. Acoust, 1986 Budapest, 25 C 1 (1971). [31] Y.Shiraki and M.Honda,”LPC speech coding based on [22] N. Levinson, “The Wiener RMS (root Mean Square) error variable-length segment quantization”,IEEE Trans. criterion in filter design and prediction”, J Math. Phys., 25 Acoust. Signal Processing, ASSP-36, 9, pp.1437-1444, (1947) 261-278. Sep. 1988 [23] N. Sugamura and F. Itakura, “Speech by LSP speech analysis-synthesis technique”, Trans. IECE 81/8, J64-A, 8, (1981) 599-606 (in Japanese).

III - 2082