<<

M-ARY PREDICTIVE CODING: A NONLINEAR MODEL FOR SPEECH

M.Vandana

Final Year B.E(ECE) PSG College of Technology Coimbatore-641004. [email protected]

ABSTRACT 2. INTRODUCTION is pivotal in the Speech is a very special for a ability of networks to support multimedia variety of reasons. The most preliminary of services. The technique currently used for these is the fact that speech is a non- speech coding is Linear Prediction. It stationary signal. This makes speech a models the throat as an all-pole filter i.e. rather tough signal to analyze and model. using a linear difference equation. However, the physical nature of the The second reason, is that factors throat is itself a clue to its nonlinear like the intelligibility, the coherence and nature. Developing a nonlinear model is other such ‘human’ characteristics play a difficult as in the solution of nonlinear vital role in speech analysis as against equations and the verification of statistical parameters like Mean Square nonlinear schemes. Error.

In this paper, two nonlinear The third reason is from a models– Quadratic Predictive Coding and communications point of view. The M-ary Predictive Coding- have been number of discrete values required to proposed. The equations for the models describe one second of speech amounts to were developed and the coefficients 8000 (at the minimum). Due to concerns of solved for. MATLAB was used for bandwidth, compression is desirable. implementation and testing. This paper gives the equations and preliminary 2.1. Linear Predictive Coding results. The proposed enhancements to MPC are also discussed in brief. In the technique of linear prediction, the vocal tract is modeled as an all-pole 1. AIM filter[1]. That is, the transfer function of the The objective of this endeavour was to vocal throat is written as i. Investigate the two nonlinear 1 models proposed by the author for H(z)= ------more effective speech coding. p -i ii. Development of one of the two S ai*z models – MPC (M-ary Predictive i=1 Coding) – as a complete coding scheme. In terms of a difference equation, p y[n]= Sai*y[n-i] i=1 This means that each sample of the Various deductions were then drawn from sequence describing speech is assumed to the models and the results. These form the be a linear weighted combination of a remainder of the paper. finite number of previous samples of the sequence. 3. NONLINEAR MODELS PROPOSED BY THE AUTHOR By the method of minimization of Mean Square Error, the coefficients for the 3.1. Quadratic Model for speech linear predictive filter (ai) can be determined[1]. This is the method In this model, as against each currently used for speech coding (in the sample being expressed as a linear form of various implementations like weighted sum of a finite number (p) CELP, RELP etc.). values preceding it, it was attempted to express each sample as A look at the shape of the human throat a quadratic weighted sum of the reveals that the throat (vocal tract) is not same. linear. It is curved. The obvious question now is why the throat has been modeled Preliminary equation for quadratic as an all-pole filter (linear) when it in model: reality is not linear. This is because the assumption of linearity in the throat, p 2 highly simplifies the required y[n] = S ( ai*y[n-i] + bi*y [n-i] ) calculations. The final system that is i=i required to be solved is simply a set of linear, simultaneous equations, various 3.2. M-ary Model for speech methods to solve which (like the Levinson Durbin algorithm) are widely In this model, the current sample of known. the sequence is modeled as being a weighted sum of ‘m’ previous However, a simple implementation of sample values, with the power of LPC yields errors in the speech output at each sample value increasing with the Rx. Hence a nonlinear model was the delay of current sample from it. studied. This attempts to model the physical fact that the longer the time elapsed With the objectives of error since the occurrence of a sample minimization, intelligibility retention and value, the lesser its effect, i.e. the computational simplicity in mind, in this decay of each sample value with project, the attempt to develop two progressive increase of time. distinct nonlinear models for speech was carried out. Both these models were Preliminary equation for M-ary investigated. Both the models were given model: mathematical expressions, and equations were developed for the respective m i parameters. MATLAB was then used to y[n] = S ( ai * y [n-i]) code each of these models and test them i=i for the above-mentioned parameters. 4. QUADRATIC MODEL Applying Equations 4.3 and 4.4 to 4.2, 4.1. Development of Equations N-1 p 2 ?{y[n]-?(ai*y[n-i]+bi*y [n-i])}*y[n-j]}=0 The preliminary equation as given n=0 i=1 above is N-1 p p 2 2 2 ?{{y[n]-?(ai*y[n-i]+bi*y [n-i])}*y [n-j]}= 0 y[n] = S ( ai*y[n-i] + bi*y [n-i] ) n=0 i=1 i=1 for j = 1 to p. (4.5) This is the equation of the model suggested. However this model is to be Simplifying and reversing the order of made to predict the speech signal. Hence, summation, it would be more accurate to denote the N-1 p N-1 p N-1 2 LHS of the above equation as yp[n] (the ? y[n]y[n-j]=? ai? y[n-i]y[n-j]+? bi? y [n-i]y[n-j] predicted value of y[n]) rather than y[n] n=0 i=1 n=0 i=1 n=0 itself. Hence, N-1 p N-1 p N-1 p 2 2 2 2 2 ? y[n].y [n-j]=? ai? y[n-i]y [n-j]+? bi? y [n-i]y [n-j] yp[n] = S ( ai*y[n-i]+bi*y [n-i] ) (4.1) n=0 i=1 n=0 i=1 n=0 i=1 (4.6) The prediction error is simply the Solving and using matrix notation, difference between the actual speech -1 -1 -1 -1 A = [Y2 Y1 – Y4 Y3] * [Y2 Yj,1 – signal and its predicted value. -1 Y4 Y j,2] (4.7) Therefore, the prediction error -1 -1 -1 -1 B = [Y1 Y2 – Y3 Y4] * [Y1 Y j,1 – e[n]=y[n]-yp[n] -1 Y3 Yj,2] (4.8) The method of minimization of the MSE was employed to compute the A, B are the vectors representing the parameters. coefficients ai, bi respectively for ‘i’ from 1 The mean square error is to p. N-1 Y , Y , Y , Y are square matrices such that 2 2 1 2 3 4 E(e [n]) = S e [n] N-1 n=0 Y1(l,m)= S y[n-l]y[n-m] (4.9) N-1 2 n=0 = S(y[n]-yp[n]) n=0 N-1 N-1 p Y (l,m)= y[n-l]y2[n-m] (4.10) 2 2 2 S = S {y[n] - S (ai * y[n-i] + bi * y [n-i])} n=0 n=0 i=1 (4.2) N-1 2 ?E(e [n]) 2 Y3(l,m)= S y [n-l]y[n-m] (4.11) ------= 0 for i=1 to p (4.3) n=0 ?ai

2 N-1 ?E(e [n]) 2 2 Y4(l,m)= S y [n-l]y [n-m] (4.12) ------= 0 for i=1 to p (4.4) n=0 ?bi (l and m take values from 1 to ‘p’.) Yj,1 and Yj,2 are column vectors such that However, this algorithm by itself didn’t N-1 yield results as expected. Also, this model Yj,1(j) = S y[n]y[n-j] (4.13) doesn’t have a physical significance as n=0 MPC can be shown to have (section 5). Hence, work on QPC was suspended after N-1 investigation of preliminary results and 2 Yj,2(j) = S y[n]y [n-j] (4.14) attention was turned to M-ary Prediction n=0 for the reasons explained there under. (j takes values from 1 to ‘p’.) 5. M-ARY PREDICTION Thus, coefficients ai and bi were determined to develop a quadratic model 5.1. Basic Model and reasons for choice for the given speech signal. The preliminary equation for the M-ary 4.2. Deductions from Mathematical model is Analysis, used in implementation m i y[n] = S ( ai*y [n-i]) (5.1) From equation set 4 above, matrices Y1, i=1 Y2, Y3, Y4 are square matrices. Y1 and Y4 are symmetric matrices i.e. An assumption that |y[x]|<=1 for all x was Y1(l,m)= Y1(m,l) and the same holds true made. (This assumption will later be for Y4 as well. Hence with 2*p supported in the implementation hence its coefficients to be estimated in all (i.e. ‘p’ consequences at this point will correctly previous values being considered at serve as the justification of the choice of current sample determination), instead of this model.) p*p elements to be computed for each of these matrices, it is sufficient to compute Observation of the above model equation simply {p(p+1)/2} elements for each. reveals that the power to which an element Similarly, coefficient matrices Y2 and Y3 (a sample) is raised, increases as its delay are but the transpose of each other. with respect to the current sample (i.e. the Hence, to compute one of the two is sample to which its contribution is to be sufficient. estimated) increases. The net result of these deductions is that from a total of 4*p*p computations that In the process of production of speech as originally needed to have been for any physical process, the decay of performed, the actual number required is energy occurs with the passage of time. simply Similarly, the effect of a sample at time t will decay to a much smaller value at time N(C) = 2*{p(p+1)/2} + {p*p} (t+dt). As dt increases, this reduction of = p(2p+1) (4.15) effect will increase progressively . It is this i.e. a saving of p(2p-1) terms in decaying that is modeled in the above computation. equation. As it has been assumed that |y(x)|<=1 for all x under consideration, as This quadratic model was implemented the delay ‘i’ in the above equation using MATLAB. increases, the contribution of a sample to future samples keeps decreasing which models the decay phenomenon in actual By the method of minimization of Mean speech production. Square Error, Additionally, it permits the setting of a ?E(e2[n]) value of ‘m’ that represents the maximum ------= 0 for i=1 to m (5.3) number of previous values that will affect ?ai the current sample value. This can be interpreted as a measure of the Applying equation 5.3 to 5.2, Reverberation Time. This in turn will help to control the intelligibility of the N-1 m i j reproduced speech. S{(y[n]- Sai*y [n-j])y [n-i]}=0 (5.4) Now that the reasons for the choice of n=0 i=1 this model and the constraints have been for j=1 to m. discussed, let’s move on to a brief look at the equations and the relevant results, Separating and reversing the order of which have been utilized in the summation, implementation. m N-1 N-1 i j j 5.2. Development of Equations Sai * S(y [n-i]y [n-j]) = Sy[n]y [n-i] (5.5) i=1 n=0 n=0 The Preliminary equation is for j=1 to m. m i y[n] = S ( ai*y [n-i]) (5.1) Writing in matrix notation, i=1 Ac * A = Y0,j (5.6)

Again, since this is an Where, approximate model and attempts to A is the M-ary Prediction coefficients’ predict the signal, y[n] can be substituted vector; with yp[n]. Ac = the constant coefficient vector of the equation; The instantaneous value of the error is And Y0,j = the constant vector. e[n]= y[n]-yp[n] A(i)=ai, the ith M-ary Prediction The Mean Square Error is coefficient;

N-1 N-1 E(e2[n]) = S e2[n] Ac(i,j)= S y i[n-i] * yj[n-j] (5.7) n=0 n=0

N-1 N-1 2 j = S(y[n]-yp[n]) Y0,j(j)= S y[n] * y[n-j] (5.8) n=0 n=0 N-1 m In all the above equations, i and j range i 2 = S {y[n] - S (ai * y [n-i])} (5.2) from 1 to m. n=0 i=1

5.3. Deductions from Mathematical data as and when each frame is Analysis, used in implementation encountered so that the maximum possible use of the amplitude range [-1 1] is made From the above equation Eq.5.7, thus preventing any loss of fine details due it can be observed that Ac(i,j)=Ac(j,i). to the relative smallness in magnitude of This symmetry was used to reduce the low-valued samples in a frame because of number of computations required. From a the limitations of resolution (here total of m*m (summation term) MATLAB’s and in a real-world computations, the required number was implementation, the limitations of the reduced to m(m+1)/2. hardware). The final stage in the implementation was a Thus, the total number of median filter with a window size of 5, to computations (summations over ) reduce the noise in the reproduced speech required for the determination of the M- signal. ary predictive coefficients is 5.5. Parameters Identified after N(C)=m+m(m+1)/2 (5.9) preliminary work After the preliminary work, the identified (Note: Comparing with Eq.4.15 that parameters were gives N(C) for QPC, it can be observed 1. The frame size (N in the above that MPC requires less number of equations for the models) computations.) 2. The parameter ‘m’, representative of the Reverberation time 5.4. Implementation 3. The gradient of the signal within the observed frame One key feature to be remembered during 4. The number of overlapping samples the analysis of the implementation is the fact that the decay of a speech sample 5.6. Add-ons for MPC that occurs physically can only be modeled using the above equation when, in mathematical terms, |y(x)|<=1.

The speech data acquisition was done in MATLAB using the simple wavread command. The speech sample sequence returned by this command is already Figure 1. First frame of 1000 normalized and lies in the range [-1 1].

However, from a numerical methods point of view, when the condition of |y(x)|<<1 occurs, higher and higher powers of this sample tend towards zeros i.e. they make zero(negligible) contribution. To handle the speech data read in as it is would thus result in loss of fine details. This was handled in the Figure 2. Second frame of 1000 implementation by adaptively scaling the TABLE 1: Comparison of Gradient depending on which the upsampling factor (maximum and mean) and MSE for two was assigned. 1000-sample frames of speech under The second add-on being considered for conditions of no upsampling and MPC is the application of silence upsampling at a factor of 2 suppression schemes like Activity Detector or Comfort Noise Generation algorithms[3]. These are currently under study.

5.7. Quantization The quantization applied was of two kinds: (logarithmic quantization[4]) and Adaptive Companding. The former was applied with a A-law compander (A=87.6) in MATLAB but found unsatisfactory as it caused a Upsampling is the add-on applied for all deterioration of speech quality, the MPC implementations whose unaccompanied by any significant statistics are tabulated in this paper. improvement with respect to . Upsampling is the resampling of the Hence, adaptive companding was original data at a higher rate. This is investigated wherein the number of bits per found to reduce the first order gradient of sample in the current analysis frame was the signal and thus improve the quality of decided by considering the distribution of the speech output. the gradients within the frame. Mu-law As can be seen from the above figures companding was implemented here with and table, upsampling greatly improves the standard parameter of Mu=255. the tracking of the signal trend by the MP (Results with and without adaptive coefficients. However, as can also be companding are provided in Table 2 under observed from these, the factor by which Section 5.8). upsampling is performed is very crucial. A large value would lead to excellent 5.8. Results signal reproduction but may end up compromising on the compression. A So far, MPC with fixed upsampling has lower value on the other hand would been studied for male speech. Sample increase errors but facilitate excellent waveforms are given below. Performance compression. Hence, a balanced choice of statistics for different MPC upsampling needs to be done. This led to implementations are tabulated below and the use of adaptive upsampling for the compared with the corresponding coding of the speech signal. parameters for some standard speech coding schemes[3][5]. Since there is no The next issue is the upsampling factor scientific objective method to test speech for a frame. As is deducible from the quality[3], three measures of Mean above table, the gradient of the frame Opinion Score(MOS), Segmented plays a pivotal role in the accuracy with SNR(SSNR) and Mean Square Error(MSE) which it can be coded and recovered. have been selected for comparison. Hence, the gradient of the signal was studied and divided into zones, the quality of the speech output is comparable to MPC75-AC.

6. CONCLUSION

The results for implementations of MPC so far have been obtained with fixed Figure 4. Waveforms for first 1000 upsampling only. The application of samples: upsampling factor of 10 adaptive upsampling is to be investigated.

The frequency domain significance TABLE 2: Comparison of results (using of the model is also being studied. Since fixed upsampling for male speech) with conventional transform techniques cannot speech coding standards be applied, statistical means are being Bitrate MOS SEGSNR MSE (KBps) (dB) *10 -3 investigated to study the frequency MPC75 4.88 3.4 25.67 13.0 response of the M-ary Predictive Model. MPC87.5 6.83 3.9 30.75 5.3 MPC75- 4.28 3.0 11.63 34.3 7. REFERENCES AC MPC87.5 5.72 3.6 13.25 23.0 [1] Deller, J.R., J. G. Proakis, and J. H. -AC USFS- 0.6 3.2 ___ 7.6 Hansen, Discrete- Time Processing of 1016 Speech . New York: Macmillan CELP1 Publishing, 1993, pp 677-804. G.721 4.0 4.1 ___ 19.38 [2] P. Ruben, R. Remez and J. Pardo. Sine AD Wave Synthesis of Speech. [Online] PCM1 GSM 1.53 4.0 ______Available: www.haskins.yale.edu/Haskins/ EFR1 MISC/SWS/SWS.html GSM2 1.625 3.5 ______[3] J. Ahonen and A. Laine. (1997, April). LPC-102 0.3 2.3 ______Realtime speech and voice transmission ___ - data unavailable on the Internet. Helsinki University of 1 - data from [3] and [5] Technology, Telecommunications Software 2 - data from [6] and Multimedia Laboratory, Helsinki, Segmented SNR has been recorded since Finland. [Online] Available: ww.tml.hut.fi this provides a better indication of the [4] T. P. Barnwell III, K. Nayebi and perceptual quality[4]. For the specified C. Richardson, Speech Coding. New York: standards though, since perceptual John Wiley and Sons, Inc, 1996, pp 42-51. quality is of utmost concern, the SNR is [5] Shannon Wichman. (1999, August). A not as meaningful[5]. comparison of Speech Coding Algorithms The given standards for comparison do ADPCM vs. CELP. Department of not include nonlinear coding techniques. Electrical Engineering, The University of One such speech coding technique is sine Texas, Dallas. [Online] Available: wave synthesis of speech, which attempts http://www.utdallas.edu/~torlak/wireless/pr to prove that speech can be adequately ojects/wichman.pdf described by the changing pattern of [6] Audio and Speech. (2001, October). vocal resonances alone disregarding the [Online] Available: acoustic elements of speech[2]. Though www.cs.columbia .edu/~hgs/teaching/ais/sli implementation details are unavailable, des/audio_long.pdf