<<

R.J. McAulay andT.F. Quatieri Speech Processing Based on a Sinusoidal Model

Using a sinusoidal model of speech, an analysis/synthesis technique has been devel­ oped that characterizes speech in terms of the amplitudes, frequencies, and phases of the component sine waves. These parameters can be estimated by applying a simple peak-picking algorithm to a short-time Fourier transform (STFf) of the input speech. Rapid changes in the highly resolved spectral components are tracked by using a frequency-matching algorithm and the concept of"birth" and "death" ofthe underlying sinewaves. Fora givenfrequency track, a cubic phasefunction is applied to a sine-wave generator. whose outputis amplitude-modulated andadded to the sine waves generated for the other frequency tracks. and this sum is the synthetic speech output. The resulting waveform preserves the general waveform shape and is essentially indistin­ gUishable from the original speech. Furthermore. inthe presence ofnoise the perceptual characteristics ofthe speech and the noise are maintained. Itwas also found that high­ quality reproduction was obtained for a large class of inputs: two overlapping. super­ posed speech waveforms; music waveforms; speech in musical backgrounds; and certain marine biologic sounds. The analysis/synthesis system has become the basis for new approaches to such diverse applications as multirate coding for secure communications, time-scale and pitch-scale algorithms for speech transformations, and phase dispersion for the enhancement ofAM radio broadcasts. Moreover, the technology behind the applications has been successfully transferred to private industry for commercial development.

Speech can be represented with a Other approaches to analysis/synthesis that· speech production model that views speech as are based on sine-wave models have been the result of passing a glottal excitation wave­ discussed. Hedelin [3] proposed a pitch-inde­ form through a time-varyinglinearfilter (Fig. 1), pendent sine-wave model for use in coding the which models the resonant characteristics of baseband for speech compression. The thevocal tract. Inmanyspeechapplications the amplitudes and frequencies of the underlying glottal excitation canbe assumed to bein one of sine waves were estimated using Kalman fIlter­ two possible states, corresponding to voiced or ing techniques and each sine-wave phase was unvoiced speech. defined to be the integral of the associated in­ In attempts to design high-quality speech stantaneous frequency. coders at mid-band rates, more general excita­ Another sine-wave-based speech system is tion models have been developed. Approaches being developed by Almeida and Silva [4]. In thatarecurrentlypopularare multipulse [1] and contrast to Hede1in's approach, their system code-excited (CELP) [2). uses a pitchestimatetoestablisha harmonic set This paper also develops a more general model of sine waves. The sine-wave phases are com­ for glottal excitation, but instead of using puted from the STFf at the harmonic frequen­ impulses as in multipulse, or code-book excita­ cies. To compensatefor anyerrorsthatmightbe tionsas in CELP, the excitation waveform is as­ introduced as a result of the harmonic sine­ sumed to be composed of sinusoidal compo­ wave representation, a residual waveform is nents ofarbitrary amplitudes, frequencies, and coded, along with the underlying sine-wave phases. parameters.

The Lincoln Laboratory Journal, Volume 1, Number 2 (1988) 153 McAulay et aI. - Speech Processing Based on a Sinusoidal Model

ofsome ofthese experiments are discussed and Pitch Period pictorial comparisons of the original and syn­ thetic waveforms are presented. , The above sinusoidal transform system (STS) W has found practical application in a number of Impulse speech problems. Vocoders have been Train developedthatoperatefrom 2.4kbps to 4.8kbps Generator --- providing good speech quality that increases c Vocal Tract ~ Filter Speech Random EXCITATION VOCAL TRACT Noise I-- Generator s(t) Sine Wave ~ h(t, T) Generation e(t)

Fig. 1- The binary model ofspeech production illustrated here, r~quires pi~ch, voca~-~ract parameters, en~rgy levels, and vOiced/unvoiced decIsions as inputs.

This paper derives a sinusoidal model for the Fig. 2 - The sinusoidal speech model consists ofan exci­ speech waveform, characterized by the ampli­ tation and vocal-tract response. The excitation waveform is characterized by the amplitudes, frequencies, and phases tudes, frequencies, and phasesofitscomponent C!f the underlying sine waves of the speech; the vocal tract sinewaves, thatleads to a newanalysis/synthe­ IS modeled by the time-varying linear filter, the impulse re­ sis technique. The glottal excitation is repre­ sponse of which is h(t; -r). sentedas a sumofsinewaves that, when applied to a time-varying vocal-tract filter, leads to the more or less uniformly with increasing bit rate. desired sinusoidal representation for speech In another area the STS provides high-quality waveforms (Fig. 2). speech transformations such as time-scale and A parameter-extraction algorithm has been pitch-scale modifications. Finally, a large developed that shows that the amplitudes, fre­ research effort has resulted in a new technique quencies, and phases of the sine waves can be for speech enhancement for AM radio obtained from the high-resolution short-time broadcasting. These applications are Fourier transform (STFT), by locating the peaks considered in more detail later in the text. of the associated magnitude function. To syn­ thesize speech, the amplitudes, frequencies, and phases estimated on one frame must be Speech Production Model matched and allowed to evolve continuously In the speech production model, the speech into the set of amplitudes, frequencies, and waveform, s(t), is modeled as the output of a phases estimated on a successive frame. These linear time-varying filter that has been excited issues are resolved using a frequency-matching bytheglottal excitationwaveform, eft). The filter, algorithm in conjunction with a solution to the which models the characteristics of the vocal phase-unwrapping and phase-interpolation tract, has animpulse response denoted by h(t; -r). problem. The speech waveform is then given by A system was built and experiments were performed with it. The synthetic speech was judged to be of excellent quality - essentially s(t) = ft hlt- T; t) e(r) dr. (1) indistingUishable from the original. The results o

154 TIle Lincoln Laboratory Journal, Volume 1, Number 2 (1988) McAulay et aI. - Speech Processing Based on a Sinusoidal Model

If the glottal excitation waveform is represented phases of the component sine waves must be as a sum of sine waves of arbitrary amplitudes. extracted from the original speech waveform. frequencies. and phases. then the model can be written as Estimation of Speech Parameters

eft) = Re ~ al(t) exp {j [~[ WI (a) da + (w; t)l (3) In general. it is diffIcult to determine analyti­ cally which of the component sine waves and represent the time-varying vocal-tract transfer their amplitudes. frequencies. and phases are function and assuming that the excitation para­ necessary to represent a speech waveform. meters given in Eq. 2 are constant throughout Therefore. an estimator based on idealized the duration of the impulse response ofthe filter speech waveforms was developed to extract in effect at time t. then using Eqs. 2 and 3 in Eq. these parameters. As restrictions on the speech 1 leads to the speech model given by waveform were relaxed in order to model real

L(t) speech better. adjustmentswere made to the es­ s(t) = ~ al(t) M!wI(t); tJ timator to accommodate these changes. 1=1 In the development ofthe estimator. the time (4) axis was first broken down into an overlapping sequence of frames each of duration T. The center ofthe analysis window for the k th frame t . occurs at time k Assuming that the vocal-tract By combining the effects ofthe glottal and vocal­ and glottal parameters are constant over an tract amplitudes and phases. the representa­ interval oftime thatincludes the duration ofthe tion can be written more concisely as analysis window and the duration of the vocal­ L(t) tract impulse response. then Eq. 7 can be writ­ s(t) = ~AI(t) exp ljl/!I(t)] (5) ten as 1=1 (8) where where the superscript kindicates that the para­ (6) meters of the model may vary from frame to frame. Using Eq. 8. in Eq. 5 the synthetic speech l/!I(t) = f t wda) da + [wI(t); tl +

represent the amplitude and phase of the lth (9) sine wave along the frequency track wlt). Equa­ tions 5. 6. and 7 combine to provide a sinusoidal where representation ofa speechwaveform. Inorder to k-Ak ( 'Ok) Y -I exp J I (9A) use the model, the amplitudes. frequencies. and I

The Lincoln Laboratory Journal. Volume 1, Number 2 (l988) 155 McAulay et aI. - Speech Processing Based on a Sinusoidal Model

represents the 1th complexamplitude for the 1th component of the Lk sine waves. Since the (w~ w~) w~l if l = i (13) sine - =sine [(l- i) ={6 if l ~ i measurements are made on digitized speech, sampled-data notation [s(n)] is used. In this where wk = lwk Then the error expression re- respect the time index n corresponds to the I o' ; duces to uniform samples oft - tk therefore n ranges from -N/2 to N/2. with n = 0 reset to the center ofthe ~ 2 ~ k analysis windowfor every frame and where N+1 fk = Iy(n) 1 - 2(N + 1) Re I (')'k) * y(w ) I n ~1 I 1 is the duration of the analysis window. The ~ (14) problem now is to fit the synthetic speech wave­ + (N+ 1) ~ 1'Y~12 form in Eq. 9 to the measured waveform. de­ 1=1 noted by y(n). A useful criterion for judging the where quality of fit is the mean-squared error ~ fk = ~ Iy(n) _ s(n) ,2 Ylw) =N: 1 y(n) exp (-jnw) (15) n n (10) 2 = ~ Iy(n) ,2 _ 2 Re ~ y(n) s*(n) + ~ Is(n) 1 . is the STFT of the measurement signal. By n n n completingthe squarein Eq. 14. the errorcanbe Substituting the speech model of Eq. 9 into written as Eq. 10 leads to the error expression k L 2 ~ ~ ('Y~)* ~ (-jnw~) k (16) fk = Iy(n) 1 - 2 Re y(n) exp L n 1=1 n +(N+ 1) ~ [IYlw~)-'Y~12-IYlW~)12I. k k L L (11) 1=1 + (N + 1) ~ ~ 'Y k ('Y k )* sine (w k - w k ) ~ ~ I 1 I 1 1=1 1=1 from which it follows that the optimal estimate for the amplitude and phase is where sinc(x) =sin [(N+ l)x/21/[(N+ 1) sin (x/2)]. The task ofthe estimator is to identify a set of .y~ = Y(lw;) • (17) sine waves that minimizes Eq. 11. Insights into the development ofa suitable estimator can be which reduces the error to obtained byrestricting the class ofinput signals to the idealization ofperfectly voiced speech, ie, speech that is periodic, hence having compo­ nent sine waves that are harmonically related. In this case the synthetic speech waveform can From this calculation it follows. therefore, that be written as the error is minimized by selecting all ofthe har­ k monic frequencies in the speech bandwidth, L k fl/w~). s(n) = ~ 'Y~ exp Unlw~) (12) fl (ie, L = 1=1 Equations 15 and 17 completely specify the structure of the ideal estimator and show that where w~ =21T/T~ and where T~ is the pitch the optimal estimator depends on the speech period assumed to be constant over the duration data through the STFT (Eq. 15). Although these ofthe k th frame. For the purpose ofestablishing results are eqUivalent to a Fourier-series repre­ the structure ofthe ideal estimator. it is further sentation of a periodic waveform, the results assumed that the pitch periodis known and that lead to an intuitive generalization to the more the width ofthe analysis windowis a multiple of practical case. This is done by considering the 2 T~ • Under these highly idealized conditions. functionlY(w) 1 to be a continuous function ofw. the sinc (.) function in the last term of Eq. 11 For the idealized voiced-speech case, this reduces to function (called a periodogram) will be pulse-

156 The Lincoln Laboratory Journal. Volume 1. Number 2 (l988) McAulay et aI. - Speech Processing Based on a Sinusoidal Model like in nature, with peaks occurring at all ofthe nated, by using the weighted STFf. LettingY"(w) pitch harmonics. Therefore, the frequencies of denote the weighted STFf, ie, the underlying sine waves correspond to the lo­ NI2 cation ofthe peaks of the periodogram, and the Y(w) = ~ w(n) y(n) exp (-jnw) (22) estimates of the amplitudes and phases are n=-NI2 obtained by evaluating the STFfat the frequen­ cies associated with the peaks of the periodo­ where w(n) represents the temporal weighting gram. This interpretation permits the extension due to the window function, then the practical of the estimator to a more generalized speech version of the idealized estimator estimates the waveform, one that is not ideally voiced. This frequencies ofthe underlying sine waves as the extension becomes evident when the STFf is locations of the peaks of I¥(w) I. Letting these {w~}, calculated for the general sinusoidal speech frequency estimates be denoted by then model given in Eq. 9. In this case the STFf is the corresponding complex amplitudes are simply given by k L (23) Y(w) = ~ 'Y~ sine (w~ - w) . (19) 1=1 Assuming that the component sine waves have Provided the analysis window is "wide enough" been properly resolved, then, in the absence of that noise, A~ will yield the value of an underlying (20) sine wave, provided the window is scaled so that

NI2 then the periodogram can be written as ~ w(n) = 1 . (24) k n=-NI2 L 2 ~ 1'Y~12sine2(w~-w), IY(w)1 = (21) Using the Hamming window for the weighted 1=1 STFf provided a very good sidelobe structure in and, as before, the location of the peaks of the that the leakage problem was eliminated; it did periodogram corresponds to the underlying so at the expense of broadening the main lobes sine-wave frequencies. The STFf samples at of the periodogram. In order to accommodate these frequencies correspond to the complex this broadening, the constraint implied by Eq. amplitudes. Therefore, provided Eq. 20 holds, 20 must be revised to require that the window the structure of the ideal estimator applies to a width be at least 2.5 times the pitch period. This more general class of speech waveforms than revision maintains the resolution features that perfectly voiced speech. Since, dUring steady were needed to justifY the optimality properties voicing, neighboring frequencies are approxi­ of the periodogram processor. Although the mately seperated by the width of the pitch window width could be set on the basis of the frequency, Eq. 20 suggests that the desired instantaneous pitch, the analyzer is less sensi­ resolution can be achieved most of the time by tive to the performance of the pitch extractor if requiring that the analysis window be at least the window width is set on the basis of the two pitch periods wide. average pitch instead. The pitch computed These propertiesare based onthe assumption dUring strongly voiced frames is averaged using that the sinc (.) function is essentially zero a 0.25-s time constant, and this averaged pitch outside of the region defmed by Eq. 20. In fact, is used to update, in real time, the width of the this approximation is not a valid one, because analysis window. During frames of unvoiced there will be sidelobes outside ofthis region due speech, the window is held fixed at the value to the rectangular window implicit in the defini­ obtained on the preceding voiced frame or 20 tion ofthe STFf. These sidelobes lead to leakage ms, whichever is smaller. that compromises the performance of the esti­ Once the width for a particularframe has been mator' a problem that is reduced, but not elimi- specified, the Hamming window is computed

'The Uncoln Laboratory Journal. Volume 1. Number 2 (1988) 157 McAulay et aI. - Speech Processing Based on a Sinusoidal Model and normalized according to Eq. 24, and the tation applies only to one analysis frame; differ­ STIT of the input speech is taken using a 512­ ent sets of these parameters are obtained for point fast Fourier transform (FIT) for 4-kHz each frame. The next problem to address, then, bandwidth speech. A typical periodogram for is how to associate the amplitudes, frequencies, voiced speech. along with the amplitudes and and phases measured on one frame with those frequencies that are estimated using the above found on a successive frame. procedure, is plotted in Fig. 3. In order to apply the sinusoidal model to Frame-to-Frame Peak Matching unvoiced speech, the frequencies corresponding to the periodogram peaks must be close enough If the number of periodogram peaks were to satisfy the requirement imposed by the constant from frame to frame, the peaks could Karhunen-Loeve expansion [5) for noiselike signals. Ifthe windowwidth is constrained to be -20 -10 10 20 at least 20 ms wide, on average, the correspond­ I IIII I I 950.0 ms ing periodogram peaks will be approximately 100 Hz apart, enough to satisfy the constraints of the Karhunen-Loeve sinusoidal representa­ tion for random noise. A typical periodogram for a frame of unvoiced speech, along with the 50 estimated amplitudes and frequencies, is plot­ X x"x ted in Fig. 4. 40 ~ Xx X This analysis provides a justification for rep­ 30 ~ Q) XXXX resenting speech waveforms in terms of the "0 20 X0 ~X X l¢( X X amplitudes, frequencies, and phases of a set of X X sine waves. However, each sine-wave represen- 10 ~ XXX ~ I III I I I -20 -10 10 20 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 IIII I I I kHz 1270.0 ms X Denotes Underlying Sine Wave

Fig. 4 - This periodogram illustrates how the power is shifted to higher frequencies in unvoiced speech. The amplitudes of the underlying sine waves are denoted with 50 anx. 40 X Xx X X l¢( XX XX X 'I!- simply be matched between frames on a fre­ 30 X Q) I quency-ordered basis. In practice, however, "0 X \ X 20 there are spurious peaks that come and go X 10 because of the effects of sidelobe interaction. (The Hammingwindowdoesn'tcompletelyelimi­ o 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 nate sidelobe interaction.) Additionally, peak locations change as the pitch changes and there kHz are rapid changes in both the location and the X Denotes Underlying Sine Wave number ofpeaks corresponding to rapidly vary­ ing regions of speech, such as at voiced/un­ Fig. 3 - This is a typicalperiodogram ofvoicedspeech. The voiced transitions. The analysis system can amplitude peaks of the periodogram determine which fre­ quencies are chosen to represent the speech waveform. accommodate these rapid changes through the The amplitudes of the underlying sine waves are denoted incorporation of a nearest-neighbor frequency with an x. tracker and the concept of the "birth" and

158 The Lincoln Laboratory Journal, Volume 1. Number 2 (1988) McAulay et aI. - Speech Processing Based on a Sinusoidal Model

"death" of sinusoidal components. A detailed Frequency description ofthis tracking algorithm is given in Ref.6. An illustration of the effects of the procedure used to account for extraneous peaks is shown in Fig. 5. The results ofapplying the tracker to a segment ofreal speech is shown in Fig. 6, which demonstrates the ability of the tracker to adapt qUickly through such transitory speech behav­ ior as voiced/unvoiced transitions and mixed voiced/unvoiced regions.

The Synthesis System

After the execution of the preceding parame­ ter-extraction and peak-matching procedures, the information captured in those steps can be Time used in conjunction with a synthesis system to I '---v---" -~ v producenatural-soundingspeech. Sincea setof Unvoiced Segment Voiced amplitudes, frequencies, and phases are esti­ Segment mated for each frame. it might seem reasonable to estimate the original speech waveform on the Fig. 6 - These frequency tracks were derived from real k th frame by using the equation speech using the birth/death frequency tracker.

(25) synthetic speech. To circumvent this problem. the parameters must be smoothly interpolated from frame to frame. where n = O. 1.2, ... , S - 1 and where S is the The most straightforward approach for per­ length of the synthesis frame. Because of th~ forming this interpolation is to overlap and add time-varyingnature ofthe parameters, however. time-weighted segments of the sinusoidal com­ this straightforward approach leads to disconti­ ponents. This process uses the measured ampli­ nuities at the frame boundaries. The disconti­ tude. frequency. and phase (referenced to the nuities seriously degrade the quality of the centerofthe synthesis frame) to constructa sine wave. which is then weighted by a triangular windowover a duration equal to twice the length u>­ c: Q) of the synthesis frame. The time-weighted com­ ::::l c­ ponents corresponding to the lagging edge ofthe Q) u..... Death triangular window are added to the overlapping leading-edge components that were generated dUring the previous frame. Real-time systems ••• Birth ••• using this technique were operated with frames Death separated by 11.5 ms and 20.0 ms. respectively. While the synthetic speech produced by the first systemwas quite good. and almost indistin­ Time gUishable from the original speech, the longer frame interval produced synthetic speech that Fig. 5 - Differentmodes usedin the birth/death frequency­ was rough and, though quite intelligible, track-matchingprocess. Note the death oftwo tracks during frames one andthree andthe birth ofa track during the sec­ deemed to be of poor quality. Therefore, the ond frame. overlap-add synthesizer is useful only for appli-

The Lincoln Laboratory Journal. Volume 1, Number 2 (1988) 159 McAulay et aI. - Speech Processing Based on a Sinusoidal Model

cations that can support a high frame rate. But there are many practical applications, such as 1T , that require lower frame rates. These applications require an alternative to the overlap-add interpolation scheme. A method will now be described that interpo­ -1T lates the matched sine-wave parameters di­ rectly. Since the frequency-matching algorithm associates all ofthe parametersmeasured for an arbitrary frame, k, with a corresponding set of 1T W

parameters for frame k + I, then letting -1T (a) (25A)

1T denote the successive sets of parameters for the fJ i (n) l th frequency track, a solution to the amplitude­ interpolation problem is to take W

~k+l "k -1T A(n) =Ak + (A S-A) n. (26)

where n =O. I, ... ,S- 1 is the time sample into the k th frame. (The track subscript l has been omitted for convenience.) 1T W Unfortunately, this simple approach cannot -1T be used to interpolate the frequency and phase (b) because the measured phase, jjk, is obtained 2n 7). modulo (Fig. Hence, phase unwrapping Fig. 7 - Requirement for unwrapped phase. (a) Interpola­ must be performed to ensure that the frequency tion of wrapped phase. (b) Interpolation of unwrapped tracks are "maximally smooth.. across frame phase. boundaries. One solution to this problem that uses a phase-interpolation function that is a cubic polynomial has been developed in Ref. 6. (28) The phase-unwrapping procedure provides each frequency trackwith an instantaneous un­ and L k is the number ofsine waves estimated for wrapped phase such that the frequencies and the k th frame. phases at the frame boundaries are consistent with the measured values modulo 2. The un­ wrapped phase accounts for both the rapid Experimental Results phase changes due to the frequency of each Figure 8 gives a block diagram description of sinusoidal component. and to the slowlyvarying the complete analysis/ synthesis system. A non­ phase changes due to the glottal pulse and the real-time floating-point simulation was devel­ vocal-tract transfer function. oped initially to determine the effectiveness of Letting iizCtl denote the unwrapped phase the proposed approach in modeling real speech. function for the l th track. then the fmal syn­ The speech processedinthe simulationwas low­ thetic waveform will be given by pass-filtered at 5 kHz, digitized at 10 kHz, and k L analyzed at lO-ms frame intervals. A 512-point S(n) = I Al(n) cos [~(n)J (27) FIT using a pitch-adaptive Hamming window, 1=1 with a width 2.5 times the average pitch. was where AzCnl is given by Eq. 26, iizCnl is given by used andfound to be sufficientfor accurate peak

160 The Uncoln Laboratory Journal Volume 1. Number 2 (1988) McAulay et aI. - Speech Processing Based on a Sinusoidal Model

ANALYSIS: Phases tan-1(.) Speech

In~ OFT Frequencies Peak I" Picking Window Amplitudes

SYNTHESIS:

Phases Frame-To-Frame 8(t) Synthetic Phase Sine Wave Sum All _ Frequencies ----.. ---. Speech Unwrapping & Generator Sine Waves Interpolation Output

_ Amplitudes Frame-To-Frame A(t) Linear Interpolation

Fig. 8 - This block diagram ofthe sinusoidal analysis/synthesis system illustrates the major functions subsumed within the system. Neither voicing decisions nor residual waveforms are required for . estimation. The maximum number of peaks that the system was capable of synthesizing a used in synthesis was set to a fIxed number broad class of signals including multispeaker (-80); if excess peaks were obtained only the waveforms. music, speech in a music back­ largest peaks were used. ground. and marine biologic signals such as A large speech data base was processed with whale sounds. Furthermore. the reconstruction this system, and the synthetic speech was es­ does not break down in the presence of noise. sentially indistinguishable from the original. A The synthesized speech is perceptuallyindistin­ visual examination of the reconstructed pas­ guishable from the original noisy speech with sages shows that the waveform structure is essentially no modification of the noise charac­ essentially preserved. This is illustrated by Fig. teristics. Illustrations depicting the perform­ 9, which compares the waveforms ofthe original ance of the system in the face of the above speech and the reconstructed speech dUring an degradations are provided in Ref. 6. More re­ unvoiced/voiced speech transition. The com­ cently a real-time system has been completed parison suggests that the quasi-stationary using the Analog Devices ADSP2100 16-bit conditionsimposed onthe speech model are met fixed-point signal-processing chips and per­ and that the use ofthe parametric model based formance equal to that of the simulation has on the amplitudes, frequencies, and phases ofa been achieved. set ofsine-wave components isjustified for both Although high-quality analysis/synthesis of voiced and unvoiced speech. speechhasbeen demonstrated usingthe ampli­ In another set of experiments it was found tudes, frequencies. and phases of the peaks

The lincoln Laboratory Journal. Volume 1, Number 2 (l988) 161 McAulay et aI. - Speech Processing Based on a Sinusoidal Model

ample ofa waveform synthesized by the magni­ tude-only system is shown in Fig. 10 (b). Com­ pared to the originalspeech, shownin Fig. 10 (a), the synthetic waveform is quite different be­ cause of the failure to maintain the true sine­ wave phases. In an additional experiment the magnitude-only system was applied to the syn­ thesis ofnoisy speech; the synthetic noise took (a) -120 msl- on a tonal quality that was unnatural and annoying.

Vocoding

~----""",,"'--1,oo~".-"~.~"iIIl·'lit"H""~M""''''''''''-~''''--- Sincethe parameters ofthe sinusoidal speech model are the amplitudes, frequencies, and phases ofthe underlying sine waves, and since, for a typical low-pitched speakerthere can be as many as 80 sine waves in a 4-kHz speech (b) -120 msl-

Fig. 9 - (a) Original speech. (b) Reconstructed speech. Both voiced and unvoiced segments are compared to the voiced/unvoiced segment of speech that has been recon­ structed with the sinusoidalanalysis/synthesis system. The waveforms are nearly identical, justifying the use of the sinusoidal transform model. of the high-resolution STFf, it is often argued thattheearisinsensitive to phase, a proposition (a) msl- that forms the basis of much of the work in -/20 narrowband speech coders. The question arises whether or not the phase measurements are essential to the sine-wave synthesis procedure. An attemptto explore this questionwas madeby replacing each cubic phase track by a phase function thatwas definedtobetheintegralofthe instantaneous frequency [3,7). In this case the instantaneous frequency was taken to be the linearinterpolationofthe frequencies measured (b) -120 msl- at the frame boundaries and the integration, whichstartedfrom a zerovalueatthebirthofthe Fig. 10- (a) Original speech. (b) Reconstructed speech. track and continued to be evaluated along that The original speech is compared to the segment that has track until the track died. This "magnitude­ been reconstructedusing the magnitude-onlysystem. The only" reconstruction technique was applied to differences brought about by ignoring the true sine-wave several sentences of speech, and, while the phases are clearly demonstrated. resulting synthetic speech was very intelligible and free of artifacts, it was perceived as being bandwidth, it isn't possible to code all of the different from the original speech, having a parameters directly. Attempts at parameter somewhat mechanical quality. Furthermore, reduction use an estimate ofthe pitch to estab­ the differences were more pronounced for low­ lish a setofharmonicallyrelated sinewaves that pitched (ie, pitch <-100 Hz) speakers. An ex- provide a best fit to the input speech waveform.

162 TIle Lincoln Laboratory Journal. Volume 1, Number 2 (l988) McAulay et aI. - Speech Processing Based on a Sinusoidal Model

The amplitudes. frequencies, and phases ofthis product is being produced that will be available reduced set ofsine waves are then coded. Since in 1989. the neighboring sine-waveamplitudesare natu­ rally correlated, especially for low-pitched Transformations speakers. pulse-code modulation is used to encode the differential log amplitudes. The goal oftime-scale modification is to main­ In the earlier versions of the vocoder all the tain the perceptualqualityofthe original speech sine-wave amplitudes were coded simply by while changingthe apparentrate ofarticulation. allocating the available bits to the number to be This implies that the frequency trajectories of coded. Since a low-pitched speakercan produce the excitation (and thus the pitch contour) are as many as 80 sine waves. then in the limit of 1 stretched or compressed in time and that the bit/sine-wave amplitude, 4000 bpswould be re­ vocal tract changes at the modified rate. To quired at a 50-Hz frame rate. For an 8-kbps achieve these rate changes, the system ampli­ channel. this leaves 4.0 kbps for coding the tudes and phases. and the excitation ampli­ pitch. energy, and about 12 baseband phases. tudes and frequencies, along each frequency However, at4.8 kbps andbelow, assigning 1bit/ track are time-scaled. Since the parameter esti­ amplitude immediately exhausts the coding mates ofthe unmodified synthesis are available budget. so no phases can be coded. Therefore, a as continuous functions of time. in theory any more efficient amplitude encoder had to be de­ rate change is possible. Rate changes ranging veloped for operation at these lower rates. betweena compression oftwo and anexpansion The increased efficiency is obtained by allow­ oftwohave beenimplementedwith good results. ing the channel separation to increase logarith­ Furthermore, the natural quality and smooth­ mically with frequency, thereby exploiting the ness of the original speech were preserved critical band properties of the ear. Rather than through transitions such as voiced/unvoiced implementa setofbandpass filters to obtain the boundaries. Besides the above constant rate channel amplitudes. as is done in the channel changes, linearly varying and oscillatory rate vocoder, an envelope of the sine-wave ampli­ changes have been applied to synthetic speech, tudes is constructed by linearly interpolating resultingin natural-soundingspeechthatisfree between sine-wave peaks and sampling at the of artifacts [9). criticalband frequencies. Moreover, thelocation Since the synthesis procedure consists of ofthe critical band frequencies was made pitch­ summing the sinusoidal waveforms for each of adaptive whereby the baseband samples are the measured frequencies. the procedure is linearly seperated by the pitch frequency and ideally suited for performing various frequency with the log-spacing applied to the high-fre­ transformations. The procedure has been em­ quency channels as required. ployed to warp the short-time spectral envelope In ordertopreserve thenaturalness atrates at and pitch contour of the speech waveform and, 4.8 kbps and below, a synthetic phase model conversely, to alter the pitch while preserving was employed that phase-locks all of the sine the short-time spectral envelope. These speech waves to the fundamental and adds a pitch­ transformations can be applied simultaneously dependent quadratic phase dispersion and a so that time- and frequency (or pitch)-scaling voicing-dependent random phase to each sine can occur together by simultaneously stretch­ wave [8). Using this technique at 4.8 kbps, the ing and shifting frequency tracks. These joint synthesized speechachieved a diagnostic rhyme operations can be performed with a continu­ test (DRT) score of94 (for three male speakers). ously adjustable rate change. A real-time, 4800-bps system that uses two ADSP2100 signal-processing chips has been Audio Preprocessing successfully implemented. The technology de­ veloped from this research has been transferred The problemofpreprocessingspeechthatis to to the commercial sector (CYLINK, INC). where a be degraded by natural or man-made distur-

The Lincoln Laboratory Journal. Volume 1. Number 2 (1988) 163 McAulay et aI. - Speech Processing Based on a Sinusoidal Model bances arises in applications such as attempts vocal-tract spectral envelope and a pitch-period to increase the coverage area ofAM radio broad­ estimate that changes as a function oftime; the casts andinimprovingground-to-air communi­ dispersive solution thus adapts to the time­ cations in a noisy cockpit environment. The varying characteristics ofthe speech waveform. transmitterin suchcasesis usuallyconstrained This phase solution is sampled at the sine-wave by a peak operating power or the dynamic range frequencies and linearly interpolated across of the system is limited by the sensitivity char­ frames to form the system phase component of acteristics of the receiver or ambient-noise lev­ each sine wave [10]. els. Under these constraints, phase dispersion This approach lends itself to coupling phase and dynamic-range compression, along with dispersion, dynamic-range compression, and spectral shaping (pre-emphasis), are combined pre-emphasis via the STS. An example of the to reduce the peak-to-rms ratio ofthe transmit­ application of the combined operations on a ted signal. In this way, more average power can speech waveform is shown in Fig. 11. A system be broadcast without changing the peak-power using these techniques produced an 8-dB re­ output ofthe transmitter, thus increasing loud­ duction in peak-to-rms level, about 3 dB better ness and intelligibility at the receiver while than commercial processors. This work is cur­ adhering to peak-output power constraints. rently being extended for further reduction of This problem is similar to one in radar in peak-to-rms level and for further improvement which the signal is periodic and given as the of quality and robustness. Overall system per­ output of a transmit filter the input of which formance will be evaluated by field tests con­ consists ofperiodic pulses. The spectral magni­ ducted by Voice ofAmerica over representative tude ofthe filter is specified and the phase ofthe transmitter and receiver links. filter is chosen so that, over one period, the signal is frequency-modulated (a linear fre­ Conclusions quency modulation is usually employed) and has a flatenvelope oversome specified duration. A sinusoidal representation for the speech Ifthere is a peak-power limitation on the trans­ waveform has been developed; it extracts the mitter, this approach imparts maximal energy amplitudes, frequencies, and phases of the to the waveform. In the case of speech, the component sine waves from the short-time voiced-speech signal is approximately periodic Fourier transform. In order to account for spu­ and can be modeled as the output of a vocal­ rious effects due to sidelobe interaction and tractfilter the input ofwhich is a pulsetrain. The time-varyingvoicingandvocal-tractevents, sine important difference between the radar design waves are allowed to come and go in accordance problem and the speech-audio preprocessing with a birth/death frequency-tracking algo­ problemis that, in thecase ofspeech, there is no rithm. Once contiguous frequencies are control over the magnitude and phase of the matched, a maximally smooth phase-interpola­ vocal-tract filter. Thevocal-tract filter is charac­ tion function is obtained that is consistent with terized by a spectral magnitude and some natu­ all of the frequency and phase measurements. ral phase dispersion. Thus, in order to take This phase function is applied to a sine-wave advantage ofthe radarsignal design solutionsto generator which is amplitude-modulated and dispersion, the natural phase dispersion must added to the othersinewaves to form the output first be estimated and removed from the speech speech. Itisimportant tonote that, exceptin up­ signal. The desired phase can then be intro­ dating the average pitch (used to adjust the duced. All three ofthese operations, the estima­ width of the analysis window), no voicing deci­ tion of the natural phase, the removal of this sions are used in the analysis/synthesis proce­ phase, and its replacement with the desired dure. phase, have been implemented using the STS. In some respects thebasic model has similari­ The dispersive phase introduced into the ties to one that Flanagan has proposed [11, 12]. speech waveform is derived from the measured Flanagan argues that because of the nature of

164 The Lincoln Laboratory Journal. Volume 1. Number 2 (I 988) McAulay et aI. - Speech Processing Based on a Sinusoidal Model the peripheral auditory system, a speech wave­ acoustic noise. fonn can be expressed as the sumofthe outputs While it may be tempting to conclude that the ofa fixed filter bank. The amplitude, frequency, ear is not phase-deaf, particularly for low­ and phase measurements of the filter outputs pitched speakers, it may be that this is simply a are then used in various configurations of property of the sinusoidal analysis/synthesis speech synthesizers. Although the presentwork system. No attempts were made to devise an is basedon'the discrete Fouriertransfonn (DFT), experiment that would resolve this question whichcan be interpreted as a filter bank, the use conclusively. It was felt, however, that the sys­ of a high-resolution DFT in combination with tem was well-suited to the design and execution peak picking renders a highly adaptive filter ofsuch an experiment, since it provides explicit bank since only a subset ofall ofthe DFT filters access to a set of phase parameters that are are used at anyone frame. It is the use of the essential to the high-quality reconstruction of frequency tracker and the phase interpolator speech. that allows the filter bank to move with the Using the frequency tracker and the cubic highly resolved speech components. Therefore, phase-interpolation function resulted ina func­ the system fits into the framework Flanagan tional description of the time-evolution of the described but, whereas Flanagan's approach is amplitude and phase of the sinusoidal compo­ based on the properties of the peripheral audi­ nents of the synthetic speech. For time-scale, tory system, the present system is designed on pitch-scale, frequency modification of speech, the basis ofproperties ofthe speech production and speech-coding applications such a func­ mechanism. tional model is essential. However, ifthe system Attempts to perfonn magnitude-only recon­ is used merely to produce synthetic speech, struction were made by replacing the cubic using a set of sine waves, then the frequency­ phase tracks with a phase that was simply the tracking and phase-interpolation procedures integral of the instantaneous frequency. While are unnecessary. In this case, the interpolation the resulting speech was very intelligible and is achieved by overlapping and adding time­ free ofartifacts, it was perceived as being differ­ weighted segments of each of the sinusoidal ent in quality from the original speech; the components. The resulting synthetic speech is differences were more pronounced for low­ essentially indistinguishable from the original pitched (ie, pitch <-100 Hz) speakers. Whenthe speech as long as the frame rate is at least magnitude-only system was used to synthesize 100 Hz. noisy speech, the synthetic noise took on a tonal A fixed-point 16-bitreal-timeimplementation quality thatwas unnatural and annoying. Itwas of the system has been developed on the concludedthat this latterpropertywould render Lincoln Digital Signal Processors [13] and using the systemunsuitable for applications for which the Analog Devices ADSP2100 processors. Di­ the speech would be subjected to additive agnostic rhyme tests have been perfonned and

Fig. 11-Audiopreprocessing usingphase dispersion anddynamic-range compression illustrates the reduction in the peak­ to-rms level.

The Lincoln Laboratory Journal. Volume 1. Number 2 (1988) 165 McAulay et aI. - Speech Processing Based on a Sinusoidal Model about one DRr point is lost relative to the hancement for AM radio broadcasting (10). unprocessed speech of the same bandwidth with theanalysis/synthesis systemoperatingat Acknowledgments a 50-Hz frame rate. The system is used in research aimed at the development of a multi­ The authors would like to thank their col­ rate speech coder (14). A practical low-rate league, Joseph Tierney, for his comments and coder has been developed at 4800 bps and suggestions in the early stages ofthis work. The 2400 bps using two commercially available authors would also like to acknowledge the DSP chips. Furthermore, the resulting technol­ support and encouragement of the late Anton ogy has been transferred to private industry Segota of RADC/EEV, who was the Air Force for commercial development. The sinusoidal program manager for this research from its in­ analysis/synthesis system has also been ap­ ception in September 1983 until his death in plied successfully to problems in time scale, March 1987. pitch scale, and frequency modification of This work was sponsored by the Department speech (9) and to the problem of speech en- of the Air Force.

166 The Lincoln Laboratory Journal. Volume 1, Number 2 (1988) McAulay et aI. - Speech Processing Based on a Sinusoidal Model

8. RJ. McAulay and T.F. Quatieri. "Multirate Sinusoidal References Transform Coding at Rates from 2.4 kbps to 8.0 kbps." Int. ConI onAcoustics. Speech. and SignalProcessing 87(IEEE, 1. B.S. AtalandJ.R Remde. "A New Model ofLPC Excitation New York. 1987). p 1645. for Producing Natural-Sounding Speech at Low Bit Rates." 9. T.F. Quatieri and RJ. McAulay. "Speech Transforma­ Int. Conj. on Acoustics, Speech. and Signal Processing 82 tions Based on a Sinusoidal Representation." IEEE Trans. (IEEE, New York, 1982). p 614. Acoust. SpeechSignalProcess. ASSP-34, 1449 (Dec. 1986). 2. M.R Schroeder and B.S. Atal, "Code-Excited Linear 10. T.FQuatieri and RJ. McAulay. "Sinewave-based Phase Prediction (CELP): High Quality Speech at Very Low Bit Dispersion forAudio Preprocessing." Int. Conj. onAcoustics, Rates," Int. ConJ on Acoustics, Speech. and Signal Process­ Speech. and Signal Processing 88 (IEEE, New York. 1988), ing 85 (IEEE. New York. 1985). p 937. p 2558. 3. P. Hedelin, "A Tone-OrientedVoice-ExcitedVocoder," Int. 11. J.L. Flanagan. "Parametric Coding ofSpeech Spectra." ConI onAcoustics, Speech. andSignalProcessing 81 (IEEE. J. Acoust. Soc. ojAmerica 68. 412 (1980). New York, 1981), p 205. 12. J.L. Flanagan and S.W. Christensen, "Computer Stud­ 4. L.B. Almeida and F.M. Silva, "Variable-Frequency Syn­ ies on Parametric CodingofSpeechSpectra,"J. Acoust. Soc. thesis: An Improved Harmonic Coding Scheme." Int. ConI ojAmerica 68, 420 (1980). on Acoustics, Speech. and SignalProcessing 84 (IEEE, New 13. P.E. Blankenship. "LDVT: High Performance Minicom­ York, 1984). p 27.5.1. puter for Real-Time Speech Processing." EASCON '75 5. H. Van Trees, Detection, Estimation and Modulation (IEEE. New York, 1975). p 214-A. Theory, Part! (John Wiley, New York. 1968), Chap 3. 14. RJ. McAulay and T.F. Quatieri, "Mid-Rate Coding 6. RJ. McAulay and T.F. Quatieri. "Speech Analysis/ Based on a Sinusoidal Representation ofSpeech." Int. ConJ Synthesis Based on a Sinusoidal Representation." IEEE on Acoustics. Speech. and Signal Processing 85 (IEEE, New Trans. Acoust. SpeechSignalProcess. ASSP-34, 744 (1986). York. 1985), p 945. 7. RJ. McAulay andT.F. Quatieri, "Magnitude-Only Recon­ struction Using a Sinusoidal Speech Model," Int. Conlon Acoustics, Speech. and Signal Processing 84 (IEEE. New York, 1984). p 27.6.1.

The Lincoln Laboratory Journal, Volume 1, Number 2 (1988) 167 McAulay et aI. - Speech Processing Based on a Sinusoidal Model

ROBERT J. McAULAY is a THOMAS F. gUATIERl is a senior staff member in the staff member in the Speech SpeechSystemsTechnology Systems Technology Group. Group. He received a BASe where he is working on degree in engineering phys­ problems in digital signal ics (with Honors) from the processing and their appli­ University of Toronto in cation to speech enhance­ 1962, an MSc degree in elec­ ment. speech coding. and trical engineering from the data communications. He University ofIllinois in 1963, and a PhD degree in electrical received a BS degree (summa cum laude) from Tufts Univer­ engineering from the University of California, Berkeley, in sity and SM. EE, and ScD degrees from Massachusetts In­ 1967. In 1967 hejoined Lincoln Laboratory and worked on stitute ofTechnology in 1975, 1977. and 1979. respectively. problems in estimation theory and signal/filter design us­ He was previously a member of the Sensor ProceSSing ing optimal control techniques. From 1970 to 1975 he was Technology group involved in multidimensional digital sig­ a member ofthe Air Traffic Control Division and worked on nal processing and image processing. Tom received the the developmentofaircrafttrackingalgOrithms. Since 1975 1982 Paper Award of the IEEE Acoustics. Speech and Sig­ he has been involved with the development of robust nar­ nal ProcessingSociety for the best paper by an authorunder row-band speech vocoders. Bob received the M. Barry thirtyyears ofage. He is a memberofthe IEEE Digital Signal Carlton award in 1978 for the best paper published in the Processing Technical Committee, Tau Beta Pi. Eta Kappa IEEE Transactions on Aerospace and Electronic Systems. Nu. and Sigma Xi. In 1987 he was a member ofthe panel on Removal ofNoise from a Speech/Noise Signal for Bioacoustics and Bi­ omechanics ofthe National Research Council's Committee on Hearing.

168 The Lincoln Laboratory Journal, Volume 1. Number 2 (1988)