Speech Processing Based on a Sinusoidal Model
Total Page:16
File Type:pdf, Size:1020Kb
R.J. McAulay andT.F. Quatieri Speech Processing Based on a Sinusoidal Model Using a sinusoidal model of speech, an analysis/synthesis technique has been devel oped that characterizes speech in terms of the amplitudes, frequencies, and phases of the component sine waves. These parameters can be estimated by applying a simple peak-picking algorithm to a short-time Fourier transform (STFf) of the input speech. Rapid changes in the highly resolved spectral components are tracked by using a frequency-matching algorithm and the concept of"birth" and "death" ofthe underlying sinewaves. Fora givenfrequency track, a cubic phasefunction is applied to a sine-wave generator. whose outputis amplitude-modulated andadded to the sine waves generated for the other frequency tracks. and this sum is the synthetic speech output. The resulting waveform preserves the general waveform shape and is essentially indistin gUishable from the original speech. Furthermore. inthe presence ofnoise the perceptual characteristics ofthe speech and the noise are maintained. Itwas also found that high quality reproduction was obtained for a large class of inputs: two overlapping. super posed speech waveforms; music waveforms; speech in musical backgrounds; and certain marine biologic sounds. The analysis/synthesis system has become the basis for new approaches to such diverse applications as multirate coding for secure communications, time-scale and pitch-scale algorithms for speech transformations, and phase dispersion for the enhancement ofAM radio broadcasts. Moreover, the technology behind the applications has been successfully transferred to private industry for commercial development. Speech signals can be represented with a Other approaches to analysis/synthesis that· speech production model that views speech as are based on sine-wave models have been the result of passing a glottal excitation wave discussed. Hedelin [3] proposed a pitch-inde form through a time-varyinglinearfilter (Fig. 1), pendent sine-wave model for use in coding the which models the resonant characteristics of baseband signal for speech compression. The thevocal tract. Inmanyspeechapplications the amplitudes and frequencies of the underlying glottal excitation canbe assumed to bein one of sine waves were estimated using Kalman fIlter two possible states, corresponding to voiced or ing techniques and each sine-wave phase was unvoiced speech. defined to be the integral of the associated in In attempts to design high-quality speech stantaneous frequency. coders at mid-band rates, more general excita Another sine-wave-based speech system is tion models have been developed. Approaches being developed by Almeida and Silva [4]. In thatarecurrentlypopularare multipulse [1] and contrast to Hede1in's approach, their system code-excited linear predictive coding (CELP) [2). uses a pitchestimatetoestablisha harmonic set This paper also develops a more general model of sine waves. The sine-wave phases are com for glottal excitation, but instead of using puted from the STFf at the harmonic frequen impulses as in multipulse, or code-book excita cies. To compensatefor anyerrorsthatmightbe tionsas in CELP, the excitation waveform is as introduced as a result of the harmonic sine sumed to be composed of sinusoidal compo wave representation, a residual waveform is nents ofarbitrary amplitudes, frequencies, and coded, along with the underlying sine-wave phases. parameters. The Lincoln Laboratory Journal, Volume 1, Number 2 (1988) 153 McAulay et aI. - Speech Processing Based on a Sinusoidal Model ofsome ofthese experiments are discussed and Pitch Period pictorial comparisons of the original and syn thetic waveforms are presented. , The above sinusoidal transform system (STS) W has found practical application in a number of Impulse speech problems. Vocoders have been Train developedthatoperatefrom 2.4kbps to 4.8kbps Generator --- providing good speech quality that increases c Vocal Tract ~ Filter Speech Random EXCITATION VOCAL TRACT Noise I-- Generator s(t) Sine Wave ~ h(t, T) Generation e(t) Fig. 1- The binary model ofspeech production illustrated here, r~quires pi~ch, voca~-~ract parameters, en~rgy levels, and vOiced/unvoiced decIsions as inputs. This paper derives a sinusoidal model for the Fig. 2 - The sinusoidal speech model consists ofan exci speech waveform, characterized by the ampli tation and vocal-tract response. The excitation waveform is characterized by the amplitudes, frequencies, and phases tudes, frequencies, and phasesofitscomponent C!f the underlying sine waves of the speech; the vocal tract sinewaves, thatleads to a newanalysis/synthe IS modeled by the time-varying linear filter, the impulse re sis technique. The glottal excitation is repre sponse of which is h(t; -r). sentedas a sumofsinewaves that, when applied to a time-varying vocal-tract filter, leads to the more or less uniformly with increasing bit rate. desired sinusoidal representation for speech In another area the STS provides high-quality waveforms (Fig. 2). speech transformations such as time-scale and A parameter-extraction algorithm has been pitch-scale modifications. Finally, a large developed that shows that the amplitudes, fre research effort has resulted in a new technique quencies, and phases of the sine waves can be for speech enhancement for AM radio obtained from the high-resolution short-time broadcasting. These applications are Fourier transform (STFT), by locating the peaks considered in more detail later in the text. of the associated magnitude function. To syn thesize speech, the amplitudes, frequencies, and phases estimated on one frame must be Speech Production Model matched and allowed to evolve continuously In the speech production model, the speech into the set of amplitudes, frequencies, and waveform, s(t), is modeled as the output of a phases estimated on a successive frame. These linear time-varying filter that has been excited issues are resolved using a frequency-matching bytheglottal excitationwaveform, eft). The filter, algorithm in conjunction with a solution to the which models the characteristics of the vocal phase-unwrapping and phase-interpolation tract, has animpulse response denoted by h(t; -r). problem. The speech waveform is then given by A system was built and experiments were performed with it. The synthetic speech was judged to be of excellent quality - essentially s(t) = ft hlt- T; t) e(r) dr. (1) indistingUishable from the original. The results o 154 TIle Lincoln Laboratory Journal, Volume 1, Number 2 (1988) McAulay et aI. - Speech Processing Based on a Sinusoidal Model If the glottal excitation waveform is represented phases of the component sine waves must be as a sum of sine waves of arbitrary amplitudes. extracted from the original speech waveform. frequencies. and phases. then the model can be written as Estimation of Speech Parameters eft) = Re ~ al(t) exp {j [~[ WI (a) da + <PI]} (2) The key problem in speech analysis/synthe sis is to extract from a speech waveform the parameters that represent a quasi-stationary where. for the l th sinusoidal component. Clz{t) portion of that waveform. and to use those and wlt) represent the amplitude and frequency parameters (or codedversions ofthem) to recon (Fig. 2). Because the sine waves will not neces struct an approximation that is "as close as f/J sarily be in phase. 1is included to represent a possible" to the original speech. The parameter fIxed phase-offset.This model leads to a particu extraction algorithm. or estimator. should be 1arly simple representation for the speech wave robust. as the parameters must often be ex form. Letting tracted from a speech signal that has been con taminated with acoustic noise. H(w; t) = M(w; t) exp lj<I>(w; t)l (3) In general. it is diffIcult to determine analyti cally which of the component sine waves and represent the time-varying vocal-tract transfer their amplitudes. frequencies. and phases are function and assuming that the excitation para necessary to represent a speech waveform. meters given in Eq. 2 are constant throughout Therefore. an estimator based on idealized the duration of the impulse response ofthe filter speech waveforms was developed to extract in effect at time t. then using Eqs. 2 and 3 in Eq. these parameters. As restrictions on the speech 1 leads to the speech model given by waveform were relaxed in order to model real L(t) speech better. adjustmentswere made to the es s(t) = ~ al(t) M!wI(t); tJ timator to accommodate these changes. 1=1 In the development ofthe estimator. the time (4) axis was first broken down into an overlapping sequence of frames each of duration T. The center ofthe analysis window for the k th frame t . occurs at time k Assuming that the vocal-tract By combining the effects ofthe glottal and vocal and glottal parameters are constant over an tract amplitudes and phases. the representa interval oftime thatincludes the duration ofthe tion can be written more concisely as analysis window and the duration of the vocal L(t) tract impulse response. then Eq. 7 can be writ s(t) = ~AI(t) exp ljl/!I(t)] (5) ten as 1=1 (8) where where the superscript kindicates that the para (6) meters of the model may vary from frame to frame. Using Eq. 8. in Eq. 5 the synthetic speech l/!I(t) = f t wda) da + <I>[wI(t); tl + <PI (7) waveform over frame k can be written as o represent the amplitude and phase of the lth (9) sine wave along the frequency track wlt). Equa tions 5. 6. and 7 combine to provide a sinusoidal where representation ofa speechwaveform. Inorder to k-Ak ( 'Ok) Y -I exp J I (9A) use the model, the amplitudes. frequencies. and I The Lincoln Laboratory Journal. Volume 1, Number 2 (l988) 155 McAulay et aI. - Speech Processing Based on a Sinusoidal Model represents the 1th complexamplitude for the 1th component of the Lk sine waves. Since the (w~ w~) w~l if l = i (13) sine - =sine [(l- i) ={6 if l ~ i measurements are made on digitized speech, sampled-data notation [s(n)] is used.