Models of Speech Synthesis ROLF CARLSON Department of Speech Communication and Music Acoustics, Royal Institute of Technology, S-100 44 Stockholm, Sweden

Proc. Natl. Acad. Sci. USA Vol. 92, pp. 9932-9937, October 1995 Colloquium Paper This paper was presented at a colloquium entitled "Human-Machine Communication by Voice," organized by Lawrence R. Rabiner, held by the National Academy of Sciences at The Arnold and Mabel Beckman Center in Irvine, CA, February 8-9,1993. Models of speech synthesis ROLF CARLSON Department of Speech Communication and Music Acoustics, Royal Institute of Technology, S-100 44 Stockholm, Sweden ABSTRACT The term "speech synthesis" has been used need large amounts of speech data. Models working close to for diverse technical approaches. In this paper, some of the the waveform are now typically making use of increased unit approaches used to generate synthetic speech in a text-to- sizes while still modeling prosody by rule. In the middle of the speech system are reviewed, and some of the basic motivations scale, "formant synthesis" is moving toward the articulatory for choosing one method over another are discussed. It is models by looking for "higher-level parameters" or to larger important to keep in mind, however, that speech synthesis prestored units. Articulatory synthesis, hampered by lack of models are needed not just for speech generation but to help data, still has some way to go but is yielding improved quality, us understand how speech is created, or even how articulation due mostly to advanced analysis-synthesis techniques. can explain language structure. General issues such as the synthesis of different voices, accents, and multiple languages Flexibility and Technical Dimensions are discussed as special challenges facing the speech synthesis community. The synthesis field can be viewed from many different angles. We can group the models along a "flexibility" scale. Multi- The term "speech synthesis" has been used for diverse tech- lingual systems demand flexibility. Individual voices, speaking nical approaches. Unfortunately, any speech output from styles, and accents also need a flexible system in which explicit computers has been claimed to be speech synthesis, perhaps transformations can be modeled. Most of these variations are with the exception of playback of recorded speech.* Some of continuous rather than discrete. The importance of separating the approaches used to generate true synthetic speech as well the modeling of speech knowledge from acoustic realization waveform concatenation methods are pre- must be emphasized in this context. as high-quality In the overview by Furui (6), synthesis techniques are sented below. divided into three main classes: waveform coding, analysis- Knowledge About Natural Speech synthesis, and synthesis by rule. The analysis-synthesis method is defined as a method in which human speech is transformed Synthesis development can be grouped into three main cate- into parameter sequences, which are stored. The output is gories: acoustic models, articulatory models, and models based created by a synthesis based on concatenation of the prestored on the coding of natural speech. The last group includes both parameters. In a synthesis-by-rule system the output is gener- predictive coding and concatenative synthesis using speech ated with the help of transformation rules that control the waveforms. Acoustic and articulatory models have had a long synthesis model such as a vocal tract model, a terminal analog, while natural speech models represent or some kind of coding. history of development, It is not an easy task to place different synthesis methods into a somewhat newer field. The first commercial systems were unique classes. Some of the common "labels" are often used based on the acoustic terminal analog synthesizer. However, at to characterize a complete system rather than the model it that time, the voice quality was not good enough for general stands for. A rule-based system using waveform coding is a use, and approaches based on coding attracted increased perfectly possible combination, as is speech coding using a interest. Articulatory models have been under continuous terminal analog or a rule-based diphone system using an development, but so far this field has not been exposed to articulatory model. In the following pages, synthesis models commercial applications due to incomplete models and high will be described from two different perspectives: the sound- processing costs. generating part and the control part of the system. We can position the different synthesis methods along a "knowledge about speech" scale. Obviously, articulatory synthesis needs considerable understanding of the speech act THE SOUND-GENERATING PART itself, while models based on coding use such knowledge only The sound-generating part of the synthesis system can be to a limited extent. All synthesis methods have to model divided into two subclasses, depending on the dimensions in something that is partly unknown. Unfortunately, artificial which thec model is controlled. A vocal tract model can be obstacles due to simplifications or lack of coverage will also be controlled by spectral parameters such as frequency and introduced. A trend in current speech technology, both in bandwidth or shape parameters such as size and length. The speech understanding and speech production, is to avoid source model that excites the vocal tract usually has parameters explicit formulation of knowledge and to use automatic meth- to control the shape of the source waveform. The combination ods to aid the development of the system. Since such analysis of time-based and frequency-based controls is powerful in the methods lack the human ability to generalize, the generaliza- sense that each part of the system is expressed in its most tion has to be present in the data itself. Thus, these methods *The foundations for speech synthesis based on acoustical or articu- The publication costs of this article were defrayed in part by page charge latory modeling can be found in Fant (1), Holmes et aL (2), Flanagan payment. This article must therefore be hereby marked "advertisement" in (3), Klatt (4), and Allen et al (5). The paper by Klatt (73) gives an accordance with 18 U.S.C. §1734 solely to indicate this fact. extensive review of the developments in speech synthesis technology. 9932 Downloaded by guest on September 26, 2021 Colloquium Paper: Carlson Proc. Natl. Acad. Sci. USA 92 (1995) 9933 explanatory dimensions. A drawback of the combined approach can be that it makes interaction between the source and a the filter difficult. However, the merits seem to outweigh the drawbacks. t I -% % Simple Waveform Concatenation n % .-- %%S -0 b IncreaselofA %Lthe Pitc %i The most radical solution to the synthesizer problem is simply t to have a set of prerecorded messages stored for reproduction. Increase of the Pitch Simple coding of the speech wave might be performed in order to reduce the amount of memory needed. The quality is high, but the usage is limited to applications with few messages. If units smaller than sentences are used, the quality degenerates c because of the problem of connecting the pieces without distortion and overcoming prosodic inconsistencies. One important and often forgotten aspect in this context is that a vocabulary change can be an expensive and time-consuming d process, since the same speaker and recording facility have to be used as with the original material. The whole system might Decrease of the Pitch t have to be completely rebuilt in order to maintain equal quality of the speech segments. FIG. 1. Example of the PSOLA method (18). Analysis-Synthesis Systems of such a voice source is the LF-model (24). It has a truncated exponential sinusoid followed by a variable cut-off 6 dB/octave Synthesis systems based on coding have as long a history as the low-pass filter modeling the effect of the return phase, that is, vocoder. The underlying philosophy is that natural speech is the time from maximum excitation of the vocal tract to analyzed and stored in such a way that it can be assembled into complete closure of the vocal folds. Fig. 2 explains the function new utterances. Synthesizers such as the systems from AT&T of the control parameters. In addition to the amplitude and Bell Laboratories (7-9), Nippon Telephone & Telegraph fundamental frequency control, two parameters influence the (NTT) (10, 11), and ATR Interpreting Telephone Research amplitudes of the two to three lowest harmonics, and one Laboratories (ATR) (12, 13) are based on the source-filter parameter influences the high-frequency content of the spec- technique where the filter is represented in terms of linear trum. Another vocal source parameter is the diplophonia predictive coding (LPC) or equivalent parameters. This filter is excited by a source model that can be of the same kind as the 0 dB one used in terminal analog systems. The source must be able F ;RQg100 to handle all types of sounds: voiced and unvoiced vowels and Rg=200 consonants. Rg=100 Rg-200 -10 Considerable success has been achieved by systems that base sound generation on concatenation of natural speech units i 1 (14). Sophisticated techniques have been developed to manip- ulate these units, especially with respect to duration and -20 fundamental frequency. The most important aspects of pros- ~ ody can be imposed on synthetic speech without considerable Difflerentiated flow of model ................. ........... -- - --....... ................. loss of quality. The pitch-synchronous overlap-add approach 10 0 1 2 3kH 4 (PSOLA) (15) methods are based on concatenation of wave- 0 dB form pieces. The frequency domain approach (FD-PSOLA) is Rk=60 Rk = 60 -30 used to modify the spectral characteristics of the signal; the time domain approach (TD-PSOLA) provides efficient solu- Rk=4O Spectru3020=6Rk tions for real-time implementation of synthesis systems. Ear- .,,- I/Rk=20 lier systems like SOLA (16) and systems for divers' speech 0 _-1 -- -.. 2R3kHz--- --- .. 4 restoration also did direct processing of the waveform (17). fI Fig.

Load more