<<

RECENT ADVANCES IN MULTILINGUAL TEXT-TO-

Bernd Mobius,È Juergen Schroeter, Jan van Santen, Richard Sproat, Joseph Olive AT&T Bell Laboratories, Murray Hill, NJ, USA

INTRODUCTION TEXT In this paper we will discuss recent advances in mul- tilingual text-to-speech (TTS) synthesis research at AT&T Bell Laboratories. The TTS system devel- oped at AT&T Bell Laboratories generates synthetic speech by concatenating segments of natural speech. The architecture of the system is designed as a modu- lar pipeline where each module handles one particular Lemmatization step in the process of converting text into speech. Be- sides conceptual and computational advantages, the modular structure has been instrumental in our effort Accenting toward a TTS system for multiple languages. This system will ultimately consist of a single set of mod- ules, and any language-speci® information will be represented in tables. In describing the TTS system, Pronunciation we will concentrate on the multilingual aspect, with a bias toward the German language.

A MODULAR ARCHITECTURE Phrasing The architecture of the TTS system is en- tirely modular. This design has a number of advan- tages (see [18] for a more detailed discussion) for Phrasal Accents system development and testing, and research. First, although the division of the TTS conversion problem into subproblems is always arbitrary to some extent, each module still corresponds to a well-de®ned sub- Duration task in TTS conversion. Second, from the system development point of view, members of a research team can work on different modules of the system, and an improved version of a given module can be Intonation integrated anytime, as long as the communication be- tween the modules and the structure of the information to be passed along is de®ned. Third, it is possible to interrupt and (re-)initiate processing anywhere in the Amplitude pipeline and assess TTS information at that point, or to insert tools or programs that modify TTS parameters. The current English version of our TTS system con- Glottal Source sists of thirteen modules (Figure 1). Information ¯ow is unidirectional, and each module adds information to the data stream. Inter-module communication is performed by way of a uniform set of data structures. Unit Selection The output of the unit concatenation module, however, is a stream of synthesis parameters; this information is ®nally used by the waveform synthesizer. The modular architecture has been instrumental Unit Concatenation in our effort to develop TTS systems for languages other than English. Currently, work is under way on nine languages: Mandarin Chinese, Taiwanese, Synthesis Japanese, Mexican Spanish, Russian, Romanian, Ital- ian, French, and German. In the initial stages of work on a new language SPEECH much of the information needed to drive these modules is missing. Typically, we start with a phonetic repre- sentation of the phoneme system of the language in Figure 1. Modules of the English TTS System. question and build the acoustic inventory, whereas text analysis and prosodic components would be worked on in a later stage. In this case, default versions of the modules enable the researcher to get a reasonable FST synthetic speech output for a given input string which can consist of phone symbols and control sequences Morphological Analysis for various prosodic parameters. While some modules, such as unit selection, unit concatenation, and waveform synthesis, have already FST been largely table-driven for some time, we recently integrated language-independent text analysis, dura- Minimal Morphological Annotation tion, and intonation components. Thus, the Bell Labs TTS system can now be characterized as consisting of one single set of modules, where any language- FST FST speci®c information will be represented in, and re- trieved from, tables. Surface Orthographic Form Pronunciation We will now turn to a discussion of components in the light of multilingual TTS. Figure 2. Generalized text analysis component for multilingual TTS based on ®nite-state transducer TEXT ANALYSIS (FST) technology. The ®rst stage of TTS conversion involves the transfor- mation of the input text into a linguistic representation from which actual synthesis can proceed. This linguis- principle potentially either ªSicherheit+s+Panneº or tic representation includes information about the pro- ªSicherheit+Spanneº). Since compounding is highly nunciation of (including such `non-standard' productive in German, most particular instances of words as numerals, abbreviations, etc.), the relative compounds that one encounters in text will not be prominence (accenting) of those words, and the di- found in the dictionary. In a case like ªSicher- vision of the input sentences into (prosodic) phrases. heitspanneº morphological analysis tells us that the Deriving each of these kinds of information presents /s/ must be the `Fugen-/s/' which regularly follows the interesting problems that can be solved using tech- noun-forming suf®x ª-heitº, and therefore should not niques from computational linguistics. be pronounced as [ ] but as [s]. In the English TTS system, the information of how to pronounce a resides in the pronuncia- In the second example, the input string

tion model itself. The model is basically a large set of ªSuchtº yields two possible morphological analy-

¢ ¡ ¢£¡ ¢£¡ ¢£¡ ¢£¡ ¢

word-speci®c pronunciation rules because it consists ses: s'uch ¡ ++ t verb 3per sg pres indi with

¢£¡ ¢

of a list of words and their pronunciations. An alterna- the pronunciation [z'u:xt] vs. s'ucht ¡ noun femi

¢ ¡ ¢ ¡ ¢ ¡ ¢ ¡ ¢ tive approach [16] has been taken in the multilingual ¡ sg ( nom | gen | dat | acc ) with the pronuncia- TTS system. Here lexical information is represented tion [z' ¤ xt]. Given equal probabilities for both alter- by (mostly morphological) annotations of the regular natives, only information provided by a part-of-speech orthography. The generalized text analysis component tagger or parser (e.g., [1]) can help disambiguate and computes linguistic analyses from text using a lexical determine the correct pronunciation. toolkit that is based on state-of-the-art weighted ®nite- The text analysis component also performs a tok- state transducer technology. enization of the input text into sentences and words. First, input text is converted into a ®nite-state ac- While some writing systems, e.g. Chinese, use a spe- ceptor which is then composed sequentially with a cial symbol to mark the end of a sentence and nothing set of transducers that go from the surface representa- else, the situation is less fortunate in other writing sys- tion to lexical analysis. Since this yields all possible tems. In German and many other European languages, lexical analyses for a given input, a language model a period is ambiguous in that it delimits sentences helps ®nd the presumably `correct' or most appro- but also marks abbreviations. End-of-sentence detec- priate analysis. The best path through the language tion, abbreviation, acronym and number expansion, model is then composed with a transducer that goes word tokenization and other preprocessing problems from lexical analysis to phonological representation are typically solved using a set of heuristics (e.g., [2] and pronunciation (Figure 2). [10]). To illustrate the relevance of morphological infor- Other linguistic information as derived by text pro- mation for pronunciation, let us consider two exam- cessing includes information on parts of speech as ples from German. One of the rules of German pro- well as on phrasing and accenting, and jointly forms nunciation is that an /s/ preceding a /p/ followed by the input to subsequent modules: segmental duration, a vowel or liquid is pronounced as [ ]. This rule only intonation, unit selection and concatenation, and syn- applies, however, if the /s/ is part of the same mor- thesis. In our multilingual systems, the new general- pheme as the following /p/, and this in turn can cause a ized text analysis component replaces all the modules problem in compounds such as ªSicherheitspanneº (in up to segmental duration.

SEGMENTAL DURATION alignment of the 0 contour with the internal temporal The duration module assigns a duration to each pho- structure of the accent group is concerned. netic segment. Given the string of segments to be syn- The key novelty in our approach, however, is that thesized, each segment is tagged with a feature vector we model in detail how the accent curves depend on containing information on a variety of factors, such the composition and duration of the accent groups. as segment identity, syllable stress, accent status, seg- This is important because listeners are sensitive to mental context, or position in the phrase. The module small changes in alignment of pitch peaks with sylla- as such is language-independent, with all language- bles. Previous ®ndings on segmental effects of timing speci®c information being stored in tables. Table and height of pitch contours are integrated in the new construction is performed in two phases: inferential- model [13]. Similar to duration module construction, statistical analysis of the , and parameter modeling these dependencies involves ®tting of pa- ®tting. rameters to a speech corpus. In the case of the German TTS system, we ®rst designed a factorial scheme, i.e. the set of factors ACOUSTIC INVENTORY and distinctions on these factors that are known or The majority of units in the acoustic inventory are expected to have a signi®cant impact on segmental diphones, i.e., units that contain the transition between durations. An important requirement was that the two adjacent phonetic segments, starting in the steady- factors can be computed from text. We then applied a state phase of the ®rst segment and ending in the stable quantitative duration model that is implemented as a region of the second segment. Units to be stored in the particular instantiation of a `sums-of-products' model acoustic inventory are chosen based on various criteria [12] whose parameters are ®tted to a hand-segmented that include spectral discrepancy and energy measures. speech database [5]. This approach uses statistical Contextual or coarticulatory effects can require the techniques that are able to cope with the problem of storage and use of context-sensitive `allophonic' units confounding factors and factor levels, and with data or even of triphones [7]. sparsity. For example, the current acoustic inventory of the During analysis, the segments are classi®ed accord- German TTS system consists of approximately 1250 ing to their position in the syllable (onset consonants, units, including about 100 context-sensitive units. nuclei, coda consonants). Within these classes, sub- This inventory is suf®cient to represent all phono- categorizations were made in terms of phone types tactically possible phone combinations for German. (e.g., voiceless stops, voiced fricatives, nasal conso- However, it will have to be augmented by units repre- nants, etc.). senting speech sounds that occur in common foreign The data show rather homogeneous patterns in that words or names, e.g., the interdental fricatives and the speech sounds within a given phone class generally /w/ glide for English, or nasalized vowels for French. exhibit similar durational trends under the in¯uence of the same combination of factors. Among the most For acoustic inventory construction we use a new important factors are: a) syllable stress (for nuclei, procedure [14] that performs an automated optimal and to some extent for stops and fricatives in the on- element selection and cut point determination. The set); b) word class (for nuclei); c) presence of phrase approach selects elements such that, for a given and word boundaries (for coda consonants, and to vowel, spectral discrepancies between elements for some extent for nuclei). The analysis yields a com- that vowel are jointly minimized, and the coverage of prehensive picture of durational characteristics of one required elements is maximized. A toolset is provided particular speaker. that helps reduce the amount of manual labor involved in the selection of inventory elements. INTONATION Elements that have been selected for inclusion in the inventory are then extracted (`cut'), normalized in The intonation module computes a fundamental fre- amplitude, indexed and stored in tables as acoustic in- quency contour ( 0) by adding three types of time- ventory elements. The normalization done on a given dependent curves: a phrase curve, which depends on element depends on the synthesis method used (see the type of phrase, e.g., declarative vs. interrogative; following section) and on the speech sounds involved accent curves, one for each accent group (accented in the element. syllable followed by zero or more non-accented syl- lables); and perturbation curves, which capture the SELECTION, CONCATENATION, SYNTHESIS effects of obstruents on pitch in the post-consonantal vowel. The unit selection and concatenation modules select This approach shares some concepts with the so- and connect the acoustic inventory elements. These called superpositional intonation models that have modules retrieve the necessary units, assign new dura-

been applied to a number of languages (e.g., [3] [6]). tions, pitch contours and amplitude pro®les and pass

These models analyze the 0 contour as a complex parameter vectors on to the synthesis module which pattern that results from the superposition of several uses one of the synthesis methods described below to components, each of which has its own temporal do- generate the output speech waveform. main. The weakness of the superpositional models is The parametric waveform synthesis module pro- generally seen in their lack of precision as far as the vides ¯exible engines to assure the highest quality speech output for a given hardware platform and num- In P.F. MacNeilage (ed.), The production of speech ber of parallel channels running on that platform. (Springer, Berlin), 39-55 Since usually more than 60% of the computational [4] Gupta S., Schroeter J. (1993): ªPitch-syn- effort of the total TTS system is spent on waveform chronous frame-by-frame and segment-based articula- synthesis, hardware constraints can be met most easily tory analysis-by-synthesisº. Journal of the Acoustical by trading off quality vs. complexity in the algorithms Society of America 94, 2517-2530 used by the synthesizer. [5] Kiel Corpus (1994): The Kiel corpus of read Our TTS system uses vector-quantized LPC and a speech, Vol. 1 (CD-ROM, Univ. Kiel) parametrized glottal waveform for synthesis. More recently, we have introduced a mixed formant/LPC [6] MobiusÈ B. (1993): Ein quantitatives Modell der representation and added waveform synthesis engines deutschen Intonation ± Analyse und Synthese von with varying degrees of complexity. GrundfrequenzverlaufenÈ (Niemeyer, Tubingen)È Each speci®c synthesis method requires its corre- [7] Olive J.P. (1990): ªA new algorithm for a concate- sponding analysis scheme. In all cases, we use the native speech synthesis system using an augmented pitch-synchronous analysis outlined in [19] with the acoustic inventory of speech soundsº. Proc. ESCA option of using a mixed LPC/formant spectral repre- Workshop on Speech Synthesis (Autrans, France), 25- sentation [8]. The glottal waveform model used in the 29 standard back end is that of Rosenberg [11]. Spectral [8] Olive J. (1992): ªMixed spectral representation ± tilt is implemented as a separate ®rst-order FIR ®lter. Formants and linear predictive coding (LPC)º. Jour- Aspiration noise is added in the glottal open phase. nal of the Acoustical Society of America 92, 1837- Plosive transients are synthesized using the original 1840 LPC residual [9]. We also have the option to use [9] Oliveira L.C. (1993): ªEstimation of source pa- the LPC residual throughout synthesis. For waveform rameters by frequency analysisº. Proc. Eurospeech- synthesis schemes, we modi®ed the analysis proce- 93, 99-102 dure to extract intervals of exactly two pitch periods (for voiced sounds; ®xed 5 ms pseudo-periods for [10] Pavlovcik P. (1991): FREND reference manual unvoiced sounds), and store the Hanning-windowed (Technical Report, AT&T Bell Laboratories) speech waveform. The system can, in principle, also [11] Rosenberg A.E. (1971): ªEffect of glottal pulse accommodate articulatory parameters [4] [15] given shape on the quality of natural vowelsº. Journal of the that analysis and resynthesis yield suf®ciently high Acoustical Society of America 48 583-590 quality. [12] van Santen J.P.H. (1994): ªAssignment of seg- mental duration in text-to-speech synthesisº. Com- SUMMARY puter Speech and Language 8, 95-128 We presented recent advances and developments in [13] van Santen J.P.H. (1996): ªSegmental duration our ongoing effort toward a TTS system with mul- and speech timingº. In Y. Sagisaka, W.N. Campbell, tilingual capability. The Bell Labs TTS system can N. Higuchi (eds.), Computing Prosody (Springer, New also drive a `talking face', a visual speech synthesis York) module that provides `lip-reading' cues to enhance [14] van Santen J.P.H., MobiusÈ B., Tanenblatt M. discrimination between confusable consonants such (1994): New procedures for constructing acoustic in- as nasals, or between labial and alveolar stops. In ventories (Technical Report, AT&T Bell Laboratories) addition, a `talking head' contributes a visual per- sonality to an application such as a computer's help [15] Schroeter J., Sondhi M.M. (1994):ªTechniques system. In summary, we feel strongly that TTS sys- for estimating vocal-tract shapes from the speech sig- tems have started to play an important role in everyday nal". IEEE Trans. Speech and Audio Proc. 2(1) Part human-machine communications. In the future, TTS II, 133-150 will sound increasingly `natural' where desired, and [16] Sproat R. (1995): ªA ®nite-state architecture will talk in several languages while conveying several for tokenization and grapheme-to-phoneme conver- speaker personalities in each language. sion for multilingual text analysisº. In From text to tags: Issues in multilingual language analysis. Proc. REFERENCES ACL SIGDAT Workshop (Dublin, Ireland), 65-72 [1] Church K. (1988): ªA stochastic parts program [17] Sproat R., Olive J. (1995): ªText to speech syn- and noun phrase parser for unrestricted textº. Proc. thesisº. AT&T Technical Journal 74(2), 35-44 2nd Conf. on Applied Natural Language Processing, [18] Sproat R., Olive J. (1996): ªA modular architec- 136-143 ture for multi-lingualtext-to-speechº. In J. van Santen, [2] Coker C., Church K., Liberman M. (1990): R. Sproat, J. Olive and J. Hirschberg (eds.), Progress ªMorphology and rhyming: Two powerful alterna- in speech synthesis (Springer, New York) tives to letter-to-sound rules for speech synthesisº. [19] Talkin D., Rowley J. (1990): ªPitch-syn- Proc. ESCA Workshop on Speech Synthesis (Autrans, chronous analysis and synthesis for TTS systemsº. France), 83-86 Proc. ESCA Workshop on Speech Synthesis (Autrans, [3] Fujisaki H. (1983): ªDynamic characteristics of France), 55-58 voice fundamental frequency in speech and singingº.