MULTILINGUAL TEXT ANALYSIS FOR TEXT-TO-SPEECH SYNTHESIS Richard Sproat
Speech Synthesis Research Department Bell Laboratories, Murray Hill, NJ, USA
ABSTRACT many cases the selection of the correct linguistic form for a ' normal- ized' item cannot be chosen before one has done a certain amount of We present a model of text analysis for text-to-speech (TTS) syn- linguistic analysis. A particularly compelling example can be found thesis based on weighted finite-state transducers, which serves as in Russian, where deciding how to read a percentage denoted with the text-analysis module of the multilingual Bell Labs TTS system. ' %' depends on complex contextual factors, unlike the situation in The transducers are constructed using a lexical toolkit that allows English. The first decision that needs to be made is whether or not declarative descriptions of lexicons, morphological rules, numeral- the number-percent string is modifying a following noun. Russian expansion rules, and phonological rules, inter alia. To date, the in general disallows noun-noun modification, so that an adjectival model has been applied to eight languages: Spanish, Italian, Ro- form of procent ' percent' must be used: 20% skidka 'twenty percent manian, French, German, Russian, Mandarin and Japanese.
discount' is rendered as dvadcati-procentnaja skidka (twenty [gen] -
;sg;fem] [nom;sg;fem] percent+adj[nom discount ). Not only does pro- 1. INTRODUCTION cent have to be in the adjectival form, but as with any Russian adjec- tive it must also agree in number, case and gender with the following The first task faced by any text-to-speech (TTS) system is the con- noun. Observe also that the word for 'twenty' must occur in the gen- version of input text into a linguistic representation. This is a com- itive case. In general, numbers which modify adjectives in Russian plex task since the written form of any language is at best an imper- must occur in the genitive case. If the percentage expression is not fect representation of the corresponding spoken forms. Among the modifying a following noun, then the nominal form procent is used, problems that one faces in handling ordinary text are the following: but this form appears in different cases depending upon the number it occurs with. With numbers ending in one (including compound 1. Some languages, such as Chinese, do not delimit words with numbers like twenty one), procent occurs in the nominative singular. whitespace. One is therefore required to 'reconstruct' word After so-called paucal numbers — two, three, four and their com- boundaries in TTS systems for such languages. pounds — the genitive singular procenta is used. After all other
2. Digit sequences need to be expanded into words, and more numbers one finds the genitive plural procentov.Sowehaveodin
;sg] [gen;sg]
generally into well-formed number names: so 243 in English procent (one percent[nom ), dva procenta (two percent ),
;pl] would generally be expanded as two hundred and forty three. and pyat' procentov (five percent[gen ). However, all of this pre- sumes that the percentage expression as a whole is in a non-oblique 3. Abbreviations must be expanded into full words. This can case. If the expression is in an oblique case, then both the number involve some amount of contextual disambiguation: so kg. can and procent show up in that case, with procent being in the singu- be either kilogram or kilograms, depending upon the context. lar if the number ends in one, and the plural otherwise: s odnym
4. Ordinary words and names need to pronounced. In many
;sg;masc] [instr;sg] procentom (with one [instr percent ) ' with one per-
languages, this requires morphological analysis: even in lan-
;pl] [instr;pl] cent' ; s pjat'ju procentami (with five [instr percent ) ' with guages with fairly 'regular' spelling, morphological structure five percent.' As with the adjectival forms, there is nothing peculiar is often crucial in determining the pronunciation of a word. about the behavior of the noun procent: all nouns exhibit identical 5. Prosodic phrasing is only sporadically marked (by punctua- behavior in combination with numbers. The complexity, of course, tion) in text, and phrasal accentuation is almost never marked. arises because the written form % gives no indication of what lin- guistic form it corresponds to. Furthermore, there is no way to cor- In many TTS systems the first three tasks — word segmentation, rectly expand this form without doing a substantial amount of anal- and digit and abbreviation expansion — would be classed under the ysis of the context, including some analysis of the morphological rubric of text normalization and would generally be handled prior properties of the surrounding words, as well as an analysis of the to, and often in a quite different fashion from the last two prob- relationship of the percentage expression to those words. lems, which fall more squarely within the domain of linguistic anal- ysis (but see [12], which treats numeral expansion as an instance of The obvious solution to this problem is to delay the decision on morphological analysis). One problem with this approach is that in how exactly to transduce symbols like or ' %' until one has enough information to make the decision in an informed manner. This sug- the MMA to the standard spelling kostra (ko stra) would, among
gests is a model where, say, an expression like ' 20%' in Russian other things, simply delete the stress marks: call this transducer
S L S is transduced into all possible renditions, and the correct form se- . The composition word , then computes the mapping from
lected from the lattice of possibilities by filtering out the illegal the lexical level to the surface orthographic level, and its inverse