MULTILINGUAL TEXT ANALYSIS for TEXT-TO-SPEECH SYNTHESIS Richard Sproat

MULTILINGUAL TEXT ANALYSIS FOR TEXT-TO-SPEECH SYNTHESIS Richard Sproat Speech Synthesis Research Department Bell Laboratories, Murray Hill, NJ, USA ABSTRACT many cases the selection of the correct linguistic form for a ' normal- ized' item cannot be chosen before one has done a certain amount of We present a model of text analysis for text-to-speech (TTS) syn- linguistic analysis. A particularly compelling example can be found thesis based on weighted finite-state transducers, which serves as in Russian, where deciding how to read a percentage denoted with the text-analysis module of the multilingual Bell Labs TTS system. ' %' depends on complex contextual factors, unlike the situation in The transducers are constructed using a lexical toolkit that allows English. The first decision that needs to be made is whether or not declarative descriptions of lexicons, morphological rules, numeral- the number-percent string is modifying a following noun. Russian expansion rules, and phonological rules, inter alia. To date, the in general disallows noun-noun modification, so that an adjectival model has been applied to eight languages: Spanish, Italian, Ro- form of procent ' percent' must be used: 20% skidka 'twenty percent manian, French, German, Russian, Mandarin and Japanese. discount' is rendered as dvadcati-procentnaja skidka (twenty [gen] - ;sg;fem] [nom;sg;fem] percent+adj[nom discount ). Not only does pro- 1. INTRODUCTION cent have to be in the adjectival form, but as with any Russian adjec- tive it must also agree in number, case and gender with the following The first task faced by any text-to-speech (TTS) system is the con- noun. Observe also that the word for 'twenty' must occur in the gen- version of input text into a linguistic representation. This is a com- itive case. In general, numbers which modify adjectives in Russian plex task since the written form of any language is at best an imper- must occur in the genitive case. If the percentage expression is not fect representation of the corresponding spoken forms. Among the modifying a following noun, then the nominal form procent is used, problems that one faces in handling ordinary text are the following: but this form appears in different cases depending upon the number it occurs with. With numbers ending in one (including compound 1. Some languages, such as Chinese, do not delimit words with numbers like twenty one), procent occurs in the nominative singular. whitespace. One is therefore required to 'reconstruct' word After so-called paucal numbers — two, three, four and their com- boundaries in TTS systems for such languages. pounds — the genitive singular procenta is used. After all other 2. Digit sequences need to be expanded into words, and more numbers one finds the genitive plural procentov.Sowehaveodin ;sg] [gen;sg] generally into well-formed number names: so 243 in English procent (one percent[nom ), dva procenta (two percent ), ;pl] would generally be expanded as two hundred and forty three. and pyat' procentov (five percent[gen ). However, all of this pre- sumes that the percentage expression as a whole is in a non-oblique 3. Abbreviations must be expanded into full words. This can case. If the expression is in an oblique case, then both the number involve some amount of contextual disambiguation: so kg. can and procent show up in that case, with procent being in the singu- be either kilogram or kilograms, depending upon the context. lar if the number ends in one, and the plural otherwise: s odnym 4. Ordinary words and names need to pronounced. In many ;sg;masc] [instr;sg] procentom (with one [instr percent ) ' with one per- languages, this requires morphological analysis: even in lan- ;pl] [instr;pl] cent' ; s pjat'ju procentami (with five [instr percent ) ' with guages with fairly 'regular' spelling, morphological structure five percent.' As with the adjectival forms, there is nothing peculiar is often crucial in determining the pronunciation of a word. about the behavior of the noun procent: all nouns exhibit identical 5. Prosodic phrasing is only sporadically marked (by punctua- behavior in combination with numbers. The complexity, of course, tion) in text, and phrasal accentuation is almost never marked. arises because the written form % gives no indication of what linguistic form it corresponds to. Furthermore, there is no way to cor- In many TTS systems the first three tasks — word segmentation, rectly expand this form without doing a substantial amount of anal- and digit and abbreviation expansion — would be classed under the ysis of the context, including some analysis of the morphological rubric of text normalization and would generally be handled prior properties of the surrounding words, as well as an analysis of the to, and often in a quite different fashion from the last two prob- relationship of the percentage expression to those words. lems, which fall more squarely within the domain of linguistic analysis (but see [12], which treats numeral expansion as an instance of The obvious solution to this problem is to delay the decision on morphological analysis). One problem with this approach is that in how exactly to transduce symbols like or ' %' until one has enough information to make the decision in an informed manner. This sug- the MMA to the standard spelling kostra (ko stra) would, among gests is a model where, say, an expression like ' 20%' in Russian other things, simply delete the stress marks: call this transducer S L S is transduced into all possible renditions, and the correct form se- . The composition word , then computes the mapping from lected from the lattice of possibilities by filtering out the illegal the lexical level to the surface orthographic level, and its inverse 1 1 1 (L S ) = S L forms. An obvious computational mechanism for accomplishing word computes the mapping from the sur- word this is the finite-state transducer (FST). Indeed, since FSTs can be face to all possible lexical representations for the text word. A set of used to model (most) morphology and phonology [5, 8], as well as pronunciation rules compiled into a transducer P [2, 6], maps from to segment words in Chinese text [11], and for performing other the MMA to the (surface) phonological representation; note that by text-analysis operations such as numeral expansion [9], this sug- starting with the MMA, rather than with the more abstract lexical gests a model of text-analysis that is entirely based on regular re- representation, the pronunciation rules do not need to duplicate in- L lations. We present such a model below. More specifically we formation that is contained in word anyway. Mapping from a sin- present a model of text analysis for TTS based on weighted FSTs gle orthographic word to its possible pronunciations thus involves (WFSTs) [7], which serves as the text-analysis module of the multi- composing the acceptor representing the word with the transducer 1 1 1 1 S L L P S M D M P lingual Bell Labs TTS system. To date, the model has been applied word (or more fully as ). word to eight languages: Spanish, Italian, Romanian, French, German, Russian, Mandarin and Japanese. One property of this model that For text elements such as numbers, abbreviations, and special sym- distinguishes it from most TTS text-analyzers is that such tasks as bols such as ' %' , the model just presented seems less persuasive, numeral expansion and word-segmentation are not logically prior because there is no aspect of a string, such as ' 25%' that indicates to other aspects of linguistic analysis. There is therefore no distin- its pronunciation: such strings are purely logographic (or even ideo- guished 'text-normalization' phase. graphic) representing nothing about the phonology of the words involved. 2 For these cases we presume a direct mapping between 2. OVERALL ARCHITECTURE all possible forms of procent, and the symbol ' %' : call this trans- 1 L L L per c special sy mbol ducer , a subset of .Then per c maps from the Let us start with the example of the lexical analysis and pronun- symbol ' %' to the various forms of procent.Inthesameway,the 1 _ 1 L L per c ciation of ordinary words, taking again an example from Russian. transducer num maps from numbers followed by the sign Russian orthography is often described as morphophonemic, mean- for percent, into various possible (and some impossible) lexical ren- ing that the orthography represents not a surface phonemic level of ditions of that string — the various forms to be disambiguated using representation, but a more abstract level. This is description is cor- contextual information, as we shall show below. Abbreviations are rect, but from the point of view of predicting word pronunciation, handled in a similar manner: abbreviations such as kg (kg)inRus- it is noteworthy that Russian, with a well-defined set of lexical ex- sian show the same complexity of behavior as procent. ceptions, is almost completely phonemic in that one can predict the pronunciation of most words in Russian based on the spelling of So far we have been discussing the mapping of single text words those words — provided one knows the placement of lexical stress, into their lexical renditions. The construction of an analyzer to han- since several Russian vowels undergo reduction to varying degrees dle a whole text is based on the observation that a text is simply con- depending upon their position relative to stressed syllables. The structed out of one or more instances of a text word coming from catch is that lexical stress usually depends upon knowing lexical one of the models described above — either an ordinary word, an properties of the word, including morphological class information. abbreviation, a number, a special symbol, or some combination of numbers with a special symbol; with each of these tokens separated To take a concrete example, consider the word kostra (Cyrillic ko s- by some combination of whitespace or punctuation.

MULTILINGUAL TEXT ANALYSIS for TEXT-TO-SPEECH SYNTHESIS Richard Sproat

A Way with Words: Recent Advances in Lexical Theory and Analysis: a Festschrift for Patrick Hanks

Finetuning Pre-Trained Language Models for Sentiment Classification of COVID19 Tweets

Lexical Analysis

Lexical Sense Labeling and Sentiment Potential Analysis Using Corpus-Based Dependency Graph

Automated Citation Sentiment Analysis Using High Order N-Grams: a Preliminary Investigation

A Framework for Representing Lexical Resources

Content Extraction and Lexical Analysis from Customer-Agent Interactions

A Corpus-Based Investigation of Lexical Cohesion in En & It

Lexical Analysis

Text-To-Speech Synthesis

Robust Extended Tokenization Framework for Romanian by Semantic Parallel Texts Processing

Semantic Analysis Lexical Analysis