DIXI { PORTUGUESE TEXT-TO-SPEECH SYSTEM

Lus C. Oliveira M. Ceu Viana Isab el M. Trancoso

INESC/IST CLUL INESC/IST

INESC CLUL

o

R. Alves Redol, 9 Av. 5 de Outubro, 85, 6

1000 Lisb oa, Portugal 1000 Lisb oa, Portugal

Abstract

This pap er describ es the software architecture of the Portuguese

1

text-to-sp eech system DIXI . The system has three ma jor mo d-

ules. The rst one contains the text normalizer and searches

eachword in the lexicon. The second one is a multi-level rule

based mo dule for lexical assignment, orthographic to pho-

netic transcription, metrically based proso dic patterning and for

generating the evolution of the synthesizer parameters. The nal

mo dule is the Klatt 80 formant synthesizer. The pap er describ es

each of these main mo dules, emphasizing the particularities of

text-to-sp eech synthesis in the .

Keywords: Sp eechSynthesis; Text-to-sp eech Systems; Por-

tuguese Language; Synthesis-by-rule.

1 Intro duction

The DIXI pro ject is the result of the co op eration b etween the

sp eech pro cessing group of INESC and the phonetic group of

Figure 1: Blo ck diagram of the DIXI system

CLUL and is, to our knowledge, the rst text-to-sp eech system

sp eci cally designed for Europ ean Portuguese, from scratch.

Several guidelines were adopted in the design of this system. One

as well as in ected forms, and corresp onding to ab out 715000

of the priorities was to have a mo dular and exible structure in

o ccurrences.

order to allow its use as a to ol for linguistic and phonetic research,

The three ma jor mo dules of the DIXI system are depicted in

and the development and evaluation of new mo dels of sound wave

the blo ck diagram of g. 1 and will b e separately discussed in

pro duction. The future extension of this system to other varieties

the following sections: text pre-pro cessing in section 2, linguistic

of Portuguese, such as Brasilian Portuguese and varieties sp oken

and phonetic pro cessing in section 3, and nally, the formant

in African countries was another ma jor guideline. The system

synthesizer in section 4.

was also designed b earing in mind its real-time implementation,

namely by using ecient co ding and by limiting the dictionary

size. It runs on several platforms including Unix systems (e.g.,

VAXstations, DECstations, Suns, Alliant) and PC's running MS-

2 Linguistic Pre-Pro cessing

DOS. Due to the fact that all the system can b e transcrib ed into

the C language and that it do es not need to load les in runtime,

This rst mo dule p erforms the input text normalization and

it can b e easily p orted to a dedicated b oard.

searches eachword in the dictionary.

For pro cedures applied at word level or b ellow, a test set of ab out

For eciency reasons, the mo dule is programmed directly in the

25000 di erent forms was used. This constitutes a frequency

C language, using functions for compiling and matching regular

corpus collected by CLUL for other purp oses, comprising citation

expressions, which simpli es co de writing and legibility.

1

Latin expression used at the end of a public sp eech The rst step in the normalization pro cedure is the conversion 1

the eight-bit characters to an internal representation in seven- is only partly true for the presentversion of DIXI. In fact, a

bit characters. This is particularly imp ortant for the Portuguese complete mo del would require a much deep er understanding of

language, since it uses the c (c) and graphical stress marks some of the language sp eci c phenomena in Portuguese. On the

in (e.g. a, ^e, , ~o) which are usually co ded in the extended other hand, more pragmatic approaches can b e justi ed in some

ASCI I co de using the eight-bit representation. Although there parts for eciency sake.

is an ISO standard for this extended set, it is not resp ected by

The system uses an international alphab et (SAM-PA[10]), and

all manufacturers, which led us into adopting two seven-bit char-

was designed to allow the intro duction of applicability conditions

acters for these symb ols (e.g. c, `a e^ i' o~). There are

at the di erentlevels of the linguistic pro cessing. The two fac-

also other symb ols that must b e replaced byPortuguese words

tors are imp ortant for its use as a research to ol and for future

o

to .o). (e.g. $ to libras)orbyinternal representation (e.g.

extensions to other varieties of Portuguese.

In the next step, the system searches the input string for dates in

With the exception of lexical stress assignment , the linguist and

numerical format (e.g. 28/2/91, 28-2-91). Only valid dates are

phonetic mo dule was built using a rule compiler combined with

transcrib ed, in order to reduce the risk of translating a numerical

a set of auxiliary functions written in the C language. The use

expression.

of a rule compiler has the advantage of imp osing a more struc-

The system contains a small dictionary of 95 abbreviation ex- tured rule de nition [6] and enabling the system developmentby

pansions which is searched when the currentword ends with the researchers with less programming skills.

symb ols \." or \/" eventually followed by an extension (e.g. the

SCYLA, Sp eech Compiler for Your LAnguage, the rule compiler

o

Portuguese abbreviation for engineer { eng {whichwas previ-

develop ed by CSELT [7], was selected b ecause of three basic

ously normalized to eng.o,isnow expanded to engenheiro).

features of its multi-level structures, allowing each pro cedure to

access simultaneously all the previous pro cedures results; its abil- The following step in the normalization pro cedure is the trans-

ity to generate p ortable C co de which can b e optimized for the lation into words of all the characters that are not letters nor

hardware where it is going to run; and, nally, its connection punctuation marks (like #, $, %, *). Some of these charac-

toaconventional pro cedural language for the op erations more ters have context dep endent translations for instance \*" can b e

eciently co ded in this form. translated to asterisco (star) or to vezes (times) in the middle

of a mathematical expression.

The translation of numerals is a common pro cedure in all text

3.1 Lexical stress assignment

normalizers. The DIXI system can translate b oth ordinal num-

a o

{ cente'simo primeiro,101 b ers (e.g. 101 { cento e um,101

Lexical stress assignment is one of the most imp ortant factors

{ cente'sima primeira)aswell as cardinal numb ers in integer,

for a correct reading of Europ ean Portuguese, since stress de-

xed or oating p oint format.

p endentvowel reduction is one of its most striking characteris-

Since not all keyb oards can pro duce the Portuguese characters,

tics. Unstressed vowels can undergo qualitychange, shortening,

the normalizer also accepts the stress marks separated from the

devoicing and deletion.

vowels, as in 'a or a', and the cedilla separated from the c. This

This assignment is a necessary step for words not included in the

is sp ecially useful for pro cessing Unix electronic mail messages

dictionary, without a graphical stress mark (,  or ^) and with

which do not allow eight-bit characters and it is also by far the

more than two letters.

most common way adopted byPortuguese users when typing

on a foreign typ e of keyb oard. Whenever necessary, the text

The stress is marked with the SAM phonetic alphab et

normalizer changes the p osition of the mark or cedilla to the

symb ol for primary stress (") and is lo cated by a set of 18 rules

internal format p osition.

which are basically the same as describ ed in [3].

The last step of the normalization pro cedure is the pro cessing

For eciency sake, wehave decided to write these rules directly

of acronyms. The adopted strategy is to restrict sp elling to

in the C language instead of using the rule compiler. Otherwise,

acronyms with no vowels, and to let the

stress could have b een assigned by the same set of rhythmic rules

rules take care of the others.

that describ e the relative prominence of within a word.

After input text normalization eachword is searched in the dic-

In our test set, 88% of the forms need the stress vowel marking

tionary and, if the search is successful, the entry is asso ciated

rules. The general rule is applied for 71% of the cases, and each

with it. In the currentversion, the system uses a small dictio-

one of the remaining rules never exceeds an application rate of

nary,containing the index of the word stress vowel, the phonetic

10%.

transcription and the grammatical category of each form. The

dictionary is used for exceptions to the phonetic transcription

rules and for syntactic parsing of the utterance.

3.2 The segmental line

The rst pro cedure of the rule system lls in the rst level with

the input text and the marks on the stress vowel. A number of

3 Linguistic and Phonetic Pro cessing

di erent levels is also lled with the dictionary information for

the words with an asso ciated entry.

Although a text-to-sp eech system can b e seen as an attempt to

The rst level, letter, is taken directly as the segmental line, mo del the linguistic and phonetic knowledge needed to pro duce

without any grapheme-to- mapping rules. This option natural sp eech from an abstract phonological representation, this 2

is motivated by the regularityofPortuguese , based existence of a partition at level N implies a partition at the

x

mainly on phonological criteria. In the cases of e and o, where the level N . The set of p ossible partitions is computed and dif-

x+1

same orthographic symb ol can b e asso ciated with two di erent ferent degrees of probability are assigned to each of them (e.g.

phonological representations, the lowvowel is assumed. This eurhythmic partitions are the most favored; partitions with the

approach, taken for rule simplicityaswell as statistical reasons, longer proso dic group on the right side are preferred to those

handles homographs (e.g. pe ga [p"eg6] { magpie { and pega having the shorter group in this p osition). A random selection

[p"E g6] { a handle) as well as ambiguous word endings ( e.g. of the partition level is then made and the pauses (if any) are

maravilhosa [m6r6viL"Oz6] { marvelous { where osa is a sux, intro duced.

and rapo sa [R6p"oz6]{fox { where it is not).

The next step consists of a series of pro cedures for melo dy assign-

ment and tonal asso ciation. The tonal features are considered as

b ehaving indep endently, and are thus represented in distinct au-

3.3 The syntactic parsing

tosegmental lines, synchronized with the segmental line. The

critical asso ciations are marked and tonal features spread over

A limited syntactic parsing is a common pro cedure in several

di erent domains that can intersect on the segmental line.

text-to-sp eech systems for other languages [1] [9] [8]. In the DIXI

The proso dic mo dule generates an abstract representation in

system, a very crude syntactic analysis is p erformed bymeansof

terms of phonetic features and prominence relations that deter-

identifying punctuation marks and a set of grammatical words to

mines the proso dic prop erties of the utterance.

which certain syntactic structures are normally asso ciated. This

step aims to identify the clause and sentence b oundaries, mo dal-

ity and verb lo calization. This typ e of information is essential

3.5 Orthographic-phonetic transcription

for a go o d p erformance of the proso dic parsing and phonetic

transcription pro cedures, describ ed b elow.

Binary-valued features are asso ciated to eachtoken of the seg-

At this level, the program also searches a set of expressions indi-

mental line and the phonetic transcription of the utterance is

cating syntactic structures, always asso ciated in Portuguese with

p erformed. Some of the rules considered at this level are purely

a proso dic fo cus (e. g. ate {even, o proprio { himself ). If no

phonetically motivated, while others are sensitivetoprosodic

such structures are found, a fo cus marker can b e assigned to the

b oundaries. A small set is also sensitive to the word grammat-

rst or last constituent of the sentence by a random pro cess.

ical category. The complete set of rules, in a numberofabout

200, is basically the same as in [3] and accounts for allophonic

variation within and across word b oundaries. Re-syllabi cation

3.4 Proso dic parsing

is automatically triggered o by certain rules, namely by those

involving vowel deletion or diphthongization across word b ound-

In Portuguese, as well as in many other languages, the di erent

aries.

syllables within a word are structured according to a rhythmic

Using the test set referred ab ove, 92% of the words were correctly

principle of alternation b etween strong and weak b eats, the same

transcrib ed without resorting to a dictionary. In the remaining

kind of principles b eing used to group words into larger units.

cases, a single error p er word was generally detected. Ab out 80%

Using a grid and constituent mo del to account for the internal

of the errors are due to the fact that there is no information on

organization of syllables within words, a close relationship was

the morphological structure of the forms. Most of the remaining

found b etween the degrees of prominence attributed to the syl-

errors are exceptions to several rules. Both need to b e included

lables by the mo del and their relative durations [4].

in the dictionary.

Those degrees of prominence also account for most of the variance

observed in the relative durations of the segments within the

syllables. Extending this typ e of analysis to proso dic domains

3.6 Phonetic values assignment

ab ove the word level, it is p ossible to account for the relative

prominences in connected, sp eech and predict its main proso dic

The phonetic transcription is rather broad and do es not contain

prop erties.

all the information needed to drive the control parameters of

As temp oral and sp ectral reduction of unstressed vowels, as well the synthesizer. Other domains of feature asso ciation need to

as lengthening of stressed ones, play a central role on the natural- b e taken into account. This is motivated by the fact that while

ness and uency of sp oken Portuguese, DIXI p erforms a proso dic certain features asso ciate with the segment as a whole, others

parsing of each utterance to b e synthesized. do not. They are asso ciated to parts of segments and spread to

adjacent parts of other segments (e. g. nasalityinvowels).

Based on rhythmic principles, a b ottom-up proso dic parsing is

made: segments are group ed into syllables, syllables into words, The notion of subsegment is, thus, explored and the rst pro ce-

words into proso dic phrases and proso dic phrases into proso dic dure at this level do es the splitting of a segmentinto di erent

groups up to the level of the utterance. This gradual grouping parts. For instance, are decomp osed into closure and

into larger and larger units is achieved resp ecting the broad syn- burst, and trills in sequences of obstruent/vowel-like alternations,

tactic hierarchy, and prominences are attributed at eachlevel of whose numb er and order are context-dep endent. Binary-valued

the analysis pro cedure. features are then transformed into n-ary ones.

Next, a top-down pro cedure of proso dic partition of the utter- Tables of default transition duration and target values for each

ance is adopted. Relatively short utterances can b e pro duced one of the variable control parameters of the synthesizer are

without any proso dic partition, but very long ones cannot. The searched and synchronized at this subsegmental level. A set of 3

Future work will include the realization of intelligibility and com- target mo di cation rules are then applied. In the case of nasal

prehension tests, the implementation of a more p owerful syntac- vowels, for instance, these rules determine the nasal p ole-zero

tic parser and a b etter understanding of vowel reduction and pair distance and the amount of nasal murmur.

related phenomena.

3.7 Synthesizer control

References

The synthesis strategy is similar to the one describ ed in [1], that

is, a target and transition mo del is used to draw all the control

[1] Allen, J., Hunnicutt, M. S., Klatt, D. (1987). From Text

parameter tracks for the synthesizer. Transition variables de-

to Speech: The MITalk System, Cambridge Univ. P., Cam-

ne the transition typ e, the smo othing time constants and the bridge, U.K.

discontinuityvalues.

[2] Andrade, A., (1989). Um Estudo Experimental das Vogais

The last step of this mo dule estimates the parametric data, ac-

Anteriores e Recuadas em Portugu^es: Implicac~oes paraa

counting for the relative in uence of proso dical as well as seg-

Teoria dos Tracos Distintivos, Diss. CLUL-INIC, ms.

mental features. Reference lines, whose length and slop e are

determined by the proso dic structure, are used to scale proso dic

[3] Andrade, E., Viana, M. C. (1985). \Corso I - Um conversor

prop erties such as F0 and energy. Segmental durations are also

de texto ortogra co em codigo fonetico para o p ortugu^es",

calculated on the basis of proso dic prominences, syllabic struc-

Tech. Rep., CLUL-INIC.

ture and segmental prop erties.

[4] Andrade, E., Viana, M.C. (1988). \Ainda sobre o ritmo e o

Finally, the synthesizer control parameters are computed in 5

o

acento em p ortugu^es", Actas do 4 Encontro da Asso ciac~ao

milliseconds interval, by linear interp olation. In the currentver-

Portuguesa de Lingustica, Lisb oa, 1988, pp. 3-15.

sion, this mo dule controls 18 of the synthesizer parameters.

[5] Klatt, D. H. (1980), \Software for a cascade/parallel for-

mant synthesizer", Journal of the Acoustical Society of

America, 67, 971-995.

4 The FormantSynthesizer

[6] Klatt, D. H. (1987), \Review of Text-to-Sp eechConversion

for English", JournaloftheAcoustical Society of America,

The DIXI system uses the Klatt80 [5] formantsynthesizer with

82(3), 737-793.

small changes in the voicing source.

[7] Lazzaretto, S., Nebbia, L., (1987). \SCYLA: Sp eechCom-

One of the main reasons for this choice was the well known abil-

piler for Your Language", Proc. of the European Conf. on

ity of this synthesizer to generate sp eech close to a human-like

Speech Technology,Edimburgh, Septemb er 1987, 2, 381-384.

quality for di erenttyp es of voices, provided that the necessary

linguistic and phonetic knowledge is adequately emb edded in the

[8] O'Shaughnessy, D.D. (1989). \Parsing with a small dictio-

system, as shown by the p erformances of MITalk and DECtalk

nary for applications suchastexttospeech", Computational

systems.

Linguistics, 15, 97-108.

Previous work using Klatt80 to generate stimuli for p erceptual

exp eriments [11][2], also showed that the acoustic patterns ob-

[9] Sorin, Ch., Larreur, D., Llorca, R. (1987). \A rhythm based

served for Portuguese can b e successfully imitated at natural-

proso dic parser for text-to-sp eech systems in French". Pro-

sp eech quality.

ceedings of the XIth ICPhS, 1:125-128.

Furthermore, Portuguese vowel reduction and large consonant

[10] Winski, R., Barry, W. J., Fourcin, A., (ed.s) (1989) Support

clusters resulting from vowel deletion, are easier to pro duce using

Available from SAM Project for other ESPRIT Speech and

this mo del than by concatenation of segmental units.

Language Work, Esprit Pro ject 2589 (SAM), Multi-Lingual

Sp eech Input/Output Assessment, Metho dology and Stan-

For those reasons the Klatt80 synthesizer seemed like the natural

dardisation.

choice for our system.

[11] Stevens, K. N., Andrade, A., Viana, M. C., \Perception

of Vowel in VCContexts: A Cross Language

Study", Journal of the Acoustical Society of America,82,

5 Conclusion

1987, S119fAg.

Although the DIXI system is still in an exp erimental stage, the

results achieved so far can b e considered encouraging, namely:

the p erformance of the stress assignment and phonetic transcrip-

tion mo dules, the versatility of the target and transition mo del

and the results of proso dic parsing. Moreover, the intensityand

fundamental frequency mo dulations, based on the random se-

lection of a partition level, strongly contributetoamuchless

monotonousness of the synthetic sp eech. 4