DIXI { PORTUGUESE TEXT-TO-SPEECH SYSTEM
Lus C. Oliveira M. Ceu Viana Isab el M. Trancoso
INESC/IST CLUL INESC/IST
INESC CLUL
o
R. Alves Redol, 9 Av. 5 de Outubro, 85, 6
1000 Lisb oa, Portugal 1000 Lisb oa, Portugal
Abstract
This pap er describ es the software architecture of the Portuguese
1
text-to-sp eech system DIXI . The system has three ma jor mo d-
ules. The rst one contains the text normalizer and searches
eachword in the lexicon. The second one is a multi-level rule
based mo dule for lexical stress assignment, orthographic to pho-
netic transcription, metrically based proso dic patterning and for
generating the evolution of the synthesizer parameters. The nal
mo dule is the Klatt 80 formant synthesizer. The pap er describ es
each of these main mo dules, emphasizing the particularities of
text-to-sp eech synthesis in the Portuguese language.
Keywords: Sp eechSynthesis; Text-to-sp eech Systems; Por-
tuguese Language; Synthesis-by-rule.
1 Intro duction
The DIXI pro ject is the result of the co op eration b etween the
sp eech pro cessing group of INESC and the phonetic group of
Figure 1: Blo ck diagram of the DIXI system
CLUL and is, to our knowledge, the rst text-to-sp eech system
sp eci cally designed for Europ ean Portuguese, from scratch.
Several guidelines were adopted in the design of this system. One
as well as in ected forms, and corresp onding to ab out 715000
of the priorities was to have a mo dular and exible structure in
o ccurrences.
order to allow its use as a to ol for linguistic and phonetic research,
The three ma jor mo dules of the DIXI system are depicted in
and the development and evaluation of new mo dels of sound wave
the blo ck diagram of g. 1 and will b e separately discussed in
pro duction. The future extension of this system to other varieties
the following sections: text pre-pro cessing in section 2, linguistic
of Portuguese, such as Brasilian Portuguese and varieties sp oken
and phonetic pro cessing in section 3, and nally, the formant
in African countries was another ma jor guideline. The system
synthesizer in section 4.
was also designed b earing in mind its real-time implementation,
namely by using ecient co ding and by limiting the dictionary
size. It runs on several platforms including Unix systems (e.g.,
VAXstations, DECstations, Suns, Alliant) and PC's running MS-
2 Linguistic Pre-Pro cessing
DOS. Due to the fact that all the system can b e transcrib ed into
the C language and that it do es not need to load les in runtime,
This rst mo dule p erforms the input text normalization and
it can b e easily p orted to a dedicated b oard.
searches eachword in the dictionary.
For pro cedures applied at word level or b ellow, a test set of ab out
For eciency reasons, the mo dule is programmed directly in the
25000 di erent forms was used. This constitutes a frequency
C language, using functions for compiling and matching regular
corpus collected by CLUL for other purp oses, comprising citation
expressions, which simpli es co de writing and legibility.
1
Latin expression used at the end of a public sp eech The rst step in the normalization pro cedure is the conversion 1
the eight-bit characters to an internal representation in seven- is only partly true for the presentversion of DIXI. In fact, a
bit characters. This is particularly imp ortant for the Portuguese complete mo del would require a much deep er understanding of
language, since it uses the c cedilla (c) and graphical stress marks some of the language sp eci c phenomena in Portuguese. On the
in vowels (e.g. a, ^e, , ~o) which are usually co ded in the extended other hand, more pragmatic approaches can b e justi ed in some
ASCI I co de using the eight-bit representation. Although there parts for eciency sake.
is an ISO standard for this extended set, it is not resp ected by
The system uses an international alphab et (SAM-PA[10]), and
all manufacturers, which led us into adopting two seven-bit char-
was designed to allow the intro duction of applicability conditions
acters for these symb ols (e.g. c, `a e^ i' o~). There are
at the di erentlevels of the linguistic pro cessing. The two fac-
also other symb ols that must b e replaced byPortuguese words
tors are imp ortant for its use as a research to ol and for future
o
to .o). (e.g. $ to libras)orbyinternal representation (e.g.
extensions to other varieties of Portuguese.
In the next step, the system searches the input string for dates in
With the exception of lexical stress assignment , the linguist and
numerical format (e.g. 28/2/91, 28-2-91). Only valid dates are
phonetic mo dule was built using a rule compiler combined with
transcrib ed, in order to reduce the risk of translating a numerical
a set of auxiliary functions written in the C language. The use
expression.
of a rule compiler has the advantage of imp osing a more struc-
The system contains a small dictionary of 95 abbreviation ex- tured rule de nition [6] and enabling the system developmentby
pansions which is searched when the currentword ends with the researchers with less programming skills.
symb ols \." or \/" eventually followed by an extension (e.g. the
SCYLA, Sp eech Compiler for Your LAnguage, the rule compiler
o
Portuguese abbreviation for engineer { eng {whichwas previ-
develop ed by CSELT [7], was selected b ecause of three basic
ously normalized to eng.o,isnow expanded to engenheiro).
features of its multi-level structures, allowing each pro cedure to
access simultaneously all the previous pro cedures results; its abil- The following step in the normalization pro cedure is the trans-
ity to generate p ortable C co de which can b e optimized for the lation into words of all the characters that are not letters nor
hardware where it is going to run; and, nally, its connection punctuation marks (like #, $, %, *). Some of these charac-
toaconventional pro cedural language for the op erations more ters have context dep endent translations for instance \*" can b e
eciently co ded in this form. translated to asterisco (star) or to vezes (times) in the middle
of a mathematical expression.
The translation of numerals is a common pro cedure in all text
3.1 Lexical stress assignment
normalizers. The DIXI system can translate b oth ordinal num-
a o
{ cente'simo primeiro,101 b ers (e.g. 101 { cento e um,101
Lexical stress assignment is one of the most imp ortant factors
{ cente'sima primeira)aswell as cardinal numb ers in integer,
for a correct reading of Europ ean Portuguese, since stress de-
xed or oating p oint format.
p endentvowel reduction is one of its most striking characteris-
Since not all keyb oards can pro duce the Portuguese characters,
tics. Unstressed vowels can undergo qualitychange, shortening,
the normalizer also accepts the stress marks separated from the
devoicing and deletion.
vowels, as in 'a or a', and the cedilla separated from the c. This
This assignment is a necessary step for words not included in the
is sp ecially useful for pro cessing Unix electronic mail messages
dictionary, without a graphical stress mark (, or ^) and with
which do not allow eight-bit characters and it is also by far the
more than two letters.
most common way adopted byPortuguese users when typing
on a foreign typ e of keyb oard. Whenever necessary, the text
The stress vowel is marked with the SAM phonetic alphab et
normalizer changes the p osition of the mark or cedilla to the
symb ol for primary stress (") and is lo cated by a set of 18 rules
internal format p osition.
which are basically the same as describ ed in [3].
The last step of the normalization pro cedure is the pro cessing
For eciency sake, wehave decided to write these rules directly
of acronyms. The adopted strategy is to restrict sp elling to
in the C language instead of using the rule compiler. Otherwise,
acronyms with no vowels, and to let the phonetic transcription
stress could have b een assigned by the same set of rhythmic rules
rules take care of the others.
that describ e the relative prominence of syllables within a word.
After input text normalization eachword is searched in the dic-
In our test set, 88% of the forms need the stress vowel marking
tionary and, if the search is successful, the entry is asso ciated
rules. The general rule is applied for 71% of the cases, and each
with it. In the currentversion, the system uses a small dictio-
one of the remaining rules never exceeds an application rate of
nary,containing the index of the word stress vowel, the phonetic
10%.
transcription and the grammatical category of each form. The
dictionary is used for exceptions to the phonetic transcription
rules and for syntactic parsing of the utterance.
3.2 The segmental line
The rst pro cedure of the rule system lls in the rst level with
the input text and the marks on the stress vowel. A number of
3 Linguistic and Phonetic Pro cessing
di erent levels is also lled with the dictionary information for
the words with an asso ciated entry.
Although a text-to-sp eech system can b e seen as an attempt to
The rst level, letter, is taken directly as the segmental line, mo del the linguistic and phonetic knowledge needed to pro duce
without any grapheme-to-phoneme mapping rules. This option natural sp eech from an abstract phonological representation, this 2
is motivated by the regularityofPortuguese orthography, based existence of a partition at level N implies a partition at the
x
mainly on phonological criteria. In the cases of e and o, where the level N . The set of p ossible partitions is computed and dif-
x+1
same orthographic symb ol can b e asso ciated with two di erent ferent degrees of probability are assigned to each of them (e.g.
phonological representations, the lowvowel is assumed. This eurhythmic partitions are the most favored; partitions with the
approach, taken for rule simplicityaswell as statistical reasons, longer proso dic group on the right side are preferred to those
handles homographs (e.g. pe ga [p"eg6] { magpie { and pega having the shorter group in this p osition). A random selection
[p"E g6] { a handle) as well as ambiguous word endings ( e.g. of the partition level is then made and the pauses (if any) are
maravilhosa [m6r6viL"Oz6] { marvelous { where osa is a sux, intro duced.
and rapo sa [R6p"oz6]{fox { where it is not).
The next step consists of a series of pro cedures for melo dy assign-
ment and tonal asso ciation. The tonal features are considered as
b ehaving indep endently, and are thus represented in distinct au-
3.3 The syntactic parsing
tosegmental lines, synchronized with the segmental line. The
critical asso ciations are marked and tonal features spread over
A limited syntactic parsing is a common pro cedure in several
di erent domains that can intersect on the segmental line.
text-to-sp eech systems for other languages [1] [9] [8]. In the DIXI
The proso dic mo dule generates an abstract representation in
system, a very crude syntactic analysis is p erformed bymeansof
terms of phonetic features and prominence relations that deter-
identifying punctuation marks and a set of grammatical words to
mines the proso dic prop erties of the utterance.
which certain syntactic structures are normally asso ciated. This
step aims to identify the clause and sentence b oundaries, mo dal-
ity and verb lo calization. This typ e of information is essential
3.5 Orthographic-phonetic transcription
for a go o d p erformance of the proso dic parsing and phonetic
transcription pro cedures, describ ed b elow.
Binary-valued features are asso ciated to eachtoken of the seg-
At this level, the program also searches a set of expressions indi-
mental line and the phonetic transcription of the utterance is
cating syntactic structures, always asso ciated in Portuguese with
p erformed. Some of the rules considered at this level are purely
a proso dic fo cus (e. g. ate {even, o proprio { himself ). If no
phonetically motivated, while others are sensitivetoprosodic
such structures are found, a fo cus marker can b e assigned to the
b oundaries. A small set is also sensitive to the word grammat-
rst or last constituent of the sentence by a random pro cess.
ical category. The complete set of rules, in a numberofabout
200, is basically the same as in [3] and accounts for allophonic
variation within and across word b oundaries. Re-syllabi cation
3.4 Proso dic parsing
is automatically triggered o by certain rules, namely by those
involving vowel deletion or diphthongization across word b ound-
In Portuguese, as well as in many other languages, the di erent
aries.
syllables within a word are structured according to a rhythmic
Using the test set referred ab ove, 92% of the words were correctly
principle of alternation b etween strong and weak b eats, the same
transcrib ed without resorting to a dictionary. In the remaining
kind of principles b eing used to group words into larger units.
cases, a single error p er word was generally detected. Ab out 80%
Using a grid and constituent mo del to account for the internal
of the errors are due to the fact that there is no information on
organization of syllables within words, a close relationship was
the morphological structure of the forms. Most of the remaining
found b etween the degrees of prominence attributed to the syl-
errors are exceptions to several rules. Both need to b e included
lables by the mo del and their relative durations [4].
in the dictionary.
Those degrees of prominence also account for most of the variance
observed in the relative durations of the segments within the
syllables. Extending this typ e of analysis to proso dic domains
3.6 Phonetic values assignment
ab ove the word level, it is p ossible to account for the relative
prominences in connected, sp eech and predict its main proso dic
The phonetic transcription is rather broad and do es not contain
prop erties.
all the information needed to drive the control parameters of
As temp oral and sp ectral reduction of unstressed vowels, as well the synthesizer. Other domains of feature asso ciation need to
as lengthening of stressed ones, play a central role on the natural- b e taken into account. This is motivated by the fact that while
ness and uency of sp oken Portuguese, DIXI p erforms a proso dic certain features asso ciate with the segment as a whole, others
parsing of each utterance to b e synthesized. do not. They are asso ciated to parts of segments and spread to
adjacent parts of other segments (e. g. nasalityinvowels).
Based on rhythmic principles, a b ottom-up proso dic parsing is
made: segments are group ed into syllables, syllables into words, The notion of subsegment is, thus, explored and the rst pro ce-
words into proso dic phrases and proso dic phrases into proso dic dure at this level do es the splitting of a segmentinto di erent
groups up to the level of the utterance. This gradual grouping parts. For instance, plosives are decomp osed into closure and
into larger and larger units is achieved resp ecting the broad syn- burst, and trills in sequences of obstruent/vowel-like alternations,
tactic hierarchy, and prominences are attributed at eachlevel of whose numb er and order are context-dep endent. Binary-valued
the analysis pro cedure. features are then transformed into n-ary ones.
Next, a top-down pro cedure of proso dic partition of the utter- Tables of default transition duration and target values for each
ance is adopted. Relatively short utterances can b e pro duced one of the variable control parameters of the synthesizer are
without any proso dic partition, but very long ones cannot. The searched and synchronized at this subsegmental level. A set of 3
Future work will include the realization of intelligibility and com- target mo di cation rules are then applied. In the case of nasal
prehension tests, the implementation of a more p owerful syntac- vowels, for instance, these rules determine the nasal p ole-zero
tic parser and a b etter understanding of vowel reduction and pair distance and the amount of nasal murmur.
related phenomena.
3.7 Synthesizer control
References
The synthesis strategy is similar to the one describ ed in [1], that
is, a target and transition mo del is used to draw all the control
[1] Allen, J., Hunnicutt, M. S., Klatt, D. (1987). From Text
parameter tracks for the synthesizer. Transition variables de-
to Speech: The MITalk System, Cambridge Univ. P., Cam-
ne the transition typ e, the smo othing time constants and the bridge, U.K.
discontinuityvalues.
[2] Andrade, A., (1989). Um Estudo Experimental das Vogais
The last step of this mo dule estimates the parametric data, ac-
Anteriores e Recuadas em Portugu^es: Implicac~oes paraa
counting for the relative in uence of proso dical as well as seg-
Teoria dos Tracos Distintivos, Diss. CLUL-INIC, ms.
mental features. Reference lines, whose length and slop e are
determined by the proso dic structure, are used to scale proso dic
[3] Andrade, E., Viana, M. C. (1985). \Corso I - Um conversor
prop erties such as F0 and energy. Segmental durations are also
de texto ortogra co em codigo fonetico para o p ortugu^es",
calculated on the basis of proso dic prominences, syllabic struc-
Tech. Rep., CLUL-INIC.
ture and segmental prop erties.
[4] Andrade, E., Viana, M.C. (1988). \Ainda sobre o ritmo e o
Finally, the synthesizer control parameters are computed in 5
o
acento em p ortugu^es", Actas do 4 Encontro da Asso ciac~ao
milliseconds interval, by linear interp olation. In the currentver-
Portuguesa de Lingustica, Lisb oa, 1988, pp. 3-15.
sion, this mo dule controls 18 of the synthesizer parameters.
[5] Klatt, D. H. (1980), \Software for a cascade/parallel for-
mant synthesizer", Journal of the Acoustical Society of
America, 67, 971-995.
4 The FormantSynthesizer
[6] Klatt, D. H. (1987), \Review of Text-to-Sp eechConversion
for English", JournaloftheAcoustical Society of America,
The DIXI system uses the Klatt80 [5] formantsynthesizer with
82(3), 737-793.
small changes in the voicing source.
[7] Lazzaretto, S., Nebbia, L., (1987). \SCYLA: Sp eechCom-
One of the main reasons for this choice was the well known abil-
piler for Your Language", Proc. of the European Conf. on
ity of this synthesizer to generate sp eech close to a human-like
Speech Technology,Edimburgh, Septemb er 1987, 2, 381-384.
quality for di erenttyp es of voices, provided that the necessary
linguistic and phonetic knowledge is adequately emb edded in the
[8] O'Shaughnessy, D.D. (1989). \Parsing with a small dictio-
system, as shown by the p erformances of MITalk and DECtalk
nary for applications suchastexttospeech", Computational
systems.
Linguistics, 15, 97-108.
Previous work using Klatt80 to generate stimuli for p erceptual
exp eriments [11][2], also showed that the acoustic patterns ob-
[9] Sorin, Ch., Larreur, D., Llorca, R. (1987). \A rhythm based
served for Portuguese can b e successfully imitated at natural-
proso dic parser for text-to-sp eech systems in French". Pro-
sp eech quality.
ceedings of the XIth ICPhS, 1:125-128.
Furthermore, Portuguese vowel reduction and large consonant
[10] Winski, R., Barry, W. J., Fourcin, A., (ed.s) (1989) Support
clusters resulting from vowel deletion, are easier to pro duce using
Available from SAM Project for other ESPRIT Speech and
this mo del than by concatenation of segmental units.
Language Work, Esprit Pro ject 2589 (SAM), Multi-Lingual
Sp eech Input/Output Assessment, Metho dology and Stan-
For those reasons the Klatt80 synthesizer seemed like the natural
dardisation.
choice for our system.
[11] Stevens, K. N., Andrade, A., Viana, M. C., \Perception
of Vowel Nasalization in VCContexts: A Cross Language
Study", Journal of the Acoustical Society of America,82,
5 Conclusion
1987, S119fAg.
Although the DIXI system is still in an exp erimental stage, the
results achieved so far can b e considered encouraging, namely:
the p erformance of the stress assignment and phonetic transcrip-
tion mo dules, the versatility of the target and transition mo del
and the results of proso dic parsing. Moreover, the intensityand
fundamental frequency mo dulations, based on the random se-
lection of a partition level, strongly contributetoamuchless
monotonousness of the synthetic sp eech. 4