Portuguese Text-To-Speech System
Total Page:16
File Type:pdf, Size:1020Kb
DIXI { PORTUGUESE TEXT-TO-SPEECH SYSTEM Lus C. Oliveira M. Ceu Viana Isab el M. Trancoso INESC/IST CLUL INESC/IST INESC CLUL o R. Alves Redol, 9 Av. 5 de Outubro, 85, 6 1000 Lisb oa, Portugal 1000 Lisb oa, Portugal Abstract This pap er describ es the software architecture of the Portuguese 1 text-to-sp eech system DIXI . The system has three ma jor mo d- ules. The rst one contains the text normalizer and searches eachword in the lexicon. The second one is a multi-level rule based mo dule for lexical stress assignment, orthographic to pho- netic transcription, metrically based proso dic patterning and for generating the evolution of the synthesizer parameters. The nal mo dule is the Klatt 80 formant synthesizer. The pap er describ es each of these main mo dules, emphasizing the particularities of text-to-sp eech synthesis in the Portuguese language. Keywords: Sp eechSynthesis; Text-to-sp eech Systems; Por- tuguese Language; Synthesis-by-rule. 1 Intro duction The DIXI pro ject is the result of the co op eration b etween the sp eech pro cessing group of INESC and the phonetic group of Figure 1: Blo ck diagram of the DIXI system CLUL and is, to our knowledge, the rst text-to-sp eech system sp eci cally designed for Europ ean Portuguese, from scratch. Several guidelines were adopted in the design of this system. One as well as in ected forms, and corresp onding to ab out 715000 of the priorities was to have a mo dular and exible structure in o ccurrences. order to allow its use as a to ol for linguistic and phonetic research, The three ma jor mo dules of the DIXI system are depicted in and the development and evaluation of new mo dels of sound wave the blo ck diagram of g. 1 and will b e separately discussed in pro duction. The future extension of this system to other varieties the following sections: text pre-pro cessing in section 2, linguistic of Portuguese, such as Brasilian Portuguese and varieties sp oken and phonetic pro cessing in section 3, and nally, the formant in African countries was another ma jor guideline. The system synthesizer in section 4. was also designed b earing in mind its real-time implementation, namely by using ecient co ding and by limiting the dictionary size. It runs on several platforms including Unix systems (e.g., VAXstations, DECstations, Suns, Alliant) and PC's running MS- 2 Linguistic Pre-Pro cessing DOS. Due to the fact that all the system can b e transcrib ed into the C language and that it do es not need to load les in runtime, This rst mo dule p erforms the input text normalization and it can b e easily p orted to a dedicated b oard. searches eachword in the dictionary. For pro cedures applied at word level or b ellow, a test set of ab out For eciency reasons, the mo dule is programmed directly in the 25000 di erent forms was used. This constitutes a frequency C language, using functions for compiling and matching regular corpus collected by CLUL for other purp oses, comprising citation expressions, which simpli es co de writing and legibility. 1 Latin expression used at the end of a public sp eech The rst step in the normalization pro cedure is the conversion 1 the eight-bit characters to an internal representation in seven- is only partly true for the presentversion of DIXI. In fact, a bit characters. This is particularly imp ortant for the Portuguese complete mo del would require a much deep er understanding of language, since it uses the c cedilla (c) and graphical stress marks some of the language sp eci c phenomena in Portuguese. On the in vowels (e.g. a, ^e, , ~o) which are usually co ded in the extended other hand, more pragmatic approaches can b e justi ed in some ASCI I co de using the eight-bit representation. Although there parts for eciency sake. is an ISO standard for this extended set, it is not resp ected by The system uses an international alphab et (SAM-PA[10]), and all manufacturers, which led us into adopting two seven-bit char- was designed to allow the intro duction of applicability conditions acters for these symb ols (e.g. c, `a e^ i' o~). There are at the di erentlevels of the linguistic pro cessing. The two fac- also other symb ols that must b e replaced byPortuguese words tors are imp ortant for its use as a research to ol and for future o to .o). (e.g. $ to libras)orbyinternal representation (e.g. extensions to other varieties of Portuguese. In the next step, the system searches the input string for dates in With the exception of lexical stress assignment , the linguist and numerical format (e.g. 28/2/91, 28-2-91). Only valid dates are phonetic mo dule was built using a rule compiler combined with transcrib ed, in order to reduce the risk of translating a numerical a set of auxiliary functions written in the C language. The use expression. of a rule compiler has the advantage of imp osing a more struc- The system contains a small dictionary of 95 abbreviation ex- tured rule de nition [6] and enabling the system developmentby pansions which is searched when the currentword ends with the researchers with less programming skills. symb ols \." or \/" eventually followed by an extension (e.g. the SCYLA, Sp eech Compiler for Your LAnguage, the rule compiler o Portuguese abbreviation for engineer { eng {whichwas previ- develop ed by CSELT [7], was selected b ecause of three basic ously normalized to eng.o,isnow expanded to engenheiro). features of its multi-level structures, allowing each pro cedure to access simultaneously all the previous pro cedures results; its abil- The following step in the normalization pro cedure is the trans- ity to generate p ortable C co de which can b e optimized for the lation into words of all the characters that are not letters nor hardware where it is going to run; and, nally, its connection punctuation marks (like #, $, %, *). Some of these charac- toaconventional pro cedural language for the op erations more ters have context dep endent translations for instance \*" can b e eciently co ded in this form. translated to asterisco (star) or to vezes (times) in the middle of a mathematical expression. The translation of numerals is a common pro cedure in all text 3.1 Lexical stress assignment normalizers. The DIXI system can translate b oth ordinal num- a o { cente'simo primeiro,101 b ers (e.g. 101 { cento e um,101 Lexical stress assignment is one of the most imp ortant factors { cente'sima primeira)aswell as cardinal numb ers in integer, for a correct reading of Europ ean Portuguese, since stress de- xed or oating p oint format. p endentvowel reduction is one of its most striking characteris- Since not all keyb oards can pro duce the Portuguese characters, tics. Unstressed vowels can undergo qualitychange, shortening, the normalizer also accepts the stress marks separated from the devoicing and deletion. vowels, as in 'a or a', and the cedilla separated from the c. This This assignment is a necessary step for words not included in the is sp ecially useful for pro cessing Unix electronic mail messages dictionary, without a graphical stress mark (, or ^) and with which do not allow eight-bit characters and it is also by far the more than two letters. most common way adopted byPortuguese users when typing on a foreign typ e of keyb oard. Whenever necessary, the text The stress vowel is marked with the SAM phonetic alphab et normalizer changes the p osition of the mark or cedilla to the symb ol for primary stress (") and is lo cated by a set of 18 rules internal format p osition. which are basically the same as describ ed in [3]. The last step of the normalization pro cedure is the pro cessing For eciency sake, wehave decided to write these rules directly of acronyms. The adopted strategy is to restrict sp elling to in the C language instead of using the rule compiler. Otherwise, acronyms with no vowels, and to let the phonetic transcription stress could have b een assigned by the same set of rhythmic rules rules take care of the others. that describ e the relative prominence of syllables within a word. After input text normalization eachword is searched in the dic- In our test set, 88% of the forms need the stress vowel marking tionary and, if the search is successful, the entry is asso ciated rules. The general rule is applied for 71% of the cases, and each with it. In the currentversion, the system uses a small dictio- one of the remaining rules never exceeds an application rate of nary,containing the index of the word stress vowel, the phonetic 10%. transcription and the grammatical category of each form. The dictionary is used for exceptions to the phonetic transcription rules and for syntactic parsing of the utterance. 3.2 The segmental line The rst pro cedure of the rule system lls in the rst level with the input text and the marks on the stress vowel. A number of 3 Linguistic and Phonetic Pro cessing di erent levels is also lled with the dictionary information for the words with an asso ciated entry. Although a text-to-sp eech system can b e seen as an attempt to The rst level, letter, is taken directly as the segmental line, mo del the linguistic and phonetic knowledge needed to pro duce without any grapheme-to-phoneme mapping rules.