Quick viewing(Text Mode)

Lexical Analyser for Syllable Matching Used in Speech Synthesis

Lexical Analyser for Syllable Matching Used in Speech Synthesis

Proceedings of the 6th WSEAS International Conference on Signal Processing, Robotics and Automation, Corfu Island, Greece, February 16-19, 2007 77

A Romanian Syllable-Based Text-To-Speech System

OVIDIU BUZA, GAVRIL TODEREAN Department of Telecommunications Technical University of Cluj-Napoca 26 – 28 G. Baritiu Str., 400027, Cluj-Napoca ROMANIA

Abstract: - In this article we present the way we have built a syllable-based TTS system for Romanian. The system contains: a text analyser capable to separate syllables from input text and detect accentuation, a vocal database with recorded syllables, a unit matching module and a synthesizer. The analyser was built using a generator by mean of two sets of phonetic rules. Vocal database was generated through an automated wave segmentation procedure.

Key-: - syllable text-to-speech system, rule-driven text analyser, automatic segmentation

1 Introduction Also, Romanian language contains a relative Concatenation of waveforms represents a method small number of syllables, so we have obtained a more and more used in our days because of high reduced size of vocal database. level of naturalness in produced speech. Corpus- Our text-to-speech system consists of ([7]): based methods are among best approaches, but - a text analysis module that brings input text and they need great efforts for database maintaining. produces basic units, that in our approach are Syllable-base methods can be an alternative, as syllables, and prosody data, which mean the they need a limited units database. Using of information about how words are accentuated; syllables in synthesis also leads to a good level of - a unit matching module that generates acoustic speech naturalness and low concatenation error units sequence according to the linguistic units rate because of small number of concatenation detected from the input and prosody data; points inside the synthesized text. - a module that generates speech This article presents an original approach for based on the acoustic units sequences. constructing a syllable-based TTS system for Romanian. The syllable approach is very The particular aspects of our work are: appropriate in our case, because Romanian spoken - using of linguistic and phonetic rules based of language contains a big number of opened vowels which we have done text analysis and obtained that gives a constant rhythm of speech and similar appropriate units and prosody data; manner of accentuating words. - automatic generation of database from recorded sequences.

Basic units Text Unit Text Synthesis Analysis Matching

Prosody data

Fig.1. Main functionalities of our Text-to-Speech system

Proceedings of the 6th WSEAS International Conference on Signal Processing, Robotics and Automation, Corfu Island, Greece, February 16-19, 2007 78

First of all we have built a linguistic analyzer detection of syllables we had to design a set of module that is capable to split the input text into linguistic rules for splitting words into syllables, syllables. Next step was to determine accentuation inspired from Romanian syntax rules ([2], [3]). by mean of a phonetic analyzer. Then we have The principle used in detecting linguistic units automatically produced a database with PCM is illustrated in figure no. 2. Here we can see the coded syllables of Romanian language. Synthesis structure of text analyser that corresponds to four was done by concatenating acoustic units from modules designed for detection of units, prosody database and giving appropriate commands to the information and unit processing. computer’ sound blaster for voice generation. These modules are: - lexical analysis module for detection of basic units; 2 Text Analysis - phonetic analysis module for generating prosody First stage in text analysis is the detection of information; linguistic units: sentences, words and segmental - high level analysis module for detection of high- units, that in our approach are the syllables. level units; Detection of sentences and words is done - processing shell for unit processing. based on punctuation and literal separators. For

Sub Unit Processing Routines processing Shell

Sentences Sintactical High Level Rules Analysis Words

Text Lexical Phonetic Stress Analysis Syllables Analysis position

Numbers Separators Lexical Phonetic Rules Rules

Fig.2. Text analyser for syllable detection

Processing shell accomplishes the unit characters, special characters and punctuation processing task and controls the subsequent marks. Using special lexical rules (that have been modules. The shell calls high-level analyser for presented in [8]), alphabetical characters are returning main syntactic units. High-level clustered as syllables, digits are clustered as analyser calls the lexical analyser for input text numbers and special characters and punctuation and detection of basic units. Then marks are used in determining of word and phonetic analysis module is called for generating sentence boundaries. stress information. Phonetic analyser gets the syllables between Lexical analyser extracts text characters and two breaking characters and detects stress clusters them into basic units. We refer to the position, i.e. the accentuated syllable from detection of alphabetical characters, numerical corresponding word. Proceedings of the 6th WSEAS International Conference on Signal Processing, Robotics and Automation, Corfu Island, Greece, February 16-19, 2007 79

Then, high-level analyser takes the syllables, Text special characters and numbers provided by the C C C C C C C C C lexical analyser, and also prosodic information, and constructs high-level units: words and sentences. Also basic sentence verification is done Digit Separator Alphanumeric here. Processing shell finally takes linguistic units provided from the previous levels and, based on some computing subroutines, classifies and stores Rules Rules Rules them in appropriate structures. From here synthesis module will construct the acoustic waves and will synthesize the text. Integer Real Sep 1 … Sep n Syllable

3 Lexical Analysis for Syllable Process separator Detection Process number Process syllable Lexical analyzer is called by the higher level modules for detection of basic lexical units: syllables, breaking characters and numbers. The Fig.3. Lexical analyser for syllable detection lexical analyzer is made by using LEX scanner generator [4]. LEX generates a lexical scanner starting from an input grammar that describes the 3.1 Specific actions of lexical analyser parsing rules. Grammar is written in BNF Specific actions inform high-level module about standard form and specifies sequences matching of syllables, numbers and breaking that can be recognized from the input. These characters. Inside lexical parser three types of sequences refer to syllables, special characters, input response actions are defined as follows: separators and numbers. Also BNF grammar A. Process syllable – this is the action to be specifies the actions to be taken in the response of taken when a syllable is matched in specific input matching, actions that will be accomplished location of one word. by the processing shell subroutines. Special attention is taken when a syllable is The whole process realized by the lexical matched at the beginning of a word. In Romanian, analyzer is illustrated in figure no. 3. As we can different word decomposition rules apply when a see, input text is interpreted as a character string. character sequence occurs at the beginning or in At the beginning, current character is classified in the middle or the final part of a word. following categories: digit, special character or B. Process number – is the action to be separator, and alphanumeric character. Taking taken when a number is matched from the input. into account left and right context, current The number is identified as INTEGER or REAL character and the characters already parsed are type. In future stage, numbers will be translated in grouped to form a lexical unit: a syllable, a orthographic alphabetical form. number or a separator. Specific production rules C. Process separator - is the action for each category indicate the mode each lexical corresponding to a breaking character matching unit is formed and classified, and also realize a from the input. Breaking characters and subclasification of units (for numbers if they are punctuation marks are used for detecting word integer or real numbers, and for separators – the and sentence boundaries. type: word or sentence separator, affirmative, interrogative, imperative or special separator). 3.2 Syllable rules matching Once the unit type and subtype is identified, Regarding syllable rules matching process inside corresponding character sequence is stored and lexical analyser, two types of rule sets were made: transmitted to the high-level analyzer by mean of a basic set consisting of three general rules, and a specific actions, as they will be described in next large set of exception rules which statues the paragraph (Process syllable, Process number, exceptions from the basic set. Process separator).

Proceedings of the 6th WSEAS International Conference on Signal Processing, Robotics and Automation, Corfu Island, Greece, February 16-19, 2007 80

(A) The basic set shows the general In Romanian stressed syllable can be one of decomposition rules for Romanian. First rule is last four syllables of the word: Sn, Sn-1, Sn-2 or Sn-3, that a syllable consists of a sequence of ( Sn is the last syllable). Most often, stress is consonants followed by a vowel: placed at last but one position. The rules set consists of this general rule (Sn-1 syllable ={CONS}*{VOC} (R1) syllable is stressed):

Second rule statues that a syllable can be {LIT}+/{SEP} { return(SN_1);} finished by a consonant if the beginning of the next syllable is also a consonant: and a consistent set of exceptions, organized in classes of words having the same termination. In syllable={CONS}*{VOC}{CONS}/{CONS}(R2) [7] one can find the complete set of rules.

Third rule says that one or more consonants can be placed at the final part of a syllable if this 5 Vocal Database Construction is the last syllable of a word : Vocal database includes a subset of Romanian language syllables. Acoustic units were separated syllable={CONS}*{VOC}{CONS}*/{SEP}(R3) from male speech and normalized in pitch and amplitude. (B) The exception set is made up from the rules Segmentation was done through an automated that are exceptions from the three rules of above. procedure which can detect silence/speech and These exceptions are situated in the front of basic voiced/unvoiced signal. Our approach uses time rules. If no rule from the exception set is matched, domain analysis of signal. After a low-pass then the syllable is treated by the basic rules. At filtering of the signal, zero-cross (Zi) wave this time, the exception set is made up by more samples were detected. Minimum (mi) and then 100 rules. Rules are grouped in subsets that maximum (Mi) points between two zeros were refer to resembling character sequences. All these also computed. rules are explained in [7], [8]. Separation between silence and speech is done

using an amplitude threshold Ts . In silence segments all MIN and MAX points have to be 4 Syllable Accentuation smaller than Ts: The principle for determining syllable accentuation is shown in the following diagram: ⎧ || < TM si ⎨ , i = s… s+n (1) ⎩ || < Tm si Text Phonetic STRESS Parser Rules In (1) s is the segment index and n is the number of samples in that segment. Sn-3 For speech segments distance between two adjacent zero-cross points (Di = d(Zi,Zi+1)) is Sn-2 Phonetic computed. Decision of voiced segment is assumed F F F 1 2 ... k S Analyzer if distance is greater than a threshold distance V: Sn-1 WORD Di > V , i = s,… , s+n (2) Sn

Fig.4. The principle of detecting syllable accentuation A Zn

The parser returns current word from input Z1 B stream. The word consists of series of phonemes F1, F2, …, Fk and is delimited by a separator S. Phonetic analyser reads this word and detects syllable accentuation based on phonetic rules. Fig.5. A voiced segment of speech Proceedings of the 6th WSEAS International Conference on Signal Processing, Robotics and Automation, Corfu Island, Greece, February 16-19, 2007 81

For the zero points between A and B from (b) Syllable is matched but not the figure 5 to be included in the voiced segment, a accentuation. In this case, unit is reconstructed look-ahead technique has been applied. A number from other syllables and phonemes which abide of maximum Nk zero points between Zi and Zi+k by the necessary accentuation. can be inserted in voiced region if Di-1>V and (c) Syllable is not matched at all, so it will be Di+k >V : constructed from separate phonemes.

, j=i..k; k<=Nk ⎧ j < VD ⎪ 7 Implementation ⎨ i−1 > VD , i = s,… , s+n (3) The purpose of our work was to build a speech ⎪ ⎩ +ki > VD synthesis system based on concatenation of syllables. The system includes a syllable database in which we have recorded 386 two-character Finally, if conditions (2) and (3) are not syllables: 283 middle-word syllables and 103 accomplished, current segment is assumed ending-word syllables, 139 most frequent three- unvoiced. character syllables and 37 four-character syllables. Syllables that are not included in database are After segmentation, voiced and unvoiced synthesized from existing syllables and separate segments are coupled according to the syllable phonemes that are also recorded. chain that is used in vocal database construction The speech synthesis system first invokes text process. Acoustic units are labelled and stored in analyzer for syllable detection, then phonetic database. Each region boundary can be viewed analyser for determining the accentuation. with a special application and, if necessary, can be Appropriate unit (stressed or unstressed) is adjusted. matched from vocal database, and speech synthesis is accomplished by syllable Vocal database with recorded syllables has a concatenation. tree data structure. Each node in the tree corresponds with a syllable characteristic, and a Unit leaf represents appropriate syllable. Text Concatenation Units have been inserted in database following and Synthesis the classification: - after length of syllables : we have two, three or Lexical analysis Phonetic Unit four character syllables (denoted S2, S3 and S4) Syllable analysis Matching and also singular phonemes (S1); detection Accentuation - after position inside the word: initial or median (M) and final syllables (F); - after accentuation: stressed or accentuated (A) or Basic units: Prosodic info: Vocal normal (N) syllables. Syllables Stress Database This classification offers the advantage of reducing time for matching process between phonetic and acoustic units. Fig.6. The principle of our syllable-based speech synthesis system

6 Unit Matching Process The matching process is done according to the 8 Conclusions and Results three–layer classification of units: number of We have presented in this article a complete characters in the syllable, accentuation and the method for construction of a syllable-based TTS place of syllable inside the word. system. Special efforts have been done to If one syllable is not founded in vocal accomplish the stage. After database, this will be constructed from other serious researches in linguistic field, we have syllables and separate phonemes that are also designed one set of rules for detecting word recorded. Following situations may appear: syllables and a second set for determining which (a) Syllable is matched in appropriate syllable is accentuated in each word. Even these accentuated form. In this case acoustic unit will be sets are not complete, they cover yet a good directly used for concatenation. Proceedings of the 6th WSEAS International Conference on Signal Processing, Robotics and Automation, Corfu Island, Greece, February 16-19, 2007 82

majority of cases. The lexical analyzer is entirely References: based on rules that assure more than 85% correct [1] Burileanu D., et al., A Parser-Based Text syllable detection at this moment, since for Romanian Language TTS accentuation analyser provides about 75% correct Synthesis, Proceedings of EUROSPEECH'99, detection rate. Budapest, Hungary, vol.5, pp. 2063-2066, Sep. 1999. The advantages of detecting syllables through [2] Constantinescu-Dobridor G., Sintaxa a rules-driven analyser are: separation between limbii române, Editura Ştiinţifică, Bucureşti, 1994 syllables detection and system code (different [3] Ciompec G. et al., Limba română from [9], where syllables detection algorithm is contempo- rană. Fonetică, fonologie, morfologie, integrated in source code); from here we have Editura Didactică şi Pedagogică, Bucureşti, 1985. easy readability and accessibility of rules. Other [4] Free Software Foundation, Flex - a scanner authors ([1]) have used LEX only for pre- generator, http://www.gnu.org processing stage of text analysis, and not for units /software/flex/manual, October 2005. detection process itself. Some methods support [5] Hunt A., Black A., Unit selection in a only a restricted domain ([6]), since our method concatenative speech synthesis system using a supports all Romanian vocabulary. The rules- large speech database, Proc. ICASSP ’96, Atlanta, driven method also needs less resources than GA, May 1996, pp. 373–376. dictionary-based methods (like [5]). [6] Lewis E., Tatham M., Word And Syllable Concatenation In Text-To-Speech Synthesis, Sixth Also our automated segmentation method European Conference on Speech Communications assures less error in concatenation points: waves and Technology, pp. 615—618, ESCA, Sep. 1999. begin and stop at zero-points and contain integer [7] Buza O., Vocal interractive systems, numbers of periods. doctoral paper, Electronics and Tele- communications Faculty, Technical University of About speech synthesis outcome, first results Cluj-Napoca, 2005 are encouraging, and after a post-recording stage [8] Buza O., Toderean G., Syllable detection of syllable normalization we have obtained a good for Romanian text-to-speech synthesis, Sixth quality of text synthesis. In future International Conference on Communications implementations, F0 adaptive correction in COMM’06 Bucarest, June 2006, pp. 135-138. concatenation points will improve this [9] Burileanu C. et al., Text-to-Speech Synthesis performance. for Romanian Language: Present and Future Trends,http://www.racai.ro/books/awde/burileanu. html