7th International Conference on Spoken ISCA Archive Language Processing [ICSLP2002] http://www.iscaĆspeech.org/archive Denver, Colorado, USA September 16Ć20, 2002

A TEXT-TO- SYSTEM FOR TELUGU

Jithendra Vepa Jahnavi Ayachitam, K V K Kalpana Reddy∗

Centre for Speech Technology Research Dept. of Computer Science University of Edinburgh Jawaharlal Nehru Technological University Edinburgh, UK Hyderabad, INDIA

ABSTRACT development group at IIT Madras has developed TTS sys- tems [4] for Telugu, Tamil and Hindi using the MBROLA In this paper, a diphone based Text-to-Speech (TTS) sys- (MultiBand Resynthesis Overlap-Add) speech synthesiser. tem for the Telugu language is presented. Telugu is one of But they have not generated an Indian language the main south-Indian languages spoken by more than 100 for use with MBROLA [5], instead used one of the exist- million people. Speech output is generated using the Fes- ing speech database (Swedish). To our knowledge, there is tival Speech Synthesis System and the MBROLA synthesis no such database available for Telugu. This prompted us to engine. The design and collection of diphones and voice develop a Telugu speech database to use it with MBROLA building process are described. Our text analysis module, the and also with Festival [6] to generate more intelligible and methods used for segment duration and generation of pitch natural synthetic Telugu speech. contours are briefly discussed. Also, we present waveform A basic TTS system has two sets of modules, as shown generation techniques used in both MBROLA and Festival in Fig.1 . The first set of modules analyse the text to deter- synthesis systems. mine the underlying structure of the sentence and phonetic composition of each word. Then, the other set of modules 1. INTRODUCTION transform this abstract linguistic representation into speech signal. The detailed function of individual modules will be The automatic generation of speech from text, whether the reported in the following sections. text was directly entered into the computer by an operator or NATURAL LANGUAGE PROCESSING SPEECH GENERATION scanned and submitted to an OCR (Optical Character Recog- nition) system, is referred to as Text-to-Speech (TTS) syn- Text Normalisation Duration Pattern TEXT SPEECH thesis. Over the past five decades extensive work has been Letter−to−Sound Rules Intonation done on speech synthesis for the English language. How- Phrasing, Accents ever, not much work has been carried out on Indian languages Waveform Generation though most of them have phonetic writing systems which LINGUISTIC mean that letter-to-sound rules should be relatively straight- REPRESENTATION forward. The first practical systems for automatic TTS from plain text have been available since the late 1970s (MITalk Fig. 1. A basic TTS system. in 1979) [1] and late 1980s (Klattalk in 1981, Prose-2000 in 1982, and DECTalk and Infovox in 1983) [2]. Since then, a A fully developed speech synthesis system has many ap- number of different techniques have been developed and im- plications, including aid for the handicapped (such as reading plemented to produce more natural and intelligible synthetic aids for the blind), medical aids and teaching aids. This tech- speech. nology also facilitates the development of natural language All these developments for English and other European interfaces for computers and renders learning of the language languages, form the basis for work on Indian languages. The easier, by enabling the development of multimedia based lan- work on the TTS synthesisers for Indian languages have guage teaching systems. TTS systems if introduced in the taken a good effect since early nineties. A complete TTS railway stations and the bus stations in India would provide system for Hindi was developed by B. Yegnanarayana et al an efficient communication for the illiterates and the visu- of IIT Madras in 1994 [3]. Complete TTS systems for differ- ally impaired people during their commutation. Hence, it ent Indian languages have also been developed. The system is all the more necessary to develop such high quality TTS systems for Indian languages. This is our motivation. This *Present, University of Central Florida, Orlando, US paper mainly focuses on development of a new Telugu (the authors‘ native language) voice to use in Festival TTS sys-  a  a:  i  i:  u tem, also produce synthetic speech using MBROLA speech  u:  R A e B e: C ai synthesis engine. D o E o: F au G” k H” kh I” g ”J gh K” n˜ L” M” N” O” P” ˜ 2. DIPHONE DATABASE CONSTRUCTION c ch j jh N Q” T R” Th S” D T” Dh U” N Concatenative speech synthesis systems need an inventory V” t W” th X” d Y” dh Z” n of speech segments (units) to concatenate and produce syn- ”a p ”b ph c” b d” bh e” m thetic speech. The choice of unit for concatenation is an open f” y g” r h” r’ i” l j” L issue. Larger units will mean fewer joins, but will also mean k” v l” s’ ”m S n” s o” h a larger inventory. Shorter units result in a smaller inventory, but more joins and therefore more concatenation artefacts in Table 1. Telugu phonemes and corresponding romanized synthesis due to mismatch between units. So, there should representation be a trade-off between unit size and number of joins. The smallest possible units for TTS would be phonemes, and the concatenation points would be phoneme boundaries. But, and generation of prosodic parameters. The output of this these are acoustically instable points. Hence, in order to module is given to the waveform generation module, which make the joins as mid-phone positions, the smallest unit is transforms this symbolic information into speech. now the diphone, which extends from the middle of one phoneme to the middle of the next phoneme. In the middle of the segment the articulators are at the target, and this will 3.1. Text Analysis hopefully be relatively invariant. The typical inventory size Text analysis is the task of transforming the input text into is the square of the number of phonemes. Since diphones its abstract underlying linguistic (phonemic) representation. are also reasonable in number (around 1600 for English) they This can be done in two modules: Text Normalisation and have formed the basis for most concatenation systems [2]. Phonetization. In our system there are 49 phonemes (including silence) in the Telugu language, based on its orthographic characters. These are listed with their romanized representation in Table. 3.1.1. Text-Normalisation 1. Then, we derived a diphone set from this list of phonemes, This module breaks the input text into sentences. It identi- by excluding some un-used or very rarely occurring diphones fies numbers, abbreviations, acronyms and idiomatics, and in this language. Finally, there are around 1732 diphones in transforms them into full text. This module is built by con- our inventory. To create the corpus, we need a word or short sidering frequently occurring abbreviations and symbols in phrase which contains the required phoneme sequence for the Telugu language. Some of them are ki.mi:, ru:, gra:, $ each diphone. We have not used nonsense carrier words to etc. These abbreviations are replaced with their correspond- collect diphones, though it is the normal practice in most ing expansions. The conversion of numbers to words is also diphone synthesis systems. Instead, we have chosen natural performed by this module. For example, 1234 is converted carrier words, such that diphones are centred in the word and to oka veyyi reNDu vandala mupphai na:lugu. also in some cases proper nouns (such as, yamuna (f€e‡€- The steps involved in this module are as follows: first, Z€ ), bha:ratam(d‚ g€V€©), appaDamu (a€©S€e‡€ )) were used. the tokeniser separates words by white space (tabs, new line Then we carefully recorded all the words (spoken by a fe- and spaces). Then, each word is checked whether the word male speaker, one of the authors) direct to computer in a contains only numbers, symbols or letters. Finally, these reasonably quiet environment (a sound proof studio was not symbols, numbers, abbreviations are converted into word available). We used Diphone Studio [7] to extract the di- form (such as $=Da:laru and 12=panneNDu). phones from the words. We used these diphone files directly for MBROLA synthesis engine. To work with Festival,we constructed diphone index, with the help of FestVox docu- 3.1.2. Phonetization mentation [8]. This module is concerned with converting the words to their corresponding phonetic representation. There are 49 phonemes 3. NATURAL LANGUAGE PROCESSING in our system, shown in Table. 1. Some sounds which are not in usage are excluded from the phoneme list. There are some The natural language processing module in a TTS system sounds which are now part of spoken Telugu but do not have deals with the production of a correct phonetic and prosodic any orthographic representations like ’f’ sound (as in "fish") transcription of input text [9]. It involves analysis of the text and ’z’ sound (as in "zoo"). These sounds are adopted from languages like English and Urdu. These are not included in • Prediction of F 0 target values (this must be done after the phoneme list. durations are predicted) Because of the phonetic nature of Telugu this phoneti- zation is somewhat less complex when compared to English Presently we have used a CART (Classification And Regres- and other European languages. There are very few words sion Trees) tree to predict phrase boundaries. Similar tree is 0 where the pronunciation differs from the spelling. For ex- used to predict which syllables are stressed. To generate F ample the combination of the ‘h’ phoneme with any other targets, we used the standard method available in Festival. 0 consonant phoneme (E.g. Jahnavi and Brahma). In this case For each phrase in the given utterance, F is generated using 0 0 0 even though ‘h’ appears first and the consonant appears next, mean (f mean) and standard deviation (f sd)ofF of this we pronounce that combination as if the consonant comes speaker. An imaginary line called baseline is drawn from first and ‘h’ comes next. start to the end, for each accented syllable three targets are In case of purna bindu (this is orthographically repre- added (one at start, one in mid vowel, and one at the end). sented as a circle in Telugu Script), it is replaced with the The start and end are at position baseline Hz and mid vowel + 0 last phoneme of the very next consonant‘s varga.Varga is set to baseline f sd. This model is not very complex but 0 means a class in Telugu, e.g. k varga, c varga etc. If the it offers a very quick and easy way to generate F targets, purna bindu comes before ‘c’ it is replaced with ‘N˜’ (e.g. though it is clearly not a perfect intonation module. kanchamu(G€oL€e‡€ )iskaN˜chamu(G€PL€e‡€ )), if it comes before ‘T’ it is replaced with ‘N’. One among the very few 3.2.2. Predicting Phonemic Duration exceptions is ‘j − N˜’ combination. This combination is pronounced as ‘g − N˜’ (jN˜a : namu is pronounced as The different duration methods with varying levels of so- gN˜a : namu). There are no heterophonic homographs, phistication are available in literature. Unlike F 0, which is i.e., words that are pronounced differently even though they primarily dependent on levels of stress, duration is dependent have the same spelling. This unambiguous nature of Telugu on a number of factors, such as syllable location, phonetic makes this Phonetization process easy. identity and surrounding segments[10]. We assigned mean durations for each phone in the phone set, which were computed using speech database. Then we 3.2. Prosodic Analysis defined a set of rules to modify this average duration based on phonetic context. This has some inherent drawbacks, The term prosody refers to the characteristics of speech that such as an inability to globally optimise the effect of each make sentences flow in a perceptually natural and intelligi- rule. We are also exploring other methods such as duration ble manner. The major components of prosody that can be prediction by the CART model, which is used by Festival recognised perceptually are pitch, loudness of the speaker, for English, and Sum-of-Products model proposed by van syllable length. Prosodic features have specific functions in Santen [10]. speech communication. These are in result of variations in the acoustic parameters, fundamental frequency (F 0), inten- sity (amplitude) and duration. Prosodic analysis can use the 4. WAVEFORM GENERATION same techniques used in phonetic analysis, applying them to the task of generating prosodic information from plain There are many different waveform generation techniques text. Each word usually has one or more syllables which for , such as TD-PSOLA, LPC syn- are stressed. The nature of this stress can, with a little sim- thesis, MBROLA, Residual LPC etc. The two methods used plification be represented by a peak in pitch for the stressed in our work are briefly discussed in the following sections. syllables. Prosodic analysis then, must be able to determine which syllables are stressed by examining the text input. 4.1. MBROLA (MultiBand Resynthesis Ovelap-Add)

3.2.1. Predicting Pitch Contours This method of combining TD-PSOLA [11] with hybrid syn- thesis techniques [12] has been proposed for automatically One aspect of prosody is the gradual rising or falling of pitch resolving phase incoherences and spectral envelope mis- over the length of an entire sentence. This is represented matches. Multiband resynthesis PSOLA (MBR-PSOLA), by a pitch contour of the fundamental frequency of speech. uses a parametric synthesiser (a hybrid model) to resynthe- While each word has its own intonation, the pitch of a natural sise speech introducing spectral characteristics that will pre- sounding sentence follows this pitch contour. In general, vent mismatches from occurring. The MBR-PSOLA resyn- intonation is generated in two steps, thesis operation was modified, referred to as MBROLA al- gorithm [9, 5], which exhibits very good database coding • Prediction of accents on a per syllable basis performances. This eliminates some problems with previous techniques 7. REFERENCES such asTD-PSOLA, LPC synthesis, by re synthesising voiced parts of all segments with constant synthesis pitch and possi- [1] J.Allen, S. Hunnicutt, R. Carlson and B. Gramstro¨m, “ bly with fixed initial phases for each period. Concatenation MITalk-79: The MIT Text-to-Speech System”, Jour- is then also performed by linear interpolation in the time nal of Acoustic Society of America, Suppl. 1 65, S130, domain. This time, spectral envelope mismatches can be 1979. attenuated, at the expense of some additional buzziness in- [2] D.H. Klatt, “ Review of text-to-speech conversion for troduced by the resynthesis operation. English”, Journal of Acoustic Society of America, Vol. 82, No. 3, pp. 737-793, 1987.

4.2. Residual Excited LPC [3] B.Yegnanarayana, S. Rajendran, V.R. Ramachandran and A.S. Madhukumar, “ Significance of knowledge The Residual Excited LPC synthesis (RELP) is used in Fes- sources for a text-to-speech for Indian languages”, tival. During LPC analysis, the exact error or residual signal Sadhana, Vol. 19, Part 1, pp. 147-169, 1994. is also computed, which can be used for in near-perfect re- construction of the speech signal. If the residual is stored [4] Text to Speech in Indian Languages, Systems De- along with the LPC coefficients in the diphone database, it velopments Laboratory, IIT Madras, India. www: can be used at synthesis time. This method is similar to LPC http://acharya.iitm.ac.in/disabilities/tts.html. synthesis, but uses residual signal instead of impulse train [5] MBROLA: TCTS labs, Facult´e Poly- for excitation. technique de Mons, Belgium. www: http://tcts.fpms.ac.be/synthesis/mbrola.html. [6] Festival: Alan Black, Paul Taylor, Richard Ca- 5. CONCLUSIONS AND FUTURE WORK ley, Rob Clark, Centre for Speech Technology Re- search (CSTR), University of Edinburgh. www: This paper presented a diphone-based Text-to-speech system http://www.cstr.ed.ac.uk/projects/festival. for Telugu. A diphone female voice for Telugu language was built to use with the Festival and the MBROLA synthesis [7] Diphone Studio: Fluency Speech engine. This diphone voice contains text analysis, duration, Technology, Netherlands. www: and intonation modules. A text analysis module expands ab- http://www.fluency.nl/dstudio/dstudio.htm breviations, symbols and numbers in Telugu into word form. Currently the CART method is used to predict phrase bound- [8] FestVox: A. Black and K. Lenzo, “Building voices aries and accents. F 0 targets were generated using mean and in the Festival speech synthesis system”, www: standard deviation of F 0 of the speaker. For waveform gen- http://festvox.org/festvox/festvox toc.html eration, either the Festival (residual LPC synthesis) or the [9] T. Dutiot, “An Introduction to Text-to-Speech Synthe- MBROLA synthesis engines can be used. sis”, Kluwer Academic Publishers, 1997. Work is in progress to improve various modules. Espe- cially during text analysis, more abbreviations should be con- [10] J.P.H. van Santen, “Computation of timing in text- sidered and also further work on a Telugu language grammar to-speech synthesis”, Speech Coding and Synthesis is required for processing more exceptions. Further research (K. Paliwal, W.B.Klein, ed.), NewYork: Elsevier, pp. needs to be done on prosodic analysis in order to improve 663-684, 1995. the naturalness of the synthetic speech. [11] F. Charpentier, M.G. Stella, “Diphone Synthesis Us- ing an Overlap-Add Technique for Speech Wave- form Concatenation”, Proceedings of the Interna- 6. ACKNOWLEDGEMENTS tional Conference on Acoustics, Speech, and Signal Processing, pp. 2015-2018, 1986. We are grateful to Dr. K. Narayana Murty, Reader, HCU, [12] T. Dutiot and H. Leich, “ MBR-PSOLA : Text-To- Hyderabad, for his helpful discussions and valuable sugges- Speech Synthesis based on an MBE Re-Synthesis of tions. Also, we would like to thank Mr. Baris Bozkurt, TCTS the Segments Database”, Speech Communications, labs, for helping us to develop the MBROLA database. The Vol. 13, No. 3-4, pp. 435-440, 1993. authors also acknowledge Dr. Simon King of the Centre for Speech Technology Research (CSTR), Edinburgh for his editorial comments.