A Text-To-Speech Synthesis System for Telugu
Total Page:16
File Type:pdf, Size:1020Kb
7th International Conference on Spoken ISCA Archive Language Processing [ICSLP2002] http://www.iscaĆspeech.org/archive Denver, Colorado, USA September 16Ć20, 2002 A TEXT-TO-SPEECH SYNTHESIS SYSTEM FOR TELUGU Jithendra Vepa Jahnavi Ayachitam, K V K Kalpana Reddy∗ Centre for Speech Technology Research Dept. of Computer Science University of Edinburgh Jawaharlal Nehru Technological University Edinburgh, UK Hyderabad, INDIA ABSTRACT development group at IIT Madras has developed TTS sys- tems [4] for Telugu, Tamil and Hindi using the MBROLA In this paper, a diphone based Text-to-Speech (TTS) sys- (MultiBand Resynthesis Overlap-Add) speech synthesiser. tem for the Telugu language is presented. Telugu is one of But they have not generated an Indian language database the main south-Indian languages spoken by more than 100 for use with MBROLA [5], instead used one of the exist- million people. Speech output is generated using the Fes- ing speech database (Swedish). To our knowledge, there is tival Speech Synthesis System and the MBROLA synthesis no such database available for Telugu. This prompted us to engine. The design and collection of diphones and voice develop a Telugu speech database to use it with MBROLA building process are described. Our text analysis module, the and also with Festival [6] to generate more intelligible and methods used for segment duration and generation of pitch natural synthetic Telugu speech. contours are briefly discussed. Also, we present waveform A basic TTS system has two sets of modules, as shown generation techniques used in both MBROLA and Festival in Fig.1 . The first set of modules analyse the text to deter- synthesis systems. mine the underlying structure of the sentence and phonetic composition of each word. Then, the other set of modules 1. INTRODUCTION transform this abstract linguistic representation into speech signal. The detailed function of individual modules will be The automatic generation of speech from text, whether the reported in the following sections. text was directly entered into the computer by an operator or NATURAL LANGUAGE PROCESSING SPEECH GENERATION scanned and submitted to an OCR (Optical Character Recog- nition) system, is referred to as Text-to-Speech (TTS) syn- Text Normalisation Duration Pattern TEXT SPEECH thesis. Over the past five decades extensive work has been Letter−to−Sound Rules Intonation done on speech synthesis for the English language. How- Phrasing, Accents ever, not much work has been carried out on Indian languages Waveform Generation though most of them have phonetic writing systems which LINGUISTIC mean that letter-to-sound rules should be relatively straight- REPRESENTATION forward. The first practical systems for automatic TTS from plain text have been available since the late 1970s (MITalk Fig. 1. A basic TTS system. in 1979) [1] and late 1980s (Klattalk in 1981, Prose-2000 in 1982, and DECTalk and Infovox in 1983) [2]. Since then, a A fully developed speech synthesis system has many ap- number of different techniques have been developed and im- plications, including aid for the handicapped (such as reading plemented to produce more natural and intelligible synthetic aids for the blind), medical aids and teaching aids. This tech- speech. nology also facilitates the development of natural language All these developments for English and other European interfaces for computers and renders learning of the language languages, form the basis for work on Indian languages. The easier, by enabling the development of multimedia based lan- work on the TTS synthesisers for Indian languages have guage teaching systems. TTS systems if introduced in the taken a good effect since early nineties. A complete TTS railway stations and the bus stations in India would provide system for Hindi was developed by B. Yegnanarayana et al an efficient communication for the illiterates and the visu- of IIT Madras in 1994 [3]. Complete TTS systems for differ- ally impaired people during their commutation. Hence, it ent Indian languages have also been developed. The system is all the more necessary to develop such high quality TTS systems for Indian languages. This is our motivation. This *Present, University of Central Florida, Orlando, US paper mainly focuses on development of a new Telugu (the authors‘ native language) voice to use in Festival TTS sys- a a: i i: u tem, also produce synthetic speech using MBROLA speech u: R A e B e: C ai synthesis engine. D o E o: F au G k H kh I g J gh K n˜ L M N O P ˜ 2. DIPHONE DATABASE CONSTRUCTION c ch j jh N Q T R Th S D T Dh U N Concatenative speech synthesis systems need an inventory V t W th X d Y dh Z n of speech segments (units) to concatenate and produce syn- a p b ph c b d bh e m thetic speech. The choice of unit for concatenation is an open f y g r h r’ i l j L issue. Larger units will mean fewer joins, but will also mean k v l s’ m S n s o h a larger inventory. Shorter units result in a smaller inventory, but more joins and therefore more concatenation artefacts in Table 1. Telugu phonemes and corresponding romanized synthesis due to mismatch between units. So, there should representation be a trade-off between unit size and number of joins. The smallest possible units for TTS would be phonemes, and the concatenation points would be phoneme boundaries. But, and generation of prosodic parameters. The output of this these are acoustically instable points. Hence, in order to module is given to the waveform generation module, which make the joins as mid-phone positions, the smallest unit is transforms this symbolic information into speech. now the diphone, which extends from the middle of one phoneme to the middle of the next phoneme. In the middle of the segment the articulators are at the target, and this will 3.1. Text Analysis hopefully be relatively invariant. The typical inventory size Text analysis is the task of transforming the input text into is the square of the number of phonemes. Since diphones its abstract underlying linguistic (phonemic) representation. are also reasonable in number (around 1600 for English) they This can be done in two modules: Text Normalisation and have formed the basis for most concatenation systems [2]. Phonetization. In our system there are 49 phonemes (including silence) in the Telugu language, based on its orthographic characters. These are listed with their romanized representation in Table. 3.1.1. Text-Normalisation 1. Then, we derived a diphone set from this list of phonemes, This module breaks the input text into sentences. It identi- by excluding some un-used or very rarely occurring diphones fies numbers, abbreviations, acronyms and idiomatics, and in this language. Finally, there are around 1732 diphones in transforms them into full text. This module is built by con- our inventory. To create the corpus, we need a word or short sidering frequently occurring abbreviations and symbols in phrase which contains the required phoneme sequence for the Telugu language. Some of them are ki.mi:, ru:, gra:, $ each diphone. We have not used nonsense carrier words to etc. These abbreviations are replaced with their correspond- collect diphones, though it is the normal practice in most ing expansions. The conversion of numbers to words is also diphone synthesis systems. Instead, we have chosen natural performed by this module. For example, 1234 is converted carrier words, such that diphones are centred in the word and to oka veyyi reNDu vandala mupphai na:lugu. also in some cases proper nouns (such as, yamuna (fe- The steps involved in this module are as follows: first, Z ), bha:ratam(d gV©), appaDamu (a©Se )) were used. the tokeniser separates words by white space (tabs, new line Then we carefully recorded all the words (spoken by a fe- and spaces). Then, each word is checked whether the word male speaker, one of the authors) direct to computer in a contains only numbers, symbols or letters. Finally, these reasonably quiet environment (a sound proof studio was not symbols, numbers, abbreviations are converted into word available). We used Diphone Studio [7] to extract the di- form (such as $=Da:laru and 12=panneNDu). phones from the words. We used these diphone files directly for MBROLA synthesis engine. To work with Festival,we constructed diphone index, with the help of FestVox docu- 3.1.2. Phonetization mentation [8]. This module is concerned with converting the words to their corresponding phonetic representation. There are 49 phonemes 3. NATURAL LANGUAGE PROCESSING in our system, shown in Table. 1. Some sounds which are not in usage are excluded from the phoneme list. There are some The natural language processing module in a TTS system sounds which are now part of spoken Telugu but do not have deals with the production of a correct phonetic and prosodic any orthographic representations like ’f’ sound (as in "fish") transcription of input text [9]. It involves analysis of the text and ’z’ sound (as in "zoo"). These sounds are adopted from languages like English and Urdu. These are not included in • Prediction of F 0 target values (this must be done after the phoneme list. durations are predicted) Because of the phonetic nature of Telugu this phoneti- zation is somewhat less complex when compared to English Presently we have used a CART (Classification And Regres- and other European languages. There are very few words sion Trees) tree to predict phrase boundaries. Similar tree is 0 where the pronunciation differs from the spelling. For ex- used to predict which syllables are stressed. To generate F ample the combination of the ‘h’ phoneme with any other targets, we used the standard method available in Festival.