A Study on Multi-Lingual and Cross-Lingual Speech Synthesis for Indian Langauges

A STUDY ON MULTI-LINGUAL AND CROSS-LINGUAL SPEECH SYNTHESIS FOR INDIAN LANGAUGES

Thesis submitted in partial fulﬁllment of the requirements for the degree of

MS by Research in Computer Science and Engineering

ELLURU NARESH KUMAR 201050029 [email protected]

LANGUAGE TECHNOLOGIES RESEARCH CENTER International Institute of Information Technology Hyderabad - 500 032, INDIA July 2014 Copyright c Elluru Naresh Kumar, 2014 All Rights Reserved International Institute of Information Technology Hyderabad, India

CERTIFICATE

It is certiﬁed that the work contained in this thesis, titled “ A STUDY ON MULTI-LINGUAL AND CROSS-LINGUAL SPEECH SYNTHESIS FOR INDIAN LANGUAGES ” by ELLURU NARESH KUMAR, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Dr. Kishore Prahallad To my Parents and guide Acknowledgments

I would like to gratefully and sincerely thanks to my advisor Dr. Kishore Prahallad for his guidance, patience, understanding and continous support for my research. May be I could not have get a better advisor and mentor for my Masters. Without his support I could not have ﬁnised my thesis. His dedication and discipline have inspired me as a researcher. I express special thanks to Prof. B. Yegnanarayana for his valuable suggestions during the meetings, seminars which enlightened my career. I thank to my collegues and friends Venkatesh, Gautam, Aditya, Guruprasad, Bajibabu, Ronanki, Anand, Nivedita, Vishala, Gangamohan, BuchiBabu, Karthik, Bhargav, Santosh, Sivanand, Padmini, Sudharsana, Sreedhar, Rambabu I thank all IIIT Hyderabad participants in my listening tests for their valuable time and patience. Finally I am grateful to my parents who provide me a happy environment for me, so that I can concentrate on my work. Special thanks to my brothers Raghavendra, Pavan, Uday for his valuable suggestions during my research carrer.

v Abstract

KeyWords: Speech synthesis, text-to-speech systems for Indian languages, multi-lingual, cross- lingual.

This thesis deals with design and development of speech databases for text-to-speech (TTS) in In- dian languages, and issues involved in multi-lingual and cross-lingual synthesis. An important issue has been addressed in this thesis is, pronunciation of non-native text (foreign word). In this work, different mapping methods such as phone-phone (P-P) and word-phone (W-P) has been experimented with to incorporate (non-native) foreign word pronunciation in Indian language TTS systems. In P-P mapping, the phones of foreign language are substituted with phones of native language. The results showed that the quality of synthesized output is poor as it contained foreign language accent. This motivates us to use a technique to provide the nativization for foreign word. To achieve this, we used W-P mapping approach. In W-P mapping, non-native word is mapped to a sequence of phones to produce in the form of native speaker pronunciation. Subjective evaluations are conducted and results show that the native speakers prefer the quality of synthesized output. The idea of manual W-P mechanism is a tedious task and it is very difﬁcult to include all the English words in a single lexicon. Here we automatically generate letter-to-sound rules for foreign word. As Indian languages share a common phonetic base, the pronunciation of phones across the languages are similar. The question we would like to ask in this work is, whether a single TTS system is sufﬁcient for multiple language inputs using Indian languages. In cross-lingual synthesis, text of one language is synthesized using a TTS system built for another language. Here the phones of one language is mapped to the phones of another/neighbouring language. Subjective evaluations are conducted with this technique. The results show that the native speakers do not prefer cross-lingual synthesis as dura- tional properties of vowels and geminates, vary across the languages. The usefulness of W-P mapping and study on cross-lingual synthesis with supportive analysis and results are presented in this thesis. The contributions of this work are: (1) Experiment and evidence that appropriate phone mapping could not be used for multi-lingual TTS system. (2) Use of W-P mapping for building multi-lingual TTS system for Indian languages. (3) Study on cross-lingual synthesis using Indian languages.

vi Contents

Chapter Page

1 Introduction to Text-to-speech system ...... 1 1.1 Text-to-speech synthesis systems ...... 1 1.2 Text Processing Module ...... 1 1.2.1 Text Normalization ...... 2 1.2.2 Grapheme-to-phoneme Conversion ...... 2 1.2.3 Prosodic Analysis ...... 3 1.3 Waveform generation module ...... 4 1.3.1 Articulatory speech synthesis ...... 4 1.3.2 Parametric synthesis ...... 4 1.3.2.1 Formant synthesis ...... 4 1.3.2.2 Linear prediction synthesis ...... 5 1.3.3 Concatenative synthesis ...... 6 1.3.3.1 Diphone synthesis ...... 6 1.3.3.2 Unit Selection Synthesis ...... 6 1.3.4 Statistical Parametric Speech Synthesis ...... 7 1.4 Evaluation of Text-to-speech systems ...... 8 1.4.1 Subjective evaluations ...... 8 1.4.1.1 Mean Opinion Score (MOS) ...... 8 1.4.1.2 ABTest ...... 8 1.4.2 Objective evaluations ...... 8 1.4.2.1 Mel-cepstral distortion (Objective Evaluation) ...... 8 1.5 Need for Multi-lingual and Cross-lingual Speech Synthesis ...... 9 1.5.1 Issues in existing approaches ...... 9 1.6 Thesis Statement ...... 10 1.7 Organization of Thesis ...... 10

2 Speech Database Collection and Evaluation Strategies for TTS ...... 11 2.1 Introduction ...... 11 2.1.1 Nature of Indic Scripts ...... 11 2.1.2 Differences and Similarities in Indic scripts ...... 12 2.1.3 Digital Storage ...... 13 2.1.3.1 Handling font-data of Indic scripts ...... 13 2.2 Design and collection of Speech Databases ...... 13 2.2.1 Design of Text prompts ...... 14 2.2.2 Recording of Speech Databases ...... 15

vii viii CONTENTS

2.2.2.1 Issues with Speech Recording ...... 15 2.2.2.2 Segmentation of recorded audio ...... 16 2.3 Framework for building baseline synthetic voices ...... 16 2.4 Objective Evaluation ...... 18 2.5 Blizzard challenge on Indian language speech databases ...... 18 2.6 Indian Language (IH) Tasks ...... 19 2.6.1 Participants in the Challenge ...... 19 2.6.2 Database Used ...... 20 2.6.3 Challenges ...... 20 2.6.4 Materials ...... 20 2.6.5 Evaluation ...... 20 2.7 Discussion and Results ...... 22 2.7.1 Results obtained for IH1.1 (Hindi) and IH1.3 (Kannada) Tasks ...... 22 2.7.2 Results obtained for IH1.2 (Bengali) and IH1.4 (Tamil) Tasks ...... 23 2.7.3 Discussion and Results ...... 23 2.8 Summary ...... 25

3 Handling English words in Telugu Text-to-speech system ...... 26 3.1 Introduction ...... 26 3.2 Previous approaches towards Multi-lingual TTS ...... 26 3.3 Our approach ...... 28 3.4 Comparison of word-phone and phone-phone mapping ...... 29 3.5 Unicode Transliteration ...... 30 3.6 Automatic generation of word-phone mapping ...... 30 3.6.1 Epsilon Scattering Method ...... 31 3.6.2 Evaluation and Results ...... 31 3.7 Integrating word-phone mapping rules in TTS ...... 32 3.8 Summary ...... 33

4 Cross-lingual synthesis in Indian languages ...... 34 4.1 Introduction ...... 34 4.2 Previous work ...... 34 4.3 Discussion on cross-lingual synthesis ...... 36 4.4 Phonetic Description of Telugu and Kannada ...... 37 4.5 Databases Used ...... 37 4.6 Experiments and Results ...... 38 4.6.1 Experimental results ...... 38 4.6.2 Analysis of segment durations ...... 40 4.7 Summary ...... 40

5 Summary and Conclusion ...... 42 5.1 Summary of the work ...... 42 5.2 Conclusion of the work ...... 43 5.3 Future Work ...... 43 List of Figures

Figure Page

1.1 Architecture of TTS ...... 1 1.2 Architecture of Formant Synthesizer ...... 4 1.3 Speech synthesis model based on LPC model ...... 5

4.1 Synthesis outputs of TT and TK systems. (a) Sythesized using TT (b) Synthesized using TK (c) Synthesized using TT (d) Synthesized using TK ...... 39

ix List of Tables

Table Page

1.1 Scale used in MOS ...... 8

2.1 Classiﬁcation of consonants of Indian languages ...... 12 2.2 Statistics of the Wikipedia Text Corpus ...... 14 2.3 Statistics of the OTS Corpus...... 15 2.4 Duration Statistics ...... 16 2.5 Acoustic Phonetic features ...... 18 2.6 MCD scores for CLUNITS and CLUSTERGEN voices ...... 19 2.7 Participants in the Indian language tasks of Blizzard 2013 ...... 19 2.8 Users statistics for the tasks IH1.1 (Hindi), IH1.2 (Bengali), IH1.3 (Kannada), IH1.4 (Tamil) ...... 22 2.9 MOS and WER scores for IH1.1 Hindi-Paid listeners ...... 23 2.10 MOS and WER scores for IH1.3 Kannada-Paid listeners ...... 24 2.11 MOS and WER scores for IH1.2 Bengali-Paid listeners ...... 24 2.12 MOS and WER scores for IH1.4 Tamil-Paid listeners ...... 25

3.1 English word computer represented as US English phone sequence, US English phone- Telugu phone mapping and English word-Telugu phone mapping ...... 29 3.2 Perceptual evaluation scores for baseline Telugu TTS system with different pronunciation rules for English ...... 30 3.3 Accuracy of prediction for English word - English phone mapping ...... 32 3.4 Accuracy of prediction for English word- Telugu phone mapping ...... 32 3.5 Perceptual results comparing systems T M and T A...... 33

4.1 Consonant sounds of Telugu (TL) and Kannada (KN) based on MOA and POA . . . . 37 4.2 Perceptual results comparing systems TT and TK ...... 38 4.3 Perceptual results comparing systems KK and KT ...... 38 4.4 Scores and Counts for Short Vowels ...... 40 4.5 Scores and Counts for Long Vowels ...... 40 4.6 Scores and Counts for Geminates ...... 41

x Chapter 1

Introduction to Text-to-speech system

1.1 Text-to-speech synthesis systems

A text-to-speech (TTS) system converts input text into a spoken waveform. Text-to-speech systems are used in many applications such as car navigation system, screen readers, voice mail, etc. The major components of a TTS system are 1) Text processing module 2) waveform generation module

Input text Speech

Text Normalization Phonemic Analysis Prosodic Analysis Speech Synthesis

Text Processing Waveform generation

Figure 1.1 Architecture of TTS

The Figure 1.1 shows the architecture of TTS system. Text processing module includes three sub- parts; Text analysis, Phonemic analysis and Prosodic analysis. The waveform generation module gen- erates the speech output for the given input text.

1.2 Text Processing Module

Text analysis is responsible for converting the input text into a linguistic speciﬁcation. The input text to this module is raw text. During text analysis, the input text is processed, non-standard words are expanded, and grapheme-to-phoneme conversion is performed.

1 1.2.1 Text Normalization

Text Normalization is the transformation of text to a pronounceable form. The input raw text which consists of standard words (common words and proper words) and non-standard words (NSW) like number, abbrevations, acronyms and idiomatic expression. Pronunciation of standard words are present in a pronunciation dictionary called a lexicon. The pronunciation can also be generated using letter-to- sound rules. By deﬁntion non-standard words comprise of numerical patterns and alphabetical strings that cannot be found in a dictionary [Sproat et al., 1999]. Pronunciation of NSWs needs to be generated using natural language processes. Text normalization is also useful in comparing two sequences of characters which are represented differently but mean the same. Examples are Don’t vs Do not, and I’m vs I am. In English and other languages, there are hundreds of words that have the same representation in text, but have different pronunciation. The pronunciation varies depending on the context. These words are called as homographs. The process of disambiguating between them is called homograph disambiguation. The three different forms of homograph disambiguation are:

1. Number disambiguation: A numerical string can be pronounced differently depending on its context. For example the string 1943 can be pronounced in three different ways such as:

• Nineteen forty three (as a date) • One nine four three (as a phone no.) • One thousand nine hundred and forty three (as a quantiﬁer)

2. Abbreviation disambiguation: Abbreviations frequently occur in raw text. These abbreviations need to be expanded. For example

• Ms is expanded as Miss. • St. Joseph St. is expanded as Saint Joseph Street. • Don’t is expanded as Do not and can’t is expanded as cannot.

3. Acronym disambiguation: Acronyms are type of abbreviation made up of the initial letter or letter of other words. Acronyms need to be converted to its expanded form. For example:

• ROM is expanded as Read Only Memory. • RAM is expanded as Random Access Memory.

1.2.2 Grapheme-to-phoneme Conversion

Grapheme to phoneme conversion involves converting orthographic symbols to a sequence of phonemes. In other words, aksharas are orthographic representation of speech sounds which map to a sequence of phonemes. A phoneme is a minimal sound unit of spoken language. The concept of grapheme to

2 phoneme rules is more applicable to English, Hindi, Bengali, Malayalam and Tamil because the relationship between orthography and pronunciation is complex. There are primarly two approaches to pronunciation of words in speech synthesis namely 1) Dictionary based and 2) Rule based. In the first method, a dictionary or lexicon is prepared to store all variations of the words with their correct pronunciation. The lexicon here is an inventory in which each entry is a word with equivalent pronunciation. Given an input word, the equivalent pronunciation is spelled out. Although this approach is very quick it needs a large database to store all the words and the system will stop when it encounters a word not present in the dictionary. The coverage is a serious issue, espicially for proper names which are generally thought to be more difficult than ordinary words. Typically, to handle out-of-vocabulary (OOV) words in the dictionary, a grapheme-to-phoneme system is built using machine learning techniques [Black et al., 1998]. The second approach is a rule based method where letters are pronounced based on a system of rules [Umeda, 1976, Allen, 1976]. Here a database is not required for generating the rules. However, it is difficult to handle all possible rules that can occur in a language.

1.2.3 Prosodic Analysis

Prosody is the combination of stress pattern, duration and intonation in a speech. It is the study of aspects of speech that typically apply to a level above that of the individual phoneme. Features above the level of the phoneme are referred to as suprasegmentals. Stress is used to refer to the degree of prominence of syllables in a word. Intonation refers to the rise and fall of the voice pitch. All languages use pitch as intonation to convey an instance, to express happiness, to raise a question, etc. Pitch is perceptual correlate of the fundamental frequency(F0). The mechanism of the vibration of vocal cords is given in Borden and Harris [Borden and Harris, 1983]. Pitch is measured in Hertz (1 Hz is 1 cycle per second). The range of F0 for each individual speaker mainly depends on the length and mass of the vocal cords. In general, for males in conversational speech this range is typically between approximately 80 to 200 Hz, for females the range is between approximately 180 to 400 Hz, and for young children this range can be seen considerably higher. Predicting the correct prosody from the input text is a difﬁcult task. Modelling of an intonation is an important task that affects intelligiblity and naturalness of the speech. The prosody of continous speech depends on many seperate aspects such as meaning of the sentence, the characteristics of the speaker and emotions [Lemmetty, 1999]. Generally intonation is distinguished as

• Rising intonation (when the pitch of the voice increases)

• Falling intonation (when the pitch of the voice decreases)

• Dipping intonation (when the pitch of the voice falls and the rises)

• Peaking intonation (when the pitch of the voice raises and then falls)

3 Along with intonation and pitch, prediction of pauses also contributes towards a increase in intonation. In the context of TTS, prosody analysis deals with predicting prosody for input text.

1.3 Waveform generation module

There are many different techniques for waveform generation. A few of them are described below:

1.3.1 Articulatory speech synthesis

Articulatory synthesizers produce speech by the direct modelling of the human articulatory mechanism [Klatt, 1987]. The human articulatory mechanism is described by with the characteristics of the vocal tract (by means of a description of the vocal tract geometry) and place the potential sound sources within this geometry. In order to generate speech, the vocal tract positions for each phoneme are deﬁned by means of parameters. The articulatory control parameters include lip aperture, lip protrusion, tongue tip position, tongue tip height, tongue position, and tongue height [Kroger and Brat, 1992]. At synthesis time, it is important to know which properties of the articulatory process have to be modelled for each phone in order to achieve high quality speech synthesis.

1.3.2 Parametric synthesis

Parametric synthesis consists of two different techniques formant synthesis and linear prediction synthesis which are described below:

1.3.2.1 Formant synthesis

Formant Tracks Annotated Phones Synthetic Speech Rule System Formant Synthesizer

Pitch contour

Figure 1.2 Architecture of Formant Synthesizer

Formant synthesis is based on source-ﬁlter model of speech synthesis. The basic assumption is that vocal tract transfer function can be modeled by simulating formant frequencies and formant amplitudes. Each formant can be speciﬁed with a center frequency, bandwidth and amplitude as shown in equa- tion 1.1.

1 H(Z) = − − − − (1.1) 1 − 2e πbi cos(2πfi)z 1 + e 2πbiz 2

4 This is a second order filter with center frequency in fi and bandwidth bi. Formant synthesis does not use a database of speech samples but depends on rules to generate parameters for synthesis of speech. At synthesis time, artificial reconstruction of the formants are performed [Yegnanarayana et al., 1994]. This is done by exciting a set of resonators by a voicing source to achieve the desired speech spectrum. Therefore the synthesized speech output has an artificial, robotic sound and the goal of the naturalness is not reached. Examples of early formant synthesizer are MITalk [Allen et al., 1979, 1987], in KlatTalk [Klatt, 1982] and in DecTalk [Klatt, 1990]. These type of synthesizers are used for visually impaired people to navigate the computers using a screen reader.

1.3.2.2 Linear prediction synthesis

Pitch Period

Impulse Train Vocal Tract Generator Voiced/Unvoiced Parameters Switch

u(n) Time-Varying Digital Filter s(n)

G Random Noise Generator

Figure 1.3 Speech synthesis model based on LPC model

Linear Prediction (LP) synthesis is the reconstruction of the signal after LP analysis. Previous methods for analyzing the speech signal start by transforming the acoustic features into spectral form by performing a short-time fourier analysis of the speech wave. In formant synthesis, spectral analysis was done based on a rule based system in studying the speech signals. Methods based on spectral analysis often do not provide a natural speech output. An alternative to formants is to use the spectral parameters of the vocal tract transfer function directly. LP synthesis is another source-ﬁlter method for speech synthesis. The idea behind the LPC analysis is that a given speech sample at time n, s(n) can be approximated as a linear combination of the past ’p’ speech samples such that

n X s(n) = aks(n − k) k=1 e(n) = S(n) − s(n) (1.2) n X e(n) = S(n) − aks(n − k) k=1

ak’s are linear prediction coefﬁcients which are obtained directly from the speech signal. A window length of 25 ms is used to calculate the system properties of the signal. As the spectral characteristics

5 of the speech vary over time, the predictor coefficients at a given time instant, n, must be estimated from a short segment of the speech signal occuring around time n. The approach is to find a set of LP coefficients that minimizes the mean squared prediction error over a short segment of the speech waveform. For calculating the predictor coefficients, two suitable methods exist, covariance method and auto correlation. The auto correlation method has the advantage of having one-to-one relationship with the predictor coefficients, as compared to the covariance methods. The predictor coefficients are used to design a synthesis filter, which is excited by the predicted residual to produce speech [Markel and Gray, 1976]. LP synthesis approach used in standard speech coding such as CELP (Code book Excited LPC).

1.3.3 Concatenative synthesis

While articulatory models suffer from lack of adequate modeling of articulators, parametric models require a large number of rules to model coarticulation and prosody. An alternative is to concatenate recorded speech segments. It produces the most natural-sounding speech [Black and Taylor, 1994]. Concatenative synthesis can broadly be classiﬁed into two types namely, diphone synthesis and unit selection.

1.3.3.1 Diphone synthesis

A diphone is a speech segment that starts in the middle of a phoneme (stable part) and ends in the middle part of the next phoneme. A diphone covers the transistion between two phones and thus captures coarticulation. Diphone synthesis uses a minimal speech database which contains the diphones occuring in a language. The number of diphones depends on the phonotatics of the language. For a language with N phonemes, there will be N 2. For example: Telugu has about 2500 (50x50) diphones, English has about 1936 (44x44). In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target prosody of a sentence is super imposed on these minimal units by means of DSP (Digital Signal Processing) techniques such as PSOLA [Moulines and Charpentier, 1990] or MBROLA [Dutoit et al., 1996]. Diphone synthesis technique uses extensive signal processing, which leads to an unavoidable degradation of the synthesized speech signal.

1.3.3.2 Unit Selection Synthesis

The approach of using an inventory of speech units is referred to as unit selection approach. It can also referred to as data-driven approach or exampler-based approach for speech synthesis. Unit selection synthesis uses a large database of speech recorded by the single speaker, recorded with natural prosody. The recorded speech is segmented based on the choice of the unit. The choice of unit may be phone [Hunt and Black, 1996], diphone [Mark et al., 1998], syllable [Kishore and Black, 2003] or words. The unit selection algorithm selects the appropriate units from the database by minimizing

6 the cost of concatenating the units. This cost is expressed in terms of two cost functions. The target cost estimates how well the database unit fits the target unit. The concatenation cost estimates the acoustic continuity of two consective units. The selection of units for synthesis is based on the two cost functions. The selection of target unit in the speech database is based on the acoustic parameters like fundamental frequency (pitch), duration, prosody characteristics and the phonetic identity of the preceding and following phonemes. Suitable selection of candidate units is obtained using classification and regression trees. At each decision node, the target feature is estimated, and stops at a leaf node when desired number of candidate units are reached. The quality of synthesized speech is based on the quality of the recorded speech and coverage of the units in the database. Thus, larger the speech database, the more units an algorithm can choose from, which makes it easier to find suitable units for a given sentence. If no suitable units are present in the database, then the output speech can sound very bad. If suitable units are present, the synthesized speech can sound natural as it selects the units from the database.

1.3.4 Statistical Parametric Speech Synthesis

Statistical parametric speech (SPS) synthesis does not use any stored units from the database. Instead it fits a model to the speech corpus during the training phase and stores this model. Two methods of SPS synthesis are present 1) CLUSTERGEN and 2) HMM (Hidden Markov Model). Based speech synthesis. In clustergen, the spectrum, fundamental frequency and durations are modelled seperately. A CART (Classification Regression Trees) model is built for each model independently based on context- dependent phonemes. At synthesis stage, values are predicted from corresponding cluster trees, which are used as parameters to the MLSA filter [Imai, 1983] to synthesize the speech. In HMM based speech synthesis, spectrum and excitation parameters are extracted from a speech database and modelled by context dependent HMMs. Apart from the source and system features, each HMMs have state duration densities to capture the temporal structure of speech [Yoshimura et al., 1998]. As a result, the system models spectrum, pitch and durations simultaneously in a unified HMM framework [Yoshimura et al., 1999]. Here the model is parametric because the speech uses parameters rather than stored examplars. It is statistical because it describes these parameters using statistics (means and variances of probability density function) which capture the distribution of parameters values found in the training data. During synthesis a sentence HMM is constructed by concatenating context dependent HMMs based on the context dependent labels. Spectrum and excitation parameters are predicted using speech parameter generation algorithm [Tokuda et al., 2000]. The synthesized waveform is obtained by exciting the MLSA filter [Imai, 1983].

7 1.4 Evaluation of Text-to-speech systems

1.4.1 Subjective evaluations

1.4.1.1 Mean Opinion Score (MOS)

MOS is used for evaluating the speech synthesis quality [Rix et al., 2006]. MOS is a subjective measurement where the listeners are asked to rate the synthesis quality in the range of 1(worst) to 5(excellent). The listeners evaluate the synthetic speech based on scale mentioned in Table 1.1. The MOS is computed by averaging the scores of the set of listeners participated in the listening test.

Table 1.1 Scale used in MOS Scale Meaning 1 Worst 2 Poor 3 Fair 4 Good 5 Best

1.4.1.2 ABTest

In this method, the same sentence is synthesized by two systems and the listener is asked to mark their preference based on the quality. ABX scores are 1(ﬁrst), 2(second), 3(no preference). They also have the choice of giving the decision of no preference.

1.4.2 Objective evaluations

1.4.2.1 Mel-cepstral distortion (Objective Evaluation)

One of the widely used distance measure in speech synthesis is the Euclidean distance between Mel frequency cepstral coefficients which is called MCD. MCD is an objective error measure used to compute cepstral distortion between original and the synthesized Mel-cepstral coefficients. Lesser the MCD value, the better is the synthesized speech. MCD is defined as

v u 25 10 u X MCD = ∗ t2 ∗ (ct − ce)2 (1.3) ln10 i i i=1

t e ci and ci denote the target and the predicted Mel-cepstral coefﬁcients respectively. The typical values for synthesized speech are in the range of 5 to 8.

8 1.5 Need for Multi-lingual and Cross-lingual Speech Synthesis

The ﬁeld of text-to-speech synthesis has undergone considerable changes over the past few years, especially in terms of the intelligiblity and naturalness of the synthesized voices. However, there is a problem in countries like India, where most speakers are multi-lingual. As a result there is need to build TTS systems which can handle text containing lexical items from multiple languages.

• Mutli-lingual TTS: A Multi-lingual TTS system handles input containing text from different languages (In this thesis we consider Telugu interspersed with English words)

• Cross-lingual TTS: A cross-lingual TTS system synthesizes input text of one language using a TTS system built for another language (In this thesis we considered Telugu and Kannada).

Earlier works have attempted to build multi-lingual TTS [Campbell, 1998, Quazza et al., 2001, Badino et al., 2004]. In Campbell, et. al. [Campbell and Black, 1996] the authors built seperate systems for each language. When the language of the input text changes, the TTS system also has to be changed. This can only be done between two sentences but not in the middle of a sentence. As a result the resulting output speech sounds very unnatural due to different voices involved. To overcome this, a multi-lingual approach can be implemented by handling different languages with the same voice [Campbell, 1998]. In [Traber et al., 1999, Latorre et al., 2006a] authors built polyglot speech synthesis system. This method involves recording a multi language speech corpus by some one who is ﬂuent in multiple languages. This speech corpus is used to build a multi-lingual TTS system. The cross-lingual approach is closely related to the multi-lingual method. There are primarily two ways of achieving this. Chen, et. al. [Chen et al., 2012] achieved this by building two independent HMM based TTS systems from data recorded by a bilingual speaker, which were then used in a framework where the HMM states were shared across the two languages in decision-tree based clustering. The other approach is frame-level selection [Qian et al., 2009]. In this approach the source language frames are mapped to the closest target language frames. The mapping is based on minimizing the distance between speech feature vectors.

1.5.1 Issues in existing approaches

• From Campbell, et. al. [Campbell and Black, 1996], it is derived that, switching TTS system can be unnatural to listeners when they hear different voices.

• From Deprez, et. al. [Deprez et al., 2007], it is shown that, phones of the foreign language are mapped to closest sound unit of the primary language. Appropriate phones are concatenated using unit selection synthesizer. If the mapped phones does not exist in the source language then the synthesized output quality would be poor.

9 • From Traber, et. al. [Traber et al., 1999], it is derived that the primary issue with polyglot speech synthesis is that it requires development of combined phoneset. Also, ﬁnding a speaker ﬂuent in multiple languages is not an easy task.

1.6 Thesis Statement

We primarly investigate the issues of multi-lingual and cross-lingual synthesis in Indian languages. The problems that are addressed in this thesis are:

1. Design and development issues that are involved for Indian language speech databases in the framework of unit selection and statistical parametric synthesis.

2. Synthesis of English words using a Telugu TTS

3. Obtaining the desired pronunciation of English words using a small training set.

4. Typically one language of text can be synthesized by a system built for another language, especially in the case of Indian languages because of common phonetic base. This raises the issue of measuring the quality of synthesis in cross-lingual synthesis system. We investigate why cross- lingual system is not natural sounding to native speakers.

1.7 Organization of Thesis

The remainder of the thesis is organized in the following fashion: • Chapter two gives an overview of development of speech databases for Indian languages. It discusses various issues in unit selection and statistical parametric speech synthesis. It also explains the framework for evaluation of Indian language audio databases.

• Chapter three deals with the pronunciation of English words, using a Telugu phoneset. The hypothesis is that while pronuncing an English word, a native speaker of Telugu mentally maps the English word to a sequence of Telugu phones as opposed to simply substituting English phones with the corresponding Telugu phones. Based on this hypothesis experiments were conducted and subjective evaluations were done. The results indicate that word-to-phone mapping is more suitable as compared to P-P for a multi-lingual system.

• Chapter four explains the issue of cross-lingual synthesis in Indian languages. It hypothesizes that a bilingual speaker imposes the mother tongue effect while speaking another language. To verify our hypothesis, we synthesized Telugu text using a Kannada TTS system and conducted subjective evaluations. The results of the evaluations show that native speakers of Telugu prefer Telugu text synthesized by a Telugu TTS, as compared to Telugu text synthesized by a Kannada TTS system.

10 Chapter 2

Speech Database Collection and Evaluation Strategies for TTS

2.1 Introduction

In this chapter, we explain the methodology used to design the speech databases for Indian languages and discuss the importance of the same. A set of baseline TTS systems is built for Indian languages using phone and grapheme as the basic units.

2.1.1 Nature of Indic Scripts

Indic scripts have originated from the ancient Brahmic script. The Indian languages have a common phonetic base. The phonemes (the minimal sound unit) are divided into two groups: vowels and consonants. Systematic combinations of elements of these groups result in the basic units of the writing system which are referred as Aksharas. The properties of Aksharas are as follows

• An Akshara is an orthographic representation of sounds in an Indian language

• Aksharas are syllabic in nature

• The typical form of Aksharas are V, CV, CCV, and CCCV. Aksharas thus have a generalized form of C*V. Here C denotes Consonant and V denotes Vowel.

• An Akshara always ends with a vowel or nasalized vowel.

• White space is used as word boundary.

• The scripts are written from left to right.

The shape of an Akshara depends on the composition of its constituent consonants and vowels, and the sequence of the consonants. In deﬁning the shape of an Akshara, one of the consonant symbols acts as a pivotal symbol. Depending on the context, it can have a complex structure with same or other consonant and vowel symbols, placed on top, below, before, after or sometimes surrounding the pivoted symbol.

11 Table 2.1 Classiﬁcation of consonants of Indian languages Place of Manner of articulation (MOA) Nasals Semivowels Fricatives articulation (POA) Unvoiced Voiced Unaspirated Aspirated Unaspirated Aspirated Velar k kh g gh ng h Palatal ch chh j jh nj y sh Alveolar t: t:h d: d:h nd r shh Dental t th d dh n l s Bilabial p ph b bh m v

2.1.2 Differences and Similarities in Indic scripts

Except for English and Urdu, most of the other Indic languages share a common phonetic base i.e., they share a common set of speech sounds. This common phonetic base consists of around 50 phones, including 15 vowels and 33 consonants. In addition to sharing a common phonetic base, some languages like Hindi, Marathi, Nepali also share a common script called Devanagiri. Other languages like Gujarathi, Bengali, Oriya, Telugu, Kannada, Tamil, Assamesse etc have distinct scripts descended from Brahmi. The property that makes these languages separate can be attributed to the phonotactics in each of these languages rather than the scripts and speech sounds. Phonotactics are permissible combinations of phones that can co-occur in a language. Based on phonotactics, the distribution of syllables encountered in each language is different. Another dimension in which the Indian languages significantly differ is prosody which includes duration, intonation and prominence associated with each syllable in a word or a sentence. The Indic language have a number of consonants, each of which represents a distinctive sound. These are classified in terms of manner and place of articulation and voicing. Manner of articulation: Manner of articulation (MOA) represents how the airflow is modified by the vocal tract during the production of sound. The degree of stricture at various articulatory positions produces different types of sounds like stops, nasals and fricatives 1) Stops are sounds produced by complete closure 2) Nasals are sounds produced by stoppage at the vocal tract and release through the nose 3) Fricatives are sounds produced by passing airflow through the vocal tract Place of articulation: Place of articulation (POA) represents the location or the point of constriction made along the vocal tract by the articulators. Consonant sounds are more closely associated with POA as compared to vowel sounds. This means that the air coming from the lungs is constrained somehow to create consonant sounds. Along the vocal tract, the places are velum, palatte, teeth and lips, and the sounds produced are velar, alvelar, palatal, dental and bilabial. The common sounds in Indic languages and their MOA and POA are shown in Table 2.1.

12 2.1.3 Digital Storage

Prior to unicode and ISCII (Indian Standard code for Information Interchange), there was no standardized format to store Indian language text in digital form. Nowdays, formats like ASCII (American Standard Code for Information Interchange), ISCII and Unicode are commonly used to store Indic text- data in digital form. With the advent of Unicode, the scripts of Indian languages have their own unique representation. This has standardized the representation of Aksharas and their rendering on the computer screen. However, key-in mechanism of these Aksharas has not been standardized. It is hard to remember and key-in the Unicode of these scripts directly by a layman user of a computer. Thus, soft keyboards, key- board layouts on top of QWERTY keyboards are still followed. Transliteration scheme, i.e., mapping the Aksharas of Indian languages to English alphabet to key-in is another popular mode. Once these Aksharas are keyed-in, they are internally processed and converted into Unicode characters. Due to non- standardization of key-in mechanism of Indian language scripts, it has to be explicitly addressed during development of text processing modules in text-to-speech systems and user interfaces[Arokia Raj et al., 2007].

2.1.3.1 Handling font-data of Indic scripts

The text available in a font-encoding format is referred to as font-data. Before crawling the text from various domains it is necessary to identify the font-type because sometimes it can contain data encoded in different font-types. The wide spread usage of unicode based news-portals and web-pages are increasing because it supports unicode codes for almost any language currently available. To handle diversiﬁed formats of Indian scripts such as Unicode and ASCII etc., it is essential to use a meta storage format. A transliteration scheme to map the Unicode letter to IT3 (Indian Transliteration)[Lavanya et al., 2005] phonetic notation has been developed.

2.2 Design and collection of Speech Databases

Text data crawled from the web cannot be directly used for processing. Some amount of modiﬁcation is required before the text data can be used. The major factors taken into account while designing the speech corpus are:

• Text selection: Most of the text available in Indian languages is in the form of News data or blogs which is under copyright. Hence, we choose to use Wikipedia articles in Indian languages as our text corpus. The articles of Wikipedia are in public domain. We could select a set of sentences, record speech data and release in public domain without any copyright infringements.

• Choice of language and dialect: We used Wikipedia dump of Indian languages released in 2008. This dump consists of 17 Indian languages. We chose to build speech database for Telugu,

13 Tamil, Malayalam, Marathi, Hindi, Bengali and Kannada. These languages were chosen, because the total number of articles in each of these languages is more than 10,000 and native speakers of these languages were available in the campus. Table 2.2 shows the statistics of text corpus collected for these languages.

• Speaker selection: To construct the speech database, a process of speaker selection was carried out. A group of four to ﬁve native speakers (who volunteered for speech data collection) was asked to record 5-10 minutes of speech. A speaker was selected based on how pleasant the voice was and how amenable the speech was for signal processing manipulations.

Each of these languages has several dialects. As a ﬁrst step, we chose to record the speech in the dialect in which the native speaker was comfortable with. The native speakers who volunteered for recording speech data were all in the age range of 20-30. During the recording process, they were made aware that the speech data being recorded would be released in public domain and a written consent was taken from each of the speakers.

Table 2.2 Statistics of the Wikipedia Text Corpus

Languages No.of No.of words No.of Syllables No.of Phones sentences Total Unique Total Unique Total Unique Bengali 54825 1830902 510197 1689005 4883 2851838 47 Hindi 44100 1361878 376465 942079 6901 1466610 58 Kannada 30330 360560 257782 3037748 5580 1697888 52 Malayalam 84000 1608333 699390 3157561 15259 5352120 51 Marathi 30850 810152 270913 1012066 2352 1452175 57 Tamil 99650 1888462 857850 3193292 10525 5688710 35 Telugu 90400 2297183 763470 3193292 9417 4940154 51

2.2.1 Design of Text prompts

Text selection for unrestricted domain is not an easy task. The important aspects which need to be considered in sentence selection are length of the sentence and ease of pronounciation of the words. The length of the sentences and number of sentences deﬁne the size of the database. It is important to have an optimal speech corpus balanced in terms of phonetic coverage and diversity in the realization of the units. Typically, larger the database, greater is the coverage of units (syllables of diphones or phones). In this work, for each language, a set of 1000 sentences were selected as described in [Black et al., 2002a]. This optimal set was selected using the script in Festvox [Black et al., 2002b] that applies the

14 Table 2.3 Statistics of the OTS Corpus.

Languages No.of No.of words No.of Syllables No.of Phones Avg.Words sen- per line tences Total Unique Total Unique Total Unique Bengali 1000 7877 2285 25757 866 37287 47 7 Hindi 1000 8273 2145 19771 890 30723 58 8 Kannada 1000 6652 2125 25004 851 37651 51 6 Malayalam 1000 6356 2077 21620 1191 38548 48 6 Marathi 1000 7601 2097 25558 660 37629 57 7 Tamil 1000 7045 2182 23284 930 42134 35 7 Telugu 1000 7347 2310 24743 997 40384 51 7

following criteria, to select the optimal text corpus. Table 2.3 shows the statistics of optimal text corpus selected for all the languages.

• A sentence is selected from the text corpus, if it has the length of 5-15 words.

• Each word in the sentence should be among the 5000 most frequent words.

• Strange punctuation should be avoided. Capitals should not be used at the beginning of each sentence, and punctuations should not be used at the end.

2.2.2 Recording of Speech Databases

The speech data was recorded in a professional recording studio using a standard headset microphone connected to a Zoom handy recorder. We used a handy recorder as it was highly mobile and easy to operate. By using a headset the distance from the microphone to a mouth and recording level was kept constant. A set of 50 utterances were recorded in a single wave file. After each utterance, the speaker was instructed to pause briefly and start the next utterance. This avoided the start-stop for each utterance. The recording was typically clean and had minimal background disturbance. In spite of care being taken, there were mistakes in the utterances due to wrong pronounciation or repeated pronounciation of a word. Any mistake made while recording was rectified either by re-recording those utterances or by correcting the corresponding transcription to suit to the utterance.

2.2.2.1 Issues with Speech Recording

Issues involved in recording of speech databases are as follows:

• During recording of the utterances, the speaker may wrongly pronounce the words.

15 Table 2.4 Duration Statistics Languages Durations Avg. Dur of (hh:mm) each utterance(sec) Bengali 1:39 5.94 Hindi 1:12 4.356 Kannada 1:41 6.05 Malayalam 1:37 5.832 Marathi 1:56 6.98 Tamil 1:28 5.27 Telugu 1:31 5.472

• Getting Mobile phone vibrations or other interference during the recording process, affects the quality of the recordings.

• Repeat recordings occur when the user makes a mistake while recording. The user corrects it by just uttering the same word again.

These issues can be corrected manually by listening the audio ﬁles and re-recording them separately.

2.2.2.2 Segmentation of recorded audio

Each wave ﬁle consists of at least 50 utterances. There is a need to segment them in order to separate the individual utterances. We used Zero Frequency Filtering (ZFF) technique to automatically segment the utterances. The technique ZFF has been shown to detect voiced and unvoiced regions in a speech signal with high accuracy [Murty et al., 2009]. The duration of unvoiced regions was subjected to a threshold. This resulted in each wave ﬁle being sliced into 50 utterances. A manual check was applied to ensure that each of the utterances matched with the corresponding text. Table 2.4 shows the total duration of speech database and average duration of each utterance for all the languages.

2.3 Framework for building baseline synthetic voices

In this section, we used framework for building baseline synthetic voices in Indian languages. The steps are as follows: 1) Deﬁnition of phone set and acoustic-phonetic properties of each phone, 2) Incor- poration of Letter-to-sound rules, 3) Incorporation of Syllabiﬁcation rules, 4) Prominence marking of each syllable, 5) Phrase break prediction, 6) Choice of unit size in synthesis and 7) Prosody modeling. While there is some clarity on the phone set and corresponding acoustic-phonetic feature, rest of the issues are largely unexplored for speech synthesis in Indian languages.

16 To build the prototype voices, we used IT3 transliteration scheme for representing scripts of Indian languages. A phone set was deﬁned for each language based on our experience. Table 2.5 shows the acoustic-phonetic features deﬁned for these phones.

Letter-to-sound (LTS) rules are a generalization of how letters are mapped into sounds. Any TTS system, requires LTS rules, which will generate the best guess, given the training data. LTS rules are more straight forward for Indian languages, as they are phonetic in nature. There is generally close relationship between what we write and what we pronounce, thus the necessity of pronounciation dictionary does not arise in the case of Indian language. However among all languages, LTS (grapheme-to- phoneme or Akshara-to-sound) rules are more applicable to Hindi, Bengali, Tamil, Malayalam, Marathi as compared to Telugu and Kannada. Moreover, there hardly exists a decent set of LTS rules that one could use readily. Hence, we did not use any LTS in this phase of building voices. Our hope was that phone level units when clustered based on context would produce appropriate sound.

Syllabification is another issue. Syllabification is to separate the syllables of a word. Aksharas refers to a syllable where each syllable must have a vowel nucleus. The most common type syllable in Indian languages has a consonant vowel(CV) structure. The different kinds of syllables are CCVC, CVC, VC, VCC. It is known as acoustic syllables which differ from Aksharas. For example the Aksharas of the word /amma/ (meaning mother) correspond to /a/ /mma/. However, acoustic syllables are /am/ and /ma/. Pronounciation of a phoneme depends on the position in which it is present inside a syllable. Given that syllabification is specific to each language, we used Aksharas as syllables in these current builds. The syllable boundaries are marked at the vowel positions. If the number of consonants between two vowels is more than one, the first consonant is treated as coda of the previous syllable and the rest of the consonant cluster is treated as the onset of the next syllable.

The concept of syllable-level prominence is more relevant for Indian languages. Prominence patterns play an important role in text-to-speech systems. Given that there is hardly any research on syllable- level prominence for Indian languages, we assigned primary prominence to the ﬁrst syllable in the word.

The input text may have one or more phrases where each phrase organized with syntatic structure, contains one or more words. Locating the phrase breaks in text is one of the most important tasks for generating natural and intelligible speech. It should be noted that prosodic phrase breaks differ significantly from syntactic phrase breaks. Although phrase break prediction has been widely explored in synthesis, generating these breaks with appropriate durations has not received much attention. Appro- priate modeling of prosodic phrase breaks requires part-of-speech (POS) tags. A POS tagger is hardly available for all Indian languages. In the current build, we used punctuation marks as indicators of phrase breaks.

17 Table 2.5 Acoustic Phonetic features Feat. Name Feat. Values Range Phone type Vowel/Consonant 2 Vowel length Short/Long/diphthong/schwa 4 Vowel height High/middle/low 3 Vowel frontness Front/mid/back 3 Lip rounding +/- 2 Consonant type Stop/fricative/affricates/ 5 nasal/lateral Place of articulation Labial/alveolar/palatal/ 6 labio-dental/dental/velar Consonant voicing voiced/unvoiced 2 Aspiration +/- 2 Cluster +/- 2 Nukta +/- 2

The choice of unit for synthesis may be phone, diphone, syllable, or grapheme. The selection of unit (diphone or syllable) is inherently bound to the speech database as it requires balance in terms of coverage of all the units. This is one of the major issue. In our current systems, we used phone and grapheme units separately.

2.4 Objective Evaluation

To compute MCD (Mel-Cepstral Distortion), we have taken 100 test sentences from relevant databases and synthesized with different units such as phone and grapheme. In this work, unit selection and statistical parametric voices were built. The Table 2.6 gives the MCD scores for each technique. The results shown in Table 2.6 indicate that CLUNITS, CLUSTERGEN [Black and Taylor, 1997, Black, 2006, Kawahara et al., 1999] were built for Indian languages. CLUSTERGEN voices have lower MCD scores than CLUNITS. This is primarily because of the use of natural durations in MCD computation for CLUSTERGEN voices. Among the CLUSTERGEN voices, Hindi and Tamil voices have higher MCD scores. This could be attributed to lack of appropriate letter-to-sound rules in these builds. However, it was interesting to note a lower MCD score for CLUSTERGEN voice of Bengali, in spite of not using any letter-to-sound rules.

2.5 Blizzard challenge on Indian language speech databases

The Blizzard challenge, originally started by Black and Tokuda [Black and Tokuda, 2005] is a well established challenge in the ﬁeld of speech synthesis. [Black and Tokuda, 2005, Bennett, 2005, Bennett and Black, 2006, Frazer and King, 2007, Karaiskos et al., 2008, King and Karaiskos, 2009, 2010, 2011, 2012] are summary papers which provide information about previous challenges. This chapter describes the Indian language tasks in Blizzard challenge 2013. The Indian language tasks consisted of data

18 Table 2.6 MCD scores for CLUNITS and CLUSTERGEN voices

Languages MCD Phone Grapheme clunits cg cg Bengali 7.74 4.96 5.36 Hindi 7.09 5.24 5.83 Kannada 6.90 5.01 5.43 Malayalam 7.78 5.1 5.31 Marathi 7.08 4.4 4.68 Tamil 8.0 5.30 5.55 Telugu 6.55 4.39 4.55

Table 2.7 Participants in the Indian language tasks of Blizzard 2013 Short Name Details Method Natural Natural speech Human I2 Institute of Infocomm Unit selection DFKI Deutsche Forschungszentrum fur¨ Kunstliche¨ Intelligenz Hybrid CMU Carneige Mellon University HMM NITECH Nagoya Institute of Technology HMM USTC National Engineering Laboratory of Speech & Language Information Processing Hybrid ILSP Institute for Language and Speech Processing Innoetics Unit selection S4A Simple4All project consortium HMM MILE-TTS Dept. of Electrical Engg. Indian Institute of Science Unit selection

from four languages: Hindi, Bengali, Kannada and Tamil taken from IIIT-H Indic databases [Prahallad et al., 2012]. Eight participants from across the world used the speech data provided as well as the corresponding text data in UTF-8, to build synthetic voices, which were then evaluated by means of listening test. In the following sections, we describe the Indian language tasks in the Blizzard 2013 challenge. Following that we discuss the results obtained for various tasks.

2.6 Indian Language (IH) Tasks

2.6.1 Participants in the Challenge

The Indian language tasks of the Blizzard challenge 2013 consisted of eight participants listed in 2.7. To anonymize the results, the systems are identiﬁed using letters, with A denoting natural speech and D to R denoting the systems submitted by the participants in the challenge.

19 2.6.2 Database Used

Speech and text data from four Indian languages 1) Hindi, 2) Bengali, 3)Kannada, and 4) Tamil were released from IIIT-H Indic database [Prahallad et al., 2012]. The details of speech data is listed in 4.4 Along with the speech data the corresponding text was provided in the UTF-8 format. Table 2.3 shows text the statistics of the text data for the four languages. No other information, like segment labels was provided as part of the challenge. However, there was no restriction on the participants to learn / use information like phonesets or labels from other resources [Prahallad et al., 2012, Murthy and et.al., 2010].

2.6.3 Challenges

Participants were asked to build synthetic voices from the databases in accordance with the rules of the challenge. Indian language(IH) tasks were numbered from IH1.1 to IH1.4 corresponding to the four languages, as listed below

• IH1.1 - Hindi

• IH1.2 - Bengali

• IH1.3 - Kannada

• IH1.4 - Tamil

2.6.4 Materials

The participants were asked to synthesize hundred test sentences as a part of a listening test. The selection of sentences to be used in the listening tests was following

• WPD (Wikipedia): Distinct sentences were selected which are not a part of IIIT-H Indic database

• SUS (Semantically Unpredictable Sentences): Distinct semantically unpredictable sentences which are not part of WPD or IIIT-H Indic database. The resulting sentences are hard to understand and remember even when spoken by human speakers. The semantically unpredictable sentences were randomly selected from text and POS tagging was performed by running IIITH- LTRC shallow parser [Avinesh and Gali, 2007]. The words were then reordered as: Subject Object Verb Conjuction Subject Object Verb.

2.6.5 Evaluation

The participants were asked to synthesize the complete test set, out of which a subset was used in the listening tests. The listening tests for IH1.1 and IH1.3 consisted of eight sections and the listening tests for IH1.2 and IH1.4 consisted of ﬁrst two sections. The list of listening tests is presented, below

20 1. Similarity section in WPD, 2. Naturalness for WPD, 3. Naturalness for SUS, 4. Naturalness for SUS, 5. Naturalness for SUS, 6. Multiple-dimension evaluation for SUS, 7. Intelligibility for SUS, 8. Intelligiblity for SUS. Multiple dimension evaluation of sentences contained the following sections, in which the listeners provided their response.

• Overall impression (“bad” to “excellent”).

• Pleasantness (“very unpleasant to ”very pleasant“)

• Speech Pauses (”speech pauses confusing/unpleasant” to “speech pauses appropriate/pleasant”)

• Stress (“stress unnatural/confusing“ to “stress natural“)

• Intonation (”melody did not ﬁt the sentence type“ to ”melody ﬁtted the sentence type”)

• Emotion (“no expression of emotions“ to “authentic expression of emotions“)

• Listening effort (”very exhausting” to “very easy”)

The methodology of scoring in the various sections of the listening test are described below:

• Similarity: The listener plays a few samples of the original speaker and one synthetic sample. The listener then chooses a response that represented how similar the synthetic voice sounded as compared to the original speaker voice on a scale from

1. Sounds like a totally different person 2. Sounds like a different person 3. Sounds like more or less like the same person 4. Sounds like the same person 5. Sounds exactly like the same person

• Naturalness: The listener listenes to a sample of synthetic speech and chooses a score which represents how natural or unnatural the sentence sounded on a scale of

1. Completely unnatural 2. Mostly unnatural 3. Equally natural and unnatural 4. Mostly natural 5. Completely natural

21 Table 2.8 Users statistics for the tasks IH1.1 (Hindi), IH1.2 (Bengali), IH1.3 (Kannada), IH1.4 (Tamil) Task Paid Users Online Volunteers Total IH1.1 55 71 126 IH1.2 62 22 84 IH1.3 84 17 101 IH1.4 47 16 63

• Intelligiblity: Listeners listen to an utterance and type in what they hear. Word Error Rate (WER) is also computed. From each section of the listening test, a listener heard one example from each system, including one natural speech sample where available. As always, a Latin Square design was employed to ensure that no listener heard the same sentence more than once, something that is particularly important for testing intelligibility.

2.7 Discussion and Results

Two types of listeners were selected for listening test :

• Paid users: All native speakers were selected and generally aged between 16-30.

• Online volunteers: Recruited via requesting or broadcasting through mailing lists etc.

Table 2.8 shows the statistics of the different listener types for the tasks(IH1.1 to IH1.4) In our case, paid listeners are considered. The paid listeners are more careful and attentive in perceptual evaluation test. So we present only the results obtained from paid listeners.

For the purpose of discussing the results obtained from the four tasks, we group the IH1.1 (Hindi) and IH1.3 (Kannada) tasks into one group and the IH1.3 (Bengali) and IH1.4 (Tamil) tasks into another group. The reason for doing so are two fold. Firstly, the test sets for IH1.1 and IH1.3 tasks contained data from both WPD and SUS datasets, whereas the test sets for IH1.2 and IH1.4 contained data from only the WPD dataset. Secondly, the letter to sound rules in Hindi and Kannada are less complex than the letter to sound rules in Bengali and Tamil. In the results discussed below. System A refers to natural speech.

2.7.1 Results obtained for IH1.1 (Hindi) and IH1.3 (Kannada) Tasks

Table 2.9 shows the mean opinion scores of similarity, naturalness on WPD dataset, mean opinion scores for naturalness and word error rate on SUS dataset, for the IH1.1 (Hindi) task. For the IH1.3 task (Kannada), Table 2.10 shows the mean opinion scores of similarity, naturalness on WPD dataset, mean opinion scores for naturalness and word error rate on SUS dataset.

22 Table 2.9 MOS and WER scores for IH1.1 Hindi-Paid listeners IH1.1 (Hindi)- Paid listeners MOS WER WPD (Similarity to WPD (naturalness) SUS (naturalness) SUS System Original speaker) Mean Std. Mean Std. Mean Std. Mean Std. Deviation Deviation Deviation Deviation A 4.2 0.95 4.7 0.72 4.6 0.83 34% 24 % D 2.3 1.20 2.4 0.87 2.3 0.97 43% 26 % E 2.1 1.22 2.7 1.15 2.5 0.90 47 % 27 % F 1.7 0.99 2.6 1.25 2.3 0.99 53 % 23 % I 2.5 1.29 2.8 1.16 2.8 0.97 57 % 26 % K 2.8 1.16 3.5 0.86 3.5 0.98 43 % 25 % L 2.8 1.25 3.7 1.07 3.2 1.23 53 % 26 % P 2.2 1.11 2.8 1.08 2.7 1.10 50 % 24 %

2.7.2 Results obtained for IH1.2 (Bengali) and IH1.4 (Tamil) Tasks

Table 2.11 shows the mean opinion scores of similarity and naturalness on WPD dataset for IH1.2 (Bengali) task. For IH1.4 task (Tamil), Table 2.12 shows the mean opinion scores of similarity and naturalness on WPD dataset.

2.7.3 Discussion and Results

A study of the WPD and SUS mean opinion scores for naturalness (Tables 2.9, 2.10), for both IH1.1 (Hindi) and IH1.3 (Kannada), shows that the scores obtained by the systems for both the WPD and SUS datasets are in similar ranges. This can be explained by the fact that Indian languages are relatively free word order, and so the word reordering during the generation of SUS sentences may not have an effect on the output of the system. As a result the outputs for both WPD and SUS sentences are scored similarily for naturalness. An examination of the WPD mean opinion scores for similarity to original speaker comparision (Ta- bles 2.9, 2.10, 2.11, 2.12 ), shows that System L has the highest score among all systems for IH1.1 (Hindi), IH1.2 (Bengali) and IH1.4 (Tamil) and the third highest score for IH1.3 (Kannada). A similar examination of the WPD mean opinion scores for naturalness (Tables 2.9, 2.10, 2.11, 2.12 ), shows that System L again has the highest score among all systems for all the four languages. Analysis of the SUS mean opinion scores for naturalness (Tables 2.9, 2.10) shows that System L has the second highest score for IH1.1 (Hindi) and the highest score for IH1.3 (Kannada). This shows that System L scores high, both in similarity to original speaker as well as in naturalness of output, for both the WPD and SUS dataset. However System L does not perform as well in terms of the SUS WER. In terms of SUS WER

23 Table 2.10 MOS and WER scores for IH1.3 Kannada-Paid listeners IH1.3 (Kannada)- Paid listeners MOS WER WPD (Similarity to WPD (naturalness) SUS (naturalness) SUS System Original speaker) Mean Std. Mean Std. Mean Std. Mean Std. Deviation Deviation Deviation Deviation A 4.1 1.3 4.5 0.81 4.4 1.0 48% 31 % D 2.0 1.3 2.9 1.26 2.8 1.1 50 % 28 % F 1.8 1.3 2.5 1.31 2.2 1.1 62 % 29 % I 2.2 1.4 2.5 1.16 2.4 1.1 72 % 26 % K 1.7 1.2 3.0 1.18 3.0 1.2 55 % 29 % L 2.5 1.5 3.7 1.15 3.1 1.2 57 % 29 % P 2.8 1.5 3.5 1.02 2.8 1.2 57 % 27 % R 3.0 1.5 3.4 1.01 2.6 1.2 67 % 26 %

Table 2.11 MOS and WER scores for IH1.2 Bengali-Paid listeners IH1.2 (Bengali)- Paid listeners MOS WPD (Similarity to WPD (naturalness) System Original speaker) Mean Std. Mean Std. Deviation Deviation A 4.7 0.49 4.7 0.53 D 2.1 0.97 2.6 0.76 F 2.1 1.15 2.4 0.90 I 2.3 1.01 2.1 0.83 K 2.7 1.11 3.3 1.01 L 3.5 1.14 3.8 0.83 P 2.9 1.19 3.0 0.89

24 Table 2.12 MOS and WER scores for IH1.4 Tamil-Paid listeners IH1.4 (Tamil)- Paid listeners MOS WPD (Similarity to WPD (naturalness) System Original speaker) Mean Std. Mean Std. Deviation Deviation A 4.4 0.99 4.3 0.78 D 2.7 1.00 2.5 0.97 F 1.9 0.83 2.2 0.89 I 1.8 0.84 2.3 1.02 K 2.7 1.13 3.1 0.96 L 3.2 1.29 3.9 1.03 P 2.8 1.16 3.2 1.05

(Tables 2.9, 2.10), System D has the best performance for both IH1.1 (Hindi) and IH1.3 (Kannada). However, its performance in terms of mean opinion scores for similarity and naturalness (both WPD and SUS datasets) are poor, for all four languages (IH1.1 (Hindi), IH1.2 (Bengali), IH1.3 (Kannada), IH1.4 (Tamil)).

2.8 Summary

In this chapter, the method for designing a speech corpus has been applied to collect a sufﬁcient amount of Indian language speech databases has been shown. To avoid manual segmentation to seperate the individual utterances, an zero frequency ﬁltering method has been used. The speech databases were used to built text-to-speech systems in festival framework. The baseline voices were built using phone and grapheme units. We conducted subjective and perceptual evaluations to evaluate these synthesizers. The results showed that the synthesized speech was better for grapheme based system. The online strategy referred as Blizzard challenge was conducted to rate the synthesize systems and to identify the effectiveness of different techniques. The results shows that System L has the highest scores both in similarity to original speaker as well as in naturalness of output, for both the WPD and SUS dataset. In terms of SUS WER, system D has the best performance.

25 Chapter 3

Handling English words in Telugu Text-to-speech system

3.1 Introduction

In the previous chapter we discussed issues in building speech databases for Indian languages. This chapter discusses multi-lingual systems speciﬁcally the problem of pronuncing English words using Telugu text-to-speech system. Two types of experiments are done for Indian English pronunciation: (1) phone-phone (P-P) mapping and (2) word-phone (W-P) mapping. In P-P mapping, the phones of foreign language are substituted with the phones of his/her native language where as in W-P mapping, an foreign word is mapped to a sequence of phones of his/her native language. To validate the above experiments, we conducted perceptual evaluations. The results indicate that word-to-phone mapping is suitable for multi-lingual systems. This chapter also provides the algorithm for automatic generation of grapheme-to-phoneme conversion using CART.

3.2 Previous approaches towards Multi-lingual TTS

Campbell [Campbell and Black, 1996] showed that pronunciation of foreign words in the input text could be handled using CHATR system. The foreign phoneme was mapped to nearest phoneme of source language text. The mapped units were selected from a large recorded database by considering their prosodic and acoustic characteristics. The selected segments were concatenated to produce the output speech. The resultant output speech was produced with an accented voice. In [Campbell and Black, 1996], the authors built seperate systems for each language. When the language of the input text changes, the TTS system has to be changed. This can only be done between two sentences but not in the middle of a sentence. As a result, the output speech sounds very unnatural due to the different voices involved. Mobius, et. al. [Mobius et al., 1996] developed TTS systems with nine languages: Mandarin, Chinese, Taiwanese, Japanese, Mexican, Russian, Romanian, Italian, French, and German. This system consisted of a single set of modules to synthesize all these languages, and any language-speciﬁc information was represented in tables. The architecture of this system was designed as a modular pipeline where each module handled one speciﬁc task. A researcher could work on one module of the system at

26 a time, and an improved version of a given module could be integrated without disturbing the remainder of the system. Thus the system could deal with only one language at each call to the synthesis module. When multilingual text had to be synthesized, it had to switch between the TTS engines and the resultant speech sounded like independent sentences being synthesized. Campbell [Campbell, 1998] built a multi-lingual TTS in the framework of unit selection synthesis. Firstly, the target speech was synthesized using the voice of a native speaker of the target language and then sequence of segments were selected from the speech corpus. Black, et. al. [Black and Lenzo, 2004] showed a multi-lingual TTS could be built using data recorded by speakers who are fluent in multiple languages. But, if the speaker is not fully bilingual the resulting synthesizers are accented. This was shown with an the US English speech synthesizer built using data from a Scottish English speaker and a Chinese English speaker. US listeners perceive the accent difference very easily. In [Traber et al., 1999], authors relied on a phone mapping algorithm to invoke different text inputs. Multi-lingual units were selected from recorded database. The selected units were concatenated to generate the output speech. The results showed that the quality of speech was not natural due to unequality of units. Using the Loquendo TTS, Leonardo [Badino et al., 2004] used the approach of classification of words based on the phonetic transcription. Every word was represented according to its language. The foreign phoneme was mapped to similar phoneme of the source language based on the assumption that two phonemes could be judged similar when they have similar phonetic articulatory features. In this case the output speech sounds unnatural. In Traber, et. al. [Traber et al., 1999], the authors used the mixed-lingual approach using diphone inventory. This method involved recording a multi language speech corpus by someone who was fluent in multiple languages. Recorded speech corpus was then used to build a multilingual TTS system which is called a polygot TTS. The primary issue with polyglot speech synthesis was that it requires development of a combined phoneset. This is a time consuming process requiring linguistic knowledge of both languages. Also, finding a speaker fluent in multiple languages is not an easy task. In [Sproat, 1996, Traber, 1995], the authors showed that weighted finite-state transducers could be applied to multi-lingual text analysis. The input text was transcribed using morphological and syntatic analyzers for determining sentence accent levels and prosodic phrase boundaries. The output of the analyzers were used to map a foreign constituent to corresponding constituent of the source language. In Deprex, et. al. [Deprez et al., 2007], used two approaches 1) Phone mapping 2) Creation of multi- lingual speech database. In phone mapping method, phones of the foreign language were mapped to closest sound unit of the primary language. Appropriate phones were concatenated using unit selection synthesizer. The problem of phone-mapping was the availability of all phones in the source language. To overcome this, a multi-lingual speech database was created. The speech database was transcribed with the phones of multiple languages were annotated with the language tag. The phonemes of the input transcription were mapped to the phonemes in the speech database using with the language tag. The limitations of this approach are that it is difficult to find a good voice talent and that the development of a multi-lingual speech database is a time consuming, tedious and costly activity. Accented unit selection

27 voice called Swift TTS system was used in [Mayfield et al., 2005] for developing a foreign accented voice. An accented lexicon file was created and was trained with the context-dependent phonemes to specify the phone sequences for target word. The authors showed that taking phonetic context into consideration a better phonetic match could be acheived. The result of the synthesized output showed that foreign accented synthesis could affect both intelligibility and perception of accentedness. In [Latorre et al., 2006a, 2005b, 2006b] the authors proposed to combine monolingual corpora from several languages to create a single polyglot average voice. This average voice was then transformed to any real speakers voice for one of these languages. To synthesize speech, the given text was converted into a sequence of HMM states. For a given HMM sequence, speech parameter vectors were generated and vectors were adapted to the voice of a specific speaker by means of techniques such as maximum likelihood linear regression (MLLR) [Leggetter and Woodland, 2005] or MAP adaptation. The resulting parameter vectors were combined with F0 and synthesized with MLSA filter [Imai, 1983]. The primary issues with the polyglot speech synthesis were 1) inclusion of prosodic information 2) requirement of the development of combined phoneset, incorporating phones from all the languages under consideration. As a result native speakers found it difficult to understand the output speech. In this thesis we proposes an approach for pronunciation of English words using Telugu language. The distinction between our approach and polyglot synthesis [Latorre et al., 2006a, 2005b, 2006b] is that the authors used multiple speaker recordings to build a single TTS system. In our case, we are applying W-P mapping using single a TTS system.

3.3 Our approach

Our hypothesis is that while pronuncing an English word, a native speaker of Telugu mentally maps the English word to a sequence of Telugu phones as opposed to simply substituting English phones with the corresponding Telugu phones. To verify our hypothesis, preliminary comparision was done for P-P and W-P mapping with the help of subjective evaluation test. The results showed that W-P mapping is preferred by the Telugu native speakers. Manual W-P mapping is tedious task and it is very difﬁcult to include all the English words in a single lexicon. So, it is necessary to build a model to predict the pronunciation from the input text itself. Human beings have attained the ability to pronunce unseen words in an almost reasonable manner. Here we adapt similar mechanism automatically in our method through letter-to-sound (LTS) rule based system [Black et al., 1998]. One of the major issues is, addressing the accents of the standard English pronunciations. To overcome this problem, an Indian-English(IE) dictionary has been prepared to make the pronunciations comfortable for the Indian audience. Also, the desired pronunciation must be attained using a small training set. Since an IE dictionary is not readily available, we started with an initial set of 1000 English words from the CMU Dictionary and manually changed the pronunciation as per a standard IE accent. A baseline automated LTS rule based system [Black et al., 1998] was built using the initial set. Later, instead of adding most

28 frequently occurring words, random new entries were added to the lexicon. The new words were added iteratively until desired pronunciation was obtained. A perceptual study was conducted with Phone-Phone and Word-Phone mapping. We prepared two sets of 25 utterances. The first set (set-A) was prepared by manual mapping of English word to Telugu phones. The second set (set-B) was prepared by the automated LTS rule system. A perceptual listening test (AB-test) was conducted using 10 native Telugu speakers (subjects). Each subject was asked to listen to an utterance from set-A and from set-B and was asked whether he/she could find any difference between the two utterances. The listeners were also asked to give their mean opinion scores(MOS), i.e, score between 1 (worst) and 5 (best) for each of the utterance. The results showed that the listeners prefered both sets equally. Perceptually there was no significant difference between manually prepared and automated system. The details about phone-phone and word-phone mapping are explained in Section 3.4. The accuracy of the automated generated of English word to Telugu phones are explained in Section 3.5. The results obtained from the perceptual studies are explained in Section 3.7

3.4 Comparison of word-phone and phone-phone mapping

Computer

/k ax m p y uw t er/ US English Phones [k @ m p j u t 3~]

/k e m p y uu t: r/ phone-phone mapping [k e m p j u: ú r]

/k a m p y uu t: a r/ word-phone mapping [k a m p j u: ú a r]

Table 3.1 English word computer represented as US English phone sequence, US English phone-Telugu phone mapping and English word-Telugu phone mapping

Table 3.1 shows an example of the word computer represented as a US English phone sequence, English phone-Telugu phone mapping and English word-Telugu phone mapping, along with the corresponding IPA transcription. The English word-Telugu phone mapping is not a one to one mapping, as it is in the case of English phone-Telugu phone mapping. Each letter has a correspondence with one or more than one phones. As some letters do not have a equivalent pronunciation sound (the letter is not mapped to any phone) the term epsilon is used whenever there is a letter which does not have a mapping with a phone.

29 To compare word-phone (W-P) mapping and phone-phone (P-P) mapping, we manually prepared word-phone and phone-phone mappings for 10 bilingual utterances and synthesized them using our baseline Telugu TTS system. We then performed perceptual listening evaluations on these synthesized utterances, using ﬁve native speakers of Telugu as the subjects of the evaluations. The perceptual listening evaluations were setup both as MOS (mean opinion score) evaluations and as ABX evaluations. An explanation of MOS and ABX evaluations is given in Section 3.7. Table 3.2 shows that results of these evaluations.

MOS ABX W-P P-P W-P P-P No. Pref 3.48 2.66 32/50 4/50 14/50

Table 3.2 Perceptual evaluation scores for baseline Telugu TTS system with different pronunciation rules for English

An examination of the results in Table 3.2 shows that manually prepared word-phone mapping is preferred perceptually when compared to manual phone-phone mapping. The MOS score of 3.48 indicates that native speakers accept W-P mapping for pronuncing English words in Telugu TTS. For the remainder of this paper, we focus exclusively on word-phone mapping. We propose a method of automatically generating these word-phone mapping from data. We experiment our approach by generating a word-phone mapping which maps each English word to a Telugu phone sequence (henceforth called EW-TP mapping). We report the accuracy of learning the word-phone mappings both on a held out test set and on a test set from a different domain. Finally, we incorporate this word-phone mapping in our baseline Telugu TTS system and demonstrate its usefulness by means of perceptual listening tests.

3.5 Unicode Transliteration

In a multi-lingual environment, the input to TTS system is in the form of Unicode text. Unicode supports codes for almost any language currently available. To handle diversiﬁed formats of Indian scripts, such as Unicode and ASCII etc., are useful and become a essential to use a meta-storage format. A transliteration scheme maps the Unicode letter to IT3 phonetic notation [P. et al.] which is mapping from one system of writing into another. Thus IT3 transliteration is used as common representation scheme for all Indic data formats.

3.6 Automatic generation of word-phone mapping

In the previous section, it has been mentioned that the number of letters in the word and number of phones in its pronunciation is not a one-to-one match. Each letter speciﬁes a phonetic correspondence of one or more phones. If the letters are not mapped to a phone then epsilon is used. As we require

30 a ﬁxed sized learning vector to build a model for learning word-phone mapping rules, we need to align the letter (graphemic) and phone sequences. For this we use the automatic epsilon scattering method.

3.6.1 Epsilon Scattering Method

The idea in automatic epsilon scattering is to estimate the probabilities for one letter (grapheme) G to match with one phone P , and then use string alignment to introduce epsilons maximizing the probability of the word’s alignment path. Once all the words have been aligned, the association probability is calculated again and so on until convergence. The algorithm for automatic epsilon scattering is given below [Lenzo and Black, 1998].

Algorithm for Epsilon Scattering : /*Initialize prob(G, P ) the probability of G matching P */ 1. for each wordi in training set count with string alignment all possible G/P association for all possible epsilon positions in the phonetic transcription /* EM loop */ 2. for each wordi in training set Y alignment path = argmax P (Gi,Pj) i,j compute probnew(G, P ) on alignment path 3. if(prob 6= probnew) go to 2

3.6.2 Evaluation and Results

Once the alignment between the each word and the corresponding phone sequence was complete, we built two phone models using Classiﬁcation and Regression Trees (CART). For the ﬁrst model, we used data from the CMU pronunciation dictionary where each English word had been aligned to a sequence of US English phones (EW-EP mapping). The second model was the EW-TP mapping. Once both the models had been built, they were used to predict the mapped phone sequences for each English word in the test data. For the purposes of testing, we performed the prediction on both held-out test data as well as on test data from a different domain. The held out test data was prepared by removing every ninth word from the lexicon. As we knew the correct phone sequence for each word in the test data, a ground truth against which to compute the accuracy of prediction was available. We measured the accuracy of the prediction both at the letter level and at the word level. At the letter level, the accuracy was computed by counting the number of times the predicted letter to phone mapping matched with the ground truth. For computing the accuracy at the word level, we counted the number of times the predicted phone sequence of each word in the test data matched with the actual phone sequence for that word (derived from the ground truth). We also varied the size of the training data and then computed the prediction accuracy for each model. We did so in order to study the effect of training data size on the prediction accuracy.

31 Training set Held-out(%) Testing(%) size Letters words Letters words 1000 92.04 39 81.43 16.6 2000 94.25 44.98 82.47 17.5 5000 94.55 47 84.40 25.1 10000 95.82 59.86 89.46 44.7 100000 94.09 56.37 93.27 55.10

Table 3.3 Accuracy of prediction for English word - English phone mapping

Training set Held-out(%) Testing(%) size Letters words Letters words 1000 92.37 28 82.22 18.8 2000 94.34 45.45 83.79 25.1 5000 95.89 68.2 88.40 42.7 10000 96.54 71.67 94.74 70.9

Table 3.4 Accuracy of prediction for English word- Telugu phone mapping

Tables 3.3 and 3.4 show the accuracy of the models. An examination of the results in the two tables shows that incrementally increasing the size of the training data results in an increase of the prediction accuracy. The native speakers of Indian languages prefer to speak what is written. As a result there are fewer variations in word-phone mapping as compared to US English. This is reﬂected in our results, which show that the word level prediction accuracy is higher for EW-TP mapping as compared to EW- EP mapping.

3.7 Integrating word-phone mapping rules in TTS

For the purpose of perceptual evaluations we built a baseline TTS systems for Telugu using the HMM based speech synthesis technique [Zen et al., 2007]. To conduct perceptual evaluations of the word-phone mapping rules built from data in 3.6.2, we incorporated these rules in our Telugu TTS system. This system is henceforth refered to as T A. A set of 25 bilingual sentences were synthesized by the Telugu TTS, and ten native speakers of Telugu performed perceptual evaluations on the synthesized utterances. As a baseline, we also synthesized the same 25 sentences by incorporating manually written word-phone mapping for the English words, instead of using the automatically generated word-phone mapping rules. We refer to this system as T M. The perceptual evaluations were set up both as MOS (mean opinion score) evaluations and as ABX evaluations. In the MOS evaluations, the listeners were asked to rate the synthesized utterances from all systems on a scale of 1 to 5 (1 being worst and 5 best), and the average scores for each system was calculated. This average is the MOS score for that system. In a typical ABX evaluation, the listeners

32 are presented with the the same set of utterances synthesized using two systems A and B, and are asked to mark their preference for either A or B. The listeners also have an option of marking no preference. In this case, the listeners were asked to mark their preference between T A and T M. The results of the perceptual evaluations are shown in Table 3.5.

MOS ABX Test T M T A T M T A No. Pref 3.48 3.43 51/250 38/250 161/250

Table 3.5 Perceptual results comparing systems T M and T A

An examination of the results shows that perceptually there is no signiﬁcant preference for the manual system over the automated system. The MOS scores also show that there is not much signiﬁcant difference between the ratings of the manual and the automated system.

3.8 Summary

This chapter discussed the method of building multi-lingual text-to-speech system for Telugu and English. An issue has been addressed here is, pronunciation of English word using Telugu language. Here English word is mapped to sequence of Telugu phones instead of substituting the English phones with corresponding Telugu phones. Manual preparation of Indian-English dictionary is difﬁcult task. A method for automatic building letter-to-sound rules for synthesizing foreign words has been shown in this chapter. We conducted perceptual evaluations on our method. The effectiveness of automatic generation of letter-to-sound rules was shown by computing the accuracy of prediction of lexicons. We have also shown that these lexicons could be applied sucessfully in multi-lingual TTS systems.

33 Chapter 4

Cross-lingual synthesis in Indian languages

4.1 Introduction

In the previous chapter we discussed about multi-lingual TTS in the context of Indian languages and advantages of word-phone sequence mapping. In this chapter, we discuss cross-lingual synthesis in the context of Indian languages.

4.2 Previous work

There have been various methods implemented for cross-lingual synthesis. Chu, et. al. [Chu et al., 2003], developed a bilingual TTS for Mandarin and English which maintains sentence level intonation even for mixed-lingual text. They implemented a language-dispatching module which took the text input and applied the language identiﬁcation. The text was then passed to the corresponding language- speciﬁc unit selection module. The unit selection module of the system was shared across languages. To avoid the annoyance of different voices when multi-lingual text was inputted, same speaker was used to create speech database for both languages. The main issue addressed was, synthesizing the speech with richer intonation inspite of language switching. Soft Prediction Only (SPO) technique was applied to normalize the pitch for both languages as English is a stress language and Mandarin is a tonal language. Prosodic Constraint Oriented (PCO) approach was used for unit selection during the synthesis. In Xianglin, et. al. [Xianglin et al., 2010], the authors built language-dependent average voice models using data recorded by multiple speakers. Speaker characteristics were minimized using linear transformation technique. As a result, the effect of acoustic parameters could be easily perceived in output and hence synthesis quality was poor. In Wu, et. al. [Wu et al., 2008], the authors showed that cross-lingual synthesis could be done using speaker adaptation approach. The authors built an average voice model using data recorded by multiple speakers. The input text was transformed to the output language using state mapping information. The primary issue with this method was that the input language information was not considered in the average voice model for conversion into output language speech. As a result

34 the synthesis quality was degraded.

Qian, et. al. [Qian et al., 2009], developed three independent HMM based systems using data recorded by a bilingual and monolingual speakers. Each leaf node in the bilingual speaker HMM was mapped to a leaf node in the monolingual HMM to generate output speech.

Liang, et. al. [Liang et al., 2007], achieved cross-lingual synthesis by building two independent HMM based TTS systems from data recorded by a bilingual speaker, which were then used in a framework where the HMM states were shared across the two languages in decision-tree based clustering. Latorre, et. al. [Latorre et al., 2005a], discussed a method for approximating the sounds of languages not included in the polyglot training data. The sounds were approximated from one language to another by means of the similarity between the articulatory features of source and target phones. These features are derived from the International Phonetic Alphabet (IPA) representation of the phones. When no similar articulatory features were found between the source and target, an ad-hoc assignment was done using a linguistic expert. Chen, et. al. [Chen et al., 2009], provided a method of phone-to-phone mapping. The source language phone was mapped to nearest neighbour phone of target language. This system uses sub-phonemic state knowledge to map to the corresponding target phone. The issue with this method is that the synthesis quality may degrade if closest phone is not available. Oura, et. al. [Oura et al., 2012], discussed a speech recognition transcription based unsupervised cross-lingual adaptation method. The transcription input was applied to TTS system according to state-level information. The input language HMM states were mapped to output language HMM by minimizing the distance between the HMM states using divergence distance measure. Chen, et. al. [Chen et al., 2012], used the frame-level approach. The frames of the source language were mapped to closest frame of the target langauge. The mapping was based on minimizing the distance between speech feature vectors. Liang, et. al. [Liang et al., 2010], used an ASR (Automatic Speech Recognition) transcription based unsupervised cross- lingual adaptation method. The transcription input was applied to TTS system by marginalising over decision tree and HMM state mapping. Performance of the system was computed by means of subjective and objective evaluations. It was observed that language mismatch was the main problem. So introducing some extra techniques to alleviate the mismatch before speaker adaptation would be helpful.

Wu, et. al. [Wu et al., 2009], proposed the state-level mapping approach. They used two approaches: (i) Transform mapping (ii) Data mapping. In the transform mapping approach, intra-lingual adaptation was ﬁrst carried out in the input language. Following this, the transforms were applied to the states of the output language acoustic model using the state-level information. Alternatively, a data mapping approach was proposed in which states belonging to the input language acoustic model were replaced by states belonging to the output language acoustic model according to the derived state mapping. The resulting transformed state emission probabilities could be directly used for synthesis in the output language. In this case, the transform mapping from source to target language was a difﬁcult task because

35 the audio databases were recorded by different speakers. However they made an assumption that if it could be done with bilingual speakers then the effect of mismatch could be reduced between two speech databases. In this thesis we propose a framework for cross-lingual synthesis for Indian languages. We show that it is difﬁcult to use the TTS system built for one language for synthesizing text of another language.

4.3 Discussion on cross-lingual synthesis

In our method, we address the problem of cross-lingual synthesis for Indian languages. We consider the neighbouring languages of Telugu and Kannada. We built separate TTS systems for each language, using data recorded by a bilingual speaker. Our hypothesis is that a bilingual speaker subconsciously imposes the effect of his/her mother tongue while speaking the second (non-native) language. We synthesized text of the first language using the TTS system built for the second language. The method of synthesizing related languages in single TTS system might not be acceptable for native speakers of the language. Therefore it is necessary to separate the language creation process from the TTS system. However even if the phones of the new language might exist in the actual TTS system, the quality of synthesis may not be acceptable. In [Winters et al., 2008], the authors showed that listeners could easily discriminate the speaker’s voice across two languages which are phonologically similar. Our assumption is that a single TTS system cannot be used for multiple language inputs. We seek to answer the following two questions: 1. How do we measure whether or not a speaker sounds similar in two different systems ?, 2. Does synthetic speech which has been generated from a non-native system sounds like a native speaker?. Both the languages were analyzed at the phonological and orthographic level. From the viewpoint of Telugu, the phones in Kannada are acoustically identical but the coarticulation effects play a role in discriminating sounds in both language. A perceptual study was conducted with Telugu native speakers. The results showed that Telugu text synthesized in Telugu TTS (TT) system was preferred over Telugu text synthesized in Kannada TTS (TK) system. This is because TK system has more variations in duration of the articulatory positions. This phenomenon can be explained by the following While speaking/reading read speech, humans pronounce words not in isolation but in a continuum. It is noticed that natural speech is typically shorter in duration and less accurately pronounced. This is especially observed for word final vowels and geminates. To verify this we compared recordings with original audio databases. We found that the segments of long vowels ([A:], [E:], [I:], [O:], [U:] ), geminates recordings were extended for TK and shortened in TT. While synthesizing input text to two different systems, the duration of phonemes are varied across the systems. In this case, a bilingual speaker could be easily distinguished by linguistic cues that are shared across the two phonemically similar languages. It is difficult to find a bilingual speaker who is fluent in both Telugu and Kannada languages without having neighbouring effect across the languages.

36 Table 4.1 Consonant sounds of Telugu (TL) and Kannada (KN) based on MOA and POA MOA POA Velar Palatal Alvelor Dental Bilabials Unaspirated TL k [k] ch [tS] t: [ú] t [t] p [p] KN k [k] ch [tS] t: [ú] t [t] p [p] Unvoiced Aspirated TL kh [kh] chh [tSh] t:h [úh] th [tH] ph [ph] KN kh [kh] chh [tSh] t:h [úh] th [tH] ph [ph] Unaspirated TL g [g] j [dZ] d:[ã] d [d] b [b] KN g [g] j [dZ] d:[ã] d [d] b [b] Voiced Aspirated TL gh [gH] jh [dZH] d:h [ãh] dh [dH] bh [bH] KN gh [gH] jh [dZH] d:h [ãh] dh [dH] bh [bH] TL ng [N] nj [ñ] nd [ï] n [n] m [m] Nasals KN ng [N] nj [ñ] nd [ï] n [n] m [m] TL y [j] r [r] l [l] v [V] SemiVowels KN y [j] r [r] l [l] v [V] TL h [h] sh [ç] shh [ù] s [s] Fricatives KN h [h] sh [ç] shh [ù] s [s]

The details of classiﬁcation of sounds in Kannada and Telugu languages are explained in the section 4.4, experiments and results are explained in the section 4.6.

4.4 Phonetic Description of Telugu and Kannada

Indian languages have a number of consonants, each of which represents a distinctive sound. These are classiﬁed in terms of Manner (MOA) and Place of articulation (POA) and Voicing. Indian language scripts have a one-to-one correspondence between what is written and what is spoken. The basic consonant sounds of Kannada and Telugu are shown in the Table 4.1. Taking this into consideration, one can make an assumption that there should not be any difference between synthesizing native and non-native text. However we found that, duration and energy of the speech signal varied across TL and KN. In this work we focused on the duration of segments in both the systems. The details of the experiments are explained in the next section.

4.5 Databases Used

Experiments were conducted using Telugu and Kannada languages. Separate TTS systems were built for both languages, using data recorded by a bilingual speaker. The Telugu language database (IIIT-NK) was used. The text consisted of isolated sentences collected from news stories. The selected sentences are phonetically balanced. This database consists of 1631 utterances. The size of the audio database is about 2.5 hours. The Kannada database is taken from IIIT-H Indic [Prahallad et al., 2012] corpus.

37 This database consists of 1000 utterances selected in such a way as to cover the phonetically balanced sentences. The size of the audio database is about 1.5 hours.

4.6 Experiments and Results

4.6.1 Experimental results

To evaluate the synthesis quality of the cross-lingual system, we built two baseline TTS systems, using Telugu and Kannada languages. These TTS systems were built using HMM based speech synthesis technique [Kawahara et al., 1999].

A set of 20 sentences were collected from Telugu and Kannada Chandamama stories. We synthesized Telugu text using Telugu system (TT), Telugu text using Kannada system (TK), Kannada text using Kannada system (KK) and Kannada text using Telugu system (KT).

A perceptual listening test was conducted using ﬁve Telugu native speakers. The evaluations were setup as both an MOS and ABX evaluations. In the MOS evaluations, the listeners were asked to give their mean opinion scores i.e score between 1 (worst) and 5(best) for each of the utterance. In the ABX evaluation, the same sentence is synthesized by two systems and the listener is asked to mark his preference based on the quality. ABX scores are 1(ﬁrst), 2(second), and 3(no preference). The results show that the listeners preferred TT and KK systems. The results of the perceptual evaluations are shown in Tables 4.2, 4.3.

Table 4.2 Perceptual results comparing systems TT and TK MOS ABX Test TT TK TT TK No. Pref 3.65 3.40 55/100 33/100 12/100

Table 4.3 Perceptual results comparing systems KK and KT MOS ABX Test KK KT KK KT No. Pref 3.86 2.94 76/100 6/100 18/100

Tables 4.2 and 4.3 show that, native language synthesis was preferred by the respective listeners. We observed that, the phone durations are varied in case of non-native synthesis systems. In majority of the cases, the increase/decrease in the phone duration was observed for long vowels, short vowels and geminates. Taking this into consideration, we compared the quality of the TTS systems with natural recordings of audio databases. Manually annotated labels were compared with synthesized speech labels. The variability in duration was easily perceived from synthesized outputs. The observations

38 clearly show that in the case of long vowels, the probability of a long vowel becoming short is high in Telugu language and low in Kannada.

We selected the Telugu words of /prajalatoo/ and /chepparu/ from test sentences. The selected sentences were synthesized using TT and TK systems. Examples have been shown for both categories of long vowels and geminates in Figure 4.1. The synthesized outputs showed that the effect of duration could be perceived at the locations of long vowels and geminates. The duration of the singleton in both the systems were nearly the same. In case (a), the duration of [oo] is nearly 100 ms and in case (b) its about 176 ms. In case (c), the duration of [pp] is nearly 96ms and in case (d) it is about 216ms.

Although, the phonetic units of Kannada and Telugu languages are equal, the experimental results showed that the effect of non-native synthesis can easily be perceived. The Telugu synthesis system was not preferred to synthesize the Kannada text because ﬁrst language accent was imposed during TTS system built for Kannadas language. To verify this, a detailed analysis has been done on natural recordings.

Figure 4.1 Synthesis outputs of TT and TK systems. (a) Sythesized using TT (b) Synthesized using TK (c) Synthesized using TT (d) Synthesized using TK

39 Table 4.4 Scores and Counts for Short Vowels Phonemes Kannada Telugu a→ a0 0.998(3119) 0.983(2469) a→ a1 0.002(6) 0.017(43) u→u0 0.995(639) 0.969(1108) u→u1 0.005(3) 0.031(36)

Table 4.5 Scores and Counts for Long Vowels Phonemes Kannada Telugu ei→ei0 0(0) 0.137(43) ei→ei1 1(221) 0.863(270) oo→oo0 0.016(1) 0.071(22) oo→oo1 0.984(62) 0.929(287) aa→ aa0 0.005(3) 0.039(47) aa→ aa1 0.995(658) 0.961(1150)

4.6.2 Analysis of segment durations

In each language, 30 minutes of data was chosen and the phoneme segments corresponding to long vowels, short vowels and geminates were manually labeled. Short and long vowels were differentiated by tagging them with “0” and “1” respectively. We calculated the probability of a segment having a short or a long duration, using the labeled data. Maximum likelihood estimation was used to determine the probability of a phoneme segment having a short or long duration. Count(α → β) p(α → β) = (4.1) Count(α) Where Count(α → β) is the number of times that rule α → β is observed in the corpus and Count(α) is the number of times that α is observed in the corpus. The probability scores for short vowels, long vowels, and geminates are shown in the Table 4.4, Table 4.5 and Table 4.6 respectively. The second and third columns represent the likelihood scores and total count of phone segments in both languages. In the case of short vowels the probability of a short vowel having a long duration is more in Telugu as compared to Kannada. In the case of long vowels and geminates, the probability of the segment having a short duration is more in Telugu as compared to Kannada.

4.7 Summary

This chapter discussed cross-lingual synthesis in Kannada and Telugu languages. We hypothesized that a single TTS system cannot be used for multiple language inputs. To verify our assumption, we synthesized Telugu text using both Telugu TTS and Kannada TTS separately. We conducted subjective

40 Table 4.6 Scores and Counts for Geminates Phonemes Kannada Telugu pp→pp0 0.05(1) 0.2(6) pp→pp1 0.95(22) 0.8(24) ll→ ll0 0(0) 0.18(12) ll→ ll1 1(125) 0.82(56) chch→chch0 0(0) 0.16(3) chch→chch1 1(5) 0.84(16) nn→nn0 0(0) 0.1(12) nn→nn1 1(109) 0.9(115) evaluations on these synthesizers. The results showed that the TT listeners preferred the TT system over the TK system. We have noticed that Telugu TTS engine proved to be successful in converting the Telugu text into a more natural form than Kannada TTS. Taking this into consideration, we compared these synthesizers with natural speech and it is clear that the inﬂuence of ﬁrst language accent affects the TTS system built for the second language.

41 Chapter 5

Summary and Conclusion

5.1 Summary of the work

In this thesis, a method for designing a speech corpus has been applied to collect a sufficient amount of Indian language speech databases has been shown. The speech databases were used to build text- to-speech systems in festival framework. The baseline voices were built using phone and grapheme units. We conducted subjective and perceptual evaluations to evaluate these synthesizers. The results showed that the synthesized speech was better for grapheme based system. The online strategy referred as Blizzard challenge was conducted to rate the synthesize systems and to identify the effectiveness of different techniques. The results shows that System L has the highest scores both in similarity to original speaker as well as in naturalness of output, for both the WPD and SUS dataset. In terms of SUS WER, system D has the best performance. As Indian languages provide an opportunity for building multi-lingual TTS systems, we experimented with multi-lingual TTS system for Telugu and English languages. An issue that has been addressed here is, pronounciation of English words using Indian language context. An English word is mapped to sequence of Telugu phones instead of substituting the English phones with corresponding Telugu phones. Manual preparation of Indian-English dictionary is difficult task. A method for automatic building letter-to-sound rules for synthesizing foreign words has been shown. We conducted perceptual evaluations on our method. The effectiveness of automatic generation of letter-to-sound rules was shown by computing the accuracy of prediction of lexicons. We have also shown that these lexicons could be applied sucessfully in multi-lingual TTS systems. The cross-lingual TTS system is related to multi- lingual TTS system. Our assumption is that, a single TTS system cannot be used for multiple language inputs using Indian languages. We verified our approach using Telugu text synthesized using Telugu TTS and Telugu text synthesized using Kannada TTS. We conducted subjective evaluations on these synthesizers. The results showed that the listeners preferred the TT system over the TK system. To verify our hypothesis, we compared these synthesizers with natural speech and it is clear that the influence of first language accent effects the TTS system built for the second language.

42 5.2 Conclusion of the work

The experiments conducted in this thesis investigate following methods 1) the ability of how an English word could be synthesized in Indian language 2) The problem of synthesizing one language of text by a system built for another language. To address the method of pronouncing English word in the context of Indian language, an inference is drawn from the hypothesis that, a native speaker of Indian language pronounces an English word to a sequence of native phones(W-P) as opposed to simply substitute English phones with corresponding native phones (P-P). In P-P approach, the phones of English word is mapped with phones of Telugu language. Subjective evaluations was performed using this technique. It is observed that a native Telugu speakers were not preferred the quality of synthesis. This provides to map the English word to sequence of Telugu phones. Subjective evaluations were conducted and results showed that native speakers preferred this approach. Using this technique, we built Telugu multi-lingual TTS which can handle both English and Telugu text as input. To reduce the manual effort on mapping word to phone sequence, an automated letter-to-sound rules are developed. To know whether a single TTS is sufficient for multiple language inputs in Indian language context, we synthesized Telugu text using TTS system built for Kannada. Here the phones of Telugu is mapped to the phones of Kannada. Subjective evaluations show that non-native synthesis were not preferred by the Telugu native speakers. Taking this into consideration, we compared these synthesizers with natural recordings of audio databases. The observation clearly showed that influence of first language accent effects the TTS system built for the second language.

5.3 Future Work

• In cross-lingual synthesis, the synthetic speech has a source language accent when text of ﬁrst language is synthesized using a TTS system built for second language. We will analyze the signal processing methods to improve the output speech.

• As of know we tried with one Indian language multi-lingual TTS. In further, we want to try with more Indian languages towards multi-lingual TTS.

• We built TTS for Indian languages. The quality of synthesis output was not natural. So following techniques are need to be investigated to improve the synthesis quality.

1. To build grapheme based synthesis system in statistical framework using HMM based synthesis. 2. The choice of syllable versus phone unit is to be addressed in statistical parametric synthesis. 3. Appropriate POS tags are required for phrase break prediction. 4. Syllable level prominence is required. 5. Applying letter-to-sound rules for Bengali, Tamil and Malayalam languages.

43 Related Publications

1. E. Naresh Kumar, Anandaswarup Vadapalli, E. Veera Raghavendra, Hema A Murthy and Kishore Prahallad “ Is word-to-phone mapping better than phone-phone mapping for handling English words? ” in Proceedings of the 51st Annual Meeting of the Association of Computational Linguistics (ACL 2013), Soﬁa, August 2013.

2. Kishore Prahallad, E. Naresh Kumar, Venkatesh Keri, S. Rajendran, Alan W Black “ The IIIT-H Indic Speech Databases” in Proceedings of Interspeech-2012, Portland, Oregon, USA.

3. Kishore Prahallad, Anandaswarup Vadapalli, E. Naresh Kumar, Gautam Mantena, Bhargav Pulu- gundla, Peri Bhaskararao, Hema A. Murthy, Simon King, Vasilis Karaiskos and Alan W Black “ The Blizzard Challenge 2013 – Indian Language Task ” in Proceedings Blizzard Challenge Workshop 2013, 2013.

4. E. Naresh Kumar, Peri Bhaskararao, Kishore Prahallad “A Study on Cross-lingual synthesis in Indian languages” not accepted to Speech Prosody, 2014.

44 Bibliography

J. Allen. Synthesis of speech from unrestricted text. In Proceedings of the IEEE, volume 64, pages 433–442, April, 1976.

J. Allen, S. Hunnicutt, R. Carlson, and B. Granstrom. MITalk-79: The 1979 MIT Text-to-Speech system. In J. Acoust. Soc. America, volume 65, pages 507–507, Cambridge, 1979.

J. Allen, M. S Hunnicutt, and Klatt D. H. From Text-to-Speech: The MITalk System. Cambridge University Press, Cambridge, 1987.

A. Arokia Raj, T. Sarkar, S.C. Pammi, S. Yuvaraj, M. Bansal, K. Prahallad, and A.W. Black. Text processing for Text-to-Speech systems in Indian languages. In Proceedings of 6th ISCA Speech Synthesis Workshop SSW6, Bonn, Germany, 2007.

P. Avinesh and K. Gali. Part-of-Speech tagging and chunking using conditional random ﬁelds and transformation based learning. In Proceedings of the IJCAI-07 workshop on shallow parsing in South Asian Languages, 2007.

L. Badino, C. Barolo, and S. Quazza. A general approach to TTS reading of mixed-language texts. In INTERSPEECH 2004 - ICSLP, Jeju Island, Korea, October 4-8, 2004.

C.L. Bennett. Large scale evaluation of corpus-based synthesizers : Results and lessons from the Bliz- zard Challenge 2005. 2005.

C.L. Bennett and A.W. Black. The Blizzard Challenge 2006. In Blizzard Challenge Workshop, Inter- speech 2006 - ICSLP satellite event, 2006.

A.W. Black. CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling. In Pro- ceedings of Interspeech 2006, ICSLP, Pittsburgh, PA, 2006.

A.W. Black and K. Lenzo. Multilingual Text-to-Speech synthesis. In Proceedings of ICASSP, Montreal, Canada, 2004.

A.W. Black and P. Taylor. CHATR: a generic speech synthesis system. In COLING94, pages 983–986, Kyoto, Japan, 1994.

45 A.W. Black and P. Taylor. Automatically clustering similar units for unit selection in speech synthesis. In Proceedings of Eurospeech, pages 601–604, 1997.

A.W. Black and K. Tokuda. The Blizzard Challenge - 2005 : Evaluating corpus-based speech synthesis on common datasets. In Proceedings of Interspeech 2005, Lisbon, 2005.

A.W. Black, K. Lenzo, and V. Pagel. Issues in building general letter to sound rules. In Proceedings of 3rd ESCA Workshop on Speech Synthesis, Jenolan Caves, Australia, 1998.

A.W. Black, , and K. Lenzo. Building voices in festival speech synthesis system. 2002a. URL http: //festvox.org/bsv/.

A.W. Black, , and K. Lenzo. 2002b.

G.J. Borden and K.S Harris. Speech Science Primer: Physiology, Acoustics and Perception of Speech. Williams and Wilkins, Baltimore, London, 1983.

N. Campbell. Foreign language speech synthesis. In Proceedings of ESCA/COCOSDA workshop on speech synthesis, Jenolan Caves, Australia, 1998.

N. Campbell and A.W. Black. CHATR: A Multi-lingual speech re-sequencing synthesis system. In Institute of Electronic, Information and Communication Engineers, Spring Meeting, Tokyo, 1996.

C-P. Chen, Y-C. Huang, C-H. Wu, and K-D. Lee. Cross-lingual frame selection method for polyglot speech synthesis. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.

Y-N. Chen, Y. Jiao, Y. Qian, and F.K. Soong. State mapping for cross-language speaker adaptation in TTS. In Proceedings of ICASSP 2009, 2009.

H. Chu, H. Peng, Z. Zhao, Y. Niu, and E. Chang. Microsoft mulan - a bilingual TTS system. In Proceedings of ICASSP, volume 1, pages 264–267, 2003.

F. Deprez, J. Odijk, and J. Moortel. Introduction to multilingual corpus-based concatenative speech synthesis. In Proceedings of INTERSPEECH, 2007.

T. Dutoit, V. Pagel, N. Pierret, Bataille F., and O. van der Vrecken. The MBROLA project: Towards a set of high quality speech synthesizers of use for non commercial purposes. In Proceedings of ICSLP, 1996.

M. Frazer and S. King. The Blizzard Challenge 2007. In Proceedings Blizzard Workshop 2007 (in Proc. SSW6), 2007.

A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of ICASSP 96, volume 1, pages 373–376, Atlanta, Georgia, 1996.

46 S. Imai. Cepstral analysis synthesis on the Mel-Frequency scale. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 8, pages 93–96, 1983.

V. Karaiskos, S. King, R. Clark, and C. Mayo. The Blizzard Challenge 2008. In Proceedings Blizzard Workshop 2008, 2008.

H. Kawahara, I. Masuda-Katsuse, and de Cheveigne. Restructuring speech representations using a pitch- adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. In Speech Communication, pages 187–207, 1999.

S. King and V. Karaiskos. The Blizzard Challenge 2009. In Proceedings Blizzard Workshop 2009, 2009.

S. King and V. Karaiskos. The Blizzard Challenge 2010. In Proceedings Blizzard Workshop 2010, 2010.

S. King and V. Karaiskos. The Blizzard Challenge 2011. In Proceedings Blizzard Workshop 2011, 2011.

S. King and V. Karaiskos. The Blizzard Challenge 2012. In Proceedings Blizzard Workshop 2012, 2012.

S.P. Kishore and A.W. Black. Unit size in unit selection speech synthesis. In Proceedings of EU- ROSPEECH, pages 1317–1320, 2003.

D.H. Klatt. The Klatt talk Text-to-Speech conversion system. In Proceedings of the international conference on acoustics, speech and signal processing, pages 1589–1592, Paris, 1982.

D.H. Klatt. Review of Text-to-Speech conversion for English. Journal of the Acoustical Society of America, 82, 1987.

D.H. Klatt. DecTalk user’s manual, Digital Equipment Corporation Rseport. 1990.

Kroger and Brat. Minimal rules for articulatory speech synthesis. In Proceedings of EUSIPCO92, pages 331–334, 1992.

J. Latorre, K. Iwano, and S. Furui. Cross-language synthesis with a polyglot synthesizer. In Proceedings of Interspeech-2005, pages 1477–1480, September, 2005a.

J. Latorre, K. Iwano, and S. Furui. Polyglot synthesis using a mixture of monolingual corpora. In Proceedings of ICASSP, volume 1, pages 1–4, 2005b.

J. Latorre, K. Iwano, and S. Furui. New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer. Speech Communication, 48(10):1227–1242, 2006a.

J. Latorre, K. Iwano, and S. Furui. New approach to polyglot synthesis: How to speak any language with anyones voice. In ISCA Workshop on Multilingual Speech and Language Processing, Stellenbosch, South Africa, 2006b.

47 P. Lavanya, P. Kishore, and G.R. Madhavi. A simple approach for building transliteration editors for Indian languages. Journal of Zhejiang University Science, pages 1354–1361, 2005.

C.J. Leggetter and P.C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov Models. In Proceedings of Computer Speech Language, pages 171–185, 2005.

S. Lemmetty. Review of speech synthesis technology. PhD thesis, Helsinki university of technology, 1999.

K. Lenzo and A.W. Black. Letter to sound rules for accented lexicon compression. In Proceedings of the 1998 International Conference on Spoken Language Processing, 1998.

H. Liang, Y. Qian, and F.K. Soong. An HMM-based bilingual(Mandarin-English) TTS. In Proceedings of 6th ISCA Speech Synthesis Workshop, pages 137–142, 2007.

H. Liang, J. Dines, and L. Saheer. A comparision of supervised and unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. In Proceedings of ICASSP 2010, pages 4598–4601, 2010.

M.B. Mark, C. Alistair, and Syrdal Ann K. Diphone synthesis using unit selection. In Proceedings of 3rd ESCA/COCOSDA Workshop on Speech Synthesis, pages 185–190, Jenolan Caves, 1998.

J. Markel and A. Gray. Linear prediction of speech. 1976.

L. Mayﬁeld, Tomokiyo, A.W. Black, and K.A. Lenzo. Foreign accents in synthetic speech: Develop- ment and evaluation. In Proceedings of INTERSPEECH, pages 1469–1472, 2005.

B. Mobius, J. Schroeter, J.V. Santen, R. Sproat, and J. Olive. Recent advances in multilingual Text-to- Speech synthesis. In Proceedings of Fortschritte der AkustikDAGA., (DPG, BadHonnef), 1996.

E. Moulines and F. Charpentier. Pitch-synchronous waveform processing techniques for Text-to-Speech synthesis using diphones. In Speech Communication 9, pages 453–467, Dec, 1990.

H.A Murthy and et.al. Building unit selection speech synthesis in Indian languages : An initiative by an Indian Consortium. In Proceedings of COCOSDA, Kathmandu, Nepal, 2010.

K.S.R. Murty, B. Yegnanarayana, and M.A. Joseph. Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16:469–472, 2009. doi: 10.1109/LSP.2009.2016829.

K. Oura, J. Yamagishi, M. Wester, S. King, and K. Tokuda. Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping. Speech communication, pages 703–714, 2012.

Lavanya P., Kishore P., and Madhavi G.R. A simple approach for building transliteration editors for Indian languages. pages 1354–1361.

48 K. Prahallad, N.K. Elluru, V. Keri, S. Rajendran, and A.W. Black. The IIIT-H Indic speech databases. In Proceedings of Interspeech, Portland, USA, Sept., 2012.

Y. Qian, H. Liang, and F.K. Soong. A cross-language state sharing and mapping approach to bilingual (Mandarin English) TTS. In Proceedings of IEEE Transactions on Audio, Speech, and Language Processing, pages 1231–1239, August, 2009.

S. Quazza, L. Donetti, L. Moisa, and P. Luigi Salza. ACTOR: a multilingual unitselection speech synthesis system. In 4th ISCA Workshop on Speech Synthesis, Perthshire, page 209, 2001.

A.W. Rix, J.G. Beerends, D.S. Kim, P. Kroon, and O. Ghitza. Objective assessment of speech and audio quality, technology and applications. In Proceedings of IEEE Trans. Audio, Speech and Language Processing, volume 14, pages 1890–1901, 2006.

R. Sproat. Multilingual text analysis for Text-to-Speech synthesis. In Proceedings of the ICSLP, Philadelphia, October, 1996.

R. Sproat, A.W. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards. Normalization of Non- standard words. In WS’99 Final Report, 1999.

K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura. Speech parameter generation algorithm for HMM-based speech synthesis. In Proceedings of ICASSP, pages 1315–1318, 2000.

Traber. SVOX: The implementation of a Text-to-Speech system for German. 1995.

C. Traber, K. Huber, K. Nedir, B. Pﬁster, E. Keller, and B. Zellner. From multilingual to polyglot speech synthesis. In Proceedings of Eurospeech, pages 835–838, 1999.

N. Umeda. Linguistic rules for Text-to-Speech synthesis. In Proceedings of the IEEE, volume 64, pages 443–451, April, 1976.

S.J. Winters, S.V. Levi, and D.B. Pisoni. Identiﬁcation and discrimination of bilingual talkers across languages. In Journal of the Acoustical Society of America, pages 4524–38, 2008.

Y-J. Wu, S. King, and K. Tokuda. Cross-lingual speaker adaptation for HMM-based synthesis. In Proceedings of Speech ISCSLP, pages 9–12, 2008.

Y-J. Wu, K. Nankanu, and K. Tokuda. State mapping based method for cross-lingual speaker adaptation in HMM-based synthesis. In Proceedings of INTERSPEECH, pages 528–531, 2009.

P. Xianglin, P. Keiichiro, N. Yoshihiko, and K. Tokuda. Cross-lingual speaker adaptation for HMM- based speech synthesis considering differences between language-dependent average voices. In Pro- ceedings of ICSP, 2010.

B. Yegnanarayana, S. Rajendran, V.R. Ramachandran, and A.S. Madhukumar. Signiﬁcance of knowledge sources for a Text-to-Speech system for Indian languages. Sadhana, pages 147–169, 1994.

49 T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Duration modelling for HMM- based speech synthesis. In Proceedings of ICSLP, pages 29–32, 1998.

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of Eurospeech, Budapest, Hungary, 1999.

H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A.W. Black, and K. Tokuda. The HMM-based speech synthesis system version 2.0. In Proceedings of ISCA SSW6, 2007.