Bilingual for Unit Selection Synthesis

Steinthor Steingrimsson

Supervisors: Dr. Korin Richmond and Dr. Robert Clark

I V E R U S E I T H Y

T

O

H F E R D I N B U

Master of Science in Speech and Processing Theoretical and Applied Linguistics School of Philosophy, Psychology and Language Sciences University of Edinburgh

2004

Abstract

This dissertation presents the process of building a voice for a unit selection speech synthesiser, capable of speaking in two . Furthermore it will report two ex- periments and their results, one concerned with the unit selection engine and whether and when it is likely to select ’foreign’ units from a shared phone set. The other exper- iment tests how natural native speakers of each language perceive words synthesised using a shared speech database, as compared to the same words synthesised using only the same voice’s database for the target language. Discussion is provided on the results, and results of statistical tests run on the data from the perceptual experiment.

iii Acknowledgements

I would like to thank my supervisors, Korin Richmond and Robert Clark for their helpful comments and feedback throughout the summer, as well as sharing their ex- pertise on Festival with me. Mike Bennett is always happy to help with all possible Linux, Unix and network related problems in the computer lab, and for that I am grate- ful. Eiríkur Rögnvaldsson helped me obtain a pronunciation lexicon for Icelandic and Skrudda Publishers provided me with the texts that made up my corpus, without these resources I would not have been able to carry out this project. Thanks also to my fellow students on the program for being good spirited and encouraging throughout the year. Finally I would like to thank all the good people that participated in my experiment for their invaluable help.

iv Declaration

I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text. This work has not been submitted for any other degree or professional qualification except as specified.

(Steinthor Steingrimsson)

v

Table of Contents

1 Introduction 1

1.1 Problems ...... 2

1.2 Previous research ...... 3

1.3 Overview and aim of the study ...... 5

1.4 Structure of thesis ...... 6

2 Resources 7

2.1 The Languages ...... 7

2.2 The Synthesiser ...... 8

2.3 Phone sets ...... 9

2.3.1 English phone set ...... 11

2.3.2 Icelandic phone set ...... 11

2.3.3 The common phones ...... 14

2.4 Pronunciation Lexicons and letter-to-sound rules ...... 18

2.4.1 English ...... 18

2.4.2 Icelandic ...... 18

2.5 Corpora for text selection and experiments ...... 22

vii 2.5.1 Phonetisation ...... 23

2.5.2 Text selection ...... 24

2.5.3 The set of recording prompts ...... 25

3 Building the synthetic voice 29

3.1 Recordings ...... 30

3.2 Processing the waveforms ...... 30

3.3 Voice definitions ...... 32

3.4 The joint database ...... 33

4 Evaluation of bilingual voice 35

4.1 Unit selection for bilingual synthesis ...... 36

4.1.1 Synthesising English using English and Icelandic speech data. 40

4.1.2 Synthesising Icelandic using Icelandic and English speech data. 42

4.2 Perceptual evaluation ...... 44

4.2.1 Icelandic ...... 46

4.2.2 English ...... 52

5 Discussions and Conclusion 57

5.1 Defining a new language ...... 57

5.2 Synthesising ...... 58

5.3 Unit Selection Experiment ...... 58

5.4 Naturalness of Synthesis Evaluation ...... 60

5.5 Future work ...... 60

5.6 Conclusion ...... 62

viii 6 Appendix A - Phone Set 63

Bibliography 67

ix

Chapter 1

Introduction

Unit selection speech synthesis is currently the state-of-the-art technique for synthe- sising speech. It uses fragments of natural speech, chosen from a recorded inventory of utterances, to produce new utterances by concatenating units that are similar at their boundaries. During the speech database creation process, each recorded utterance is split into smaller units. Many kinds of units can be used; half-phones, phones, di- phones, triphones, , words and even phrases. But diphones (phone-to-phone transitions) are the prominent units of choice. Diphones are chosen because the pro- duction of each phone is affected by the neighbouring phones, and with the diphone units starting and ending in mid-phone, they incorporate most of the co-articulation and transition effects. The units to use for concatenation are selected by calculating join costs based on differences between the characteristics of candidate phones. This technique has shown its potential to exhibit greater naturalness than other current tech- niques.

The theoretical maximum of diphones in a language is the number of phones defined in the language squared. English can be transcribed using 42 different phonetic symbols, giving a theoretical possibility of a maximum of 1764 diphones. On the other hand, languages usually have considerably fewer diphones than the theoretical maximum and for the English set of 42 distinctive phones, only about 1300 diphones would be needed (Huang et al. 2001, p. 807). Getting full diphone coverage when recording any

1 2 Chapter 1. Introduction given language can be a difficult task (Möbius 2001). Furthermore, if context is added, the task becomes much harder. The distribution of diphones is such that it is quite probable for a given sentence to contain at least one rare diphone (Clark et al. 2004). For a synthesiser to be able to produce natural sounding speech, a substantial amount of carefully selected data is therefore required.

While simple multilingual speech systems, having distinct databases and settings for each language, are not uncommon, a limited amount of work has been done on poly- glot speech synthesis. A polyglot speech synthesis system can have many advantages, such as that of being able to naturally synthesise foreign words within utterances in another language, reading multi-lingual texts without necessitating a switch between voices or it could use foreign language speech units to produce words commonly pro- nounced with foreign phones. Such a system might also be desirable for multinational organizations wanting to use a single voice on a telephone network used by people of different linguistic background, or for tourist information at museums, airports or other sites where it is essential to provide information in multiple languages. A poly- glot voice might also possibly share resources between languages, thus making the building of each of the languages more economical, or offer a wider variety of context, giving lower join costs, and thus increase the possibility of acceptable synthesis. This last one mentioned is an important problem because when building a polyglot voice, it is very difficult and expensive to do an extensive coverage of each language, so a trade-off is very likely in that situation, covering fewer rare diphones in each language.

1.1 Problems

For a polyglot voice for a unit selection speech synthesis system to sound as natural as possible, it has to have a full inventory of carefully selected data for each language it is supposed to be able to synthesise. The work needed to gather the data for one language thus has to be multiplied by the number of languages it is supposed to speak, in order to obtain the same quality in each language. This can be costly, and even prohibitive as the recordings would take more time, and the speakers’ voice characteristics thus 1.2. Previous research 3 more likely to fluctuate, decreasing naturalness. Much of the data gathered for each language, is for covering units which are relatively rare. As any given language con- tains phones also used in other languages, it is quite possible that many diphones are common to many languages. If such foreign language diphones could be used for a given target language, some of them covering the rare diphones, without decreasing the naturalness of the synthesis, a lot less recording would have to be done for each language.

For sharing units between languages to give the desired results, units having the same description in different languages would have to be close to identical or at least very similar. A careful definition of phones, consistent between languages, is therefore es- sential. The phones might have to be defined quite narrowly and possibly this would entail a substantial growth of the phone sets. All allophones may have to be accounted for, and even further definitions might even be needed, such as prosodic information. If very narrow distinctions are needed, that might eliminate, or at least reduce consid- erably the prospects of a polyglot voice synthesis system sharing resources being more economical than a conventional system.

1.2 Previous research

Although rather limited work has been done on multilingual voices for unit selection speech synthesis, some of the problems involved have been studied to some extent.

Researchers at Telia Research in Sweden have studied the use of foreign phones in Swedish, and its implications for speech synthesis (Eklund & Lindström 1998, 2001, Lindström & Eklund 2000). Their hypothesis was that some foreign speech sounds were commonly used in every-day Swedish by a large part of the population. To test their hypothesis, they constructed a set sentences containing English speech sounds they judged to be possible candidates for the process. None of the sounds are nor- mally included in descriptions of the Swedish phonological system. They recorded the sentences read out by 460 subjects, aged 15 to 75, from all around Sweden. Their research indicates that the majority of Swedish speakers add English speech sounds

4 Chapter 1. Introduction

¢¡ ¤£

Jackson 47.8% 0.4% 51.2% ¥

§ the World 38.5% ¦ 57.5% 1.3%

Table 1.1: Examples of xenophones in Swedish. to their phone repertoire when pronouncing some English words within Swedish sen- tences. These sounds are normally not included in the description of Swedish, and they do not have a phonemic or allophonic function. The term ’xenophone’ is suggested for

these kinds of phones (Eklund & Lindström 1998). Two examples of this process are ¢¡

provided above, and ¦ are the xenophones in the words ’Jackson’ in a Swedish sen- tence about Michael Jackson, and ’the World’ in a Swedish sentence about the song ’We are the World’:

Eklund and Lindström assert that a Swedish TTS system must be capable of producing the appropriate pronunciation of foreign names or words in running Swedish texts, as users will be less prone to accept a speech synthesis system with a lower level of competence than themselves. On the other hand, if the xenophone dimension of a synthesis system is ’maximised’, certain users might be left behind, especially in regard to languages that are not as widely known as English (Eklund & Lindström 2001).

The researchers stress, that they only inspected the use of English phones in Swedish and that their results should not be ’translated’ to other languages, as the inclusion of xenophones no doubt varies between languages. It is nevertheless not a far-fetched assumption that many other languages may have similar processes.

Traber et al. (1999), experimented with building a quadrilingual synthesiser, speaking German, Italian, English and French. Their approach was to integrate the system for all the languages, using shared algorithms, a shared diphone inventory and a single lexicon with language tags. To ease the lexicon acquisition process they did not opt for a phonetic alphabet that distinguishes the sounds of the languages, but retained to common transcription standards as used for individual languages in the lexica they obtained. They report that the quality of their synthesiser is not acceptable (Traber et al. 1999). 1.3. Overview and aim of the study 5

1.3 Overview and aim of the study

In the study delineated in this thesis, the possibilities of pooling together multiple language speech data for synthesis will be inspected. Such a shared database can very possibly be useful for synthesising utterances including xenophones, equivalents of sentences used in the Swedish research described above. Although, in this project the focus is on the speech sounds common to both the languages. The objective is therefore to build a voice speaking more than one language using a shared inventory for identical and similar diphones. The inspiration comes from Traber et al. (1999), although the methodology will be somewhat different, as in their research all possible language resources were shared, but here everything will be kept separate except for the speech data. Furthermore, the symbol sets are modified to a small degree so that phones from different languages likely to sound identical are transcribed using the same symbol, while others are not.

The aim is twofold. First to investigate how rarely or often a unit selection engine selects foreign phones when it has the possibility of using units from both languages. Secondly how using a shared phone set and sharing phones that have the same general characteristics in two languages, affects perception of naturalness of the synthesised speech. For comparison, the synthesis using the shared database will be compared to synthesis using only speech data recorded in the target language. The focus will be prominently on rare diphones, to see whether using a multilingual database can be useful for getting better rare diphone coverage, as a rare diphone in one language might be common in another. For these purposes a voice will be built capable of synthesising speech in two languages, English and Icelandic.

If the results of the perceptual experiments are positive towards the joint database, it might imply that by designing the speech database using similar methods as the ones used in this project could be an efficient part of a polyglot speech synthesis system. 6 Chapter 1. Introduction

1.4 Structure of thesis

In the thesis the process of building the voice will be outlined as well as the experi- ments, and interpretation and discussion on the results will be provided.

The thesis is divided into five chapters. The first one being this introduction. The sec- ond chapter describes the resources used and the preparation work that had to be done. The third chapter describes the voice building and implications thereof. The fourth chapter illustrates the design of the experiments and results and offers interpretation of the results. The fifth chapter provides discussion on the work and results described and concludes the thesis. Chapter 2

Resources

The two languages chosen for the voice to be built in are English and Icelandic. The Festival speech synthesis system was used for synthesis, with the new multisyn engine for unit selection (Clark et al. 2004). In this chapter descriptions are given of the principal aspects of these basic components, the languages and the synthesis system. Other necessary components, Phone sets, lexica and lts-rules, and corpora, will also be introduced.

2.1 The Languages

The languages were chosen for convenience, as being a natice Icelandic speaker and a fluent English speaker, I could record my own voice for the speech database. My level of fluency in English is reasonably high, and although my exact renditions may not sound like English (RP), I was instructed to speak RP in school and do speak consistently making the same or very similar distinctions. Ice- landic is a North Germanic language spoken by around 300.000 people in Iceland with the closest relatives being Faroese and Norwegian. It is, like Russian, Latin and An- cient Greek, a highly inflected language with variable inflection patterns. is somewhat unusual for European languages as having an aspiration con- trast in its stops, rather than a voicing contrast. On the other hand, the Icelandic sono-

7 8 Chapter 2. Resources rants, all exhibit a voicing contrast. Length is also contrastive for many , including all the and a few (Rögnvaldsson 1989). Long consonants are never preceded by a long though and they always stand between two short vowels. Furthermore, unlike English, stress is non-phonological in Icelandic. Icelandic is a SVO language, but the inflectional system allows for a great deal of freedom in word order.

While English has been the predominant language used in speech synhtesis research, not much work has been done in terms of Icelandic speech synthesis. To my best knowledge, the only speech synthesiser with the capability of synthesising Icelandic was made in the early 1990’s by Infovox. Not much was published on that work, and I was not able to acquire any literature on the research carried out for designing that system.

2.2 The Synthesiser

The Festival system was used, a well established speech synthesis platform, using the new multisyn unit selection engine. The algorithm it uses for unit selection is a con- ventional one. It predicts a target utterance structure and proposes suitable candidates from the inventory for each target unit. A Viterbi algorithm is then used to search for the best candidate sequence, the sequence with the lowest target and join costs. If a requested diphone is missing from the inventory, a back-off procedure, involving diphone substitution, is used (Clark et al. 2004).

The target cost is the sum of a set of weighted functions, each adding a penalty if some feature of the candidate diphone doesn’t match the target. For calculating the join cost, three sets of acoustic parameters are used. Spectral discontinuities, estimated by finding the Euclidian distance between two vectors of 12 MFCCs from either side of the potential join point, and pitch and energy mismatches, also estimated by calculating Euclidian distances between the corresponding coefficients across the join (Clark et al. 2004). 2.3. Phone sets 9

2.3 Phone sets

In general, phones are defined only within a single language, although their represen- tative IPA symbol is usually shared with other languages. As the aim of the research is to experiment with the prospect of sharing similar phonemes in two languages, the symbol set should use identical symbols for such phones, but different symbols for phones considered too distinct, although they might traditionally be represented with the same IPA symbols when the languages are being described on their own. Further- more the phonetic inventory has to be readable by Festival, and the HTK toolkit used for forced alignment as discussed in section 3.2. Various attempts have been made to define computer readable symbol sets not just targeted at one language at a time, but designed for many languages. One notable such attempt is the Worldbet. The Worldbet was designed on the underlying principle "that any spectrally and temporally distinct speech sound (not including pitch) which is phonemic in some language should have a separate base symbol." (Hieronymus 1993)

The idea is based on the IPA alphabet, using base symbols that should be a concate- nation of something representing an IPA symbol and . The benefits of such a symbol set are obvious for the purposes of multilingual speech databases, allowing for easy comparison of speech data between languages, and the possibility of sharing com- mon data for purposes of synthesis. The disadvantages on the other hand include its liberal use of various ASCII symbols, such as ampersand (&), asterisk(*) and question marks (?), which are all likely to pose problems with parsing using regular expres- sions. Because of the accuracy of the Worldbet, it has hundreds of symbols, some only slightly different from others. It can therefore be cumbersome for humans to deci- pher some of the symbols made up of strings of many different characters, when one character would be sufficient when only working with one or few languages. These disadvantages may be significant reasons for impeding the symbol set from being pop- ular for general use. In spite of these disadvantages, occasional researchers have used it to describe their speech databases, e.g. Dijkstra et al. (2004).

Some of the more common machine readable phonetic alphabets include the Edin- burgh Machine Readable Phonetic Alphabet (MRPA), used along with keywords in 10 Chapter 2. Resources the Unisyn lexicon and post-lexical rules. That package will be discussed further in section 2.4. The Esprit Speech Assessment Methodology Phonetic Alphabet (SAMPA) was originally devised for six European languages in the late 80’s (Wells 1997). Due to its initial design, using it for other multiple languages can cause collisions. An ex- tended version, SAMPA Extended (X-SAMPA), has thus been proposed, carrying a similar concept as the Worldbet, extending the basic conventions of SAMPA so as to make provisions for every symbol on the IPA chart, making it in principle possible to produce a machine-readable phonetic transcription for all known languages (Wells 1995)

Symbol sets had to be chosen or devised, that could cover the two languages and make the proper distinctions. The readily available English Unisyn resources, use the MRPA symbol set as mentioned earlier, although tools are made available to map it onto other symbol sets. The choice of a symbol set for Icelandic is largely based on the SAMPA symbols used in the Icelandic lexicon obtained. Both the lexica will be discussed further in section 2.4.

The MRPA set works perfectly with the tools to be used, but some of the SAMPA characters might pose problems for Festival and/or HTK. As no information could be found on conventions for mapping non-English phonemes onto the MRPA set, a comprompise between the two was decided on, mapping the SAMPA symbols used in the Icelandic lexicon onto X-SAMPA, adjusting the set to a set acceptable to both HTK and Festival, and changing the symbols for the phones common to both languages, to the MRPA symbols used in the English resources. Furthermore, a prefix "IS_" was added to all symbols denoting Icelandic-only phones, so they wouldn’t mistakenly be mixed with identical symbols in the English symbol set, standing for a different phone. Obviously an arrangement such as this one would not be ideal for a setting of more than two or three languages, as it might become quite confusing. But it is well suited to the purposes of the work carried out and described in this thesis. 2.3. Phone sets 11

2.3.1 English phone set

The Unisyn package discussed in section 2.4, contains many different dialects of En- glish (Fitt 2000), but as the speaker was taught to speak RP English, the RP specifica- tions were chosen to be the most likely to make the same distinctions. To be able to pair the corresponding phonemes of the two languages together, I mapped the symbol sets used in the pronunciation lexicons of the two languages, into the X-SAMPA sym- bol set, and described the phonemes the symbols stand for. The English phone set is provided in table 2.1 on page 12.

The descriptions describe the following phone features, for vowels: length (short, long, diphthong, schwa), height (close, close-mid, mid, open-mid, open), frontness (front, central, back), and lip rounding (rounded, unrounded). For consonants: type (stop, , , nasal, lateral, ), (labial, alveolar, velar, palatal, post-alveolar, labio-dental, dental, glottal), and consonant voic- ing (voiced, unvoiced). To map the symbol set onto the X-SAMPA set and to describe the symbols, I used descriptions and mappings in EN1 (1998), Wells (1995), Ladefoged (2001a).

2.3.2 Icelandic phone set

The Icelandic phone set used by the lexicon is an altered version of SAMPA. I use the phone set used in the obtained lexicon described in section 2.4, but to be able to easily identify phones common to the English phone set, I mapped the symbol set onto X-SAMPA, just like I did with the English one. The Icelandic phone set is provided in table 2.2 on page 13.

A brief introduction to the language was given in section 2.1, but the descriptions describe the following phone features, for vowels: length (short, long, diphthong, long diphthong), height (close, close-mid, mid, open-mid, open), frontness (front, central, back), lip rounding (rounded, unrounded), and whether the vowels are lax or not. For consonants: consonant type (stop, fricative, nasal, lateral, trill), place of articulation (labial, alveolar, velar, palatal, labio-dental, dental, glottal), consonant voicing (voiced, 12 Chapter 2. Resources

mrpa X- Description mrpa X-sampa Description p p unvoiced labial stop r r voiced alveolar approx. t t unvoiced y j voiced palatal approx. k k unvoiced w w voiced labial approx. b b voiced labial stop e e short close-mid front unr. d d voiced alveolar stop a { short lax open front unr. g g voiced velar stop aa A: long open back unr. ch tS unvoiced palatal affr. ou @U diphthong jh dZ voiced palatal affr. o Q short open back rounded s s unvoiced alveolar fric. oo O: long open-mid back rounded z z voiced alveolar fric. ii i: long close front unr. sh S unvoiced post-alveolar fric. iy i close front unr. zh Z voiced post-alveolar fric. i I lax close-mid front unr. f f unvoiced labio-dental fric. @ @ schwa mid central unr. v v voiced labio-dental fric. uh V short open-mid back unr. th T unvoiced dental fric. u U lax close-mid back rounded dh D voiced dental fric. uu u: long close back rounded h h unvoiced glottal fric. uw M close back unrounded m m voiced labial nasal ei eI diphthong m! m= syllab. voiced lab. nasal ai aI diphthong n n voiced alveolar nasal oi OI diphthong n! n= syllab. voiced alv. nasal ow aU diphthong N voiced velar nasal i@ I@ diphthong l l voiced alveolar lateral @@r 3: long open-mid central unr. lw l voiced alveolar lateral eir e@ diphthong l! l= syllab. voiced alv. lat. ur U@ diphthong

Table 2.1: English phone set 2.3. Phone sets 13

Lex. X-sampa Description Lex. X-sampa Description p p_h unv. lab. asp. stop 9 9 short op.-mid front rnd. b b_0 unv. lab. stop i i short tense cl. front unr. t t_h unv. alv. asp. stop I I short lax cl.-mid front unr. d d_0 unv. alv. stop E E short op.-mid front unr. c c_h unv. pal. asp. stop a A short op. back unr. J_ J_0 unv. pal. stop Y Y short lax cl.-mid front rnd. k k_h unv. velar asp. stop u u short tense cl. back rnd. g g_0 unv. velar stop O: O: long op.-mid back rnd. f f unv. lab.-dent. fric. O O short op.-mid back rnd. v v voiced lab.-dent. fric. i: i: long tense cl. front unr. T T unv. dental fric. I: I: long lax cl.-mid front unr. D D voiced dental fric. E: E: long op.-mid front unr. s s unv. alveolar fric. a: A: long op. back unr. C C unv. palatal fric. Y: Y: long lax cl.-mid front rnd. j j\ voiced pal. fric. 9: 9: long op. front rnd. x x unv. velar fric. u: u: long tense cl. back rnd. G G voiced velar fric. au au diphthong h h unv. glottal fric. au: au: long diphthong m m voiced labial nasal ei ei diphthong m0 m_0 unv. labial nasal ei: ei: long diphthong n n voiced alv. nasal 9y 9y diphthong n0 n_0 unv. alv. nasal 9y: 9y: long diphthong JJ_ J voiced palatal nasal ai ai diphthong J0J_ J\_0 unv. pal. nasal ai: ai: long diphthong N N voiced velar nasal Yi Yi diphthong N0 N_0 unv. velar nasal Yi: Yi: long diphthong l l voiced alv. lateral Oi Oi diphthong l0 l_0 unv. alv. lateral Oi: Oi: long diphthong r r voiced alv. trill ou ou diphthong r0 r_0 unv. alv. trill ou: ou: long diphthong

Table 2.2: Icelandic phone set 14 Chapter 2. Resources unvoiced) and whether the consonant stops are aspirated or not. To map the symbol set onto the X-SAMPA set and to describe the symbols, I used the descriptions in Wells (1995), Rögnvaldsson (1989, 1993). It is noteworthy that the phone set doesn’t make distinctions between any long and short consonants. This choice is based on the defined phone set in the obtained pronunciation lexicon, which was built for speech recognition where these distinctions may not be necessary. The lexicon is discussed further in section 2.4.

2.3.3 The common phones

When building polyglot voices, the common phone sets should probably be adapted to the speaker for each language, especially if he doesn’t have a native-like competence in one or more of the languages, as he might replace some phones with others a native might not use. These phones might on the other hand be identical to ones he uses in an other language. It would therefore be likely that such an adaptation might give better results. Additionally, as the phones in the phone sets are done to make distinctions within languages, many of them represent phonemes, not making distinctions between allophones of the same . This can make it hard to decide whether or not a phone from one language phone set is the same as a phone in another languages phone set. In cases where two sounds can be used to distinguish words in one language, but in another they are allophones of the same phoneme, this might especially become a problem. Such scenarios are quite common in the worlds languages and is discussed briefly in Ladefoged (2001b). When deciding which phones from the phone sets de- scribed above can be considered common to both sets, this must be kept in mind.

The phones having the same descriptions were chosen as initial potential candidates. A few utterances were recorded in both languages and the spectrograms then com- pared. The recorded utterance prompts were written so as to show the same strings of potential candidates together in both languages. Figure 2.1 and 2.2 show example of spectrograms used for these purposis, in the English sentence ’I have this toe’ and Icelandic sentence ’Það hafðist þó’: 2.3. Phone sets 15

Figure 2.1: Spectrogram of string ’h ae v dh i s t’ in English sentence.

When figure 2.1 is compared to figure 2.2, it seems that there is not a great difference between the ’h’, ’v’, ’dh’, ’i’ and ’s’ sounds in the two languages. Upon inspection, the formants are similar, indicating the sounds may be close enough. All the candidate sounds were inspected like that, and the following were considered to be the same in the two languages:

MRPA X-SAMPA MRPA X-SAMPA MRPA X-SAMPA f f v v th T dh D s s h h m m n n ng N l l iy i i I ii i: aa A: uu u: oo O:

Table 2.3: Common phones

Additionally to the phones described identically, some other phones seemed to be very much alike in the two languages. The aspirated stops in Icelandic seem to sound just like the (almost) corresponding English stops. The speaker (I) also seems to pro- nounce the unaspirated Icelandic stops very much like he pronounces the voiced En- 16 Chapter 2. Resources

Figure 2.2: Spectrogram of string ’h IS_A v dh i s d’ in Icelandic sentence. glish stops. Figures 2.3 and 2.4 show parts of the English utterance ’He said: “powow Coba Cobana”’ and Icelandic utterance ’Hann býr á Kópaskeri’

Figure 2.3: Spectrogram of string ’ow k ou b aa k’ in English sentence.

The diphthongs, ’ou’ and ’ow’ also seem to be very similar as the Icelandic ’ou:’ and ’au:’, and the stops as well. A speaker with native-like competence in both languages would probably make more distinction between these phones, but that does not seem to be the case here. Analysis of the phones in table 2.4 revealed them to be sufficiently similar, that they could be added to the shared inventory. 2.3. Phone sets 17

Figure 2.4: Spectrogram of string ’ow k ou b aa s’ in Icelandic sentence.

MRPA X-SAMPA MRPA X-SAMPA EN IS EN IS p p p_h b b b_0 t t t_h d d d_0 k k k_h g g g_0 ou @U ou: ow aU au:

Table 2.4: Similar phones

Although at a glance, these phones seem to be very much alike, this part may be the most delicate part of the system. For the common set to be just right, this might need some thorough research. Such a research might also reveal whether further distinctions should be made in the phone sets. For this project though, no such thorough research was undertaken but only a preliminary inspection was carried out.

The complete phone set uses the MRPA symbols for the English and common phones, but an adaptation of X-SAMPA for the ones only in Icelandic. The final choice of symbols is illustrated and explained in appendix A. 18 Chapter 2. Resources

2.4 Pronunciation Lexicons and letter-to-sound rules

Methods for English text-to-phones conversion are available with Festival. Therefore, only a lexicon and/or lts-rules had to be obtained or designed for Icelandic. When transcribing text, Festival then uses the selected pronunciation lexicon, and falls back to the lts-rules if a given word form does not have an entry in the lexicon.

2.4.1 English

The Unisyn lexicon and post-lexical rules are distributed with Festival. The Unisyn Lexicon was designed to be accent independent, and thus make synthesis of regional accent possible without defining a new lexicon for each and every accent. It provides all the features necessary to describe regional pronunciation at the segmental level. Some regional features, such as intonation or duration are not included though, as they are beyond the scope of a lexicon. The Unisyn lexicon, unilex, defines abstract units that stand for possible dialect variance, thus making it possible to derive lexica for the many dialects of English from a single lexicon (Fitt 2000).

As previously noted in section 2.1, the speaker was instructed to speak RP English in school, and thus makes the same or very close to the same distinctions. What is impor- tant here is not to define a lexicon for the exact renditions of the speakers speech, but a fit to the distinctions he makes. As it is out of the scope of this project to investigate the exact distinctions the speaker may make, the RP lexicon was chosen to be the most likely match to the speakers pronunciation.

2.4.2 Icelandic

A pronunciation lexicon was obtained, containing just less than 56 thousand entries. The lexicon was designed for a speech recognition system and contained no POS-tags or syllabification (Rögnvaldsson 2004). Furthermore it contained multiple entries for some word forms, giving transcriptions for different dialects of Icelandic, although no 2.4. Pronunciation Lexicons and letter-to-sound rules 19 tags accompanied the entries to identify the dialect.

Word Transcription áhyggjusvip au:hIJ_Ysvi:b áhætta au:haihda áhættan au:haihdan áhættu au:haihdY áhættuíþróttir au:haihdYi:TrouhdIdnar áhættuna au:haihdYna áhættunni au:haihdYnI áhættusamt au:haihdYsam0d áhættusamt au:haihdYsamt

Table 2.5: Entries from the lexicon.

The first column is the word form and the second column is the SAMPA style tran- scription. The lowest two entries have the same word but are transcribed for different dialects.

Missing part-of-speech information Icelandic homographs, different words or word forms with identical orthographic representation, usually have the same pronunciation within a dialect. But there are a few exceptions to this rule, necessitating POS tags in a pronunciation lexicon for speech synthesis. Common examples of this with X-SAMPA transcriptions are:

Halli proper noun, nom. [h A l I] Halli proper noun, dat. [h A d_0 l I]

brúnni noun, dat., w/suffixed article [b r u n I] brúnni adj., comp. [b r u d_0 n I]

Table 2.6: Four different words, but only two different orthographies.

As the pronunciation lexicon is missing these tags, it is unlikely to be able to produce accurate pronunciation of many utterances. 20 Chapter 2. Resources

Syllabification issues The lexicon did not contain any information on syllabification within the words. In Icelandic, syllabification is not very complicated, but needs mor- phological analysis. All syllables in Icelandic contain one and only one vowel. For stress, the general rule is that words have primary stress in the first and sec- ondary stress in all other odd syllables in the word, but inflectional suffixes are not stressed (Þráinsson 1995). Compound words with affixes in mid-word, can have dif- ferent stress patterns though, as well as function words which are often unstressed when spoken in context.

The work involved in implementing a parser for syllabifying Icelandic and marking stress, is out of the scope of this project. When the lexicon is compiled, the Festi- val lexicon compiler syllabifies the Icelandic entries according to English rules. This syllabification seems to be a relatively good approximation to correct Icelandic syllab- ification. The compiler marks all syllables as unstressed. As there is no information on stress anyway, marking all the syllables as unstressed should not have a negative effect on the synthesis of Icelandic using that lexicon. But as will be discussed further in chapter 4, this might effect the unit selection from the multilingual speech database.

Dialects

The lexicon was designed for a speech recognition system, and is therefore supposed to cover the most common dialects. But as there are no dialect tags for words with mul- tiple entries, the lexicon has to be modified to be useful for speech synthesis. Fortu- nately, Icelandic dialects differ in pronunciation in a very limited part of the vocabulary. Therefore, moving the multiple entries to another file, and examine a reasonable rep- resentation for selecting the transcriptions that suits the pronunciation of the projects speaker, was an achievable task. The multiple entries were removed from the lexicon, about 3000 of them were checked, and from these 3000, about 1000 were selected as having a correct transcription of the wanted dialect. Afterward, the lexicon had about 46 thousand different word forms.

Training letter-to-sound rules

Before the first round of training letter-to-sound rules, the aforementioned 1000 di- 2.4. Pronunciation Lexicons and letter-to-sound rules 21 alect specific words and transcriptions were added to the lexicon, and all abbreviations removed for the sake of coherency.

Training process described in the Festvox manual was followed. After pre-processing the lexicon a set of allowable pairing of letters to phones was defined. Ready-made scripts were then used for the construction of probabilities of each letter/phone pair, which are used to align the letters to the corresponding set of phones, or _epsilons_ when a letter is skipped for whatever reason. The data is then extracted by letter suit- able for training, and finally CART models are built for predicting phone from letters and the letters context (Black & Lenzo 2003).

When a held-out set of 10% of the data was tested by lts-rules trained on 90% of the lexicons entries, the lts-rules predicted 95.692% of the data correctly.

To try to realise if there were specific types of errors causing the bad predictions, a random sample of 1500 words was taken from the lexicon and checked to see if something could be learned from it. No specific error pattern seemed to be prominent, but the errors found in the 1500 words were fixed, and added to the lexicon again. Using lts-rules derived from that lexicon, a list of the most frequent 2000 words in the corpus (discussed in 2.3) that were not in the lexicon, was checked, fixed and added to the lexicon. Lts-rules were derived from that lexicon, having 95.765% accuracy of guessing the correct transcription in a held out set of 10%.

The rest of the corpus word list was now transcribed. The words in the list were added to the lexicon, adding up to a pronunciation lexicon of ca. 106 thousand words, although at least 3-4% could be expected to be incorrect. This was done for conve- nience, to hurry up the process of checking how any given word in the corpus would be transcribed, by not having to run it through Festival. The lexicon would also serve as a corpus for the experiments described in chapter 4. The lts-rules were thus never actually used in the experiments, but only to build the lexicon of all words in the cor- pus.

The final change to the lexicon and letter-to-sound rules is done after the recordings, but after recording, all the words were checked manually and those who had incorrect 22 Chapter 2. Resources transcriptions were fixed. This was done for the forced alignment process described in section 3.2 to be more effective.

The effects of missing information, especially the syllabification, is likely to give somewhat strange synthesis, as stress can change pitch, duration or intensity of the given syllable. This was found to be the case when the voice had been built and tested informally. In designing the experiment I try to minimize this effect, but this will be discussed further in chapter 4.

2.5 Corpora for text selection and experiments

Before the recordings can take place a set of sentences with good coverage of the languages phonetic structure has to be in place. Discussion on that can for example be found in Black & Lenzo (2003). English recording prompts with reasonable diphone coverage were available to me. A set of 460 phonetically balanced sentences from the TIMIT set were used, and for increased coverage the first 340 sentences were used from a set Yoko Saikachi generated for an MSc. project at the University of Edinburgh (Saikachi 2003). The 166,600 words in the lexicon were used as the basis for the experiments, as outlined in chapter 4. Not doing my own text selection for English had minimal disadvantages, prominently preventing me from controlling the length of the recording prompts. The length of the recording prompts are not crucial for the quality of the synthetic voice, at least not if it is reasonable with regard to the purpose of the synthesiser. On the other hand, longer sentences can be harder to record, because if the speaker makes a mistake in reading a sentence out, he has to start it all over again. The longer a sentence is, the more probable such an error is.

No corpus of Icelandic text, or phonetically balanced sentences were available for the purposis of this project. It therefore had to be built. For that, texts in Icelandic had to be obtained. One way considered for that, was to extract text from web sites in Icelandic and build a corpus from the extracted texts. Before opting for that I contacted a book publisher and asked for the text of published books. I was given the text of ten books about history and topics of general interest, which constituted the corpus. After 2.5. Corpora for text selection and experiments 23 expanding abbreviations, all sentences were split at commas, to make them shorter and thus easier to record. The corpus totalled 127,338 such utterances, 1,143,786 words and 90,522 different word forms.

2.5.1 Phonetisation

Having obtained text, built a corpus and expanded the abbreviations in it, the auto- matically trained rules discussed in section 2.4.2, were used to transcribe the corpus. Upon inspection of diphone frequencies, they were found to have the expected Zipf- distribution, with the most common diphone appearing 98,548 times, and 132 appear- ing only once. In total there were 5,718,814 diphones, and 2,271 different diphone types (out of a possible 60x59 = 3,540).

English Prompt Set Diphone Distribution

500 Diphones

400

300

Frequency 200

100

0 0 200 400 600 800 1000 1200 1400 1600 Number of diphones

Figure 2.5: Number of Icelandic diphones in corpus, plotted against frequency. 24 Chapter 2. Resources

2.5.2 Text selection

The goal of the text selection process, is to get as good diphone coverage as possible in reasonably few sentences. A simple way to build a set of recording prompts is to select a subset from a big set of sentences. For this process I set some constraints on what kind of sentences I wanted, and wrote a greedy algorithm to select the optimal set of sentences from the corpus, covering what had to be covered in as few sentences as possible.

Constraints and greedy algorithm

For easier recording, the sentence size was limited to be at least 4 words and at most 16 words. As explained before, the sentences were split in two where there were commas. The resulting sentences in the target range were found to be 59.8% (76,108 sentences) of the total, and diphone coverage to be 91.4% of distinct diphones. It should be kept in mind though that the lts-rules are error prone, and some of the diphones might actually not exist in the spoken language, but be products of a flawed rule set.

The greedy algorithm designed for that was modelled on principles a number of re- searchers have built upon, for example François & Boëffard (2001). It is an iterative technique to select the optimal subset of sentences from a large set. Initially, a set of diphones is defined, and a certain number of diphones of each type are wanted in the target subset. The number is based on the frequency of the diphones in the corpus. Each iteration of the algorithm selects one sentence, deemed to be the optimal one, based on a calculated prominence score. The diphone types all have a given score according to how rare they are in the corpus, with the highest score for the rarest di- phones. The sum of these scores for all the diphones constituting a sentence is divided by the number of diphones in the sentence, to get the prominence score. The sum is divided with the number of diphones to try to prevent longer sentences from getting higher scores and thus being selected on the grounds of being long, instead of contain- ing a high proportion of rare units. When the intended number of diphones of a certain type has been reached, it’s score is changed to zero. After each round, all sentences that got a score of zero are deleted from the list. When no sentences are left in the 2.5. Corpora for text selection and experiments 25 list, the algorithm terminates and the subset of optimal sentences is saved to a file. Ta- ble 2.7 shows the algorithms criteria for choosing diphones based on frequency, with minimum number of each type and initial score given.

Frequency Number Score 1-10 1 16 10-100 2 8 100-1000 3 4 1000-10000 4 2 10000< 5 1

Table 2.7: Numbers of diphones to select and initial score given, based on frequency.

Although the design of the greedy algorithm used here is in principle the same as the one used in François & Boëffard (2001), their approach is somewhat different. Their aim is to get good coverage of triphone units, and not diphones as is being done here. They do not aim to cover all triphones, but preprocess the corpus so as to leave the rarest triphones out (with less than 10 instances in the whole corpus), and thus condensing the corpus and the resulting speech database. They also do not seem to make a decision of out-of-cover or in-cover units, as is done here by setting the score of units selected often enough to zero. They want the minimum of 10 tokens for each triphone, and do not base that on frequency. This results in about 5 hours of speech.

For the work described here more extreme restrictions on the speech database have to be made. As the time for recording speech for this project is rather limited, a set resulting in about an hour is desired. Furthermore, coverage of all rare diphones is wanted. A few runs of trial and error resulted in the weights and numbers chosen, giving an appropriate set of sentences.

2.5.3 The set of recording prompts

To fulfil the criteria set, the greedy algorithm selected 1,243 sentences, out of a total 76,108 of the length specified. Before recording, a few of the sentences were deleted, 26 Chapter 2. Resources as they contained mostly foreign words or abbreviations not expanded, and were thus incorrectly transcribed and unsuitable for recording. That left us with a remainder of 1,210 sentences. The statistics for the main characteristics of the prompt set can be seen in table 2.8, along with the statistics for the same characteristics in the English recording prompts.

ENG ICE Total diphones 33,883 31,004 Diphone types 1,689 2,005 Frequency of most common diphone 519 438 Number of diphones occurring only once 279 351 Number of sentences 800 1,210 Number of words 8,021 6,757 Number of word types 3,381 3,232

Table 2.8: Characteristics of prompt sets.

The distribution of the diphones is typical for sets like these. Although a great mojority of the diphones in the corpus are present, the Zipf-like distribution has proportionately more mid-frequency diphones. This can easily be seen by comparing figures 2.6 and 2.7 to the distribution of the Icelandic corpus presented in figure 2.5. 2.5. Corpora for text selection and experiments 27

Icelandic Prompt Set Diphone Distribution

400 Diphones 350 300 250 200

Frequency 150 100 50 0 0 500 1000 1500 2000 Number of diphones

Figure 2.6: Number of Icelandic diphones in recordings plotted against frequency.

English Prompt Set Diphone Distribution

500 Diphones

400

300

Frequency 200

100

0 0 200 400 600 800 1000 1200 1400 1600 Number of diphones

Figure 2.7: Number of English diphones in recordings plotted against frequency.

Chapter 3

Building the synthetic voice

The voice building process is laid out in a document distributed with the multisyn unit selection engine (CST 2004). It includes recording a set of prompts, designed for optimal diphone coverage. The recordings are split into files, each file containing a recording of one of the recording set prompts. MFCCs are generated for the utterances, the files are then labelled, pitchmarks generated for the wavefiles, and the wavefiles are normalised. Then the utterance structure is built, duration is inspected for outliers and fundamental frequency track contours, normalised coefficients for use in join costs and LPC coefficients are generated. Finally a pause model is added, and the speech database should be ready for synthesis.

When the speech database is ready, the voice has to be defined. Lexicon and other language specific information have to be specified, setting everything in place for syn- thesis.

In this chapter the most relevant of the above-mentioned points will be explained, and how some of the speech data will be used jointly for both languages.

29 30 Chapter 3. Building the synthetic voice

3.1 Recordings

The recordings for both languages, English and Icelandic, were carried out in 6 days and 17 sessions, in the linguistic department’s near-anechoic studio. The speaker was a 27 year old male, a native Icelandic speaker and a fluent English speaker, namely myself. The prompts specified in section 2.5.3 were used, 800 English utterances and 1210 Icelandic ones.

The prompts were read one at a time with pauses in between, recorded in 16 bit mono and sampled at 16KHz. In order to automatically split the session files into a set of files carrying one prompt each, a 500ms tone at 7KHz was added in between all the prompts at recording time. A script from the multisyn package, using tools from the speech tools library, was used to split the waveforms and erase false starts. Some cleaning had to be done, as a substantial number of the files contained false starts that should have been erased, but were not. After cleaning the files, I was left with 65 minutes of English speech and 68 minutes of Icelandic speech for the database. Clark et al. (2004) argue that the ARCTIC data set for English, a set of 36 thousand phones totalling about 1.4 hours of speech is about right for good unit selection synthesis, if prosodic context is not considered in any detail.

Although the speech databases are not much more than an hour for each language, the total number of diphones is close to the 36 thousand in the Arctic data set. As I will use some of the ’foreign’ language speech data to synthesise each of the two languages, the size of the total speech data used for each language, might be very close to that of the ARCTIC set.

3.2 Processing the waveforms

Festival utterance structures need to be built for each of the utterances in the database in order for the engine to be able to synthesise speech. This process requires labels for: segments, syllables, words, phrases, F0 Targets, and intonation events. Ideally this should all be carefully hand labelled, but for most purposes that’s impractical (Lenzo 3.2. Processing the waveforms 31

2000). The labels are therefore obtained automatically. This process is an important one, as the essence of this method of speech synthesis is to join units, the quality of the synthesised speech will be determined largely on the quality of the segmental labels.

Techniques to do automatic segmental labelling often borrow from automatic speech recognition (ASR), as the task is essentially a simplified recognition task. In this case the outcome is known, and is used to help estimate the exact position of each speech sound within the utterance. This is thus called ’forced alignment’. Methods for doing forced alignment have involved Hidden Markov Models and Dynamic Time Warping, both common in ASR.

Although speech recognition methods are not designed to find the exact positions of phones within utterances, but aim to predict correctly the highest percentage of spoken words as possible, transcriptions created by these methods have been found to be con- sistent and reproducible. Based on that they have been regarded a reasonable choice (Hosom 2000).

HMM based forced alignment is applied using HTK, the Hidden Markov Model Toolkit. The script supplied with the multisyn package trains the models iteratively, using the same phoneset as used in other stages of the process. It uses the model at each stage to predict a closer model at the next by adjusting the boundaries. The tool can also predict whether vowel reduction took place for any given phone in the utterance. As it was unclear what vowel reductions might take place in Icelandic, and there are no obvious reductions in Icelandic like the reductions in Englush to schwa for instance, the substitution option was only used when aligning the English utterances.

The multisyn defaults parameterise the speech as 12 Mel-scale cepstral coefficients, energy, deltas and delta deltas. A relatively short window size of 10ms is used with a short 2ms shift, generating more consistent boundary positions and fewer bad labelling errors than using a larger frame shift or longer window (Clark et al. 2004).

Pitchmarks are then generated for the wavefiles. To automatically generate good pitch- marks ideal settings have to be found for the voice recorded, and some recording spe- cific fine tuning should be done. Good settings should give fairly good pitch marking, 32 Chapter 3. Building the synthetic voice with most of the voiced sounds having a label for every spike in the waveform.

Automatic pitch marking is very hard to do perfectly, so a component is incorporated in the multisyn package that indicates bad marking. The component uses a normalised version of the log likelihood score for each segment and a flag indicating whether a segment is too short to have a meaningful pitch-marking (Clark et al. 2004). A casual inspection seemed to indicate that the number of bad pitch marks varied greatly between recording sessions, indicating that the settings for pitch marking might have to be fine tuned for each recording session for optimal results.

Utterance structure files are built for each utterance in the speech database, in order for Festival to be able to use it in run-time. The utterance structures contain streams of items, such as word, pos-tag, stress, and information about abnormalities ofsegments such as bad pitch marking or bad duration outliers. Festival loads these utterance structures in order to utilize the information therein on speech features, which allow the synthesiser to exploit the utterances for synthesising speech. Features can also be dumped from the speech database to train various models (Black & Lenzo 2003).

Other information is generated for each of the segments. Segment duration info to mark outliers, Fundamental frequency track contours, Normalised MFCC coefficients and LPC coefficients are generated and stored on individual phones for use in calcu- lating Join costs for concatenating units.

3.3 Voice definitions

Having built the utterance structures for both languages, the last thing to do before being able to use the voice to synthesise speech, is to define the voice within the system.

The basic components for this are the phone sets and definitions of word pronunciation, either by letter-to-sound rules or by using a lexicon. The phone sets have already been defined as discussed in 2.3, and the lexicon work and acquisition process has already be discussed in some detail in 2.4. Therefore, in this step the only thing that had to be done was to compile the Icelandic lexicon for Festival to make it ready for synthesis. 3.4. The joint database 33

Token processing rules, post-lexical rules and a phrasing method is all defined for En- glish and available with Festival. Setting up the English voice for proper synthesis was straightforward. On the other hand, this had all to be defined for Icelandic to get a fully working synthesiser. As the experiments to be carried out would only include single words, none of these definitions were essential for the project. These compo- nents were therefore not included in the language definition. The synthesised speech suffers as a result in cases of utterances longer than one word, but for the purposes of this project that is not important, although having post-lexical rules defined might or might not have improved phone alignments in the speech database. Post-lexical rules in Icelandic are usually optional, and they are relatively few (Rögnvaldsson 1993). In spite of that, a comprehensive list of post-lexical rules could not be acquired, nor did I manage to define them myself. With that in mind, the prompts were read to try to minimize the effects of post-lexical rules. Unfortunately, the substitutions process at the forced alignment stage does not accept other substitutions than vowel reduction at this stage (Clark et al. 2004). An effort to try to use substitutions as a means of getting more precise transcriptions for the speech data could thus not be attempted.

3.4 The joint database

The experiments planned, as described in chapter 4, demand some method of pooling the common diphones from both of the languages together. The common diphones are made up of the phones listed in chapter 2 as common and similar. These phones had the same symbols in both languages, while all phones not intended to be shared, were given distinct symbols as listed in Appendix A.

Table 3.1 lists the numbers of potentially common diphones, units having only phones from the shared phone set, and a subset of that, diphones that are actually common to both the languages in the sense that both language recordings have that phones. These are the shared units the multisyn engine can select from to synthesise both languages.

The number of potentially common diphones are listed to show the theoretical maxi- mum of common phones from each set. This maximum could hardly ever be reached, 34 Chapter 3. Building the synthetic voice

English Icelandic Potentially Common in Potentially Common in common recordings common recordings Diphone types 472 391 435 391 Diphone tokens 13,212 11,821 10,532 9,366 Total Diphones in Recordings 33,883 31,004

Table 3.1: Common diphones but if data would be added to the recordings of the other language, the number of di- phones common in recordings could grow slowly towards the theoretical maximum, until the other language has a perfect diphone coverage.

For each of the languages, the files storing the information about utterance structure, cepstral coefficients and other information used for synthesis and selecting units were copied to a joint location. In the voice specification files, this common database was specified as the speech unit inventory, although other language specific information was kept separate. Two new voice definitions were built, one for each language, but the specification of the database was changed. This worked as intended and Festival searched through the bilingual data to find optimal paths for synthesising utterances. Chapter 4

Evaluation of bilingual voice

The aim of the project is to try to understand the effects of using a bilingual speech database to synthesise either one of the languages in the database. If the unit selection method prefers to use a foreign diphone when it can use a target language diphone, it does so because the join cost is lower. If the foreign diphone truly contains the phone used in the target language, this should therefore, at least as far as the technology goes, be the best sequence of units to build the word or utterance.

With the experiments, I want first to inspect how common it is for the unit selection engine to select foreign diphones, and second to ask people to listen to some of the words the unit selection engines prefers to synthesise with foreign units, and have them compare it to the same words synthesised only using the target language speech data.

To be better able to keep all conditions other than the foreign unit the same, to make comparison easier and for the participants in the perceptual experiments to be more likely to focus on the difference being inspected, it was decided to use only single words instead of utterances of more than one word. This also minimizes the effects of the missing language specific definitions for Icelandic, with only the missing syllabifi- cation left to affect the outcome.

35 36 Chapter 4. Evaluation of bilingual voice

4.1 Unit selection for bilingual synthesis

For the experiments only diphones covered by my recordings, as outlined in section 2.5.3, can be used. The first experiment to be carried out, is an inspection of how frequently the unit selection engine selects foreign units. The experiment is in six parts for each of the languages. All the parts involve synthesis of one seperate word having certain characteristics. They are as follows.

1. One shared rare1 diphone in mid-word. No other diphones part of the shared inventory.

2. One shared common2 diphone in mid-word. No other diphones part of the shared inventory.

3. One shared rare diphone adjacent to a shared start or end. No other diphones part of the shared inventory.

4. One shared common diphone adjacent to a shared start or end. No other diphones part of the shared inventory.

5. One rare diphone in a word and at least two common diphones.

6. All diphones in word are common to both languages. Each word has at least three diphones.

These criteria were decided for three main reasons. To investigate whether foreign diphones are more likely to replace rare diphones than common ones. To investigate whether foreign units are more likely to be selected as they are more numerous in the word, or if trying to reduce the necessary cross-language joins will make it more likely for foreign units to be selected.

1) and 2) on the one hand, and 3) and 4) on the other have the same contrasts of rare vs. common units, and are therefore intended to provide evidence or clues about the first reason. The difference between the same, 1) and 3) as a group, vs. 2) and 4), is

1Rare diphones are diphones that occur 5 times or less in the recordings for the language in question. 2Common diphones are diphones that occur 50 or more times in the recordings for the language in question. 4.1. Unit selection for bilingual synthesis 37 meant to investigate whether foreign diphones are more likely to be selected if they don’t have to join but one target language unit, and then another foreign unit. This is tested by only experimenting with words having one shared diphone next to a start or an end, which is also shared, thus making it possible for the unit selection engine to select a joining unit for a foreign diphone that is of the same language. The last two criteria are meant for inspecting what effect it has to have the potentiality of multiple or even all foreign phones in a unit.

For each of the parts I inspect how many words in the lexicon meet the criteria for each group, and then how many of these words are synthesised using foreign units.

This experiment is mainly to see how common it is for the unit selection system to choose units from the other language under different circumstances. It should be ex- pected that in whole, it is more common for the unit selection process to select from the target language recordings, as it may be expected that diphones there are more likely to be adjacent to the diphones they are adjacent to in the word being synthe- sised. These effects could be inspected to some extent by counting the frequency of bigrams in the database, containing the diphone in question and the preceding or fol- lowing one, and comparing the language in those aspects. I will make do with a more basic inspection, inspecting the frequency of the multisyn engine selecting a certain diphone, and compare to the frequency of the diphone in question in each of the lan- guages recorded speech data. The distribution of diphones in each language probably has a great effect on the frequency of the multisyn choosing a foreign or non-foreign diphone, and the expected frequency could be expected to be determined largely by the frequency of bigrams or trigrams of the diphones. What this experiment investigates, is thus only whether the usage of foreign units seems to be a common or uncommon phenomenon, and whether it seems to be more likely in cases of rare diphones than in cases of common diphones.

The frequency of choice from each language for each diphone in each of the six parts of the experiment will be reported. The ratio of the occurance of this diphone in the target language will also be calculated, as compared to occurance in the database for both languages. The distribution of diphones in each language change this frequency 38 Chapter 4. Evaluation of bilingual voice quite a lot, and the expected frequency should be closer to the frequency of bigrams or trigrams of the diphones, although experimenting and solid data are needed for stating exactly to what extent.

From the shared diphones in the recordings, sets of rare and common diphones are selected for each of the languages. Rare diphones are defined as the ones occurring 5 times or less in the recorded data for each language, and common diphones occurring 50 times or more.

For all the parts of the first experiment, words are selected from the lexica, having diphones in one or both of the two sets as to meet the criteria for the six different groups, and they are grouped accordingly. Total diphones in the two lexica are as listed in tables 4.1 and 4.2.

Total diphones in Icelandic lexicon: 1089192 Total Common Diphones: 254636 Total start/end phones: 218436 Total common start/end: 131559 Total rare diphone types from the common subset: 107 Total common diphone types from the common subset: 43

Table 4.1: Diphones in Icelandic lexicon

Total diphones in English lexicon: 1379670 Total Common Diphones: 379205 Total start/end phones: 333200 Total common start/end: 228957 Total rare diphone types from the common subset: 98 Total common diphone types from the common subset: 54

Table 4.2: Diphones in English lexicon

Before being able to inspect the unit selection for each of the groups, words had to be selected from the lexicon meeting the criteria of one of the six defined groups. The English lexicon has about 167,000 entries, and if a word has many derived forms in the 4.1. Unit selection for bilingual synthesis 39 lexicon, only the first one (in alphabetical order) is selected for use in the experiment. Total words in the English lexicon meeting the criteria of one the six groups are 11,325, or 6.8% of the total words in the lexicon. The Icelandic lexicon has about 109,000 entries and total words there meeting the criteria of one the six groups is 8,601, or 7.9% of the lexicon’s total. It should be noted that a great majority of the words in each lexicon contains at least one common diphone, but as words containing the diphones falling between the groups of rare and common are not included, the biggest part is eliminated.

The number of words in each group is actually a bit less than this because some of the words couldn’t be synthesised due to missing diphones. As the diphones were supposed to be selected for the experiment based on availability, this may seem rather odd. On the other hand, automatic processes during voice building are bound to affect the inventory in some ways. During the forced alignment, silences are added between words as the speaker pauses for whatever reason, some of the vowels (in the English recordings) may have been reduced. Furthermore, some sounds may have been omitted by the speaker. This situation is thus not entirely unanticipated. The experiments were

Group English Icelandic 1) rare, mid-word 48 62 2) common,mid-word 718 1108 3) rare + stop/end 189 138 4) common + stop/end 3211 2173

Table 4.3: Number of words synthesised in the first four groups run by synthesising the word lists meeting the criteria for each group, and saving the utterance structures for the synthesised words in a file. A script was then run on the files checking for the common diphones, and whether they were selected from files containing Icelandic or English data. Matrices with some discussions are provided below. Some of the words in the word list contained diphones not covered by the inventory of the target language. They were thus omitted. 40 Chapter 4. Evaluation of bilingual voice

Some words in group 5 and all words in group 6 had the potential to be synthesised entirely with foreign units. Table 4.4 provides information on the division of words with regard to original language of units used.

English Words Icelandic Words All Engl. All Icel. Mix Total All Engl. All Icel. Both Total Diph. Diph. of Both Words Diph. Diph. of Both Words Group 5 1,337 15 2,807 4,159 8 1,863 2,069 3,940 Group 6 1,453 26 1,411 2,890 15 640 445 1,100

Table 4.4: Language units used for synthesising words in group 5 and 6.

Of the 6 groups, the sixth group was the only one that could synthesise all the words and did not have any missing diphones. This was true for both languages, suggesting that it is not a coincidence. The sixth group differs from the others as having words made up solely of diphones from the shared inventory. This might thus indicate that foreign units are used in cases of missing target-language units, but as a tiny minority of words are synthesised with foreign units only, the foreign units might mostly be used when no or very few options are left. In the next two sections this will be studied further.

4.1.1 Synthesising English using English and Icelandic speech data.

After running all the words in all the groups through the synthesiser, the utterance structure generated for each word was used to grab the information on which utterance files were being used for each and every segment. That information was then used to determine in which language the recording of the segment was done. In table 4.5, a summary of the English diphone tables3, information is provided on which language the common diphones used to synthesise words belong to, whether English or Icelandic units were generally used.

3Available on-line at http://www.ling.ed.ac.uk/~s0344328/unitsSelected.htm 4.1. Unit selection for bilingual synthesis 41

Gr. EN 0 ad. 1 ad. 2 ad. IS 0 ad. 1 ad. 2 ad. EN/Tot IsRec EnRec % E1 33 19 9 5 15 15 0 0 68.8% 682 106 13.5% E2 702 159 462 81 16 16 0 0 97.8% 2369 6593 73.6% E3 111 70 38 3 78 33 45 0 58.7% 759 139 15.5% E4 3104 81 2212 811 107 18 89 0 96.7% 2404 7212 75.0% E5 11783 3036 6742 2005 4231 1997 2095 139 73.6% 8608 10901 55.9% E6 10255 1829 6473 1953 2268 762 1364 142 81.9% 8108 10829 57.2%

Table 4.5: Common diphones selected in English word unit selection experiments.

The tables headings stand for: Gr. = Group; 0 ad. = No units from the same utterance immediately adjacent to diphone; 1 ad. = One unit from the same utterance adjacent to diphone; 2 ad. = Both adjacent diphones from the same source utterance as the diphone in question; EN/Tot = English diphones selected divided by total diphones selected; IsRec = Number of Icelandic diphones in inventory of the same type as the diphones in group; EnRec = Number of English diphones in inventory of the same type as the diphones in group; %Eng = ratio of English diphones as a percentage of total diphonesof the type in inventory; EN1-EN6 are the 6 groups of English words.

In table 4.5 it is apparent that even when the shared diphones are rare (E1 and E3), The majority of diphones selected by the unit selection engine are from the target language, English, recordings. This also applies to diphones that are not part of clusters of two or more diphones taken from the same utterances in the speech database.It is hard to say with only this kind of experimentation, what this might imply, although if the phones are really the same in both languages, this should be close to the language ratio for these diphones in the common inventory. If the phones defined to be ’identical’ or ’similar’ in both languages, but are in some way different, maybe in terms of aspiration, or other phonological features that are contrastive in one language but not in another. This could very possibly affects join costs, and therefore reduce the chance of these units to be selected. Although this is likely to be an effect, only further tests can show that, and some other factors might also be at work.

The common diphones should be more likely to be selected from the target language recordings. We can see that for the diphones studied in E2 and E4 on the table, they are about three times as common for English as they are for Icelandic in the speech database. But the unit selection engine chooses them almost all the time. If the number 42 Chapter 4. Evaluation of bilingual voice

Gr. EN 0 ad. 1 ad. 2 ad. Icel. 0 ad. 1 ad. 2 ad. IS/Tot IsRec EnRec %Ice I1 8 8 0 0 54 28 23 3 87.1% 119 508 19.0% I2 1 1 0 0 1107 106 848 153 99.9% 4019 2620 60.5% I3 48 32 16 0 90 36 43 11 65.2% 178 887 16.7% I4 32 1 31 0 2138 66 1138 934 98.5% 4519 2695 62.6% I5 2442 1503 914 25 13206 3531 7482 2193 84.4% 8883 11472 43.6% I6 596 136 412 48 3662 592 2382 688 86.0% 8062 8926 47.5%

Table 4.6: Common diphones selected in Icelandic word unit selection experiments. of English diphones standing next to two adjacent units from the recordings is not counted, so the cells with zero numbers in the Icelandic data null out equivalent cells for the English data, and the ratio in the speech database is taken into account, it still seems to be more likely for the target language to be selected.

The last two groups, E5 and E6, are a mixture of rare and common diphones, and the statistics seem to reflect that. The E6 group is made solely up of words having only shared diphones, although some such words are also in E5. This provides the possibility for the unit selection engine to select only units from the other than target language. In spite of that, and the fact that the corresponding diphones in the database have a ratio not too far from 50/50, the target language phones are much more likely to be selected, and only 26 words, out of 2890, are synthesised with Icelandic units only, while 1453 words use only English units.

4.1.2 Synthesising Icelandic using Icelandic and English speech data.

Table 4.6 is the table for unit selections in Icelandic words, corresponding to table 4.5 on the English data. The explanations below table 4.5 on page 41 also explain the headers for this table.

Table 4.6 indicates, that the unit selection for Icelandic has similar trends as it does for English, although they seem to be even more decisive. The rare Icelandic diphones are not as common in the English speech data as rare English diphones are in the Icelandic 4.1. Unit selection for bilingual synthesis 43 one, but the ratio of common Icelandic diphones in the shared speech inventory, is lower than the ratio of the common English diphones is. In spite of that, all the groups give more decisive results for Icelandic than they do for English, and while 106 com- mon Icelandic phones are selected to join two other units from different utterances, only 1 English diphone is. This is ten times less than for English. As with explaining these same tendencies in English, it is hard to say why it is more extreme for Icelandic. There might be less variance in the speech sounds in Icelandic, making most of the English units differ too much from the Icelandic ones.

Many other factors might also be at work here. It is really not possible to state the reason for this variance based on this somewhat primitive experiment. But what can be stated is that for the voice used in the experiment, the unit selection engine is not very likely to select foreign units when synthesising one of the two languages. Discussion on whether this can give any reason for a more general conclusion is given in section 5.3.

The extremes of these selections are partly due to the difference in phone frequency between languages, the more commonplace the units are in a given language, the more likely they are to fit the exact criteria of a certain context. If the other language has an equal number of these phones and is still unlikely to be selected it is likely that the phones are in general somehow different in that language. The reason for this difference might be simply because it is not the same speech sound, or that the context has an effect on speech such that not only the adjacent phones affect the production of a phone, but also the phones adjacent to them. An aspect of the speech database highly likely to influence the selection, is the stress marking. The Icelandic data is all marked unstressed, while the English data has proper stress distinctions. The target cost calculated by Festival is made up of stress cost among other factors. The default settings for target costs were used, giving stress a weight of 10%. This means that only unstressed vowels get proper stress costs calculated, giving all other foreign phones wrong target costs. look more expensive to join than they really are. This may explain the apparent tendency of Icelandic to be a lot more likely to select Icelandic diphones than English diphones. The English gives all Icelandic units that are stressed too low target cost for unstressed syllables, but Icelandic always gives all but the unstressed 44 Chapter 4. Evaluation of bilingual voice syllables a high cost, whether it is right or not. This might also affect the perception of the synthesised speech. As part of speech information is also missing for Icelandic, and the syllabification is not always correct, the target cost might cause even more skews. The POS-cost is 6% and the position in syllable cost is 5%. Other weights in the target cost are either weighing utterances longer than one word, or factors that are likely to be language independent, such as bad pitch cost and position in word cost. Possibly, disabling target costs might give better results for using the shared inventory when synthesising with this voice.

4.2 Perceptual evaluation

Having inspected the frequency of usage of foreign language diphones, testing to see whether there is any difference in the naturalness of synthesis depending on the usage of foreign diphones is a logical next step. A perceptual experiment was set up on the web, in two parts, one for each language.4

Participants were asked to listen to word pairs and to evaluate each instance of a word, based on how natural the word sounds to them. The scale was of each subject’s own choosing, so as not to constrain their evaluation. The pairs consisted of one word synthesised using only target language units, but the other used the shared inventory and had at least one foreign diphone, depending on which group of words it belonged to. Four groups of words were defined, and for each language I tested six word pairs from each group. The groups were defined as follows:

1. One word contains one foreign diphone in the middle of the word (the diphone has two joins). All other units in the word are exactly the same as in the non- foreign word. Both words have as many joins and in the same places.

2. One word contains one foreign diphone, which joins to an end phone also in the foreign language. Other things are the same as in 1).

4The front page of the experiment with instructions for subjects and examples as they appeared for the English native speakers, is available on-line at http://www.ling.ed.ac.uk/~s0344328/experiment/ 4.2. Perceptual evaluation 45

3. One word contains more than one foreign diphone. The non-foreign word has joins in all the same places though.

4. One of the two words is synthesised with all-foreign units. The other has none. (Didn’t look at joins here.)

The idea is that each group has more foreign diphones than its preceding group. The first two groups are quite similar though, both having one foreign diphone from the shared inventory, but while the diphones in the first group are in the middle of the words and have two joins, the words in the second group stand next to an end unit from the same language, so the foreign phone only joins the target language in one place.

Both words in each pair being evaluated should have joins in the same places, for the conditions to be as similar to possible, making the difference between words in each pair to be most likely attributed to the difference between using a foreign or non-foreign diphones. They were selected randomly from a pool of words meeting each group’s criteria, and the criteria that the word would have joins in the exact same places when synthesised using only the target language speech data.

Each participant is asked to mark each pair twice in the test, so he can be tested for reliability. When the web page was set up, the order of the pairs was randomized, as was the order within pairs. Having done that no further randomization was done and all participants listened to the pairs in the same order. Each page had six word pairs, in total eight pages. If both instances of the same pair would come up on the same page, it would be moved the previous or next page. Native speakers of each language were asked to evaluate the pairs. With few exceptions, the English native speakers were recruited from the staff and students at the Linguistics, psychology and informatics departments at the University of Edinburgh. To evaluate the Icelandic word pairs, friends and relatives of this works author were recruited.

35 subjects participated in the evaluation of the pairs, while 17 participated in the evaluation of the pairs. I treated the data for the languages separately, as separate data were being tested. For both languages, the 46 Chapter 4. Evaluation of bilingual voice subjects were initially tested for internal reliability. I used a model of reliability called ’Strict parallel’, which assumes that all items have equal variances and equal error variances across replications. It also assumes equal means across items (LEA 2002). This gives a value called Estimated reliability of scale. ERS is similar to Cronbach’s Alpha, which is another statistic for estimating a tests reliability, a value in the range of 0.0 and 1.0. I accepted all participants who had a high value for the estimated reliability of scale. I decided the margin for a value to be considered high enough, would be 0.75. I based that on what values of Cronbach’s Alpha are considered high (Coolican 1999, p. 171). The participants that did not get a high enough value were rejected.

Working only with the subset of reliable subjects, I wanted to test two hypotheses. I was mainly interested in whether the type of speech database affected the score given by the participants, that is whether either using target language speech only or using shared speech data would give better results than the other. Furthermore, I wanted to see whether it affected the score which group of the four the words were in. If that would be the case a trend test to try to find in which way could be informative of which way the effect goes.5 The raw data for both languages, and the reliability scores for each subject, are in Appendix D.

As the judgements made by participants vary considerably, with each participant mak- ing his own scale, each participant’s data was reduced to ordinal level, starting with 1 for the smallest. This is recommended for data gathered on an unstandardised, in- vented scale of human judgement (Coolican 1999, p. 224). The median value was taken for each group of each type of speech database being tested, for each of the sub- jects. Thus, only 8 values were used for each subject, each value representative of each of the factors tested. With this data testing commenced.

4.2.1 Icelandic

The reliability tests indicated that 5 of the 35 participants were not reliable enough. Therefore the data of 30 participants were used.

5The raw data for both languages, and the reliability scores for each subject, can be found on-line: http://www.ling.ed.ac.uk/~s0344328/experiment/data.html 4.2. Perceptual evaluation 47

As already explained, the first test is for investigating the difference, if any, between using only target language speech for synthesising, or both languages. For each group I have 1 value for each of the two different speech data, in total 4 values from each subject. Statistical testing will be carried out for each group separately, and for all the groups as a whole. I start with the separate groups.

The hypotheses are always the same for the five tests to be carried out:

H0: The independent variable, speech database type, does not affect the score.

H1: The independent variable, speech database type, affects the score.

To be able to reject H0, the significance test should return a value of p<0.05, as the test is two-tailed. This is a related design as the word pairs are matched. The independent variable is the speech database type, only target language (O) or both languages (B). Wilcoxon matched pairs signed ranks test was used. It looks at differences between paired values, the direction of differences, and the rank of these differences relative to other differences. It adds up the ranks of the positive and negative differences, and tells us, in effect, how unlikely we are to get such a low rank total for one group (Coolican 1999, p. 317). It thus assesses whether the differences between ranks are small enough to ignore. The test is suitable to a design such as the one described above.

Group 1) One word contains one "foreign" diphone in the middle of the word

There are 30 dependent variables here for each independent variable, 1 for each subject. Initially all the 60 values are ranked together. Then a Wilcoxon test is run on the data.

The Wilcoxon matched pairs signed ranks test does not indicate any significant dif- ference between the perception of speech based on inventory used for synthesising, in this group. (Z = -0.979, p=0.328) The H0 can therefore not be rejected.

B has a lower rank than O: 9 times B is higher than O: 4 times The ranks are tied: 17 times

Table 4.7: Group 1: Differences between ranks of independent variable 48 Chapter 4. Evaluation of bilingual voice

Group 2) One word contains one "foreign" diphone, which joins to an end phone also in the foreign language. Other things are the same as in 1).

There are 30 dependent variables here for each independent variable, 1 for each subject. I start with ranking all the 60 values together. Then I run the Wilcoxon test on the data.

The Wilcoxon matched pairs signed ranks test indicates that there is a significant differ- ence between the perception of speech in this group, based on which speech database is used. (Z = -3.782, p<0.001) The H0 is therefore rejected.

B has a lower rank than O: 21 times B is higher than O: 1 time The ranks are tied: 8 times

Table 4.8: Group 2: Differences between ranks of independent variable

Group 3) One word contains more than one foreign diphone. The non-foreign word has joins in all the same places though.

There are 30 dependent variables here for each independent variable, 1 for each subject. I start with ranking all the 60 values together. Then I run the Wilcoxon test on the data.

The Wilcoxon matched pairs signed ranks test indicates that there is a significant differ- ence between the perception of speech in this group, based on which speech database is used. (Z = -4.459, p<0.001) The H0 is therefore rejected.

B has a lower rank than O: 26 times B is higher than O: never The ranks are tied: 4 times

Table 4.9: Group 3: Differences between ranks of independent variable

Group 4) One of the two words is synthesised with all-foreign units. The other has none. (Didn’t look at joins here.)

There are 30 dependent variables here for each independent variable, 1 for each subject. I start with ranking all the 60 values together. Then I run the Wilcoxon test on the data. 4.2. Perceptual evaluation 49

The Wilcoxon matched pairs signed ranks test indicates that there is a significant differ- ence between the perception of speech in this group, based on which speech database is used. (Z = -4.660, p<0.001) The H0 is therefore rejected.

B has a lower rank than O: 28 times B is higher than O: 1 time The ranks are tied: 1 time

Table 4.10: Group 4: Differences between ranks of independent variable

Finally I run the Wilcoxon test on the data from all the groups together.

For the 30 subjects and all the groups, we compare 30x4=120 instances of each of the independent variable to the other. All the 240 (120x2 i.v’s are ranked together.

The Wilcoxon matched pairs signed ranks test indicates that there is a significant dif- ference between the perception of the speech data based on which speech database is used. (Z = -7.732, p<0.001). The H0 is therefore rejected.

B has a lower rank than O: 84 times B is higher than O: 6 times The ranks are tied: 30 times

Table 4.11: All groups: Differences between ranks of independent variable

These test indicate that using foreign phones give significantly less natural speech for all groups except the one where there should be the least difference between the two sets. This might imply that the more foreign diphone a synthesised word has, the less natural it becomes. To find if that is likely to be true, all the groups are tested to see if they are significantly different. If they are, I will proceed to test if there is a significant trend from the least difference to the most different.

The groups are different, with different words. As there is not a one-to-one correspon- dence between groups, this is an unrelated design. The independent variable is the type of group. As there are four types of groups, it has 4 levels. To see if the groups differ significantly from one another I thus use the Kruskal Wallis test. The Kruskal-Wallis 50 Chapter 4. Evaluation of bilingual voice test tells us whether three or more ranked samples of data differ significantly among themselves. A significant result tells the probability of all samples coming from an identical population. If the significant results are lover than the usual alpha level of p<0.05, we can reject the null hypothesis (Coolican 1999, p. 381).

The hypotheses are as follows:

H0: The independent variable does not affect the score, the distributions are identical.

H1: The independent variable, group type, affects the score.

For this test, I do not mix the scores for the two different databases, but test them sepa- rately to keep as many variables stable other than the one I actually want to test. A test indicating the O database scores differ significantly between groups, would indicate a flaw in my data as all the words should come from the same population and have the same characteristics. On the other hand, the data from the B speech database might or might not be significantly different between groups.

Before I run the test, I rank all scores in all the conditions, starting with one for the smallest. For each test I have 120 values, 30 for each group. That is 1 value from each subject for each group.

I first tested the O data, and with mean ranks 84.35, 22.15, 53.02 and 82.48, the Kruskal-Wallis test indicated a significant different between groups (H (Chi-square) = 63.954, df=3 and p<0.001). This probably indicates that the random sample of words for each group is not big enough, and therefore possibly scewing the results.

I rank the 120 values of the B data, starting with one for the smallest, and run the Kruskal-Wallis test on that. If these results are significant, I will run a trend test on both this and the O data, to see if the trend is based on number of diphones in a group.

The Kruskal-Wallis test does indeed indicate a significant difference between the groups, having the mean ranks of 103.98, 37.90, 64.48 and 35.63. (H (Chi-square) = 75.293, df=3 and p<0.001).

As the differences between groups for both types of databases, I do a trend test for both 4.2. Perceptual evaluation 51 of them. If the trend test will predict the same trend for both kinds of speech database usage, nothing can be deduced from the results, other than the sample of words used in the experiment is flawed, and another experiment has to be run to learn anything about the differences between groups. If only the B database has the predicted trend, I can assume that although the words used for the experiment are not good enough, that does indeed indicate significant results. The means for the O data seems not to indicate a trend, but the means for the B data might, although the mean for group 3 is lower than for the adjacent groups. To be on the safe side I run tests for both.

The Jonckheere trend test is for inspecting trends, making the same kind of data as- sumptions the Kruskal-Wallis test does. I use that and order the data based on how many foreign phones are in the words, predicting the highest score for Group 1, and the lowest for Group 4.

The hypotheses are thus:

H0: There is not a trend like the one described in H1.

H1: Group 1 > Group 2 > Group 3 > Group 4.

The Jonckheere-Terpstra test indicates that for the B data, there is a trend, and the higher number of foreign diphones are in a word (in the order: 1M, 1ME, MM, All), the lower the rating for naturalness becomes. (J-T=-6.240, p<0.001)

When I run the test on the O data, the Jonckheere-Terpstra test does not indicate any trend. (J-T=1.053, p=0.292)That should have been expected as that data should have similar characteristics in all groups. Although the Kruskal-Wallis test indicates that the characteristics are not similar, the difference seems to be random between groups.

The results of the experiments on the Icelandic data suggest that there is a trend be- tween the number of foreign diphones in a word synthesised with Festival using the Multisyn unit selection engine. If the English data suggest the same, it is likely this holds for systems based on at least some other pairs of languages. Further discussion on that is provided in chapter 5.4 52 Chapter 4. Evaluation of bilingual voice

4.2.2 English

Unfortunately, only 17 subjects participated in the English language synthesiser exper- iment. Even more unfortunate, 10 of these people failed the reliability test. Therefore, only 7 peoples marks can be used to run the statistical tests for this part of the ex- periment. Although the subjects are not many, statistical tests can be run on the data to determine whether the data suggest significant differences between groups. When considering the scores are based on the judgement of such few individuals, the results from the statistical tests should be taken with some caution. This will therefore only be viewed as a rough indicator of whether the different unit inventories seem to have similar effect on English speech synthesis, as they do on Icelandic.

The data have the same characteristics as the data described in sections 4.2 and 4.2.1. Therefore the same tests are used. The first set of tests use the Wilcoxon test to inves- tigate whether the independent variable, speech database type, has any affect on score in each of the four word groups. The test is additionally used to investigate whether it affects all the groups as a whole, just like was done with the Icelandic test set. The hypotheses are the same as before:

H0: The independent variable, speech database type, does not affect the score.

H1: The independent variable, speech database type, affects the score.

Group 1) One word contains one "foreign" diphone in the middle of the word

There are 7 dependent variables here for each independent variable, 1 for each subject. I start with ranking all the 14 values together. Then I run the Wilcoxon test on the data.

The Wilcoxon matched pairs signed ranks indicates a significant difference between the perception of speech based on inventory used for synthesising, in this group. (Z = -2.032, p=0.042) The H0 can therefore be rejected.

Group 2) One word contains one "foreign" diphone, which joins to an end phone also in the foreign language. Other things are the same as in 1).

The Wilcoxon matched pairs signed ranks does not indicate a significant difference between the perception of speech based on inventory used for synthesising, in this 4.2. Perceptual evaluation 53

B has a lower rank than O: 5 times B is higher than O: never The ranks are tied: 2 times

Table 4.12: Group 1: Differences between ranks of independent variable

B has a lower rank than O: 2 times B is higher than O: 3 times The ranks are tied: 2 times

Table 4.13: Group 2: Differences between ranks of independent variable group. (Z = -0.406, p=0.684) The H0 can therefore not be rejected.

Group 3) One word contains more than one foreign diphone. The non-foreign word has joins in all the same places though.

The Wilcoxon matched pairs signed ranks indicates a significant difference between the perception of speech based on inventory used for synthesising, in this group. (Z = -2.207, p=0.027) The H0 can therefore be rejected.

Group 4) One of the two words is synthesised with all-foreign units. The other has none. (Didn’t look at joins here.)

The Wilcoxon matched pairs signed ranks does not indicate a significant difference between the perception of speech based on inventory used for synthesising, in this group. (Z = -0.734, p=0.463) The H0 can therefore not be rejected.

Finally I run the Wilcoxon test on the data from all the groups together. For the 7 subjects and all the groups, we compare 7x4=28 instances of each of the independent variable to the other. All the 56 (28x2 i.v’s are ranked together.

B has a lower rank than O: 6 times B is higher than O: never The ranks are tied: 1 times

Table 4.14: Group 3: Differences between ranks of independent variable 54 Chapter 4. Evaluation of bilingual voice

B has a lower rank than O: 4 times B is higher than O: 2 times The ranks are tied: 1 times

Table 4.15: Group 4: Differences between ranks of independent variable

B has a lower rank than O: 17 times B is higher than O: 5 times The ranks are tied: 6 times

Table 4.16: All groups: Differences between ranks of independent variable

The Wilcoxon matched pairs signed ranks test indicates that there is a significant dif- ference between the perception of the speech data based on which speech database is used. (Z = -3.101, p=0.002). The H0 is therefore rejected.

Although the final test indicates a significant difference of naturalness, depending on the quality of between the usage, it is not as decisive as it is for the Icelandic data. It is also noteworthy that the only Icelandic group that was not significantly different was one of the two significantly different groups in English. If a difference such as this would hold even though the number of participants was tripled or quadrupled, it would mean that the trend is probably different for English than for Icelandic. As with the Icelandic data, tests are run to see if there is a significant difference between groups. If there is none, the seemingly different pattern of the trend can be disqualified. To test for differences between groups, Kruskal-Wallis is used, like before.

I first tested the O data. With mean ranks 13.57, 14.43, 16.43 and 13.57, the Kruskal- Wallis test did not indicate a significant different between groups (H (Chi-square) = .569, df=3 and p=0.904).

The B data was then tested, and with mean ranks 10.79, 18.36, 13.57 and 15.29, the Kruskal-Wallis test did not indicate a significant different between groups here either (H (Chi-square) = 3.130, df=3 and p=0.372).

As the tests do not indicate any significant difference between groups in either case, it can be assumed that the data was, in both cases, drawn from the same population. 4.2. Perceptual evaluation 55

Therefore a trend test is unnecessary, as it wont say anything significant. The Kruskal- Wallis test does not give significant results, and therefore it can’t be said of the groups that they are significantly different. But some of them might be significantly different than others, indicating that the two groups giving significant results for difference be- tween synthesised words based on speech inventory used, are indeed what they seem to be. It is also quite safe to assume that the results of the Wilcoxon test, inspecting differences between speech databases for all the groups as a whole, are solid.

Chapter 5

Discussions and Conclusion

In the project, a voice was built capable of speaking two languages, either by using only target language recordings for concatenation, or by using a shared pool of speech data. The voice building involved defining a shared phone set, text selection for one of the two languages, recording of both and building the unit selection voice for Festival and the Multisyn engine. Furthermore, some experiments were run to try to evaluate the usefulness of a bilingual speech inventory, in terms of frequency of using the foreign phones, and naturalness of speech synthesised with phones recorded in two different languages.

5.1 Defining a new language

There were some complications in defining a new language, Icelandic, for Festival. This was mostly due to insufficient resources; no POS-tagger is available for the lan- guage, I did not have access to defined post-lexical rules, and the somewhat flawed pronunciation lexicon did not have any information about syllabification or stress. The quality of the resulting synthesiser is therefore limited to some extent. Acquiring these resources would thus be a logical next step for improving the Icelandic voice and the possibility of building other reasonably good voices in the language. A small corpus of Icelandic texts was built and used for generating a list of recording prompts with

57 58 Chapter 5. Discussions and Conclusion reasonable diphone coverage. In a 1997 paper, van Santen & Buchsbaum (1997) ar- gued that the instability of frequency distribution across text corpora, poses a risk for systems relying too much on the frequency distribution of that particular corpus, as it might be missing quite a few units another corpus might have, even abundantly in some cases. Using a bigger corpus with samples of many different texts from various sources, might therefore be a better strategy to get better diphone coverage.

5.2 Synthesising

When the voice had been built, experimentation was carried out. I found that some diphones that should have been covered by the recording prompts, were missing. Some of the words that couldn’t be synthesised with only their own languages speech data, could be synthesised with the joint inventory, as the other language made up for the missing diphones. This showed that using multiple languages can be helpful in such cases. The problem of the missing diphones, underlines that for text selection, a big diverse database should be considered. The ten books my corpus consisted of are probably all too limited. But even by using a huge text corpus, there is no guarantee all possible diphones will be covered. Although using multiple language doesn’t do that either, the different distribution of diphones between languages, is a sufficient reason to give this some thought.

5.3 Unit Selection Experiment

The first experiment run was supposed to investigate how frequently foreign units were chosen when Festival had the choice of using both types. It indicated that in a great majority of cases, a target language unit was chosen by the Multisyn text selection engine. The Multisyn bases its selection of units on target and join costs, so the study shows they are generally lower when all the units are from the same language. Some of this is highly likely to be attributed to effects of context, even of other than adjacent phones, on the diphones waveform, and therefore on the cepstral coefficients used to 5.3. Unit Selection Experiment 59 calculate join costs. Through such effects the diphones might become quite language specific. Another effect highly likely to discriminate between the languages is the effect of the target cost. Some of the target cost weights should have been disabled as they based their score on insufficient and sometimes wrong information.

The two speech databases were built separately, and normalized seperately. If the normalization of the speech data is different that affects join costs, and thus makes it less likely for foreign phones to be chosen. The recordings were done in 17 sessions. When running the automatic pitch marking it was apparent that the recordings varied. The parameters were fine tuned to the first session, and were very good for marking the pitch in that session and a few others. For some other sessions on the other hand, most utterances had at one or more bad pitch marks. This indicates that there was some variation in the recordings, which would give worse join costs between sessions. Two languages were never recorded in one session.

Other likely reason is that one or both phones making up each half of the diphone, might be somewhat different than the corresponding phone in the target language. Different languages use different phonological features to contrast between similar sounds, and may not pay attention to other factors that are not contrastive in that lan- guage. An example of that may be the contrastive features of stops in English and Ice- landic. In English voicing is contrastive, but in Icelandic it is aspiration. To deal with this, it might be necessary to include in the phone set, all allophones of all phonemes in the language, thus making it easier to evaluate whether or not a given phone in a phone set, has a corresponding phone in another language. This method does have some drawbacks. As the phones could become quite numerous in some languages, the number of rare diphones would grow, as would the problem of missing diphones. Number of utterances needed for recording is also likely to grow, making the voice building more expensive and laborious. This is therefore unlikely to be a cost-effective method. Another way might be to build a voice in more than two languages. A voice using speech data recorded in many languages is likely to have access to more variable phones, and the unit selection engine might thus be more likely to take advantage of foreign phones. This remains to be studied though. 60 Chapter 5. Discussions and Conclusion

5.4 Naturalness of Synthesis Evaluation

To test the naturalness of the speech synthesised using the two different methods, a perceptual test was devised and set up on the web. The implications of the usage of foreign phones in Icelandic synthesis was quite decisive in Icelandic, showing that the more English diphones a synthesised word contained, the less likely it was to be natural than words with fewer or no English diphones.

The results for English were not as decisive. There is significant difference between the synthesised words based on whether the shared data were used or only the target language data. Two of the four groups also showed significant results in the same direction, but upon inspecting whether the groups were significantly different, the re- sults were not significant. The English results are only based on the marks of seven participants, they should thus be taken with caution, as more participants may clear up the picture, as they are not mirroring the Icelandic results, and may also be somewhat ambiguous.

5.5 Future work

If working with the voice used in the project described here, the first steps should be to improve it. There are a few things that might improve it without much effort. Nor- malizing all the speech data together might give slightly better results when sharing the databases. Trying to disable or tune target costs, and possibly join costs are likely to give at least slightly better results if the right parameters are found. An important thing to do for the voice to work properly as a polyglot voice, would be to investi- gate thoroughly the differences and familiarities of the phones in the two languages. Such detailed information is imperative for sharing the phones, at least if they are not supposed to sound foreign.

If Icelandic is to be synthesised, it is very important to define syllabification rules, post-lexical rules, pos-tagger and building a proper pronunciation lexicon for Icelandic. Such tools are not only invaluable help to the synthesiser at run time, but also help the 5.5. Future work 61 voice building process, and text selection for recording prompts.

The benefits of using foreign phones would prominently be to allow for less recording in the languages the voice speaks. But it might be possible to use the foreign phones differently. Eklund & Lindström (1998) consider the lack of ’xenophones’, foreign language phones, in a speech synthesising system, to be a big drawback. They say the majority of Swedes uses foreign phones under some circumstances in day-to-day spoken Swedish. A voice built on data from recordings in multiple languages, could also to be used to take advantage of the foreign speech inventory, simply to produce the foreign phones an ordinary person would use. To test this with my voice, research would have to be done on whether, and when, English phones are likely to be used in spoken Icelandic (or a much less likely scenario, Icelandic in English). These results might then be used to synthesise such speech and test whether it sounds less or more natural than only using Icelandic (or other target-language) units. There might also be a reason to look at individual diphones, or diphones belonging to certain phone classes and see whether they are more likely or unlikely to sound natural in a foreign context. Informal inspection of the English/Icelandic voice, indicate that vowels are more likely to sound foreign than . Looking at differences between word classes in different languages might well be worthwhile.

This study only investigated the possibilities of a voice recorded in two languages. A voice in more languages might give completely different results, and even a voice in some other two languages might show different trends. The results reported here do not suggest that the same might hold over all languages. A similar trend may probably be expected, but there may be numerous factors not thought of during the process of this project that may be partly responsible for the results.

If good phone sets and lts-rules which make the right distinctions between phones in many different languages can be built, it may be interesting to try to optimize record- ing sets for the languages by running a greedy algorithm jointly on sentences in all languages. Even though such sets would be assumed to ’speak’ a given with a strange foreign-sounding accent, that doesn’t necessarily have to be bad if the pronunciation is consistent and clear. 62 Chapter 5. Discussions and Conclusion

5.6 Conclusion

Before concluding this dissertation, it should be noted that the results of the experi- ments are affected, to a lesser or greater extent, by the fact that the speaker is a na- tive speaker of Icelandic, speaking English with an Icelandic accent. As discussed in chapter 2, the English lts-rules and lexicon are calibrated to someone speaking the RP dialect of English. This renderings of some speech sounds might therefore have been different than presupposed by the system, affecting the forced alignment, and perhaps other factors. The voice was also quite ’raw’, with little adjustment to the default settings for building the voice, and no manual fixes.

Furthermore, some English speakers might have perceived some of the words synthe- sised using Icelandic-accent English units, as unnatural. For tests such as the ones run during this project and described here, a native-like voice in both languages might be better, or a voice that has been elaborated on makeing as natural as possible.

For now, the most important aspects of building a polyglot voice, may be defining the phone set definitions and designing and building the speech data inventory, the heart of all unit selection speech synthesisers. But if such systems are to become a viable option, a substantial amount of research into the different areas concerning multilingual voices for speech synthesis is needed. Chapter 6

Appendix A - Phone Set

The inventory of phones in the speech databases for the two languages include the following phones. The English phones are designated MRPA symbols, while the Ice- landic phones which are not common to the English phone set, are denoted with X- SAMPA symbols, with a prefix ’IS_’ added, to prevent the systems used mistaking them for English phones using identical symbols. Some of the X-SAMPA symbols had furthermore to be adapted to HTK and Festival, as one or both of the programs does not accept numbers or backslash in transcription symbols. The complete phone- set for the bilingual speech database is therefore the following:

MRPA X-SAMPA Description p p unvoiced labial stop t t unvoiced alveolar stop k k unvoiced velar stop b b voiced labial stop d d voiced alveolar stop g g voiced velar stop ch tS unvoiced palatal affricate jh dZ voiced palatal affricate

Table 6.1: All phones defined for both languages (1 of 3).

63 64 Chapter 6. Appendix A - Phone Set

MRPA X-SAMPA Description s s unvoiced alveolar fricative z z voiced alveolar fricative f f unvoiced labio- v v voiced labio-dental fricative th T unvoiced dental fricative dh D h h unvoiced glottal fricative m m voiced labial nasal m! m= syllabic voiced labial nasal n n voiced alveolar nasal n! n= syllabic voiced alveolar nasal ng N voiced velar nasal l l voiced alveolar lateral lw l voiced alveolar lateral l! l= syllabic voiced alveolar lateral r r voiced alveolar approximant y j voiced palatal approximant w w voiced labial approximant e e short close-mid front unrounded a { short lax open front unrounded aa A: long open back unrounded ou @U Diphthong close-mid central unrounded o Q short open back rounded oo O: long open-mid back rounded ii i: long close front unrounded iy i close front unrounded i I lax close-mid front unrounded @ @ schwa mid central unrounded uh V short open-mid back unrounded u U lax close-mid back rounded uu u: long close back rounded uw M close back unrounded ei eI diphthong ai aI diphthong oi OI diphthong ow aU diphthong i@ I@ diphthong @@r 3: long open-mid central unrounded eir e@ diphthong open-mid front unrounded ur U@ diphthong close-mid back rounded

Table 6.2: All phones defined for both languages (2 of 3). 65

MRPA X-SAMPA Description IS_c_h c_h unvoiced palatal aspirated stop IS_gj_o J _0 unvoiced IS_C C unvoiced IS_j j IS_x x unvoiced velar fricative IS_G G IS_m_o m_0 unvoiced labial nasal IS_n_o n_0 unvoiced alveolar nasal IS_J J voiced palatal nasal IS_J_o J_0 unvoiced palatal nasal IS_ng_o N_0 unvoiced velar nasal IS_l_o l_0 unvoiced alveolar lateral IS_r r voiced alveolar trill IS_r_o r_0 unvoiced alveolar trill IS_E E short open-mid front unrounded IS_A A short open back unrounded IS_Y Y short lax close-mid front rounded IS_Q 9 short open-mid front rounded IS_u u short tense close back rounded IS_O O short open-mid back rounded IS_I: I: long lax close-mid front unrounded IS_E: E: long open-mid front unrounded IS_Y: Y: long lax close-mid front rounded IS_Q: 9: long open front rounded IS_ei ei diphthong IS_ei: ei: long diphthong IS_Qy 9y diphthong IS_Qy: 9y: long diphthong IS_ai ai diphthong IS_ai: ai: long diphthong IS_Yi Yi diphthong IS_Yi: Yi: long diphthong IS_Oi Oi diphthong IS_Oi: Oi: long diphthong IS_ou ou diphthong IS_au au diphthong

Table 6.3: All phones defined for both languages (3 of 3).

Bibliography

Black, A. W. & Lenzo, K. A. (2003), ‘Building synthetic voices’, on-line: http://www.festvox.org/bsv/.

Clark, R. A. J., Richmond, K. & King, S. (2004), Festival 2 - build your own general purpose unit selection speech synthesiser, SSW5-2004, pp. 173–178.

Coolican, H. (1999), Research Methods and Statistics in Psychology, 3rd edn, Hodder & Stoughton.

CST (2004), Notes, procedure to build a new Multisyn voice.

Dijkstra, J., Pols, L. C. W. & van Son, R. J. J. H. (2004), Frisian TTS, an example of bootstrapping TTS for minority languages, SSW5-2004, pp. 97–102.

Eklund, R. & Lindström, A. (1998), How to handle "foreign" sounds in swedish text- to-speech conversion: Approaching the ’xenophone’ problem., Vol. 7 of Proceed- ings of ICSLP 98, Sydney, pp. 2831–2834.

Eklund, R. & Lindström, A. (2001), ‘Xenophones: An investigation of phone set ex- pansion in swedish and implications for speech recognition and speech synthesis.’, Speech Communication 1-2(35), 81–102.

EN1 (1998), on-line: http://www.mit.edu/afs/sipb/user/kenta/festival/festival/lib/voices/- english/en1_mbrola/en1/en1.txt.

Fitt, S. (2000), Documentation and User Guide to Unisyn Lexicon and Post-Lexical Rules, CSTR University of Edinburgh.

67 68 Bibliography

François, H. & Boëffard, O. (2001), Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem, Proc. of Eu- rospeech, Aalborg, Denmark.

Hieronymus, J. L. (1993), ‘Ascii phonetic symbols for the world’s languages: World- bet’, Journal of the International Phonetic Association .

Hosom, J. P. (2000), Automatic Time Alignment of Phonemes Using Acoustic- Phonetic Information, PhD thesis, Oregon Graduate Institute of Science and Tech- nology.

Huang, X., Acero, A. & Hon, H.-W. (2001), Spoken language processing:a guide to theory, algorithm, and system development, Prentice-Hall.

Ladefoged, P. (2001a), A Course in Phonetics, 4th edn, Heinle & Heinle, Boston, USA.

Ladefoged, P. (2001b), Vowels and Consonants, Blackwell, Malden MA, USA and Oxford, Uk.

LEA (2002), SPSS 11.5 - help.

Lenzo, A. W. B. . K. A. (2000), ‘Building voices in the festival speech synthesis sys- tem’.

Lindström, A. & Eklund, R. (2000), How foreign are "foreign" speech sounds? im- plications for speech recognition and speech synthesis., RTO Meeting Proceedings 28. Papers and reports presented at the Tutorial and Workshop held in Leusden, The Netherlands, 13-14 September 1999.

Möbius, B. (2001), Rare events and closed domains: Two delicate concepts in speech synthesis, number 117 in ‘SSW4-2001’.

Þráinsson, H. (1995), Handbók um málfræði, Námsgagnastofnun, Reykjavik, Iceland.

Rögnvaldsson, E. (1989), Íslensk hljóðfræði, Málvísindastofnun Háskóla Íslands, Reykjavik, Iceland.

Rögnvaldsson, E. (1993), Íslensk hljóðkerfisfræði, Málvísindastofnun Háskóla Íslands, Reykjavik, Iceland. Bibliography 69

Rögnvaldsson, E. (2004), ‘The icelandic speech recognition project Hjal’, Nordisk Sprogteknologi. Årbog pp. 239–242.

Saikachi, Y. (2003), Building a unit selection voice for festival, Master’s thesis, Uni- versity of Edinburgh.

Traber, C., Huber, K., Nedir, K., Pfister, B., Keller, E. & Zellner, B. (1999), From multilingual to polyglot speech synthesis, Proceedings of Eurospeech, pp. 835–838. van Santen, J. P. H. & Buchsbaum, A. L. (1997), Methods for optimal text selection, in ‘Proc. Eurospeech ’97’, Rhodes, Greece, pp. 553–556.

Wells, J. C. (1995), ‘Computer-coding the ipa: a proposed extension of sampa’, on- line: http://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm.

Wells, J. C. (1997), Handbook of Standards and Resources for Spoken Language Sys- tems, Mouton de Gruyter, Berlin and New York, chapter SAMPA computer readable phonetic alphabet, pp. Part IV, section B.