2006:113 CIV MASTER'S THESIS

Feasibility Study on a Text-To-Speech Synthesizer for Embedded Systems

Linnea Hammarstedt

Luleå University of Technology MSc Programmes in Engineering Electrical Engineering Department of Computer Science and Electrical Engineering Division of Signal Processing

2006:113 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--06/113--SE

Preface

This is a master degree project commissioned by and performed at Teleca Systems GmbH in N¨urnberg at the department of Speech Technology. Teleca is an IT services company focused on developing and integrating advanced and information technology so- lutions. Today Teleca possesses a speak recognition system including a grapheme-to-phoneme module, i.e., an algorithm converting text into phonetic notation. Their future objective is to develop a Text-To-Speech system including this module. The purpose of this work from Teleca’s point of view is to investigate a possible solution of converting phonetic notation into speech suitable for an embedded implementation platform. I would like to thank Dr. Andreas Kiessling at Teleca for his support and patient discussions during this work, and Dr. Stefan Dobler, the head of department, for giving me the possibility to experience this interesting field of speech technology. Finally, I wish to thank all the other personnel of the department for their consistently practical support.

i

Abstract

A system converting textual information into speech is usually denoted as a TTS (Text-To- Speech) system. The design of this system varies depending on its purpose and platform requirements. In this thesis a TTS synthesizer designed for an embedded system operat- ing on an arbitrary vocabulary has been evaluated and partially implemented in Matlab, constituting a base for further development. The focus of this thesis is on the speech gen- eration part, which involves the conversion from phonetic notation into synthetic speech. The chosen TTS system is the so called Time Domain-PSOLA, which convincingly suits the implementation and platform requirements. It concatenates segments of recorded speech and changes its prosodic characteristics with the Pitch Synchronous Overlap and Add (PSOLA) technique. The segment size is from the mid point of one phone to the mid point of the next, referred to as a diphone. The quality of the generated synthesized speech is rather satisfying for the test sen- tences applied. Some disturbances still occur as a consequence of mismatches, such as different spectral properties of the segments and pitch detection errors, but with further developing a reduction of these can be performed.

iii

Contents

1 Introduction 1 1.1 IntroductiontoTTSSystems ...... 1 1.2 Linguistic Analysis Module ...... 2 1.3 SpeechGenerationModule...... 3 1.3.1 Rule-BasedSynthesis...... 3 1.3.2 Concatenative-Based Synthesis ...... 4 1.4 ProjectFocus ...... 4

2 Theory 7 2.1 SegmentDataPreparation ...... 7 2.1.1 Segment Format and Speech Corpus Selection ...... 7 2.1.2 PreparationProcess ...... 8 2.1.3 SegmentRepresentation ...... 9 2.2 SpeechSynthesis ...... 10 2.2.1 SynthesizingProcess ...... 10 2.2.2 ProsodicInformation ...... 12 2.3 PSOLAMethod...... 13 2.3.1 PSOLAOperationProcess...... 13 2.3.2 ModificationofProsody ...... 15 2.3.3 TD-PSOLA as Speech Synthesizer ...... 16 2.4 ExtensionofTD-PSOLAintoMBR-PSOLA ...... 16 2.4.1 Re-synthesisProcess ...... 17 2.4.2 SpectralEnvelopeInterpolation ...... 18 2.4.3 Multi-BandExcitationModel ...... 19 2.4.4 Benefits with the respective PSOLA Methods ...... 21 2.5 UtilizedDatafromExternalTTSProjects ...... 21 2.5.1 FestivalandFestVox ...... 21 2.5.2 MBROLA...... 22

3 Implementation 25 3.1 SegmentDataPreparation ...... 25 3.1.1 Segment Information Modification ...... 25 3.1.2 PitchMarksModification ...... 27 3.1.3 SpeechCorpusModification ...... 27 3.1.4 AdditiveModifications ...... 27 3.2 SpeechSynthesis ...... 27

v vi CONTENTS

3.2.1 InputFormat ...... 28 3.2.2 SegmentListGenerator ...... 28 3.2.3 ProsodyModification...... 31 3.2.4 SegmentConcatenation ...... 32

4 Evaluation 33 4.1 Analysis of the Segment and Input Data ...... 34 4.1.1 PitchMarks...... 34 4.1.2 SpectralMismatch ...... 35 4.1.3 Fundamental Frequencies ...... 35 4.2 SolutionAnalysis ...... 36 4.2.1 Window Choice for the ST-signal Extraction ...... 36 4.2.2 FrequencyModification...... 37 4.2.3 DurationModification ...... 39 4.2.4 WordBorderInformation ...... 40

5 Discussion 41 5.1 Conclusions ...... 41 5.1.1 ComparisonofTD-andMBR-PSOLA ...... 42 5.2 FurtherWork ...... 42 5.2.1 ProceedingsforTeleca ...... 42 5.2.2 Possible Quality Improvements ...... 43

A SAMPA Notation for British English 47

B MRPA - SAMPA Lexicon for British English 51

C Licence for CSTR’s British Diphone Database 53 List of abbreviations

IPA International Phonetic Alphabet MBE Multi-Band Exciter MBR-PSOLA Multi-Band Re-synthesis PSOLA MBROLA short for MBR-PSOLA MOS Mean Opinion Score MRPA Machine Readable Phonetic Alphabet OLA Overlap and Add PSOLA Pitch-Synchronous Overlap and Add SAMPA Speech Assessment Methods Phonetic Alphabet TD-PSOLA Time Domain PSOLA TTS Text-To-Speech V/UV Voiced/Un-Voiced

vii

Chapter 1

Introduction

The possibility of producing synthesized speech from plain textual information, so called Text-To-Speech (TTS) systems, has today aroused an extensive interest in many technical areas. Different methods with varying quality and properties exist, and the development is still continuing. The purpose of this thesis is to define and evaluate a TTS synthesizer suitable for embedded systems. It is performed at Teleca Systems GmbH in N¨urnberg and its focus is established by the requirements of the company. Today, Teleca holds a module able to transform text into phonetic notation, which is originally developed for another speech purpose. This module is assumed able to be used also in a TTS system, and the starting point for this project is hence phonetic notation. The developed system is restricted to British English, but the theoretical descriptions are, though, valid for an arbitrary language. Since the starting level is at phonetic notation, it is not really correct to consider the investigated system as a Text-To-Speech system. However, for simplicity, and since the process of going from phonetic notation to speech is a major part of a TTS system, the term TTS is though used in this thesis describing the evaluated overall process. In this study a TD-PSOLA (Time Domain-Pitch Synchronous Overlap and Add) [Dut97] synthesizer is investigated and implemented in Matlab. The result is evaluated and suggestions of further work will be given for an accomplishment of the system. A possible extension of this method with the Multi-Band Excitation (MBE) [GL88] model is theoretically presented together with information about its benefits and disadvantages. The following sections briefly describe the main principals of a general TTS system as well as some existing classifications and groupings. The ambition of the latter description is to show what choices have been made and to give some explanation why.

1.1 Introduction to TTS Systems

A TTS synthesizer is a computer based system that takes a text string as input and converts it into synthetic speech waveforms. The methods and equipment needed for this process varies depending on physical restrictions according to the implementation platform and development costs. Two main hardware restrictions are storage properties, such as its capacity and memory type, and the clock rate of the processor.

1 2 CHAPTER 1. INTRODUCTION

The synthesis process can for all methods be divided into the two main modules pre- sented in Figure 1.1. The first step transcribes the input text into a linguistic format, which is usually expressed as phonetic notations (phones) together with additional infor- mation about its prosody. The term prosody refers to properties of each phone such as duration and pitch. The outputs are then used in the second block for construction of the final synthesis speech waves.

Text Linguistic Phonemes & Speech Speech Analysis Prosody Info Generation

Figure 1.1: Division of a general TTS system into two main modules.

1.2 Linguistic Analysis Module

In almost all languages the textual presentation of a word does not directly correspond to its pronunciation. The position of letters within a word and the words appearance within the sentence affect the pronunciation considerably, as well as additive characters such as punctuation marks and the content of the sentence do. An alternative symbolic representation is therefore needed to comprise the hidden information. Usually a lan- guage can be described by 20 to 60 different phonetic characters [Lem99], when excluding the information about its melody. To also be able to describe the pitch characteristics additional prosodic information is needed. Converting text into linguistic representation requires a large set of different rules and exceptions depending on language. This process can be described through three main parts [Dut97]:

1. Text analysis

2. Automatic phonetization

3. Prosody generation

The first part functions as a pre-processing phase. It identifies special characters and notations, such as numbers and abbreviations, and converts them into full text when needed. Several words can have different pronunciations depending on their meaning, and hence a contextual analysis is performed for categorization of the words. The last step in the text analysis part is to find the structure of the text and to organize the utterance into clauses and phrases. After the text analysis phase an automatic phonetization of the words is performed, focusing on single words. The letter representation is automatically transcribed into a phonetic format using a dictionary-based or rule-based strategy, or a combination of both. 1.3. SPEECH GENERATION MODULE 3

The former strategy divides the words into morphemes1 and then converts them using a morpheme to phoneme dictionary. A large database is required for the dictionary to function in a general environment, together with additive transcription rules for expressing un-matched morphemes. Also in the case of a general rule-based text conversion, a mixture between the two strategies is present. Here, an exception dictionary is needed for the words that do not follow the defined pronunciation rules. The last part of the linguistic transcription process is to add the prosodic information. This is applied as additional information and hence the phonemes are not further changed. Prosodic features are created by grouping of syllables and words into larger segments. A hierarchic classification of these groups then leads to the resulting prosody description, usually presented as pitch definitions and phonetic durations.

1.3 Speech Generation Module

The best synthetic result for generating speech is achieved by having recordings of all existing words stored in a huge database. The input generated by the linguistic analysis module can then simply be used to find and return the desired words. However, having a TTS system able to work for arbitrary text inputs would in this case require an almost infinite amount of recorded words. A more effective and by that means also more complex speech generation system is therefore needed. There exist several different methods for generating undefined speech in an implemen- tation realistic manner, i.e., with a limited amount of storage and number of operations. This synthesis can be done either explicitly using models of the vocal tract, or implic- itly which is based on pre-recorded sounds [Dut97]. Implementation of these approaches results in the two different classifications:

1. Rule-based synthesis for explicit operations.

2. Concatenative-based synthesis for implicit operations.

1.3.1 Rule-Based Synthesis Creating rule-based synthesizers requires a careful study of how the different sounds for the human voice are produced. This modelling is then usually represented either by articulatory parameters or by formants2 [SO95]. In the former case several parameters hold information about for example shapes and movements of lips and tongue, glottal aperture, cord tension and lung pressure [Lem99]. In the latter, formants are based on a set of rules used to determine the parameters necessary to synthesize a desired utterance. These rule-based synthesizers require a considerably high computational load com- pared to other common methods. Secondly, the synthesized speech sounds very unnat- ural due to the complicated modelling state and that it is impossible to model speech accurately. On the other hand, they are space-efficient, since no speech segments need to

1A morpheme is the smallest language unit that carries a semantic interpretation. For example, the word ’unbelievable’ can be divided into the three morphemes un-believe-able. 2A formant is a peak in an acoustic frequency spectrum. 4 CHAPTER 1. INTRODUCTION

be stored, and can in principal easily be adjusted to a new speaker with a different voice and dialect.

1.3.2 Concatenative-Based Synthesis In concatenative-based synthesizers segments of pre-recorded speech are connected (con- catenated) to produce the desired utterances. The longer segments used for synthesizing, the better quality is achieved. On the other hand, using segments each consisting of sev- eral phonemes is, as mentioned previously referring words, not realistic for a non-defined implementation area due to database size. In this case most often so called diphones (see subsection 2.1.3) are used consisting of two phonemes and the transition in-between. The method principally used for concatenate speech segments is the PSOLA (Pitch- Synchronous Overlap and Add) technique, or only OLA in case of an exclusion of the pitch synchronizing state, which will be closer described in section 2.3. Several methods use these concatenation operations involved, with varying pre-processing steps of the database and different methods of applying desired prosody. Most common is the TD-PSOLA (Time Domain-PSOLA) and the MBR-PSOLA (Multi-Band Re-synthesis PSOLA), both described in the following chapter. An evaluation and comparison between four classical concatenative-based syntheses is described in [Dut94] involving the TD- and MBE-PSOLA methods, an LPC (Lin- ear Predictive Coding) synthesizer and a synthesizer based on the Hybrid H/S (Har- monic/Stochastic) method3. This study implies that the TD- and MBR-PSOLA methods are very run-time effective, since they are estimated to require an operational load of 7 operations per sample while at least ten times more is needed for the other two. Addi- tionally, the two PSOLA-based methods have better intelligibility and naturalness, only in case of fluidity the Hybrid H/S model is ranked higher than the TD-PSOLA. The main difference between the two methods TD- and MBR-PSOLA appears in the pre-processing state. In case of the latter mentioned synthesizer the segments are more normalized and equalized, which is beneficial for data compression and speech fluidity but at the cost of naturalness. The implementation of this method is also more complex, which can be seen in the next chapter where a more theoretical description of the two synthesizers is shown. A drawback with concatenative-based synthesizers is the large memory space needed for the stored segments. Additionally, the synthesizer cannot change speaker character- istics as in the case of rule-based systems. Some important benefits, though, are the simplicity of implementation, the natural sound and the few real-time operations needed [SO95].

1.4 Project Focus

The speech recognition system existing at Teleca today includes a grapheme-to-phoneme module. This module is assumed able to be used as the linguistic part of a TTS system, converting text into phonetic notation. However, this is only an assumption and a future

3See [Dut97] for description of the LPC and Hybrid H/S models. 1.4. PROJECT FOCUS 5

implementation task, and therefore not put in practice in this project4. The starting level for the TTS system described in this thesis is hence phonetic notation combined with prosodic information, assumed presented on a defined format suiting the synthesize model chosen. Since the future implementation platform for the desired TTS system is an embedded system device working in real-time, a method with low operational load and small data storage is required. This leads to, according to what is described in the previous section, that a concatenation-based synthesis method is preferable using the run-time effective PSOLA technique. To minimize the data storage for a TTS system with an arbitrary vocabulary area, the segment database should consist of pure phoneme recordings. How- ever, since it is known that the transition between two phonemes is more important for understanding of speech than the stable state itself [Dut97], a segment database consisting of diphones (with one transition point present in each segment) is preferable. Naturally, the more transitions each segment includes the better the speech is understood, but this would also require a larger database. The memory load will approximately increase with a power of two for each additional transition included in each segment and very soon an unrealistic number of segments is reached. Therefore, to reduce data storage size, this TTS system is based on diphone segments. The most common PSOLA based synthesizers are the TD- and the MBR-PSOLA. The former method is more widely used and requires a somewhat simpler implementation. For this project this TD-PSOLA method is chosen basically as a consequence of the fact that the MBR-PSOLA is more or less an extension of the TD-PSOLA, and hence the implemented system still has the possibility to be further developed. It would though, be interesting to implement an MBR-PSOLA synthesizer as well and compare this with the TD-PSOLA, but as a consequence of the time restrictions for the project this is not performed. For the implementation of the TTS system on an external device it is preferable having the algorithm expressed in C- or a device dependent Assembler code. In this thesis the program is thoroughly implemented in Matlab because of its good analysis possibilities and tools. When the optimal solution is found, the code can relatively easy be translated into C-code.

4The reason of an assumption and not an implementation is discussed in subsection 5.2.1

Chapter 2

Theory

The main operations needed for generating speech from a phonetic description through a concatenative-based synthesizer can be divided into two main processes – a segment data preparation process and a process. The former process creates the data underlying the synthesizing process and is performed once. It operates on collected speech data and restructures the data into a format suitable for the synthesizer. Information useful for the future synthesis process is calculated and applied either as additional data or as a recalculation in the collected data. The speech synthesis process consists of the functions operating on the phonetic input together with the generated data and produces the resulting speech. It is independent on the language of the stored segments, only the defined segment format, i.e., the length of each speech unit, is required.

2.1 Segment Data Preparation

2.1.1 Segment Format and Speech Corpus Selection The initial two steps in building a concatenative-based synthesizer are to determine the segment format and to collect a speech corpus. The term segment format is referring to the number of phonetic notes present in each speech unit together with information about where in a phonetic note the segment starts and ends. Usually the number of notes are fixed to a certain value, as in the cases described in subsection 2.1.3 below, but it could also have varying length as in the case of words. The selection of the segment format is a trade-off between operational load and complexity, storage requirements and speech quality. Longer segments results in less concatenation points, and hence a simpler TTS system, and a better preserved naturalness. On the other hand, using a segment format of several phonetic notes requires a large amount of recorded speech segments. For each new phonetic part included, an almost exponential increase1 of the memory size is required as a consequence of the mounting number of possible combinations.

1According to combinatorial theory, the permutation

n! k P (n, k)= = n(n − 1)(n − 2) . . . (n − (k − 1)) ≈ n (for small k and large n), (n − k)!

where n denotes total number of phonetic notes and k the number of notes per segment.

7 8 CHAPTER 2. THEORY

The segment data preparation process is based on a recorded speech corpus, and the number of segments to be included in the corpus is derived according to the chosen segment format. It is preferable to record several versions of each segment and then later choose the most appropriate recording in the preparation process. The resulting quality of the TTS system depends to a large extent on the quality of the speech corpus. The recorded data should be noiseless and read by one single person, and to facilitate the future segment concatenation, it should be spoken as monotonously and energy stable as possible.

2.1.2 Preparation Process

Figure 2.1 displays the operations involved in the segment data preparation process of a general concatenative-based TTS system. It starts at speech corpus level and a description of each block is presented below.

Speech Selective Segment Corpus Segmentation Information

Speech Analysis

Synthesis Equalization Segments

Figure 2.1: Block scheme for the segment data preparation process of a general concatenative-based speech generator.

Selective Segmentation

The recorded speech stored in the Speech Corpus database usually consists of complete words intended to be divided into the defined segment format. This segment extraction is performed by either marking of the segment end points or cutting out and storing the desired parts. Finding the optimal cutting points is a time consuming process since an automatic segmentation function is hard to develop and therefore it needs to be made more or less by hand. Secondly, the most appropriate speech segment is chosen if several recordings per segment exist. Information about the segments are calculated and stored in the Segment Information database, referring for example the length of the segments and, when using segments con- sisting of at least two phonemes2, the mid position of the transition appearing between the phonemes. Finally, the extracted speech segments are transmitted to the next function.

2Defined as the mental abstraction of a phonetic note, see next subsection. 2.1. SEGMENT DATA PREPARATION 9

Speech Analysis The operations involved in the Speech Analysis part mainly depend on the chosen synthesis- method of the TTS system. In some cases the speech segments are recalculated to better resemble each other, such as the case of normalization. Additional information about each segment, for instance pitch marks, is for some methods stored in the Synthesis Segment database. Later in this chapter a description of the pre-processing calculations needed for a TTS synthesizer is presented for two different PSOLA-based systems.

Equalization One operation concerning all concatenative-based methods is the energy equalization. When speech data is recorded from a human, it is never received as spoken with a constant volume. This energy variation can lead to clearly audible mismatches when concatenating the different segments, so before final storing of the speech segments in the Synthesis Segment database, this equalization is applied. It is found [Dut97] that the amplitude energy of each phoneme differs from each other according to the type of sound and where in the mouth it is produced. To preserve this natural energy variation, the equalization process is applied within each group of equal phones. Note that the term equalization is used here in contrast to the term normalization. An energy normalization process would set the energy for all segments to an averaged value and the natural energy variation would then be lost.

2.1.3 Segment Representation A phoneme is the linguistic representation of a phonetic event and thereby defined as the smallest unit an utterance can be divided into. Often the phoneme is incorrectly identified with a phone, but the latter describes a pronounced phonetic note while the phoneme corresponds to the mental abstraction of it. In other words, a phoneme can be defined as the categorization of a group of related phones [Dut97]. To represent all phones in a language a set of about 40 to 50 basic phonemes is needed, depending on language and desired transcription accuracy [Lem99]. A diphone describes the transition between two phonemes. It starts in the middle of the steady region in one phoneme and ends in the middle of the steady region in the next. In this case the point to be concatenated will always appear at the most steady state of a phone, and compared to using phoneme segments, the spectral mismatch at the point for concatenation will be decreased. As an example, the word ’mean’ with the phonetic description [m, i:, n] (referring SAMPA notation, see further in this section) corresponds to a diphone description according to [#-m, m-i:, i:-n, n-#], where # denotes a short silence. Figure 2.2 displays the signal representation of the two phones m and i: spoken consecutively, together with a classification of its different regions. The number of diphones needed for presenting a language is basically the square of the number of phonemes, disregarding some non-existing phoneme combinations. This results that the English language consists of approximately 1500 to 2000 diphones [HAH+98]. The most general alphabet for presenting phonemes is the International Phonetic Al- phabet (IPA). It has the capability to express all spoken languages of the world with one standard notation. This notation is composed by a large set of different characters which 10 CHAPTER 2. THEORY

cutting cutting point point DIPHONE

transition point

Steady region of m Transition Steady region of i: region

Figure 2.2: Signal representation of the two phones m and i: spoken consecu- tively, with defined conceptions marked.

are mostly not represented by the ASCII codes. In computer systems it is however prefer- able to use an alphabet composed by a restricted number of combinations of these ASCII characters. The SAMPA (Speech Assessment Methods Phonetic Alphabet) is one of the most popular machine-readable phonetic alphabets used today. It has a simple structure consisting of ASCII characters with up to two combined characters. In Appendix A the SAMPA notation for British English is listed together with descriptions of pronunciation and classified according to how the phones are produced. Another, and more restricted used, phonetic alphabet is the MRPA (Machine-Readable Phonetic Alphabet). It is developed by the CSTR (the Centre for Speech Technology Research) at the University of Edinburgh in a project called Festival, which is further described in subsection 2.5.1 below. This alphabet considerably resembles the SAMPA, but uses only non-capital letters together with the character @. The MRPA can directly be mapped onto SAMPA notation, which can be seen in the appended MRPA-SAMPA lexicon in Appendix B. Recorded speech phones can be classified according to their waveform as voiced or unvoiced signals, usually denoted V and UV, respectively. A voiced signal contains a clearly identifiable fundamental frequency with clear harmonics while the unvoiced has a frequency spectrum resembling noise. Many speech phones, however, consist of a mixture of these two classes, which occurs having varying V and UV proportions in different frequency regions, and therefore a ratio V/UV is introduced where 1 correspond to a purely voiced signal and 0 to an unvoiced.

2.2 Speech Synthesis

2.2.1 Synthesizing Process A model of the speech synthesis process is shown in Figure 2.3, with a description of each block below. The figure describes a general concatenative-based TTS system starting at 2.2. SPEECH SYNTHESIS 11

phonetic notation level.

Phonemes & Prosody Info

Segment Segment List Information Generator

Segment File Synthesis Collector Segments

Prosody Modification

Segment Concatenation

Waveform Generator

Speech Figure 2.3: Block scheme for the run-time operating functions of a general concatenative-based speech synthesizer.

Segment List Generator

In this block, the phonetic input notation is transformed into the pre-defined segment format of the synthesizer. The structure of the prosodic information is then changed in order that this information has an expression that corresponds to the defined format. This operation requires information about the stored segment and this is found in the Segment Information database together with data needed for further operations such as the segment file addresses. The functions following this block operate on one segment at a time, and therefore the segment list generator transmits the segment transcriptions with the corresponding information one by one.

Segment File Collector

The Segment File Collector reads the current speech segment file from the Synthesis Segment database according to the file address received from the Segment List Generator and transmits the file further. 12 CHAPTER 2. THEORY

Prosody Modification An application of the desired prosodic properties is performed on the speech segment in this block. These properties usually refer to pitch and time duration (see subsection 2.2.2 below) and is adapted with a method depending on the synthesizing algorithm. This process will be described for PSOLA-based systems in section 2.3.

Segment Concatenation The method for concatenation of two segments is independent of the segment format chosen. For a good concatenation result, the two segments should have as equal prosodic characteristic as possible in their concatenation parts, i.e., at the last and first ends, respectively. At these points the segments are assumed to have equal fundamental fre- quencies and that the cutting points appear at the same position within their period times. A concatenation using PSOLA technique is described in section 2.3. Before the concatenation of the segments an eventual smoothing of discontinuities is performed. The concatenation process itself usually results in a smoothing of the end parts of the segments, but for instance in the method described in section 2.4 a spectral envelop smoothing is performed by linear interpolation (which is also described in the same section). Since the concatenation (or/and smoothing) of one segment depends on the shape of the next one, a one-segment delay in the concatenation block is required. This delay forces the concatenative-based synthesizers to be partly non-causal. However, the smoothing of one segment will never depend on any non-adjacent segment [Dut97] and the non-causality of the system is therefore clearly restricted.

Waveform Generator In some cases, for instance in rule-based TTS systems, the sound segment is parametrically stored or described by certain rules, and a Waveform Generator is then needed to decode the sound into a perceptible format, i.e., by generating sound waves.

2.2.2 Prosodic Information In linguistics, the term prosody refers to certain properties of a speech signal, which usually denote audible changes in pitch, loudness and syllable length. Other properties related to prosody are for example speech rate and rhyme, though not as commonly used in TTS systems. The pitch information is one of the most essential prosodic properties. It describes the ’melody’ of the utterance and prevents thereby the output of a TTS system to sound monotonously. Additionally, a stressed syllable can be symbolized by a fast and large pitch change, as well as an increase in its length of duration. This syllabic length also varies for different positions within a word and hence is another important parameter for speech synthesis. The denotation of the desired pitch can be expressed as a sequence of frequency labels consisting of information about time appearance and value. The last mentioned common prosody property is the loudness, which can also be defined as energy intensity. This parameter is however only of interest when producing emotional speech, since it is approximately constant within normal speech temper [Dut97]. 2.3. PSOLA METHOD 13

2.3 PSOLA Method

The purpose of the PSOLA (Pitch-Synchronous Overlap and Add) technique is to change pitch and duration of an audible signal without performing any operations in the frequency domain. This process can be divided into two main steps:

1. decomposition of the signal into separate but overlapping parts, and

2. recombination of the parts by means of overlap-adding (OLA) with desired pitch and duration modification considered.

The operations involved in these two steps are closely described in the following subsection.

2.3.1 PSOLA Operation Process

3 A signal s(t) is decomposed to several short-time signals (ST-signals) si(t) by windows generated by time-shifted versions of the window w(t). This window is centralized around each pitch mark pmi of the original signal. A pitch mark is defined as the maximum signal appearance, denoted in time, in a period time T0i of the instant fundamental frequency F0 according to Figure 2.4. If the signal is cut yielding pm0 = 0, each pitch mark can be described as i pm = T , i ∈ N. (2.1) i X 0n n=1 As described, these pitch marks correspond to the time shift of the windows and the extraction of a general ST-signal can thus be expressed as

si(t)= s(t)w(t − pmi). (2.2)

Note that as time index the variable t is used, which usually defines continuous time. Time indexing in discrete time though, as in this case, is most often denoted by n, but for better understandability4 the variable t is used. If the ST-signals are added together again but with another time shift, i.e., with a new pitch mark-vector pm′, the reconstructed signal will have changed its fundamental frequencies and is generated by

s′(t)= s (t − pm′ ). (2.3) X i i i

If the original signal is strictly periodic, the time periods T0i are equal for all i and the pitch mark-vector in (2.1) can be simplified to pmi = iT0. The decomposition and recombination described in equation (2.2) and (2.3) respectively, can thus for periodic signals be redefined as

siper (t)= sper(t)w(t − iT0), (2.4) s′ (t)= s (t − iT ′). (2.5) per X iper 0 i

3For a TTS system, indicating one speech segment. 4According to the author. 14 CHAPTER 2. THEORY

s(t) pm pm pm i−1 i w(t) i+1

(a)

T0i

s i(t)

(b)

Figure 2.4: Signal representation of the phone i:. (a) Original signal s(t) with a window w(t) centralized around pmi. (b) Extracted ST-signal si(t).

Theoretically, the reconstruction comprising a pitch change as described in (2.5) can be ′ perfectly performed. This means that sper(t) must have the same spectral properties as sper(t) with only a constant change of its fundamental frequency and harmonics. This statement can be proved by the Poisson formula [DL93] meaning if a signal

F f(t) ←→ F (ω), then +∞ +∞ F 2π n n f(t − nT ) ←→ F ( )δ(ω − ). (2.6) X 0 T X T T n=−∞ 0 n=−∞ 0 0 In words, the formula implies that summing an infinite number of shifted versions of a given signal f(t) results in sampling its Fourier transform with a sampling period equal to the inverse of the time shift T0. The spectral envelope is hence preserved while the new harmonics are evenly spread, and the statement of theoretically perfect time shift of a periodic signal is approved. As previously described, the windows used for creating the ST-signals si(t) are sep- arated with the length of the local T0. If a window size much larger than this is used, spectral lines5 will appear in the spectrum of the ST-signal [Dut97]. These spectral lines can prevent s(t) from being harmonized, since the sampling of the frequency domain (cf. 2.6) can result in frequency values at spectral dips. On the other hand, using a too narrow window will produce a very rough harmonization with approximated frequency values. Choosing a window size as an intermediate between these two cases results in an optimal window size of about twice the period time. This results in a window-overlapping

5A spectral line is a dominated absence or presence of a certain frequency (spectral dip or top). 2.3. PSOLA METHOD 15

for the ST-signal extraction in (2.2) at the length of the period time T0, when using a window size of exactly 2 ∗ T0. The Poisson formula presumes an infinite number of equal ST-signals as input, see (2.6), which is only achieved having a stationary signal. Speech, however, is a non-periodic signal but with a relative slow varying frequency spectrum. This property of quasi-stationarity implies a use of equation (2.6) for speech but with restricted summation boundaries and hence a somewhat less perfect result. Furthermore, the described windowing requires a slow variation of the fundamental frequency, since the window size is defined to ap-

proximately 2T0 and this condition will only be true if T0i ≈ T0i+1 . Consequently, the requirement of a quasi-stationary signal is also generated from the windowing step.

2.3.2 Modification of Prosody

The pitch of the signal is changed by implementing the new pitch mark-vector pm′ as done ′ in equation (2.3). Each pitch mark pmj corresponds to a point in the original pm-vector, as shown in Figure 2.5, but with changed distances in-between. If a pitch change with a ′ T0 factor k is desired, the new instant T0 = k . The factor k can have different values for each ′ pmj, but with relative small variations for preserving the quasi-stationary assumption. A consequence of this pitch shift method is that the duration of the resulting signal is changed inversely proportional to k. To get the desired signal duration the number of ST-signals must then be changed and is done by either duplicating or removing some ST-signals before concatenation. The resulting function of pm′ for application of prosody is therefore j T0 pm′ = a(n) , (2.7) j X k n=1 a(n) where a(j) consists of indices from the original ST-signal indexed i, i.e., the vector a indicates which ST-signals that are used. This results in a being a transfer function from pm to pm′.

pm pm pm pm pm pm 1 2 3 4 5 6 t

t’ pm’ pm’ pm’ pm’ pm’ pm’ pm’pm’ pm’ 1 2 3 4 5 6 7 8 9 Figure 2.5: Schematic example of a transfer function for the pitch mark-vector regarding pitch and duration modification. Here a = {1, 2, 2, 3, 4, 4, 5, 6, 6}.

An expression for the recombination of the extracted ST-signals including both pitch 16 CHAPTER 2. THEORY and duration modification can now be presented by combining (2.3) and (2.7) into

j ′ T0a(n) s (t)= si t − . (2.8) X  X ka n  i n=1 ( )

2.3.3 TD-PSOLA as Speech Synthesizer The PSOLA operation presented above describes the prosody modification part of a TTS system. An expansion of this technique is the TD-PSOLA (Time Domain-PSOLA) that functions as a complete speech synthesizer keeping all its operations in the time domain. It is defined that with this method it is possible performing a change in both pitch and time duration by a factor in the range of 0.5 to 2, without any notable change in position and bandwidth of its formants [Dut97]. The PSOLA operation used in this TTS system requires information about the pitch mark locations for each segment is needed. This is usually generated in the speech analysing step of the segment data preparation part (see Figure 2.1) by a pitch detection algorithm. However, detecting the fundamental frequency in a signal with a low V/UV value is difficult and sometimes even impossible (when V/UV ≈ 0) and the pitch mark- ing must therefore often partly be done by hand. In cases of purely un-voiced signals no fundamental frequency exist and the pitch marks are spaced given a fixed distance, approximately an average of the pitches of the speech segments. In the point of concatenation, three different types of mismatches can appear as a consequence of varying segment characteristics – pitch, harmonic phases and spectral envelope mismatches. All these mismatches can lead to a degradation of the quality. In case of differing pitches, the PSOLA process can eliminate this mismatch by placing the windows in the recombination phase equal for both segments. However, since PSOLA is an approximate method the process changes the spectral properties of the segments. If then the pitch is to be changed rather differently, in case of a relative large pitch difference, a spectral mismatch will appear. A second case of audible mismatch occurs when two voiced signals have harmonics with different corresponding phases. The phases of the fundamental frequency, though, are implicitly equalized through the pitch marking process, since the marking is always placed at the highest peak of the period time. The spectral mismatches described require operations in the frequency domain. Since the current synthesizer operates in time domain, a compensation of these mismatches, by for example smoothing, cannot be done. However, in the special case of having ST-signals with equal length (which occurs when the original pitch is constant), a spectral envelope smoothing in time domain can be performed. This will be further described in the next section.

2.4 Extension of TD-PSOLA into MBR-PSOLA

An extension of the TD-PSOLA synthesizer has been developed by Dutoit and Le- ich, [DL93], which involves a re-synthesis of the segment database. This extended TTS method called MBR-PSOLA (Multi-Band Re-synthesis PSOLA) has the purpose of per- forming a more specific normalization in the pre-processing state than in the case of TD- PSOLA, and not requiring additional pitch mark files. The resulting segments are stored 2.4. EXTENSIONOFTD-PSOLAINTOMBR-PSOLA 17

and the same synthesis method can be used as before together with a quality improv- ing interpolation block. Figure 2.6 displays this extension from the original TD-PSOLA synthesizer (gray blocks) to an MBR-PSOLA TTS system.

Phonemes & Prosody Info

Segment Information Database

SPEECH Linear Interpolation TD−PSOLA SYNTHESIS Segments Segment Re−synthesis

MBR−PSOLA Segments Speech

Figure 2.6: Extension of TD-PSOLA into MBR-PSOLA. The white blocks refer to the additive operations.

2.4.1 Re-synthesis Process The segment re-synthesis process consists of two normalization steps. First, the speech segments are recalculated achieving constant pitch throughout the entire database. This has the consequence that the future window positioning performed in the PSOLA process can be given one fixed value for all segments, relative the constant pitch period start, and therefore no additional pitch mark information is needed. The second re-synthesis operation comprises harmonic phase normalization of voiced signals. These phases are set to fixed values, valid for all segments. The choice of these phase values affect the sound quality considerably. Constant or linearly distributed harmonic phases lead to a rather metallic sound. A better quality result is achieved giving the phases randomly distributed values. Additionally, tests performed in [DL93] have shown that keeping the phases of the high-frequency harmonics at their original value actually improves the quality. Using an upper border of about 2000 Hz for which harmonics to be normalized was proved as the best value. If a higher value was used, no enhancement was noticed, while a too low value resulted in worse quality. The method for re-synthesizing the TD-PSOLA segment database using the MBE operations described in the next subsection is shown in Figure 2.7. First, each segment is windowed into ST-signals according to equation (2.2). This requires, however, known pitch marks and in this case no such information is available. Instead, the window size and position is calculated using a constant F0 for the whole database, estimated from a 1 rough average of the overall pitch ( ) of the complete corpus. The pitch mark pmi T0av

can hence be replaced with iT0av . Each ST-signal is then parameterized according to the MBE model (described in next subsection) into 18 CHAPTER 2. THEORY

• harmonic amplitudes (sampled spectral envelope),

• harmonic phases, and

• narrowband noise variances.

In the calculation process of these parameters, a voiced/unvoiced classification of the signal is included. This information is used for controlling if the signal will be modified (in case of V) or returned unchanged (in case of UV). Before final storing of the segments, the normalized ST-signals are concatenated with the OLA (Overlap and Add) method.

TD−PSOLA Segments

MBE w(t) Parametrization Parametric Normalization V/UV

Segment Synthesis

OLA

MBR−PSOLA Segments

Figure 2.7: Segment re-synthesis process using MBE model.

2.4.2 Spectral Envelope Interpolation Another benefit having constant pitch and identical harmonic phases is that a spectral matching in the concatenation point of the synthesizer can be performed by a linear interpolation in time domain between the ST-signals. This normalization implies that this so called direct temporal interpolation described below, is equivalent to an interpolation of the spectral envelope [DL93], which is wanted. Furthermore, the constant length of the segments also simplifies the interpolation by a direct position mapping between the samples or parameters. If the segments sL and sR (referring left and right segments) is to be concatenated, L R X the two overlapping ST-signals can be denoted as s0 and s0 , respectively. Each sn is X described by the speech sample or parameter set pn , where X refers to the segment (L or X R) and n to its current window or ST-signal. The vector pn is of constant length which 2.4. EXTENSIONOFTD-PSOLAINTOMBR-PSOLA 19

L R is required for the vector operations. Suppose the difference |p0 − p0 | is to be divided onto NL windows on the left segment and NR on the right, the spectral smoothing can be expressed as

′L L R L NL − i p −i = p−i +(p0 − p0 ) , i =0, 1,...,NL − 1 (2.9) 2NL

′R R L R NR − j p j = pj +(p0 − p0 ) , j =0, 1,...,NR − 1 (2.10) 2NR ′L ′R where p −i and p j denote the new interpolated values of the samples or parameters L R describing, respectively, the ST-signals s−i and sj . The optimum number of ST-signals to use for the interpolation, i.e., NL and NR, varies between the different segments. It is preferable to avoid ST-signals from the transition part in spectral smoothing and since the length of a segment and its transition position varies, a segment-dependent selection of the number of smoothed windows is optimal. Additionally, spectral smoothing is only applied on voiced signal, and the selection of which segments to use can also be achieved by the same segment classification. This selection of the number of windows to use, i.e., the segment classification, is based on the V/UV information calculated by the MBE analysing process described in the following subsection.

2.4.3 Multi-Band Excitation Model The Multi-Band Excitation (MBE) model is originally designed for speech storage com- pression in voice codecs [GL88]. It is based on a parameterization of the frequency domain of a speech signal and since it includes information about its harmonic frequencies it is ideal to use for pitch and phase normalization. Below follows a description of the MBE parameterization of an arbitrary short time speech signal. Suppose a voiced ST-signal sw(t) has the Fourier transform Sw(ω) according to Fig- ure 2.8(a). This frequency represented signal can be modelled as the product of its spectral envelope Hw(ω) (with phase included) and an excitation spectrum |Ew(ω)| [GL88],

Sˆw(ω)= Hw(ω)|Ew(ω)|. (2.11)

If the fundamental frequency ω0 of the signal is known, the excitation spectrum can be expressed as a combination of a periodic spectrum |Pw(ω)| which is based on ω0 2 and a random noise spectrum |Uw(ω)| with the variance σ . The periodic spectrum consists of peaks with equal amplitude appearing at the fundamental frequency and its harmonics as shown in Figure 2.8(c). A frequency band with a width of the distance between two harmonic peaks is defined as a harmonic band, centralized on a harmonic. A V/UV analysis is performed on Sw(ω) for each harmonic band and expressed on a binary representation using a threshold value, see Figure 2.8(d). The two spectral signals are combined using the V/UV information to generate |Ew(ω)| by

|Ew(ω)| = V/UV (ω) ·|P (ω)| + (1 − V/UV (ω)) ·|Uw(ω)|, (2.12)

and these different spectrum parts can be seen in Figure 2.8(c-f). Figure 2.8(b) displays 20 CHAPTER 2. THEORY

Figure 2.8: Example of an MBE modelled signal. (a) Original spectrum, (b) Spectral envelope, (c) Periodic spectrum, (d) V/UV information, (e) Noise spectrum, (f) Excitation spectrum, (g) Synthetic spectrum.

the spectral envelope |Hw(ω)|, which is usually represented by one sample value for each harmonic in both voiced and unvoiced regions to reduce the number of parameters. Finally the resulting synthetic signal spectrum Sˆw(ω) can be seen in Figure 2.8(g), calculated as described above. The estimation of the parameters in this method is based on the least square error between the synthesized spectrum |Sˆw(ω)| and the original spectrum |Sw(ω)|. This ap- proach is usually termed an analysis-by-synthesis method. First, the spectral envelope and the periodic spectrum are estimated in the least square sense. Then the V/UV de- 2.5. UTILIZED DATA FROM EXTERNAL TTS PROJECTS 21

cisions are made by comparing the resulting spectrum to the original for each harmonic band and using a threshold value for the error to determine the band voiced or unvoiced.

2.4.4 Benefits with the respective PSOLA Methods

TD-PSOLA MBR-PSOLA

• High naturalness of the synthesized • No mismatch in harmonic phase and speech because of ’untouched’ seg- pitch. ments. • No external pitch marks needed, im- • Less sensitive to analysis errors plicitly calculated. regarding V/UV classification [Dut94]. • Simple spectral smoothing possible.

• Simpler data preparation step. • Good database compression poten- tial.

2.5 Utilized Data from External TTS Projects

There exist numerous companies and universities researching and offering products in the area of TTS systems. The availability of these results varies between the owners, but in most cases the solutions are hidden. The research projects mentioned in this section are all in some extend underlying the TTS system investigated in this thesis.

2.5.1 Festival and FestVox

The CSTR (Centre for Speech Technology Research) is an interdisciplinary research cen- tre at the University of Edinburgh. One of their project products, the Festival Speech Synthesis System, contains a full concatenative-based TTS system with different synthe- sis methods implemented. Except for the PSOLA based synthesizer, the software for the various TTS systems is distributed under a free license [Fesa]. The latest version is Fes- tival 2.0, which is developed for a number of languages: British and American English, Spanish and Welsh. A further improvement of the Festival TTS system is developed at the Carnegie Mellon University (CMU) through their project FestVox. The aim for this project is to make the building of new synthetic voices more systematic and better documented. FestVox 2.0 is the latest version which was released in January 2003 [Fesb] with a software free of use without restrictions, referring both commercial and non-commercial use. The involved are presented by FestVox on their homepage, containing, among other things, two voice databases consisting of all possible diphones for American and British En- glish. They are developed by the CMU and the CSTR, respectively, including waveforms, laryngograph (EEG) files, hand corrected labels of start, stop and transition points, and extracted pitch marks. The pitch marks are not hand corrected and thus not completely reliable. 22 CHAPTER 2. THEORY

The data concerned in this thesis are extractions from the British database called CSTR UK RAB Diphone. (A detailed description of the licence is attached in Ap- pendix C.) The current data is as follows:

1. Recorded speech corpus spoken by one British male speaker covering all possible diphones for British English. The data is stored as wave-files comprising a set of 2001 nonsense words with a sampling rate of 16 kHz and a precision of 16 bits.

2. List of all diphones described with MRPA notation, resulting in 2005 items. Each diphone is complemented with information about the corresponding sound segment which consists of name of the wave file, labels with start and stop position of the current diphone segment and a label for its transient point. The position labels are hand corrected and expressed in seconds with a significance of three decimals.

3. Data files with extracted pitch marks. Each pitch mark file corresponds to one wave-file, and hence a number of 2001 files exist. The pitch position is expressed in seconds using a seven decimal significance.

The notation of the diphones used in the segment information list (point 2. in the list above) follows the MRPA scheme described in Appendix B. Additional information has been included consisting of the three symbols #, and $. The first corresponds to a short silence while the second is indicating a consonant cluster, i.e., it is indicating two conso- nants appearing within a word instead of between two words. This can be exemplified by the notation t - r meaning the /tr/ as in ’true’ and not as in ’fat rat’. The last character, $, is in subsection 3.2.2 investigated and found to symbolize a word border between a plosive and a vowel.

2.5.2 MBROLA Another partly freely available TTS system is presented by the MBROLA project, which is initiated by the TCTS Lab of the Facult`ePolytechnique de Mons in Belgium [MBR]. A product with the same name as the project, MBROLA, has been developed consisting of a speech synthesizer based on the MBR-PSOLA technique. It takes a list of phonemes as input, together with prosodic information consisting of phoneme duration and pitch description. The design of the system requires a diphone database, but apart from this it can operate on any voice and language, assuming the defined input format. Since the starting level is at phonetic notation, MBROLA is rather a phoneme to speech system than a complete TTS system. The MBROLA synthesizer is only provided for free for non-commercial and non- military applications. Originally it consisted of one single segment database, a French speaking male voice. Today the system is available for several different languages and voices through cooperation with other research labs or companies contributing their di- phone databases. The downloading package from MBROLA includes example sentences to use as input and an executable file. No source code is available since the algorithm is protected. The example sentences are stored on the format

phoneme duration pitch position pitch position ...  2.5. UTILIZED DATA FROM EXTERNAL TTS PROJECTS 23 with SAMPA-notation for phonemes and the duration expressed in milliseconds. The position of the eventual pitch definition [Hz] is given in percent of the specified phoneme duration. Several pitch definitions can appear and a linear pitch interpolation between each pitch position is then intended. In total there are three different test files which together include almost 30 seconds of speech.

Chapter 3

Implementation

As implementation tool for the TTS system designed in this thesis Matlab is consistently used, with a based Matlab of version 7.0. The synthesiser follows the TD-PSOLA model presented in the previous chapter with diphones as the segment format and com- bines the Festival diphone databases with the input format of MBROLA using SAMPA notation. The original corpus is stored as wave files with a sampling frequency of 16 kHz. This frequency is kept during the synthesis process and the output is stored as a wave file. In this chapter the implementation is closely described step by step through a division into a Segment Data Preparation part and a Speech Synthesis part. When a phoneme is declared by its phonetic description the MRPA alphabet is intended if no other phonetic alphabet is explicitly defined.

3.1 Segment Data Preparation

The segment data preparation process operates on the three Festival databases listed in subsection 2.5.1. It creates the Segment Information database and the Synthesis Segment database, the latter consisting of diphone segments and pitch mark vectors. In principal the operation model follows the structure displayed in Figure 2.1. The difference in this case is that the pitch mark vectors already exist and are used to define the cutting points of the speech corpus. Secondly, information about where the diphone appears and its transition point are known. The process in this case is better described by Figure 3.1, where the pitch mark operation block corresponds to the Speech Analysis in Figure 2.1 and the segment information operations and the diphone extraction block is a part of the Selective Segmentation block. The databases denoted Pitch Marks and Diphone Segments are both representing the Synthesis Segment database block in Figure 2.1. Below follows a description of the modifications performed on the three original databases.

3.1.1 Segment Information Modification The segment information file available from Festival contains data about each diphone on the format

diphone file name start point transition point end point

25 26 CHAPTER 3. IMPLEMENTATION

Segment Speech Pitch Marks Information Corpus ORIGINAL

Exclusion of un− needed Diphones

Valid Pitch Mark Diphone Extraction Extraction

Diphone Info Recalculation Energy Equalization

MODIFIED

Segment Diphone Pitch Marks Information Segments

Figure 3.1: Block scheme of the segment data preparation process performed on Festival’s databases.

where the diphones are noted with MRPA with the character - as separation between the phones. The received points are expressed in seconds referring where in the given file the diphone appears.

As will be described in subsection 3.2.2, the character is used for denoting different versions of the recorded segment. These are only added on phones with the length of one character and for almost all cases only on one side. The exceptions are the notations K - R, where K ∈ [k,p,t] and R ∈ [r,w,y,l]. For simplify the TTS system these diphones are removed and each phone part is now restricted to two characters. This simplification does not affect the result considerably because of the first phone part of these segments is hardly audible and is therefore probably originally mentioned only to be used for special cases. Before removing, though, the segment addresses is overwriting the addresses for the diphones K - R, i.e., the sound files for K - R are used for K - R.

When the diphones are extracted from the speech corpus and stored in new files (see subsection 3.1.3), the start point is not needed in the segment information file and the file addresses are changed. Instead of the start points, the appearance of the first pitch mark is included. This information is used in the synthesis process when the time duration of the diphone is calculated. Since one period time T0x is overlapped in the OLA process, this time loss has to be considered for correct length of the output. 3.2. SPEECH SYNTHESIS 27

3.1.2 Pitch Marks Modification If the diphones are cut at the positions of pitch marks, the phase of the fundamental frequency is the same at the start point for all segments. The defined cutting points are therefore given the position of the closest pitch mark, which is not the case in the original information file. The pitch mark database contains pitch marks for the whole speech corpus, i.e., also for the parts outside a diphone region. The pitch marks valid for each diphone are extracted and stored in the modified pitch mark database.

3.1.3 Speech Corpus Modification The modified start and end points, which appears at pitch mark positions, are used for the extraction of the diphones. As can be seen in Figure 3.1, an energy equalization is performed before the final storing. This equalization is focusing on each group of related phones by setting the pitch periods to be concatenated, i.e., the first or last period time of the diphone, to the average value for each phoneme. For some phones, as in the case of plosives, there is a short silence before and after the pronounced part of the phone. An equalization of these phones is therefore not realistic and after a subjective energy analysis also the two fricatives th and dh are excluded, comprising the same signal characteristics as plosives. The different versions of each phoneme, referring the phonemes on the format x, x and x where x is an arbitrary phoneme, are classified as the same phoneme in the equalization calculations. The equalization is linearly distributed at sample level on each diphone. As an example, if the phoneme m-i: of sample length L is to be equalized with a factor of a for m and b for i:, each sample value s(n) is multiplied with the factor b−a a + L · n.

3.1.4 Additive Modifications When the pitch marks are displayed together with their corresponding diphone segments several mismatches are found referring missing or considerably misplaced pitch marks. About ten percent of the segments contain a relation between the lengths of two following pitch periods of a value of 2 or more, i.e., one of the pitch periods is twice as long as the other one. The pitch marks for these segments are considered not reliable and therefore adjusted by hand. It is also found that the second phone part of the diphone m-p does not comprise a whole pitch period and hence this is lengthened one period time. During the quality evaluation it was found that the diphone au-@@ was incorrectly pitch marked (see 4.1.1), and hence also corrected by hand.

3.2 Speech Synthesis

The Speech Synthesis process is divided into the four main blocks presented in Figure 2.3. In this thesis there exist no coding of the signals into parametric form, and hence the Waveform Generator block shown can be excluded. The Segment File Collector reads the stored data and includes no further operations and is hence not described in a subsection below. 28 CHAPTER 3. IMPLEMENTATION

3.2.1 Input Format The structure of the input for this TTS system is based on the MBROLA format described in subsection 2.5.2. The only data required here is the phonemes expressed with SAMPA, the others (duration and pitch definition) are optional, though the unit of the values must be as described. If the pitch information is missing, the original frequencies are used without any interpolation between the varying pitches of the diphones. A condition for having an input without duration information is that also the pitch definitions are missing. In this case the duration of each phoneme is given a certain value as described in the following subsection. Another optional input is the information about the borders between the words. If a small pause is intended, the character for SAMPA notation (mapped to # for MRPA) can be used. In case of a word border without a pause the system identifies a border by the character , in the input sequence. This word border information can improve the quality of the result, since some diphones are pronounced differently depending on if it is describing a diphone part within a word or at the border between two words. Which diphones this is concerning are presented later in this chapter.

3.2.2 Segment List Generator The operation process of the Segment List Generator part of the synthesizer consists of the six following steps:

1. If the input does not include duration values for the phonemes, certain default values are given, see below.

2. Mapping of the SAMPA denoted phonetic input into MRPA notation according to the lexicon in Appendix B. The result is transcribed onto the format of diphones.

3. Applying the additive notes and $ for different diphone positions as described below.

4. Reading the information corresponding to the current diphones from the Segment Information database.

5. Calculation of the distribution of the desired duration between the two current phone parts as explained below.

6. Expressing the appearance of the input frequencies according to a diphone instead of a phone. For description, see last part of this section.

Default Values for Phoneme Duration If the input only consists of phonetic notes, the duration of the phonemes are given a default value. These values are based on the duration values used in the French TTS test described in [Dut94], where the French phonemes are given the duration values as follows: [a,E,9,i,O,y,u] = 70 ms, [e,2,o,a∼,o∼,e∼] = 170 ms, fricatives = 100 ms, sonorant liquids = 80 ms, sonorant nasals = 80 ms, plosives = 100 ms. The phonemes in the square brackets are denoted with the French SAMPA notation and describes the 3.2. SPEECH SYNTHESIS 29

French vowels. These can be seen listed in [GMW97] and is freely interpreted for English as the first set corresponding to the vowels classified as checked vowels together with the central vowel, while the second matches the free vowels, see Appendix A for this phoneme classification. The phone groups affricates and sonorant glides are not defined and are freely interpreted corresponding to the same values as for fricatives and the two sonorant groups, respectively. Finally, the pause character is given the duration value 150 ms. Applying the described duration values results in a rather slow sounding word or sentence. A halving of these values results in a more natural speech rhythm. These duration values can be summarized as described in Table 3.1 below. Duration Phoneme Group 50 ms fricatives, plosives, affricates 40 ms sonorants (liquids, nasals and glides) 35 ms checked and central vowels 85 ms free vowels 150 ms pauses Table 3.1: Default value for the duration of each phoneme group.

Additional Diphone Notations For some diphones there exist two different recorded speech segments in the database obtained from Festival. The different versions are separated in the diphone description by the MRPA note or $, depending on the content. As mentioned in subsection 2.5.1, the character indicates a diphone consisting of consonants that appears within a word instead of between two words. The second character, $, always appears in combination with a vowel and its purpose was found, through listening tests of the diphone segments, to indicate a word border between a vowel and a plosive. Notice the different event indications of the two additive characters mentioned. A diphone within a word is indicated by , while $ refers to a diphone between two words. The additional diphone segments included in this thesis are the following combinations:

1. [plosive] - [sonorant liquid or glide],

2. [plosive]$ - [vowel] with the vowel @ excluded, and

3. [vowel] - $[plosive] with @ and t excluded.

In the case of having no information about the word borders, the addition of the special character according to case 1. above is performed on the diphone segments satisfying the condition. The audible difference between two diphone segments with and without this additional notation is explained in section 4.2. In the original database from Festival it exist more diphone combinations including the characters and $ than mentioned in the three cases presented above. These combinations, however, does not comprise any important audible difference of the resulting speech output and are for this TTS system excluded. 30 CHAPTER 3. IMPLEMENTATION

Duration Distribution The desired phone duration is distributed proportionally to the diphone segments in- volved, i.e., the phone parts of the segments to be concatenated. This can be described through the example case of having a phoneme a as input with a desired duration of d and with the intention that a is to be synthesized by the two diphones m-a and a-t. The length of each diphone segment is read from the Segment Information database together with the time position of the first pitch mark and transition point. The time length of each phone part of the diphones can be denoted as m2|a1 and a2|t1, respectively. The first part in a diphone is calculated as the time between the first pitch mark and the transition point and the last corresponds to the remaining time from the transition point to the end of the segment. The duration d will then be applied by change the a1 and a2 into the new ′ ′ time length of the phone parts denoted as a1 and a2. The proportionally distribution can then be expressed as ′ a1 a1 ′ ′ ′ = , and d = a1 + a2. a2 a2

Frequency Distribution The appearance of the defined frequency positions (frequency marks) are at input denoted in percentage of the duration of a phoneme. Each phoneme can have no, one or several frequency marks and between two marks an interpolation is intended. Two different interpolation methods have been implemented, requiring different data calculated from this operation block - a Local and a Comprehensive frequency interpolation. The latter method is selected for the final TTS synthesizer and in the evaluation subsection 4.2.2 the both methods are compared. The Comprehensive Frequency Interpolation method consists of a purely linear inter- polation between the frequency marks. First the percentage expression is recalculated to a notation in time of the phonemes using the time data information from the Segment Information database. Two vectors are then created consisting of the position in time for each frequency mark and each boundary point, respectively, expressed from the first point of the very first diphone. By a simple linear interpolation, the desired frequencies at the beginning and end of each diphone segment are calculated and transferred further. The last point of one segment is in other words given the same value as the first in the next. For those diphones including frequency marks the appearing of these are also appended expressed in time from the beginning of the diphone segment. In the case when no frequency information is available, the frequency marks are based on the frequencies of the stored segments. Since the appearance of the first pitch mark is included in the Segment Information database, the initial frequency of each segment is known. This value is set as the beginning frequency of a diphone and copied to the end of the previous. The interpolation in the Prosody Modification step can be applied with no further changes to function for this case of no frequency information. The Local Frequency Interpolation, chosen not to be used, is based on an interpolation only within a diphone. If a diphone does not include a frequency mark the original frequencies are used. An example of this mapping from a phone based to a diphone based positioning of a frequency mark is modelled in Figure 3.2. The information that a mark exists in the previous diphone is applied by copying the value of this mark and place 3.2. SPEECH SYNTHESIS 31

if at the beginning of the current and at the end of the old diphone, i.e., around the cutting point as shown in the example figure. The broken line shows the future intended interpolation.

Phone X Phone Y Transition Point value f f 3 1 f 2

time Cutting Diphone A Diphone B Point

Figure 3.2: Example of a mapping by local interpolation of the frequency marks f1, f2 and f3 from phone into diphone based positioning. Each dot corresponds to a frequency mark and the broken line displays the intended interpolation.

3.2.3 Prosody Modification The prosody modification is performed with the PSOLA method described in section 2.3. Desired pitches and frequencies are applied by generating a new pitch mark vector accord- ing to (2.7), as described below. Finally, the PSOLA decomposition and recombination is performed. When frequency marks exist for the current segment, the new pitch mark vector is calculated keeping the length of the original vector. The period times corresponding to the desired frequency values are placed according to their mark positions, and a linear interpolation is performed between the values. In the frequency application method valid for the synthesizer all positions in the pitch mark vector are automatically given a value. However, in the second method special cases must be considered. If the frequency marks appear on only one side of the transition point, as for Diphone A in Figure 3.2, the original period time values are used for the non-interpolated part, starting from the boundary point. In the case of no frequency marks at all, the whole original pitch mark vector is used for denoting the new vector. Once the new pitch mark vector is calculated, the defined duration is applied. First, the relation r between the desired and the current duration is calculated for both phone parts, where the current duration is achieved from the last element in the new pitch mark vector. The relation r is then used for dividing the original pitch period indices into the transfer vector a in (2.7), by dividing the distance of first to last index into numbers separated by r and then rounded into integer values. For phone parts consisting of three or more pitch periods the pitch period including the boundary point is excluded in the creation of the transfer vector a and then added after the distribution of the indices. See subsection4.2.3 in the next chapter for a justification of this operation. 32 CHAPTER 3. IMPLEMENTATION

The decomposition of a segment into ST-signals, equation (2.2), is performed with a Hanning [Den98] window positioned by the original pitch mark vector. The size of the window is twice the instant T0, referring to the pitch period calculated from the previous pitch mark to the current, window centralizing mark. The extracted ST-signals are then recombined using the new pitch mark vector with mapped indices a included.

3.2.4 Segment Concatenation Since the segments are windowed with a Hanning window around each pitch mark and then added by the OLA method both ends of the segments remains smoothed having the value 0 at the very first and last points. The concatenation between two diphones is performed using same technique as for the duration modification, i.e., the OLA, where the ST-signals to be concatenated are the first and last pitch periods. This requires a delay of one diphone (see the Segment Concatenation subsection in 2.2.1) and the overlap is positioned as the last point of the oldest segment added onto the first pitch mark point of the newer. Chapter 4

Evaluation

An evaluation of a TTS system can be performed through several aspects. The main relevant evaluation issues for an implementation area of embedded systems, as the case in this thesis, are:

• Storage properties – The size of the database needed, which, except for the amount of information, also depends on the coding possibilities of the data and the choice of codec.

• Computational complexity – Number and type of operations per sample needed for synthesizing a text.

• Usability – Investigation of suitable application areas, such as reading e-mails or items in a menu, and the possibility to extend the TTS system to function for other languages than English.

• Speech quality – The perceived characteristics of the synthesized speech in the aspects of intelligibility, fluidity and naturalness.

• Implementation costs – The time needed for develop the system, including the col- lection of data, and the cost of the required hardware devices.

The TTS synthesizer developed in this thesis is designed to suit the defined implementa- tion area. Since it is a feasibility study on a TTS system for embedded devices it can not be definitive evaluated for all the issues above. The evaluation presented in this chapter refers to the quality of the synthesized speech and the chosen solution. Because of the restricted time limit for this project, an extensive quality evaluation cannot be performed. For an overall quality judgment of a speech synthesis system usually a MOS (Mean Opinion Score) [RM95] analysis method is performed, which requires many persons for a reliable result. The listener grades the quality in the three categorizations – intelligibility, fluidity and naturalness – by a five level MOS scale from 1 to 5, where 1 corresponds to bad and 5 to excellent. The audible evaluations described in this chapter are performed by only one listener, i.e., the author. The quality conclusions are based on relative clear differences and are assumed legible for any arbitrary listener. Additionally, the judging of the quality can only be relative and not absolute. This arises from the lack of an original or maximum

33 34 CHAPTER 4. EVALUATION quality-valid signal and therefore only a comparison between different synthetic speech signals can be performed. The intelligibility of the evaluated speech signals is very good for all cases and hence is not generally mentioned in the quality judgments.

4.1 Analysis of the Segment Database and Input Data

4.1.1 Pitch Marks As mentioned in the previous chapter the pitch marks achieved from Festival are not al- ways correct. In the implementation part an analysis is performed of the relation between two following pitch periods and in the cases when the diphones contains a difference ratio of 2 or more, the pitch mark files are hand corrected. A lot of errors still exist in the database resulting in audible mismatches in the synthesized speech. One example of this can be seen in Figure 4.1 where the upper signal shows the diphone segment au-@@ and the lower a part of a sentence created with the speech synthesizer referring the current diphone. The dotted lines display the pitch marks and misplacement can clearly be seen as an irregular signal pattern in the lower figure, which audibly results in an annoying disturbance.

’au−@@’

0.2

(a) 0

−0.2

0 0.011 0.039 0.063 0.086 0.108 0.13 0.153 time [s] Ouput signal

0.2

(b) 0

−0.2

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 time [s]

Figure 4.1: Pitch misplacement. (a) The diphone au-@@ with corresponding pitch marks. (b) Same diphone with prosody modification. 4.1. ANALYSIS OF THE SEGMENT DATABASE AND INPUT DATA 35

4.1.2 Spectral Mismatch In the generated synthetic speech files a major spectral mismatch has been found. It appears in the combination of the two diphones k$-au and au-@@ and results in an audibly annoying disturbance. The two diphone segments are presented in Figure 4.2(a) with their end parts to be concatenated visible. The result of the concatenation can be seen in Figure 4.2(b). When both segments are listened separately, the quality sounds good and therefore it can be concluded that the spectral mismatch is the reason for the disturbance.

’k$−au’ ’au−@@’

0.2 0.2

(a) 0 0

−0.2 −0.2

0.02 0.04 0.06 0.08 0.02 0.04 0.06 0.08 time [s] time [s] Output signal

0.2

(b) 0

−0.2

0.02 0.04 0.06 0.08 0.1 0.12 0.14 time [s]

Figure 4.2: (a) Parts of the two diphones k$-au and au-@@. (b) Concatenation of the diphones with prosody modification included.

4.1.3 Fundamental Frequencies The speech corpus that the diphone segments are extracted from is spoken with a some- what hoarse voice and at a relative low frequency. The latter can be seen in the histogram in Figure 4.3, where the pitch periods for the whole segment database are included, re- sulting in an average value for the fundamental frequencies of 109 Hz. The values in the histogram are based on the pitch marks and since it is found that these comprise some misleading values, the result is not fully reliable. If the errors are assumed symmetrically distributed, the average value would be the same as for a perfect analysis. The variance of the distribution of the real frequency values is however presumably somewhat smaller. In the test data received from MBROLA the average fundamental frequency is much higher than the recorded diphone segments. Figure 4.4 displays the distribution of the desired frequency values having an average value of 140 Hz. The duration of time that each frequency is intended to have, is not considered as in the previous histogram, but the spread of the values are still representative. 36 CHAPTER 4. EVALUATION

Mean value = 108.6 Hz 3000

2500

2000

1500

1000

500

0 0 50 100 150 200 250 Frequency [Hz] Figure 4.3: Histogram of the fundamental frequencies of the diphone segment database.

Mean value = 139.6 Hz

15

10

5

0 0 50 100 150 200 250 Frequency [Hz] Figure 4.4: Histogram of the fundamental frequencies of the test files.

4.2 Solution Analysis

4.2.1 Window Choice for the ST-signal Extraction If the PSOLA operations are performed on a segment without any prosody modifications involved, meaning pm′ = pm, the output signal should be a good approximation of the input1. A rather common window to use in the decomposition process is the Hanning window and the outcome of this window choice can be evaluated by using a step function as input. The resulting output signal can be seen in Figure 4.5, together with the result of using two other common window functions, the Hamming and the Blackman. The pitch periods are given a constant value corresponding to a fundamental frequency of about the average pitch of the stored segments. With a sampling frequency of 16 kHz this results in a window length of 145 samples and a position shift of the windows with 290 samples. In the case of the Hamming window a slight amplification can be seen together with sharp ends of the windows. If a ST-signal would be extracted with this window a discontinuity will appear at its ends. The step response using a Blackman window function results on the other hand in a substantial oscillation. To fulfil the desire of a good approximation of the input, the chosen window must

1The terms input and output in this section refers to the signal before and after the PSOLA operations are applied. 4.2. SOLUTION ANALYSIS 37

Hanning window

1

0.8

0.6

0.4

0.2

0 100 200 300 400 500 600 700 800 [samples] Hamming window 1.2

1

0.8

0.6

0.4

0.2

0 100 200 300 400 500 600 700 800 [samples] Blackman window 1.2

1

0.8

0.6

0.4

0.2

0 100 200 300 400 500 600 700 800 [samples]

Figure 4.5: Step response of the PSOLA method with three different windows. be symmetrical in amplitude, when no bias level is included. This is the case for the Hanning window resulting in a constant output amplitude with the value 1, as can be seen in Figure 4.5.

4.2.2 Frequency Modification As described in the Implementation chapter two different methods for the application of the frequency marks has been realized. The result of these interpolation methods can be seen in Figure 4.6 where the resulting fundamental frequencies are displayed for each pitch period of a certain sentence.

Local Frequency Interpolation The first method, where only interpolation within a diphone is performed, results in a rather unnatural sounding speech signal with low fluidity. It has even worse quality than in the case of no frequency modification at all. The reason of this poor quality is most probably because of the major frequency difference between the recorded speech segments and the input data, see Figure 4.3 and 4.4, which results in a relative fast pitch change. When the desired frequency marks are down-scaled with a factor of 0.85 the quality of 38 CHAPTER 4. EVALUATION

250

200

(a) 150

100 Frequency [Hz] 50 0 50 100 150 200 Pitch period index

250

200

(b) 150

100 Frequency [Hz] 50 0 50 100 150 200 Pitch period index

250

200

(c) 150

100 Frequency [Hz] 50 0 50 100 150 200 Pitch period index

250

200

(d) 150

100 Frequency [Hz] 50 0 50 100 150 200 Pitch period index

Figure 4.6: Frequency periods for the sentence I think it’s not impossible. (a) Comprehensive frequency interpolation. (b) Frequency interpolation between original values. (c) No frequency modification. (d) Local frequency interpola- tion. the output is considerable better. This factor is arbitrary chosen with the single purpose of reducing the pitch change. The result of this is that the described local frequency interpolation still can be a useful method. The benefit with this method is that in the Segment List Generation step no informa- tion about the future diphone is required, referring to a non-causal operation. Secondly, if it does not exist any frequency marks for several following diphones, the natural fre- quency variation is preserved. This can be compared to the comprehensive interpolating case where the pitch is kept approximately constant when frequency marks are missing, resulting in a monotone signal.

Comprehensive Frequency Interpolation The method consisting of an interpolation between all pitch marks is the best solution for the pitch modification found in this project. It comprises a high fluidity and only at a relative high frequency change the speech signal reduces its naturalness. The drawback 4.2. SOLUTION ANALYSIS 39

with a comprehensive frequency interpolation is that it requires a good prosody generating method in the Linguistic Analysis module for avoiding monotonousness and keeping its naturalness. The MBROLA test data fulfils the conditions for good prosodic information but since it consists of some relative high frequencies a slight metallic sound is observed.

Frequency Interpolation between Original Values At last, an analysis is performed of the frequency interpolation case where only the fre- quencies of the recorded segments are considered as described in subsection 3.2.2. The quality is lower than in the comprehensive interpolation case in aspect of naturalness, but still comprises a high fluidity. The output signal is compared to the case of no interpo- lation at all which results in a slightly better fluidity for the interpolated signal with a somewhat worse naturalness. The latter is probably a consequence of having some phones being forced, because of the interpolation, to frequencies that are not realistic for that phoneme.

4.2.3 Duration Modification The system is designed for receiving information about the duration of each phoneme. However, as mentioned in 3.2.2, default values for the duration are applied when this information is missing. The resulting speech output with these applied is still intelligible but the naturalness and especially the fluidity are decreased. If the original durations of the stored segments are used instead of the default values, the result sounds even less natural and fluid, while the intelligibility is still kept at about the same level as for the default values. In the distribution of the change in duration the pitch period including the boundary point is excluded and then added after the distribution operation. This is favourable for both compressing and stretching of the speech segment. If a signal is to be shortened, i.e., some pitch periods will be removed, it is preferable to keep the transition part of the signal for a smoother phoneme changeover and this is guaranteed with this solution. On the contrary, for a time stretch of the signal the pitch periods are duplicated and a repetition of the transition part results for some diphones in a decrease in its naturalness. An example of this is shown in Figure 4.7 below where the diphone -aI is stretched without the special consideration of the transition point. This results in an audible ’pre-echo’ of the phone aI as can be seen in the figure. 40 CHAPTER 4. EVALUATION

0.15

0.1

0.05

0

−0.05

−0.1

−0.15

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 time [s]

Figure 4.7: Transition part of the stretched diphone -aI with boundary point included in duration distribution.

4.2.4 Word Border Information When no information about the borders between the words exists, the diphone segments representing an appearance within a word are used. This leads to a minor degradation in naturalness of the synthesized speech, present as a slight slur. Since the additional diphones only exist for some combinations (see subsection 3.2.2) this quality decrease occurs rather seldom. Chapter 5

Discussion

5.1 Conclusions

The quality of a speech signal generated by the developed TTS synthesizer is rather sat- isfying when full prosodic information is received. Still further mismatches and program errors can be found, since not all possible phone and diphone combinations have been investigated. In the previous chapter it is described that the local interpolation method results in an even worse quality than in the case of not performing any frequency modifications at all. This local interpolation method combines original and desired frequencies and since these pitches are relative differently it results in fast frequency variations. The condition to use the PSOLA method (section 2.3) is to have a quasi-stationary signal. This assumption is however not true for this case which is probably the reason of the quality decrease. The conclusion of this is that a frequency mismatch, as in the non-modifying case, is better than fast frequency changes. Another subjective conclusion is the importance of good defined durations in contrast to frequency. Whole sentences can be created with relative good quality without any desired frequencies as input, as long as the durations are well defined. This is true for both with and without an interpolation between the original frequency values. The result is slightly more fluent with the interpolation applied but without this operation the naturalness is somewhat higher. The application of the default duration values does, as stated, not generate a satisfying result. However, when a single word is to be synthesized these values can give a fairly good result for most inputs. This requires a use of the interpolation between the original frequencies for avoiding a too poor fluidity. A drawback of the speech data that this system is based on is the hoarseness of the recorded voice. The hoarseness is audibly easy to distinguish and is unfortunately also amplified through the PSOLA process. Tests with different filters have been performed for reducing this hoarseness, but so far no significant results are found. The only achievement of a satisfying reduction of this hoarse part of the voice is when a rather high frequency change is applied as in the comprehensive interpolation case shown in Figure 4.6. This, on the other hand, generates a somewhat metallic sound, which might for some listeners be valid as a worse sound degradation. The subjective conclusion is that a metallic sound is preferable compared to hoarseness.

41 42 CHAPTER 5. DISCUSSION

In the recombining part of the PSOLA method same windows are used as for the decomposition part. When the windows are placed narrower for the recombination, i.e., a pitch increase is desired, the amplitude of the resulting signal is increased. This could be compensated by a downscaling of the window, but is not considered in this thesis. So far, though, no mismatches have been heard corresponding to an amplitude mismatch.

5.1.1 Comparison of TD- and MBR-PSOLA In available publications about the MBR-PSOLA synthesizer it is stated that this method has numerous benefits mainly regarding speech quality compared with the TD-PSOLA. This is the reason why this method is mentioned in the theoretical part, together with the fact that the implemented system also can be further developed into an MBR-PSOLA version. A benefit concerning Teleca is that no pitch marking is required resulting in a more flexible system and thereby easier to implement on various speech corpuses. This can, as an example, be useful when the system is to be implemented on several languages. The only part of an MBR-PSOLA system to be changed for an implementation of another language is the corpus with its corresponding segment extraction and some eventual extra phonetic characters specific for that language. The speech synthesizer developed by MBROLA, which is based on the MBR-PSOLA method, is tested with the same sentences as the system described in this thesis. Despite that the input data is originally designed for the MBROLA synthesizer and that this method is stated to perform a better quality, the TD-PSOLA based system is subjectively found to achieve higher quality. The output from the MBROLA has a slightly better fluidity, but the major lead in naturalness for the TD-PSOLA synthesizer results in the preferential of the investigated system.

5.2 Further Work

Before the developed TTS system can completely function as a speech synthesizer in a Matlab environment, some further operations need to be performed. At first all the pitch mark files have to be investigated and corrected by hand. It is then preferable to perform a deep analysis of all possible diphone combinations for detecting hidden errors. The last operation is to optimize the implemented code referring computational time and reduction of possible operation errors, such as for instance vector exceedance. The next step towards an embedded TTS system is to translate the program code to, preferably, C-code. The optimal structure of the Matlab code does not follow the optimal one for C and since the implementation platform is a low level device a careful code optimization is required. Additionally, the coding of the segment database affects the quality of the result and it is hence good to perform a proper analysis of different codecs.

5.2.1 Proceedings for Teleca The grapheme-to-phoneme module existing at Teleca today is a rule-based system gen- erated by training of a decision tree. It exists for several languages, including American 5.2. FURTHER WORK 43

English, and is under training for British English, which the developed TTS synthesizer is designed for. The module converts textual input into phonetic notation but without any prosodic information included. As stated above, information about the duration of each phone is more or less required for the synthesizer. If whole sentences are to be produced, also frequency information is preferable. A further extension of the grapheme-to-phoneme module is hence required before a final TTS system can be produced. The company is member of a speech databank group, holding an enormous amount of recorded speech for several languages. The speech data are recorded during a longer period by several companies and speakers, grouped according to a certain desired field. It is therefore assumed that this data is not covering all possible diphones spoken by one speaker, which is a requirement for the category of the general working TTS system described in this thesis.

5.2.2 Possible Quality Improvements Disturbances caused by a spectral mismatch of two diphones, as shown in Figure 4.2, can easily be reduced in the case of MBR-PSOLA, according to the possibility of a spectral envelope interpolation as described in 2.4.2. Since the quality of the TD-PSOLA was subjectively graded higher despite the presence of several mismatched, a mismatch be- tween the harmonic phases have not any considerable influence on the result. According to theory, the reason why the spectral envelope interpolation could not be performed in the TD-PSOLA synthesizer is the absence of the pitch synchronizing and harmonic phase normalizing operation. However, after the segment concatenation the pitch of the signals is almost, or at least could be, constant for a restricted duration of time. If then new ST-signals around a mismatched concatenation point are extracted, the interpolation mentioned above could be applied. The phase of each harmonic would then be different from window to window, but hopefully this disturbance would be perceived less annoying than the spectral mismatch.

Bibliography

[CST] CSTR. Webpage. .

[Den98] Philip Denbigh. System Analysis & Signal Processing. Addison-Wesley, Har- low, 1998.

[DL93] T. Dutoit and H. Leich. MBR-PSOLA: Text-To-Speech Synthesis Based on an MBE Re-Synthesis of the Segments Database. Speech Communication, 13(3-4):435–440, 1993.

[Dut94] Thierry Dutoit. High Quality Text-To-Speech Synthesis: A Comparison of Four Candidate Algorithms. In Proceedings of ICASSP ’94, volume 1, pages 565–568, Adelaide, Austrailia, April 1994.

[Dut97] Thierry Dutoit. An Introduction to Text-To-Speech Synthesis. Kluwer Aca- demic Publishers, Dordrecht, 1997.

[Fesa] CSTR:s project Festival. Homepage. .

[Fesb] FestVox. Homepage. .

[GL88] D. W. Griffin and J. S. Lim. Multiband Excitation Vocoder. IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-36(8):1223–1235, 1988.

[GMW97] Daffyd Gibbon, Roger Moore, and Richard Winski, editors. Handbook of Stan- dards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin, 1997.

[HAH+98] H.-W. Hon, A. Acero, X. Huang, J. Liu, and M. Plumpe. Automatic Genera- tion of Synthesis Units for Trainable Text-To-Speech Systems. In Proceedings of ICASSP ’98, volume 1, pages 2293–2296, Seattle, 1998.

[Lem99] Sami Lemmetty. Review of Speech Synthesis Technology. Master’s thesis, Helsinki University of Technology, Finland, 1999.

[MBR] MBROLA. Homepage. .

[RM95] Ravi P. Ramachandran and Richard Mammone, editors. Modern Methods of Speech Processing. Kluwer, Boston, 1995.

45 46 BIBLIOGRAPHY

[SO95] Richard Sproat and Joseph Olive. An Approach to Text-To-Speech Synthesis. In W. B. Kleijn and K. K. Paliwal, editors, Speech Coding and Synthesis, chapter 17. Elsevier Science, Amsterdam, Holland, 1995. Appendix A

SAMPA Notation for British English

Extraction from [GMW97]: Consonants The standard English consonant system is traditionally considered to comprise 17 obstru- ents (6 plosives, 2 affricates and 9 fricatives) and 7 sonorants (3 nasals, 2 liquids and 2 semivowel glides). With the exception of the fricative /h/, the obstruents are usually classified in pairs as ”voiceless and ”voiced”, although the presence or absence of periodicity in the signal resulting from laryngeal vibration is not a reliable feature distinguishing the two classes. They are better considered ”fortis” (strong) and ”lenis” (weak), with duration of con- striction and intensity of the noise component signalling the distinction.

The six plosives are pbtdkg: Symbol Word Transcription p pin pIn b bin bIn t tin tIn d din dIn k kin kIn g give gIv The ”lenis” stops are most reliably voiced intervocalically; aspiration duration following the release in the fortis stops varies considerably with context, being practically absent following /s/, and varying with degree of stress syllable-initially.

The two phonemic affricates are tS and dZ: Symbol Word Transcription tS chin tSIn dZ gin dZIn As with the lenis stop consonants, /dZ/ is most reliably voiced between vowels. There are nine fricatives, fvTDszSZh:

47 48 APPENDIX A. SAMPA NOTATION FOR BRITISH ENGLISH

Symbol Word Transcription f fin fIn v vim vIm T thin TIn D this DIs s sin sIn z zing zIN S shin SIn Z measure "meZ@ h hit hIt Intervocalically the lenis fricatives are usually fully voiced, and they are often weakened to approximants (fricationless continuants) in unstressed position.

The sonorants are three nasals m n N, two liquids r l, and two sonorant glides w j: Symbol Word Transcription m mock mQk n knock nQk N thing TIN r wrong rQN l long lQN w wasp wQsp j yacht jQt

Vowels The English vowels fall into two classes, traditionally known as ”short” and ”long” but, owing to the contextual effect on duration of following ”fortis” and ”lenis” consonants (traditional ”long” vowels preceding fortis consonants can be shorter than ”short” vowels preceding lenis consonants), they are better described as ”checked” (not occurring in a stressed syllable without a following consonant) and ”free”.

The checked vowels are I e { Q V U: Symbol Word Transcription I pit pIt e pet pet { pat p{t Q pot pQt V cut kVt U put pUt There is a short central vowel, normally unstressed: Symbol Word Transcription @ another @"nVD@ The free vowels comprise monophthongs and diphthongs, although no hard and fast line 49 can be drawn between these categories. They can be placed in three groups according to their final quality: i: eI aI OI, u: @U aU, 3: A: O: I@ e@ U@. They are exem- plified as follows: Symbol Word Transcription i: ease i:z eI raise reIz aI rise raIz OI noise nOIz

u: lose lu:z @U nose n@Uz aU rouse raUz

3: furs f3:z A: stars stA:z O: cause kO:z I@ fears fI@z e@ stairs ste@z U@ cures kjU@z The vowels /i:/ and /u:/ in unstressed syllables vary in their pronunciation between a close [i]/[u] and a more open [I]/[U]. Therefore it is suggested that /i/ and /u/ be used as indeterminacy symbols. Symbol Word Transcription i happy "h{pi u into "Intu

1. Notational variants. Differently from the notation set out above:

1. It is possible to transcribe English long vowels without using length marks, thus /i u 3 A O/. This is phonemically unambiguous, although it does remove the option of restricting the symbols [i u] to the use just described, for the phonemically indeterminate weak vowels. 2. The symbol /E/ is quite widely used in place of /e/ for the vowel of ”pet”. 3. In an older notation, now no longer in general use, paired short and long vowels were transcribed using the same vowel symbol with and without length marks, thus /i/ in ”pit”, /i:/ in ”ease”; /O/ in ”pot”, /O:/ in ”cause”.

2. Additional symbols. For some purposes and some varieties of English it is useful to give explicit symbolization to the glottal stop and/or the voiceless velar fricative: Symbol Word Transcription ? network ne?w3:k x loch lQx

Appendix B

MRPA - SAMPA Lexicon for British English

Extraction from [CST]:

MRPA SAMPA Example MRPA SAMPA Example p p put zh Z measure b b but y j yes t t ten ii i: bean d d den aa A: barn k k can oo O: born m m man uu u: boon n n not @@ 3: burn l l like i I pit r r run e e pet f f full a { pat v v very uh V putt s s some o Q pot z z zeal u U good h h hat @ @ about w w went ei eI bay g g game ai aI buy ch tS chain oi OI boy jh dZ Jane ou @U no ng N long au aU now th T thin i@ I@ peer dh D then e@ e@ pair sh S ship u@ U@ poor

51

Appendix C

Licence for CSTR’s British Diphone Database

Centre for Speech Technology Research University of Edinburgh, UK Copyright (c) 1996,1997 All Rights Reserved.

Permission to use, copy, modify, distribute this database for any purpose is hereby granted without fee, subject to the following conditions: 1. Redistributions retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in an encoded form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the redistribution. 3. Neither the name of the University nor the names of contributors to this work may be used to endorse or promote products derived from this software without specific prior written permission.

THE UNIVERSITY OF EDINBURGH AND THE CONTRIBUTORS TO THIS WORK DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE UNIVERSITY OF EDINBURGH NOR THE CONTRIBUTORS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS WORK.

53