Feasibility Study on a Text-To-Speech Synthesizer for Embedded Systems
Total Page:16
File Type:pdf, Size:1020Kb
2006:113 CIV MASTER'S THESIS Feasibility Study on a Text-To-Speech Synthesizer for Embedded Systems Linnea Hammarstedt Luleå University of Technology MSc Programmes in Engineering Electrical Engineering Department of Computer Science and Electrical Engineering Division of Signal Processing 2006:113 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--06/113--SE Preface This is a master degree project commissioned by and performed at Teleca Systems GmbH in N¨urnberg at the department of Speech Technology. Teleca is an IT services company focused on developing and integrating advanced software and information technology so- lutions. Today Teleca possesses a speak recognition system including a grapheme-to-phoneme module, i.e., an algorithm converting text into phonetic notation. Their future objective is to develop a Text-To-Speech system including this module. The purpose of this work from Teleca’s point of view is to investigate a possible solution of converting phonetic notation into speech suitable for an embedded implementation platform. I would like to thank Dr. Andreas Kiessling at Teleca for his support and patient discussions during this work, and Dr. Stefan Dobler, the head of department, for giving me the possibility to experience this interesting field of speech technology. Finally, I wish to thank all the other personnel of the department for their consistently practical support. i Abstract A system converting textual information into speech is usually denoted as a TTS (Text-To- Speech) system. The design of this system varies depending on its purpose and platform requirements. In this thesis a TTS synthesizer designed for an embedded system operat- ing on an arbitrary vocabulary has been evaluated and partially implemented in Matlab, constituting a base for further development. The focus of this thesis is on the speech gen- eration part, which involves the conversion from phonetic notation into synthetic speech. The chosen TTS system is the so called Time Domain-PSOLA, which convincingly suits the implementation and platform requirements. It concatenates segments of recorded speech and changes its prosodic characteristics with the Pitch Synchronous Overlap and Add (PSOLA) technique. The segment size is from the mid point of one phone to the mid point of the next, referred to as a diphone. The quality of the generated synthesized speech is rather satisfying for the test sen- tences applied. Some disturbances still occur as a consequence of mismatches, such as different spectral properties of the segments and pitch detection errors, but with further developing a reduction of these can be performed. iii Contents 1 Introduction 1 1.1 IntroductiontoTTSSystems . 1 1.2 Linguistic Analysis Module . ... 2 1.3 SpeechGenerationModule. 3 1.3.1 Rule-BasedSynthesis. .. .. 3 1.3.2 Concatenative-Based Synthesis . .. 4 1.4 ProjectFocus .................................. 4 2 Theory 7 2.1 SegmentDataPreparation . 7 2.1.1 Segment Format and Speech Corpus Selection . ... 7 2.1.2 PreparationProcess .......................... 8 2.1.3 SegmentRepresentation . 9 2.2 SpeechSynthesis ................................ 10 2.2.1 SynthesizingProcess . 10 2.2.2 ProsodicInformation . 12 2.3 PSOLAMethod................................. 13 2.3.1 PSOLAOperationProcess. 13 2.3.2 ModificationofProsody . 15 2.3.3 TD-PSOLA as Speech Synthesizer . 16 2.4 ExtensionofTD-PSOLAintoMBR-PSOLA . 16 2.4.1 Re-synthesisProcess . 17 2.4.2 SpectralEnvelopeInterpolation . ... 18 2.4.3 Multi-BandExcitationModel . 19 2.4.4 Benefits with the respective PSOLA Methods . .. 21 2.5 UtilizedDatafromExternalTTSProjects . ..... 21 2.5.1 FestivalandFestVox . 21 2.5.2 MBROLA................................ 22 3 Implementation 25 3.1 SegmentDataPreparation . 25 3.1.1 Segment Information Modification . .. 25 3.1.2 PitchMarksModification . 27 3.1.3 SpeechCorpusModification . 27 3.1.4 AdditiveModifications . 27 3.2 SpeechSynthesis ................................ 27 v vi CONTENTS 3.2.1 InputFormat .............................. 28 3.2.2 SegmentListGenerator . 28 3.2.3 ProsodyModification. 31 3.2.4 SegmentConcatenation . 32 4 Evaluation 33 4.1 Analysis of the Segment Database and Input Data . ...... 34 4.1.1 PitchMarks............................... 34 4.1.2 SpectralMismatch ........................... 35 4.1.3 Fundamental Frequencies . 35 4.2 SolutionAnalysis ................................ 36 4.2.1 Window Choice for the ST-signal Extraction . .... 36 4.2.2 FrequencyModification. 37 4.2.3 DurationModification . 39 4.2.4 WordBorderInformation . 40 5 Discussion 41 5.1 Conclusions ................................... 41 5.1.1 ComparisonofTD-andMBR-PSOLA . 42 5.2 FurtherWork .................................. 42 5.2.1 ProceedingsforTeleca . 42 5.2.2 Possible Quality Improvements . 43 A SAMPA Notation for British English 47 B MRPA - SAMPA Lexicon for British English 51 C Licence for CSTR’s British Diphone Database 53 List of abbreviations IPA International Phonetic Alphabet MBE Multi-Band Exciter MBR-PSOLA Multi-Band Re-synthesis PSOLA MBROLA short for MBR-PSOLA MOS Mean Opinion Score MRPA Machine Readable Phonetic Alphabet OLA Overlap and Add PSOLA Pitch-Synchronous Overlap and Add SAMPA Speech Assessment Methods Phonetic Alphabet TD-PSOLA Time Domain PSOLA TTS Text-To-Speech V/UV Voiced/Un-Voiced vii Chapter 1 Introduction The possibility of producing synthesized speech from plain textual information, so called Text-To-Speech (TTS) systems, has today aroused an extensive interest in many technical areas. Different methods with varying quality and properties exist, and the development is still continuing. The purpose of this thesis is to define and evaluate a TTS synthesizer suitable for embedded systems. It is performed at Teleca Systems GmbH in N¨urnberg and its focus is established by the requirements of the company. Today, Teleca holds a module able to transform text into phonetic notation, which is originally developed for another speech purpose. This module is assumed able to be used also in a TTS system, and the starting point for this project is hence phonetic notation. The developed system is restricted to British English, but the theoretical descriptions are, though, valid for an arbitrary language. Since the starting level is at phonetic notation, it is not really correct to consider the investigated system as a Text-To-Speech system. However, for simplicity, and since the process of going from phonetic notation to speech is a major part of a TTS system, the term TTS is though used in this thesis describing the evaluated overall process. In this study a TD-PSOLA (Time Domain-Pitch Synchronous Overlap and Add) [Dut97] synthesizer is investigated and implemented in Matlab. The result is evaluated and suggestions of further work will be given for an accomplishment of the system. A possible extension of this method with the Multi-Band Excitation (MBE) [GL88] model is theoretically presented together with information about its benefits and disadvantages. The following sections briefly describe the main principals of a general TTS system as well as some existing classifications and groupings. The ambition of the latter description is to show what choices have been made and to give some explanation why. 1.1 Introduction to TTS Systems A TTS synthesizer is a computer based system that takes a text string as input and converts it into synthetic speech waveforms. The methods and equipment needed for this process varies depending on physical restrictions according to the implementation platform and development costs. Two main hardware restrictions are storage properties, such as its capacity and memory type, and the clock rate of the processor. 1 2 CHAPTER 1. INTRODUCTION The synthesis process can for all methods be divided into the two main modules pre- sented in Figure 1.1. The first step transcribes the input text into a linguistic format, which is usually expressed as phonetic notations (phones) together with additional infor- mation about its prosody. The term prosody refers to properties of each phone such as duration and pitch. The outputs are then used in the second block for construction of the final synthesis speech waves. Text Linguistic Phonemes & Speech Speech Analysis Prosody Info Generation Figure 1.1: Division of a general TTS system into two main modules. 1.2 Linguistic Analysis Module In almost all languages the textual presentation of a word does not directly correspond to its pronunciation. The position of letters within a word and the words appearance within the sentence affect the pronunciation considerably, as well as additive characters such as punctuation marks and the content of the sentence do. An alternative symbolic representation is therefore needed to comprise the hidden information. Usually a lan- guage can be described by 20 to 60 different phonetic characters [Lem99], when excluding the information about its melody. To also be able to describe the pitch characteristics additional prosodic information is needed. Converting text into linguistic representation requires a large set of different rules and exceptions depending on language. This process can be described through three main parts [Dut97]: 1. Text analysis 2. Automatic phonetization 3. Prosody generation The first part functions as a pre-processing phase. It identifies special characters and notations, such as numbers and abbreviations, and converts them into full text when needed. Several words can have different pronunciations depending on their meaning, and hence a contextual analysis is performed for categorization of the words.