Review of Speech Synthesis Technology

Helsinki University of Technology Department of Electrical and Communications Engineering Sami Lemmetty Review of Speech Synthesis Technology This Master's Thesis has been submitted for official examination for the degree of Master of Science in Espoo on March 30, 1999. Supervisor of the Thesis Professor Matti Karjalainen HELSINKI UNIVERSTY OF TECHNOLOGY Abstract of the Master's Thesis Author: Sami Lemmetty Name of the Thesis: Review of Speech Synthesis Technology Date: March 30, 1999 Number of pages: 104 Department: Electrical and Communications Engineering Professorship: Acoustics and Audio Signal Processing (S-89) Supervisor: Professor Matti Karjalainen Synthetic or artificial speech has been developed steadily during the last decades. Especially, the intelligibility has reached an adequate level for most applications, especially for communication impaired people. The intelligibility of synthetic speech may also be increased considerably with visual information. The objective of this work is to map the current situation of speech synthesis technology. Speech synthesis may be categorized as restricted (messaging) and unrestricted (text-to-speech) synthesis. The first one is suitable for announcing and information systems while the latter is needed for example in applications for the visually impaired. The text-to-speech procedure consists of two main phases, usually called high- and low-level synthesis. In high-level synthesis the input text is converted into such form that the low-level synthesizer can produce the output speech. The three basic methods for low-level synthesis are the formant, concatenative, and articulatory synthesis. The formant synthesis is based on the modeling of the resonances in the vocal tract and is perhaps the most commonly used during last decades. However, the concatenative synthesis which is based on playing prerecorded samples from natural speech is becoming more popular. In theory, the most accurate method is articulatory synthesis which models the human speech production system directly, but it is also the most difficult approach. Since the quality of synthetic speech is improving steadily, the application field is also expanding rapidly. Synthetic speech may be used to read e-mail and mobile messages, in multimedia applications, or in any kind of human-machine interaction. The evaluation of synthetic speech is also an important issue, but difficult because the speech quality is a very multidimensional term. This has led to the large number of different tests and methods to evaluate different features in speech. Today, speech synthesizers of various quality are available as several different products for all common languages, including Finnish. Keywords: speech synthesis, synthesized speech, text-to-speech, tts, artificial speech, speech synthesizer, audio-visual speech. TEKNILLINEN KORKEAKOULU Diplomityön tiivistelmä Tekijä: Sami Lemmetty Työn nimi: Katsaus puhesynteesiteknologiaan Päivämäärä: 30.3.1999 Sivumäärä: 104 Osasto: Sähkö- ja tietoliikennetekniikan osasto Professuuri: Akustiikka ja äänenkäsittelytekniikka (S-89) Työn valvoja: Professori Matti Karjalainen Synteettinen eli keinotekoisesti tuotettu puhe on kehittynyt varsin nopeasti viimeisten vuosikymmenten aikana. Erityisesti puheen ymmärrettävyys on saavuttanut riittävän tason moniin kommunikaatiovaikeuksia omaavien ihmisten tarpeisiin ja sovelluksiin. Synteettisen puheen ymmärrettävyyttä voidaan lisäksi parantaa merkittävästi lisäämällä visuaalista informaatiota (puhuva pää). Tämän työn tarkoitus on kartoittaa puhe- synteesiteknologian nykytila. Puhesynteesi voidaan jakaa rajoitetun ja rajoittamattoman sanaston synteesiin. Rajoitetun sanaston synteesi soveltuu hyvin erilaisiin kuulutus- ja informaatiojärjestelmiin, kun taas esimerkiksi näkövammaissovelluksiin tarvitaan useimmiten rajoittamattoman sanaston synteesiä. Rajoittamattoman sanaston synteesi voidaan jakaa korkean- ja matalan tason synteesiin. Korkean tason synteesi huolehtii tekstin esikäsittelystä (numerot, lyhenteen jne.), analyysistä sekä tarvittavan tiedon välittämisestä varsinaisen puhesignaalin tuottavan matalan tason syntetisaattorin ohjaamiseksi. Varsinaisen puhesynteesin tuottamiselle on kolme perusmenetelmää. Yleisin menetelmä on formanttisynteesi, missä mallinnetaan ihmisen ääniväylän resonanssikohtia. Yleistymässä on myös luonnollisesta puheesta poimittujen lyhyiden ääninäytteiden toistamiseen perustuva aikatason synteesi. Kolmas vaihtoehto on mallintaa ihmisen puheentuottojärjestelmää suoraan, mikä on kuitenkin teknisesti ja laskennallisesti varsin raskasta. Puheen luonnollisuuden parantuessa sitä on alettu käyttää yhä useammassa eri sovelluskohteessa, kuten erilaiset lukulaitteet (sähköposti, tekstiviesti jne.), multimedia, tai mikä tahansa ihmisen ja koneen välinen vuorovaikutus. Koska puheen laatu on varsin monitahoinen kysymys, on myös sen laadun arvioiminen varsin hankalaa ja monimutkaista. Tämän vuoksi on olemassa lukuisia eri menetelmiä synteettisen puheen laadun ja erilaisten ominaisuuksien arvioimiseksi. Puhe- syntetisaattoreita on tällä hetkellä saatavilla lukuisia erilaisia ja eritasoisia kaikille yleisimmille kielille, myös suomeksi. Avainsanat: puhesynteesi, audiovisuaalinen puhesynteesi, tts, keinotekoinen puhe, synteettinen puhe. PREFACE This thesis has been carried out at Helsinki University of Technology, Laboratory of Acoustics and Audio Signal Processing. The work is a part of the larger audio-visual speech synthesis project at HUT and it is supported by TEKES, Nokia Research Center, Sonera, Finnish Federation for the Visually Impaired, and Timehouse Oy. This thesis is based on the research and literature made by several speech synthesis researchers and research groups over the world during last decades. First, I would like to thank my supervisor Professor Matti Karjalainen for his instruction and guidance during this work. Without his encouragement, feedback, and motivation this work could not have been done. I wish also to give my thanks to Dr. Tech. Unto K. Laine for his comments and discussions considering my work, Mr. Matti Airas, Mr. Martti Rahkila, and Mr. Tero Tolonen for their support when the Windows95 has shown some of its concealed and unpredictable features, and the people at the Acoustics Laboratory for giving me the most enjoyable and inspiring working environment. I would also thank all the project participants for their feedback and comments during my work. Espoo, March 30, 1999 Sami Lemmetty TABLE OF CONTENTS 1. INTRODUCTION ..............................................................................................................1 1.1 PROJECT DESCRIPTION .....................................................................................................1 1.2 INTRODUCTION TO SPEECH SYNTHESIS ............................................................................2 2. HISTORY AND DEVELOPMENT OF SPEECH SYNTHESIS...................................4 2.1 FROM MECHANICAL TO ELECTRICAL SYNTHESIS ............................................................4 2.2 DEVELOPMENT OF ELECTRICAL SYNTHESIZERS ..............................................................6 2.3 HISTORY OF FINNISH SPEECH SYNTHESIS ......................................................................10 3. PHONETICS AND THEORY OF SPEECH PRODUCTION.....................................11 3.1 REPRESENTATION AND ANALYSIS OF SPEECH SIGNALS.................................................11 3.2 SPEECH PRODUCTION .....................................................................................................14 3.3 PHONETICS......................................................................................................................17 3.3.1 English Articulatory Phonetics..............................................................................18 3.3.2 Finnish Articulatory Phonetics..............................................................................20 4. PROBLEMS IN SPEECH SYNTHESIS ........................................................................22 4.1 TEXT-TO-PHONETIC CONVERSION..................................................................................22 4.1.1 Text preprocessing.................................................................................................22 4.1.2 Pronunciation ........................................................................................................24 4.1.3 Prosody..................................................................................................................25 4.2 PROBLEMS IN LOW LEVEL SYNTHESIS ...........................................................................26 4.3 LANGUAGE SPECIFIC PROBLEMS AND FEATURES...........................................................27 5. METHODS, TECHNIQUES, AND ALGORITHMS....................................................28 5.1 ARTICULATORY SYNTHESIS............................................................................................28 5.2 FORMANT SYNTHESIS .....................................................................................................29 5.3 CONCATENATIVE SYNTHESIS .........................................................................................32 5.3.1 PSOLA Methods.....................................................................................................34 5.3.2 Microphonemic Method.........................................................................................36 5.4 LINEAR PREDICTION BASED METHODS ..........................................................................37

Review of Speech Synthesis Technology

THE DEVELOPMENT of ACCENTED ENGLISH SYNTHETIC VOICES By

Commercial Tools in Speech Synthesis Technology

Gender, Ethnicity, and Identity in Virtual

Models of Speech Synthesis ROLF CARLSON Department of Speech Communication and Music Acoustics, Royal Institute of Technology, S-100 44 Stockholm, Sweden

Masterarbeit

UN SYNTHÉTISEUR DE LA VOIX CHANTÉE BASÉ SUR MBROLA POUR LE MANDARIN Liu Ning

Espeak : Speech Synthesis

Fully Generated Scripted Dialogue for Embodied Agents

Text-To-Speech Synthesis System for Marathi Language Using Concatenation Technique

Wormed Voice Workshop Presentation

Design and Implementation of Text to Speech Conversion for Visually Impaired People

Feasibility Study on a Text-To-Speech Synthesizer for Embedded Systems