Techniques and Challenges in Speech Synthesis Final Report for ELEC4840B David Ferris - 3109837 04/11/2016 A thesis submitted in partial fulfilment of the requirements for the degree of Bachelor of Engineering in Electrical Engineering at The University of Newcastle, Australia. Abstract The aim of this project was to develop and implement an English language Text-to-Speech synthesis system. This first involved an extensive study of the mechanisms of human speech production, a review of modern techniques in speech synthesis, and analysis of tests used to evaluate the effectiveness of synthesized speech. It was determined that a diphone synthesis system was the most effective choice for the scope of this project. A diphone synthesis system operates by concatenating sections of recorded human speech, with each section containing exactly one phonetic transition. By using a database that contains recordings of all possible phonetic transitions within a language, or diphones, a diphone synthesis system can produce any word by concatenating the correct diphone sequence. A method of automatically identifying and extracting diphones from prompted speech was designed, allowing for the creation of a diphone database by a speaker in less than 40 minutes. The Carnegie Mellon University Pronouncing Dictionary, or CMUdict, was used to determine the pronunciation of known words. A system for smoothing the transitions between diphone recordings was designed and implemented. CMUdict was then used to train a maximum-likelihood prediction system to determine the correct pronunciation of unknown English language alphabetic words. Using this, the system was able to find an identical or reasonably similar pronunciation for over 76% of the training set. Then, a Part Of Speech tagger was designed to find the lexical class of words within a sentence (lexical classes being categories such as nouns, verbs, and adjectives). This lets the system automatically identify the correct pronunciation of some Heterophonic Homographs: words which are spelled the same way, but pronounced differently. For example, the word “dove” is pronounced two different ways in the phrase “I dove towards the dove”, depending on its use as a verb or a noun. On a test data set, this implementation found the correct lexical class of a word within context 76.8% of the time. A method of altering the pitch, duration, and volume of the produced voice over time was designed, being a combination of the time-domain Pitch Synchronous Overlap Add (PSOLA) algorithm and a novel approach referred to as Unvoiced Speech Duration Shifting (USDS). This approach was designed from an understanding of mechanisms of natural speech production. This combination of two approaches minimises distortion of the voice when shifting the pitch or duration, while maximising computational efficiency by operating in the time domain. This was used to add correct lexical stress to vowels within words. A text tokenisation system was developed to handle arbitrary text input, allowing pronunciation of numerical input tokens and use of appropriate pauses for punctuation. Methods for further improving sentence level speech naturalness were discussed. Finally, the system was tested with listeners for its intelligibility and naturalness. i Acknowledgements I want to thank Kaushik Mahata for supervising this project, providing advice and encouragement as to the direction my research should take, and discussing this project with me as it developed over time. I also want to thank my friends Alice Carden, Josh Morrison-Cleary, and Megan McKenzie for sitting down and making weird sounds into a microphone for hours to donate their voices to this project. I promise never to use your voices for evil. In addition, many thanks to those who listened to the synthesis system and provided feedback, from the barely intelligible beginnings to the more intelligible final product. Particular thanks to KC, You Ben, That Ben, Dex, and Kat, who helped with formal intelligibility and naturalness tests. More broadly, I want to thank the staff of the Engineering and Mathematics faculties of the University of Newcastle for teaching me over the course of my degree. Without the knowledge imparted through their classes, this project would have been impossible. Finally, I want to thank my parents and friends for putting up with me excitedly talking about various aspects of linguistics and signal processing for an entire year. ii List of Contributions The key contributions of this project are as follows: Completed an extensive background review of acoustics and linguistics, Reviewed and compared the various tests used to evaluate the intelligibility and naturalness of speech synthesis systems, Reviewed and compared different techniques currently used for speech synthesis, Designed and implemented a system to automatically separate a speech waveform of a prompted monophone into sections of initial excitation, persistence, and return to silence, Designed and implemented a system to automatically extract a prompted diphone from a given speech waveform, Used these to automate the construction of several English language diphone databases, Designed and implemented a computationally efficient method of smoothly concatenating recorded diphone waveforms, Used the above in conjunction with a machine-readable pronunciation dictionary to produce synthesized English speech, Designed and implemented a data-driven method of converting arbitrary alphabetic words into corresponding English-language pronunciations, Designed and implemented a trigram-based Part of Speech tagging system to identify the lexical class of words within an input sentence, Used this Part of Speech data to determine the correct pronunciation of Heterophonic Homographs, Designed and implemented a novel method for arbitrarily altering the volume, fundamental frequency, and duration of synthesized speech on the phone level, Designed and implemented a text pre-processing system to convert arbitrarily punctuated input text into a format which the system can synthesise, Designed software allowing a user to define the target volume, pitch, and duration of produced speech based on sentence punctuation, lexical class of input words, and word frequency to improve system prosody, Combined all of the above into a comprehensive English language Text To Speech diphone synthesis system, and Experimentally evaluated the intelligibility and naturalness of said system. David Ferris Kaushik Mahata iii Table of Contents Abstract .................................................................................................................................................... i Acknowledgements ................................................................................................................................. ii List of Contributions ............................................................................................................................... iii List of Figures ........................................................................................................................................ vii List of Tables .......................................................................................................................................... ix 1. Introduction ........................................................................................................................................ 1 1.1. Report Outline.............................................................................................................................. 2 1.2. Other Remarks on this Report ..................................................................................................... 2 2. Background on Linguistics and Technologies...................................................................................... 3 2.1. Acoustics ...................................................................................................................................... 3 2.1.1. The Human Ear ...................................................................................................................... 4 2.2. Human Speech Production .......................................................................................................... 6 2.2.1. Classification of Speech Units ............................................................................................. 10 2.2.2. Classification of Words........................................................................................................ 12 2.2.3. Prosodic Features of Speech ............................................................................................... 12 2.3. Transcribing Phonetic Information ............................................................................................ 13 2.3.1. International Phonetic Alphabet (IPA) ................................................................................ 14 2.3.2. Arpabet ............................................................................................................................... 15 2.4. Encoding and Visualising Audio Information ............................................................................. 17 2.4.1. Recording Digital Audio Signals ........................................................................................... 17 2.4.2. Recording Human Speech ................................................................................................... 19 2.4.3. Time Domain Representation
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages138 Page
-
File Size-