Realtime and Accurate Musical Control of Expression in Voice Synthesis

Total Page:16

File Type:pdf, Size:1020Kb

Realtime and Accurate Musical Control of Expression in Voice Synthesis Realtime and Accurate Musical Control of Expression in Voice Synthesis N. d'Alessandro Dissertation submitted to the Faculty of Engineering of the University of Mons, for the degree of Doctor of Philosophy in Applied Science - Supervisor: Prof. T. Dutoit Realtime and Accurate Musical Control of Expression in Voice Synthesis Nicolas d’Alessandro A dissertation submitted to the Faculty of Engineering of the University of Mons, for the degree of Doctor of Philosophy ii iii Abstract In the early days of speech synthesis research, understanding voice pro- duction has attracted the attention of scientists with the goal of produc- ing intelligible speech. Later, the need to produce more natural voices led researchers to use prerecorded voice databases, containing speech units, reassembled by a concatenation algorithm. With the outgrowth of computer capacities, the length of units increased, going from di- phones to non-uniform units, in the so-called unit selection framework, using a strategy referred to as “take the best, modify the least”. Today the new challenge in voice synthesis is the production of ex- pressive speech or singing. The mainstream solution to this problem is based on the “there is no data like more data” paradigm: emotion- specific databases are recorded and emotion-specific units are segmented. In this thesis, we propose to restart the expressive speech synthesis problem, from its original voice production grounds. We also assume that expressivity of a voice synthesis system rather relies on its interac- tive properties than strictly on the coverage of the recorded database. To reach our goals, we develop the Ramcess software system, an anal- ysis/resynthesis pipeline which aims at providing interactive and real- time access to the voice production mechanism. More precisely, this system makes it possible to browse a connected speech database, and to dynamically modify the value of several glottal source parameters. In order to achieve these voice transformations, a connected speech database is recorded, and the Ramcess analysis algorithm is applied. Ramcess analysis relies on the estimation of glottal waveforms and vocal tract impulse responses from the prerecorded voice samples. We iv cascade two promising glottal flow analysis algorithms, ZZT and ARX- LF, as a way of reinforcing the whole analysis process. Then the Ramcess synthesis engine computes the convolution of pre- viously estimated glottal source and vocal tract components, within a realtime pitch-synchronous overlap-add architecture. A new model for producing the glottal flow signal is proposed. This model, called SELF, is a modified LF model, which covers a larger palette of phonation types and solving some problems encountered in realtime interaction. Variations in the glottal flow behavior are perceived as modifications of voice quality along several dimensions, such as tenseness or vocal effort. In the Ramcess synthesis engine, glottal flow parameters are modified through several dimensional mappings, in order to give access to the perceptual dimensions of a voice quality control space. The expressive interaction with the voice material is done through a new digital musical instrument, called the HandSketch: a tablet-based controller, played vertically, with extra FSR sensors. In this work, we describe how this controller is connected to voice quality dimensions, and we also discuss the long term practice of this instrument. Compared to the usual prototyping of multimodal interactive systems, and more particularly digital musical instruments, the work on Ram- cess and HandSketch has been structured quite differently. Indeed our prototyping process, called the Luthery Model, is rather inspired by the traditional instrument making and based on embodiment. The Luthery Model also leads us to propose the Analysis-by-Interaction (AbI) paradigm, a methodology for approaching signal analysis prob- lems. The main idea is that if signal is not observable, it can be imitated with an appropriate digital instrument and a highly skilled practice. Then the signal can be studied be analyzing the imitative gestures. v Acknowledgements First, I would like to thank Prof. Thierry Dutoit. Five years ago, when I came to his office and told him about making a PhD thesis in “something related to music”, he could understand this project, see its potential, and – more than everything – trust me. This PhD thesis has been the opportunity to meet extraordinary collaborators. I would like to highlight some precious meetings. Prof. Caroline Traube, who definitely helped me to dive into the computer music world; Prof. Christophe d’Alessandro and Boris Doval for their deep knowledge in voice quality and long discussions about science and music; and Prof. Baris Bozkurt, for his wiseness, experience and open mind. There is nothing like a great lab. With TCTS, I have been really lucky. I would like to thank all my workmates, for their support and availability for all the questions that I had. More precisely, I would like to thank Alexis Moinet and Thomas Dubuisson, for their involvement without boundaries, in some common projects. This thesis also results from a strong wish of some people to enable interdisciplinary re- search on multimodal user interfaces. These teams share a common name, eNTERFACE workshops. I would like to thank Prof. BenoˆıtMacq for encouraging me to come back each year with new projects. I also would like to thank FRIA/FNRS (grant no 16756) and R´egionWallonne (numediart project, grant no 716631) for their financial support. Family and friends have been awesome with me, during these hard times I have had when writing this thesis. Special thanks to L. Berquin, L. Moletta, B. Mahieu, L. Verfaillie, S. Pa¸co-Rocchia, S. Pohu, V. Cordy for sharing about projects, performance and art; A.-L. Porignaux, S. Baclin, X. Toussaint, B. Carpent for reading and correcting my thesis; M. Astrinaki for these endless discussions; A. Zara for this great journey in understanding embodied emotions, and for the great picture of the HandSketch (Figure 7.1). Finally, I warmly thank my parents for their unconditional trust and love, and Laurence Baclin, my wife, without whom this thesis would just not have been possible. vi Contents 1 Introduction 3 1.1 Motivations . 4 1.2 Speech vs. singing . 6 1.3 An interactive model of expressivity . 7 1.4 Analysis-by-Interaction: embodied research . 11 1.5 Overview of the RAMCESS framework . 12 1.6 Outline of the manuscript . 14 1.7 Innovative aspects of this thesis . 15 1.8 About the title and the chapter quote . 16 2 State of the Art 19 2.1 Producing the voice . 20 2.1.1 Anatomy of the vocal apparatus . 20 2.1.2 Source/filter model of speech . 21 2.2 Behavior of the vocal folds . 24 2.2.1 Parameters of the glottal flow in the time domain . 25 2.2.2 The Liljencrants-Fant model (LF) . 27 2.2.3 Glottal flow parameters in the frequency domain . 28 2.2.4 The mixed-phase model of speech . 30 2.2.5 The causal/anticausal linear model (CALM) . 31 2.2.6 Minimum of glottal opening (MGO) . 32 2.3 Perceptual aspects of the glottal flow . 34 2.3.1 Dimensionality of the voice quality . 34 2.3.2 Intra- and inter-dimensional mappings . 35 2.4 Glottal flow analysis and source/tract separation . 36 2.4.1 Drawbacks of source-unaware practices . 37 2.4.2 Estimation of the GF/GFD waveforms . 39 2.4.3 Estimation of the GF/GFD parameters . 46 vii viii Contents 2.5 Background in singing voice synthesis . 49 3 Glottal Waveform Analysis and Source/Tract Separation 53 3.1 Introduction . 53 3.2 Working with connected speech . 54 3.2.1 Recording protocol . 55 3.2.2 Phoneme alignment . 58 3.2.3 GCI marking on voiced segments . 58 3.2.4 Intra-phoneme segmentation . 60 3.3 Validation of glottal flow analysis on real voice . 61 3.3.1 Non-accessibility of the sub-glottal pressure . 62 3.3.2 Validation tech. used in the impr. of ZZT-based results . 63 3.3.3 Separability of ZZT patterns . 63 3.3.4 Noisiness of the anticausal component . 66 3.3.5 Model-based validation criteria . 69 3.4 Estimation of the glottal formant . 72 3.4.1 Shifting the analysis frame around GCIk .............. 73 3.4.2 Evaluation of glottal formant frequency . 76 3.4.3 Fitting of the LF model . 77 3.5 Joint estimation of source/filter parameters . 79 3.5.1 Error estimation on a sub-codebook . 80 3.5.2 Error-based re-shifting . 81 3.5.3 Frame-by-frame resynthesis . 81 3.6 Evaluation of the analysis process . 83 3.6.1 Relevance and stability of source parameters . 83 3.6.2 Mean modeling error . 84 3.7 Conclusions . 85 4 Realtime Synthesis of Expressive Voice 87 4.1 Introduction . 87 4.2 Overview of the RAMCESS synthesizer . 88 4.3 SELF: spectrally-enhanced LF model . 89 4.3.1 Inconsistencies in LF and CALM transient behaviors . 90 4.3.2 LF with spectrally-generated return phase . 94 4.4 Voice quality control . 98 4.4.1 Mono-dimensional mapping: the “presfort” approach . 99 4.4.2 Realtime implementation of the phonetogram effect . 101 Contents ix 4.4.3 Vocal effort and tension . 102 4.5 Data-driven geometry-based vocal tract . 105 4.6 Conclusions . 106 5 Extending the Causal/Anticausal Description 109 5.1 Introduction . 109 5.2 Causality of sustained sounds . 110 5.3 Mixed-phase analysis of instrumental sounds . 111 5.3.1 Trumpet: effect of embouchure . 111 5.3.2 Trumpet: effect of intensity . 112 5.3.3 Violin: proof of concept . 113 5.4 Mixed-phase synthesis of instrumental sounds . 114 5.5 Conclusions . 116 6 Analysis-by-Interaction: Context and Motivations 117 6.1 Introduction .
Recommended publications
  • A Unit Selection Text-To-Speech-And-Singing Synthesis Framework from Neutral Speech: Proof of Concept 39 II.1 Introduction
    Adding expressiveness to unit selection speech synthesis and to numerical voice production 90) - 02 - Marc Freixes Guerreiro http://hdl.handle.net/10803/672066 Generalitat 472 (28 de Catalunya núm. Rgtre. Fund. ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de ió la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual undac F (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs. Universitat Ramon Llull Universitat Ramon ADVERTENCIA. El acceso a los contenidos de esta tesis doctoral y su utilización debe respetar los derechos de la persona autora. Puede ser utilizada para consulta o estudio personal, así como en actividades o materiales de investigación y docencia en los términos establecidos en el art. 32 del Texto Refundido de la Ley de Propiedad Intelectual (RDL 1/1996).
    [Show full text]
  • A. Acoustic Theory and Modeling of the Vocal Tract
    A. Acoustic Theory and Modeling of the Vocal Tract by H.W. Strube, Drittes Physikalisches Institut, Universität Göttingen A.l Introduction This appendix is intended for those readers who want to inform themselves about the mathematical treatment of the vocal-tract acoustics and about its modeling in the time and frequency domains. Apart from providing a funda­ mental understanding, this is required for all applications and investigations concerned with the relationship between geometric and acoustic properties of the vocal tract, such as articulatory synthesis, determination of the tract shape from acoustic quantities, inverse filtering, etc. Historically, the formants of speech were conjectured to be resonances of cavities in the vocal tract. In the case of a narrow constriction at or near the lips, such as for the vowel [uJ, the volume of the tract can be considered a Helmholtz resonator (the glottis is assumed almost closed). However, this can only explain the first formant. Also, the constriction - if any - is usu­ ally situated farther back. Then the tract may be roughly approximated as a cascade of two resonators, accounting for two formants. But all these approx­ imations by discrete cavities proved unrealistic. Thus researchers have have now adopted a more reasonable description of the vocal tract as a nonuni­ form acoustical transmission line. This can explain an infinite number of res­ onances, of which, however, only the first 2 to 4 are of phonetic importance. Depending on the kind of sound, the tube system has different topology: • for vowel-like sounds, pharynx and mouth form one tube; • for nasalized vowels, the tube is branched, with transmission from pharynx through mouth and nose; • for nasal consonants, transmission is through pharynx and nose, with the closed mouth tract as a "shunt" line.
    [Show full text]
  • Robust Structured Voice Extraction for Flexible Expressive Resynthesis
    ROBUST STRUCTURED VOICE EXTRACTION FOR FLEXIBLE EXPRESSIVE RESYNTHESIS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Pamornpol Jinachitra June 2007 c Copyright by Pamornpol Jinachitra 2007 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Julius O. Smith, III Principal Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Robert M. Gray I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Jonathan S. Abel Approved for the University Committee on Graduate Studies. iii iv Abstract Parametric representation of audio allows for a reduction in the amount of data needed to represent the sound. If chosen carefully, these parameters can capture the expressiveness of the sound, while reflecting the production mechanism of the sound source, and thus allow for an intuitive control in order to modify the original sound in a desirable way. In order to achieve the desired parametric encoding, algorithms which can robustly identify the model parameters even from noisy recordings are needed. As a result, not only do we get an expressive and flexible coding system, we can also obtain a model-based speech enhancement that reconstructs the speech embedded in noise cleanly and free of musical noise usually associated with the filter-based approach.
    [Show full text]
  • Lecture Notes
    Presentations Work in pairs in 6 minutes mini-interviews (3 minutes each). Ask questions around the topics: • What is your previous experience of speech synthesis? Speech synthesis • Why did you decide to take this course? • What do you expect to learn? Write down the answers of your partner. Present during the presentation round Submit the answers to me Why? To let me know more about your background and expectations to be able to adapt the course content. To get to know each other. To ”start you up”… Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 The course Definition & Main scope This is what the course book will look like… Until then, refer to http://svr-www.eng.cam.ac.uk/~pat40/book.html Course pages: www.speech.kth.se/courses/GSLT_SS Lecture content (impossible to cover the entire book): 1) History, Concatenative synthesis, Unit selection, HMM synthesis, Text issues, Prosody The automatic generation of synthesized sound or 2) Vocal tract models, Formant synthesis, Evaluation visual output from any phonetic string. 3) Term paper presentations, assignment correction To Do until next time: 1) Assignment 1: Unit selection calculations 2) Term paper topic selection Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 Synthesis approaches History By Concatenation By Rule Elementary speech units are Speech is produced by stored in a database and then mathematical rules that concatenated and processed describe the influence of to produce the speech signal phonemes on one another Olov Engwall, Speech synthesis, 2008 Olov Engwall, Speech synthesis, 2008 1 van Kempelen Wheatstone’s version Wolfgang von Kempelen’s book Charles Wheatstone’s version of von Kempelen's speaking machine Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine (1791).
    [Show full text]
  • Detection of Discontinuities in Concatenative Speech Synthesis
    University of Crete Department of Computer Science Detection of Discontinuities in Concatenative Speech Synthesis (MSc. Thesis) Ioannis Pantazis Heraklion Fall 2006 Department of Computer Science University of Crete Detection of Discontinuities in Concatenative Speech Synthesis Submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Master of Science Fall, 2006 c 2006 University of Crete. All rights reserved. Author: Ioannis Pantazis Department of Computer Science Board of enquiry: Supervisor Ioannis Stylianou Associate Professor Member Georgios Tziritas Professor Member Panagiotis Tsakalides Associate Professor Accepted by: Chairman of the Graduate Studies Committee Panagiotis Trahanias Professor Heraklion, Fall 2006 iv Abstract Last decade, unit selection synthesis became a hot topic in speech synthesis research. Unit selection gives the greatest naturalness due to the fact that it does not apply a large amount of digital signal processing to the recorded speech, which often makes recorded speech sound less natural. In order to find the best units in the database, unit selection is based on two cost functions, target cost and concatenation cost. concatenation cost refers to how well adjacent units can be joined. The problem of finding a concatenation cost function is broken into two subproblems; into finding the proper parame- terizations of the signal and into finding the right distance measure. Recent studies attempted to specify which concatenation distance measures are able to predict audible discontinuities and thus, highly correlates with human perception of discontinuity at concatenation point. However, none of the concatenation costs used so far, can measure the similarity (or, (dis-)continuity) of two consecutive units efficiently.
    [Show full text]
  • On Prosodic Modification of Speech
    Thesis for the degree of Licentiate of Engineering On Prosodic Modification of Speech Barbara Resch Sound and Image Processing Laboratory School of Electrical Engineering KTH (Royal Institute of Technology) Stockholm 2006 Resch, Barbara On Prosodic Modification of Speech Copyright c 2006 Barbara Resch except where otherwise stated. All rights reserved. ISBN 91-7178-267-2 TRITA-EE 2006:002 ISSN 1653 - 5146 Sound and Image Processing Laboratory School of Electrical Engineering KTH (Royal Institute of Technology) SE-100 44 Stockholm, Sweden Telephone + 46 (0)8-790 7790 Abstract Prosodic modification has become of major theoretical and practical inter- est in the field of speech processing research over the last decades. Algo- rithms for time and pitch scaling are used both for speech modification and for speech synthesis. The thesis consists of an introduction providing an overview and discussion of existing techniques for time and pitch scaling and of three research papers in this area. In paper A a system for time synchronization of speech is presented. It performs an alignment of two utterances of the same sentence, where one of the utterances is modified in time scale so as to be synchronized with the other utterance. The system is based on Dynamic Time Warping (DTW) and the Waveform Similarity Overlap and Add (WSOLA) method, a tech- nique for time scaling of speech signals. Paper B and C complement each other and present a novel speech representation system that facilitates both time and pitch scaling of speech signals. Paper A describes a method to warp a signal with time-varying pitch to a signal with constant pitch.
    [Show full text]
  • Speech Pattern Recognition for Speech to Text Conversion”, Thesis Phd, Saurashtra University
    Saurashtra University Re – Accredited Grade ‘B’ by NAAC (CGPA 2.93) Kumbharana, Chandresh K., 2007, “Speech Pattern Recognition for Speech to Text Conversion”, thesis PhD, Saurashtra University http://etheses.saurashtrauniversity.edu/id/eprint/337 Copyright and moral rights for this thesis are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the Author. The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the Author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given. Saurashtra University Theses Service http://etheses.saurashtrauniversity.edu [email protected] © The Author Speech Pattern Recognition for Speech To Text Conversion A thesis to be submitted to Saurashtra University in fulfillment of the requirement for the degree of Doctor of Philosophy (Computer Science) By CK Kumbharana Guided by Dr. N. N. Jani Prof. & Head Department of Computer Science Saurashtra University Rajkot Department of Computer Science Saurashtra University Rajkot November - 2007 Certificate by Guide This is to certify that the contents of the thesis entitled “Speech Pattern Recognition for Speech to Text Conversion” is the original research work of Mr. Kumbharana Chandresh Karsanbhai carried out under my supervision and guidance. I further certified that the work has not been submitted either partially or fully to any other university/Institute for the award of any degree.
    [Show full text]
  • Applicationofsinging Synthesistechniquesto
    APPLICATION OF SINGING SYNTHESIS TECHNIQUES TO BERTSOLARITZA a dissertation presented by Xabier Sarasola Aramendia Thesis director: Dr. Eva Navas Cordón in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science UPV/EHU Bilbo October 2020 (cc)2020 XABIER SARASOLA ARAMENDIA (cc by-sa 4.0) Acknowledgments I would like to thank the following people for helping with this research project: • Eva nire tutoreari, laguntza eta pazientzia guztiarengatik. Itsasuntzi guztiek behar dute lema. Flotatzea eta aurrera egitea ez da inoiz nahikoa. • Aholabeko taldekide guztiei. Ezinezkoa da neurtzea lan hau egin bitartean jasotako erronkak, laguntza eta babesa. Inma, meritoa dauka hau bezelako talde poliedrikoa bat gainbegiratzea eta zer esanik ez honekin helburu koer- enteak lortzea. Eskerrik asko gidaritzarengatik. Jon, eskerrik asko unibert- sitate irakaskuntzarako sarreran laguntzeagatik. Makinista izatea ez duzu oraindik lortu baina bips bezelako lurrun-makina bat nola zaintzen duzun ikusita argi dago arazorik ez zenuela izango eskala handiagoekin. Ibon, er- ronka dialektiko etengabea, eskerrik asko txartel grafikoen eta pythonen arora sartzerakoan pauso bakoitzaren zehaztasun zientifikoa exijitzeagatik. Igor, eskerrik asko hizketaren ezagutzarekin laguntzeagatik eta edozein gairi buruzko ikuspuntu kritikoa emateagatik. Daniel, tan excepcional en el conocimiento de la voz humana y tan cercano en el día a día. Gracias por la introducción y ayuda en mis inicios en la sínte- sis de voz. Agustín, gracias también por la ayuda con la síntesis de voz en mis inicios y los cursos acelerados sobre el idioma y cultura japonesa. No creo que mi trabajo haya conseguido la calidad de Hatsune Miku pero espero haberme acercado lo suficiente.
    [Show full text]
  • Statistical Parametric Speech Synthesis Based on Sinusoidal Models
    Statistical parametric speech synthesis based on sinusoidal models Qiong Hu Institute for Language, Cognition and Computation School of Informatics 2016 This dissertation is submitted for the degree of Doctor of Philosophy I would like to dedicate this thesis to my loving parents ... Acknowledgements This thesis is the result of a four-year working experience as PhD candidate at Centre for Speech Technology Research in Edinburgh University. This research was supported by Toshiba in collaboration with Edinburgh University. First and foremost, I wish to thank my supervisors Junichi Yamagishi and Korin Richmond who have both continuously shared their support, enthusiasm and invaluable advice. Working with them has been a great pleasure. Thank Simon King for giving suggestions and being my reviewer for each year, and Steve Renals for the help during the PhD period. I also owe a huge gratitude to Yannis Stylianou, Ranniery Maia and Javier Latorre for their generosity of time in discussing my work through my doctoral study. It is a valuable and fruitful experience which leads to the results in chapter 4 and 5. I would also like to express my gratitude to Zhizheng Wu and Kartick Subramanian, who I have learned a lot for the work reported in chapter 6 and 7. I also greatly appreciate help from Gilles Degottex, Tuomo Raitio, Thomas Drugman and Daniel Erro by generating samples from their vocoder implementations and experimental discussion given by Rob Clark for chapter 3. Thanks to all members in CSTR for making the lab an excellent place to work and a friendly atmosphere to stay. I also want to thank all colleagues in CRL and NII for the collab- oration and hosting during my visit.
    [Show full text]