Realtime and Accurate Musical Control of Expression in Voice Synthesis
Total Page:16
File Type:pdf, Size:1020Kb
Realtime and Accurate Musical Control of Expression in Voice Synthesis N. d'Alessandro Dissertation submitted to the Faculty of Engineering of the University of Mons, for the degree of Doctor of Philosophy in Applied Science - Supervisor: Prof. T. Dutoit Realtime and Accurate Musical Control of Expression in Voice Synthesis Nicolas d’Alessandro A dissertation submitted to the Faculty of Engineering of the University of Mons, for the degree of Doctor of Philosophy ii iii Abstract In the early days of speech synthesis research, understanding voice pro- duction has attracted the attention of scientists with the goal of produc- ing intelligible speech. Later, the need to produce more natural voices led researchers to use prerecorded voice databases, containing speech units, reassembled by a concatenation algorithm. With the outgrowth of computer capacities, the length of units increased, going from di- phones to non-uniform units, in the so-called unit selection framework, using a strategy referred to as “take the best, modify the least”. Today the new challenge in voice synthesis is the production of ex- pressive speech or singing. The mainstream solution to this problem is based on the “there is no data like more data” paradigm: emotion- specific databases are recorded and emotion-specific units are segmented. In this thesis, we propose to restart the expressive speech synthesis problem, from its original voice production grounds. We also assume that expressivity of a voice synthesis system rather relies on its interac- tive properties than strictly on the coverage of the recorded database. To reach our goals, we develop the Ramcess software system, an anal- ysis/resynthesis pipeline which aims at providing interactive and real- time access to the voice production mechanism. More precisely, this system makes it possible to browse a connected speech database, and to dynamically modify the value of several glottal source parameters. In order to achieve these voice transformations, a connected speech database is recorded, and the Ramcess analysis algorithm is applied. Ramcess analysis relies on the estimation of glottal waveforms and vocal tract impulse responses from the prerecorded voice samples. We iv cascade two promising glottal flow analysis algorithms, ZZT and ARX- LF, as a way of reinforcing the whole analysis process. Then the Ramcess synthesis engine computes the convolution of pre- viously estimated glottal source and vocal tract components, within a realtime pitch-synchronous overlap-add architecture. A new model for producing the glottal flow signal is proposed. This model, called SELF, is a modified LF model, which covers a larger palette of phonation types and solving some problems encountered in realtime interaction. Variations in the glottal flow behavior are perceived as modifications of voice quality along several dimensions, such as tenseness or vocal effort. In the Ramcess synthesis engine, glottal flow parameters are modified through several dimensional mappings, in order to give access to the perceptual dimensions of a voice quality control space. The expressive interaction with the voice material is done through a new digital musical instrument, called the HandSketch: a tablet-based controller, played vertically, with extra FSR sensors. In this work, we describe how this controller is connected to voice quality dimensions, and we also discuss the long term practice of this instrument. Compared to the usual prototyping of multimodal interactive systems, and more particularly digital musical instruments, the work on Ram- cess and HandSketch has been structured quite differently. Indeed our prototyping process, called the Luthery Model, is rather inspired by the traditional instrument making and based on embodiment. The Luthery Model also leads us to propose the Analysis-by-Interaction (AbI) paradigm, a methodology for approaching signal analysis prob- lems. The main idea is that if signal is not observable, it can be imitated with an appropriate digital instrument and a highly skilled practice. Then the signal can be studied be analyzing the imitative gestures. v Acknowledgements First, I would like to thank Prof. Thierry Dutoit. Five years ago, when I came to his office and told him about making a PhD thesis in “something related to music”, he could understand this project, see its potential, and – more than everything – trust me. This PhD thesis has been the opportunity to meet extraordinary collaborators. I would like to highlight some precious meetings. Prof. Caroline Traube, who definitely helped me to dive into the computer music world; Prof. Christophe d’Alessandro and Boris Doval for their deep knowledge in voice quality and long discussions about science and music; and Prof. Baris Bozkurt, for his wiseness, experience and open mind. There is nothing like a great lab. With TCTS, I have been really lucky. I would like to thank all my workmates, for their support and availability for all the questions that I had. More precisely, I would like to thank Alexis Moinet and Thomas Dubuisson, for their involvement without boundaries, in some common projects. This thesis also results from a strong wish of some people to enable interdisciplinary re- search on multimodal user interfaces. These teams share a common name, eNTERFACE workshops. I would like to thank Prof. BenoˆıtMacq for encouraging me to come back each year with new projects. I also would like to thank FRIA/FNRS (grant no 16756) and R´egionWallonne (numediart project, grant no 716631) for their financial support. Family and friends have been awesome with me, during these hard times I have had when writing this thesis. Special thanks to L. Berquin, L. Moletta, B. Mahieu, L. Verfaillie, S. Pa¸co-Rocchia, S. Pohu, V. Cordy for sharing about projects, performance and art; A.-L. Porignaux, S. Baclin, X. Toussaint, B. Carpent for reading and correcting my thesis; M. Astrinaki for these endless discussions; A. Zara for this great journey in understanding embodied emotions, and for the great picture of the HandSketch (Figure 7.1). Finally, I warmly thank my parents for their unconditional trust and love, and Laurence Baclin, my wife, without whom this thesis would just not have been possible. vi Contents 1 Introduction 3 1.1 Motivations . 4 1.2 Speech vs. singing . 6 1.3 An interactive model of expressivity . 7 1.4 Analysis-by-Interaction: embodied research . 11 1.5 Overview of the RAMCESS framework . 12 1.6 Outline of the manuscript . 14 1.7 Innovative aspects of this thesis . 15 1.8 About the title and the chapter quote . 16 2 State of the Art 19 2.1 Producing the voice . 20 2.1.1 Anatomy of the vocal apparatus . 20 2.1.2 Source/filter model of speech . 21 2.2 Behavior of the vocal folds . 24 2.2.1 Parameters of the glottal flow in the time domain . 25 2.2.2 The Liljencrants-Fant model (LF) . 27 2.2.3 Glottal flow parameters in the frequency domain . 28 2.2.4 The mixed-phase model of speech . 30 2.2.5 The causal/anticausal linear model (CALM) . 31 2.2.6 Minimum of glottal opening (MGO) . 32 2.3 Perceptual aspects of the glottal flow . 34 2.3.1 Dimensionality of the voice quality . 34 2.3.2 Intra- and inter-dimensional mappings . 35 2.4 Glottal flow analysis and source/tract separation . 36 2.4.1 Drawbacks of source-unaware practices . 37 2.4.2 Estimation of the GF/GFD waveforms . 39 2.4.3 Estimation of the GF/GFD parameters . 46 vii viii Contents 2.5 Background in singing voice synthesis . 49 3 Glottal Waveform Analysis and Source/Tract Separation 53 3.1 Introduction . 53 3.2 Working with connected speech . 54 3.2.1 Recording protocol . 55 3.2.2 Phoneme alignment . 58 3.2.3 GCI marking on voiced segments . 58 3.2.4 Intra-phoneme segmentation . 60 3.3 Validation of glottal flow analysis on real voice . 61 3.3.1 Non-accessibility of the sub-glottal pressure . 62 3.3.2 Validation tech. used in the impr. of ZZT-based results . 63 3.3.3 Separability of ZZT patterns . 63 3.3.4 Noisiness of the anticausal component . 66 3.3.5 Model-based validation criteria . 69 3.4 Estimation of the glottal formant . 72 3.4.1 Shifting the analysis frame around GCIk .............. 73 3.4.2 Evaluation of glottal formant frequency . 76 3.4.3 Fitting of the LF model . 77 3.5 Joint estimation of source/filter parameters . 79 3.5.1 Error estimation on a sub-codebook . 80 3.5.2 Error-based re-shifting . 81 3.5.3 Frame-by-frame resynthesis . 81 3.6 Evaluation of the analysis process . 83 3.6.1 Relevance and stability of source parameters . 83 3.6.2 Mean modeling error . 84 3.7 Conclusions . 85 4 Realtime Synthesis of Expressive Voice 87 4.1 Introduction . 87 4.2 Overview of the RAMCESS synthesizer . 88 4.3 SELF: spectrally-enhanced LF model . 89 4.3.1 Inconsistencies in LF and CALM transient behaviors . 90 4.3.2 LF with spectrally-generated return phase . 94 4.4 Voice quality control . 98 4.4.1 Mono-dimensional mapping: the “presfort” approach . 99 4.4.2 Realtime implementation of the phonetogram effect . 101 Contents ix 4.4.3 Vocal effort and tension . 102 4.5 Data-driven geometry-based vocal tract . 105 4.6 Conclusions . 106 5 Extending the Causal/Anticausal Description 109 5.1 Introduction . 109 5.2 Causality of sustained sounds . 110 5.3 Mixed-phase analysis of instrumental sounds . 111 5.3.1 Trumpet: effect of embouchure . 111 5.3.2 Trumpet: effect of intensity . 112 5.3.3 Violin: proof of concept . 113 5.4 Mixed-phase synthesis of instrumental sounds . 114 5.5 Conclusions . 116 6 Analysis-by-Interaction: Context and Motivations 117 6.1 Introduction .