Cantor Digitalis: Chironomic Parametric Synthesis of Singing Lionel Feugère1, Christophe D’Alessandro1,2 *, Boris Doval2 and Olivier Perrotin1
Total Page:16
File Type:pdf, Size:1020Kb
Feugère et al. EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:2 DOI 10.1186/s13636-016-0098-5 RESEARCH Open Access Cantor Digitalis: chironomic parametric synthesis of singing Lionel Feugère1, Christophe d’Alessandro1,2 *, Boris Doval2 and Olivier Perrotin1 Abstract Cantor Digitalis is a performative singing synthesizer that is composed of two main parts: a chironomic control interface and a parametric voice synthesizer. The control interface is based on a pen/touch graphic tablet equipped with a template representing vocalic and melodic spaces. Hand and pen positions, pen pressure, and a graphical user interface are assigned to specific vocal controls. This interface allows for real-time accurate control over high-level singing synthesis parameters. The sound generation system is based on a parametric synthesizer that features a spectral voice source model, a vocal tract model consisting of parallel filters for vocalic formants and cascaded with anti-resonance for the spectral effect of hypo-pharynx cavities, and rules for parameter settings and source/filter dependencies between fundamental frequency, vocal effort, and formants. Because Cantor Digitalis is a parametric system, every aspect of voice quality can be controlled (e.g., vocal tract size, aperiodicities in the voice source, vowels, and so forth). It offers several presets for different voice types. Cantor Digitalis has been played on stage in several public concerts, and it has also been proven to be useful as a tool for voice pedagogy. The aim of this article is to provide a comprehensive technical overview of Cantor Digitalis. Keywords: Digital musical instrument, Voice synthesis, Gestural control, Singing voice 1 Introduction documentation. However, the scientific basis and techni- Cantor Digitalis is a singing instrument, i.e., a perfor- cal details underlying Cantor Digitalis have never been mative singing synthesis system. It allows for expressive published and discussed. The aim of the present work is to musical control of high-quality vocal sounds. Expres- provide a comprehensive technical description of Cantor sive musical control is provided by an effective human- Digitalis, including the interface, the formant synthesizer, computer interface that captures the player’s gestures and and the singing synthesis rules. converts them into synthesis control parameters [1, 2]. The sound synthesis components of Cantor Digitalis High-quality vocal sounds are produced by the synthe- are in the tradition of formant synthesis. Apart from sis engine, which features a specially designed formant tape-based music using recorded voices and vocoders, synthesizer and an elaborate set of singing rules [3–5]. synthetic voices first appeared in contemporary music Cantor Digitalis is a musical instrument, and it is reg- pieces thanks to the “Chant” program [6]. “Chant” was ularly played on stage by Chorus Digitalis1,thechoirof based on a formant voice synthesizer and synthesis by Cantor Digitalis. The expressiveness and sound quality of rules, i.e., a parametric model of voice production3.Other this innovative musical instrument have been recognized, research groups also proposed rule-based formant synthe- as it was awarded the first prize of the 2015 Interna- sizers [7, 8]. The main advantage of parametric synthesis is tional Margaret Guthman Musical Instrument Competi- its flexibility and economy in terms of memory and com- tion (Georgia Institute of Technology)2. Cantor Digitalis is putational load. The next generation of voice synthesis distributed as a free software, accompanied by a detailed systems was based on recording, concatenation, and mod- ification of real voice samples4 or statistical parametric *Correspondence: [email protected] synthesis [9]. A formant synthesizer is preferred for Can- 1LIMSI, CNRS, Université Paris-Saclay, Bât 508, rue John von Neumann, Campus Universitaire, F-91405, Orsay Cedex, France tor Digitalis because flexibility and real time are the main 2Sorbonne Universités, UPMC Univ Paris 06, CNRS, UMR 7190, Institut Jean Le issues for performative singing synthesis. Rond d’Alembert, 4 place Jussieu, F-75005, Paris, France © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Feugère et al. EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:2 Page 2 of 19 Singing instruments have been proposed by different noise (aspiration noise in the voice resulting in breathiness research groups [3, 10–13]. The graphic tablet has been and structural aperiodicities such as vocal jitter or shim- proposed for approximately a decade [10, 14, 15] for con- mer resulting in roughness or hoarseness), and vocal tract trolling intonation and voice source variation. This inter- size (or larynx height). All high-level parameters are listed face appeared as a very effective choice. It has been exten- below: sively tested for intonation control in speech and singing synthesis [16, 17]. Additionally, this interface allows much • Pitch P corresponds to the perceived melodic expressiveness [18] because it takes advantage of the dimension of voice sounds. It is often the most accuracy and precision acquired through writing/drawing important musical dimension. gestures. This is the interface chosen for Cantor Digitalis. • Vocal effort E corresponds to the dynamics, i.e., This article describes the three main original com- perceived force of vocal sounds. It is also an essential ponents of Cantor Digitalis: the interface, the synthesis musical dimension. engine, and the rules for converting the input of the • Vowel height H defines the openness or closeness of former into parameters for the latter. The next section the vowel and corresponds to the vertical axis in the presents the general architecture of Cantor Digitalis and vocalic triangle, a classical two-dimensional the main issues and choices related to chironomic control representation for vowels. This dimension is related of the singing voice. Section 3 presents the parametric for- to the vertical position of the tongue, which depends mant synthesizer. Section 4 describes the synthesis rules, on the aperture of the jaw. i.e., the transformation of parameters issued by the chi- • Vowel backness V defines the front-back position of ronomic interface into synthesizer parameters. Section 5 the vowel and corresponds to the horizontal axis of discusses the obtained results, illustrated by audio-visual the vocalic triangle. It is related to the position of the files, and proposes some directions for future work. tongue relative to the teeth and the back of the mouth. • The noise dimension in vocal sounds can be 2 Chironomic control of the singing voice decomposed into two components. The first The Cantor Digitalis architecture is illustrated in Fig. 2. component is roughness or hoarseness R due to It is composed of three layers: the interface, the synthe- structural aperiodicities, i.e., random pitch period or sis/mapping rules, and the parametric synthesizer. This amplitude perturbations. It defines the hoarse or architecture follows the path of music production, from rough quality of the voice. the player to sound. Initially, the musician plans to pro- • Breathiness B is the second noise dimension. Leakage duce a given musical phrase, with a given vowel, given at the glottis produces aspiration or breath noise in dynamics, given voice quality, and so forth. The planned the voice. The extreme case of breathiness is musical task is then expressed through hand gestures whispering, with no fold vibration, resulting in related to the interface, i.e., through motions of a sty- unvoiced vowels. lus and fingers on the graphic tablet (after selection of • Tenseness T defines the tense/lax quality of the the voice type and other presets). The interface captures voice, i.e., the degree of adduction/abduction of the high-level parameters that are perceptually relevant to the vocal folds. player, such as the vowel quality or pitch. These high-level • Vocal tract size S defines the apparent vocal tract size parameters are then converted into low-level synthesis of the singer. Vocal tract size is singer dependent, but parameters through a layer of synthesis/mapping rules. it also varies according to the larynx position or lips Low-level synthesis parameters drive the parametric voice rounding for the same individual. synthesizer for sound sample production. The resulting • Pitch range is singer dependent. A typical singer sound is played back; listened to by the musician, who range is approximately 2 octaves. For simplicity, a reacts accordingly; and the perception-action loop for unique pitch range size of 3 octaves is implemented. performative singing synthesis is closed. Before address- To play either low (e.g., bass) or high (e.g., soprano) ing the control method itself, the control parameters must voices, a pitch offset parameter P0 is introduced. be identified. • Laryngeal vibration mechanism M defines the vibration mode of the vocal folds used by the singer. 2.1 Singing voice parameters Only chest and falsetto mechanisms are used, Cantor Digitalis is restricted to vocalic sounds (the case of corresponding to M = 1 and M = 2, respectively. consonants being considerably more difficult for real-time high-quality musical control [5, 19]). The corresponding All these dimensions are expressed in normalized units parameters are pitch, voice force (or vocal effort), voice (between 0 and 1), except P0 and M, and are summa- quality, and vowel label. The main perceived dimensions rized in Table 1. For most musical performances, only of voice quality are [20] voice tension (lax/tense voice), vowel label, pitch, and vocal effort are controlled with the Feugère et al. EURASIP Journal on Audio, Speech, and Music Processing (2017) 2017:2 Page 3 of 19 Table 1 High-level parameter control targets. The top of Fig.