Glitch Free FM Vocal Synthesis
Total Page:16
File Type:pdf, Size:1020Kb
Glitch Free FM Vocal Synthesis Chris Chafe Center for Computer Research in Music and Acoustics, Stanford University [email protected] ABSTRACT most synthesis technique have been applied to emulate the singing voice (additive, subtractive, physical model, FOF, Frequency Modulation (FM) and other audio rate non-linear etc.). The quest continues more than fifty years later with modulation techniques like Waveshaping Digital Synthe- composers attracted to vocal synthesizers like Yamaha’s sis, Amplitude Modulation (AM) and their variants are well- Vocaloid 1 where they can explore a fascination with musi- known techniques for generating complex sound spectra. cal personalities of singers which never existed. This paper Kleimola [1] provides a comprehensive and up-to-date de- joins a thread which began with Chowning’s work in the scription of the entire family. One shared trait is that syn- late 70’s, early 80’s involving FM for vocal synthesis and thesizing vocal sounds and other harmonically-structured which has been virtually languishing since it’s early use in sounds comprised of formants can be problematic because a few musical works. of an obstacle which causes distortions when intensifying John Chowning’s FM singing voice method was first de- time-varying controls. scribed in his 1980 article [3] prior to completing Phone¯ Large deflections of pitch or phoneme parameters cause at IRCAM (1981). The multi-channel tape piece features a jumps in the required integer approximations of formant wide variety of singing voices and morphing of vocal tim- center frequencies. Trying to imitate human vocal behav- bres with other FM-generated timbres such as gongs. The ior with its often wide prosodic and expressive excursions technique creates multiple formants with independent tun- causes audible clicks. A partial solution lay buried in some ings using multiple carriers and a shared modulator. Two code from the 80’s. This, combined with a phase-syncronous formants are used for his version of a soprano voice “eee” oscillator bank described in Lazzarini and Timoney [2] and three formants for his spectrally-rich basso profondis- produces uniform harmonic components which ensure ar- simo. A later version adds a third formant to the soprano tifact-free, exact formant spectra even under the most ex- model in a synthesis of the vowel “ahh” [4]. Pitch vi- treme dynamic conditions. The paper revisits singing and brato which causes synchronous spectral modulation is es- speech synthesis using the classic FM single modulator / pecially effective and Chowning has often demonstrated multiple-carrier structure pioneered by Chowning [3]. The how crucial this is to rendering vowels convincingly. “It is revised method is implemented in Faust and is as efficient striking that the tone only fuses and becomes a unitary per- as its predecessor technique. Dynamic controls arrive mul- cept with the addition of the pitch fluctuations, thus spec- tiplexed via an audio rate “articulation stream” which in- tral envelope does not make a voice!”[3]. terfaces conveniently with sample-synchronous algorithms The method has an inherent shortcoming which limits written in Chuck. FM for singing synthesis can now be the amount of vibrato excursion and limits phoneme tran- “abused” with radical time-varying controls. It also has sitions to nearby phonemes. Beyond these limits notice- potential as an efficient means for low-bandwidth analy- able artifacts occur which are caused by discrete shifts of sis – resynthesis speech coding. Applications of the tech- formant center frequency. Discontinuities are perceived as nique for sonification and in concert music are described. clicks and result from integer shifts in the carrier to mod- ulator ratio c : m which are required in order to track a 1. INTRODUCTION desired formant center frequency fc for a given pitch fp. Synthesis of singing voice by computer has a history which The modulating oscillator is always set to fp, so m = 1:0. begins in the very first years of computer music. The song The carrier ratio c is an integer approximation and quanti- Daisy Bell (Bicycle Built for Two) was sung by a com- zation of the actual real ratio fc=fp. puter in 1961 in an arrangement by Max Mathews and Formant synthesis with FM is essentially contradictory Joan Miller with vocal synthesis by John Kelly and Carol to the physics. The harmonic nature of voiced sound al- Lochbaum when the Bell Telephone Laboratories experi- lows only harmonic number ratios c 2 N≥1 for the car- ments with digital music synthesis were only 4 years old. rier. Where physical sound production is an excitation – It was an early case of analysis – resynthesis speech coding resonance mechanism with independent tuning of both el- providing a means for singing synthesis. Over the decades, ements, FM models can only approximate the resonance frequencies of the latter when constrained to produce har- monic spectra. The inherent problem is that these approxi- Copyright: c 2013 Chris Chafe et al. This is an open-access article distributed mations are discontinuous in frequency. In practice, this under the terms of the Creative Commons Attribution 3.0 Unported License, which permits unrestricted use, distribution, and reproduction in any medium, provided 1 Vocaloid3 uses a triphone frequency-domain concatenative synthesis the original author and source are credited. engine. TITLE TITLE 4000 4000 3500 3500 3000 3000 2500 2500 2000 2000 1500 1500 Frequency (Hz) Frequency (Hz) 1000 1000 500 500 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (s) Time (s) (a) vibratoTITLE (a) vibratoTITLE – deglitched 4000 4000 3500 3500 3000 3000 2500 2500 2000 2000 1500 1500 Frequency (Hz) Frequency (Hz) 1000 1000 500 500 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (s) Time (s) (b) vowel alternation “eee–ooo–eee” (b) vowel alternation – deglitched Figure 1. Clicks always occur when transitions to a new Figure 2. Result of applying the solution adopted from Le formant center frequency fc forces a carrier oscillator to Brun to the synthesis shown in Fig.1. change its harmonic ratio. FM vocal formants use a c : m ratio where c 2 N≥1 and m = fp, where fp is the desired pitch. vations for the present investigation into solving the dis- continuity problem. The completed work will reach the public as a gallery installation (exploring recorded data) severely limits the amount of pitch skew not only con- and as a medical monitoring device for detecting seizures straining vibrato, but also portamento and glissando to small (with the singing voice controlled directly from electrodes ranges. In the method’s original form, it is impossible to in real time). shift ratios without causing glitches like those shown in the spectrograms of Fig.1. 2. EARLY SOLUTION Finding a solution became necessary in order to use FM as the synthesis engine for a sonfication project involv- Marc Le Brun described digital waveshaping synthesis in ing brain signals. FM singing offers advantages for this 1979 as a generalized paradigm for non-linear modulation body of work which attempts to fashion a singing choir synthesis [5]. FM is a special case of waveshaping synthe- direct from the mind. The goal isn’t what one probably sis and in devising a way to avoid the discontinuity prob- first imagines e.g., an ensemble of mind-controlled voices. lem for waveshaping, Le Brun also solved it for the FM Instead, this is a technology for auditory display of the case. Le Brun’s solution remains unpublished (until now) rapid fluctuations of EEG and electrocorticography (eCog) with one exception: Bill Schottstaedt has preserved it as recordings. Singing voice synthesis has its attractions in a synthesis instrument in the Common Lisp Music (CLM) that it can allude to imagery of “inner voices” but it is project [6]. From the code comment, “Vox, an elaborate also particularly apt because of the ease with which lis- multi-carrier FM instrument is the voice instrument writ- teners lock on to patterns of phonemic and other voice- ten by Marc Le Brun, used in Colony and other pieces.” like timbral transitions. The range of data encountered in Vox avoids the integer ratio shift discontinuities by im- brain recordings (from quiescent to seizure) and a desire plementing a cross-fading solution. Two carriers corre- to have a very flexible mapping strategy have been moti- sponding to even and odd harmonic numbers are assigned to each formant “bracketing” the true formant center fre- mixed for each formant. These are the pair being cross- quency. Their assignments are made from the two nearest faded to combat the clicks in what I will now label as the harmonics flower = bfc=fpc while the other is the nearest “first-order problem.” upper harmonic fupper = dfc=fpe. The assignment of har- The cross-fade technique assumes that the energy of all monics to individual oscillators is dynamic and depends on coincident pairs of spectral lines will sum arithmetically. whether they are even numbered or odd numbered. When However, this assumption does not take phase into account. an oscillator is required to change its harmonic number the A “second-order problem” is caused by phase interference other will be approaching the actual target fc=fp. The two of coincident spectral lines. These are the spectral lines carrier oscillators’ amplitudes sum to unity in a mixture (carrier and sideband frequencies) of the two overlapping whose gains are complementary and linearly determined formant generators which fill out the spectral envelope of by proximity to the target. The key feature which makes the formant.