Challenges and Perspectives on Real-Time Singing Voice Synthesis S´Intesede Voz Cantada Em Tempo Real: Desaﬁos E Perspectivas

Revista de Informatica´ Teorica´ e Aplicada - RITA - ISSN 2175-2745 Vol. 27, Num. 04 (2020) 118-126 RESEARCH ARTICLE Challenges and Perspectives on Real-time Singing Voice Synthesis S´ıntesede voz cantada em tempo real: desafios e perspectivas Edward David Moreno Ordonez˜ 1, Leonardo Araujo Zoehler Brum1* Abstract: This paper describes the state of art of real-time singing voice synthesis and presents its concept, applications and technical aspects. A technological mapping and a literature review are made in order to indicate the latest developments in this area. We made a brief comparative analysis among the selected works. Finally, we have discussed challenges and future research problems. Keywords: Real-time singing voice synthesis — Sound Synthesis — TTS — MIDI — Computer Music Resumo: Este trabalho trata do estado da arte do campo da s´ıntese de voz cantada em tempo real, apre- sentando seu conceito, aplicaçoes˜ e aspectos tecnicos.´ Realiza um mapeamento tecnologico´ uma revisao˜ sistematica´ de literatura no intuito de apresentar os ultimos´ desenvolvimentos na area,´ perfazendo uma breve analise´ comparativa entre os trabalhos selecionados. Por fim, discutem-se os desafios e futuros problemas de pesquisa. Palavras-Chave: S´ıntese de Voz Cantada em Tempo Real — S´ıntese Sonora — TTS — MIDI — Computaçao˜ Musical 1Universidade Federal de Sergipe, Sao˜ Cristov´ ao,˜ Sergipe, Brasil *Corresponding author: [email protected] DOI: http://dx.doi.org/10.22456/2175-2745.107292 • Received: 30/08/2020 • Accepted: 29/11/2020 CC BY-NC-ND 4.0 - This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. 1. Introduction voice synthesis embedded systems, through the description of its concept, theoretical premises, main used techniques, latest From a computational vision, the aim of singing voice syn- developments and challenges for future research. This work thesis is to generate a song, given its musical notes and lyrics is organized as follows: Section 2 describes the theoretical [1]. Hence, it is a branch of text-to-speech (TTS) technology requisites that serve as base for singing voice synthesis in gen- [2] with the application of some techniques of musical sound eral; Section 3 discusses about singing synthesis techniques; synthesis. Section 4 presents a technological mapping of the patents An example of application of singing voice synthesizers registered for this area; in Section 5 the systematic review of is in the educational area. Digital files containing the singing literature is shown. Section 6 contains a comparative analysis voice can be easily created, shared and modified in order to among the selected works; Section 7 discusses the challenges facilitate the learning process, which dispenses the human and future tendencies for this field; finally, Section 8 presents presence of a singer as reference or even recordings. a brief conclusion. This kind of synthesis can also be used for artistic purposes [3]. Investments on the “career” of virtual singers, like 2. Theoretical framework Hatsue Miku, in Japan, have been made, which includes live The problem domain on singing voice synthesis is multidis- shows where the singing voice is generated by Vocaloid syn- ciplinary: beyond computer science, it depends on concepts thesizer, from Yamaha, and the singer image is projected by from acoustics, phonetics, music theory and signal processing. holograms. The following subsections present some concepts from each The applications on singing voice synthesis technology mentioned knowledge area. have been increased by the development of real-time synthesizers, like Vocaloid Keyboard [4], whose virtual singer was 2.1 Elements of Acoustics implemented by an embedded system into a keytar, allowing Sound is a physical phenomenon produced by a variation its user the execution of an instrumental performance. of air pressure levels over time, coming from a vibratory The present article presents a review about real-time singing source, for example, and a guitar string. Due to its undulatory Challenges on Real-time Singing Synthesis nature, sound is measured by physical quantities like period, frequency, and amplitude. Period consists in the duration of a complete wave cycle; frequency is the inverse of period, and indicates the quantity of cycles per second that the sound wave presents; amplitude is the maximum value of the pressure variation in relation to the equilibrium point of the oscillation [5]. The period and amplitude of a simple soundwave can be viewed in Figure 1. Figure 2. Envelope shape and its four phases.[5] 2.2 Sound qualities and music theory In regard of the perception of the sound phenomenon by the human sense of audition, the physical components previously described have a relation with the so-called sound qualities: Figure 1. A simple soundwave. pitch, which permits us to distinguish between bass and treble sounds, is directly proportional to the fundamental fre- The simplest soundwaves are the so-called sinusoids, which quency; consist in a single frequency, but this kind of sound is pro- intensity, which depends on amplitude and allows us to duced neither by nature, nor by conventional musical instru- stablish the difference between strong sounds and weak ones; ments. Such sources generate complex sounds, composed by The timbre, related to the waveshape and its envelope, is several frequencies. The lowest of them is called fundamental the quality that makes possible the perception of each sound frequency. When the values of the other frequencies are mul- source as they have their own “voice”. tiples of the fundamental one, the sound is said to be periodic. The sounds of a piano and a guitar, for example, are dis- When the opposite is true, the sound is called aperiodic. The tinguishable by their timbre, even if they have the same pitch superposition of frequencies results in a waveshape, which is and intensity. The fourth quality of sound is its duration over typical for each sound source. It describes a curve called en- time. velope, obtained from the maximum values of the waveshape It is important to remark that complex periodic sounds oscillation. are usually perceived as they have a definite pitch, which cor- The envelope shape is commonly decomposed into four responds to the fundamental frequency, for these sounds are phases, denominated by the following abbreviation ADSR. called musical sounds. On the other hand, aperiodic sounds ADSR means: do not have a distinguishable pitch and they are denominated as noise, although they are also employed in music, especially (i) attack, which corresponds to the time between the be- in percussion instruments. ginning of the execution of the sound and its maximum Since the development of the music theory in the Western amplitude; World, some ways of selection and organization of musical sounds for artistic purposes are established, as well graphical (ii) decay, the necessary time from the maximum amplitude representations of them, according to their qualities. The to a constant one; sensation of similitude between a certain sound and another one with twice the value of the fundamental frequency of the (iii) sustain, the interval of time where the amplitude keeps first, allowed the division of the range of audible frequency such constant state; and into musical scales, composed by individual sounds called by the music theory as notes. The succession of sounds with (iv) release, from the end of the constant state to the return different pitches or, in other words, the succession of different to silence [5]. musical notes is the part of music denominated melody. The rhythm is another part of music and it can be defined Figure 2 shows a schematic chart with an envelope shape as the movement of sounds regulated by their duration. The and its four phases indicated by its initial letters. duration of notes is measured by and arbitrary unit called R. Inform. Teor.´ Apl. (Online) • Porto Alegre • V. 27 • N. 4 • p.119/126 • 2020 Challenges on Real-time Singing Synthesis tempo, based on the proportion that the notes keep in relation vibrations that result from the obstruction of the airflow by to each other’s durations. The tempos are grouped in equal body parts such as lips or tongue. On the other hand, vowels portions into structures called bars. have a periodic nature. In modern musical notation, melodies are written in two Another difference between vowels and consonants is re- dimensions: symbols with different shapes, called musical lated to the role that each one plays in the syllable, which is figures, indicate the duration of sounds. Their succession is a unit whose definition is difficult to be given, but it can be made horizontally in a staff, which is a set of five lines and described as a sonorous chain composed by acclivities, peaks four spaces that indicate, according to the position that the and declivities of sonority. According to the frame/content musical figures occupy in it vertically, the different pitches. theory [7], speech is organized in syllabic frames, which con- Figure 3 shows a melody written in a staff. sist in cycles of opening and closure of mouth. In each frame, there is a segmental content, the phonemes. This content has Three structures: attack, nucleus, and coda. Consonantal phonemes are located in attack or coda, while a vowel form the syllable nucleus [8]. Furthermore, there are phonemes called semivowels, which are present in diphthongs and can Figure 3. Melody written in a staff. be found in the syllabic boundaries. In an acoustic perspective, the syllable is a waveshape where consonants and semivowels The quality of intensity is treated by a part of music the- occupy the phases of attack (acclivity) and release (decliv- ory called dynamics. It is indicated in musical notation by ity), while vowels are in sustain phase (peak or nucleus). In symbols located below the staff.

Load more