The Vocaloid Phenomenon of Vocal Synthesis and Sample Concatenation
Total Page:16
File Type:pdf, Size:1020Kb
The Vocaloid Phenomenon of Vocal Synthesis and Sample Concatenation Cole Masaitis University of Mary Washington Department of Music Dr. Mark Snyder January 30, 2017 Masaitis, 1 Imagine a future where a singing voice synthesizer sang to you, instead of a real human. Instead of imagining that, what if I told you that this particular future has been around for quite some time? Vocaloid, or computer software that matches this exact description has been blowing up in Japan for over a decade, with other areas including the West being largely in the dark about it. In reality, Vocaloid has been slowly but surely seeping into pop-culture in other countries as well and was created all the way back in the early 2000’s. Vocaloid originated in the early 2000’s and was developed by a man named Hideki Kenmochi who some refer to as “the father” of the software, for a research project at Pompeu Fabra University in Barcelona, Spain. Following his time at university, Yamaha Corporation funded his research which allowed for the further development of his creation and since then, the software has evolved into the worldwide phenomenon called Vocaloid that exists today. Vocaloid employs sample concatenation or sequence that replicates the human voice, based on actual recordings of different individuals for each voicebank found in the Vocaloid editor programs. Originally, it was only capable of pronouncing vowels, and by the year 2003, the team released their product which was now able to sing simplistic words. Over the years the product went through several iterations before it reached the modern version of vocal synthesis it exemplifies today, in the form of Yamaha’s Vocaloid 4 software with more advanced phonetic, linguistic, and vocal than ever before. For an original composition, I decided to research the Vocaloid voice synthesizing software and incorporating it into a lecture recital performance of a new composition. This composition includes modern instrumentation and human forms of expression to contrast the inherently more robot-like voice of the Vocaloid, but the first step was delving into as much background research on Vocaloid as possible, and what anyone would Masaitis, 2 likely ask to start, “so what exactly is Vocaloid?”1 According to an an early write-up by the creator Hideki Kenmochi, and Hayato Ohshita, “Vocaloid is a singing synthesizer developed by the Yamaha Corporation. It is one of the few singing synthesizers that are available to end-users, and is the most widely used in the world currently. It provides the product users not merely a synthesis engine but an integrated environment in which the user can generate a singing voice easily and use it for music production.”2 In Mariana Timony’s article, Vocaloids: Our Friends Electric on Bandcamp, she describes it in a slightly more accessible manner. Timony elaborates a bit more and states that “Vocaloids: are synthesized vocal singing software first developed by Yamaha in 2003 that has, quite literally, taken on a life on it’s own”. She also compares Vocaloid to synthesized instruments, like “violins, harpsichords” and more in music creating software called Digital Audio Workstations or DAW’s like “Garageband”, Logic Pro X, Ableton Live, and so on. Timony says “Vocaloid software offers to its users a sort of singer in a box. Human voice actresses and actors sing various phonemes, which are compiled to create voicebanks of various synthesized sounds for anyone with a laptop for less than $200. Once installed, the Vocaloid is programmable through a piano-roll interface, with parameters for adjusting pitch, vibrato, tone, clarity, and other vocal characteristics.”3 Kenmochi also says that in order to create the melodies, one must be familiar with Musical Instrument Digital Interface Language or MIDI, which appears in the Score Editor which is the view of literally a piano keyboard rolled out vertically for users to click through the 1 "My Vocaloid - History" 2017 2 Kenmochi and Ohshita 2007 3 Timony 2017 Masaitis, 3 notes and durations they wish to use. Essentially, he mentions that in the Score Editor, “the user can input notes, lyrics, and optionally some expressions”. He goes on in his writing to explain that “Editor is designed specifically for Vocaloid”, and that “anyone can type out their lyrics as if they were normally writing the words, and they are converted into phonetic symbols by scanning a built-in pronunciation dictionary”.4 To break down these concepts a bit more, in the Merriam Webster dictionary it defines phonemes or phonetic symbols as “abstract units of the phonetic system of a language that correspond to a set of similar speech sounds which are perceived to be a single distinctive sound in the language”. There are forty-four phonemes in all, each contributing different sounds to how we form our language.5 Going back to Kenmochi, he shares an example of this works stating “if you would like to concatenate (or link together) a sequence s-e, e, e-t, which are all the sounds that make up the word set. The spectral envelope of sustained [e] at each frame is generated by interpolating [e] in the end of [s-e] and [e] in the beginning of [e-t]. By doing these processes, there is theoretically no timbre or tone quality gap in the concatenation.6 In the New York Times article Can I Get That Song in Elvis, Please?, author Bill Werde does an excellent job at simplifying this information for his readers. Werde explains that Vocaloid in layman’s terms could be more easily seen as an “Audio Font: musical notation and lyrics can be translated into the chosen voice, then saved for replay, just as a word processor might translate a text into Helvetica or Times New Roman and print out as many times as you like”7 He also says, “These fonts are made up of a database of phonemes (vowels and 4 Kenmochi and Ohshita 2007 5 Reithaug 2007 6 Kenmochi and Ohshita 2007 7 Werde 2003 Masaitis, 4 consonants), the basic sounds that make up any language. To create the database, technicians record a singer performing as many as 60 pages including thousands of scripted articulations (like “epp, pep, lep”). Assorted pitches and techniques like glissandos and legatos are also thrown in the mix; with all the combinations, the process takes a week of five-hour singing days”. According to Werde, the managing director Ed Stratton of Zero-G Limited, a company that licensed Vocaloid technology along with Yamaha, Stratton says the resulting fonts of all the arduous hours of working on recording are “reminiscent” of the singer’s voice. Werde also mentions that “it requires a deep knowledge of phonetics and audio engineering to create new fonts”.8 Jordi Bonada, one of the Music Technology Group members at Pompeu Fabra University who was one of the senior researchers assisting and supervising Kenmochi mentions “We realized it might be a better idea to record not just a song from a particular singer, but a set of vocal exercises with a great phonetics range, and build a model capable of singing any song”, in the 2014 Red Bull Music Academy article The Making of Vocaloid.9 Interestingly enough, Kenmochi mentions in the Making of Vocaloid article that “One style we can’t really do with Vocaloid now is very rough singing. The program assumes you can detect pitch, it’s basic frequency. But in rough voices, you sometimes can’t. We want to improve that”.10 Right around the same time of this article, one of the newer renditions of the software called Vocaloid 4, a growl feature was added so that Vocaloid voices could not quite add the harsh quality of screaming vocalist in metal bands, but more so a gritty, raspy version to edit 8 Werde 2003 9 St. Michel 2014 10 St. Michel 2014 Masaitis, 5 certain sections of their vocals to make them more intense.11 Specific Vocaloids have been made for certain languages, and there a few that can sing in multiple languages. Generally they are developed and recorded so that they can sing specifically one language. The most common languages include Japanese and English, but include others such as Spanish, Catalan, Chinese, and Korean.12 In the dissertation From Voder to Vocaloid: A Media History of Voice, Sarah Bell recognizes the difficulty in even coming close to a human sound from a synthesizer. She explains that like all models of sample concatenation, the Vocaloid imitations are essentially skewed versions of “intricate processes such as attack, breathiness, airflow, and breath controls”. Bell goes on to also add a few examples of the parameters in Vocaloid software that can be changed and edited.13 These include velocity, dynamics, breathiness, clearness, brightness, and the author of Vocaloid Tutorial - Using the Parameters with the username Kevinayp, adds opening, gender factor, and portamento timing, or gliding from one note to the other without defining the notes along the way, pitch bend, and pitch bend sensitivity.14 Through sample concatenation can be used to Frankenstein together many different possibilities due to the inhuman capabilities Vocaloid synthesizers inherently have that separate them from human vocals. Although they are capable of traveling new avenues that the human voices cannot, they are also limited and unable certain inherent human qualities.15 Vocaloids by the name of Leon and Lola the first voicebanks, or fonts to be released by Zero-G, and were both debuted at the annual NAMM or National Association of Music Merchants Show in January 15th 11 Corporation 2017 12 "List Of Vocaloid Products" 2017 13 Bell 2015 14 "VOCALOID Tutorial - Using The Parameters" 2008 15 Bell 2015 Masaitis, 6 of 2004.