The Phenomenon of Vocal Synthesis and Sample Concatenation

Cole Masaitis

University of Mary Washington

Department of Music

Dr. Mark Snyder

January 30, 2017

Masaitis, 1

Imagine a future where a singing voice sang to you, instead of a real human.

Instead of imagining that, what if I told you that this particular future has been around for quite some time? Vocaloid, or computer software that matches this exact description has been blowing up in Japan for over a decade, with other areas including the West being largely in the dark about it. In reality, Vocaloid has been slowly but surely seeping into pop-culture in other countries as well and was created all the way back in the early 2000’s.

Vocaloid originated in the early 2000’s and was developed by a man named Hideki ​ Kenmochi who some refer to as “the father” of the software, for a research project at Pompeu

Fabra University in , . Following his time at university, funded his research which allowed for the further development of his creation and since then, the ​ software has evolved into the worldwide phenomenon called Vocaloid that exists today.

Vocaloid employs sample concatenation or sequence that replicates the human voice, based on actual recordings of different individuals for each voicebank found in the Vocaloid editor programs. Originally, it was only capable of pronouncing , and by the year 2003, the team released their product which was now able to sing simplistic words. Over the years the product went through several iterations before it reached the modern version of vocal synthesis it exemplifies today, in the form of Yamaha’s Vocaloid 4 software with more advanced phonetic, linguistic, and vocal than ever before. For an original composition, I decided to research the ​ Vocaloid voice synthesizing software and incorporating it into a lecture recital performance of a new composition. This composition includes modern instrumentation and human forms of expression to contrast the inherently more robot-like voice of the Vocaloid, but the first step was delving into as much background research on Vocaloid as possible, and what anyone would Masaitis, 2

likely ask to start, “so what exactly is Vocaloid?”1 ​ According to an an early write-up by the creator Hideki Kenmochi, and Hayato Ohshita,

“Vocaloid is a singing synthesizer developed by the Yamaha Corporation. It is one of the few singing that are available to end-users, and is the most widely used in the world currently. It provides the product users not merely a synthesis engine but an integrated environment in which the user can generate a singing voice easily and use it for music production.”2

In Mariana Timony’s article, : Our Friends Electric on Bandcamp, she ​ ​ describes it in a slightly more accessible manner. Timony elaborates a bit more and states that

“Vocaloids: are synthesized vocal singing software first developed by Yamaha in 2003 that has, quite literally, taken on a life on it’s own”. She also compares Vocaloid to synthesized instruments, like “violins, harpsichords” and more in music creating software called Digital

Audio Workstations or DAW’s like “Garageband”, Logic Pro X, Ableton Live, and so on.

Timony says “Vocaloid software offers to its users a sort of singer in a box. Human voice actresses and actors sing various , which are compiled to create voicebanks of various synthesized sounds for anyone with a laptop for less than $200. Once installed, the Vocaloid is programmable through a -roll interface, with parameters for adjusting pitch, vibrato, tone, clarity, and other vocal characteristics.”3

Kenmochi also says that in order to create the melodies, one must be familiar with

Musical Instrument Digital Interface Language or MIDI, which appears in the Score Editor which is the view of literally a piano keyboard rolled out vertically for users to click through the

1 "My Vocaloid - History" 2017 2 Kenmochi and Ohshita 2007 3 Timony 2017 Masaitis, 3

notes and durations they wish to use. Essentially, he mentions that in the Score Editor, “the user can input notes, , and optionally some expressions”. He goes on in his writing to explain that “Editor is designed specifically for Vocaloid”, and that “anyone can type out their lyrics as if they were normally writing the words, and they are converted into phonetic symbols by scanning a built-in pronunciation dictionary”.4 To break down these concepts a bit more, in the Merriam

Webster dictionary it defines phonemes or phonetic symbols as “abstract units of the phonetic system of a language that correspond to a set of similar speech sounds which are perceived to be a single distinctive sound in the language”. There are forty-four phonemes in all, each contributing different sounds to how we form our language.5 Going back to Kenmochi, he shares an example of this works stating “if you would like to concatenate (or link together) a sequence s-e, e, e-t, which are all the sounds that make up the word set. The spectral envelope of sustained

[e] at each frame is generated by interpolating [e] in the end of [s-e] and [e] in the beginning of

[e-t]. By doing these processes, there is theoretically no or tone quality gap in the concatenation.6

In article Can I Get That Song in Elvis, Please?, author Bill Werde ​ ​ does an excellent job at simplifying this information for his readers. Werde explains that

Vocaloid in layman’s terms could be more easily seen as an “Audio Font: musical notation and lyrics can be translated into the chosen voice, then saved for replay, just as a word processor might translate a text into Helvetica or Times New Roman and print out as many times as you like”7 He also says, “These fonts are made up of a database of phonemes (vowels and

4 Kenmochi and Ohshita 2007 5 Reithaug 2007 6 Kenmochi and Ohshita 2007 7 Werde 2003 Masaitis, 4

consonants), the basic sounds that make up any language. To create the database, technicians record a singer performing as many as 60 pages including thousands of scripted articulations

(like “epp, pep, lep”). Assorted pitches and techniques like glissandos and legatos are also thrown in the mix; with all the combinations, the process takes a week of five-hour singing days”. According to Werde, the managing director Ed Stratton of Zero-G Limited, a company that licensed Vocaloid technology along with Yamaha, Stratton says the resulting fonts of all the arduous hours of working on recording are “reminiscent” of the singer’s voice. Werde also mentions that “it requires a deep knowledge of phonetics and audio engineering to create new fonts”.8

Jordi Bonada, one of the Music Technology Group members at Pompeu Fabra University who was one of the senior researchers assisting and supervising Kenmochi mentions “We realized it might be a better idea to record not just a song from a particular singer, but a set of vocal exercises with a great phonetics range, and build a model capable of singing any song”, in the 2014 Red Bull Music Academy article The Making of Vocaloid.9 Interestingly enough, ​ ​ Kenmochi mentions in the Making of Vocaloid article that “One style we can’t really do with ​ ​ Vocaloid now is very rough singing. The program assumes you can detect pitch, it’s basic frequency. But in rough voices, you sometimes can’t. We want to improve that”.10

Right around the same time of this article, one of the newer renditions of the software called Vocaloid 4, a growl feature was added so that Vocaloid voices could not quite add the harsh quality of screaming vocalist in metal bands, but more so a gritty, raspy version to edit

8 Werde 2003 9 St. Michel 2014 10 St. Michel 2014 Masaitis, 5

certain sections of their vocals to make them more intense.11 Specific Vocaloids have been made for certain languages, and there a few that can sing in multiple languages. Generally they are developed and recorded so that they can sing specifically one language. The most common languages include Japanese and English, but include others such as Spanish, Catalan, Chinese, and Korean.12

In the dissertation From Voder to Vocaloid: A Media History of Voice, Sarah Bell ​ ​ recognizes the difficulty in even coming close to a human sound from a synthesizer. She explains that like all models of sample concatenation, the Vocaloid imitations are essentially skewed versions of “intricate processes such as attack, breathiness, airflow, and breath controls”. Bell goes on to also add a few examples of the parameters in Vocaloid software that can be changed and edited.13 These include velocity, dynamics, breathiness, clearness, brightness, and the author of Vocaloid Tutorial - Using the Parameters with the username Kevinayp, adds opening, gender ​ ​ factor, and portamento timing, or gliding from one note to the other without defining the notes along the way, pitch bend, and pitch bend sensitivity.14

Through sample concatenation can be used to Frankenstein together many different possibilities due to the inhuman capabilities Vocaloid synthesizers inherently have that separate them from human vocals. Although they are capable of traveling new avenues that the human voices cannot, they are also limited and unable certain inherent human qualities.15 Vocaloids by the name of and Lola the first voicebanks, or fonts to be released by Zero-G, and were both debuted at the annual NAMM or National Association of Music Merchants Show in January 15th

11 Corporation 2017 12 "List Of Vocaloid Products" 2017 13 Bell 2015 14 "VOCALOID Tutorial - Using The Parameters" 2008 15 Bell 2015 Masaitis, 6

of 2004. NAMM is a Show where a plethora of big-name music companies show off their newest products each year, and plenty of musicians who are upcoming as well as famous stop by to check out all the soon to be released gear.16 Leon and Lola were designed as “Soul Singers” as a marketing decision according the Werde quoting Stratton, but Stratton also mentions that this is just the beginning. He essentially made the claim that future Vocaloids would be made with other voices in different styles as well such as “operatic or choir boy voices”.17 Looking at the software’s track record and since Leon and Lola’s debut, the capabilities of Vocaloid synthesis have grown immensely, from being able to control more parameters, to bigger voicebanks in multiple languages, and more.

Nowadays using software like Vocaloid is certainly another signal of the future, since laptop musicians and home studios are already rampant for music writers.18 By incorporating these voicebanks, independent artists basically have professional Singers and their voices to write lyrics for, and their Vocaloid of choice can sing for them whatever they want.19 For those who can’t sing, the software acts as a tool to give others a voice who want to be create music. As aforementioned, for Vocaloid the sampled voices are being taken, in a fashion very similar to sampling for hip-hop and other genres. Also similar to MIDI sample libraries that are now commonplace in the music-making community, such as being able to play a violin, drums, or even do simple singing due to hours of recording done by the creators of the libraries to get every dynamic, pitch, articulations and other minor details that make it hard for us to even tell the difference between a real orchestra, and a MIDI orchestra nowadays. One of the harder and less

16 "Introduction To Vocaloid" 2017 17 Werde 2003 18 Werde 2003 19 Werde 2003 Masaitis, 7

commonly synthesized instruments are voices. Vocaloid could also in theory, be taken from meticulously hand-picking vocal sounds out of recordings of long-gone singers and stars to create Vocaloid versions so that their voices live on in new music. If not enough is there, new singers that sound similar could take part in completing the library. One could technically use any singer’s voice in completely new settings, like Ariana Grande singing Happy Birthday, to specifics like making Bono into an opera singer in a duet with Tom Waits. Before long, these possibilities and others we would never imagined could occur in the future due to the progress of

Vocaloid and vocal synthesis technology.20 Take this concept and fast-forward to more recently, and a called , which is a singing voice synthesis tool with similar parameters to Vocaloid has appeared. This means that people can now literally, through hours of work, create singing voicebanks out of their favorite singers to use at their disposal.21

Due to the concepts behind Vocaloid, and it being created for the purpose of professional use, whether anyone expected it or not, the software has turned into a pop-star brand. One voice in particular, the voicebank of has risen to fame, and since it’s debut has become the most popular Vocaloid to date, as well as the mascot for the company. Prior to Miku, “the company released and , who were released a few years after

Leon and Lola. Their product box art featured characters drawn in an style which proved to be a successful marketing strategy”.22 Due to the success of the character avatars that were given to Vocaloid products, all of the voicebanks to released later on had them as well. The

Vocaloid brand, but particularly Miku has been rising in popularity as a pop-star internationally, and her name literally translates to “The First Sound of the Future” as shared in the

20 Werde 2003 21 "Singing Voice Synthesis Tool UTAU" 2008 22 "Introduction To Vocaloid" 2017 Masaitis, 8

aforementioned Bandcamp article. Mariana Timony states that within the last year fans all attended a sold-out North American tour where she “sang hit song after hit song”. The important thing to remember in this situation, is that anyone can use Miku or any other voicebank, which means that any user who creates songs for her is the star along with Miku. In other words, Miku is entirely the face of her audience and fans.23 Tons of merchandise is made of Vocaloids, similar to big name pop-stars.24 To give an idea of exactly big Vocaloid and Miku has gotten, the

Celebrity Net Worth Website shared an article by Paula Wilson entitled The Biggest Star in ​ Japan Right Now is a Hologram. A Hologram Who Sells Out Stadiums Full of Real People. In ​ this article, Wilson writes about how “Miku was the opening act for 's "ArtRave: The ​ Artpop Ball" tour, and was featured in ' remix of "Last night, Good night" by

Livetune. Wilson also tells her readers that “Miku became the first Vocaloid album to top the

Oricon weekly album charts at #1, bumping big-name Justin Beiber to #2.25 Miku has also been a ​ featured performer on the Late Night Show with David Letterman, and is the feature performer on sold out global tours either solo, or even with a human band to back her.26 The upcoming band , who performed on Jimmy Fallon, recently wrote one of their most popular songs to date featuring Miku. The Vocaloid was used in popular artist

Porter Robinson's song “Sad Machine”. The animated cartoon called Bee and Puppycat employs ​ ​ the use of the Vocaloid as the voice actor for the character Puppycat’s voice. The

Japanese animated film Paprika is noted as having one of the first film scores to use Vocaloid on ​ ​ the soundtrack, and the Vocaloid Lola was used. In the Bandcamp article by Mariana Timony

23 Timony 2017 24 Werde 2003 25 Wilson 2014 26 Wagstaff 2015 Masaitis, 9

entitled Vocaloids: Our Friends Electric, she mentions Miku has been used to advertise for big ​ ​ ​ ​ name companies like , and Google.27 The NPR or National Public Radio article The New ​ Wave of Cartoon Bands by Neda Ulaby, they also make mention specifically that “Hatsune Miku ​ stars in a new Toyota Corolla commercial aimed at the Asian-American market.28 Vocaloid video games have also been made by such as Project DIVA series.29

Gorillaz creator Dan Takemura mentions in Could I Get That Song in Elvis, Please?, ​ ​ “that he would want to use the software to create sounds that human voices could not” as well as,

“it’s the imperfections in a voice, the happy accidents, the human-ness that are often what’s best in a song”. Interestingly, the Gorillaz is another form of pop-stardom involving fictional pop-idols, often with Takemura behind the scenes not actually performing. Instead, Takemura has the characters he created to go in his stead, much like the Vocaloid community does with the different Vocaloid singers, such as Miku.30

Vocaloid is a crowdsourced medium, where the fans and fellow music makers are the ones driving the superstardom of Miku and other Vocaloids, meaning that they are exactly what the audience makes them out to be. Fans are allowed to create art, music, shows, movies, and more non-commercially. This means that the characters and voices can be used until a specific vocaloid is advertised and making a decent profit of the merchandise. For example, if a music maker is beginning to sell their music and it’s becoming popular, Crypton Future Media must be contacted so that the artist is given permission, and depending on how much money is involved, it may or may not be split between parties. According the website Piapro, they say that Miku and

27 Timony 2017 28 Ulaby 2011 29 "My Vocaloid - History" 2017 30 Werde 2003 Masaitis, 10

other characters “can be used for business purposes, and you are allowed to receive money in compensation for the usage of the characters” if and only if you contact Crypton to ask their permission first.31

Lastly, for this particular composition, since Vocaloid is such a unique tool for musical expression, I decided to use one of them in my Senior composition for Music Seminar with Dr.

Snyder. This composition was inspired by many things, including video game and movie soundtracks with slight influences of modern classical composers like Nobuo Uematsu, Yasunori

Mitsuda, Joe Hisaishi, and more. There are certainly more unconscious influences in this piece, but some of the ones I am aware of are artists such as Defeater, Touche Amore, Tiny Moving

Parts, Become the Teeth, and non-specifically genres like post-rock, pop, , math rock, and more.

As far as what the song is about, well that’s a little harder to strictly define. It takes my musical influences and gravitates towards themes like nostalgia, sadness, bittersweetness, haunting, anger, regret, and plenty more. If I had to say it was about anything specifically, I suppose it’s about growing up seeming never-ending, always coping with change, successes, losses, understanding that your emotions are valid to you and they don’t necessarily have to align with other people’s, and that you should never let people patronize you or tell you how you should feel. Another big theme and influence behind the music was definitely understanding that while we are always trying to be a better version of ourselves and change into something new, we need to learn not to dwell, but to remember to learn from our past lessons and skills, and respect experiences that have shaped who we are, rather than trying to keep everything in the

31 "Piapro.Net" 2017 Masaitis, 11

past. Essentially, just be mindful of where you’ve been, because helps you know where you want to go in the future. The Vocaloid acts as a symbol for future, while singing about the past, new and old musical concepts were used and redesigned for this piece, taking both new things I’ve learned and mixing them with old influences, and using seasons as a symbol for coping with harsh and sudden inevitable changes in life.

Masaitis, 12

Bibliography

"Believe In Music". 2017. NAMM.Org. https://www.namm.org/. ​ ​

Bell, Sarah. 2015. "From Voder To Vocaloid: A Media History Of Voice Synthesis". Ph.D, The

University of Utah.

Boxall, Andy. 2016. "‘Vocaloids’ Aren’T Characters, They’Re Instruments Changing The Way

Music Is Made". Digital Trends. ​ ​ http://www.digitaltrends.com/music/hatsune-miku-creative-revolution-musicians/.

Corporation, Yamaha. 2017. "Tone Rion V4". Vocaloid.Com. ​ ​ https://www.vocaloid.com/en/products/show/v4l_tone_rion_en.

"For Creators". 2017. Piapro.Net. http://piapro.net/intl/en_for_creators.html. ​ ​

Hutchinson, Kate. 2014. "Hatsune Miku: Japan’S Holographic Pop Star Might Be The Future Of

Music". The Guardian. ​ ​ https://www.theguardian.com/music/2014/dec/05/hatsune-miku-japan-hologram-pop-star.

"Introduction To Vocaloid". 2017. Google Docs. ​ ​ https://docs.google.com/document/d/1Feqm4ScOPFh_GJqGZ6-wUAgT76rb1wBW3oirBn

B-dFY/edit#.

Kenmochi, Hideki, and Hayato Ohshita. 2007. "VOCALOID - Commercial Singing Synthesizer

Based On Sample Concatenation". Interspeech 2007. ​ ​ http://www.interspeech2007.org/Technical/ssc_files/Yamaha/VOCALOID_Interspeech.pdf.

Lalwani, Mona. 2016. "It Takes A Village: The Rise Of Virtual Pop Star Hatsune Miku". Masaitis, 13

Engadget. https://www.engadget.com/2016/02/02/hatsune-miku/. ​

"List Of Vocaloid Products". 2017. En.Wikipedia.Org. ​ ​ https://en.wikipedia.org/wiki/List_of_Vocaloid_products#References.

"My Vocaloid - Faqs". 2017. Stanford.Edu. http://stanford.edu/~trzhao/CS73N/faqs.html. ​ ​

"My Vocaloid - History". 2017. Stanford.Edu. ​ ​ http://stanford.edu/~trzhao/CS73N/articles/history.html.

"My Vocaloid - Introduction". 2017. Stanford.Edu. ​ ​ http://stanford.edu/~trzhao/CS73N/intro.html.

Reithaug, Dawn. 2007. Orchestrating Success In Reading. West Vancouver, B.C.: Stirling Head ​ ​ Enterprises.

Robinson, Daniel. 2014. "If You Don't Go See Virtual Pop Star Hatsune Miku In Concert,

You're Insane". Noisey. ​ ​ https://noisey.vice.com/en_us/article/if-you-dont-go-see-virtual-pop-star-hatsune-miku-in-c

oncert-youre-insane.

"Singing Voice Synthesis Tool UTAU". 2008. Utau2008.Web.Fc2.Com. ​ ​ http://utau2008.web.fc2.com/.

St. Michel, Patrick. 2014. "Red Bull Music Academy". Daily.Redbullmusicacademy.Com. ​ ​ http://daily.redbullmusicacademy.com/2014/11/vocaloid-feature.

"The MIDI Language". 2017. Harfesoft.De. ​ ​ Masaitis, 14

http://www.harfesoft.de/aixphysik/sound/midi/pages/midibnin.html.

Timony, Mariana. 2017. "Vocaloids: Our Friends Electric". Bandcamp Daily. ​ ​ https://daily.bandcamp.com/2017/02/16/vocaloids-our-friends-electric/.

Ulaby, Neda. 2011. "The New Wave Of Cartoon Bands". NPR.Org. ​ ​ http://www.npr.org/sections/therecord/2011/06/30/137529117/the-new-wave-of-cartoon-ba

nds.

"VOCALOID Tutorial - Using The Parameters". 2008. Vocaloidism. ​ ​ http://vocaloidism.com/vocaloid-tutorial-using-the-parameters/.

"Vocaloid Wiki". 2017. Vocaloid.Wikia.Com. http://vocaloid.wikia.com/wiki/Vocaloid_Wiki. ​ ​

Wagstaff, Keith. 2015. "Japanese Hologram Pop Star Is Coming To America". NBC News. ​ ​ http://www.nbcnews.com/tech/innovation/japanese-hologram-pop-star-hatsune-miku-comin

g-america-n463016.

Werde, Bill. 2003. "MUSIC; Could I Get That Song In Elvis, Please?". Nytimes.Com. ​ ​ http://www.nytimes.com/2003/11/23/arts/music-could-i-get-that-song-in-elvis-please.html.

Wilson, Kara. 2017. ": Definition, Segmentation & Examples - Video & Lesson

Transcript | Study.Com". Study.Com. ​ ​ http://study.com/academy/lesson/phoneme-definition-segmentation-examples.html.

Wilson, Paula. 2014. "The Biggest Pop Star In Japan Right Now Is A Hologram. A Hologram

Who Sells Out Stadiums Full Of Real People.". Celebrity Net Worth. ​ ​ http://www.celebritynetworth.com/articles/entertainment-articles/biggest-pop-star-japan-rig Masaitis, 15

ht-now-hologram/.

Zushi, Yo. 2017. "Crowd-Sourced Pop Singer Hatsune Miku Reveals The True Nature Of

Stardom". Newstatesman.Com. ​ ​ http://www.newstatesman.com/culture/observations/2017/03/crowd-sourced-pop-singer-hat

sune-miku-reveals-true-nature-stardom.