A Singing Voice Dataset
Total Page:16
File Type:pdf, Size:1020Kb
VOCALSET: A SINGING VOICE DATASET Julia Wilkins1;2 Prem Seetharaman1 Alison Wahl2;3 Bryan Pardo1 1 Computer Science, Northwestern University, Evanston, IL 2 School of Music, Northwestern University, Evanston, IL 3 School of Music, Ithaca College, Ithaca, NY [email protected] ABSTRACT Gender and Voice Type Distribution We present VocalSet, a singing voice dataset of a capella 11 10 Voice Type singing. Existing singing voice datasets either do not 9 Baritone capture a large range of vocal techniques, have very few 8 Bass singers, or are single-pitch and devoid of musical context. 7 VocalSet captures not only a range of vowels, but also a 6 Bass−Baritone 5 Countertenor diverse set of voices on many different vocal techniques, Count sung in contexts of scales, arpeggios, long tones, and ex- 4 Mezzo−Soprano 3 cerpts. VocalSet has recordings of 10.1 hours of 20 pro- Soprano 2 fessional singers (11 male, 9 female) performing 17 differ- 1 Tenor ent different vocal techniques. This data will facilitate the 0 development of new machine learning models for singer F M identification, vocal technique identification, singing gen- Gender eration and other related applications. To illustrate this, we establish baseline results on vocal technique classification Figure 1. Distribution of singer gender and voice type. and singer identification by training convolutional network VocalSet data comes from 20 professional male and female classifiers on VocalSet to perform these tasks. singers ranging in voice type. 1. INTRODUCTION plored [2,5]. VocalSet can be used in a similar way, but for VocalSet is a singing voice dataset containing 10.1 hours singing voice generation. Our dataset can also be used to of recordings of professional singers demonstrating both train systems for vocal technique transfer (e.g. transform- standard and extended vocal techniques in a variety of mu- ing a sung tone without vibrato into one with vibrato) and sical contexts. Existing singing voice datasets aim to cap- singer style transfer (e.g. transforming a female singing ture a focused subset of singing voice characteristics, and voice to a male singing voice). Deep learning models for generally consist of fewer than five singers. VocalSet con- multi-speaker source separation have shown great success tains recordings from 20 different singers (11 male, 9 fe- for speech [7, 21]. They work less well on singing voice. male) performing a variety of vocal techniques on 5 vow- This is likely because they were never trained on a vari- els. The breakdown of singer demographics is shown in ety of singers and singing techniques. This dataset could Figure 1 and Figure 3, and the ontology of the dataset is be used to train machine learning models to separate mix- shown in Figure 4. VocalSet improves the state of exist- tures of multiple singing voices. The dataset also con- ing singing voice datasets and singing voice research by tains recordings of the same musical material with different capturing not only a range of vowels, but also a diverse modulation patterns (vibrato, straight, trill, etc), making it set of voices on many different vocal techniques, sung in useful for training models or testing algorithms that per- contexts of scales, arpeggios, long tones, and excerpts. form unison source separation using modulation pattern as Recent generative audio models based on machine a cue [17, 22]. Other obvious uses for such data are train- learning [11, 25] have mostly focused on speech applica- ing models to identify singing technique, style [9,19], or a tions, using multi-speaker speech datasets [6, 13]. Gen- unique singer’s voice [1, 10, 12, 14]. eration of musical instruments has also recently been ex- The structure of this article is as follows: we first com- pare VocalSet to existing singing voice datasets and cover c Julia Wilkins, Prem Seetharaman, Alison Wahl, Bryan existing work in singing voice analysis and applications. Pardo. Licensed under a Creative Commons Attribution 4.0 International We then describe the collection and recording process for License (CC BY 4.0). Attribution: Julia Wilkins, Prem Seetharaman, Alison Wahl, Bryan Pardo. “VocalSet: A Singing Voice Dataset”, 19th VocalSet and detail the structure of the dataset. Finally, we International Society for Music Information Retrieval Conference, Paris, illustrate the utility of VocalSet by implementing baseline France, 2018. classification systems for identifying vocal technique and 468 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 469 Vibrato Straight Breathy Vocal Fry Lip Trill 4096 4096 4096 4096 4096 2048 2048 2048 2048 2048 Hz Hz Hz Hz Hz 1024 1024 1024 1024 1024 512 512 512 512 512 0 0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 0 0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 0 0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 0 0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 0 0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 Time Time Time Time Time Trill Trillo Inhaled Belt Spoken 4096 4096 4096 4096 4096 2048 2048 2048 2048 2048 Hz Hz Hz Hz Hz 1024 1024 1024 1024 1024 512 512 512 512 512 0 0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 0 0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 0 0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 Time Time Time Time Time Figure 2. Mel spectrograms of 5-second samples of the 10 techniques used in our vocal technique classification model. All samples are from Female 2, singing scales, except “Trill”, “Trillo”, and “Inhaled” which are found only in the Long Tones section of the dataset, and “Spoken” which is only in the Excerpts section. singer identification, trained on VocalSet. range of vocal techniques that one would not necessarily classify within a single genre so that models trained on VocalSet could generalize well to many different singing 2. RELATED WORK voice tasks. A few singing voice datasets already exist. The Phona- tion Modes Dataset [18] captures a range of vocal sounds, We illustrate the utility of VocalSet by implementing but limits the included techniques to ’breathy’, ’pressed’, baseline systems trained on VocalSet for identifying vo- ’flow’, and ’neutral’. The dataset consists of a large num- cal technique and singer identification. Prior work on vo- ber of sustained, sung vowels on a wide range of pitches cal technique identification includes work that explored from four singers. While this dataset does contain a sub- the salient features of singing voice recordings in order to stantial range of pitches, the pitches are isolated, lacking better understand what distinguishes one person’s singing any musical context (e.g. a scale, or an arpeggio). This voice from another as well as differences in sung vow- makes it difficult to model changes between pitches. Vo- els [4, 12], and work using source separation and F0 es- calSet consists of recordings within musical contexts, al- timation to allow a user to edit the vocal technique used in lowing for this modeling. The techniques listed above that a recorded sample [8]. are observed in the Phonation Modes Dataset are based on the different formations of the throat when singing Automated singer identification has been a topic of in- and not necessarily on musical applications of these tech- terest since at least 2001 [1,10,12,14]. Typical approaches niques. Our dataset focuses on a broader range of tech- use shallow classifiers and hand-crafted features (e.g. mel niques in singing, such as vibrato, trill, vocal fry, and in- ceptral coefficients) [16,24]. Kako et al. [9] identifies four haled singing. See Table 2 for the full set of techniques in singing styles music style using the phase plane. Their our dataset. work was not applied to specific vocal technique classi- The Vocobox dataset 1 focuses on single vowel and fication, likely due to the lack of a suitable dataset. We hy- consonant vocal samples. While they feature a broad range pothesize that deep models have not been proposed in this of pitches, they only capture data from one singer. Our data area due to the scarcity of high-quality training data with contains 20 singers, with a wide range of voice types and multiple singers. The VocalSet data addresses these issues. singing styles over a larger range of pitches. We illustrate this point by training deep models for singer The Singing Voice Dataset [3] contains over 70 vocal identification and vocal technique classification using this recordings of 28 professional, semi-professional, and am- data. ateur singers performing predominantly Chinese Opera. This dataset does capture a large range of voices, like Vo- For singing voice generation, [20] synthesized singing calSet. However, it does not focus on the distinction be- voice by replicating distinct and natural acoustic features tween vocal techniques but rather on providing extended of sung voice. In this work, we focus on classification tasks excerpts of one genre of music. VocalSet provides a wide rather than generation tasks. However, VocalSet could be applied to generation tasks as well, and we hope our mak- 1 https://github.com/vocobox/human-voice-dataset ing this dataset available will facilitate that research. 470 Proceedings of the 19th ISMIR Conference, Paris, France, September 23-27, 2018 Age and Gender Distribution data collection.