Dadagp: a Dataset of Tokenized Guitarpro Songs for Sequence Models
Total Page:16
File Type:pdf, Size:1020Kb
DADAGP: A DATASET OF TOKENIZED GUITARPRO SONGS FOR SEQUENCE MODELS Pedro Sarmento1 Adarsh Kumar2;3 CJ Carr4 Zack Zukowski4 Mathieu Barthet1 Yi-Hsuan Yang2;5 1 Queen Mary University of London , UK 2 Academia Sinica, Taiwan 3 Indian Institute of Technology Kharagpur, India 4 Dadabots 5 Taiwan AI Labs {p.p.sarmento, m.barthet}@qmul.ac.uk, [email protected], [email protected] ABSTRACT Originating in the Renaissance and burgeoning in the dig- ital era, tablatures are a commonly used music notation system which provides explicit representations of instru- ment fingerings rather than pitches. GuitarPro has estab- lished itself as a widely used tablature format and soft- ware enabling musicians to edit and share songs for mu- sical practice, learning, and composition. In this work, we present DadaGP, a new symbolic music dataset com- prising 26,181 song scores in the GuitarPro format cover- ing 739 musical genres, along with an accompanying tok- enized format well-suited for generative sequence models such as the Transformer. The tokenized format is inspired Figure 1. An excerpt from a GuitarPro song notation using by event-based MIDI encodings, often used in symbolic tablatures and score for two guitars, bass and drums. music generation models. The dataset is released with an encoder/decoder which converts GuitarPro files to to- kens and back. We present results of a use case in which emphasis is on the action (symbol-to-action), contrary to DadaGP is used to train a Transformer-based model to gen- descriptive forms of notation, which establishes a symbol- erate new songs in GuitarPro format. We discuss other rel- to-pitch relationship. This characteristic makes tablatures evant use cases for the dataset (guitar-bass transcription, an intuitive and inclusive device for music reading and music style transfer and artist/genre classification) as well learning, which can explain their large prevalence for mu- as ethical implications. DadaGP opens up the possibility to sic score sharing over the Internet [3,4]. Often represented train GuitarPro score generators, fine-tune models on cus- as non-standardised text files that require no specific soft- tom data, create new styles of music, AI-powered song- ware to read or write, tablatures’ online dissemination has writing apps, and human-AI improvisation. surpassed more sophisticated music notation formats, such as Music XML or MIDI [3]. However, tablature represen- tations that rely solely on text have limitations from a user 1. INTRODUCTION perspective. For example, it is common that rhythm indica- Historically, tablatures’ proliferation is closely linked to tions are discarded, preventing a comprehensive transcrip- the lute repertoire, compositions that roughly span from tion of the music and automatic playback. Tablature edi- the 16th century onwards, and are still available today [1]. tion software (e.g. GuitarPro 1 , PowerTab 2 , TuxGuitar 3 ) arXiv:2107.14653v1 [cs.SD] 30 Jul 2021 In opposition to standard notational practices (usually re- can be regarded as a solution for this problem, keeping ferred to as staff notation), in a tablature system for string the prescriptive approach, and supporting rhythm notations instruments each staff line on the score represents a string and playback. By supporting the annotation of multiple in- of the instrument, substituting a representation of pitch by struments, as observable in Figure 1, these tools account a given location on said instrument (i.e. a fingering) [2]. for an interactive music experience, either for songwriting Tablatures are a prescriptive type of notation, where the or music learning purposes. The release of this dataset intends to leverage the GuitarPro format used by the before-mentioned software © P. Sarmento, A.Kumar, CJ Carr, Z. Zukowski, M. Barthet to support guitar and bands/ensembles’ related research and Y. Yang. Licensed under a Creative Commons Attribution 4.0 Inter- within the MIR community, focusing specifically on the national License (CC BY 4.0). Attribution: P. Sarmento, A.Kumar, CJ Carr, Z. Zukowski, M. Barthet and Y. Yang, “DadaGP: A Dataset of Tok- 1 https://www.guitar-pro.com/ enized GuitarPro Songs for Sequence Models”, in Proc. of the 22nd Int. 2 http://www.power-tab.net/guitar.php Society for Music Information Retrieval Conf., Online, 2021. 3 https://sourceforge.net/projects/tuxguitar/ task of symbolic music generation. The contributions of string and fret positions, chords and beats [17]. Further- this paper are: (1) a dataset of over 25,000 songs in Gui- more, the Guitar Playing Techniques dataset [18] contains tarPro and token format, together with statistics on its fea- 6,580 clips of notes together with playing technique an- tures and metadata, (2) an algorithm and Python software notations. Likewise, the IDMT-SMT-Guitar dataset [19] to convert between any GuitarPro file and a dedicated to- also comprises short excerpts that include annotations of ken format suitable for sequence models 4 , (3) results from single notes, playing techniques, note clusters, and chords. its main use case, the task of symbolic music generation, Lately, Chen et al. compiled a dataset of 333 tablatures and (4) a discussion about further applications for DadaGP of fingerstyle guitar, created specifically for the purpose of and its ethical implications. music generation [20]. In this paper, we first present some relevant background To the authors best knowledge, there exists no multi- concerning previously released music datasets in symbolic instrument dataset that is able to combine the ease of use of format. In Section 3, we discuss advantages of tab-based symbolic formats whilst providing guitar (and bass) play- datasets for MIR research. We then describe, in Section 4, ing technique information. Such expressive information is the details of the DadaGP dataset, its encoder/decoder sup- lacking in other formats, and GuitarPro appears as a viable port tool, the features it encompasses and the ones it lacks. resource for music analysis and generation. Within Section 5 we present a use case of symbolic music generation using our proposed dataset, supported by previ- ous approaches concerning databases of symbolic music. Section 6 proposes additional applications for the dataset. 3. MOTIVATIONS: WHY GUITARPRO? Finally, in Section 7 we explain the steps needed in order to acquire the dataset, further pointing out some ethical con- GuitarPro is both a software and a file format, widely used siderations in Section 8. by guitar and bass players, but also by bands. It is mostly utilized for tasks such as music learning and practicing, where musicians simply read or play along a given song, 2. BACKGROUND and for music notation, in which composers/bands use the software to either support the songwriting process, or sim- Since its release in 1983, the MIDI (Music Instrument Dig- ply as a means for ease of distribution once compositions ital Inferfaces) standard has remained highly ubiquitous. are done. As an example of the software’s widespread Unsurprisingly, MIDI has been the most recurrent option dissemination, the tablature site Ultimate Guitar 6 hosts a in terms of musical notation formats, concerning datasets catalogue of over 200,000 user-submitted GuitarPro files, released within the MIR community, either targeting music containing notations of commercial music, mostly from the generation purposes, that lately have boomed by leverag- genres of rock and metal. One of the main motivations ing deep learning approaches, or aiming for musical anal- for the creation of DadaGP is to engage the MIR com- ysis, musicology or purely information retrieval ends. A munity into research that leverages the expressive infor- comprehensive overview of previously released datasets in mation, instrumental parts and song diversity in formats symbolic format is presented in [5]. The authors present such as GuitarPro. Although GuitarPro is a paid software, MusPy, a toolkit for symbolic music generation, that na- free alternatives such as TuxGuitar are capable of edit- tively supports a total of eleven datasets. Considering cu- ing/exporting into GuitarPro format. Moreover, GuitarPro mulative song duration, the top five datasets are the Lakh files can be easily imported into MuseScore 7 , a free soft- MIDI dataset [6], the MAESTRO dataset [7], the Wikifo- 5 ware notoriously known for music notation, which also nia Lead Sheet dataset , the Essen Folk Song database [8], possesses tablature features. However, using MuseScore and the NES Music database [9]. With respect to music no- might present some occasional incompatibilities, specif- tation formats, these datasets employ MIDI, MusicXML ically those regarding the selection of instruments (e.g. and ABC. Recently, the GiantMIDI-Piano dataset [10], drums are often imported as piano, and the correspond- comprising 10,854 unique piano solo pieces, the POP909 ing MIDI instruments need to be manually switched). An- dataset [11] and the Ailabs.tw Pop1K7 dataset [12], con- other important motivation for the release of this dataset is taining respectively piano arrangements of 909 and 1,748 that it is possible to make conversions between GuitarPro popular songs, were also released, all relying on MIDI for- and MIDI files. This can be done inside any of the afore- mat. This standardisation around MIDI is useful for there mentioned software, by simply exporting into MIDI, or by are several Python libraries to work with this format, such scripting. Thus, by converting the dataset’s GuitarPro files as music21 [13], mido [14], pretty_midi [15], and jSym- into MIDI, MIDI-based music feature extraction functions bolic [16]. available (e.g. Python libraries referenced in Section 2) Regarding guitar-oriented research, previous dataset re- can be applied. Finally, we believe that our dataset is able leases have not particularly targeted automatic music gen- to provide researchers with the information present in stan- eration goals, instead focusing on guitar transcription or dard MIDI datasets, while including at the same time pre- playing technique detection. The GuitarSet consists of scriptive information useful for guitar-oriented research.