Acousticbrainz: a Community Platform for Gathering Music Information Obtained from Audio

ACOUSTICBRAINZ: A COMMUNITY PLATFORM FOR GATHERING MUSIC INFORMATION OBTAINED FROM AUDIO Alastair Porteryz, Dmitry Bogdanovy, Robert Kayez, Roman Tsukanovz, Xavier Serray yMusic Technology Group, Universitat Pompeu Fabra, Barcelona, Spain zMetaBrainz Foundation alastair.porter,dmitry.bogdanov,[email protected] rob,[email protected] ABSTRACT For example, existing datasets for genre classification are of insufficient size with respect to both the number of in- We introduce the AcousticBrainz project, an open plat- stances per class and the ability of these instances to ac- form for gathering music information. At its core, Acous- curately represent the entire musical genre space [4]. A ticBrainz is a database of music descriptors computed from list of datasets commonly used in MIR is provided in [1]. audio recordings using a number of state-of-the-art Mu- Half of them have fewer than 10,000 instances, although sic Information Retrieval algorithms. Users run a supplied in recent years there have been attempts to create larger feature extractor on audio files and upload the analysis re- datasets. Building such datasets would allow research at sults to the AcousticBrainz server. All submissions include the scale of the requirements of commercial applications. a MusicBrainz identifier allowing them to be linked to var- In general however, the creation of datasets may be dif- ious sources of editorial information. The feature extractor ficult for researchers due to a number of reasons: is based on the open source Essentia audio analysis library. From the data submitted by the community, we run classi- • Gathering and sharing datasets require legal considera- fiers aimed at adding musically relevant semantic informa- tions with regard to the distribution of copyrighted ma- tion. These classifiers can be developed by the community terial [7]. using tools available on the AcousticBrainz website. All • Collections which are hand-created may be biased in data in AcousticBrainz is freely available and can be ac- their contents and annotations, especially if they are cessed through the website or API. For AcousticBrainz to created by only one person, or if they are created for be successful we need to have an active community that the evaluation of a specific task or algorithm (such as contributes to and uses this platform, and it is this commu- the GZTAN dataset, commonly used to evaluate audio nity that will define the actual uses and applications of its feature-based genre classification algorithms [10]). data. One project in recent years to address some of these is- sues is the Million Song Dataset (MSD) [1]. At the time 1. INTRODUCTION of its release, this was the largest dataset of music descrip- One of the biggest bottlenecks in many Music Information tors in the MIR community and has gained a lot of atten- Retrieval (MIR) tasks is the access to large amounts of tion for its size and breadth of music content, as well as music data, in particular to audio features extracted from the simplicity of accessing its data. The MSD relies on commercial music recordings. Most approaches to tasks the EchoNest API 1 to compute its descriptors, a commer- such as music classification, auto-tagging, music similar- cial product which is closed to academic inspection. Some ity and music recommendation, are based on using audio downsides of this approach include: features obtained from well-established audio signal pro- cessing algorithms. This is a time consuming process that • Implementation details of the algorithms used to com- is beyond the possibilities of any individual researcher. It pute the descriptors are unknown and it is impossible to may not be possible for researchers to gather this much review the quality of their implementation. information, annotate it according to their needs, or com- • The dataset is fixed in time, and does not appear to have pute the required features at the scale required for the task. been updated with new features, or music released since it was created. c Alastair Porter, Dmitry Bogdanov, Robert Kaye, Roman The MSD has been further expanded with features com- Tsukanov, Xavier Serra. puted using open source algorithms, on audio samples Licensed under a Creative Commons Attribution 4.0 International Li- cense (CC BY 4.0). Attribution: Alastair Porter, Dmitry Bogdanov, from 7digital.com [9]. As this dataset reflects the MSD, Robert Kaye, Roman Tsukanov, Xavier Serra. “AcousticBrainz: a com- it is also fixed in time, and features do not represent the munity platform for gathering music information obtained from audio”, whole recording, but only the sample. 16th International Society for Music Information Retrieval Conference, 2015. 1 http://developer.echonest.com Based on these considerations, we believe that there is companies. 5 Every item in the database is uniquely iden- still space for a large dynamic dataset consisting of music tified by an MBID and many companies and organizations features calculated with open algorithms. rely on these IDs as identifiers for music-related concepts. MBIDs can be used to retrieve data from external services 2. ACOUSTICBRAINZ which understand them (e.g., Last.fm, WikiData), and are also a part of the Music Ontology. We are introducing a new platform, AcousticBrainz, 2 to assist with the gathering of musical data from the mu- 2.2 Current submission statistics sic enthusiast and research community, and to provide researchers with large datasets of recordings to work with. All of the source code in AcousticBrainz is open, 3 en- 160000 couraging sharing of algorithms between contributors and 140000 providing the ability for people to improve on the work of 120000 others. All submitted and generated data is freely available 100000 under a Creative Commons CC-0 license (public domain). The platform is split into three categories: feature ex- 80000 traction, data storage, and the creation of musical semantic Frequency 60000 information. A feature extractor, based on algorithms in 40000 the Essentia audio analysis library [2], can be downloaded 20000 by anyone who wishes to contribute data to the project. 0 They run this extractor on their personal computer, giv- 1940 1950 1960 1970 1980 1990 2000 2010 2020 ing audio files as input. The output of this extractor is a Year JSON file for each audio track containing descriptors (see Section 2.3.3). A submission tool provided with the ex- Figure 1: Release years of submissions from file metadata. tractor automatically uploads the JSON files to the Acous- ticBrainz server. A database stores submissions and makes the data avail- Format Count able via an API. AcousticBrainz only stores descriptors mp3 1,784,778 of audio, and never the actual audio itself. Submissions flac 777,826 are identified by the MusicBrainz identifier (MBID) of the vorbis 83,867 input audio file. These stable identifiers let us uniquely aac 64,733 and unambiguously refer to a music recording, and can alac 29,481 also let us obtain additional editorial information from wmav2 4,019 MusicBrainz and from other services that also understand other 1,320 MBIDs. To encourage experimentation with the data, the Acous- Table 1: Number of submissions per audio codec. ticBrainz website lets anybody create, annotate, and share their own datasets consisting of recordings present in the database. A search interface lets users query for recordings At this time, 6 the AcousticBrainz database has audio based on editorial data from MusicBrainz or extracted fea- features submitted for 1,671,701 unique recording MBIDs. tures and add the results to the dataset. From these datasets We keep duplicate submissions from different sources, re- users can build classifier models which can be used to es- sulting in a total of 2,747,094 submissions. For these timate characteristics of any recording present in Acous- submissions we also have metadata information available ticBrainz. from MusicBrainz, including 99,159 artists and 165,394 releases. The duplicates consist of analyzed features of 2.1 MusicBrainz various source audio files, with differing codecs, encoders, and bit rates. These duplicates let us see real-world ex- MusicBrainz 4 is a community-maintained open encyclo- amples of the effect of different codecs and encoding pa- pedia of music information. It contains editorial metadata rameters on our descriptors. We have collected submission for many musical concepts, including Artists (individuals, for 538,614 unique MBIDs (807,307 including duplicates) groups, and other people associated with musical events), for audio files encoded using a lossless codec (FLAC and Releases, Recordings, and Works. It also contains relation- ALAC), which is in itself is a large database. The most ships between items, and to other external databases. Data common audio format for submitted files is MP3, with is entered manually by a large community of volunteers more submissions than for all other formats combined (Ta- (editors), who also vote on changes made by other editors ble 1). 94% of submitted files contain year metadata. We to ensure its quality. It is used by a number of commercial show a histogram of the year that submissions were re- 2 http://acousticbrainz.org leased in Figure 1. The majority of tracks are from the 3 https://github.com/metabrainz/ acousticbrainz-server 5 https://metabrainz.org/customers 4 https://musicbrainz.org 6 July 7 2015 1990s and first decade of the 2000s. As tracks submit- call “high-level” data). The high-level data is computed ted to AcousticBrainz require a MBID this distribution on the server without needing to access audio files. The may also be reflective of the content in the MusicBrainz community can moderate the models and the good ones database. The current size of the database (containing all are used to compute high-level data for all AcousticBrainz JSON file submissions) is approximately 118GB, split be- submissions. tween 102GB of low-level data (average file size 40kB), and 12GB of high-level (file size 4kB).

Load more