Discovering Metadata Inconsistencies
Total Page:16
File Type:pdf, Size:1020Kb
11th International Society for Music Information Retrieval Conference (ISMIR 2010) DISCOVERING METADATA INCONSISTENCIES Bruno Angeles Cory McKay Ichiro Fujinaga CIRMMT CIRMMT CIRMMT Schulich School of Music Schulich School of Music Schulich School of Music McGill University McGill University McGill University bruno.angeles@mail. [email protected] [email protected] mcgill.ca ABSTRACT suggest more than 170 metadata elements organized into five types, in research pertaining to metadata for This paper describes the use of fingerprinting-based querying in identifying metadata inconsistencies in music phonograph recordings. Casey et al. [2] distinguish libraries, as well as the updates to the jMusicMeta- factual from cultural metadata. The most widely used Manager software in order to perform the analysis. Test implementation of musical metadata is the ID3 format results are presented for both the Codaich database and a associated with MP3 files [5]. generic library of unprocessed metadata. Statistics were The main problem with metadata is its inconsistency. computed in order to evaluate the differences between a The fact that it is stored in databases containing manually-maintained library and an unprocessed thousands, if not millions, of entries often means that the collection when comparing metadata with values on a data is supplied by several people who may have different MusicBrainz server queried by fingerprinting. approaches. Spelling mistakes may go unnoticed for a long time, and information such as an artist’s name might 1. INTRODUCTION be spelled in different but equally valid ways. 1.1 Purpose Additionally, several metadata labels—most notably Metadata is useful in organizing information, but in large genre—are highly subjective. collections of data it can be tedious to keep that When maintaining large databases of music, valid and information consistent. Whereas decision making in consistent metadata facilitates the retrieval and previous environments such as traditional libraries was classification of recordings, be it for Music Information limited to a small number of highly trained and Retrieval (MIR) purposes or simply for playback. meticulous people, the democratization of music brought 1.3 Metadata Repositories and Maintenance Software about by the digital age poses new challenges in terms of Metadata repositories are databases that include metadata maintenance, as music can now be obtained information about recordings. When they are supported from diverse and potentially noisy sources. by an Application Programming Interface (API), they The contributors to many popular metadata provide users with a convenient way of accessing the repositories tend to be much less meticulous, and may stored metadata. Existing music metadata repositories have limited expertise. The research presented here include MusicBrainz, Discogs, Last.fm, and Allmusic, to proposes a combination of metadata management name just a few [3]. software, acoustic fingerprinting, and the querying of a Several software solutions exist that provide ways to metadata database in order to discover possible errors and access, use, and adjust musical metadata. These include inconsistencies in a local music library. MusicBrainz Picard, MediaMonkey, jMusicMeta- Metadata entries are compared between a library of Manager, and Mp3tag. The first three applications manually-maintained files and a metadata repository, as support fingerprinting in some form, but Mp3tag does well as between a collection of unprocessed metadata and not. GNAT [10] allows querying by metadata or the same repository, in order to highlight the possible fingerprinting to explore aggregated semantic Web data. differences between the two. jMusicMetaManager is the only application that performs 1.2 Metadata and its Value extensive automated internal textual metadata error detection, and it also produces numerous useful summary Metadata is information about data. In music, it is statistics not made available by the alternatives. information related to recordings and their electronic version (such as the performer, recording studio, or 1.4 Acoustic Fingerprinting lyrics), although it can also be event information about Acoustic fingerprinting is a procedure in which audio artists or other attributes not immediately linked to a recordings are automatically analyzed and recorded piece of music. Corthaut et al. present 21 deterministically associated with a key that consumes semantically related clusters of metadata [1], covering a considerably less space than the original recording. The wide range of information that illustrates the variety of purpose of using the key in our context is to retrieve metadata that can be found in music. Lai and Fujinaga [6] metadata for a given recording using only the audio 195 11th International Society for Music Information Retrieval Conference (ISMIR 2010) information, which is more reliable than using the often found for entries that should be identical. For example, highly noisy metadata packaged with recordings. uncorrected occurrences of both ―Lynyrd Skynyrd‖ and Among other attributes, fingerprinting algorithms are ―Leonard Skinard‖ or of the multiple valid spellings of distinguished by their execution speed; their robustness to composers such as ―Stravinsky‖ would be problematic for noise and to various types of filtering; and their an artist identification system that would incorrectly transparency to encoding formats and associated perceive them as different artists. compression schemes [1]. At its simplest level, jMusicMetaManager calculates The fingerprinting service used in this paper is that of the Levenshtein (edit) distance between each pair of MusicIP (now known as AmpliFIND). It is based on entries for a given field. A threshold is then used to Portable Unique Identifier (PUID) codes [9]. These are determine whether two entries are likely to, in fact, computed using the GenPUID piece of software. The correspond to the same true value. This threshold is PUID format was chosen for its association with the dynamically weighted by the length of the strings. This is MusicBrainz API. done separately once each for the artist, composer, title, album, and genre fields. In the case of titles, recording 1.5 Method length is also considered, as two recordings might This research uses jMusicMetaManager [7] (a Java correctly have the same title but be performed entirely application for maintaining metadata), Codaich [7] (a differently database of music with manually-maintained metadata), a This approach, while helpful, is too simplistic to detect reference library of music labeled with unprocessed the full range of problems that one finds in practice. metadata, and a local MusicBrainz server at McGill Additional pre-processing was therefore implemented and University’s Music Technology Area. Reports were additional post-modification distances were calculated. generated in jMusicMetaManager, an application for This was done in order to reduce the edit distance of music metadata maintenance which was improved as part strings that probably refer to the same thing, thus making of this project by the addition of fingerprinting-based it easier to detect the corresponding inconsistency. For querying. This was done in order to find the percentage of example: metadata that was identical between the manually- Occurences of ―The ‖ were removed (e.g., ―The maintained metadata and that found on the MusicBrainz Police‖ should match ―Police‖). server of metadata. In addition to comparing the artist, album, and title fields, a statistic was computed indicating Occurrences of ― and ‖ were replaced with ― & ‖ how often all three of these specific fields matched Personal titles were converted to abbreviations between the local library and the metadata server, a (e.g., ―Doctor John‖ to ―Dr. John‖). statistic that we refer to as ―identical metadata.‖ Raimond Instances of ―in’‖ were replaced with ―ing‖ (e.g., et al. [10] present a similar method, with the ultimate ―Breakin’ Down‖ to ―Breaking Down‖). objective of accessing information on the Semantic Web. Punctuation and brackets were removed (e.g., An unprocessed test collection, consisting of music ―R.E.M.‖ to ―REM‖). files obtained from file sharing services, was used in Track numbers from the beginnings of titles and order to provide a comparison between unmaintained and extra spaces were removed. manually-maintained metadata. This unprocessed library In all, jMusicMetaManager can perform 23 pre- is referred to as the ―reference library‖ in this paper. processing operations. Furthermore, an additional type of processing can be performed where word orders are 2. JMUSICMETAMANAGER rearranged (e.g., ―Ella Fitzgerald‖ should match ―Fitzgerald, Ella,‖ and ―Django Reinhardt & Stéphane jMusicMetaManager [7] is a piece of software designed Grappelli‖ should match ―Stéphane Grappelli & Django to automatically detect metadata inconsistencies and Reinhardt‖). Word subsets can also be considered (e.g., errors in musical collections, as well as generate ―Duke Ellington‖ might match ―Duke Ellington & His descriptive profiling statistics about such collections. The Orchestra‖). software is part of the jMIR [8] music information jMusicMetaManager also automatically generates a retrieval software suite, which also includes audio, MIDI, variety of HTML-formatted statistical reports about music and cultural feature extractors; metalearning machine collections, including multiple data summary views and learning software; and