Discovering Metadata Inconsistencies

Total Page:16

File Type:pdf, Size:1020Kb

Discovering Metadata Inconsistencies 11th International Society for Music Information Retrieval Conference (ISMIR 2010) DISCOVERING METADATA INCONSISTENCIES Bruno Angeles Cory McKay Ichiro Fujinaga CIRMMT CIRMMT CIRMMT Schulich School of Music Schulich School of Music Schulich School of Music McGill University McGill University McGill University bruno.angeles@mail. [email protected] [email protected] mcgill.ca ABSTRACT suggest more than 170 metadata elements organized into five types, in research pertaining to metadata for This paper describes the use of fingerprinting-based querying in identifying metadata inconsistencies in music phonograph recordings. Casey et al. [2] distinguish libraries, as well as the updates to the jMusicMeta- factual from cultural metadata. The most widely used Manager software in order to perform the analysis. Test implementation of musical metadata is the ID3 format results are presented for both the Codaich database and a associated with MP3 files [5]. generic library of unprocessed metadata. Statistics were The main problem with metadata is its inconsistency. computed in order to evaluate the differences between a The fact that it is stored in databases containing manually-maintained library and an unprocessed thousands, if not millions, of entries often means that the collection when comparing metadata with values on a data is supplied by several people who may have different MusicBrainz server queried by fingerprinting. approaches. Spelling mistakes may go unnoticed for a long time, and information such as an artist’s name might 1. INTRODUCTION be spelled in different but equally valid ways. 1.1 Purpose Additionally, several metadata labels—most notably Metadata is useful in organizing information, but in large genre—are highly subjective. collections of data it can be tedious to keep that When maintaining large databases of music, valid and information consistent. Whereas decision making in consistent metadata facilitates the retrieval and previous environments such as traditional libraries was classification of recordings, be it for Music Information limited to a small number of highly trained and Retrieval (MIR) purposes or simply for playback. meticulous people, the democratization of music brought 1.3 Metadata Repositories and Maintenance Software about by the digital age poses new challenges in terms of Metadata repositories are databases that include metadata maintenance, as music can now be obtained information about recordings. When they are supported from diverse and potentially noisy sources. by an Application Programming Interface (API), they The contributors to many popular metadata provide users with a convenient way of accessing the repositories tend to be much less meticulous, and may stored metadata. Existing music metadata repositories have limited expertise. The research presented here include MusicBrainz, Discogs, Last.fm, and Allmusic, to proposes a combination of metadata management name just a few [3]. software, acoustic fingerprinting, and the querying of a Several software solutions exist that provide ways to metadata database in order to discover possible errors and access, use, and adjust musical metadata. These include inconsistencies in a local music library. MusicBrainz Picard, MediaMonkey, jMusicMeta- Metadata entries are compared between a library of Manager, and Mp3tag. The first three applications manually-maintained files and a metadata repository, as support fingerprinting in some form, but Mp3tag does well as between a collection of unprocessed metadata and not. GNAT [10] allows querying by metadata or the same repository, in order to highlight the possible fingerprinting to explore aggregated semantic Web data. differences between the two. jMusicMetaManager is the only application that performs 1.2 Metadata and its Value extensive automated internal textual metadata error detection, and it also produces numerous useful summary Metadata is information about data. In music, it is statistics not made available by the alternatives. information related to recordings and their electronic version (such as the performer, recording studio, or 1.4 Acoustic Fingerprinting lyrics), although it can also be event information about Acoustic fingerprinting is a procedure in which audio artists or other attributes not immediately linked to a recordings are automatically analyzed and recorded piece of music. Corthaut et al. present 21 deterministically associated with a key that consumes semantically related clusters of metadata [1], covering a considerably less space than the original recording. The wide range of information that illustrates the variety of purpose of using the key in our context is to retrieve metadata that can be found in music. Lai and Fujinaga [6] metadata for a given recording using only the audio 195 11th International Society for Music Information Retrieval Conference (ISMIR 2010) information, which is more reliable than using the often found for entries that should be identical. For example, highly noisy metadata packaged with recordings. uncorrected occurrences of both ―Lynyrd Skynyrd‖ and Among other attributes, fingerprinting algorithms are ―Leonard Skinard‖ or of the multiple valid spellings of distinguished by their execution speed; their robustness to composers such as ―Stravinsky‖ would be problematic for noise and to various types of filtering; and their an artist identification system that would incorrectly transparency to encoding formats and associated perceive them as different artists. compression schemes [1]. At its simplest level, jMusicMetaManager calculates The fingerprinting service used in this paper is that of the Levenshtein (edit) distance between each pair of MusicIP (now known as AmpliFIND). It is based on entries for a given field. A threshold is then used to Portable Unique Identifier (PUID) codes [9]. These are determine whether two entries are likely to, in fact, computed using the GenPUID piece of software. The correspond to the same true value. This threshold is PUID format was chosen for its association with the dynamically weighted by the length of the strings. This is MusicBrainz API. done separately once each for the artist, composer, title, album, and genre fields. In the case of titles, recording 1.5 Method length is also considered, as two recordings might This research uses jMusicMetaManager [7] (a Java correctly have the same title but be performed entirely application for maintaining metadata), Codaich [7] (a differently database of music with manually-maintained metadata), a This approach, while helpful, is too simplistic to detect reference library of music labeled with unprocessed the full range of problems that one finds in practice. metadata, and a local MusicBrainz server at McGill Additional pre-processing was therefore implemented and University’s Music Technology Area. Reports were additional post-modification distances were calculated. generated in jMusicMetaManager, an application for This was done in order to reduce the edit distance of music metadata maintenance which was improved as part strings that probably refer to the same thing, thus making of this project by the addition of fingerprinting-based it easier to detect the corresponding inconsistency. For querying. This was done in order to find the percentage of example: metadata that was identical between the manually- Occurences of ―The ‖ were removed (e.g., ―The maintained metadata and that found on the MusicBrainz Police‖ should match ―Police‖). server of metadata. In addition to comparing the artist, album, and title fields, a statistic was computed indicating Occurrences of ― and ‖ were replaced with ― & ‖ how often all three of these specific fields matched Personal titles were converted to abbreviations between the local library and the metadata server, a (e.g., ―Doctor John‖ to ―Dr. John‖). statistic that we refer to as ―identical metadata.‖ Raimond Instances of ―in’‖ were replaced with ―ing‖ (e.g., et al. [10] present a similar method, with the ultimate ―Breakin’ Down‖ to ―Breaking Down‖). objective of accessing information on the Semantic Web. Punctuation and brackets were removed (e.g., An unprocessed test collection, consisting of music ―R.E.M.‖ to ―REM‖). files obtained from file sharing services, was used in Track numbers from the beginnings of titles and order to provide a comparison between unmaintained and extra spaces were removed. manually-maintained metadata. This unprocessed library In all, jMusicMetaManager can perform 23 pre- is referred to as the ―reference library‖ in this paper. processing operations. Furthermore, an additional type of processing can be performed where word orders are 2. JMUSICMETAMANAGER rearranged (e.g., ―Ella Fitzgerald‖ should match ―Fitzgerald, Ella,‖ and ―Django Reinhardt & Stéphane jMusicMetaManager [7] is a piece of software designed Grappelli‖ should match ―Stéphane Grappelli & Django to automatically detect metadata inconsistencies and Reinhardt‖). Word subsets can also be considered (e.g., errors in musical collections, as well as generate ―Duke Ellington‖ might match ―Duke Ellington & His descriptive profiling statistics about such collections. The Orchestra‖). software is part of the jMIR [8] music information jMusicMetaManager also automatically generates a retrieval software suite, which also includes audio, MIDI, variety of HTML-formatted statistical reports about music and cultural feature extractors; metalearning machine collections, including multiple data summary views and learning software; and
Recommended publications
  • Beets Documentation Release 1.5.1
    beets Documentation Release 1.5.1 Adrian Sampson Oct 01, 2021 Contents 1 Contents 3 1.1 Guides..................................................3 1.2 Reference................................................. 14 1.3 Plugins.................................................. 44 1.4 FAQ.................................................... 120 1.5 Contributing............................................... 125 1.6 For Developers.............................................. 130 1.7 Changelog................................................ 145 Index 213 i ii beets Documentation, Release 1.5.1 Welcome to the documentation for beets, the media library management system for obsessive music geeks. If you’re new to beets, begin with the Getting Started guide. That guide walks you through installing beets, setting it up how you like it, and starting to build your music library. Then you can get a more detailed look at beets’ features in the Command-Line Interface and Configuration references. You might also be interested in exploring the plugins. If you still need help, your can drop by the #beets IRC channel on Libera.Chat, drop by the discussion board, send email to the mailing list, or file a bug in the issue tracker. Please let us know where you think this documentation can be improved. Contents 1 beets Documentation, Release 1.5.1 2 Contents CHAPTER 1 Contents 1.1 Guides This section contains a couple of walkthroughs that will help you get familiar with beets. If you’re new to beets, you’ll want to begin with the Getting Started guide. 1.1.1 Getting Started Welcome to beets! This guide will help you begin using it to make your music collection better. Installing You will need Python. Beets works on Python 3.6 or later. • macOS 11 (Big Sur) includes Python 3.8 out of the box.
    [Show full text]
  • Beets Documentation Release 1.3.3
    beets Documentation Release 1.3.3 Adrian Sampson February 27, 2014 Contents i ii beets Documentation, Release 1.3.3 Welcome to the documentation for beets, the media library management system for obsessive-compulsive music geeks. If you’re new to beets, begin with the Getting Started guide. That guide walks you through installing beets, setting it up how you like it, and starting to build your music library. Then you can get a more detailed look at beets’ features in the Command-Line Interface and Configuration references. You might also be interested in exploring the plugins. If you still need help, your can drop by the #beets IRC channel on Freenode, send email to the mailing list, or file a bug in the issue tracker. Please let us know where you think this documentation can be improved. Contents 1 beets Documentation, Release 1.3.3 2 Contents CHAPTER 1 Contents 1.1 Guides This section contains a couple of walkthroughs that will help you get familiar with beets. If you’re new to beets, you’ll want to begin with the Getting Started guide. 1.1.1 Getting Started Welcome to beets! This guide will help you begin using it to make your music collection better. Installing You will need Python. (Beets is written for Python 2.7, but it works with 2.6 as well. Python 3.x is not yet supported.) • Mac OS X v10.7 (Lion) and 10.8 (Mountain Lion) include Python 2.7 out of the box; Snow Leopard ships with Python 2.6.
    [Show full text]
  • Downloading a Resource from Another Server, Looks for Keywords in the Document and Then Searches the Contents for Hyperlinks
    Embedding intelligence in enhanced music mapping agents By MARNITZ CORNELL GRAY DISSERTATION submitted in fulfilment of the requirements for the degree MASTER OF SCIENCE In COMPUTER SCIENCE in the FACULTY OF SCIENCE at the UNIVERSITY OF JOHANNESBURG SUPERVISOR: PROF. E.M. EHLERS SEPTEMBER 2007 Abstract Keywords: Pluggable Intelligence, Intelligent Music Selection Artificial Intelligence has been an increasing focus of study over the past years. Agent technology has emerged as being the preferred model for simulating intelligence [Jen00a]. Focus is now turning to inter-agent communication [Jen00b] and agents that can adapt to changes in their environment. Digital music has been gaining in popularity over the past few years. Devices such as Apple’s iPod have sold millions. These devices have the capability of holding thousands of songs. Managing such a device and selecting a list of songs to play from so many can be a difficult task. This dissertation expands on agent types by creating a new agent type known as the Modifiable Agent. The Modifiable Agent type defines agents which have the ability to modify their intelligence depending on what data they need to analyse. This allows an agent to, for example, change from being a goal based to a learning based agent, or allows an agent to modify the way in which it processes data. Digital music is a growing field with devices such as the Apple iPod revolutionising the industry. These devices can store large amounts of songs and as such, make it very difficult to navigate as they usually don’t include devices such as a mouse or keyboard.
    [Show full text]
  • 5. Controlling the Cocktail Audio Pro X100
    www.cocktailaudio.co.uk Contents Cocktail Audio PRO X100 ............................................................................................ 1 Safety Notice .................................................................................................................. 1 Caution: .......................................................................................................................... 1 Warranty & Disclaimer: ................................................................................................. 1 Lawful Usage & Digital Music Licensing: .................................................................... 2 1. Introduction ................................................................................................................ 2 2. Product ....................................................................................................................... 4 3. Hard drive (HDD) installation/setup:......................................................................... 5 4. Installation & connections: ........................................................................................ 5 5. Controlling the Cocktail Audio Pro X100 ................................................................. 5 6. Initial setup................................................................................................................. 5 7. Setting up the network ............................................................................................... 7 8. How to rip (encode) an audio CD: ............................................................................
    [Show full text]
  • Release V2.5.6
    MusicBrainz Picard Release v2.5.6 Feb 16, 2021 MusicBrainz Picard User Guide by Bob Swift is licensed under CC0 1.0. To view a copy of this license, visit https://creativecommons.org/publicdomain/zero/1.0 CONTENTS 1 Introduction 1 1.1 Picard Can. ...........................................2 1.2 Picard Cannot. .........................................2 1.3 Limitations...........................................2 2 Contributing to the Project3 3 Acknowledgements4 3.1 Editor and English Language Lead..............................4 3.2 Translation Teams.......................................4 3.3 Contributors..........................................4 4 Glossary of Terms 6 5 Getting Started 10 5.1 Download & Install Picard................................... 10 5.2 Main Screen.......................................... 12 5.3 Status Icons........................................... 18 6 Configuration 20 6.1 Screen Setup.......................................... 20 6.2 Action Options......................................... 21 6.3 Option Settings......................................... 21 7 Tags & Variables 66 7.1 Basic Tags........................................... 66 7.2 Advanced Tags......................................... 70 7.3 Basic Variables......................................... 72 7.4 File Variables.......................................... 73 7.5 Advanced Variables...................................... 74 7.6 Classical Music Tags...................................... 75 7.7 Tags from Plugins......................................
    [Show full text]
  • Descriptive Metadata in the Music Industry 1
    DESCRIPTIVE METADATA IN THE MUSIC INDUSTRY 1 Descriptive Metadata In The Music Industry: Why It Is Broken And How To Fix It Tony Brooke December 2014 Author Note Tony Brooke is a media asset manager in San Francisco, California. He holds a Master’s degree in Library and Information Science (with a concentration in media asset management and audiovisual metadata) from San José State University, and has been an audio engineer in San Francisco since 1992. Correspondence regarding this should be made to Tony Brooke, Silent Way Media Asset Management, San Francisco, (415) 826-2888. http://www.silentway.com/research This is a corrected postprint of an article originally published in two parts, February- March 2014 in the Journal of Digital Media Management. Keywords: metadata, standards, music, descriptive, open, proprietary, database, identifier, credits, schema, persistent DESCRIPTIVE METADATA IN THE MUSIC INDUSTRY 2 Table Of Contents Abstract 3 Introduction 4 Descriptive Metadata: Now And Then 5 Terminology 8 Reasons for the Lack of Descriptive Metadata 17 The Silos 20 The DDEX Suite of Standards and CCD 30 Part Two 33 Why Hasn’t This Been Fixed Yet? 33 Toward A Globally Unique Abstracted Persistent Identifier (GUAPI) 40 Proposed Study: Quantifying Descriptive Metadata Value 49 Research Questions 49 Participants 50 Data Collection Instrument 50 Procedure 51 Conclusion And Recommendations 53 References 59 Appendix: Acronym Reference 67 Document Control Version history Version Date Version notes 1.0 June 19, 2013 Initial draft, circulated privately, APA citations. 1.1 July 14, 2013 Revised draft. 2.0 October 15, 2013 Significant rewrite, with different conclusion.
    [Show full text]
  • Managing File Names with Metadata
    A File By Any Other Name: Managing File Names with Metadata Name1 Name2 Name3 Name4 Affiliation1 Affiliation2 Affiliation3 Email1 Email2 Email3/4 Abstract it or recognize it when we look at it later. In order to help File names are an ancient abstraction, a string of characters users to find and remember files, they often contain a bounty to uniquely identify a file and help users remember the con- of useful metadata about the file. tents of a file when they look for it later. However, they are However current approaches to file names have flaws: not an ideal solution for these tasks. Names often contain • They are unstructured semantic metadata which is opaque useful semantic metadata providedby the user, but this meta- to the system data is opaque to the rest of the system. The formatting of • Formatting is up to the user, and is error prone and incon- the file name is up to the user, and as a result, metadata in sistent, making it hard to find files later file names is often error-prone, inconsistently formatted, and • hard to search for later. File names need to be more mean- Changing a file name destroys information ingful and reliable, while simplifying application design and Consider the following file name drawn from the experi- encouraging users and applications to provide more meta- ments for this paper: data for search. createfiles_HDD_truenames_100000files_1threads. We describe our prototype file system, TrueNames, a data POSIX compliant file system which demonstrates an al- Looked at in one light, this is a long, arbitrary, error prone ternate approach to naming and metadata, by providing string of characters.
    [Show full text]
  • Unify Your Local Music Library
    Unify your local music library [WORKSHOP] Introducing MusicBrainz, Picard, and ultimate music categorization Justin W. Flory (licensed under CC-BY-SA 4.0) Twitter: @jflory7 Blog: blog.justinwflory.com : Wat? ● Open source music metadata encyclopedia (est. 2000) ○ Creative Commons ● Store, categorize, sort (all languages and scripts) ● Used by… ○ Developers: Build cool music apps! ○ Commercial users: Update local music databases with live updates from MusicBrainz (Spotify, Amazon, Google, Last.fm, etc.) ○ Users (that’s us!): Tagging your own music Metadata matters Visibility Find your music! Record keeping Write music history! Research Discover music science! Is MusicBrainz really that big? Let’s look at some numbers… ● Artists: 1,189,376 ● Releases: 1,769,002 ● Recordings: 17,152,131 ● Artist locations: ○ Countries: 259 ○ Cities: 74,087 ○ Islands: 96 ● Editor accounts: 1,781,906 registered (1,329 active last week) But what about my audio dramas? Yes, you can have your audio dramas, all 257 of them. Introducing… ● Tool to sort, scan, and correct metadata for you ○ References MusicBrainz database ● Puts the right metadata in the right place ○ Album name, artist name, album artwork, release year… ○ Helpful for music tracking services (e.g. Last.fm / Libre.fm) ○ Helpful for cloud music players (upload and forget) ● Available for Windows, macOS, Linux Installing and using Available for… almost everything picard.musicbrainz.org ● Windows / macOS: Installers available ● Linux: ○ Ubuntu: sudo add-apt-repository ppa:musicbrainz-developers/stable sudo apt-get update ○ Fedora: sudo dnf install picard chromaprint-tools Why do I need chromaprint-tools? ○ Arch Linux, Debian, Gentoo, OpenSUSE, and others (see the docs) Exploring the MusicBrainz Picard interface Conceptual demo (1) Hey, look, this album doesn’t have album artwork, but I have it on my computer.
    [Show full text]
  • Making Metadata: the Case of Musicbrainz
    Making Metadata: The Case of MusicBrainz Jess Hemerly [email protected] May 5, 2011 Master of Information Management and Systems: Final Project School of Information University of California, Berkeley Berkeley, CA 94720 Making Metadata: The Case of MusicBrainz Jess Hemerly School of Information University of California, Berkeley Berkeley, CA 94720 [email protected] Summary......................................................................................................................................... 1! I.! Introduction .............................................................................................................................. 2! II.! Background ............................................................................................................................. 4! A.! The Problem of Music Metadata......................................................................................... 4! B.! Why MusicBrainz?.............................................................................................................. 8! C.! Collective Action and Constructed Cultural Commons.................................................... 10! III.! Methodology........................................................................................................................ 14! A.! Quantitative Methods........................................................................................................ 14! Survey Design and Implementation.....................................................................................
    [Show full text]
  • The Big Book of Itunes
    MakeUseOf.com presents The Big Book of With links to Useful new Cool apps to articles on tips you use with iTunes MakeUseOf never knew iTunes By Jackson Chung Author & Staff writer The Big Book of iTunes 2009 Introduction This manual was created with the intention of introducing iTunes to beginners and to provide basic information and instructions to perform various tasks when using iTunes on both Mac and Windows. This manual also contains some very useful tips on how to achieve various actions such as transferring iTunes libraries from Windows to Mac and using iTunes as your alarm clock. iTunes can actually be more than just a music player. Also, this manual will introduce cool, new software to complement and extend the functionality of iTunes. This manual will begin with the very basics, introducing iTunes and organizing music. Then it will move on to more advanced topics like integrating Last.fm and using Applescripts. I hope you will find The Big Book of iTunes a useful tool for doing more with iTunes. Table of Contents Getting Started 7 Why should I use iTunes? 7 Running iTunes for the first time 7 Iʼve started up iTunes, now what? 8 Organizing Your Music 8 Why do I need to organize my music? 8 Iʼve got my library in order, now what? 9 Why should I add lyrics? 9 What are Cover art? 10 iTunes How-Toʼs 11 I used iTunes in Windows. How do I transfer my library over to a Mac? 11 Itʼs all about sharing 15 How do I share my iTunes library? 15 How do I listen to other shared libraries? 15 Other sharing options 16 One of them is Mojo.
    [Show full text]
  • Musicbrainz Picard Release V2.3.2
    MusicBrainz Picard Release v2.3.2 Dec 11, 2020 MusicBrainz Picard User Guide by Bob Swift is licensed under CC0 1.0. To view a copy of this license, visit https://creativecommons.org/publicdomain/zero/1.0 CONTENTS 1 Introduction 1 1.1 Documentation Credits & Updates...............................2 1.2 Picard Can. ...........................................2 1.3 Picard Cannot. .........................................2 1.4 Limitations...........................................3 2 Glossary of Terms 4 3 Getting Started 7 3.1 Download & Install Picard...................................7 3.2 Main Screen..........................................8 3.3 Status Icons...........................................9 4 Configuration 11 4.1 Screen Setup.......................................... 11 4.2 Action Options......................................... 12 4.3 Option Settings......................................... 12 5 Tags & Variables 52 5.1 Basic Tags........................................... 52 5.2 Advanced Tags......................................... 57 5.3 Basic Variables......................................... 58 5.4 Advanced Variables...................................... 60 5.5 Classical Music Tags...................................... 61 5.6 Tags from Plugins....................................... 61 5.7 Other Information....................................... 64 6 Scripting 65 6.1 Syntax............................................. 65 6.2 Metadata Variables....................................... 65 7 Scripting Functions 66 7.1 Assignment
    [Show full text]
  • Discovering Metadata Inconsistencies
    DISCOVERING METADATA INCONSISTENCIES Bruno Angeles Cory McKay Ichiro Fujinaga CIRMMT CIRMMT CIRMMT Schulich School of Music Schulich School of Music Schulich School of Music McGill University McGill University McGill University bruno.angeles@mail. [email protected] [email protected] mcgill.ca ABSTRACT suggest more than 170 metadata elements organized into five types, in research pertaining to metadata for This paper describes the use of fingerprinting-based querying in identifying metadata inconsistencies in music phonograph recordings. Casey et al. [2] distinguish libraries, as well as the updates to the jMusicMeta- factual from cultural metadata. The most widely used Manager software in order to perform the analysis. Test implementation of musical metadata is the ID3 format results are presented for both the Codaich database and a associated with MP3 files [5]. generic library of unprocessed metadata. Statistics were The main problem with metadata is its inconsistency. computed in order to evaluate the differences between a The fact that it is stored in databases containing manually-maintained library and an unprocessed thousands, if not millions, of entries often means that the collection when comparing metadata with values on a data is supplied by several people who may have different MusicBrainz server queried by fingerprinting. approaches. Spelling mistakes may go unnoticed for a long time, and information such as an artist’s name might 1. INTRODUCTION be spelled in different but equally valid ways. 1.1 Purpose Additionally, several metadata labels—most notably Metadata is useful in organizing information, but in large genre —are highly subjective. collections of data it can be tedious to keep that When maintaining large databases of music, valid and information consistent.
    [Show full text]