Audio Content-Based Music Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
Meinard Müller Tutorial 2007 Habilitation Audio Content-Based Bonn, Germany Music Retrieval 2007 MPI Informatik Saarland, Germany Meinard Müller Joan Serrà Saarland University and MPI Informatik Artificial Intelligence Research Institute Senior Researcher [email protected] Spanish National Research Council Multimedia Information Retrieval [email protected] & Music Processing 6 PhD Students Joan Serrà Music Retrieval 2007- 2011 Ph.D. Textual metadata Music Technology Group – Traditional retrieval – Searching for artist, title, … Universitat Pompeu Fabra Barcelona, Spain Rich and expressive metadata – Generated by experts Postdoctoral Resarcher – Crowd tagging, social networks Artificial Intelligence Research Institute Spanish National Research Council Content-based retrieval Music information retrieval, machine – Automatic generation of tags learning, time series analysis, complex – Query-by-example networks, ... Query-by-Example Database Query-by-Example Query Taxonomy Hits Specificity Granularity level level Retrieval tasks: Bernstein (1962) Retrieval tasks: Beethoven, Symphony No. 5 High Fragment-based Audio identification Beethoven, Symphony No. 5: Audio identification specificity retrieval Bernstein (1962) Audio matching Karajan (1982) Audio matching Gould (1992) Version identification Version identification Beethoven, Symphony No. 9 Low Category-based music retrieval Beethoven, Symphony No. 3 Category-based music retrieval Document-based Haydn Symphony No. 94 specificity retrieval Overview Overview Part I: Audio identification Part I: Audio identification Part II: Audio matching Part II: Audio matching Coffee Break Coffee Break Part III: Version identification Part III: Version identification Part IV: Category-based music retrieval Part IV: Category-based music retrieval Overview Overview Part I: Audio identification Part I: Audio identification Part II: Audio matching Part II: Audio matching Coffee Break Coffee Break Part III: Version identification Part III: Version identification Part IV: Category-based music retrieval Part IV: Category-based music retrieval Overview Overview Part I: Audio identification Part I: Audio identification Part II: Audio matching Part II: Audio matching Coffee Break Coffee Break Part III: Version identification Part III: Version identification Part IV: Category-based music retrieval Part IV: Category-based music retrieval Overview Tutorial Audio Content-Based Part I: Audio identification Music Retrieval Part II: Audio matching Meinard Müller Joan Serrà Coffee Break Saarland University and MPI Informatik Artificial Intelligence Research Institute [email protected] Spanish National Research Council [email protected] Part III: Version identification Part IV: Category-based music retrieval Audio Identification Application Scenario Database: Huge collection consisting of all audio User hears music playing in the environment recordings (feature representations) to be potentially identified. User records music fragment (5-15 seconds) with mobile phone Goal: Given a short query audio fragment , identify the original audio recording the query is taken Audio fingerprints are extracted from the recording from. and sent to an audio identification service Notes: Instance of fragment-based retrieval Service identifies audio recording based on fingerprints High specificity Service sends back metadata (track title, artist) to user Not the piece of music is identified but a specific rendition of the piece Audio Fingerprints Audio Fingerprints An audio fingerprint is a content-based compact An audio fingerprint is a content-based compact signature that summarizes some specific audio content. signature that summarizes a piece of audio content Requirements: Requirements: Ability to accurately identify an item within a huge number of other items Discriminative power Discriminative power (informative, characteristic) Low probability of false positives Invariance to distortions Invariance to distortions Recorded query excerpt Compactness Compactness only a few seconds Large audio collection on the server Computational simplicity Computational simplicity side (millions of songs) Audio Fingerprints Audio Fingerprints An audio fingerprint is a content-based compact An audio fingerprint is a content-based compact signature that summarizes a piece of audio content signature that summarizes a piece of audio content Requirements: Requirements: Recorded query may be distorted Reduction of complex multimedia and superimposed with other audio objects Discriminative power sources Discriminative power Background noise Reduction of dimensionality Pitching Invariance to distortions Invariance to distortions (audio played faster or slower) Making indexing feasible Equalization Compactness Compression artifacts Compactness Allowing for fast search Cropping, framing Computational simplicity … Computational simplicity Audio Fingerprints Literature (Audio Identification) An audio fingerprint is a content-based compact signature that summarizes a piece of audio content Allamanche et al. (AES 2001) Cano et al. (AES 2002) Requirements: Computational efficiency Haitsma/Kalker (ISMIR 2002) Kurth/Clausen/Ribbrock (AES 2002) Discriminative power Extraction of fingerprint should be simple Wang (ISMIR 2003) … Invariance to distortions Size of fingerprints should be small Dupraz/Richard (ICASSP 2010) Compactness Ramona/Peeters (ICASSP 2011) Computational simplicity Literature (Audio Identification) Fingerprints (Shazam) Steps: Allamanche et al. (AES 2001) 1. Spectrogram Cano et al. (AES 2002) 2. Peaks (local maxima ) Haitsma/Kalker (ISMIR 2002) Kurth/Clausen/Ribbrock (AES 2002) Wang (ISMIR 2003) … Dupraz/Richard (ICASSP 2010) Intensity Frequency (Hz) Frequency Efficiently computable Ramona/Peeters (ICASSP 2011) Standard transform Robust Time (seconds) Fingerprints (Shazam) Fingerprints (Shazam) Steps: Steps: 1. Spectrogram 1. Spectrogram 2. Peaks 2. Peaks / differing peaks Robustness: Noise, reverb, room acoustics, equalization Intensity Intensity Frequency (Hz) Frequency (Hz) Frequency Time (seconds) Time (seconds) Fingerprints (Shazam) Fingerprints (Shazam) Steps: Steps: 1. Spectrogram 1. Spectrogram 2. Peaks / differing peaks 2. Peaks / differing peaks Robustness: Robustness: Noise, reverb, Noise, reverb, room acoustics, room acoustics, equalization equalization Intensity Intensity Frequency (Hz) Frequency Audio codec (Hz) Frequency Audio codec Superposition of other audio sources Time (seconds) Time (seconds) Matching Fingerprints (Shazam) Matching Fingerprints (Shazam) Database document Database document (constellation map) Intensity Frequency (Hz) Frequency (Hz) Frequency Time (seconds) Time (seconds) Matching Fingerprints (Shazam) Matching Fingerprints (Shazam) Database document Query document Database document Query document (constellation map) (constellation map) (constellation map) (constellation map) 1. Shift query across database document 2. Count matching peaks 20 Frequency (Hz) Frequency (Hz) Frequency 15 10 5 #(matching peaks) #(matching 0 0 1 2 3 4 5 6 7 8 9 Time (seconds) Time (seconds) Shift (seconds) Matching Fingerprints (Shazam) Matching Fingerprints (Shazam) Database document Query document Database document Query document (constellation map) (constellation map) (constellation map) (constellation map) 1. Shift query across 1. Shift query across database document database document 2. Count matching peaks 2. Count matching peaks 20 20 Frequency (Hz) Frequency (Hz) Frequency 15 15 10 10 5 5 #(matching peaks) #(matching peaks) #(matching 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Time (seconds) Shift (seconds) Time (seconds) Shift (seconds) Matching Fingerprints (Shazam) Matching Fingerprints (Shazam) Database document Query document Database document Query document (constellation map) (constellation map) (constellation map) (constellation map) 1. Shift query across 1. Shift query across database document database document 2. Count matching peaks 2. Count matching peaks 20 20 Frequency (Hz) Frequency (Hz) Frequency 15 15 10 10 5 5 #(matching peaks) #(matching peaks) #(matching 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Time (seconds) Shift (seconds) Time (seconds) Shift (seconds) Matching Fingerprints (Shazam) Indexing (Shazam) Database document Query document Index the fingerprints using hash lists Hash 2B (constellation map) (constellation map) Hashes correspond to (quantized) frequencies 1. Shift query across database document 2. Count matching peaks 3. High count indicates a hit (Hz) Frequency (document ID & position) Hash 2 20 Frequency (Hz) Frequency Hash 1 15 10 Time (seconds) 5 #(matching peaks) #(matching 0 0 1 2 3 4 5 6 7 8 9 Time (seconds) Shift (seconds) Indexing (Shazam) Indexing (Shazam) Index the fingerprints using hash lists Hash 2B Idea: Use pairs of peaks to increase specificity of hashes Hashes correspond to (quantized) frequencies Hash list consists of time positions (and document IDs) 1. Peaks 2. Fix anchor point N = number of spectral peaks 3. Define target zone 4. Use paris of points B = #(bits) used to encode spectral peaks (Hz) Frequency 2B = number of hash lists 5. Use every point as N / 2 B = average number of elements per list Hash 2 anchor point Hash 1 (Hz) Frequency Problem: Individual peaks are not characteristic Hash lists may be very long List to Hash 1: Not suitable for indexing Time (seconds) Time (seconds) Indexing (Shazam) Indexing (Shazam) Idea: Use pairs of peaks to