Audio Retrieval) Music Processing

Advanced Course Computer Science Overview (Audio Retrieval) Music Processing Summer Term 2010 Audio identification (audio fingerprinting) Meinard Müller Saarland University and MPI Informatik [email protected] Audio matching Audio Retrieval Cover song identification Overview (Audio Retrieval) Audio Identification Audio identification Allamanche et al. (AES 2001) (audio fingerprinting) Cano et al. ( IEEE MMSP 2002) Kurth/Clausen/Ribbrock (AES 2002) Wang (ISMIR 2003) Audio matching Shrestha/Kalker (ISMIR 2004) Cover song identification Audio Identification Audio Identification Shazam application scenario Shazam application scenario User hears music playing in the environment ``The Moment´´ User records music fragment (5-15 seconds) with Radio – Car, home, work mobile phone TV and cinema Audio fingerprints are extracted from recording and sent to a service Clubs and bars Server identifies audio recording based on fingerprints Cafes, shops, restaurants Server sends back metadata (track title, artist) to user [Wang, ISMIR 2003] [Wang, ISMIR 2003] Audio Identification Audio Identification Shazam application scenario: Target audience An audio fingerprint is a content-based compact signature that summarizes a piece of audio content Requirements: Discriminative power Invariance to distortions Compactness Computational simplicity [Wang, ISMIR 2003] Audio Identification Audio Identification An audio fingerprint is a content-based compact An audio fingerprint is a content-based compact signature that summarizes a piece of audio content signature that summarizes a piece of audio content Requirements: Requirements: Ability to accurately identify an Recorded query may be item within a huge number of distorted and superimposed with Discriminative power other items Discriminative power other audio sources (informative, high entropy) Background noise Invariance to distortions Invariance to distortions Pitching Low probability of false positives (audio played faster or slower) Compactness Compactness Equalization Recorded query excerpt Compression artifacts (only a few seconds) Computational simplicity Computational simplicity Cropping, framing … Large audio collection on the server side (millions of songs) Audio Identification Audio Identification An audio fingerprint is a content-based compact An audio fingerprint is a content-based compact signature that summarizes a piece of audio content signature that summarizes a piece of audio content Requirements: Requirements: Reduction of complex Computational efficiency multimedia objects Discriminative power Discriminative power Extraction of fingerprint should Reduction of dimensionality be simple Invariance to distortions Invariance to distortions Making indexing feasible Size of fingerprint should be Compactness Compactness small Allowing for fast search Computational simplicity Computational simplicity Matching Fingerprints (Shazam) Matching Fingerprints (Shazam) Generate list of matching fingerprints (matches between query and database document) Each match is represented by a pair (tdatabase , t query ) of time positions Matching segment is characterized by set M of pairs each having the same time difference For each database document (audio file), generate ∩ ∩ tdatabase – tquery = constant for (tdatabase , t query ) ∩ ∩ M reproducible landmarks Each landmark occurs at a time position Set of false positives have random time differences For each landmark, generate a “fingerprint” that Filter out cruft by doing a histogram on time differences characterizes its location Score is size of largest histogram peak Do same for query fragment [Wang, ISMIR 2003] [Wang, ISMIR 2003] Matching Fingerprints (Shazam) Matching Fingerprints (Shazam) Scatter plot of matching hash locations Scatter plot of matching hash locations Time (Query) Time (Query) Time Time (Database) Time (Database) [Wang, ISMIR 2003] [Wang, ISMIR 2003] Matching Fingerprints (Shazam) Matching Fingerprints (Shazam) Scatter plot of matching hash locations Scatter plot of matching hash locations Time (Query) Time (Query) Time Time (Database) Time (Database) Histogram of time differences Histogram of time differences Time offset [Wang, ISMIR 2003] Time offset [Wang, ISMIR 2003] Matching Fingerprints (Shazam) Matching Fingerprints (Shazam) Scatter plot of matching hash locations Scatter plot of matching hash locations Time (Query) Time (Query) Time Time (Database) Time (Database) Histogram of time differences Histogram of time differences no peak → no matching segment Time offset [Wang, ISMIR 2003] Time offset [Wang, ISMIR 2003] Matching Fingerprints (Shazam) Fingerprints (Shazam) Scatter plot of matching hash locations Steps: 1. Spectrogram Time (Query) Time Time (Database) Histogram of time differences Efficiently computable Frequency Frequency (Hertz) Standard transform Robust Matching segment (starting at position 40) Time (Seconds) Time offset [Wang, ISMIR 2003] [Wang, ISMIR 2003] Fingerprints (Shazam) Fingerprints (Shazam) Steps: Steps: 1. Spectrogram 1. Spectrogram 2. Peaks 2. Peaks ``Constellation map´´ Problem: Frequency Frequency (Hertz) Robust to noise, reverb, Frequency (Hertz) Individual peaks have low room acoustics entropy Tend to survive through Not suitable for indexing voice codec Time (Seconds) Time (Seconds) [Wang, ISMIR 2003] [Wang, ISMIR 2003] Fingerprints (Shazam) Fingerprints (Shazam) Steps: Steps: 1. Spectrogram 1. Spectrogram 2. Peaks 2. Peaks 3. Target zone 3. Target zone 4. Pairs of peaks 4. Pairs of peaks Fix anchor point Fix anchor point Frequency Frequency (Hertz) Define target zone Frequency (Hertz) Define target zone Use pairs of points Use pairs of points Use every point as anchor Use every point as anchor point point Time (Seconds) Time (Seconds) [Wang, ISMIR 2003] [Wang, ISMIR 2003] Indexing (Shazam) Indexing (Shazam) Definitions: Hash is formed between anchor point and each point in N = number of spectral peaks target zone using frequency values and time difference. p = probability that a spectral peak can be found in (noisy and distorted) query F = fan-out of target zone, e. g. F = 10 Fan-out (taking pairs of peaks) may cause a B = #(bits) used to encode spectral peaks and time difference combinatorial explosion in the number of tokens. However, this can be controlled by the size of the traget Consequences : F · N = #(tokens) to be indexed zone. 2B+B = increase of specifity (2B+B+B instead of 2B) p2 = propability of a hash to survive Using more complex hashes increases specificity p·(1-(1-p) F) = probability of at least on hash survives per anchor point (leading to much smaller hash bucktes) and speed (making the retrieval much faster). Example: F = 10 and B = 10 Memory requirements: F · N = 10 · N Speedup factor: 2B+B / F2 ~ 10 6 / 10 2 = 10000 (F times as many tokens in query and database, respectively) [Wang, ISMIR 2003] [Wang, ISMIR 2003] Results (Shazam) Conclusions (Shazam) Test dataset of 10000 tracks Search time: 5 to 500 milliseconds Many parameters to choose: Temporal and spectral resolution in spectrogram Peak picking strategy Target zone and fan-out parameter 1 15 sec Recognition rate Recognition 2 10 sec 5 sec Hash function … Signal/ Noise Ration (dB) [Wang, ISMIR 2003] [Wang, ISMIR 2003] Conclusions (Audio Identification) Overview (Audio Retrieval) Identifies audio recording ( not piece of music) Audio identification (audio fingerprinting) Highly robust to noise, artifacts, deformations May even work to handle superimposed recordings Audio matching Does not allow to identify studio recordings by query taken from live recordings Cover song identification Does not generalize to identify different interpretations of the same piece of music Audio Matching Audio Matching Various interpretations – Beethoven‘s Fifth Pickens et al. (ISMIR 2002) Müller/Kurth/Clausen (ISMIR 2005) Suyoto et al. (IEEE TASLP 2008) Bernstein Kurth/Müller (IEEE TASLP 2008) Karajan Scherbakov (piano) MIDI (piano) Audio Matching Audio Matching General strategy Given: Large music database containing several – recordings of the same piece of music Normalized and smoothed chroma features – interpretations by various musicians – correlate to harmonic progression – arrangements in different instrumentations – robust to variations in dynamics, timbre, articulation, local tempo Goal: Given a short query audio clip , identify all corresponding audio clips of similar musical content Robust matching procedure – irrespective of the specific interpretation and instrumentation – efficient – automatically and efficiently – robust to global tempo variations – scalable using index structure Query-by-Example paradigm [Müller et al., ISMIR 2005] [Müller et al., ISMIR 2005] Feature Design Feature Design Beethoven‘s Fifth: Bernstein Audio Subband Chroma Statistics Convolution signal decom- energy position distribution Quantization Normalization CENS 88 bands 12 bands Downsampling Two stages: Stage 1: Local chroma energy distribution features Stage 2: Normalized short-time statistics CENS = Chroma Energy Normalized Statistics Resolution: 10 features/second Feature window size: 200 milliseconds Feature Design Feature Design Beethoven‘s Fifth: Bernstein Beethoven‘s Fifth: Bernstein Resolution: 10 features/second Resolution: 1 features/second Feature window size: 200 milliseconds Feature window size: 4000 milliseconds Feature Design Feature Design Beethoven‘s Fifth: Bernstein vs. Sawallisch Beethoven‘s Fifth: Bernstein vs. Sawallisch

Load more