Advanced Course Computer Science Overview (Audio Retrieval) Music Processing
Summer Term 2010 Audio identification (audio fingerprinting) Meinard Müller
Saarland University and MPI Informatik [email protected] Audio matching
Audio Retrieval Cover song identification
Overview (Audio Retrieval) Audio Identification
Audio identification Allamanche et al. (AES 2001) (audio fingerprinting) Cano et al. ( IEEE MMSP 2002) Kurth/Clausen/Ribbrock (AES 2002) Wang (ISMIR 2003) Audio matching Shrestha/Kalker (ISMIR 2004)
Cover song identification
Audio Identification Audio Identification
Shazam application scenario Shazam application scenario
User hears music playing in the environment ``The Moment´´
User records music fragment (5-15 seconds) with Radio – Car, home, work mobile phone TV and cinema Audio fingerprints are extracted from recording and sent to a service Clubs and bars
Server identifies audio recording based on fingerprints Cafes, shops, restaurants
Server sends back metadata (track title, artist) to user
[Wang, ISMIR 2003] [Wang, ISMIR 2003] Audio Identification Audio Identification
Shazam application scenario: Target audience An audio fingerprint is a content-based compact signature that summarizes a piece of audio content
Requirements:
Discriminative power
Invariance to distortions
Compactness
Computational simplicity
[Wang, ISMIR 2003]
Audio Identification Audio Identification
An audio fingerprint is a content-based compact An audio fingerprint is a content-based compact signature that summarizes a piece of audio content signature that summarizes a piece of audio content
Requirements: Requirements: Ability to accurately identify an Recorded query may be item within a huge number of distorted and superimposed with Discriminative power other items Discriminative power other audio sources (informative, high entropy) Background noise Invariance to distortions Invariance to distortions Pitching Low probability of false positives (audio played faster or slower) Compactness Compactness Equalization Recorded query excerpt Compression artifacts (only a few seconds) Computational simplicity Computational simplicity Cropping, framing … Large audio collection on the server side (millions of songs)
Audio Identification Audio Identification
An audio fingerprint is a content-based compact An audio fingerprint is a content-based compact signature that summarizes a piece of audio content signature that summarizes a piece of audio content
Requirements: Requirements: Reduction of complex Computational efficiency multimedia objects Discriminative power Discriminative power Extraction of fingerprint should Reduction of dimensionality be simple Invariance to distortions Invariance to distortions Making indexing feasible Size of fingerprint should be Compactness Compactness small Allowing for fast search Computational simplicity Computational simplicity Matching Fingerprints (Shazam) Matching Fingerprints (Shazam)
Generate list of matching fingerprints (matches between query and database document)
Each match is represented by a pair (tdatabase , t query ) of time positions Matching segment is characterized by set M of pairs each having the same time difference
For each database document (audio file), generate ∩ ∩ tdatabase – tquery = constant for (tdatabase , t query ) ∩ ∩ M reproducible landmarks Each landmark occurs at a time position Set of false positives have random time differences For each landmark, generate a “fingerprint” that Filter out cruft by doing a histogram on time differences characterizes its location Score is size of largest histogram peak Do same for query fragment [Wang, ISMIR 2003] [Wang, ISMIR 2003]
Matching Fingerprints (Shazam) Matching Fingerprints (Shazam)
Scatter plot of matching hash locations Scatter plot of matching hash locations Time (Query) Time (Query) Time
Time (Database) Time (Database)
[Wang, ISMIR 2003] [Wang, ISMIR 2003]
Matching Fingerprints (Shazam) Matching Fingerprints (Shazam)
Scatter plot of matching hash locations Scatter plot of matching hash locations Time (Query) Time (Query) Time
Time (Database) Time (Database) Histogram of time differences Histogram of time differences
Time offset [Wang, ISMIR 2003] Time offset [Wang, ISMIR 2003] Matching Fingerprints (Shazam) Matching Fingerprints (Shazam)
Scatter plot of matching hash locations Scatter plot of matching hash locations Time (Query) Time (Query) Time
Time (Database) Time (Database) Histogram of time differences Histogram of time differences
no peak → no matching segment
Time offset [Wang, ISMIR 2003] Time offset [Wang, ISMIR 2003]
Matching Fingerprints (Shazam) Fingerprints (Shazam)
Scatter plot of matching hash locations Steps: 1. Spectrogram Time (Query) Time
Time (Database) Histogram of time differences Efficiently computable Frequency Frequency (Hertz) Standard transform Robust Matching segment (starting at position 40)
Time (Seconds)
Time offset [Wang, ISMIR 2003] [Wang, ISMIR 2003]
Fingerprints (Shazam) Fingerprints (Shazam) Steps: Steps: 1. Spectrogram 1. Spectrogram 2. Peaks 2. Peaks
``Constellation map´´ Problem:
Frequency Frequency (Hertz) Robust to noise, reverb, Frequency (Hertz) Individual peaks have low room acoustics entropy Tend to survive through Not suitable for indexing voice codec
Time (Seconds) Time (Seconds)
[Wang, ISMIR 2003] [Wang, ISMIR 2003] Fingerprints (Shazam) Fingerprints (Shazam) Steps: Steps: 1. Spectrogram 1. Spectrogram 2. Peaks 2. Peaks 3. Target zone 3. Target zone 4. Pairs of peaks 4. Pairs of peaks
Fix anchor point Fix anchor point
Frequency Frequency (Hertz) Define target zone Frequency (Hertz) Define target zone Use pairs of points Use pairs of points Use every point as anchor Use every point as anchor point point
Time (Seconds) Time (Seconds)
[Wang, ISMIR 2003] [Wang, ISMIR 2003]
Indexing (Shazam) Indexing (Shazam)
Definitions: Hash is formed between anchor point and each point in N = number of spectral peaks target zone using frequency values and time difference. p = probability that a spectral peak can be found in (noisy and distorted) query F = fan-out of target zone, e. g. F = 10 Fan-out (taking pairs of peaks) may cause a B = #(bits) used to encode spectral peaks and time difference combinatorial explosion in the number of tokens. However, this can be controlled by the size of the traget Consequences : F · N = #(tokens) to be indexed zone. 2B+B = increase of specifity (2B+B+B instead of 2B) p2 = propability of a hash to survive Using more complex hashes increases specificity p·(1-(1-p) F) = probability of at least on hash survives per anchor point (leading to much smaller hash bucktes) and speed (making the retrieval much faster). Example: F = 10 and B = 10 Memory requirements: F · N = 10 · N Speedup factor: 2B+B / F2 ~ 10 6 / 10 2 = 10000 (F times as many tokens in query and database, respectively) [Wang, ISMIR 2003] [Wang, ISMIR 2003]
Results (Shazam) Conclusions (Shazam) Test dataset of 10000 tracks Search time: 5 to 500 milliseconds Many parameters to choose:
Temporal and spectral resolution in spectrogram
Peak picking strategy
Target zone and fan-out parameter
1 15 sec Recognition rate Recognition 2 10 sec 5 sec Hash function
…
Signal/ Noise Ration (dB) [Wang, ISMIR 2003] [Wang, ISMIR 2003] Conclusions (Audio Identification) Overview (Audio Retrieval)
Identifies audio recording ( not piece of music) Audio identification (audio fingerprinting) Highly robust to noise, artifacts, deformations
May even work to handle superimposed recordings Audio matching
Does not allow to identify studio recordings by query taken from live recordings Cover song identification
Does not generalize to identify different interpretations of the same piece of music
Audio Matching Audio Matching
Various interpretations – Beethoven‘s Fifth Pickens et al. (ISMIR 2002) Müller/Kurth/Clausen (ISMIR 2005) Suyoto et al. (IEEE TASLP 2008) Bernstein Kurth/Müller (IEEE TASLP 2008) Karajan
Scherbakov (piano)
MIDI (piano)
Audio Matching Audio Matching
General strategy Given: Large music database containing several – recordings of the same piece of music Normalized and smoothed chroma features – interpretations by various musicians – correlate to harmonic progression – arrangements in different instrumentations – robust to variations in dynamics, timbre, articulation, local tempo Goal: Given a short query audio clip , identify all corresponding audio clips of similar musical content Robust matching procedure – irrespective of the specific interpretation and instrumentation – efficient – automatically and efficiently – robust to global tempo variations – scalable using index structure Query-by-Example paradigm
[Müller et al., ISMIR 2005] [Müller et al., ISMIR 2005] Feature Design Feature Design Beethoven‘s Fifth: Bernstein
Audio Subband Chroma Statistics Convolution signal decom- energy position distribution Quantization Normalization CENS 88 bands 12 bands Downsampling
Two stages:
Stage 1: Local chroma energy distribution features Stage 2: Normalized short-time statistics
CENS = Chroma Energy Normalized Statistics Resolution: 10 features/second Feature window size: 200 milliseconds
Feature Design Feature Design Beethoven‘s Fifth: Bernstein Beethoven‘s Fifth: Bernstein
Resolution: 10 features/second Resolution: 1 features/second Feature window size: 200 milliseconds Feature window size: 4000 milliseconds
Feature Design Feature Design Beethoven‘s Fifth: Bernstein vs. Sawallisch Beethoven‘s Fifth: Bernstein vs. Sawallisch
Resolution: 10 features/second Resolution: 1 features/second Feature window size: 200 milliseconds Feature window size: 4000 milliseconds Matching Procedure Matching Procedure
Compute CENS feature sequences Query: Beethoven‘s Fifth / Bernstein, first 20 seconds Database Query
Global distance function
Matching Procedure Matching Procedure
Query: Beethoven‘s Fifth / Bernstein, first 20 seconds Query: Beethoven‘s Fifth / Bernstein, first 20 seconds
Best audio matches: 1 Best audio matches: 2
Matching Procedure Matching Procedure
Query: Beethoven‘s Fifth / Bernstein, first 20 seconds Query: Beethoven‘s Fifth / Bernstein, first 20 seconds
Best audio matches: 3 Best audio matches: 4 Matching Procedure Matching Procedure
Query: Beethoven‘s Fifth / Bernstein, first 20 seconds Query: Beethoven‘s Fifth / Bernstein, first 20 seconds
Best audio matches: 5 Best audio matches: 6
Matching Procedure Global Tempo Variations
Query: Beethoven‘s Fifth / Bernstein, first 20 seconds Query: Beethoven‘s Fifth / Bernstein, first 20 seconds
Problem: Karajan is much faster useless
Solution?
Best audio matches: 7
Global Tempo Variations Global Tempo Variations
Query: Beethoven‘s Fifth / Bernstein, first 20 seconds Query: Beethoven‘s Fifth / Bernstein, first 20 seconds
Problem: Karajan is much faster useless Problem: Karajan is much faster useless
Solution: Make Bernstein query faster and comute new Solution: Compute for various tempi Global Tempo Variations Experiments
Query: Beethoven‘s Fifth / Bernstein, first 20 seconds
Problem: Karajan is much faster useless Audio database > 110 hours, 16.5 GB
Solution: Minimize over all resulting ’s Preprocessing CENS features, 40.3 MB
Query clip 20 seconds
Query response time < 10 seconds
Experiments Experiments
Query: Beethoven‘s Fifth / Bernstein, first 20 seconds Query: Beethoven‘s Fifth / Bernstein, first 20 seconds
Experiments Experiments
Query: Shostakovich, Waltz/Chailly, first 27 seconds Query: Shostakovich, Waltz/Chailly, first 21 seconds Index-based Matching Index-based Matching Indexing stage Quantization
Feature space
Visualization (3D)
Convert database into feature sequence (chroma/CENS) Quantize features with respect to a fixed codebook Create an inverted file index – contains for each codebook vector an inverted list – each list contains feature indices in ascending order
[Kurth/Müller, IEEE-TASLP 2008]
Index-based Matching Index-based Matching Quantization How to derive a good codebook?
Codebook selection by unsupervised learning Feature space – Linde–Buzo–Gray (LBG) algorithm – similar to k-means Codebook selection – adjust algorithm to spheres of suitable size R Codebook selection based on musical knowledge Quantization using nearest neighbors
Index-based Matching Index-based Matching LBG algorithm LBG algorithm Steps: Steps:
1. Initialization of 1. Initialization of codebook vectors codebook vectors 2. Assignment 2. Assignment 3. Recalculation 3. Recalculation 4. Iteration (back to 2.) 4. Iteration (back to 2.) Index-based Matching Index-based Matching LBG algorithm LBG algorithm Steps: Steps:
1. Initialization of 1. Initialization of codebook vectors codebook vectors 2. Assignment 2. Assignment 3. Recalculation 3. Recalculation 4. Iteration (back to 2.) 4. Iteration (back to 2.)
Index-based Matching Index-based Matching LBG algorithm LBG algorithm Steps: Steps:
1. Initialization of 1. Initialization of codebook vectors codebook vectors 2. Assignment 2. Assignment 3. Recalculation 3. Recalculation 4. Iteration (back to 2.) 4. Iteration (back to 2.)
Index-based Matching Index-based Matching LBG algorithm LBG algorithm Steps: Steps:
1. Initialization of 1. Initialization of codebook vectors codebook vectors 2. Assignment 2. Assignment 3. Recalculation 3. Recalculation 4. Iteration (back to 2.) 4. Iteration (back to 2.)
Until convergence Index-based Matching Index-based Matching LBG algorithm for spheres LBG algorithm for spheres
Example: 2D Example: 2D Assignment Assignment Recalculation Recalculation Projection Projection
Index-based Matching Index-based Matching LBG algorithm for spheres LBG algorithm for spheres
Example: 2D Example: 2D Assignment Assignment Recalculation Recalculation Projection Projection
Index-based Matching Index-based Matching Codebook using musical knowledge Codebook using musical knowledge
Observation: Chroma features capture C-Major harmonic information
C#-Major Example : C -Major
Choose codebook to contain n-chords for n=1,2,3,4 Example: C #-Major n 1 2 3 4 Experiments: For more then 95% of all chroma features >50% of energy lies in at most 4 components template
# 12 66 220 495 793 Index-based Matching Index-based Matching Codebook using musical knowledge Quantization
Additional consideration of harmonics in chord templates Orignal chromagram and projections on codebooks
Example: 1-chord C Original LBG-based Model-based
Harmonics 1 2 3 4 5 6 Pitch C3 C4 G4 C5 E5 G5 Frequency 131 262 392 523 654 785 Chroma C C G C E C
Replace by with suitable weights for the harmonics
Index-based Matching Index-based Matching Query and retrieval stage Retrieval results
Medium sized database – 500 pieces – 112 hours of audio – mostly classical music Selection of various queries – 36 queries Query consists of a short audio clip (10-40 seconds) – duration between 10 and 40 seconds Specification of fault tolerance setting – hand-labelled matches in database – fuzzyness of query Indexing leads to speed-up factor between 15 and 20 – number of admissable mismatches (depending on query length) – tolerance to tempo variations Only small degradation in precision and recall – tolerance to modulations
Index-based Matching Conclusions (Index-based Matching) Retrieval results
No index Described method suitable for medium-sized databases LBG-based index – index is assumed to be in main memory Model-based index – inverted lists may be long Goal was to find all meaningful matches – high -degree of fault -tolerance required (fuzzyness, mismatches) – number of intersections and unions may explode What to do when dealing with millions of songs? Can the quantization be avoided? Average Precision Average Better indexing and retrieval methods needed! – kd-trees – locality sensitive hashing – … Average Recall Conclusions (Audio Matching) Conclusions (Audio Matching)
Strategy: Absorb variations at feature level Global matching procedure Strategy: Exact matching and multiple scaled queries – simulate tempo variations by feature resampling Chroma invariance to timbre – different queries correspond to different tempi – indexing possible Normalization invariance to dynamics Strategy: Dynamic time warping Smoothing invariance to local time deviations – subsequence variant – more flexible (in particular for longer queries) – indexing hard
Application: Audio Matching Application: Audio Matching
Overview (Audio Retrieval) Cover Song Identification
Audio identification Gómez/Herrera (ISMIR 2006) (audio fingerprinting) Casey/Slaney (ISMIR 2006) Serrà (ISMIR 2007) Audio matching Ellis/Polioner (ICASSP 2007) Serrà/Gómez/Herrera/Serra (IEEE TASLP 2008)
Cover song identification Cover Song Identification Cover Song Identification
Motivation Goal: Given a music recording of a song or piece of music, find all corresponding music recordings within a huge Automated organization of music collections collection that can be regarded as a kind of version, interpretation, or cover song. ``Find me all covers of …´´ Live versions Versions adapted to particular country/region/language Musical rights management Contemporary versions of an old song Radically different interpretations of a musical piece Learning about music itself … ``Understanding the essence of a song´´ Instance of document-based retrieval!
Cover Song Identification Cover Song Identification
Nearly anything can change! But something doesn't change. How to compare two different songs? Often this is chord progression and/or melody Song A
Bob Dylan Avril Lavigne Knockin’ on Heaven’s Door key Knockin’ on Heaven’s Door Metallica Apocalyptica Enter Sandman timbre Enter Sandman Nirvana Nirvana Poly [Incesticide Album] tempo Poly [Unplugged] Song A Black Sabbath Cindy & Bert Paranoid lyrics Der Hund Der Baskerville AC/DC AC/DC High Voltage recording conditions High Voltage [live] song structure
[Serrà et al., IEEE-TASLP 2009]
Cover Song Identification Cover Song Identification How to compare two different songs? How to compare two different songs?
Chroma Chroma Song A Song A Sequence Sequence
Optimal Transposition
Chroma Chroma Song A Song A Sequence Sequence
Feature computation Feature computation Dealing with different keys
[Serrà et al., IEEE-TASLP 2009] [Serrà et al., IEEE-TASLP 2009] Cover Song Identification Cover Song Identification How to compare two different songs? How to compare two different songs?
Chroma Chroma Song A Song A Sequence Sequence Dyncamic Binary Binary Optimal Optimal Programming Similarity Similarity Score Transposition Transposition Local Matrix Matrix Alignment
Chroma Chroma Song A Song A Sequence Sequence
Feature computation Feature computation Dealing with different keys Dealing with different keys Local similarity measure Local similarity measure Global similarity measure [Serrà et al., IEEE-TASLP 2009] [Serrà et al., IEEE-TASLP 2009]
Cover Song Identification Cover Song Identification
B A# A Feature computation G# Dealing with different keys G F# F E D# D C# C
Chroma features 20 40 60 80 100 120 140 Bob Dylan – Knockin’ on Heaven’s Door Avril Lavigne – Knockin’ on Heaven’s Door – correlates to harmonic progression – robust to changes in timbre and instrumentation – normalization introduces invariance to dynamics Compute average chroma vectors for each song Consider cyclic shifts of the chroma vectors to Enhancement strategies simulate transpositions – model for considering harmonics – compensation of tuning differences Determine optimal shift indices so that the shifted – finer resolution (1, 1/2, 1/3 semitone resolution) chroma vectors are matched with minimal cost → 12/24/36 dimensional chroma features [Gómez, PhD 2006] Transpose the songs accordingly
Cyclic Chroma Shifts Cyclic Chroma Shifts Given chroma vectors Feature space: Fix a local cost measure Compute cost between x and shifted y Chroma vector:
Cyclic shift operator: x y
1 Composition of shifts: , 0.5 Cost Note: 0 0 1 2 3 4 5 6 7 8 9 10 11 Shift index Cyclic Chroma Shifts Cyclic Chroma Shifts Given chroma vectors Given chroma vectors Fix a local cost measure Fix a local cost measure Compute cost between x and shifted y Compute cost between x and shifted y
x x σ (y) σ 2(y)
1 1
0.5 0.5 Cost Cost
0 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 Shift index Shift index
Cyclic Chroma Shifts Cyclic Chroma Shifts Given chroma vectors Given chroma vectors Fix a local cost measure Fix a local cost measure Compute cost between x and shifted y Compute cost between x and shifted y
x x σ 3(y) σ 4(y)
1 1
0.5 0.5 Cost Cost
0 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 Shift index Shift index
Cyclic Chroma Shifts Cyclic Chroma Shifts Given chroma vectors Given chroma vectors Fix a local cost measure Fix a local cost measure Compute cost between x and shifted y Compute cost between x and shifted y
x x σ 5(y) σ 6(y)
1 1
0.5 0.5 Cost Cost
0 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 Shift index Shift index Cyclic Chroma Shifts Cyclic Chroma Shifts Given chroma vectors Given chroma vectors Fix a local cost measure Fix a local cost measure Compute cost between x and shifted y Compute cost between x and shifted y
x x σ 7(y) σ 8(y)
1 1
0.5 0.5 Cost Cost
0 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 Shift index Shift index
Cyclic Chroma Shifts Cyclic Chroma Shifts Given chroma vectors Given chroma vectors Fix a local cost measure Fix a local cost measure Compute cost between x and shifted y Compute cost between x and shifted y
x x σ 9(y) σ 10 (y)
1 1
0.5 0.5 Cost Cost
0 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 Shift index Shift index
Cyclic Chroma Shifts Cyclic Chroma Shifts Given chroma vectors Given chroma vectors Fix a local cost measure Fix a local cost measure Compute cost between x and shifted y Compute cost between x and shifted y Minimizing shift index: 3
x x σ 11 (y) σ 3(y)
1 1
0.5 0.5 Cost Cost
0 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 Shift index Shift index Cyclic Chroma Shifts Cyclic Chroma Shifts
B B What is a good local cost A# What is a good local cost A# A A meaure for chroma space? G# meaure for chroma space? G# G G F# Euclidean? Cosine distance? F# F ? F ? E E D# d D# D D C# α C# C C
1 2 3 1 2 3 Is the chroma space Euclidean? Probably not! For example, C is musically closer to G than C #
Idea: Usage of very coarse binary cost measure that indicates the same tonal root [Serrà et al., IEEE-TASLP 2009]
Cyclic Chroma Shifts Cyclic Chroma Shifts
Original local cost measure Cost matrix based on c Binary cost matrix based on cb
Binary cost measure
1 Song A Song
0
for Song B Song B [Serrà et al., IEEE-TASLP 2009] [Serrà et al., IEEE-TASLP 2009]
Cyclic Chroma Shifts Cover Song Identification Think positive! How to compare two different songs? Cost matrix based on c Binary similarity matrix Chroma Song A Sequence Dyncamic Binary Optimal Programming Similarity Score 1 Transposition Local Matrix Alignment
Chroma Song A Sequence Song A Song
Feature computation -1 Dealing with different keys Local similarity measure Song B Song B Global similarity measure [Serrà et al., IEEE-TASLP 2009] [Serrà et al., IEEE-TASLP 2009] Cover Song Identification Local Alignment How to compare two different songs? Assumption: Chroma Song A Two songs are considered as similar if they contain Sequence possibly long subsegments that possess a similar Dyncamic Binary Optimal Programming harmonic progression Similarity Score Transposition Local Matrix Alignment Task: Chroma Song A Sequence Let X=(x 1,…,xN) and Y=(y 1,…,yM) be the two chroma sequences of the two given songs, and let S be the Feature computation resulting similarity matrix. Then find the maximum similarity Dealing with different keys of a subsequence of X and a subsequence of Y. Local similarity measure Global similarity measure [Serrà et al., IEEE-TASLP 2009]
Local Alignment Local Alignment
Note: Classical DTW Global correspondence X This problem is also known from bioinformatics. between X and Y The Smith-Waterman algorithm is a well-known algorithm Y for performing local sequence alignment ; that is, for determining similar regions between two nucleotide or Subsequence DTW protein sequences. Subsequence of Y corresponds X to X
Strategy: Y We use a variant of the Smith-Waterman algorithm. Local Alignment
Subsequence of Y corresponds X to subequence of X
Y
Local Alignment Local Alignment Computation of accumulated score matrix D Example: Knockin’ on Heaven’s Door
from given binary similarity (score) matrix S Knockin' on Heaven's Door 1 Binary similarity 140 0.8 matrix
120 0.6 1
0.4 100
0.2
80
0 Bob Dylan Bob
60 -0.2 Bob Dylan Bob Zero-entry allows for jumping to any cell without penalty 40 -0.4-1 g penalizes`` inserts´´ and ``delets´´ in alignment -0.6 Best local alignment score is the highest value in D 20 Best local alignment ends at cell of highest value -0.8
50 100 150 200 250 300 Guns and Roses Start is obtained by backtracking to first cell of value zero Guns and Roses Local Alignment Local Alignment Example: Knockin’ on Heaven’s Door Example: Knockin’ on Heaven’s Door
Knockin' on Heavens's Door Knockin' on Heavens's Door 94.2 94.2 90 90 140 90 Accumulated 140 90 Accumulated
80 score matrix 80 score matrix 120 120
70 70
100 100 Cell with max. 60 60 60 60 score = 94.2
80 50 80 50 Bob Bob Dylan Bob Dylan 40 40 60 60 Bob Dylan Bob Dylan Bob 30 30 30 30 40 40
20 20
20 20 10 10
0 0 50 100 150 200 250 300 50 100 150 200 250 300 Guns and Roses Guns and Roses Guns and Roses Guns and Roses
Local Alignment Local Alignment Example: Knockin’ on Heaven’s Door Example: Knockin’ on Heaven’s Door
Knockin' on Heaven's Door Knockin' on Heaven's Door 94.2 94.2 90 90 140 90 Accumulated 140 90 Accumulated
80 score matrix 80 score matrix 120 120
70 70
100 Cell with max. 100 Cell with max. 60 60 score = 94.2 94.2 60 60 score = 94.2
80 50 80 50 Bob Bob Dylan Bob Dylan 40 40 60 Alignment path 60 Alignment path Bob Dylan Bob Dylan Bob 30 30 of maximal score 30 30 of maximal score 40 40
20 score Accumulated 20
20 20 Alignment path 10 10
0 0 50 100 150 200 250 300 50 100 150 200 250 300 Guns and Roses Guns and Roses Guns and Roses Guns and Roses
Local Alignment Cover Song Identification Example: Knockin’ on Heaven’s Door Query: Bob Dylan – Knockin’ on Heaven’s Door
Knockin' on Heaven's Door Retrieval result: 90 140 90 Accumulated
80 score matrix Rank Recording Score 120 70 1. Gunsand Roses:Knockin‘ On Heaven’s Door 94.2 100 Cell with max. 60 60 score = 94.2 2. AvrilLavigne:Knockin‘OnHeaven’sDoor 86.6
80 50 3. WyclefJean:Knockin‘OnHeaven’sDoor 83.8 Bob Bob Dylan 40 60 Alignment path 4. Bob Dylan: Not For You 65.4 Bob Dylan Bob 30 30 of maximal score 40 5. Gunsand Roses:Patience 61.8 20 6. Bob Dylan: Like A Rolling Stone 57.2 20 Matching 10 subsequences 7.-14. … 0 50 100 150 200 250 300 Guns and Roses Guns and Roses Cover Song Identification Conclusions (Cover Song Identification)
Query: AC/DC – Highway To Hell Harmony-based approach Retrieval result:
Rank Recording Score Binary cost measure a good trade-off between robustness and expressiveness 1. AC/DC:Hard Asa Rock 79.2 2. Hayseed Dixie: Dirty Deeds Done Dirt Cheap 72.9 Measure is suitable for document retrieval, but seems to 3. AC/DC: Let There Be Rock 69.6 be too coarse for audio matching applications 4. AC/DC: TNT (Live) 65.0 5.-11. … Every song has to be compared with any other 12. Hayseed Dixie: Highway To Hell 30.4 → method does not scale to large data collection 13. AC/DC: Highway To Hell Live (live) 21.0 14. … What are suitable indexing methods?
Conclusions (Audio Retrieval)
Retrieval Audio Audio Cover song task identification matching identification
Identification Concrete audio Different Different recording interpretations versions
Query Short fragment Audio clip Entire song (5-10 seconds) (10-40 seconds)
Retrieval level Subsequence Subsequence Document Features Spectral peaks Chroma Chroma (abstract) (harmony) (harmony)
Indexing Hashing Inverted lists No indexing