<<

Meinard Müller Tutorial  2007 Habilitation Audio Content-Based Bonn, Germany Music Retrieval  2007 MPI Informatik Saarland, Germany Meinard Müller Joan Serrà

Saarland University and MPI Informatik Artificial Intelligence Research Institute  Senior Researcher [email protected] Spanish National Research Council Multimedia Information Retrieval [email protected] & Music Processing

 6 PhD Students

Joan Serrà Music Retrieval

 2007- 2011 Ph.D.  Textual metadata Music Technology Group – Traditional retrieval – Searching for artist, title, … Universitat Pompeu Fabra Barcelona, Spain  Rich and expressive metadata – Generated by experts  Postdoctoral Resarcher – Crowd tagging, social networks Artificial Intelligence Research Institute Spanish National Research Council  Content-based retrieval  Music information retrieval, machine – Automatic generation of tags learning, time series analysis, complex – Query-by-example networks, ...

Query-by-Example Query-by-Example

Query Taxonomy

Hits Specificity Granularity level level Retrieval tasks: Bernstein (1962) Retrieval tasks: Beethoven, Symphony No. 5 High Fragment-based Audio identification Beethoven, Symphony No. 5: Audio identification specificity retrieval  Bernstein (1962) Audio matching  Karajan (1982) Audio matching  Gould (1992) Version identification Version identification  Beethoven, Symphony No. 9 Low Category-based music retrieval  Beethoven, Symphony No. 3 Category-based music retrieval Document-based  Haydn Symphony No. 94 specificity retrieval Overview Overview

Part I: Audio identification Part I: Audio identification

Part II: Audio matching Part II: Audio matching

Coffee Break Coffee Break

Part III: Version identification Part III: Version identification

Part IV: Category-based music retrieval Part IV: Category-based music retrieval

Overview Overview

Part I: Audio identification Part I: Audio identification

Part II: Audio matching Part II: Audio matching

Coffee Break Coffee Break

Part III: Version identification Part III: Version identification

Part IV: Category-based music retrieval Part IV: Category-based music retrieval

Overview Overview

Part I: Audio identification Part I: Audio identification

Part II: Audio matching Part II: Audio matching

Coffee Break Coffee Break

Part III: Version identification Part III: Version identification

Part IV: Category-based music retrieval Part IV: Category-based music retrieval Overview Tutorial

Audio Content-Based Part I: Audio identification Music Retrieval Part II: Audio matching

Meinard Müller Joan Serrà Coffee Break Saarland University and MPI Informatik Artificial Intelligence Research Institute [email protected] Spanish National Research Council [email protected] Part III: Version identification

Part IV: Category-based music retrieval

Audio Identification Application Scenario

Database: Huge collection consisting of all audio  User hears music playing in the environment recordings (feature representations) to be potentially identified.  User records music fragment (5-15 seconds) with mobile phone Goal: Given a short query audio fragment , identify the original audio recording the query is taken  Audio fingerprints are extracted from the recording from. and sent to an audio identification service

Notes:  Instance of fragment-based retrieval  Service identifies audio recording based on fingerprints  High specificity  Service sends back metadata (track title, artist) to user  Not the piece of music is identified but a specific rendition of the piece

Audio Fingerprints Audio Fingerprints

An audio fingerprint is a content-based compact An audio fingerprint is a content-based compact signature that summarizes some specific audio content. signature that summarizes a piece of audio content

Requirements: Requirements:  Ability to accurately identify an item within a huge number of other items  Discriminative power  Discriminative power (informative, characteristic)  Low probability of false positives  Invariance to distortions  Invariance to distortions  Recorded query excerpt  Compactness  Compactness only a few seconds  Large audio collection on the server  Computational simplicity  Computational simplicity side (millions of songs) Audio Fingerprints Audio Fingerprints

An audio fingerprint is a content-based compact An audio fingerprint is a content-based compact signature that summarizes a piece of audio content signature that summarizes a piece of audio content

Requirements: Requirements:  Recorded query may be distorted  Reduction of complex multimedia and superimposed with other audio objects  Discriminative power sources  Discriminative power  Background noise  Reduction of dimensionality  Pitching  Invariance to distortions  Invariance to distortions (audio played faster or slower)  Making indexing feasible  Equalization  Compactness  Compression artifacts  Compactness  Allowing for fast search  Cropping, framing  Computational simplicity  …  Computational simplicity

Audio Fingerprints Literature (Audio Identification)

An audio fingerprint is a content-based compact signature that summarizes a piece of audio content  Allamanche et al. (AES 2001)  Cano et al. (AES 2002) Requirements:  Computational efficiency  Haitsma/Kalker (ISMIR 2002)  Kurth/Clausen/Ribbrock (AES 2002)  Discriminative power  Extraction of fingerprint should be simple  Wang (ISMIR 2003) …

 Invariance to distortions  Size of fingerprints should be small  Dupraz/Richard (ICASSP 2010)  Compactness  Ramona/Peeters (ICASSP 2011)  Computational simplicity

Literature (Audio Identification) Fingerprints (Shazam) Steps:  Allamanche et al. (AES 2001) 1. Spectrogram  Cano et al. (AES 2002) 2. Peaks (local maxima )  Haitsma/Kalker (ISMIR 2002)  Kurth/Clausen/Ribbrock (AES 2002)  Wang (ISMIR 2003) …

 Dupraz/Richard (ICASSP 2010) Intensity

Frequency (Hz) Frequency  Efficiently computable  Ramona/Peeters (ICASSP 2011)  Standard transform

 Robust

Time (seconds) Fingerprints (Shazam) Fingerprints (Shazam) Steps: Steps: 1. Spectrogram 1. Spectrogram 2. Peaks 2. Peaks / differing peaks

Robustness:  Noise, reverb, room acoustics, equalization Intensity Intensity Frequency (Hz) Frequency (Hz) Frequency

Time (seconds) Time (seconds)

Fingerprints (Shazam) Fingerprints (Shazam) Steps: Steps: 1. Spectrogram 1. Spectrogram 2. Peaks / differing peaks 2. Peaks / differing peaks

Robustness: Robustness:  Noise, reverb,  Noise, reverb, room acoustics, room acoustics, equalization equalization Intensity Intensity

Frequency (Hz) Frequency  Audio codec (Hz) Frequency  Audio codec  Superposition of other audio sources

Time (seconds) Time (seconds)

Matching Fingerprints (Shazam) Matching Fingerprints (Shazam)

Database document Database document (constellation map) Intensity Frequency (Hz) Frequency (Hz) Frequency

Time (seconds) Time (seconds) Matching Fingerprints (Shazam) Matching Fingerprints (Shazam)

Database document Query document Database document Query document (constellation map) (constellation map) (constellation map) (constellation map)

1. Shift query across database document 2. Count matching peaks

20 Frequency (Hz) Frequency (Hz) Frequency

15

10

5 #(matching peaks) #(matching 0 0 1 2 3 4 5 6 7 8 9 Time (seconds) Time (seconds) Shift (seconds)

Matching Fingerprints (Shazam) Matching Fingerprints (Shazam)

Database document Query document Database document Query document (constellation map) (constellation map) (constellation map) (constellation map)

1. Shift query across 1. Shift query across database document database document 2. Count matching peaks 2. Count matching peaks

20 20 Frequency (Hz) Frequency (Hz) Frequency

15 15

10 10

5 5 #(matching peaks) #(matching peaks) #(matching 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Time (seconds) Shift (seconds) Time (seconds) Shift (seconds)

Matching Fingerprints (Shazam) Matching Fingerprints (Shazam)

Database document Query document Database document Query document (constellation map) (constellation map) (constellation map) (constellation map)

1. Shift query across 1. Shift query across database document database document 2. Count matching peaks 2. Count matching peaks

20 20 Frequency (Hz) Frequency (Hz) Frequency

15 15

10 10

5 5 #(matching peaks) #(matching peaks) #(matching 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Time (seconds) Shift (seconds) Time (seconds) Shift (seconds) Matching Fingerprints (Shazam) Indexing (Shazam)

Database document Query document  Index the fingerprints using hash lists Hash 2B (constellation map) (constellation map)  Hashes correspond to (quantized) frequencies

1. Shift query across database document 2. Count matching peaks 3. High count indicates a hit (Hz) Frequency (document ID & position) Hash 2

20

Frequency (Hz) Frequency Hash 1 15

10 Time (seconds)

5 #(matching peaks) #(matching 0 0 1 2 3 4 5 6 7 8 9 Time (seconds) Shift (seconds)

Indexing (Shazam) Indexing (Shazam)

 Index the fingerprints using hash lists Hash 2B Idea: Use pairs of peaks to increase specificity of hashes  Hashes correspond to (quantized) frequencies  Hash list consists of time positions (and document IDs) 1. Peaks 2. Fix anchor point  N = number of spectral peaks 3. Define target zone 4. Use paris of points

 B = #(bits) used to encode spectral peaks (Hz) Frequency  2B = number of hash lists 5. Use every point as  N / 2 B = average number of elements per list Hash 2 anchor point

Hash 1 (Hz) Frequency Problem:  Individual peaks are not characteristic  Hash lists may be very long List to Hash 1:  Not suitable for indexing Time (seconds)

Time (seconds)

Indexing (Shazam) Indexing (Shazam)

Idea: Use pairs of peaks to increase specificity of hashes  A hash is formed between an anchor point and each point in the target zone using two frequency values 1. Peaks and a time difference. 2. Fix anchor point 3. Define target zone f2  Fan-out (taking pairs of peaks) may cause a 4. Use paris of points f combinatorial explosion in the number of tokens . ∆t 1 5. Use every point as However, this can be controlled by the size of the anchor point target zone.

Frequency (Hz) Frequency New hash: Consists of two frequency  Using more complex hashes increases specificity values and a time difference: (leading to much smaller hash lists) and speed (making the retrieval much faster). ( f 1 , f 2 , ∆ t )

Time (seconds) Indexing (Shazam) Conclusions (Shazam)

Definitions:  N = number of spectral peaks Many parameters to choose:  p = probability that a spectral peak can be found in (noisy and distorted) query  F = fan-out of target zone, e. g. F = 10   B = #(bits) used to encode spectral peaks and time difference Temporal and spectral resolution in spectrogram

Consequences:   F · N = #(tokens) to be indexed Peak picking strategy  2B+B = increase of specifity (2B+B+B instead of 2B) 2  p = propability of a hash to survive   p·(1-(1-p) F) = probability that, at least, on hash survives per anchor point Target zone and fan-out parameter

Example: F = 10 and B = 10   Memory requirements: F · N = 10 · N Hash function  Speedup factor: 2B+B / F2 ~ 10 6 / 10 2 = 10000 (F times as many tokens in query and database, respectively)  …

Literature (Audio Identification) Fingerprints (Philips) Steps:  Allamanche et al. (AES 2001) 1. Spectrogram  Cano et al. (AES 2002)  Haitsma/Kalker (ISMIR 2002)  Kurth/Clausen/Ribbrock (AES 2002)  Wang (ISMIR 2003) …

 Dupraz/Richard (ICASSP 2010) Intensity

Frequency (Hz) Frequency  Efficiently computable  Ramona/Peeters (ICASSP 2011)  Standard transform

 Robust

Time (seconds)

Fingerprints (Philips) Fingerprints (Philips) Steps: Steps: 1. Spectrogram 1. Spectrogram (long window) (long window) 2. Consider limited frequency range Intensity Intensity

Frequency (Hz) Frequency  Coarse temporal resolution (Hz) Frequency  300 – 2000 Hz  Most relevant spectral range  Large overlap of windows (perceptually)

 Robust to temporal distortion

Time (seconds) Time (seconds) Fingerprints (Philips) Fingerprints (Philips) Steps: Steps: 1. Spectrogram 1. Spectrogram (long window) (long window) 2. Consider limited 2. Consider limited frequency range frequency range 3. Log -frequency 3. Log -frequency (Bark scale) (Bark scale) Bit

Band 4. Binarization State Intensity  300 – 2000 Hz  Local thresholding  Most relevant spectral range  Sign of energy difference (perceptually) (simultanously along time and  33 bands (roughly bark scale) frequency axes)  Coarse frequency resolution  Sequence of 32-bit vectors  Robust to spectral distortions

Time (seconds) Time (seconds)

Fingerprints (Philips) Fingerprints (Philips) Sub-fingerprint: Sub-fingerprint:  32-bit vector  32-bit vector  Not characteristic enough  Not characteristic enough

Fingerprint-block:  256 consecutive sub-fingerprints Bit Bit  Covers roughly 3 seconds State State  Overlapping

Time (seconds) Time (seconds)

Fingerprints (Philips) Fingerprints (Philips) Sub-fingerprint: Sub-fingerprint:  32-bit vector  32-bit vector  Not characteristic enough  Not characteristic enough

Fingerprint-block: Fingerprint-block:  256 consecutive  256 consecutive sub-fingerprints sub-fingerprints Bit  Covers roughly 3 seconds Bit  Covers roughly 3 seconds State  Overlapping State  Overlapping

Time (seconds) Time (seconds) Matching Fingerprints (Philips) Matching Fingerprints (Philips)

Database document Query document Database document Query document (fingerprint-blocks) (fingerprint-block) (fingerprint-blocks) (fingerprint-block)

1. Shift query across database document 2. Calculate a block -wise bit-error-rate (BER) Band Band

Intensity 1 Intensity

0.75

0.5 BER

0.25

0 0 1 2 3 4 5 6 7 8 9 Time (seconds) Time (seconds) Shift (seconds)

Matching Fingerprints (Philips) Matching Fingerprints (Philips)

Database document Query document Database document Query document (fingerprint-blocks) (fingerprint-block) (fingerprint-blocks) (fingerprint-block)

1. Shift query across 1. Shift query across database document database document 2. Calculate a block -wise 2. Calculate a block -wise bit-error-rate (BER) bit-error-rate (BER) Band Band

1 Intensity 1 Intensity

0.75 0.75

0.5 0.5 BER BER

0.25 0.25

0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Time (seconds) Shift (seconds) Time (seconds) Shift (seconds)

Matching Fingerprints (Philips) Matching Fingerprints (Philips)

Database document Query document Database document Query document (fingerprint-blocks) (fingerprint-block) (fingerprint-blocks) (fingerprint-block)

1. Shift query across 1. Shift query across database document database document 2. Calculate a block -wise 2. Calculate a block -wise bit-error-rate (BER) bit-error-rate (BER) Band Band

1 Intensity 1 Intensity

0.75 0.75

0.5 0.5 BER BER

0.25 0.25

0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Time (seconds) Shift (seconds) Time (seconds) Shift (seconds) Matching Fingerprints (Philips) Indexing (Philips)

Database document Query document Note:  Individual sub-fingerprints (32 bit) are not characteristic (fingerprint-blocks) (fingerprint-block)  Fingerprint blocks (256 sub-fingerprints, 8 kbit) are used

1. Shift query across Problem:  Computation of BER between query fingerprint-block database document and every database fingerprint-block is expensive 2. Calculate a block -wise  Chance that a complete fingerprint-block survives is low bit-error-rate (BER)  Exact hashing problematic 3. Low BER indicates hit Strategy:  Only sub-fingerprints are indexed using hashing Band

1 Intensity  Exact sub-fingerprint matches are used to identify candidate fingerprint-blocks in database . 0.75  BER is only computed between query fingerprint-block 0.5

BER and candidate fingerprint-blocks 0.25  Procedure is terminated when database fingerprint-block

0 is found, where BER falls below a certain threshold 0 1 2 3 4 5 6 7 8 9 Time (seconds) Shift (seconds)

Indexing (Philips) Indexing (Philips)

Database document Query document Database document Query document (fingerprint-blocks) (fingerprint-block) (fingerprint-blocks) (fingerprint-block)

1. Efficient search for 1. Efficient search for exact matches of exact matches of sub-fingerprints sub-fingerprints (anchor points) (anchor points) 2. Calculate BER only for blocks

Band Band containing anchor

Intensity Intensity points

Time (seconds) Time (seconds)

Conclusions (Philips) Conclusions (Audio Identification)

 Comparing binary fingerprint-blocks expressing  Basic techniques used in Shazam and Philip systems tempo-spectral changes  Many more ways to define robust audio fingerprints  Usage of some sort of shingling technique → see [Casey et al. 2008, IEEE-TASLP] for a similar  Delicate trade-off between specificity, robustness, and efficiency approach applied to a more general retrieval task  Audio recording is identified ( not a piece of music)  Acceleration using hash-based search for anchor-points (sub-fingerprints)  Does not allow for identifying studio recording using a query taken from recordings  Concepts of fault tolereance are required to increase robustness  Does not generalize to identify different interpretations or versions of the same piece of music  Susceptible to distortions in specific frequency bands (e. g. equalization) or to superpositions with other sources Overview Tutorial

Audio Content-Based Part I: Audio identification Music Retrieval Part II: Audio matching

Meinard Müller Joan Serrà Coffee Break Saarland University and MPI Informatik Artificial Intelligence Research Institute [email protected] Spanish National Research Council [email protected] Part III: Version identification

Part IV: Category-based music retrieval

Audio Matching Audio Matching

Beethoven’s Fifth Database: Audio collection containing:  Several recordings of the same piece of music  Different interpretations by various musicians  Arrangements in different instrumentations Various interpretations Goal: Given a short query audio fragment , find all corresponding audio fragments of similar Bernstein musical content. Karajan Notes:  Instance of fragment-based retrieval  Medium specificity Scherbakov (piano)  A single document may contain several hits MIDI (piano)  Cross-modal retrieval also feasible

Application Scenario Application Scenario

Content-based retrieval Cross-modal retrieval Literature (Audio Matching) Audio Matching

Two main ingredients:  Pickens et al. (ISMIR 2002)  Müller/Kurth/Clausen (ISMIR 2005) 1.) Audio features  Suyoto et al. (IEEE TASLP 2008)  Robust but discriminating  Casey et al. ( IEEE TASLP 2008 )  Chroma-based features  Correlate to harmonic progression  Kurth/Müller (IEEE TASLP 2008)  Robust to variations in dynamics, timbre, articulation, local tempo  Yu et al. (ACM MM 2010) … 2.) Matching procedure  Efficient  Robust to local and global tempo variations  Scalable using index structure

Audio Features Audio Features

Example: Chromatic scale Example: Chromatic scale Spectrogram Log-frequency spectrogram Frequency Frequency (Hz) MIDI pitch Intensity Intensity (dB) Intensity Intensity (dB)

Time (seconds) Time (seconds)

Audio Features Audio Features

Example: Chromatic scale Example: Chromatic scale Chroma representation Chroma representation (normalized, Euclidean) Chroma Chroma Intensity Intensity (dB) Intensity Intensity (normalized)

Time (seconds) Time (seconds) Audio Features Audio Features Example: Beethoven’s Fifth Example: Beethoven’s Fifth Chroma representation (normalized, 10 Hz ) Chroma representation (normalized, 2 Hz ) Smoothing (2 seconds) + downsampling (factor 5) Karajan Scherbakov Karajan Scherbakov

Time (seconds) Time (seconds) Time (seconds) Time (seconds) Time (seconds) Time (seconds) Time (seconds) Time (seconds)

Audio Features Audio Features Example: Beethoven’s Fifth Chroma representation (normalized, 1 Hz )  There are many ways to implement chroma features Smoothing (4 seconds) + downsampling (factor 10)  Properties may differ significantly  Appropriateness depends on respective application Karajan Scherbakov

 http://www.mpi-inf.mpg.de/resources/MIR/chromatoolbox/  MATLAB implementations for various chroma variants Time (seconds) Time (seconds)  ISMIR 2011, Poster Session (PS2), Tuesday 13-15 Time (seconds) Time (seconds)

Matching Procedure Matching Procedure

Compute chroma feature sequences Query  Database  Query DB  N very large (database size), M small (query size) Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich

Matching curve Time (seconds) Matching Procedure Matching Procedure

Query Query

DB DB

Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich

Time (seconds) Time (seconds)

Matching Procedure Matching Procedure

Matching curve Query Query : Beethoven’s Fifth / Bernstein (first 20 seconds)

DB

Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich

Time (seconds) Time (seconds)

Matching Procedure Matching Procedure

Matching curve Problem: How to deal with tempo differences? Query : Beethoven’s Fifth / Bernstein (first 20 seconds) Karajan is much faster then Bernstein!

Bach Beethoven/Bernstein Beethoven/Sawallisch Shostakovich

Beethoven/Karajan

Matching curve does not indicate any hits! Time (seconds) Hits 1 2 5 3 4 6 7 Time (seconds) Matching Procedure Matching Procedure 1. Strategy: Usage of local warping Warping strategies 2. Strategy: Usage of multiple scaling are computationally Karajan is much expensive and hard faster then Bernstein! for indexing.

Beethoven/Karajan Beethoven/Karajan

Time (seconds) Time (seconds)

Matching Procedure Matching Procedure 2. Strategy: Usage of multiple scaling 2. Strategy: Usage of multiple scaling

Beethoven/Karajan Beethoven/Karajan

Time (seconds) Time (seconds)

Matching Procedure Matching Procedure 2. Strategy: Usage of multiple scaling 2. Strategy: Usage of multiple scaling Query resampling simulates tempo changes Query resampling simulates tempo changes Minimize over all curves

Beethoven/Karajan Beethoven/Karajan

Time (seconds) Time (seconds) Matching Procedure Experiments 2. Strategy: Usage of multiple scaling Query resampling simulates tempo changes  Audio database ≈ 110 hours, 16.5 GB Minimize over all curves

Resulting curve is similar warping curve  Preprocessing → chroma features, 40.3 MB

 Query clip ≈ 20 seconds Beethoven/Karajan

 Retrieval time ≈ 10 seconds (using MATLAB)

Time (seconds)

Experiments Experiments

Query : Beethoven’s Fifth / Bernstein (first 20 seconds) Query : Shostakovich, Waltz / Chailly (first 21 seconds)

Rank Piece Position Expected hits 1 Beethoven‘s Fifth/Bernstein 0 - 21 2 Beethoven‘s Fifth/Bernstein 101- 122

3 Beethoven‘s Fifth /Karajan 86 - 103 Shostakovich/Chailly Shostakovich/Yablonsky … … … … … … … …

10 Beethoven‘s Fifth/Karajan 252 - 271 11 Beethoven (Liszt) Fifth/Scherbakov 0 - 19 12 Beethoven‘s Fifth/Sawallisch 275 - 296 13 Beethoven (Liszt) Fifth/Scherbakov 86 - 103 14 Schumann Op. 97,1/Levine 28 - 43 Time (seconds)

Experiments Indexing

Query : Shostakovich, Waltz / Chailly (first 21 seconds)  Matching procedure is linear in size of database

Rank Piece Position 1 Shostakovich/Chailly 0 - 21  Retrieval time was 10 seconds for 110 hours of audio 2 Shostakovich/Chailly 41- 60 3 Shostakovich/ Chailly 180 - 198 → Much too slow 4 Shostakovich/Yablonsky 1 - 19 5 Shostakovich/Yablonsky 36 - 52 → Does not scale to millions of songs 6 Shostakovich/Yablonsky 156 - 174 7 Shostakovich/Chailly 144 - 162 → Need of indexing methods 8 Bach BWV 582/Chorzempa 358 - 373 9 Beethoven Op. 37,1/Toscanini 12 - 28 10 Beethoven Op. 37,1/Pollini 202 - 218 Indexing Indexing General procedure Quantization

 Feature space

Visualization (3D)

 Convert database into feature sequence (chroma)  Quantize features with respect to a fixed codebook  Create an inverted file index – contains for each codebook vector an inverted list – each list contains feature indices in ascending order

[Kurth/Müller, IEEE-TASLP 2008]

Indexing Indexing Quantization How to derive a good codebook?

 Codebook selection by unsupervised learning  Feature space – Linde–Buzo–Gray (LBG) algorithm – similar to k-means  Codebook selection – adjust algorithm to spheres of suitable size  Codebook selection based on musical knowledge  Quantization using nearest neighbors

Indexing Indexing LBG algorithm LBG algorithm Steps: Steps:

1. Initialization of 1. Initialization of codebook vectors codebook vectors 2. Assignment 2. Assignment 3. Recalculation 3. Recalculation 4. Iteration (back to 2.) 4. Iteration (back to 2.) Indexing Indexing LBG algorithm LBG algorithm Steps: Steps:

1. Initialization of 1. Initialization of codebook vectors codebook vectors 2. Assignment 2. Assignment 3. Recalculation 3. Recalculation 4. Iteration (back to 2.) 4. Iteration (back to 2.)

Indexing Indexing LBG algorithm LBG algorithm Steps: Steps:

1. Initialization of 1. Initialization of codebook vectors codebook vectors 2. Assignment 2. Assignment 3. Recalculation 3. Recalculation 4. Iteration (back to 2.) 4. Iteration (back to 2.)

Indexing Indexing LBG algorithm LBG algorithm Steps: Steps:

1. Initialization of 1. Initialization of codebook vectors codebook vectors 2. Assignment 2. Assignment 3. Recalculation 3. Recalculation 4. Iteration (back to 2.) 4. Iteration (back to 2.)

Until convergence Indexing Indexing LBG algorithm for spheres LBG algorithm for spheres

 Example: 2D  Example: 2D  Assignment  Assignment  Recalculation  Recalculation  Projection  Projection

Indexing Indexing LBG algorithm for spheres LBG algorithm for spheres

 Example: 2D  Example: 2D  Assignment  Assignment  Recalculation  Recalculation  Projection  Projection

Indexing Indexing Codebook using musical knowledge Codebook using musical knowledge

 Observation: Chroma features capture  C-Major harmonic information

 C#-Major  Example : C -Major

 Choose codebook to contain n-chords for n=1,2,3,4  Example: C #-Major n 1 2 3 4  Experiments: For more then 95% of all chroma features >50% of energy lies in at most 4 components template

# 12 66 220 495 793 Indexing Indexing Codebook using musical knowledge Quantization

Additional consideration of harmonics in chord templates Orignal chromagram and projections on codebooks

Example: 1-chord C Original LBG-based Model-based

Harmonics 1 2 3 4 5 6 Pitch C3 C4 G4 C5 E5 G5 Frequency 131 262 392 523 654 785 Chroma C C G C E C

Replace by with suitable weights for the harmonics

Indexing Indexing Query and retrieval stage Retrieval results

 Medium sized database – 500 pieces – 112 hours of audio – mostly  Selection of various queries – 36 queries  Query consists of a short audio clip (10-40 seconds) – duration between 10 and 40 seconds  Specification of fault tolerance setting – hand-labelled matches in database – fuzzyness of query  Indexing leads to speed-up factor between 15 and 20 – number of admissable mismatches (depending on query length) – tolerance to tempo variations  Only small degradation in precision and recall – tolerance to modulations

Indexing Indexing Retrieval results Conclusions

No index  Described method suitable for medium-sized LBG-based index Model-based index – index is assumed to be in main memory – inverted lists may be long  Goal was to find all meaningful matches – high -degree of fault -tolerance required (fuzzyness, mismatches) – number of intersections and unions may explode  What to do when dealing with millions of songs?

Average Precision Average  Can the quantization be avoided?  Better indexing and retrieval methods needed! – kd-trees – locality sensitive hashing – … Average Recall Conclusions (Audio Matching) Conclusions (Audio Matching)

Matching procedure Audio Features  Strategy: Exact matching and multiple scaled queries Strategy: Absorb variations already at feature level – simulate tempo variations by feature resampling – different queries correspond to different tempi  Chroma → invariance to timbre – indexing possible  Normalization → invariance to dynamics  Strategy: Dynamic time warping – subsequence variant  Smoothing → invariance to local time deviations – more flexible (in particular for longer queries) – indexing hard Message: There is no standard chroma feature! Variants can make a huge difference!

Feature Design Motivation: Audio Matching

 Enhancement of chroma features

 Usage of audio matching framework for evaluating the quality of obtained audio features

 Usage of matching curves as mid-level representation to reveal a feature’s robustness and discriminative capability

M. Müller and S. Ewert (2010): Towards Timbre-Invariant Audio Features for Harmony-Based Music. IEEE Trans. on Audio, Speech & Language Processing, Vol. 18, No. 3, pp. 649-662. Time (seconds)

[Müller/Ewert, IEEE-TASLP 2010]

Motivation: Audio Matching Chroma Features

Four occurrences of the main theme

First occurrence Third occurrence First occurrence Third occurrence Chroma scale

Time (seconds) Time (seconds) 1 2 3 4

Time (seconds) Chroma Features Chroma Features

First occurrence Third occurrence First occurrence Third occurrence Chroma scale Chroma scale

Time (seconds) Time (seconds) Time (seconds) Time (seconds)

How to make chroma features more robust to timbre changes ? How to make chroma features more robust to timbre changes ? Idea: Discard timbre-related information

[Müller/Ewert, IEEE-TASLP 2010]

MFCC Features and Timbre MFCC Features and Timbre MFCC coefficient MFCC coefficient MFCC

Time (seconds) Time (seconds)

Lower MFCCs Timbre

[Müller/Ewert, IEEE-TASLP 2010] [Müller/Ewert, IEEE-TASLP 2010]

MFCC Features and Timbre Enhancing Timbre Invariance

Short-Time Pitch Energy Steps: 1. Log-frequency spectrogram MFCC coefficient MFCC

Time (seconds) scale Pitch

Lower MFCCs Timbre

Idea: Discard lower MFCCs to achieve timbre invariance

Time (seconds) [Müller/Ewert, IEEE-TASLP 2010] [Müller/Ewert, IEEE-TASLP 2010] Enhancing Timbre Invariance Enhancing Timbre Invariance

Log Short-Time Pitch Energy Steps: PFCC Steps: 1. Log-frequency spectrogram 1. Log-frequency spectrogram 2. Log (amplitude) 2. Log (amplitude) 3. DCT Pitch scale Pitch scale Pitch

Time (seconds) Time (seconds) [Müller/Ewert, IEEE-TASLP 2010] [Müller/Ewert, IEEE-TASLP 2010]

Enhancing Timbre Invariance Enhancing Timbre Invariance

PFCC Steps: PFCC Steps: 1. Log-frequency spectrogram 1. Log-frequency spectrogram 2. Log (amplitude) 2. Log (amplitude) 3. DCT 3. DCT 4. Discard lower coefficients 4. Keep upper coefficients [1:n-1] [n:120] Pitch scale Pitch scale Pitch

Time (seconds) Time (seconds) [Müller/Ewert, IEEE-TASLP 2010] [Müller/Ewert, IEEE-TASLP 2010]

Enhancing Timbre Invariance Enhancing Timbre Invariance

Steps: Steps: 1. Log-frequency spectrogram 1. Log-frequency spectrogram 2. Log (amplitude) 2. Log (amplitude) 3. DCT 3. DCT

4. Keep upper coefficients Chroma scale 4. Keep upper coefficients [n:120] [n:120] Time (seconds)

Pitch scale Pitch 5. Inverse DCT 5. Inverse DCT 6. Chroma & Normalization

Time (seconds) [Müller/Ewert, IEEE-TASLP 2010] [Müller/Ewert, IEEE-TASLP 2010] Enhancing Timbre Invariance Chroma versus CRP

Shostakovich Waltz

Steps: First occurrence Third occurrence 1. Log-frequency spectrogram 2. Log (amplitude) Chroma 3. DCT

Chroma scale 4. Keep upper coefficients [n:120] Time (seconds) Time (seconds) Time (seconds) 5. Inverse DCT 6. Chroma & Normalization CRP(n)

Chroma DCT-Reduced Log-Pitch [Müller/Ewert, IEEE-TASLP 2010] [Müller/Ewert, IEEE-TASLP 2010]

Chroma versus CRP Quality: Audio Matching

Shostakovich Waltz Query : Shostakovich, Waltz / Yablonsky (3. occurrence)

First occurrence Third occurrence

Chroma

Shostakovich/Chailly Shostakovich/Yablonsky

CRP(55) n = 55

Time (seconds) Time (seconds) [Müller/Ewert, IEEE-TASLP 2010] Time (seconds)

Quality: Audio Matching Quality: Audio Matching Query : Shostakovich, Waltz / Yablonsky (3. occurrence) Query : Shostakovich, Waltz / Yablonsky (3. occurrence)

Shostakovich/Chailly Shostakovich/Yablonsky Shostakovich/Chailly Shostakovich/Yablonsky

Time (seconds) Time (seconds) Quality: Audio Matching Quality: Audio Matching Query : Shostakovich, Waltz / Yablonsky (3. occurrence) Query : Shostakovich, Waltz / Yablonsky (3. occurrence)

Standard Chroma (Chroma Pitch)

Shostakovich/Chailly Shostakovich/Yablonsky Shostakovich/Chailly Shostakovich/Yablonsky

Time (seconds) Time (seconds)

Quality: Audio Matching Quality: Audio Matching Query : Shostakovich, Waltz / Yablonsky (3. occurrence) Query : Free in you / Indigo Girls (1. occurence)

Standard Chroma (Chroma Pitch) Standard Chroma (Chroma Pitch) CRP(55)

Shostakovich/Chailly Shostakovich/Yablonsky Free in you/Indigo Girls Free in you/Dave Cooley

Time (seconds) Time (seconds)

Quality: Audio Matching Chroma Toolbox Query : Free in you / Indigo Girls (1. occurence)  There are many ways to implement chroma features  Properties may differ significantly Standard Chroma (Chroma Pitch) CRP(55)  Appropriateness depends on respective application

Free in you/Indigo Girls Free in you/Dave Cooley

 http://www.mpi-inf.mpg.de/resources/MIR/chromatoolbox/  MATLAB implementations for various chroma variants  ISMIR 2011, Poster Session (PS2), Tuesday 13-15 Time (seconds) Overview Tutorial

Audio Content-Based Part I: Audio identification Music Retrieval Part II: Audio matching

Meinard Müller Joan Serrà Coffee Break Saarland University and MPI Informatik Artificial Intelligence Research Institute [email protected] Spanish National Research Council [email protected] Part III: Version identification

Part IV: Category-based music retrieval

Version Identification Version Identification

Goal: Given a music recording or a sufficiently long Versions: Alternative recordings, renditions, arrangements excerpt of it, find all corresponding music recordings or performances of the same musical piece. within a collection that can be regarded as a kind of version or reinterpretation of the same piece .  Live versions  Cover songs  Tributes  Demos  Parodies  Contemporary versions of an old song  Radically different interpretations of a musical piece  …

Version Identification Version Identification

Motivation Bob Dylan Avril Lavigne Knockin’ on Heaven’s Door Key Knockin’ on Heaven’s Door  Search and organization of music collections Duke Ellington Harmonization James Aebersold – Near-duplicate detection/discovery of music documents Take the A train Take the A train  Musical rights management and licenses Timbre Enter Sandman – Radio broadcast monitoring, advertisements, ... Nirvana Nirvana Poly [Incesticide ] Tempo Poly [Unplugged] – Aid in court decisions/music copyright infringement Getz Rosa Pasos & Ron Carter (e.g. Müllensiefen & Pendzich, Mus. Sci. , 2009) Garota de Ipanema Timing Garota de Ipanema Black Sabbath Cindy & Bert  “Creative” application contexts Paranoid Lyrics Der Hund Der Baskerville – Composers (e.g. assess originality of ideas) AC/DC Recording conditions AC/DC – Musicologists (e.g. track origins of a motive) High Voltage High Voltage [live] Queen Molotov – Understanding (what’s the essence of a song?) Bohemian rhapsody Structure Bohemian rhapsody  Fun/leisure Version Identification Version Identification

Everything changes, but there must be some “essence” that allows us to identify versions Main idea:

 Exploit common information (tonal information)

How to compare two different recordings? What is kept?  Be invariant against potential changes (timbre, ... What allows us to still identify such pieces? main tonality, tempo, structure, …)

Tonal information : The played notes, their temporal succession, combinations and organization.

Main building Version Identification: Approach A blocks of a version identi- fication system

[Serrà et al., New J. Phys. , 2009]

Version Identification: Approach A Version Identification: Approach A Feature extraction

Chroma (PCP) features  Correlates to harmonic + melodic progression  Robust to changes in timbre and instrumentation  Normalization introduces invariance to dynamics

[Serrà et al., New J. Phys. , 2009] Version Identification: Approach A Version Identification: Approach A Feature extraction

Harmonic Pitch Class Profiles (HPCP)

[Gómez, PhD, 2006] [Serrà et al., New J. Phys. , 2009]

Version Identification: Approach A Cyclic Chroma Shifts Key invariance  Given chroma vectors  Fix a local cost measure Knockin’ on Heaven’s Door  Compute cost between x and shifted y Bob Dylan Avril Lavigne

x y  Compute average chroma vectors for each song 1  Consider cyclic shifts of the chroma vectors to simulate transpositions 0.5  Determine optimal shift indices so that the shifted Cost

chroma vectors are matched with minimal cost 0 0 1 2 3 4 5 6 7 8 9 10 11  Transpose the songs accordingly Shift index

Cyclic Chroma Shifts Cyclic Chroma Shifts  Given chroma vectors  Given chroma vectors  Fix a local cost measure  Fix a local cost measure  Compute cost between x and shifted y  Compute cost between x and shifted y

x x σ (y) σ 2(y)

1 1

0.5 0.5 Cost Cost

0 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 Shift index Shift index Cyclic Chroma Shifts Cyclic Chroma Shifts  Given chroma vectors  Given chroma vectors  Fix a local cost measure  Fix a local cost measure  Compute cost between x and shifted y  Compute cost between x and shifted y

x x σ 3(y) σ 11 (y)

1 1

0.5 0.5 Cost Cost

0 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 Shift index Shift index

Cyclic Chroma Shifts Version Identification: Approach A  Given chroma vectors  Fix a local cost measure  Compute cost between x and shifted y  Minimizing shift index: 3

x σ 3(y)

1

0.5 Cost

0 0 1 2 3 4 5 6 7 8 9 10 11 Shift index [Serrà et al., New J. Phys. , 2009]

Version Identification: Approach A Version Identification: Approach A Local similarity Local similarity

Phase space embedding (smoothing)  Taking m consecutive samples  Separated with some delay τ  Concatenating them to form a high -dimensional vector r r r r r ˆ L xi = [xi xi−τ xi−2τ xi−(m− )1 τ ]  Phase space embedding (smoothing)  Cross-recurrence plot (thresholding)

[Kantz & Schreiber, Camb. Univ. Press , 2004; Marwan et al., Phys. Rep. , 2007] [Kantz & Schreiber, Camb. Univ. Press , 2004] Version Identification: Approach A Version Identification: Approach A Local similarity: Smooting Local similarity: Smooting

Local dissimilarity (cost) matrix Local dissimilarity (cost) matrix

Version Identification: Approach A Version Identification: Approach A Local similarity: Smooting Local similarity: Smooting

Local dissimilarity (cost) matrix: Tempo changes get blurred Local dissimilarity (cost) matrix: Tempo changes get blurred

Version Identification: Approach A Version Identification: Approach A Local similarity: Smooting Local similarity

Cross recurrence plot (thresholding the local cost dis- Filtering along 8 plausible directions and minimizing similarity measure)

[Müller & Kurth, ICASSP , 2006] [Marwan et al., Phys. Rep. , 2007] Version Identification: Approach A Version Identification: Approach A Local similarity

Cross recurrence plot (thresholding the local cost dis- similarity measure)

 Absolute threshold

 Percentage of the mean cost found

 Taking each frame’s K nearest neighbors

 …

[Marwan et al., Phys. Rep. , 2007] [Serrà et al., New J. Phys. , 2009]

Alignment Global Alignment  Classical DTW Global correspondence X between X and Y Dynamic Time Warping (DTW) Y An optimal warping between two time series (from beginning to end) is found by dynamic programming  Subsequence DTW Subsequence of Y corresponds X (“accumulating” local costs + backtracking) to X

Y D(n − ,1 m − )1   Local Alignment ( , ) = ( , ) + min ( , − )1 Subsequence of Y corresponds X D n m C n m  D n m  to subequence of X  D(n − ,1 m) Y

Global Alignment Global Alignment

Dynamic Time Warping (DTW): Dynamic Time Warping (DTW) We can apply global path constraints Or local path constraints

D(n −1,m −1)  D(n,m) = C(n,m) + min  D(n,m − )1   D(n − ,1 m)

(Sakoe-Chiba band) (Itakura parallelogram) Global Alignment Alignment  Classical DTW Global correspondence X between X and Y Dynamic Time Warping (DTW) Or local path constraints Y

 D(n −1,m −1)  Subsequence DTW

 X D(n,m) = C(n,m) + min D(n − ,1 m − )2 Subsequence of Y corresponds  to X D(n − ,2 m − )1 Y

 Local Alignment

Subsequence of Y corresponds X to subequence of X

Y

Subsequence Alignment Alignment  Classical DTW Subsequence DTW Global correspondence X  Compute cumulative dissimilarity matrix as before between X and Y

 Backtrack from lowest value in a given end region unti Y the start region is reached  Subsequence DTW Subsequence of Y corresponds X to X

Y

 Local Alignment

Subsequence of Y corresponds X to subequence of X

Y

Local Alignment Local Alignment Computation of accumulated score matrix S from a binary Task: Let X=(x 1,…,x N) and Y=(y 1,…,y M) be two time similarity (reward) matrix R series. Then, find the maximum similarity of a sub- Think positive! sequence of X and a subsequence of Y  D(n − ,1 m − )1 S(n − ,1 m − )1   Note: D(n,m) = C(n,m) + min D(n − ,1 m − )2 S(n,m) = R(n,m) + max S(n − ,1 m − )2    This problem is also known from bioinformatics. The Smith-Waterman D(n − ,2 m − )1 S(n − ,2 m − )1 algorithm is a well -known algorithm for performing local sequence alignment ; that is, for determining similar regions between two nucleotide or protein sequences.  S(n − ,1 m − )1  Similar concepts have been developed in parallel in the nonlinear time series   R(n,m) + max S(n − ,1 m − )2 if R(n,m) > 0 analysis community (physics). The cross recurrence plots and their   S(n − ,2 m − )1 recurrence quantification measures are also a good starting point.  S(n,m) =   0  Strategy:  S(n − ,1 m − )1 − g  max  otherwise  Develop our own variant of these algorithms  S(n − ,1 m − )2 − g  Differentiate between    Specifically tackle structure changes and small temporal  S(n − ,2 m − )1 − g matches and disruptions (both very important for version identification) mismatches Local Alignment Local Alignment Score Computation of accumulated score matrix S from a binary Example: Knockin’ on Heaven’s Door similarity (reward) matrix R Knockin' on Heavens's Door 94.2 90 140 Accumulated  S(n − ,1 m − )1 90   80 score matrix R(n,m) + max S(n − ,1 m − )2 if R(n,m) > 0 120  70  S(n − ,2 m − )1  100 S(n,m) =   0 60 60   S(n − ,1 m − )1 − g 80 50  max  otherwise Bob Bob Dylan  S(n − ,1 m − )2 − g 40  60  ( − ,2 − )1 − Dylan Bob  S n m g 30 30 40  g penalizes `` inserts´´ and ``delets´´ in alignment 20

20  Zero-entry allows for jumping to any cell without penalty 10  Best local alignment score is the highest value in S 0 50 100 150 200 250 300  Best local alignment ends at cell of highest value Guns and Roses Guns and Roses  Start is obtained by backtracking to first cell of value zero

Local Alignment Score Local Alignment Score Example: Knockin’ on Heaven’s Door Example: Knockin’ on Heaven’s Door

Knockin' on Heavens's Door Knockin' on Heaven's Door 94.2 94.2 90 90 140 90 Accumulated 140 90 Accumulated

80 score matrix 80 score matrix 120 120

70 70

100 Cell with max. 100 Cell with max. 60 60 score = 94.2 60 60 score = 94.2

80 50 80 50 Bob Bob Dylan Bob Dylan 40 40 60 60 Alignment path Bob Dylan Bob Dylan Bob 30 30 30 30 of maximal score 40 40

20 20

20 20 10 10

0 0 50 100 150 200 250 300 50 100 150 200 250 300 Guns and Roses Guns and Roses Guns and Roses Guns and Roses

Local Alignment Score Local Alignment Score Example: Knockin’ on Heaven’s Door Example: Knockin’ on Heaven’s Door

Knockin' on Heaven's Door Knockin' on Heaven's Door 94.2 90 90 140 90 Accumulated 140 90 Accumulated

80 score matrix 80 score matrix 120 120

70 70

100 Cell with max. 100 Cell with max. 94.2 60 60 score = 94.2 60 60 score = 94.2

80 50 80 50 Bob Bob Dylan Bob Dylan 40 40 60 Alignment path 60 Alignment path Bob Dylan Bob Dylan Bob 30 30 of maximal score 30 30 of maximal score 40 40

Accumulated score Accumulated 20 20

20 Alignment path 20 Matching 10 10 subsequences 0 0 50 100 150 200 250 300 50 100 150 200 250 300 Guns and Roses Guns and Roses Guns and Roses Guns and Roses Version Identification: Conclusions

 Main idea: Extract common, shared information and be invariant towards potential changes – Tonal information (notes) – Timbre, tempo/rhythm, key/harmonicity, structure, lyrics, ...  Example approach – Chroma (PCP) features – Transposition indices (circular shifts) – Local similarity measure (subsequence + thresholding) – Local alignment  Many other approaches exist! – Casey et al., IEEE-TASLP, 2008: Shingling + Indexing – Marolt, IEEE-TMM, 2008: Mid-level representation + 2D Spectrum – Ellis & Polliner, ICASSP, 2007: Beat tracking + Correlation (cf. Serrà, PhD, “Literature Review“, 2011) Overview Tutorial

Audio Content-Based Part I: Audio identification Music Retrieval Part II: Audio matching

Meinard Müller Joan Serrà Coffee Break Saarland University and MPI Informatik Artificial Intelligence Research Institute [email protected] Spanish National Research Council [email protected] Part III: Version identification

Part IV: Category-based music retrieval

Category-based Retrieval Category-based Retrieval

Goal: Find music recordings within a collection that share  Categories are one of the most basic ways to represent a specific, predefined characteristic or trait. and structure human knowledge

 They encapsulate characteristic, intrinsic features of a Humans naturally group objects with similar characteristics group of items or traits into categories (or classes).  Prototypes can be built and potential category members can be compared to these Category: Any word or concept that can be used to describe a set of recordings sharing some property. Also  They can overlap and have some kind of “organization” referred to as class , type or kind . (e.g. fuzzy and hierarchical)

[Zbikowski, Oxford Univ. Press , 2007]

Category-based Retrieval Category-based Retrieval

Category-based query examples :

 Genres (e.g. “bossa-nova”)  Moods (e.g. “sad”)  Specific instruments (e.g. “extremely distorted electric guitar”)  Instrument ensembles (e.g. “”)  Key/mode (e.g. “G minor pieces”)  Dynamics (e.g. “lots of loudness changes”)  Type of recording (e.g. “live recording”)  Epoch (e.g. “music from the 80s”)  … Category-based Retrieval Category-based Retrieval

Objective: Label/annotate music documents (ideally with Unsupervised : some confidence and probability measurements) in a  No taxonomy. Categories emerge from the data way that makes sense to a user. organization under a predefined similarity measure  Typical clustering algorithms: K-means, hierarchical clustering, SOM, growing hierarchical SOM, …

Categories usually imply a taxonomy :  Based on the opinion of experts, on the “wisdom of the crowd” (folksonomy), on a single user (personomy), …  Taxonomies can be organized, their members can overlap, different relation types can be applied (ontologies)…

Category-based Retrieval Category-based Retrieval

Supervised : Motivation  Use a taxonomy  Search and organization of music collections (discovery  Map features to categories (usually without describing of music documents) the rules of this mapping explicitly)  Fun/leisure  Typical supervised learning algorithms: KNN, neural networks, LDA, SVM, trees, …  Playlist generation  Music recommendation  Music similarity  ...

[Tversky, Psych. Revs. , 1977; Bogdanov et al., IEEE-TMM , 2011]

Typical Approach: Automatic Annotation Typical Approach: Automatic Annotation

Data processing: Data processing: Feature Classifier training Feature Labeled audio transformation and validation Labeled audio transformation Classification and selection and selection

Feature Feature extraction extraction

Annotation, Prob(Annotation) Unlabeled audio and Conf(Annotation) Typical Approach: Automatic Annotation Typical Approach: Automatic Annotation Feature extraction

Data processing: Feature Features depend on the task ! (e.g. MFCC for key etimation?) Labeled audio transformation Classification and selection  Temporal : ZCR, energy, loudness, rise time, onset asyncrony , amplitude modulation , …  Spectral : MFCC, LPC, spectral centroid, spectral Feature bandwidth, bark-band energies, … extraction  Tonal : Chroma/PCP, melody, chords, key, …  Rhythmic : Beat histogram, IOIs, rhythm transform, FP, …

Features extracted on a short-time frame-by-frame basis Annotation, Prob(Annotation) Unlabeled audio and Conf(Annotation)

Typical Approach: Automatic Annotation Typical Approach: Automatic Annotation Feature extraction

But usually summarized to represent global characteristics of Data processing: Feature the whole audio excerpt: Labeled audio transformation Classification and selection  Component-wise means  Component -wise variances  Covariances

 Higher-order statistical moments (skewness, kurtosis, …) Feature  Derivatives (1st, 2nd, …) extraction  Max, min, percentiles, histograms, ...

(+ more exotic summarizations) Annotation, Prob(Annotation) Unlabeled audio and Conf(Annotation)

Typical Approach: Automatic Annotation Typical Approach: Automatic Annotation Data processing Data processing: Feature transformation

Feature transformation and feature selection Principal Component Analysis (PCA)  Rescale each feature so that all of them have the same  Linear transform “influence” (normalize, standardize, … Gaussianize?)  Eliminates redundancy  Detect highly correlated features (try to decorrelate them?)  Decorrelates variables  Remove uninformative features  It can be used to reduce the number of features  Reduce the dimensionality  Learn about the features and the task!  Assumes some “homogeneity” of the distributions  We lose the information of which specific feature is relevant Typical Approach: Automatic Annotation Typical Approach: Automatic Annotation Data processing: Feature selection Data processing: Feature selection

Many other feature selection algorithms: Fisher’s ratio  Ratio between inter- and intra-class scatter  Correlation-based  Measures discriminative power of feature j between  Chi -squared classes i and i’  Consistency  Ratio low when classes are well separated  Gain ratio  Information gain  No standard way of performing this feature selection  Bootstrapping iteratively  Classifier-affine feature selection − Sequential backward selection (removing bad features)  … − Sequential forward selection (selecting good features) [Witten & Frank, Elsevier , 2005]

Typical Approach: Automatic Annotation Typical Approach: Automatic Annotation Classifiers

Data processing: Bayesian/probabilistic Classifiers Feature Labeled audio transformation Classification and selection

Feature extraction Naïve conditional independence

Annotation, Prob(Annotation) Unlabeled audio and Conf(Annotation) [Witten & Frank, Elsevier , 2005]

Typical Approach: Automatic Annotation Typical Approach: Automatic Annotation Classifiers Classifiers

Gaussian Mixtures Neural Networks

[Witten & Frank, Elsevier , 2005] Typical Approach: Automatic Annotation Typical Approach: Automatic Annotation Classifiers Classifiers

K Nearest Neighbors Classifier Trees

Typical Approach: Automatic Annotation Typical Approach: Automatic Annotation Classifiers Classifiers

Support Vector Machines Many other classifiers exist

 Logistic regression  Perceptrons  Random forest  Hidden Markov models  Boosting  Ensembles of classifiers  …

Typical Approach: Automatic Annotation Category-based Retrieval Machine learning / data mining software packages Classifiers: Issues

Always perform a visual inspection of your features (Check that there are no things such as inf , or nan ) Category-based Retrieval Category-based Retrieval Classifiers: Issues Classifiers: Issues

Do not classify M items with D features when D is larger Be careful when comparing approaches that use a or comparable to M different number of features As a rule of thumb (personal): M>20*D

Category-based Retrieval Category-based Retrieval Classifiers: Issues Classifiers: Issues

If possible, use the same amount of instances per class Avoid overfitting  ``Normalize´´ by random sampling most populated classes  Run several experiments (Monte -carlo )

Category-based Retrieval Category-based Retrieval Classifiers: Issues Classifiers: Issues

Avoid overfitting Avoid overfitting Diagonal line = ideal solution for the problem Let’s solve the same problem for a smaller dataset Category-based Retrieval Category-based Retrieval Classifiers: Issues Classifiers: Issues

Avoid overfitting Avoid overfitting Let’s solve the same problem for a smaller dataset Classification of new data (3 errors)

Category-based Retrieval Category-based Retrieval Classifiers: Issues Classifiers: Issues

Avoid overfitting How to avoid overfitting: Solution using a simple line (1 error)  Use learning algorithms that intrinsically, by design, generalize well  Pursue simple classifiers  Evaluate with the most suitable measure for your task (unbiased and low-variance statistical estimators)  Estimate the performance of a model with data you did not use to construct the model

Category-based Retrieval Category-based Retrieval Classifiers: Issues Classifiers: Issues

 Hold-out cross-validation Check the default parameters of your classifier

Examples:  Cross -fold validation  Minimum number of instances per tree branch  Number of NN in a KNN  Complexity parameter in an SVM  Distance-based classifier  LOOCV, Stratified cross-validation, nested cross- − Which distance validation, … − Effect of normalization/re-scaling  With all of them: ensure a sufficient number of repetitions  … Category-based Retrieval Category-based Retrieval Classifiers: Issues Classifiers: Issues

If you evaluate your features, demonstrate their If you evaluate your features, demonstrate their usefulness/increment in accuracy usefulness/increment in accuracy

 Use several classifiers (include a “ one -rule classifier ”?)  Use several datasets (music collections)  Always compute/report random baselines (average accuracy, but also their std, min, and max values)  Include also a baseline feature set in the evaluation

Category-based Retrieval Category-based Retrieval: Conclusions

Classifiers: Issues Scenario : Retrieve music based on automatically annotated Test for statistical significance! labels from audio Assign a label/class, a probability of that label, and a confidence value Automatic annotation Three main steps for supervised learning :  Feature extraction  Feature transformation and selection  Classification Issues with classification tasks Overfitting, statistical significance, number of features, test several methods/datasets, randomizations, … Wrap Up Tutorial

Audio Content-Based Part I: Audio identification Music Retrieval Part II: Audio matching

Meinard Müller Joan Serrà

Saarland University and MPI Informatik Artificial Intelligence Research Institute [email protected] Spanish National Research Council [email protected] Part III: Version identification

Part IV: Category-based music retrieval

Wrap Up References (Audio Identification)

 E. Allamanche, J. Herre, O. Hellmuth, B. Fröba, and M. Cremer. AudioID: Towards content-based identification of audio material. In Proc. 110th AES Convention , 2001.  P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of algorithms for audio fingerprinting. In Proc. MMSP, pp.169–173 , 2002.  P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of audio fingerprinting. Journal of VLSI Signal Processing Systems, 41(3):271 –284 , 2005.  E. Dupraz and G. Richard. Robust frequency-based audio fingerprinting. In Proc. IEEE ICASSP, pp. 281-284, 2010.  J. Haitsma and T. Kalker. A highly robust audio fingerprinting system. In Proc. ISMIR, pages 107–115, 2002.  Y. Ke, D. Hoiem, and R. Sukthankar. Computer Vision for Music Identification. In Proc. IEEE CVPR, pages 597–604, 2005.

References (Audio Identification) References (Audio Matching)

 F. Kurth, A. Ribbrock, and M. Clausen.  C. Fremerey, M. Müller, F. Kurth, and M. Clausen. Identification of highly distorted audio material for querying large scale data bases. Automatic mapping of scanned sheet music to audio recordings. In Proc. AES Convention, 2002. In Proc. ISMIR, pp. 413–418, 2008.  M. Ramona and G. Peeters.  F. Kurth and M. Müller. Audio identification based on spectral modeling of Bark-bands energy and synchronization Efficient index-based audio matching. through onset detection. IEEE Trans. on Audio, Speech, and Language Processing, 16(2):382–395, 2008. In Proc. IEEE ICASSP, pp. 477-480 , 2011.  M. Müller.  J.S. Seo, J.Haitsma, and T. Kalker. Information Retrieval for Music and Motion. Linear speed -change resilient audio fingerprinting. Springer Verlag, 2007. In Proc. IEEE WMPCA , 2002.  M. Müller and S. Ewert.  A. Wang. Towards timbre-invariant audio features for harmony-based music. An industrial strength audio search algorithm. IEEE Trans. on Audio, Speech, and Language Processing (TASLP), 18(3):649–662 , 2010. In Proc. ISMIR, pp. 7–13, 2003.  M. Müller, F. Kurth, and M. Clausen. Audio matching via chromabased statistical features. In Proc. ISMIR, pp. 288–295, 2005.  J. Pickens, J. P. Bello, G. Monti, M. Sandler, T. Crawford, and M. Dovey. Polyphonic score retrieval using polyphonic audio queries: a harmonic modeling approach. Journal of New Music Research, 32(2): 223-236 , 2003. References (Audio Matching) References (Version Identification)

 I. S. H. Suyoto, A. L. Uitdenbogerd, and F. Scholer.  M. Casey, C. Rhodes, and M. Slaney. Searching musical audio using symbolic queries. Analysis of minimum distances in high-dimensional musical spaces. IEEE Transactions on Audio, Speech & Language Processing, 16(2):372–381, 2008. IEEE Trans. on Audio, Speech & Language Processing, 16(5), 2008 .  Y. Yu, M. Crucianu, V. Oria, and L. Chen.  D. P. W. Ellis and G. E. Poliner. Local summarization and multi-level LSH for retrieving multi-variant audio tracks. Identifying cover songs with chroma features and dynamic programming beat tracking. In Proc. ACM Multimedia, pp. 341-350, 2009 . In Proc. IEEE ICASSP, vol. 4, pp.1429–1432, 2007.  R. Foucard, J. L. Durrieu, M. Lagrange, and G. Richard. Multimodal similarity between musical streams for cover version detection. In Proc. IEEE ICASSP, pp. 5514 –5517, 2010.  H. Kantz and T. Schreiber. Nonlinear time series analysis , 2 nd ed. Cambridge University Press, 2004.  M. Marolt. A mid-level representation for melody-based retrieval in audio collections. IEEE Trans. on Multimedia, 10(8): 1617-1625 , 2008.  N. Marwan, M. C. Romano, M. Thiel, and J. Kurths. Recurrence plots for the analysis of complex systems. Physics Reports, 438(5): 237-329 , 2007.

References (Version Identification) References (Version Identification)

 D. Müllensiefen and M. Pendzich.  W. H. Tsai, H. M. Yu, and H. M. Wang. Court decisions on music plagiarism and the predictive value of similarity algorithms. Using the similarity of main melodies to identify cover versions of popular songs for Musicae Scientiae, Discussion Forum 4B, 207-238 , 2009. music document retrieval. Journal of Information Science and Engineering, 24(6):1669–1687 , 2008.  S. Ravuri and D. P. W. Ellis. Cover song detection: from high scores to general classification. In Proc. IEEE ICASSP, pp. 55–58, 2010.  J. Serrà. Identification of versions of the same musical composition by processing audio descriptions. PhD Thesis, Universitat Pompeu Fabra, 2011.  J. Serrà, E. Gómez, P. Herrera, and X. Serra. Chroma binary similarity and local alignment applied to cover song identification. IEEE Trans. on Audio, Speech and Language Processing, 16(6):1138–1152 , 2008.  J. Serrà, X. Serra, and R. G. Andrzejak. Cross recurrence quantification for cover song identification. New Journal of Physics,11:093017 , 2009.  J. Serrà, H. Kantz, X. Serra, and R. G. Andrzejak. Predictability of music descriptor time series and its application to cover song detection. IEEE Trans. On Audio, Speech, and Language Processing . In Press.

References (Category-Based Music Retrieval) References (Category-Based Music Retrieval)

 C. M. Bishop.  C. Laurier, O. Meyers, J. Serrà, M. Blech, P. Herrera, and X. Serra. Pattern recognition and machine learning. Indexing music by mood: design and implementation of an automatic content-based Springer, 2007. annotator.  D. Bogdanov, J. Serrà, N. Wack, P. Herrera, and X. Serra. Multimedia Tools and Applications, 48(1): 161-184 , 2010. Unifying low-level and high-level music similarity measures.  M. Mandel and D. P. W. Ellis. IEEE Trans. on Multimedia, 13(4): 687-701 , 2011. Song-level features and support vector machines for music classification.  P. Cano and M. Koppenberger. In Proc. of ISMIR, pp. 594- 599, 2005 Automatic sound annotation.  N. Scaringella, G. Zoia, and D. Mlynek. IEEE Workshop on Machine Learning for Signal Processing , 2004. Automatic genre classification of music content: a survey. IEEE Signal Processing Magazine, 23(2):133–141 , 2006.  R. O. Duda, P. E. Hart, and D. G. Stork.  J. Shen, M. Wang, S. Yan, H. Pang, and X. Hua. Pattern classification , 2 nd ed. Effective music tagging through advanced statistical modeling. John Wiley & Sons, 2000. In Proc. of SIGIR, 2010.  Z. Fu, G. Lu, K. M. Ting, D. Zhang.  E. Tsunoo, T. Akase, N. Ono, and S. Sagayama. A survey of audio-based music classification and annotation. Musical mood classification by rhythm and bass-line unit pattern analysis. IEEE Trans. on Multimedia, 13(2): 303-319 , 2011. In Proc. ICASSP, pp. 265-268, 2010.  A. K. Jain, R. P. W. Duin, and J. Mao.  D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Statistical pattern recognition: a review. Semantic annotation and retrieval of music sound effects. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(1): 4-37 , 2000. IEEE Trans. on Audio, Speech, and Language Processing, 16(2): 467-476, 2008. References (Category-Based Music Retrieval) Tutorial  A. Tversky. Features of similarity. Psychological Reviews, 84(4): 327-352 , 1977.  G. Tzanetakis and P. Cook. Audio Content-Based Musical genre classification of audio signals. IEEE Trans. on Speech and Audio Processing, 5(10):293– 302, 2002. Music Retrieval  K. West. Novel techniques for audio music classification and search. PhD Thesis, University of East Anglia, 2008.  I. A. Witten and E. Frank. Meinard Müller Joan Serrà Data mining: practical machine learning tools and techniques, 2nd ed. Saarland University and MPI Informatik Artificial Intelligence Research Institute nd Elsevier, 2 ed., 2005. [email protected] Spanish National Research Council  L. M. Zbikowski. [email protected] Conceptualizing music: cognitive structure, theory and analysis. Oxford University Press, 2002.