Music Information Retrieval for Jazz Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA

{dpwe,thierry}@ee.columbia.edu http://labrosa.ee.columbia.edu/

1. Music Information Retrieval 2. Automatic Tagging 3. Musical Content 4. Future Work

MIR for Jazz - Dan Ellis 2012-11-15 1 /29 Machine Listening • Extracting useful information from ... like (we) animals do

Task Automatic Emotion Music Narration Recommendation Describe

Music Environment ASR Awareness Transcription Classify

“Sound VAD Speech/Music Intelligence” Dectect

Environmental Speech Music Sound Domain

MIR for Jazz - Dan Ellis 2012-11-15 2 /29 1. The Problem • We have a lot of music Can computers help?

• Applications archive organization ◦ music recommendation musicological insight?

MIR for Jazz - Dan Ellis 2012-11-15 3 /29 Music Information Retrieval (MIR) • Small field that has grown since ~2000 musicologists, engineers, librarians significant commercial interest • MIR as musical analog of text IR find stuff in large archives • Popular tasks genre classification chord, melody, full transcription music recommendation • Annual evaluations “Standard” test corpora - pop

MIR for Jazz - Dan Ellis 2012-11-15 4 /29 2. Automatic Tagging

• Statistical Pattern Recognition: Sensor Finding matches signal to training examples Pre-processing/ segmentation

1600 segment

1400 Feature extraction

1200 feature vector

1000 Classification 800 class 600 200 400 600 800 1000 1200 Post-processing • Need: Feature design ◦ Labeled training examples • Applications Genre ◦ Instrumentation ◦ Artist ◦ Studio ... MIR for Jazz - Dan Ellis 2012-11-15 5 /29 Features: MFCC • Mel- Cepstral Coefficients the standard features from speech recognition

0.5 Sound 0

−0.5 FFT X[k] 0.25 0.255 0.26 0.265 0.27 time / s spectra 10 5

Mel scale 0 4 freq. warp 0x 10 1000 2000 3000 freq / Hz 2 1 0 log |X[k]| 0 5 10 15 freq / Mel 100 audspec 50 0

IFFT 0 5 10 15 freq / Mel 200

cepstra 0

−200 Truncate 0 10 20 30 quefrency

MFCCs MIR for Jazz - Dan Ellis 2012-11-15 6 /29 Representing Audio • MFCCs are short-time features (25 ms) • Sound is a “trajectory” in MFCC space

• Describe whole track by its statistics VTS_04_0001 - Spectrogram 8 30 MFCC 7 20 6 10 Covariance freq / kHz / freq 5 Audio 4 0 Matrix 3 -10 2 MFCC covariance -20 50 1 20 0 level / dB 1 2 3 4 5 6 7 8 9 18 time / sec 16 14 20 18 20 12 16 15 0 14 10 10 MFCC bin MFCC 12 5 8

10 0 MFCC dimension 8 -5 6 features 6 -10 4 -15 4 2 -20 value 2 1 2 3 4 5 6 7 8 9 time / sec -50 5 10 15 20 MFCC dimension

MIR for Jazz - Dan Ellis 2012-11-15 7 /29 MFCCs for Music • Can resynthesize MFCCs by shaping noise gives an idea about the information retained

Freddy Freeloader 3644 1770 860 freq / Hz / freq 418 Original 203

12 10 8 6 MFCCs 4

mel quefrency 2

3644 1770 860 freq / Hz / freq 418 203 Resynthesis 0 10 20 30 40 50 60 time / s

MIR for Jazz - Dan Ellis 2012-11-15 8 /29 Ground Truth Mandel & Ellis ’08 • MajorMiner: Free-text tags for 10s clips 400 users, 7500 unique tags, 70,000 taggings

• Example: drum, bass, piano, jazz, slow, instrumental, saxophone, soft, quiet, club, ballad, smooth, soulful, easy_listening, swing, improvisation, 60s, cool, light MIR for Jazz - Dan Ellis 2012-11-15 9 /29 Classification Ground C, γ truth Mean Standardize µ One-vs-all Average Sound Chop into MFCC ∆ across 10 sec blocks (20 dims) (60 dims) train SVM Prec. set ∆2 Covariance Σ (399 samples)

• MFCC features + human ground truth + standard machine learning tools

MIR for Jazz - Dan Ellis 2012-11-15 10 /29 Classification Results Classifiers trained from top 50 tags • 01 Soul Eyes

2416

freq / Hz / freq 1356 761 427 240 135

50 100 150 200 250 300 _90s club trance end drum_bass singing horns 1.5 punk samples silence quiet noise solo 1 strings indie house alternative r_b funk 0.5 soft ambient british distortion drum_machine country 0 keyboard saxophone fast instrumental electronica 80s −0.5 voice slow rap hip_hop jazz −1 piano techno dance female bass vocal −1.5 pop electronic rock synth male guitar −2 drum 40 80 120 160 200 240 280 320320 time / s MIR for Jazz - Dan Ellis 2012-11-15 11 /29 3. Musical Content • MFCCs (and speech recognizers) don’t respect pitch

3644 freq / Hz / freq pitch is important 1770 visible in spectrogram 860 418

203

46 48 50 52 54 time / s • Pitch-related tasks note transcription chord transcription matching by musical content (“cover songs”)

MIR for Jazz - Dan Ellis 2012-11-15 12 /29 Poliner & Ellis Note Transcription ‘05,’06,’07 feature representation feature vector Training data and features: •MIDI, multi-track recordings, playback piano, & resampled audio (less than 28 mins of train audio). •Normalized magnitude STFT.

classification posteriors Classification: •N-binary SVMs (one for ea. note). •Independent frame-level classification on 10 ms grid. •Dist. to class bndy as posterior.

hmm smoothing Temporal Smoothing: •Two state (on/off) independent HMM for ea. note. Parameters learned from training data. •Find Viterbi sequence for ea. note.

MIR for Jazz - Dan Ellis 2012-11-15 13 /29 Polyphonic Transcription • Real music excerpts + ground truth MIREX 2007 Frame-level transcription Estimate the of all notes present on a 10 ms grid 1.25

1.00

0.75

0.50

0.25

0 Precision Recall Acc Etot Esubs Emiss Efa Note-level transcription Group frame-level predictions into note-level transcriptions by estimating onset/offset 1.25

1.00

0.75

0.50

0.25

0 Precision Recall Ave. F-measure Ave. Overlap

MIR for Jazz - Dan Ellis 2012-11-15 14 /29 Chroma Features Fujishima 1999 • Idea: Project onto 12 semitones regardless of octave maintains main “musical” distinction invariant to musical equivalence

no need to worry about ? Warren et al. 2003

4 G G

chroma F 3 chroma F freq / kHz / freq D 2 D C 1 C

A 0 A 50 100 150 2 4 6 8 fft bin time / sec 50 100 150 200 250 time / frame NM C(b)= B(12 log (k/k ) b)W (k) X[k] 2 0 | | k=0 W(k) is weighting, B(b) selects every ~ mod12 MIR for Jazz - Dan Ellis 2012-11-15 15 /29 Chroma Resynthesis Ellis & Poliner 2007 • Chroma describes the notes in an octave ... but not the octave • Can resynthesize by presenting all octaves ... with a smooth envelope “Shepard tones” - octave is ambiguous M b o+ b y (t)= W (o + ) cos 2 12 w t b 12 0 o=1 12 Shepard tone spectra Shepard tone resynth 0 4 -10 3 -20 level / dB

-30 freq / kHz 2 -40 1 -50 -60 0 0 500 1000 1500 2000 2500 freq / Hz 2 4 6 8 10 time / sec endless sequence illusion

MIR for Jazz - Dan Ellis 2012-11-15 16 /29 Chroma Example • Simple Shepard tone resynthesis can also reimpose broad spectrum from MFCCs

Freddie 3644 1770 860 freq / Hz / freq 418 203203

A G chroma E D C

3644 1770 860 freq / Hz / freq 418 203

0 10 20 30 40 50 60 time / s

MIR for Jazz - Dan Ellis 2012-11-15 17 /29 Onset detection Bello et al. 2005 Simplest thing is W/2 • e(n )= w[n] x(n + n ) 2 energy envelope 0 | 0 | n= W/2 Harnoncourt Maracatu 8

6 freq / kHz / freq

4

emphasis on 2

0 high ? 60

50

level / dB 40 X(f,t) 30 | | 20 f 10 0 60

50

level / dB 40 f X(f,t) 30 ·| | 20 f 10 0 0 2 4 6 8 10 0 2 4 6 8 10 time / sec time / sec MIR for Jazz - Dan Ellis 2012-11-15 18 /29 Tempo Estimation • Beat tracking (may) need global tempo period τ otherwise problem is not “optimal substructure”

Onset Strength Envelope (part) 4 • Pick peak in 2 onset envelope 0 -2 8 8.5 9 9.5 10 10.5 11 11.5 12 time / s autocorrelation Raw Autocorrelation after applying 400 “human 200 preference” 0 Windowed Autocorrelation window 200 check for 100 0

subbeat -100 0 0.5 1 1.5 2 2.5 3 3.5 4 Secondary Tempo Period lag / s Primary Tempo Period

MIR for Jazz - Dan Ellis 2012-11-15 19 /29 Beat Tracking by Dynamic Programming N N To optimize C( ti )= O(ti)+ F (ti ti 1,p) • { } i=1 i=2 define C*(t) as best score up to time t then build up recursively (with traceback P(t))

O(t)

τ t C*(t) C*(t) = O(t) + max{αF(t – τ, τ ) + C*(τ)} τ p P(t) = argmax{αF(t – τ, τ ) + C*(τ)} τ p

final beat sequence {ti} is best C* + back-trace

MIR for Jazz - Dan Ellis 2012-11-15 20 /29 Beat Tracking Results • Prefers drums & steady tempo

Soul Eyes

0 5 10 15 20 25 time / s

0 100 200 300 400 500 600 700 800 900 period / 4 ms samp

MIR for Jazz - Dan Ellis 2012-11-15 21 /29 Beat-Synchronous Chroma • Record one chroma vector per beat compact representation of harmonies Freddy 3644 1770 860 freq / Hz / freq 418 203

B A G E

chroma D Beat-sync C 20 40 60 80 100 120 time / beat 3644 1770 860 418

Resynth 203

3644 1770 860 418 +MFCC Resynth 203

0 10 20 30 40 50 60 time / s MIR for Jazz - Dan Ellis 2012-11-15 22 /29 Chord Recognition • Beat synchronous chroma look like chords G E D chroma bin C A 05101520... time / sec A-C-E C-E-G B-D-G A-C-D-F can we transcribe them? • Two approaches manual templates (prior knowledge) learned models (from training data)

MIR for Jazz - Dan Ellis 2012-11-15 23 /29 Chord Recognition System Sheh & Ellis 2003 • Analogous to speech recognition Gaussian models of features for each chord Hidden Markov Models for chord transitions

Beat track

beat-synchronous Audio 100-1600 Hz Chroma BPF chroma features HMM chord Viterbi labels

B 25-400 Hz Chroma A C maj BPF G test E D 24 C train C D E G A B Gauss B Root A c min Gaussian Unnormalize G normalize models E D C C D E G A B

Labels Resample b

a

g

f e Count d 24x24 c B transitions transition A G

F matrix E D

C

C D E F G A B c d e f g a b MIR for Jazz - Dan Ellis 2012-11-15 24 /29 Chord Recognition • Often works: Let It Be/06-Let It Be freq / Hz / freq 2416

Audio 761

240

Ground truth C G A:min C G F C C G A:min F:maj7 F:maj6 F:maj7 chord A:min/b7 A:min/b7 G F Beat- E synchronous D chroma C B A RecognizedBut onlyCGaFCGFC about 60% of the time Ga • 0 2 4 6 8 10 12 14 16 18

MIR for Jazz - Dan Ellis 2012-11-15 25 /29 What did the models learn? • Chord model centers (means) indicate chord ‘templates’:

PCP_ROT family model means (train18) 0.4 DIM DOM7 MAJ 0.35 MIN MIN7

0.3

0.25

0.2

0.15

0.1

0.05

0 C0 D5 E 1F0 G15 A 20 B C25 (for C-root chords)

MIR for Jazz - Dan Ellis 2012-11-15 26 /29 Chords for Jazz How many types? • Freddy − logf sgram

2416 freq / Hz / freq 761

240

0 10 20 30 40 50 60 time / sec Freddy − beat−sync chroma B A# A

chroma G# G F# F E D# D C# C Freddy − chord likelihoods + Viterbi path a# g# f# chords e d c A# G# F# E D C Freddy − chord−based chroma reconstruction B A# A

chroma G# G F# F E D# D C# C 20 40 60 80 100 120 time / beats

MIR for Jazz - Dan Ellis 2012-11-15 27 /29 Future Work

Between the Bars − Elliot Smith − pwr .25 12 10 Matching items 8 • 6 4 cover songs / standards 2 120 130 140 150 160 170 Between the Bars − Glenn Phillips − pwr .25 − −17 beats − trsp 2 similar instruments, styles 12 10 8 6 4 2

120 130 140 150 160 170

pointwise product (sum(12x300) = 19.34) 12 10 8 6

chroma bin 4 2 120 130 140 150 160 170 time / beats • Analyzing musical content solo transcription & modeling musical structure

• And so much more... MIR for Jazz - Dan Ellis 2012-11-15 28 /29 Summary • Finding Musical Similarity at Large Scale

Low-level browsing features Classification and Similarity discovery production Melody and notes Music audio Key and chords Music modeling Structure generation Discovery curiosity Tempo and beat

MIR for Jazz - Dan Ellis 2012-11-15 29 /29