Extracting and Using Music Audio Information Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Engineering, Columbia University, NY USA http://labrosa.ee.columbia.edu/
1. Motivation: Music Collections 2. Music Information 3. Music Similarity 4. Music Structure Discovery
Music Audio Information - Ellis 2007-11-02 p. 1 /42 LabROSA Overview
Information Extraction
Music Environment Recognition
Separation
Retrieval Machine Signal Learning Processing Speech
Music Audio Information - Ellis 2007-11-02 p. 2 /42 1. Managing Music Collections • A lot of music data available e.g. 60G of MP3 ≈ 1000 hr of audio, 15k tracks • Management challenge how can computers help?
• Application scenarios personal music collection discovering new music “music placement”
Music Audio Information - Ellis 2007-11-02 p. 3 /42 Learning from Music • What can we infer from 1000 h of music? common patterns Scatter of PCA(3:6) of 12x16 beatchroma 60
sounds, melodies, chords, form 50 what is and what isn’t music 40 30
20 • Data driven musicology? 10 10 20 30 40 50 60 60 Applications 50 • 40 modeling/description/coding 30 computer generated music 20 10 curiosity... 10 20 30 40 50 60
Music Audio Information - Ellis 2007-11-02 p. 4 /42 The Big Picture
Low-level browsing features Classification and Similarity discovery production Melody and notes Music audio Key and chords Music modeling Structure generation Discovery curiosity Tempo and beat
.. so far
Music Audio Information - Ellis 2007-11-02 p. 5 /42 2. Music Information • How to represent music audio?
4000 Audio features 3000 • 2000 Frequency spectrogram, MFCCs, bases 1000
0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time • Musical elements notes, beats, chords, phrases requires transcription
• Or something inbetween? optimized for a certain task?
Music Audio Information - Ellis 2007-11-02 p. 6 /42 Transcription as Classification Poliner & Ellis ‘05,’06,’07 • Exchange signal models for data transcription as pure classification problem: feature representation feature vector Training data and features: •MIDI, multi-track recordings, playback piano, & resampled audio (less than 28 mins of train audio). •Normalized magnitude STFT.
classification posteriors Classification: •N-binary SVMs (one for ea. note). •Independent frame-level classification on 10 ms grid. •Dist. to class bndy as posterior.
hmm smoothing Temporal Smoothing: •Two state (on/off) independent HMM for ea. note. Parameters learned from training data. •Find Viterbi sequence for ea. note.
Music Audio Information - Ellis 2007-11-02 p. 7 /42 Polyphonic Transcription • Real music excerpts + ground truth MIREX 2007 Frame-level transcription Estimate the fundamental frequency of all notes present on a 10 ms grid 1.25
1.00
0.75
0.50
0.25
0 Precision Recall Acc Etot Esubs Emiss Efa Note-level transcription Group frame-level predictions into note-level transcriptions by estimating onset/offset 1.25
1.00
0.75
0.50
0.25
0 Precision Recall Ave. F-measure Ave. Overlap Music Audio Information - Ellis 2007-11-02 p. 8 /42 Beat Tracking Ellis ’06,’07 • Goal: One feature vector per ‘beat’ (tatum) for tempo normalization, efficiency • “Onset Strength Envelope” sumf(max(0, difft(log |X(t, f)|))) 40 30 20 freq / mel 10
0
0 5 10 time / sec 15 • Autocorr. + window → global tempo estimate
0
0168.5100 BPM 200 300 400 500 600 700 800 900 1000 lag / 4 ms samples Music Audio Information - Ellis 2007-11-02 p. 9 /42 Beat Tracking • Dynamic Programming finds beat times {ti} optimizes i O(ti) + i W((ti+1 – ti – p)/) where O(t) is onset strength envelope (local score) W(t) is a log-Gaussian window (transition cost) p is the default beat period per measured tempo incrementally find best predecessor at every time backtrace from largest final score to get beats
τ t C*(t) = γ O(t) + (1–γ)max{W((τ – τp)/β)C*(τ)} C*(t) τ P(t) = argmax{W((τ – τp)/β)C*(τ)} O(t) τ
Music Audio Information - Ellis 2007-11-02 p. 10 /42 Beat Tracking • DP will bridge gaps (non-causal) there is always a best path ... Alanis Morissette - All I Want - gap + beats d 40 n a b
k 30 r a B
/
20 q e r f 10
182 184 186 188 190 192 time / sec • 2nd place in MIREX 2006 Beat Tracking compared to McKinney & Moelants human data test 2 (Bragg) - McKinney + Moelants Subject data d 40 n a b
k 30 r a B
/
20 q e r f 10
# 40 t c e j 20 b u S 0 0 5 10 time / s 15 Music Audio Information - Ellis 2007-11-02 p. 11 /42 Chroma Features • Chroma features convert spectral energy into musical weights in a canonical octave i.e. 12 semitone bins Piano chromatic scale IF chroma
4 a z m
H G o k r
/
3 h Piano F c q e r scale f 2 D 1 C
0 A 2 4 6 8 10 time / sec 100 200 300 400 500 600 700 time / frames • Can resynthesize as “Shepard Tones” all octaves at once 12 Shepard tone spectra Shepard tone resynth 0 4 -10 3 -20 level / dB
-30 freq / kHz 2 -40 1 -50 -60 0 0 500 1000 1500 2000 2500 freq / Hz 2 4 6 8 10 time / sec
Music Audio Information - Ellis 2007-11-02 p. 12 /42 Key Estimation Ellis ICASSP ’07 • Covariance of chroma reflects key • Normalize by transposing for best fit Taxman Eleanor Rigby I'm Only Sleeping
G G G F F F
single Gaussian D D D
aligned chroma C C C
model of one piece A A A A C D F G A C D F G find ML rotation Love You To Aligned Global model Yellow Submarine G G G of other pieces F F F
D D D model all C C C
A A A transposed pieces A C D F G A C D F G A C D F G She Said She Said Good Day Sunshine And Your Bird Can Sing
iterate until G G G convergence F F F D D D C C C
A A A A C D F G A C D F G A C D F G aligned chroma
Music Audio Information - Ellis 2007-11-02 p. 13 /42 Chord Transcription Sheh & Ellis ’03 • “Real Books” give chord transcriptions # The Beatles - A Hard Day's Night but no exact timing # G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G .. just like speech transcripts F6 G C D G C9 G D G C7 G F6 G C7 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G • Use EM to simultaneously C9 G Cadd9 Fadd9 learn and align chord models
Model inventory Labelled training data dh ax k ae t ae1 ae2 ae3
dh1 dh2 s ae t aa n
Initialization dh ax k ae t Uniform parameters Θinit initialization s ae t aa n alignments
E-step: dh ax k ae i N probabilities p(qn|X1,Θold) Repeat until of unknowns convergence M-step: maximize via Θ : max E[log p(X,Q | Θ)] parameters
Music Audio Information - Ellis 2007-11-02 p. 14 /42 Chord Transcription Frame-level Accuracy
Feature Recog. Alignment Beatles - Beatles For Sale - Eight Days a Week (4096pt) #
G
MFCC 8.7% 22.0% # 120 F 100 E
PCP_ROT 21.7% 76.0% s s
a 80 l
c #
h c t i D 60
(random ~3%) p
MFCCs are poor # 40 (can overtrain) PCPs better C 20 B 0 (ROT helps generalization) # intensity A 16.27 24.84 time / sec true E G D Bm G align E G DBm G recog E G Bm Am Em7 Bm Em7 • Needed more training data...
Music Audio Information - Ellis 2007-11-02 p. 15 /42 3. Music Similarity • The most central problem... motivates extracting musical information supports real applications (playlists, discovery) • But do we need content-based similarity? compete with collaborative filtering compete with fingerprinting + metadata
• Maybe ... for the Future of Music connect listeners directly to musicians
Music Audio Information - Ellis 2007-11-02 p. 16 /42 Discriminative Classification Mandel & Ellis ‘05 • Classification as a proxy for similarity • Distribution models... Training MFCCs GMMs
KL Artist 1 Min Artist
KL Artist 2
Test Song • vs. SVM Training MFCCs Song Features
1 D
t s i t
r D A D DAG SVM Artist D 2
t s i
t D r A D
Test Song Music Audio Information - Ellis 2007-11-02 p. 17 /42 Segment-Level Features Mandel & Ellis ‘07 Statistics of spectra and envelope LabR•OSA’s audio music similarity and classification systems Lab ROSA define a pointMic inhael fIeaturMandel ande spaceDaniel P W Ellis Laboratory for the Recognition and Organization of Speech and Audio LabROSA Dept of Electrical Engineering Columbia University, New York · · for SVM classification,mim,dpwe [email protected] Euclidean similarity... { }
1. Feature Extraction 2. Similarity and Classification 3. Results
Spectral features are the same as Mandel and Ellis (2005), mean and We use a DAG-SVM for n-way classification of songs (Platt et al., 2000) Audio Music Similarity Audio Classification 80 • • 0.8 unwrapped covariance of MFCCs The distance between songs is the Euclidean distance between their 70 • 0.7 Temporal features are similar to those in RauberMusicet al. (2002) Audio Informationfeature vectors - Ellis 2007-11-02 p. 18 /42 • 0.6 60 Break input into overlapping 10-second clips During testing, some songs were a top similarity match for many songs • • 0.5 50 Analyze mel spectrum of each clip, averaging clips’ features Re-normalizing each song’s feature vector avoided this problem • • 0.4 40 Combine mel frequencies together to get magnitude bands for low, • 0.3 30 low-mid, high-mid, and high frequencies
0.2 20 Greater0 FFT in time gives modulation frequency, keep magnitude of low- Psum Genre ID Hierarchical Fine Genre ID Raw • References 0.1 10 WCsum Mood ID est 20% of frequencies SDsum Composer ID Greater1 Artist ID 0 0 IM ME IM DCT in modulation frequency gives envelope cepstrum M. Mandel and D. Ellis. Song-level features and support vector machines for music classification. In Joshua PS GT LB CB1 TL1 ME TL2 CB2 CB3 BK1 PC BK2 svm spec ME TL GT knn KL CL GH • Stack the four bands’ envelope cepstra into one feature vector Reiss and Geraint Wiggins, editors, Proc. ISMIR, pages 594–599, 2005. PS = Pohle, Schnitzer; GT = George Tzane- IM = IMIRSEL M2K; ME = Mandel, Ellis; TL = Lidy, • John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large margin DAGs for multiclass classification. In takis; LB = Barrington, Turnbull, Torres, Lanckriet; Rauber, Pertusa, Inesta;˜ GT = George Tzane- Each feature is then normalized across all of the songs to be zero- CB = Christoph Bastuck; TL = Lidy, Rauber, Per- takis; KL = Kyogu Lee; CL = Laurier, Herrera; • S.A. Solla, T.K. Leen, and K.-R. Mueller, editors, Advances in Neural Information Processing Systems 12, mean, unit-variance. pages 547–553, 2000. tusa, Inesta;˜ ME = Mandel, Ellis; BK = Bosteels, GH = Guaus, Herrera Kerre; PC = Paradzinets, Chen Andreas Rauber, Elias Pampalk, and Dieter Merkl. Using psycho-acoustic models and self-organizing maps to create a hierarchical structuring of music by sound similarity. In Michael Fingerhut, editor, Proc. ISMIR, pages 71–80, 2002.
MIREX 2007, 3rd Music Information Retrieval Evaluation eXchange, 26 September 2007, Vienna, Austria LabROSA’s audio music similarity and classification systems Lab ROSA Michael I Mandel and Daniel P W Ellis Laboratory for the Recognition and Organization of Speech and Audio LabROSA Dept of Electrical Engineering Columbia University, New York · · mim,dpwe @ee.columbia.edu { }
MIREX’07 Results 1. Feature Extraction 2. Similarity and Classification • One system for3. similarityResults and classification Spectral features are the same as Mandel and Ellis (2005), mean and We use a DAG-SVM for n-way classification of songs (Platt et al., 2000) Audio Music Similarity Audio Classification 80 • • 0.8 unwrapped covariance of MFCCs The distance between songs is the Euclidean distance between their 70 • 0.7 Temporal features are similar to those in Rauber et al. (2002) feature vectors • 0.6 60 Break input into overlapping 10-second clips During testing, some songs were a top similarity match for many songs • • 0.5 50 Analyze mel spectrum of each clip, averaging clips’ features Re-normalizing each song’s feature vector avoided this problem • • 0.4 40 Combine mel frequencies together to get magnitude bands for low, • 0.3 30 low-mid, high-mid, and high frequencies
0.2 20 Greater0 FFT in time gives modulation frequency, keep magnitude of low- Psum Genre ID Hierarchical Fine Genre ID Raw • References 0.1 10 WCsum Mood ID est 20% of frequencies SDsum Composer ID Greater1 Artist ID 0 0 IM ME IM DCT in modulation frequency gives envelope cepstrum M. Mandel and D. Ellis. Song-level features and support vector machines for music classification. In Joshua PS GT LB CB1 TL1 ME TL2 CB2 CB3 BK1 PC BK2 svm spec ME TL GT knn KL CL GH • Stack the four bands’ envelope cepstra into one feature vector Reiss and Geraint Wiggins, editors, Proc. ISMIR, pages 594–599, 2005. PS = Pohle, Schnitzer; GT = George Tzane- IM = IMIRSEL M2K; ME = Mandel, Ellis; TL = Lidy, • John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large margin DAGs for multiclass classification. In takis; LB = Barrington, Turnbull, Torres, Lanckriet; Rauber, Pertusa, Inesta;˜ GT = George Tzane- Each feature is then normalized across all of the songs to be zero- CB = Christoph Bastuck; TL = Lidy, Rauber, Per- takis; KL = Kyogu Lee; CL = Laurier, Herrera; • S.A. Solla, T.K. Leen, and K.-R. Mueller, editors, Advances in Neural Information Processing Systems 12, mean, unit-variance. pages 547–553, 2000. tusa, Inesta;˜ ME = Mandel, Ellis; BK = Bosteels, GH = Guaus, Herrera Kerre; PC = Paradzinets, Chen Andreas Rauber, Elias Pampalk, and Dieter Merkl. Using psycho-acoustic models and self-organizing maps to create a hierarchical structuring of music by sound similarity. In Michael Fingerhut, editor, Proc. ISMIR, pages 71–80, 2002. Music Audio Information - Ellis 2007-11-02 p. 19 /42
MIREX 2007, 3rd Music Information Retrieval Evaluation eXchange, 26 September 2007, Vienna, Austria Active-Learning Playlists • SVMs are well suited to “active learning” solicit labels on items closest to current boundary • Automatic player with “skip” = Ground truth data collection active-SVM automatic playlist generation
Music Audio Information - Ellis 2007-11-02 p. 20 /42 Cover Song Detection Ellis & Poliner ‘07 • “Cover Songs” = reinterpretation of a piece different instrumentation, character no match with “timbral” features Let It Be - The Beatles Let It Be - Nick Cave Let It Be / Beatles / verse 1 Let It Be / Nick Cave / verse 1 4 4 z z H H k k
/ 3 / 3 q q e e r r f 2 f 2
1 1
0 0 2 4 6 8 10 time / sec 2 4 6 8 10 time / sec • Need a different representation! beat-synchronous chroma features Beat-sync chroma features Beat-sync chroma features
a G a G m m
o F o F r r h h c D c D C C
A A 5 10 15 20 25 beats 5 10 15 20 25 beats
Music Audio Information - Ellis 2007-11-02 p. 21 /42 Beat-Synchronous Chroma Features • Beat + chroma features / 30ms frames → average chroma within each beat compact; sufficient?