<<

Extracting and Using Music Audio Information Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Engineering, Columbia University, NY USA http://labrosa.ee.columbia.edu/

1. Motivation: Music Collections 2. Music Information 3. Music Similarity 4. Music Structure Discovery

Music Audio Information - Ellis 2007-11-02 p. 1 /42 LabROSA Overview

Information Extraction

Music Environment Recognition

Separation

Retrieval Machine Signal Learning Processing Speech

Music Audio Information - Ellis 2007-11-02 p. 2 /42 1. Managing Music Collections • A lot of music data available e.g. 60G of MP3 ≈ 1000 hr of audio, 15k tracks • Management challenge how can computers help?

• Application scenarios personal music collection discovering new music “music placement”

Music Audio Information - Ellis 2007-11-02 p. 3 /42 Learning from Music • What can we infer from 1000 h of music? common patterns Scatter of PCA(3:6) of 12x16 beatchroma 60

, melodies, chords, form 50 what is and what isn’t music 40 30

20 • Data driven ? 10 10 20 30 40 50 60 60 Applications 50 • 40 modeling/description/coding 30 computer generated music 20 10 curiosity... 10 20 30 40 50 60

Music Audio Information - Ellis 2007-11-02 p. 4 /42 The Big Picture

Low-level browsing features Classification and Similarity discovery production Melody and notes Music audio Key and chords Music modeling Structure generation Discovery curiosity Tempo and

.. so far

Music Audio Information - Ellis 2007-11-02 p. 5 /42 2. Music Information • How to represent music audio?

4000 Audio features 3000 • 2000 Frequency spectrogram, MFCCs, bases 1000

0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time • Musical elements notes, beats, chords, phrases requires transcription

• Or something inbetween? optimized for a certain task?

Music Audio Information - Ellis 2007-11-02 p. 6 /42 Transcription as Classification Poliner & Ellis ‘05,’06,’07 • Exchange signal models for data transcription as pure classification problem: feature representation feature vector Training data and features: •MIDI, multi-track recordings, playback piano, & resampled audio (less than 28 mins of train audio). •Normalized magnitude STFT.

classification posteriors Classification: •N-binary SVMs (one for ea. note). •Independent frame-level classification on 10 ms grid. •Dist. to class bndy as posterior.

hmm smoothing Temporal Smoothing: •Two state (on/off) independent HMM for ea. note. Parameters learned from training data. •Find Viterbi sequence for ea. note.

Music Audio Information - Ellis 2007-11-02 p. 7 /42 Polyphonic Transcription • Real music excerpts + ground truth MIREX 2007 Frame-level transcription Estimate the fundamental frequency of all notes present on a 10 ms grid 1.25

1.00

0.75

0.50

0.25

0 Precision Recall Acc Etot Esubs Emiss Efa Note-level transcription Group frame-level predictions into note-level transcriptions by estimating onset/offset 1.25

1.00

0.75

0.50

0.25

0 Precision Recall Ave. F-measure Ave. Overlap Music Audio Information - Ellis 2007-11-02 p. 8 /42 Beat Tracking Ellis ’06,’07 • Goal: One feature vector per ‘beat’ (tatum) for tempo normalization, efficiency • “Onset Strength Envelope” sumf(max(0, difft(log |X(t, f)|))) 40 30 20 freq / mel 10

0

0 5 10 time / sec 15 • Autocorr. + window → global tempo estimate

0

0168.5100 BPM 200 300 400 500 600 700 800 900 1000 lag / 4 ms samples Music Audio Information - Ellis 2007-11-02 p. 9 /42 Beat Tracking • Dynamic Programming finds beat times {ti} optimizes i O(ti) +  i W((ti+1 – ti – p)/) where O(t) is onset strength envelope (local score) W(t) is a log-Gaussian window (transition cost) p is the default beat period per measured tempo incrementally find best predecessor at every time backtrace from largest final score to get beats

τ t *(t) = γ O(t) + (1–γ)max{W((τ – τp)/β)C*(τ)} C*(t) τ P(t) = argmax{W((τ – τp)/β)C*(τ)} O(t) τ

Music Audio Information - Ellis 2007-11-02 p. 10 /42 Beat Tracking • DP will bridge gaps (non-causal) there is always a best path ... Alanis Morissette - All I Want - gap + beats d 40 n a b

k 30 r a B

/

20 q e r f 10

182 184 186 188 190 192 time / sec • 2nd place in MIREX 2006 Beat Tracking compared to McKinney & Moelants human data test 2 (Bragg) - McKinney + Moelants Subject data d 40 n a b

k 30 r a B

/

20 q e r f 10

# 40 t c e j 20 b u S 0 0 5 10 time / s 15 Music Audio Information - Ellis 2007-11-02 p. 11 /42 Chroma Features • Chroma features convert spectral energy into musical weights in a canonical i.e. 12 bins Piano chromatic scale IF chroma

4 a z m

H G o k r

/

3 h Piano F c q e r scale f 2 D 1 C

0 A 2 4 6 8 10 time / sec 100 200 300 400 500 600 700 time / frames • Can resynthesize as “Shepard Tones” all at once 12 Shepard tone spectra Shepard tone resynth 0 4 -10 3 -20 level / dB

-30 freq / kHz 2 -40 1 -50 -60 0 0 500 1000 1500 2000 2500 freq / Hz 2 4 6 8 10 time / sec

Music Audio Information - Ellis 2007-11-02 p. 12 /42 Key Estimation Ellis ICASSP ’07 • Covariance of chroma reflects key • Normalize by transposing for best fit Taxman Eleanor Rigby I'm Only Sleeping

G G G F F F

single Gaussian D D D

aligned chroma C C C

model of one piece A A A A C D F G A C D F G find ML rotation Love You To Aligned Global model Yellow Submarine G G G of other pieces F F F

D D D model all C C C

A A A transposed pieces A C D F G A C D F G A C D F G She Said She Said Good Day Sunshine And Your Bird Can Sing

iterate until G G G convergence F F F D D D C C C

A A A A C D F G A C D F G A C D F G aligned chroma

Music Audio Information - Ellis 2007-11-02 p. 13 /42 Chord Transcription Sheh & Ellis ’03 • “Real Books” give chord transcriptions # The Beatles - A Hard Day's Night but no exact timing # G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G .. just like speech transcripts F6 G C D G C9 G D G C7 G F6 G C7 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G • Use EM to simultaneously C9 G Cadd9 Fadd9 learn and align chord models

Model inventory Labelled training data dh ax k ae t ae1 ae2 ae3

dh1 dh2 s ae t aa n

Initialization dh ax k ae t Uniform parameters Θinit initialization s ae t aa n alignments

E-step: dh ax k ae i N probabilities p(qn|X1,Θold) Repeat until of unknowns convergence M-step: maximize via Θ : max E[log p(X,Q | Θ)] parameters

Music Audio Information - Ellis 2007-11-02 p. 14 /42 Chord Transcription Frame-level Accuracy

Feature Recog. Alignment Beatles - Beatles For Sale - Eight Days a Week (4096pt) #

G

MFCC 8.7% 22.0% # 120 F 100 E

PCP_ROT 21.7% 76.0% s s

a 80 l

c #

h c t i D 60

(random ~3%) p

MFCCs are poor # 40 (can overtrain) PCPs better C 20 B 0 (ROT helps generalization) # intensity A 16.27 24.84 time / sec true E G D Bm G align E G DBm G recog E G Bm Am Em7 Bm Em7 • Needed more training data...

Music Audio Information - Ellis 2007-11-02 p. 15 /42 3. Music Similarity • The most central problem... motivates extracting musical information supports real applications (playlists, discovery) • But do we need content-based similarity? compete with collaborative filtering compete with fingerprinting + metadata

• Maybe ... for the Future of Music connect listeners directly to musicians

Music Audio Information - Ellis 2007-11-02 p. 16 /42 Discriminative Classification Mandel & Ellis ‘05 • Classification as a proxy for similarity • Distribution models... Training MFCCs GMMs

KL Artist 1 Min Artist

KL Artist 2

Test Song • vs. SVM Training MFCCs Song Features

1 D

t s i t

r D A D DAG SVM Artist D 2

t s i

t D r A D

Test Song Music Audio Information - Ellis 2007-11-02 p. 17 /42 Segment-Level Features Mandel & Ellis ‘07 Statistics of spectra and envelope LabR•OSA’s audio music similarity and classification systems Lab ROSA define a pointMic inhael fIeaturMandel ande spaceDaniel P W Ellis Laboratory for the Recognition and Organization of Speech and Audio LabROSA Dept of Electrical Engineering Columbia University, New York · · for SVM classification,mim,dpwe [email protected] Euclidean similarity... { }

1. Feature Extraction 2. Similarity and Classification 3. Results

Spectral features are the same as Mandel and Ellis (2005), mean and We use a DAG-SVM for n-way classification of songs (Platt et al., 2000) Audio Music Similarity Audio Classification 80 • • 0.8 unwrapped covariance of MFCCs The distance between songs is the Euclidean distance between their 70 • 0.7 Temporal features are similar to those in RauberMusicet al. (2002) Audio Informationfeature vectors - Ellis 2007-11-02 p. 18 /42 • 0.6 60 Break input into overlapping 10-second clips During testing, some songs were a top similarity match for many songs • • 0.5 50 Analyze mel spectrum of each clip, averaging clips’ features Re-normalizing each song’s feature vector avoided this problem • • 0.4 40 Combine mel frequencies together to get magnitude bands for low, • 0.3 30 low-mid, high-mid, and high frequencies

0.2 20 Greater0 FFT in time gives modulation frequency, keep magnitude of low- Psum Genre ID Hierarchical Fine Genre ID Raw • References 0.1 10 WCsum Mood ID est 20% of frequencies SDsum Composer ID Greater1 Artist ID 0 0 IM ME IM DCT in modulation frequency gives envelope cepstrum M. Mandel and D. Ellis. Song-level features and support vector machines for music classification. In Joshua PS GT LB CB1 TL1 ME TL2 CB2 CB3 BK1 PC BK2 svm spec ME TL GT knn KL CL GH • Stack the four bands’ envelope cepstra into one feature vector Reiss and Geraint Wiggins, editors, Proc. ISMIR, pages 594–599, 2005. PS = Pohle, Schnitzer; GT = George Tzane- IM = IMIRSEL M2K; ME = Mandel, Ellis; TL = Lidy, • John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large margin DAGs for multiclass classification. In takis; LB = Barrington, Turnbull, Torres, Lanckriet; Rauber, Pertusa, Inesta;˜ GT = George Tzane- Each feature is then normalized across all of the songs to be zero- CB = Christoph Bastuck; TL = Lidy, Rauber, Per- takis; KL = Kyogu Lee; CL = Laurier, Herrera; • S.A. Solla, T.K. Leen, and K.-R. Mueller, editors, Advances in Neural Information Processing Systems 12, mean, unit-variance. pages 547–553, 2000. tusa, Inesta;˜ ME = Mandel, Ellis; BK = Bosteels, GH = Guaus, Herrera Kerre; PC = Paradzinets, Chen Andreas Rauber, Elias Pampalk, and Dieter Merkl. Using psycho-acoustic models and self-organizing maps to create a hierarchical structuring of music by similarity. In Michael Fingerhut, editor, Proc. ISMIR, pages 71–80, 2002.

MIREX 2007, 3rd Music Information Retrieval Evaluation eXchange, 26 September 2007, Vienna, Austria LabROSA’s audio music similarity and classification systems Lab ROSA Michael I Mandel and Daniel P W Ellis Laboratory for the Recognition and Organization of Speech and Audio LabROSA Dept of Electrical Engineering Columbia University, New York · · mim,dpwe @ee.columbia.edu { }

MIREX’07 Results 1. Feature Extraction 2. Similarity and Classification • One system for3. similarityResults and classification Spectral features are the same as Mandel and Ellis (2005), mean and We use a DAG-SVM for n-way classification of songs (Platt et al., 2000) Audio Music Similarity Audio Classification 80 • • 0.8 unwrapped covariance of MFCCs The distance between songs is the Euclidean distance between their 70 • 0.7 Temporal features are similar to those in Rauber et al. (2002) feature vectors • 0.6 60 Break input into overlapping 10-second clips During testing, some songs were a top similarity match for many songs • • 0.5 50 Analyze mel spectrum of each clip, averaging clips’ features Re-normalizing each song’s feature vector avoided this problem • • 0.4 40 Combine mel frequencies together to get magnitude bands for low, • 0.3 30 low-mid, high-mid, and high frequencies

0.2 20 Greater0 FFT in time gives modulation frequency, keep magnitude of low- Psum Genre ID Hierarchical Fine Genre ID Raw • References 0.1 10 WCsum Mood ID est 20% of frequencies SDsum Composer ID Greater1 Artist ID 0 0 IM ME IM DCT in modulation frequency gives envelope cepstrum M. Mandel and D. Ellis. Song-level features and support vector machines for music classification. In Joshua PS GT LB CB1 TL1 ME TL2 CB2 CB3 BK1 PC BK2 svm spec ME TL GT knn KL CL GH • Stack the four bands’ envelope cepstra into one feature vector Reiss and Geraint Wiggins, editors, Proc. ISMIR, pages 594–599, 2005. PS = Pohle, Schnitzer; GT = George Tzane- IM = IMIRSEL M2K; ME = Mandel, Ellis; TL = Lidy, • John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large margin DAGs for multiclass classification. In takis; LB = Barrington, Turnbull, Torres, Lanckriet; Rauber, Pertusa, Inesta;˜ GT = George Tzane- Each feature is then normalized across all of the songs to be zero- CB = Christoph Bastuck; TL = Lidy, Rauber, Per- takis; KL = Kyogu Lee; CL = Laurier, Herrera; • S.A. Solla, T.K. Leen, and K.-R. Mueller, editors, Advances in Neural Information Processing Systems 12, mean, unit-variance. pages 547–553, 2000. tusa, Inesta;˜ ME = Mandel, Ellis; BK = Bosteels, GH = Guaus, Herrera Kerre; PC = Paradzinets, Chen Andreas Rauber, Elias Pampalk, and Dieter Merkl. Using psycho-acoustic models and self-organizing maps to create a hierarchical structuring of music by sound similarity. In Michael Fingerhut, editor, Proc. ISMIR, pages 71–80, 2002. Music Audio Information - Ellis 2007-11-02 p. 19 /42

MIREX 2007, 3rd Music Information Retrieval Evaluation eXchange, 26 September 2007, Vienna, Austria Active-Learning Playlists • SVMs are well suited to “active learning” solicit labels on items closest to current boundary • Automatic player with “skip” = Ground truth data collection active-SVM automatic playlist generation

Music Audio Information - Ellis 2007-11-02 p. 20 /42 Cover Song Detection Ellis & Poliner ‘07 • “Cover Songs” = reinterpretation of a piece different instrumentation, character no match with “timbral” features Let It Be - The Beatles Let It Be - Nick Cave Let It Be / Beatles / verse 1 Let It Be / Nick Cave / verse 1 4 4 z z H H k k

/ 3 / 3 q q e e r r f 2 f 2

1 1

0 0 2 4 6 8 10 time / sec 2 4 6 8 10 time / sec • Need a different representation! beat-synchronous chroma features Beat-sync chroma features Beat-sync chroma features

a G a G m m

o F o F r r h h c D c D C C

A A 5 10 15 20 25 beats 5 10 15 20 25 beats

Music Audio Information - Ellis 2007-11-02 p. 21 /42 Beat-Synchronous Chroma Features • Beat + chroma features / 30ms frames → average chroma within each beat compact; sufficient?

Music Audio Information - Ellis 2007-11-02 p. 22 /42 Matching: Global Correlation • Cross-correlate entire beat-chroma matrices ... at all possible transpositions implicit combination of match quality and duration

Elliott Smith - Between the Bars

s G n i b E a

m D o

r C h c A 100 200 300 400 500 beats @281 BPM Glen Phillips - Between the Bars

s G n i b E a

m D o

r C h c A

s Cross-correlation e

n +5 o t i m e

s 0

/

w e

k -5 s -500 -400 -300 -200 -100 0 100 200 300 400 skew / beats • One good matching fragment is sufficient...?

Music Audio Information - Ellis 2007-11-02 p. 23 /42 MIREX 06 Results MIREX 06 Cover Song Results: # Covers retrieved per song per system Cover song contest 1 • 2 3 30 songs x 11 4 5 versions of each (!) 6 7 8 (data has not been 9 10 )

g 11 n

disclosed) o s

12 y r e

u 13 q

e

n 14

# true covers in top 10 o

s i 15 w o r 16 h c

8 systems compared a

e 17 (

t e

s 18 - g

n 19 (4 cover song o s 20 21 + 4 similarity) 22 23 8 Found 761/3300 24 6 25 • 4 26 = 23% recall 27 2 28 0 correct 29 matches next best: 11% 30 retrieved CS DE KL1 KL2 KWL KWT LR TP guess: 3% cover song systems similarity systems Music Audio Information - Ellis 2007-11-02 p. 24 /42 Cross-Correlation Similarity • Use cover-song approach to find similarity e.g. similar note/instrumentation sequence may sound very similar to judges • Numerous variants try on chroma (melody/harmony) and MFCCs (timbre) try full search (xcorr) or landmarks (indexable) compare to random, segment-level stats • Evaluate by subjective tests modeled after MIREX similarity

Music Audio Information - Ellis 2007-11-02 p. 25 /42 Cross-Correlation Similarity that assigns a single time anchor or boundary within each seg- . Results of the subjective similarity evaluation. ment, then calculates the correlation only at the single skew Table 1 HumanCounts wareeb-basedthe number of times judgmentsthe best hit returned by each that aligns the time anchors. We use the BIC method• [8] to algorithm was rated as similar by a human rater. Each al- find the boundary time within the feature matrix that maxi- binarygorithm judgmentsprovided one forreturn speedfor each of 30 queries, and was mizes the likelihood advantage of fitting separate Gaussians judged by 6 raters, hence the counts are out of a maximum to the features each side of the boundary compared to6fit- users x 30 queries x 10 candidate returns possible of 180. ting the entire sequence with a single Gaussian i.e. the time Algorithm Similar count point that divides the feature array into maximally dissimi- lar parts. While almost certain to miss some of the matching (1) Xcorr, chroma 48/180 = 27% alignments, an approach of this simplicity may be the only (2) Xcorr, MFCC 48/180 = 27% viable option when searching in databases consisting of mil- (3) Xcorr, combo 55/180 = 31% lions of tracks. (4) Xcorr, combo + tempo 34/180 = 19% (5) Xcorr, combo at boundary 49/180 = 27% (6) Baseline, MFCC 81/180 = 45% 3. EXPERIMENTS AND RESULTS (7) Baseline, rhythmic 49/180 = 27% (8) Baseline, combo 88/180 = 49% The major challenge in developing music similarity systems Random choice 1 22/180 = 12% is performing any kind of quantitative analysis. As noted Random choice 2 28/180 = 16% above, the genre and artist classification tasks that have been used as proxies in the past most likely fall short of accounting for subjective similarity, particularly in the case•of a systemCross-correlation inferior to baseline... such as ours which aims to match structural detail instead of combined score evaluated only at the reference boundary of overall statistics. Thus, we conducted a small subjective...lis- but sectionis getting2.1. To thsomewhereese, we added three, additioevennal withhits from ‘landmara k’ tening test of our own, modeled after the MIREX music simi- more conventional feature statistics system using (6) MFCC larity evaluations [4], but adapted to collect only a single sim- mean and covariance (as in [2]), (7) subband rhythmic fea- ilar/not similar judgment for each returned clip (toMusicsimplify Audiotures Information(modulation - Ellisspectra, similar to [10]), and 2007-11-02(8) a simple p. 26 /42 the task for the labelers), and including some random selec- summation of the normalized scores under these two mea- tions to allow a lower-bound comparison. sures. Finally, we added two randomly-chosen clips to bring the total to ten. 3.1. Data 3.3. Collecting subjective judgments Our data was drawn from the uspop2002 dataset of 8764 pop- ular music tracks. We wanted to work with a single, broad We generated the sets of ten matches for 30 randomly-chosen genre (i.e. pop) to avoid confounding the relatively simple query clips. We constructed a web-based rating scheme, where discrimination of grossly different genres with the more sub- raters were presented all ten matches for a given query on a tle question of similarity. We also wanted to maximize the single screen, with the ability to play the query and any of density of our database within the area of coverage. the results in any order, and to click a box to mark any of the For each track, we took a 10 s excerpt from 60 s into the returns as being judged “similar” (binary judgment). Each track (tracks shorter than this were not included). We chose subject was presented the queries in a random sequence, and 10 s based on our earlier experiments with clips of this length the order of the matches was randomized on each page. Sub- that showed this is an adequate length for listeners to get a jects were able to pause and resume labeling as often as they sense of the music, yet short enough that they will probably wished. Complete labeling of all 30 queries took around one listen to the whole clip [9]. (MIREX uses 30 s clips which are hour total. 6 volunteers from our lab completed the labeling, quite arduous to listen through). giving 6 binary votes for each of the 10 returns for each of the 30 queries. 3.2. Comparison systems 3.4. Results Our test involved rating ten possible matches for each query. Five of these were based on the system described above: we Table 1 shows the results of our evaluation. The binary sim- included (1) the best match from cross-correlating chroma ilarity ratings across all raters and all queries are pooled for features, (2) from cross-correlating MFCCs, (3) from a com- each algorithm to give an overall ‘success rate’ out of a possi- bined score constructed as the harmonic mean of the chroma ble 180 points – roughly, the probability that a query returned and MFCC scores, (4) based on the combined score but ad- by this algorithm will be rated as similar by a human judge. ditionally constraining the tempos (from the beat tracker) of A conservative binomial significance test requires a difference database items to be within 5% of the query tempo, and (5) of around 13 votes (7%) to consider two algorithms different. Cross-Correlation Similarity • Results are not overwhelming .. but database is only a few thousand clips

Music Audio Information - Ellis 2007-11-02 p. 27 /42 “Anchor Space” Berenzweig & Ellis ‘03 • Acoustic features describe each song .. but from a signal, not a perceptual, perspective .. and not the differences between songs • Use genre classifiers to define new space prototype genres are “anchors”

n-dimensional Anchor vector in "Anchor Space" Anchor p(a1|x) Audio n-dimensional GMM p(a2|x) Input AnchorAnchor vector in "Anchor Modeling (Class i) Space" Similarity Anchor p(a |x) Computation p(a1|x) n Audio p(a |x) GMM Input 2 Anchor Modeling (Class j) KL-d, EMD, etc. Conversion to Anchorspace p(an|x)

Conversion to Anchorspace

Music Audio Information - Ellis 2007-11-02 p. 28 /42 “Anchor Space” • Frame-by-frame high-level categorizations Cepstral Features Anchor Space Features

compare to 0.6 0 0.4 raw features? 0.2 5 0 0.2 10 0.4 Electronica fifth cepstral coef 0.6 madonna madonna 15 0.8 bowie bowie

1 0.5 0 0.5 15 10 5 third cepstral coef Country properties in distributions? dynamics?

Music Audio Information - Ellis 2007-11-02 p. 29 /42 ‘Playola’ Similarity Browser

Music Audio Information - Ellis 2007-11-02 p. 30 /42 Ground-truth data Ellis et al, ‘02 • Hard to evaluate Playola’s ‘accuracy’ user tests... ground truth? • “Musicseer” online survey/game: ran for 9 months in 2002 > 1,000 users, > 20k judgments http://labrosa.ee.columbia.edu/ projects/musicsim/

Music Audio Information - Ellis 2007-11-02 p. 31 /42 “Semantic Bases” • Describe segment in human-relevant terms e.g. anchor space, but more so • Need ground truth... what words to people use? • MajorMiner game: 400 users 7500 unique tags 70,000 taggings 2200 10-sec clips used • Train classifiers...

Music Audio Information - Ellis 2007-11-02 p. 32 /42 3. Music Structure Discovery • Use the many examples to map out the “manifold” of music audio ... and hence define the subset that is music 32GMMs on 1000 MFCC20s x 104

u2 -2.5 tina_turner

roxette -3 rolling_stones queen -3.5 pink_floyd -4 metallica madonna

s -4.5 green_day

genesis -5 artist model garth_brooks fleetwood_mac -5.5 depeche_mode -6 dave_matthews_band creedence_clearwater_revival -6.5 bryan_adams

beatles -7 aerosmith

ae be br cr da de fl ga ge gr ma me pi qu ro ro ti u2 test tracks • Problems alignment/registration of data factoring & abstraction separating parts?

Music Audio Information - Ellis 2007-11-02 p. 33 /42 Eigenrhythms: Drum Pattern Space Ellis & Arroyo ‘04 • Pop songs built on repeating “drum loop” variations on a few bass, snare, hi-hat patterns

• Eigen-analysis (or ...) to capture variations? by analyzing lots of (MIDI) data, or from audio • Applications music categorization “beat box” synthesis insight

Music Audio Information - Ellis 2007-11-02 p. 34 /42 Aligning the Data • Need to align patterns prior to modeling...

tempo (stretch): by inferring BPM & normalizing downbeat (shift): correlate against ‘mean’ template

Music Audio Information - Ellis 2007-11-02 p. 35 /42 Eigenrhythms (PCA)

• Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims) • Eigenrhythms both add and subtract

Music Audio Information - Ellis 2007-11-02 p. 36 /42 Posirhythms (NMF)

Posirhythm 1 Posirhythm 2 HH HH SN SN BD BD Posirhythm 3 Posirhythm 4 HH HH SN SN

BD BD 0.1 Posirhythm 5 Posirhythm 6 0 HH HH -0.1 SN SN BD BD 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 samples (@ 200 Hz) 1 2 3 4 1 2 3 4 beats (@ 120 BPM) • Nonnegative: only adds beat-weight • Capturing some structure

Music Audio Information - Ellis 2007-11-02 p. 37 /42 Eigenrhythm BeatBox • Resynthesize rhythms from eigen-space

Music Audio Information - Ellis 2007-11-02 p. 38 /42 Melody Clustering • Goal: Find ‘fragments’ that recur in melodies .. across large music database .. trade data for model sophistication

Training Melody 5 second VQ data extraction fragments clustering

Top clusters • Data sources pitch tracker, or MIDI training data • Melody fragment representation DCT(1:20) - removes average, smoothes detail

Music Audio Information - Ellis 2007-11-02 p. 39 /42 Melody Clustering • Clusters match underlying contour:

• Some interesting matches: e.g. Pink + Nsync

Music Audio Information - Ellis 2007-11-02 p. 40 /42 Beat-Chroma Fragment Codebook • Idea: Find the very popular music fragments e.g. perfect cadence, rising melody, ...? • Clustering a large enough database should reveal these but: registration of phrase boundaries, transposition • Need to deal with really large datasets e.g. 100k+ tracks, multiple landmarks in each but: Locality Sensitive Hashing can help - quickly finds ‘most’ points in a certain radius • Experiments in progress...

Music Audio Information - Ellis 2007-11-02 p. 41 /42 Conclusions

Low-level browsing features Classification and Similarity discovery production Melody and notes Music audio Key and chords Music modeling Structure generation Discovery curiosity Tempo and beat

• Lots of data + noisy transcription + weak clustering ⇒ musical insights?

Music Audio Information - Ellis 2007-11-02 p. 42 /42