Cultural and Acoustic Approaches to Music Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
Cultural and Acoustic Approaches to Music Retrieval Brian Whitman, Music Mind and Machine Group MIT Media Lab Outline • Music retrieval – Similarity – Signal feature extraction – Cultural feature extraction – Learning tasks – Evaluation • “Grounding” (audio/audience) Take Home • How to – Analyze audio from music – Extract cultural features from music – Form them into representations & learn relations – Evaluate models of similarity • Bigger Message: – Retrieval is still new & undefined, there is no clear best approach / method – Signal-only approaches have problems – Culture-only approaches have problems – Unsupervised methods (no bias) from observation seem to be our best bet Music-IR • Understanding music for organization, indexing, recommendation, browsing • Score / Audio / Library • Conferences: ISMIR, DAFX, AES, WEDELmusic, WASPAA, ICMC, SIGIR, MMSP, ICME, ACM-MM Music IR Applications Development Challenging Still Unsolved • Filename search • Genre ID • Source separation & • CF-style • Beat tracking friends: recommendation • Audio Similarity – Instrument ID – Transcription • Song ID • Structure mapping – Vocal removal • MIDI parsing: key • Monophonic finding, tempo from parameterization • Trusted-source recommendation score, score indexing • Robust song ID • Theme finding • Score intelligence • Artist ID What can we do? • Commercial: • Research: – Copyright – Cognitive models – Buzz tracking – Grounding music / – Recommendation music acquisition – Query-by-X – Machine listeners • Humming, singing, description, example – Synthesis (Rock star – Content-IR generation) Who’s We? • (Completely unexhaustive list) • Larger commercial interests: – Sony, NEC, HP, MELCO, Apple • Record industry: – BMG, WarnerMG, AOL/TW, MTV • Metadata providers: – AMG, Yahoo, Google, Moodlogic • Academia: – CMU, Umich, MTG, Queen Mary & CCL, IRCAM, MIT, CNMAT, CCRMA, Columbia Similarity Why is similarity important? • A “perfect” similarity metric solves all music understanding problems • Unfortunately, we don’t even know what a perfect metric would do • Biggest problems: – time scale • Segments vs. Songs vs. Albums vs. Career • D(Madonna_1988, Madonna_2003) – “outside the world” features • cf: female hip-hop vs. female teen-pop – Context dependence! Views of similarity • Many different measures of similarity. – Just a few examples: Music theory Culture Melody Genre Harmony Date Structure Lyrics Perception Cognition Loudness Experience Pitch Reference Texture Beat Timbre Rules of similarity • Self-similarity: D(A,A)=0 • Positivity: D(A,B)=0 iff A=B • Symmetry D(A,B)=D(B,A) – Influential similarity breaks symmetry • Triangle Inequality: – D(A,B) <= D(A,C)+D(B,C) – Partial similarity breaks the triangle Specifics of Artist Similarity Beatles Jason Falkner Michael Penn D(Penn, Beatles) = 0.1 D(Beatles, Penn) = 0.5 “Influential D(Penn, Falkner) = 0.1 D(Falkner, Penn) = 0.1 similarity” Breaks symmetry Specifics of Artist Similarity 2 A C B Jason Falkner Grays Jon Brion D(Falkner, Grays) = 0.1 (Falkner member of the Grays) D(Brion, Grays) = 0.1 (Brion member of the Grays) D(Falkner, Brion) = 0.5 D(A,B) <= D(A,C)+D(B,C) ? “Partial 0.5 is not <= 0.1 + 0.1! similarity” Breaks triangle inequality Ground Truth • Evaluation of similarity is very hard • Use metadata sources? Editors? Human evaluation? • What is rock? – “Electronic”, “World Music” • Is similarity opinion or fact? Experimental paradigm Extract Collect Train for relevant Evaluation data similarity features Basic acoustics Auditory perception Pattern recognition Grounding Data mining Music theory Machine learning Psychology DSP NLP Acoustic Features Packaging Perception • Algorithms like observations – Pieces of an image, frames of a movie, words in a sentence, paragraphs in a document & most ML/sim metrics look for correlations in the given observations: d=8, l=4 Four observations, 8 dimensions each The (time)Frame Problem So how do we make multiple observations from music? The answer depends on time scale. What are you trying to do? Instrument ID Window at control rate Beat detection Measures Song structure Verses, choruses, etc Song similarity In-song chunks (arbitrary?) Artist similarity Entire songs Career scale Albums Time Domain Measures • Never forget the time domain! – Signal entropy – Onset/offset/envelope detection Full Width Spectral Slices • Small chunks of audio (2 beats?) – “Car seek” 40KB/sec, 10,000 dimensions per observation! Power spectral density FFT – Magnitude only PSD is the mean of the STFT time frames over your window. Multiple FFT windows make each time slice 1, Song by Artist #1 2, f M 3, t 512 dimensions, 2KB 1, 2, ……, n j Log-space FFT PSD on its own isn’t perceptually sound. Here we rescale the bins to grow logarithmically and choose 16 slices. Each frame (could vary in length) is now represented as 16 dimensions. MFCC • Mel-frequency cepstral coeffcients – Mel-scale frequencies 5000 4000 3000 (Mel) 2000 Pitch 1000 100 200 500 1000 2k 5k 10k Frequency (Hz) Cepstrum Mel-scale weighting FFT log FFT-1 • Same number of points as input – Low order coefficients: spectral envelope shape – High order coefficients: fundamental frequency and harmonics MFCC • Keep low-order coefficients – (analysis does better with <13) • MFCC applications – Speech recognition – Speaker identification – Instrument identification – Music similarity Principal Components Analysis Better decorrelation: each component has the maximal variance Decompose STFT matrix into eigenvectors via SVD Save only n of them: representation is coefficients of the weighting matrix PCA on PSD PCA / SVD / NMF • Singular value decomposition (SVD) The (time)Frame Problem #2 • Time is important to music! • But for most features, time is ignored. • We could swap around the observations (randomize the song!) and the results would be the same • Instead: – beat features – state paths – learning tricks Autocorrelation Statistics • View FFT instead as “repetition counter” – Text case, quick frequency estimator • “FFT of the FFT” (aka the “Beatogram”) Beat Detection with Autoco • If the music is periodic, you’ll get the beat Song Path Regularization • MPEG7 (Casey et. al) Cluster each spectral frame into one of 20 states. Now a song is just a symbolic message: “AABCHGFKGA...” Much easier for ML to handle Approximate Pattern Matching ß Dynamic programming ß Perception-based methods e.g., melody matching Cultural Features Two-Way IR • So much going the other way! “My favorite song” P2P Collections “Timbaland produced the new Missy record” Online playlists “Uninspired electro-glitch rock” Informal reviews “Reminds me of my ex-girlfriend” Query habits Sound & Artists Score Listeners Acoustic vs. Cultural Representations • Acoustic: • Cultural: – Instrumentation – Long-scale time – Short-time (timbral) – Inherent user model – Mid-time (structural) – Listener’s perspective – Usually all we have – Two-way IR Which genre? Describe this. Which artist? Do I like this? What instruments? 10 years ago? Which style? Representation Uses • Acoustic: • Cultural: – Artist ID – Style ID – Genre ID – Recommendation – Audio similarity – Query by Description – Copyright Protection – Cultural similarity Combined: Auto description of music Community Synthesis High-accuracy style ID Acoustic/Cultural User Knob Creating a Cultural Representation • Where do people talk about music? • Web mining • USENET mining • Discussion groups, weblogs • Usage statistics • P2P mining “Community Metadata” • Combine all types of mined data – P2P, web, usenet, future? • Long-term time aware • One comparable representation via gaussian kernel – Machine learning friendly Data Collection Overview • Web/Usenet crawl: – Web crawls for artist names – Retrieved documents are parsed for: • Unigrams, bigrams and trigrams • Artist names • Noun phrases • Adjectives • P2P crawl: – Robots watch OpenNap network for shared songs on collections. • Specific collections: – Playlists, AMG, reviews Web searching • Augmented search patterns: – For the band “War”: • “War” • “War” +music • “War” +music +review • Use top 50 results for parsing (Usenet 100) • HTML documents have their tags removed and are sentence split. Language Processing for IR • Web page to feature vector n1 n2 n3 XTC XTC was XTC was one Was Was one Was one of One One of One of the HTML Sentence Chunks Of Of the Of the smartest the The smartest The smartest and …. Smartest Smartest and Smartest and catchiest Aosid asduh asdihu asiuh And And catchiest And catchiest british oiasjodijasodjioaisjdsaioj Catchiest Catchiest british Catchiest british pop aoijsoidjaosjidsaidoj. XTC was one of the smartest British British pop British pop bands Oiajsdoijasoijd. — and catchiest — British pop Pop Pop bands Pop bands to Iasoijdoijasoijdaisjd. Asij bands to emerge from the Bands Bands to Bands to emerge aijsdoij. Aoijsdoijasdiojas. To To emerge To emerge from Aiasijdoiajsdj., asijdiojad punk and new wave iojasodijasiioas asjidijoasd explosion of the late '70s. Emerge Emerge from Emerge from the oiajsdoijasd ioajsdojiasiojd From From the From the punk iojasdoijasoidj. Asidjsadjd iojasdoijasoijdijdsa. IOJ Punk The punk The punk and iojasdoijaoisjd. Ijiojsad. New Punk and Punk and new …. wave And new And new wave np art adj XTC Smartest Catchiest british pop bands Catchiest British pop bands XTC British Pop bands New Punk and new wave late explosion How-to “Klepmit” 1) Web parse (lynx --dump, parsers, tag removal, etc) 2) N-gram counting (Perl etc) 3) Part-of-speech tagging (Brill, Alembic) 4) Noun-phrase chunking (Penn’s baseNP, Columbia’s