Cultural and Acoustic Approaches to Music Retrieval

Brian Whitman, Music Mind and Machine Group MIT Media Lab Outline

• Music retrieval – Similarity – Signal feature extraction – Cultural feature extraction – Learning tasks – Evaluation • “Grounding” (audio/audience) Take Home

• How to – Analyze audio from music – Extract cultural features from music – Form them into representations & learn relations – Evaluate models of similarity • Bigger Message: – Retrieval is still new & undefined, there is no clear best approach / method – Signal-only approaches have problems – Culture-only approaches have problems – Unsupervised methods (no bias) from observation seem to be our best bet Music-IR

• Understanding music for organization, indexing, recommendation, browsing • Score / Audio / Library • Conferences: ISMIR, DAFX, AES, WEDELmusic, WASPAA, ICMC, SIGIR, MMSP, ICME, ACM-MM Music IR Applications

Development Challenging Still Unsolved • Filename search • Genre ID • Source separation & • CF-style • Beat tracking friends: recommendation • Audio Similarity – Instrument ID – Transcription • Song ID • Structure mapping – Vocal removal • MIDI parsing: key • Monophonic finding, tempo from parameterization • Trusted-source recommendation score, score indexing • Robust song ID • Theme finding • Score intelligence • Artist ID What can we do?

• Commercial: • Research: – Copyright – Cognitive models – Buzz tracking – Grounding music / – Recommendation music acquisition – Query-by-X – Machine listeners • Humming, singing, description, example – Synthesis (Rock star – Content-IR generation) Who’s We?

• (Completely unexhaustive list) • Larger commercial interests: – Sony, NEC, HP, MELCO, Apple • Record industry: – BMG, WarnerMG, AOL/TW, MTV • Metadata providers: – AMG, Yahoo, Google, Moodlogic • Academia: – CMU, Umich, MTG, Queen Mary & CCL, IRCAM, MIT, CNMAT, CCRMA, Columbia Similarity Why is similarity important?

• A “perfect” similarity metric solves all music understanding problems • Unfortunately, we don’t even know what a perfect metric would do

• Biggest problems: – time scale • Segments vs. Songs vs. Albums vs. Career • D(Madonna_1988, Madonna_2003) – “outside the world” features • cf: female hip-hop vs. female teen-pop – Context dependence! Views of similarity

• Many different measures of similarity. – Just a few examples:

Music theory Culture Melody Genre Harmony Date Structure Lyrics Perception Cognition Loudness Experience Pitch Reference Texture Beat Timbre Rules of similarity

• Self-similarity: D(A,A)=0 • Positivity: D(A,B)=0 iff A=B • Symmetry D(A,B)=D(B,A) – Influential similarity breaks symmetry • Triangle Inequality: – D(A,B) <= D(A,C)+D(B,C) – Partial similarity breaks the triangle Specifics of Artist Similarity

Beatles

Jason Falkner Michael Penn

D(Penn, Beatles) = 0.1 D(Beatles, Penn) = 0.5 “Influential D(Penn, Falkner) = 0.1 D(Falkner, Penn) = 0.1 similarity” Breaks symmetry Specifics of Artist Similarity 2

A C B

Jason Falkner Grays

D(Falkner, Grays) = 0.1 (Falkner member of the Grays) D(Brion, Grays) = 0.1 (Brion member of the Grays) D(Falkner, Brion) = 0.5

D(A,B) <= D(A,C)+D(B,C) ? “Partial 0.5 is not <= 0.1 + 0.1! similarity” Breaks triangle inequality Ground Truth

• Evaluation of similarity is very hard • Use metadata sources? Editors? Human evaluation? • What is rock? – “Electronic”, “World Music” • Is similarity opinion or fact? Experimental paradigm

Extract Collect Train for relevant Evaluation data similarity features

Basic acoustics Auditory perception Pattern recognition Grounding Data mining Music theory Machine learning Psychology DSP NLP Acoustic Features Packaging Perception

• Algorithms like observations – Pieces of an image, frames of a movie, words in a sentence, paragraphs in a document

& most ML/sim metrics look for correlations in the given observations:

d=8, l=4 Four observations, 8 dimensions each The (time)Frame Problem

So how do we make multiple observations from music?

The answer depends on time scale. What are you trying to do? Instrument ID Window at control rate Beat detection Measures Song structure Verses, choruses, etc Song similarity In-song chunks (arbitrary?) Artist similarity Entire songs Career scale Albums Time Domain Measures

• Never forget the time domain! – Signal entropy

– Onset/offset/envelope detection Full Width Spectral Slices

• Small chunks of audio (2 beats?) – “Car seek”

40KB/sec, 10,000 dimensions per observation! Power spectral density

FFT – Magnitude only PSD is the mean of the STFT time frames over your window.

Multiple FFT windows make each time slice

1,

Song by Artist #1 2, f M 3,

t 512 dimensions, 2KB 1, 2, ……, n j Log-space FFT

PSD on its own isn’t perceptually sound.

Here we rescale the bins to grow logarithmically and choose 16 slices.

Each frame (could vary in length) is now represented as 16 dimensions. MFCC

• Mel-frequency cepstral coeffcients – Mel-scale frequencies

5000

4000

3000 (Mel) 2000 Pitch

1000

100 200 500 1000 2k 5k 10k Frequency (Hz) Cepstrum

Mel-scale weighting

FFT log FFT-1

• Same number of points as input – Low order coefficients: spectral envelope shape – High order coefficients: fundamental frequency and harmonics MFCC

• Keep low-order coefficients – (analysis does better with <13) • MFCC applications – Speech recognition – Speaker identification – Instrument identification – Music similarity Principal Components Analysis

Better decorrelation: each component has the maximal variance

Decompose STFT matrix into eigenvectors via SVD

Save only n of them: representation is coefficients of the weighting matrix

PCA on PSD PCA / SVD / NMF

• Singular value decomposition (SVD) The (time)Frame Problem #2

• Time is important to music! • But for most features, time is ignored. • We could swap around the observations (randomize the song!) and the results would be the same • Instead: – beat features – state paths – learning tricks Autocorrelation Statistics

• View FFT instead as “repetition counter” – Text case, quick frequency estimator • “FFT of the FFT” (aka the “Beatogram”) Beat Detection with Autoco

• If the music is periodic, you’ll get the beat Song Path Regularization

• MPEG7 (Casey et. al)

Cluster each spectral frame into one of 20 states. Now a song is just a symbolic message: “AABCHGFKGA...”

Much easier for ML to handle Approximate Pattern Matching

ß Dynamic programming

ß Perception-based methods e.g., melody matching Cultural Features Two-Way IR

• So much going the other way!

“My favorite song” P2P Collections “Timbaland produced the new Missy record” Online playlists “Uninspired electro-glitch rock” Informal reviews “Reminds me of my ex-girlfriend” Query habits

Sound & Artists Score Listeners Acoustic vs. Cultural Representations

• Acoustic: • Cultural: – Instrumentation – Long-scale time – Short-time (timbral) – Inherent user model – Mid-time (structural) – Listener’s perspective – Usually all we have – Two-way IR

Which genre? Describe this. Which artist? Do I like this? What instruments? 10 years ago? Which style? Representation Uses

• Acoustic: • Cultural: – Artist ID – Style ID – Genre ID – Recommendation – Audio similarity – Query by Description – Copyright Protection – Cultural similarity

Combined: Auto description of music Community Synthesis High-accuracy style ID Acoustic/Cultural User Knob Creating a Cultural Representation

• Where do people talk about music? • Web mining • USENET mining • Discussion groups, weblogs • Usage statistics • P2P mining “Community Metadata”

• Combine all types of mined data – P2P, web, usenet, future? • Long-term time aware • One comparable representation via gaussian kernel – Machine learning friendly Data Collection Overview

• Web/Usenet crawl: – Web crawls for artist names – Retrieved documents are parsed for: • Unigrams, bigrams and trigrams • Artist names • Noun phrases • Adjectives • P2P crawl: – Robots watch OpenNap network for shared songs on collections. • Specific collections: – Playlists, AMG, reviews Web searching

• Augmented search patterns: – For the band “War”: • “War” • “War” +music • “War” +music +review • Use top 50 results for parsing (Usenet 100) • HTML documents have their tags removed and are sentence split. Language Processing for IR

• Web page to feature vector

n1 n2 n3 XTC XTC was XTC was one Was Was one Was one of One One of One of the HTML Sentence Chunks Of Of the Of the smartest the The smartest The smartest and …. Smartest Smartest and Smartest and catchiest Aosid asduh asdihu asiuh And And catchiest And catchiest british oiasjodijasodjioaisjdsaioj Catchiest Catchiest british Catchiest british pop aoijsoidjaosjidsaidoj. XTC was one of the smartest British British pop British pop bands Oiajsdoijasoijd. — and catchiest — British pop Pop Pop bands Pop bands to Iasoijdoijasoijdaisjd. Asij bands to emerge from the Bands Bands to Bands to emerge aijsdoij. Aoijsdoijasdiojas. To To emerge To emerge from Aiasijdoiajsdj., asijdiojad punk and new wave iojasodijasiioas asjidijoasd explosion of the late '70s. Emerge Emerge from Emerge from the oiajsdoijasd ioajsdojiasiojd From From the From the punk iojasdoijasoidj. Asidjsadjd iojasdoijasoijdijdsa. IOJ Punk The punk The punk and iojasdoijaoisjd. Ijiojsad. New Punk and Punk and new …. wave And new And new wave np art adj XTC Smartest Catchiest british pop bands Catchiest British pop bands XTC British Pop bands New Punk and new wave late explosion How-to “Klepmit”

1) Web parse (lynx --dump, parsers, tag removal, etc) 2) N-gram counting (Perl etc) 3) Part-of-speech tagging (Brill, Alembic) 4) Noun-phrase chunking (Penn’s baseNP, Columbia’s linkIT) 5) Math: TF-IDF, SVD, LSA (Perl, Matlab) Part-of-speech Tagging

• Rule-based tagger simplest • Takes each lexeme / term and assigns a class • English-specific for now • We use Brill’s tagger (1992) What’s a Noun Phrase?

• Simple NP chunking algorithm: – Find a noun – Extend selection to the most description relating to the noun • “loud rock,” “rock,” “tons of loud rock,” “rock lobster,” “not rock” • Far more descriptive than an n-gram, but far more rare (less overlap) Artist Terms

• From the OpenNap crawling we have a list of ~10,000 artist names • If any co-occur in the set, we store them in a separate feature • Meant to capture direct references Adjective Terms

• We store all adjectives separately • Human readable • Far smaller in size • Higher descriptive power Counting Terms

• Each term (n1,n2,np,art,adj) has two outputs: fd and ft

• fd is ‘document frequency:’ – How often does this term occur everywhere? (High for ‘the’, ‘loud’)

• ft is ‘term frequency:’ – How often does this term occur for our artist? Example: Portishead

Portishead ______

Artist: Portishead Album: Pnyc Released: 1998 Review by: punkis This live CD was taped at a 1997 concert at the Roseland in NYC and features the band and a backing symphony. For those of you who don't know, Portishead is generally lumped into the 'Trip Hop' category. While I don't really consider it to be Trip Hop, it is excellent and extremely undefinable music. Most of the music has a dark and moody sound which is one of the reasons I find it so appealing. One of the hallmarks of the music has got to be the background effects supplied by DJ and sometimes percusionist Geoff Barrows. What makes his backgrounds so impressive is the fact that he rarely samples other bands' music as most DJ's do. Most of the samples are supplied by the band themselves. Don't think it's possible to have a DJ "scratching" to songs backed by a symphony? Well he pulls it off amazingly. To top it all off the lead singer, Beth Gibbons, has one of the most beutiful and unique voices I have ever heard. Her voice blends perfectly with the backdrop supplied by the band and symphony. Her lyrics comes off as full of passion and from the heart. The CD contains 11 tracks, all of which are standout performances. I am a big fan of Porishead and am probably a bit biased but if I were stranded on a desert island with only 10 CD's this would be one of them. One of the reasons I find Portishead's music so appealing is the fact that you can listen to it while your working which is a big plus for me. … Example: Portishead

0.05849832780086525 0.00205761316872428 1 pnyc 0.04873439082011475 0.00205761316872428 0.833090323299701545 seductolicious 0.045765611633875105 0.00205761316872428 0.782340510478629184 nikf 0.04502780153977759 0.00205761316872428 0.769728011594744813 looooooooooooooooove 0.04458939264328485 0.00205761316872428 0.762233628199972707 mushed 0.04276625320786995 0.00205761316872428 0.731067960668055087 hearvey 0.04200705731394355 0.00205761316872428 0.718089882106377453 scoona 0.04023738237810095 0.00205761316872428 0.687838163768944486 goldfrappp 0.0394621471343028 0.00205761316872428 0.674585900448920431 lwiegard5 0.0371548452170544 0.00205761316872428 0.635143714595288686 mysterons 0.036329127459367 0.00205761316872428 0.621028477652136496 jbadder 0.0353292090145824 0.00205761316872428 0.603935366064597917 percusionist 0.03481608212147135 0.00205761316872428 0.595163715448228327 fey3 0.03454773869346735 0.00205761316872428 0.590576517179630452 archiveportishead 0.033169375534645 0.00205761316872428 0.567014080258109379 aaahhhhh 0.0317767653758542 0.00205761316872428 0.543208097913940528 beatpolygram 0.0255355612682091 0.00205761316872428 0.436517798511009796 bartended 0.04863557858376511 0.00411522633744856 0.415700588479434622 alternativetrip 0.048424050082601515 0.00411522633744856 0.413892600891450387 raidohead 0.0242043551088777 0.00205761316872428 0.413761487187667831 comcompellingreviewssecondlaunch25 0.047971915485609945 0.00411522633744856 0.410028092161126624 rocksuperstardom 0.02386934673366835 0.00205761316872428 0.40803468459683557 43dummy 0.04748065385618642 0.00411522633744856 0.405829154790678757 attendent 0.04725023910964264 0.00411522633744856 0.403859741687725309 epicly 0.047174158768802715 0.00411522633744856 0.403209463776372775 stringsounds 0.045013477088948785 0.00411522633744856 0.384741571093960311 tryer 0.02244439692044485 0.00205761316872428 0.383675871844543126 spose 0.0207467962881131 0.00205761316872428 0.354656228101724195 porishead 0.04076602034605685 0.00411522633744856 0.34843748427159211 carpella Term Coverage

• n2,np far greater in size

n1 n2 np adj art Experiment

• Can community metadata predict an edited similar artist list? • Problems: – Incomplete ground truth – Editor-dependent • Best way to evaluate CM? Ground Truth

• AMG similar lists:

• “Soft” GT: relative comparisons are useful What’s a good scoring metric?

• Count # of shared terms? – What about ‘the’? ‘music?’ ‘web?’ – Shouldn’t more specific terms be worth more? “Electronic gamelan rock” What’s a good scoring metric?

• TF-IDF provides natural weighting – TF-IDF is ft s( ft , fd ) = fd – More ‘rare’ co-occurrences mean more. – i.e. two artists sharing the term “heavy metal banjo” vs. “” • But… Curse of Specificity

• Straight TF-IDF breaks with: – Typos – Band member’s names, proper nouns – Lingo – Technical terms / web terms Smooth the TF-IDF

2 • Reward ‘mid-ground’ terms -(log( fd )-m ) ft fte s( ft , fd ) = s( f , f ) f t d = 2 d 2s Smoothing Function

• Inputs are term and document frequency with mean and standard deviation:

2 f e-(log( fd )-m ) s( f , f ) = t t d 2s 2 • We use mean of 6 and stdev of 0.9 Example

• For Portishead: Experiments

• Will two known-similar artists have a higher overlap than two random artists? • Use 2 metrics – Straight TF-IDF sum – Smoothed gaussian sum • On each term type • Similarity is: for all shared terms S(a,b) s( f , f ) = Â t d TF-IDF Sum Results

• Accuracy: % of artist pairs that were predicted similar correctly (S(a,b) > S(a,random)) • Improvement = S(a,b)/S(a,random)

N1 N2 Np Adj Art

Accuracy 78% 80% 82% 69% 79%

Improvement 7.0x 7.7x 5.2x 6.8x 6.9x Gaussian Smoothed Results

• Gaussian does far better on the larger term types (n1,n2,np)

N1 N2 Np Adj Art

Accuracy 83% 88% 85% 63% 79%

Improvement 3.4x 2.7x 3.0x 4.8x 8.2x P2P Similarity

• Crawling p2p networks • Download user->song relations • Similarity inferred from collections? • Similarity metric:

C(a,b) C(a) - C(b) S(a,b) = (1- ) C(b) C(c) P2P Crawling Logistics

• Many freely available scripting ‘agents’ for P2P networks • Easier: OpenNap, Gnutella, Soulseek – No real authentication/social protocol • Harder: Kazaa, DirectConnect, Hotline/KDX/etc • Usual algorithm: search for random band name, browse collections of matching clients

Overall Accuracy Classification

(Machine intelligence on features) Why use classification?

• With our feature vectors in place we need a model for new data • Simplest case: define a distance metric for two points in your feature space

• But usually you want to regularize observations Distance Measures

Geometric distance d Minkowski distance D (x, x') ( | x x '|q )1/ q M = Â k - k k=1

Manhattan distance (q=1) Euclidean distance (q=2)

x ⋅ x' D (x, x' ) = Cosine distance C | x || x'| Mahalanobis Distance 2 -1 T DS (x, x') = (x - x')S (x - x') Other Distance Measures

• Edit distance: – Cost of DP adds/deletes/substitutions – (Words:) cost of letter adds/deletes etc • “Erdos” / jump distance: – Length of connecting path – (Words:) thesaurus lookups to get from word A to word B Classification algorithms

• Gaussian mixture models (GMM) – Fit multiple multi-dimensional gaussians to best encapsulate training classes • Expectation-maximization (EM) • Number of gaussians defined a priori Gaussian Mixture Models

• Clustering – k-Means – Expectation maximization (EM) Classification algorithms

• Support vector machines (SVM) – Finds an optimally separating hyper- plane between classified data – Train machines for each individual class, and classify by calculating the best performing machine Support Vector Machines

• SVMs find the optimal regression line through data embedded in a kernel space • Data is defined from distance matrix, aka “Gram Matrix” aka “Kernel Space” • Usually only two parameters: – kernel parameter (i.e. gaussian’s variance width) – max lagrangians (~ generalization) • Work surprisingly well if your data is OK Kernel space Observed

( , ) ( , ) ( , ) ( , ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

2 - xi - x j (x , x ) = e i j 2d 2 • Distance function represents data – (gaussian works well for audio) SVMs 2

Rock Jazz

Blue line is the support vector (there can be many. The more SVs, the less robust)

In effect, consider an SVM a compressor: For the purposes of detecting genre, all you need is one vector! SVMs 3

Rock Jazz

New data is classified as being on either side of the SV. The kernel distance of the new point from the SV is its magnitude.

Confidence thresholding is sometimes imporant in large frame-type problems. SVM Logistics: Software

• MATLAB SVM toolbox • Roll your own RLSC • NODElib (Flake, NECI / Overture) • SVMfu (Rifkin, MIT) • Java libsvm SVM Tips and Tricks

• d << l, the more l the better • Confidence threshold! • Multi-class-- bias is important • Audio: – gaussian kernel, variance at mean of training set squared. Start C at 10, if you don’t trust your data move it up to 100. – Check your kernel! Visualize the Gram matrix if you can, check the distribution, it should be 0..1 or so “Multiclass” SVMs

• SVMs are 1/0 machines (or -1..1 reg.) • Multiple SVMs w/ winner take all work • Other avenues: feed the SVM outputs to a NN or linear classifier • Don’t throw away all class outputs! Severe multi-class problem

Observed a B C D E F G ?

?

1. Incorrect ground truth 2. Bias 3. Large number of output classes Regularized least-squares classification (RLSC)

• (Rifkin 2002) ( , ) ( , ) ( , ) ( , ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) KK ( ) ( ) ( ) ( )

I I (K + )c = y c = (K + )-1 y C t t t C t

ct = machine for class t yt = truth vector for class t C = regularization constant (10) Anchor Models

• Berenzweig/Ellis/Lawrence 02, Whitman/Flake/Lawrence 01

Rock Jazz Elec

Observations are put in ‘anchor space’, one dimension per classifier. Then we use a distance metric (B/E/L) or a neural network (W/F/L) to measure similarity of observations.

Semantically regularizing observations Anchor Model Steps

• Train n SVMs, one for each anchor – on half the songs space; use the other half to evaluate anchors. • Run all songs and future songs through feature extractor, then trained SVMs – New ‘anchor space’ is mean of SVM posteriors, n dimension vector • Sort similarity by chosen distance function Anchor Model Demo Problems with Anchors

• Anchor makers (us) are biased! – “World Music” is a huge insult! – “Rock/Pop” doesn’t mean anything – “Jazz” is ill-defined – “Classical” is too broad – “Electronic” is inane (IDM especially) • (Automatically grounding terms is better!) Time Embedding

• Observations can be windowed with deltas on memory:

• Quick hack but often works well! Artist ID

Case Study Artist Identification

• Good test of “music intelligence” – Cultural factors, many time scales – Vocal identity, aesthetic similarity – Easy ground truth! • Commercial uses: – “Copy-cop” efficiency – Recommendation • Problems: – “Album effect” – “Arto effect” – Scaling Approaches to Artist ID

• Whitman/Flake/Lawrence 01 – SVM artist anchors to NN, PSD-type features – 80% on small sets, 20% on mid set • Berenzweig/Ellis/Lawrence 02 – Genre anchor models GMM->NN, MFCC features – 30-40% on very large set • Kim/Whitman 02 – Vocal features, GMM, SVM, 40% on mid set • Whitman 03 – Grounded anchor models, PSD features – 66% on mid*2 set Why Not a Song Database?

• Very high (90%+!) accuracy on song id • But entire database needs to be stored and maintained • We should stick to eigenartists Evaluation

(How to know when you’re right.) Evaluation

• Similarity intrinsically human • Most agents are tuned to scientist bias • OK idea: get usage data – Impossible to fill graph! • Better idea: surveys and experts – Low turnout, expensive/illegal • Best idea: have the system become ‘trusted’ – Similarity as suggestion, not prediction Getting Similarity Data

• Usage observations: – Sale data, P2P data, Website clicks, harddrive searches, iPod contents – (Does owning == preference?) – (Does owning == similarity?) – (Privacy) The Graph Problem

Songs For us to fill in the graph we’ll need (n*n)/2 song relations. ngs

So What if we just sampled?

Why can’t we assume that unconnected matches Red likes 3 songs have zero similarity? Yellow likes 2 Feedback

Songs

If something is unpopular we can’t make predictions about it. Songs Isn’t the whole point of music similarity to find “hidden gems”? CF’s Feedback

• CF should only be last resort!

• Recommendations are intrinsically weighted by popularity! • Unfortunately people don’t seem to mind Better Evaluations

• Surveys, explicit recommendations – “No one wants to take a survey.” • Ask the Experts – AMG, etc. list similar artists • (Not songs) Case Study: Poperdos

• Ellis/Whitman/Berenzweig/Lawrence • 224,000 observations like: – “X is more similar to Y than Z.” • Artist level • Fun game, Blog Hook, Good advertising Trusted Authority

• AMG doesn’t publish as a recommender and yet thousands use them as one • “Leave one out” metric to evaluate a recommender based on evaluation: – P(song|user) (baseline is usually tiny) • Instead, marketing+consistent recommendations+timeliness will win Grounding

(Audio + Audience) Music intelligence

StructureStructure RecommendationRecommendation

GGenreenre / /S Styletyle ID ID AArtistrtist ID ID

SSongong similarity similarity SynthesisSynthesis

• Extracting salience from a signal • Learning is features and regression

ROCK/POP

Classical Better understanding through semantics

StructureStructure RecommendationRecommendation

GGenreenre / /S Styletyle ID ID AArtistrtist ID ID

SSongong similarity similarity SynthesisSynthesis

Loud college rock with electronics.

• What if a system learned the meaning of the underlying perception? • How can we get context to computationally influence understanding? Using context to learn descriptions of perception

• “Grounding” meanings (Harnad 1990): defining terms by linking them to the ‘outside world’ Query-by-description as evaluation case

• QBD: “Play me something loud with an electronic beat.” • With what probability can we accurately describe music? • Training: We play the computer songs by a bunch of artists, and have it read about the artists on the Internet. • Testing: We play the computer more songs by different artists and see how well it can describe it. The audio data

• Large set of music audio – Minnowmatch testbed (1000 albums) – Most popular on OpenNap August 2001 – 51 artists randomly chosen, 5 songs each • Each 2sec frame an observation: – TD‡PSD‡PCA to 20 dimensions

2sec audio 512-pSD 20-PCA “Community metadata”

• Description side is the adjective vectors from CM Learning formalization

• Learn relation between audio and naturally encountered description • Can’t trust target class! – Opinion – Counterfactuals – Wrong artist – Not musical • 200,000 possible terms (output classes!) – (For this experiment we limit it to adjectives) Learning QBD

Audio features, aritst 0, frame 1 “Electronic” 0.30 “Loud” 0.30 “Talented” 2.0

Audio features, aritst 0, frame 2 “Electronic” 0.30 “Loud” 0.30 “Talented” 2.0

Audio features, aritst 0, frame 3 “Electronic” 0.30 “Loud” 0.30 “Talented” 2.0

Audio features, aritst 1, frame 1 “Electronic” 0.1 “Loud” 3.23 “Talented” 0.4

Audio features, aritst 1, frame 2 “Electronic” 0.1 “Loud” 3.23 “Talented” 0.4

Audio features, aritst 3, frame 1 “Electronic” 0 “Loud” 0.95 “Talented” 0

Audio features, aritst 3, frame 2 “Electronic” 0 “Loud” 0.95 “Talented” 0

Audio features, aritst 3, frame 3 “Electronic” 0 “Loud” 0.95 “Talented” 0 Computational benefit

• (n classes, d dimensions of input, l examples) • Store 2 lxl gram matrices in memory or train n SVMs? • RLSC (d,l dependent to time, l dependent to memory): – 1.5hrs for precomputing & inverting – 256MB of space for both gram matrices • SVM (d,n,l dependent to time, d,l dependent to memory): – 250hrs for training 1/10th of the SVMs – Max 16MB cache needed QBD evaluation results

• Compute ‘weighted precision’ P(p)P(n)

• Usual IR evals worthless because of incredibly low baseline, mistrust of data, bias • What is important are deltas! Per-term accuracy

Good terms Bad terms Electronic 33% Annoying 0% Digital 29% Dangerous 0% Gloomy 29% Fictional 0% Unplugged 30% Magnetic 0% Acoustic 23% Pretentious 1% Dark 17% Gator 0% Female 32% Breaky 0% Romantic 23% Sexy 1% Vocal 18% Wicked 0% Happy 13% Lyrical 0% Classical 27% Worldwide 2%

Baseline = 0.14%

• Good term set as restricted grammar? Time-aware audio features

• MPEG-7 derived state-paths (Casey 2001) • Music as discrete path through time • Reg’d to 20 states

0.1 s Per-term accuracy

Good terms Bad terms Busy 42% Artistic 0% Steady 41% Homeless 0% Funky 39% Hungry 0% Intense 38% Great 0% Acoustic 36% Awful 0% African 35% Warped 0% Melodic 27% Illegal 0% Romantic 23% Cruel 0% Slow 21% Notorious 0% Wild 25% Good 0% Young 17% Okay 0%

• Weighted accuracy (to allow for bias) Synthesizing opinion

• “What does loud mean?”

• Weighted mean of labeled observations Semantic decomposition

• Music models from unsupervised methods find statistically significant parameters • Can we identify the optimal semantic attributes for understanding music? Female/Male

Angry/Calm The linguistic expert

• Some semantic attachment requires ‘lookups’ to an expert

“Dark”

“Big” “Light” “Small”

“?” Linguistic expert

• Perception + “Dark” observed “Big” “Light” language: “Small” • Lookups to linguistic expert:

Big Small Dark Light

• Allows you to infer new gradation:

Big Small “?” Dark Light Parameters: synants of “quiet”

“The antonym of every synonym and the synonym of every antonym.”

“thundering”

“quiet” “noisy”

“soft” “clangorous”

“hard” Antonyms Synonyms Learning the knobs

• Nonlinear dimension reduction – Isomap • Like PCA/NMF/MDS, but: – Meaning oriented – Better perceptual distance – Only feed polar observations as input • Future data can be quickly semantically classified with guaranteed expressivity

Quiet Loud Male Female Parameter understanding

• Some knobs aren’t 1-D intrinsically Color spaces & user models! Top descriptive parameters

•All P(a) of terms in anchor synant sets averaged •P(quiet) = 0.2, P(loud) = 0.4, P(quiet-loud) = 0.3. •Sorted list gives best grounded parameter map

Good parameters Bad parameters Big – little 30% Evil – good 5% Present – past 29% Bad – good 0% Unusual – familiar 28% Violent – nonviolent 1% Low – high 27% Extraordinary – ordinary 0% Male – female 22% Cool – warm 7% Hard – soft 21% Red – white 6% Loud – soft 19% Second – first 4% Smooth – rough 14% Full – empty 0% Vocal – instrumental 10% Internal – external 0% Minor – major 10% Foul – fair 5% What’s next

• Human evaluation – cf. Reiger/Carlson – “can we trust the internet for community meaning?” • Time-aware features • Learning parameter spaces – “fast .. slow” “loud .. soft” – Knobs for retrieval/synthesis • Bootstrapping terms from expert • Hierarchy learning Real-time

• “Description synthesis” • RLSC can’t do real-time so we train high-scoring term SVMs Reverse: semantic synthesis

• “What does college rock sound like?” • Meaning as transition probabilities

Loud rock with electronics Future: music acquisition

Short term music model: auditory scene to events Structural music model: recurring patterns in music streams Language of music: relating artists to descriptions (cultural representation) Music acceptance models: path of music through social network

Grounding sound, “what does loud mean?”

Semantics of music: “what does rock mean?”

What makes a song popular?

Semantic synthesis