<<

Heterogeneous Embedding for Subjective Artist Similarity

computer Brian McFee and Gert Lanckriet audition University of , San Diego laboratory

Overview Quantifying Consistency Similarity Prediction

Goal Construct a similarity measure between artists Data We use aset400 [2]: Evaluation Direct inconsistencies are pruned and the that agrees with human perception. 412 popular artists data is split for 10-fold cross-validation. 16385 similarity measurements Average training set size: 6252.7 edges Similarity between artists is expressed by Similarity Average test set size: 1149.6 (x,y,z) relative comparisons: Direct Disagreement on the direction of an edge inconsistency (x,y,z) Artist x is more similar to y than to z. Prediction task 1. Embed the training set (learn projections) x,y x,z 2. Map unseen artists into the space ? 3. Use distance to predict similarities Consistency Artist similarity is inherently subjective, and (x,y,z) where x is unseen. may vary from person to person. Note that the test set has not been processed How can we quantify consistency? General Higher-order disagreements can be removed by finding maximal acyclic subgraphs for internal consistency, so 100% accuracy inconsistency is not possible. Embedding Our similarity measure is defined by the Euclidean distance between artists. y,z Results x,y x,z Given the variety of features available, what 0.620 is the best way to combine them? MFCC 0.535 Optimized 0.561 Native Results Chroma 0.507 Human feedback will help us construct an SM 0.590 optimal embedding from the input features. 0.554 Total number of edges 16385 100% 0.611 MFCC+Chroma 0.522 0.614 Retained edges after pruning MFCC+SM 0.556 13420 81.9% direct inconsistencies Tags 0.776 54.8% 0.705 Average size of maximal Biography 0.705 Embedding Algorithm 8975.2 0.514 0.790 consistent subgraphs Tags+Bio 0.640 52.5% Number of edges included Tags+MFCC 0.773 Idea View each artist in heterogeneous 8598 0.693 in all acyclic subgraphs 0.783 feature spaces by using multiple kernels: Tags+Bio+MFCC 0.640 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

Problem Features may disagree with human perception Prediction accuracy before and after learning Features are not all equally informative Input Features an optimal embedding Solution Construct an optimal embedding from the Example feature spaces by learning projections Tags 7737 tags from last.fm

TF-IDF cosine kernel Linkin Park Feature space 1 TEXT Biography 16753 words from artist biographies TF-IDF cosine kernel Kid Rock Queensryche OPTIMIZED ARTISTS Feature space 2 Disturbed Rammstein Rage Against the Machine

... EMBEDDING Pennywise ... New Found Glory MFCC 13 MFCCs + first and second derivatives Rancid Iron Maiden SR−71 MxPx Modeled by Gaussian mixtures Marilyn Manson Feature space m Sublime Probability product kernel Goldfinger blink−182 Skid Row

White Zombie 12-d summary of pitch distribution Green Day Chroma Bloodhound Gang The Offspring Me First and the Gimme Gimmes Modeled by full-covariance Gaussian Human The optimization is constrained to match Orgy ACOUSTIC Tool Training Nine Inch Nails Symmetrized KL-divergence kernel Wheatus Faith No More Foo Fighters human perception measurements by 311 Filter Finger Eleven perception Partial Order Embedding [1]: Smash Mouth Tesla Semantic Distribution over 149 auto-tags Multinomial Derived from MFCCs [3] The metal/punk region of an optimized embedding (x,y,z) Probability product kernel

More similar References (x,y,z) [1] Brian McFee and Gert Lanckriet. Partial order embedding with multiple kernels. In Proceedings of the 26th x,y International Conference on Machine x,y x,y Learning, 2009.

[2] D. Ellis, B. Whitman, A. Berenzweig and S. Lawrence. The quest for ground truth in musical artist similarity. In Proceedings of the International x,z x,z x,z Symposium on Music Information Less similar Retrieval (ISMIR2002), pp.170-177, 2002. GRAPH PROCESSING GRAPH

1. Humans supply similarity 2. Similarity measurements 3. Inconsistencies are pruned to 4. The DAG is simplified to a [3] Douglas Turnbull, Luke Barrington, David Torres, and Gert Lanckriet. measurements are combined to form a reveal a directed acyclic graph minimal equivalent edge set. Semantic annotation and retrieval directed graph over pairs (DAG) These edges form the of music and sound effects. IEEE Transactions on Audio, Speech and constraints of the optimization. Language Processing, 16(2):467-476, February 2008.