Effects of Album and Artist Filters in Audio Similarity Computed for Very
Total Page:16
File Type:pdf, Size:1020Kb
Arthur Flexer∗ and Dominik Schnitzer∗† Effects of Album and Artist ∗Austrian Research Institute for Artificial Filters in Audio Similarity Intelligence Freyung 6/6 A-1010 Vienna, Austria Computed for Very Large [email protected] †Department of Computational Music Databases Perception Johannes Kepler University Linz Altenberger Str. 69 A-4040 Linz, Austria [email protected] In music information retrieval, one of the classification results quite considerably (as much as central goals is to automatically recommend from 71 percent down to 27 percent, for one of their music to users based on a query song or query music collections). These over-optimistic accuracy artist. This can be done using expert knowl- results due to not using an artist filter have been edge (e.g., www.pandora.com), social meta-data confirmed in other studies (Flexer 2006; Pampalk (e.g., www.last.fm), collaborative filtering (e.g., 2006). Other results suggest that the use of an artist www.amazon.com/mp3), or by extracting informa- filter not only lowers genre classification accuracy tion directly from the audio (e.g., www.muffin.com). but may also erode the differences in accuracies In audio-based music recommendation, a well- between different techniques (Flexer 2007). known effect is the dominance of songs from the All these results were achieved on rather small same artist as the query song in recommendation databases (from 700 to 15,000 songs). Often whole lists. albums from an artist were part of the database, This effect has been studied mainly in the perhaps even more than one. These specifics of context of genre-classification experiments. Because the databases are often unclear and not properly no ground truth with respect to music similarity documented. The present article extends these usually exists, genre classification is widely used for results by analyzing a very large data set (over evaluation of music similarity. Each song is labelled 250,000 songs) containing multiple albums from as belonging to a music genre using, e.g., advice individual artists. We try to answer the following of a music expert. High genre classification results questions: indicate good similarity measures. If, in genre classification experiments, songs from the same 1. Is there an album and artist effect even in artist are allowed in both training and test sets, this very large databases? can lead to over-optimistic results since usually all 2. Is the album effect larger than the artist songs from an artist have the same genre label. It effect? can be argued that in such a scenario one is doing 3. What is the influence of database size on artist classification rather than genre classification. music recommendation and classification? One could even speculate that the specific sound of As will be seen, we find that the artist effect does an album (mastering and production effects) is being exist in very large databases, and the album effect is classified. In Pampalk, Flexer, and Widmer (2005) bigger than the artist effect. the use of a so-called “artist filter” that ensures that a given artist’s songs are either all in the training set, or all in the test set, is proposed. Those authors found Data that the use of such an artist filter can lower the For our experiments we used a data set D(AL L) Computer Music Journal, 34:3, pp. 20–28, Fall 2010 of S = 254,398 song excerpts (30 seconds each) c 2010 Massachusetts Institute of Technology. from a popular Web store selling music. The freely 20 Computer Music Journal Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/COMJ_a_00004 by guest on 24 September 2021 Figure 1. Percentages Figure 2. Percentages (y-axis) of albums having (y-axis) of artists having 6, 1, 2, 3, 4, or 5 to 8 genre 7, ..., 19, or 20 to 29 labels (x-axis). albums (x-axis). 30 25 20 20 10 10 0 0 1 2 3 4 5 to 8 6 7 8 9 10 11 12 13 14 15 16 17 18 19 >20 Table 1. Percentages of Songs Belonging to the parsed automatically from the HTML code. The 22 Genres, with Multiple Membership excerpts are from U = 18,386 albums from A = 1,700 Allowed artists. From the 280 existing different hierarchical Genre Percentage genres, only the G = 22 general ones on top of the hierarchy are being kept for further analysis (e.g., Pop 49.79 “Pop/General” is kept but not “Pop/Vocal Pop”). Classical 12.89 The names of the genres plus percentages of songs Broadway 7.45 belonging to each of the genres are given in Table 1. Soundtracks 1.00 (Each song is allowed to belong to more than one Christian/Gospel 10.20 New Age 2.48 genre, hence the percentages in Table 1 add up to Miscellaneous 6.11 more than 100 percent.) The genre information is Opera/Vocal 3.24 identical for all songs on an album. The numbers Alternative Rock 27.13 of genre labels per album are given in Figure 1. Our Rock 51.78 database was set up so that every artist contributes Rap/Hip-Hop 0.98 between 6 and 29 albums (see Figure 2). R&B 4.26 To study the influence of the size of the database Hard Rock/Metal 15.85 on results, we created random non-overlapping Classic Rock 15.95 splits of the entire data set: D(1/2) is two data sets Country 4.07 with the mean number of song excerpts = 127,199; Jazz 6.98 D(1/20) is twenty data sets with the mean number Children’s Music 7.78 = D / International 9.69 of songs excerpts 12,719.9; and (1 100) is one Latin Music 0.54 hundred data sets with the mean number of songs = Folk 11.18 excerpts 2,543.98. An artist with all their albums Dance & DJ 5.24 is always a member of a single data set. Blues 11.24 Methods available preview song excerpts were obtained with an automated Web-crawl. All meta-information We compare two approaches based on different pa- (artist name, album title, song title, genres) is rameterizations of the data. Whereas mel-frequency Flexer and Schnitzer 21 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/COMJ_a_00004 by guest on 24 September 2021 cepstral coefficients (MFCCs) are a relatively di- as (Penny 2001): rect representation of the spectral information 1 det ( ) of a signal and therefore of the specific “sound” KL pq = p + Tr −1 N( ) log p q or “timbre” of a song, fluctuation patterns (FPs) 2 det ( q) are a more abstract kind of feature describing the + μ − μ −1 μ − μ − amplitude modulation of the loudness per fre- ( p q) p ( q p) d (1) quency band. It is our hypothesis that MFCCs are more prone to pick up production and master- where Tr(M) denotes the trace of the matrix M, ing effects of a single album as well as the spe- Tr(M) = i=1..nmi,i. The divergence is symmetrized cific “sound” of an artist (voice, instrumentation, by computing: etc.). KL (pq) + KL (qp) KL = N N (2) sym 2 Mel-Frequency Cepstral Coefficients and Single Gaussians (G1) Fluctuation Patterns and Euclidean Distance (FP) We use the following approach to music similarity FP (Pampalk 2001; Pampalk, Rauber, and Merkl based on spectral similarity. For a given music 2002) describe the amplitude modulation of the collection of songs, it consists of the following loudness per frequency band and are based on ideas steps: developed in Fruehwirt and Rauber (2001). For a given music collection of songs, computation 1. For each song, compute MFCCs for short of music similarity based on FPs consists of the overlapping frames. following steps: 2. Train a single Gaussian (G1) to model each of the songs. 1. For each song, compute an FP. 3. Compute a similarity matrix between all 2. Compute a similarity matrix between all songs using the symmetrized Kullback- songs using the Euclidean distance of the FP Leibler divergence between respective G1 patterns. models. Closely following the implementation outlined in The 30-second song excerpts in MP3 format Pampalk (2006), an FP is computed by: (i) cutting are recomputed to 22,050-Hz mono audio sig- an MFCC spectrogram into three-second segments; nals. We divide the raw audio data into non- (ii) using an FFT to compute amplitude modulation overlapping frames of short duration and use frequencies of loudness (range 0 − 10 Hz) for each MFCCs to represent the spectrum of each frame. segment and frequency band; (iii) weighting the MFCCs are a perceptually meaningful and spec- modulation frequencies based on a model of per- trally smoothed representation of audio signals. ceived fluctuation strength; and (iv) applying filters MFCCs are now a standard technique for com- to emphasize certain patterns and smooth the result. putation of spectral similarity in music analysis The resulting FP is a 12 (frequency bands according (see, e.g., Logan 2000). The frame size for com- to twelve critical bands of the Bark scale [Zwicker putation of MFCCs for our experiments was and Fastl 1999]) × 30 (modulation frequencies, 46.4 msec (1,024 samples). We used the first 25 ranging from 0 to 10 Hz) matrix for each song. The MFCCs for all our experiments. A G1 with full distance between two FPs i and j is computed as the covariance represents the MFCCs of each song Euclidean distance: (Mandel and Ellis 2005). For two single Gaussians, 12 30 j 2 p(x) = N (x; μp, p)andq(x) = N (x; μq, q), the closed i j = i − D(FP , FP ) FPk,l FPk,l (3) form of the Kullback-Leibler divergence is defined k=1 l=1 22 Computer Music Journal Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/COMJ_a_00004 by guest on 24 September 2021 Figure 3.