OF SPARSE FEATURES FOR SCALABLE AUDIO CLASSIFICATION

Mikael Henaff, Kevin Jarrett, Koray Kavukcuoglu and Yann LeCun Courant Institute of Mathematical Sciences New York University [email protected] ; [email protected]

ABSTRACT led to solid and state-of-the-art results on several object recog- nition benchmarks [8, 15, 23, 30]. In this work we present a system to automatically learn fea- Some of these methods have also began receiving inter- tures from audio in an unsupervised manner. Our method est as means to automatically learn features from audio data. first learns an overcomplete dictionary which can be used to The authors of [22] explored the use of sparse coding using sparsely decompose log-scaled spectrograms. It then trains learned dictionaries in the time domain, for the purposes of an efficient encoder which quickly maps new inputs to ap- genre recognition. Convolutional DBNs were used in [16] proximations of their sparse representations using the learned to learn features from speech and music spectrograms in an dictionary. This avoids expensive iterative procedures usu- unsupervised manner. Using a similar method, but with su- ally required to infer sparse codes. We then use these sparse pervised fine-tuning, the authors in [12] were able to achieve codes as inputs for a linear Support Vector Machine (SVM). 84.3% accuracy on the Tzanetakis genre dataset, which is Our system achieves 83.4% accuracy in predicting genres one of the best reported results to date. on the GTZAN dataset, which is competitive with current Despite their theoretical appeal, systems to automatically state-of-the-art approaches. Furthermore, the use of a sim- learn features also bring a specific set of challenges. One ple linear classifier combined with a fast drawback of DBNs noted by the authors of [12] were their system allows our approach to scale well to large datasets. long training times, as well as the large number of hyper- parameters to tune. Furthermore, several authors using sparse coding algorithms have found that once the dictionary is 1. INTRODUCTION learned, inferring sparse representations of new inputs can be slow, as it usually relies on some kind of iterative proce- Over the past several years much research has been devoted dure [14, 22, 30]. This in turn can limit the real-time appli- to designing feature extraction systems to address the many cations or scalability of the system. challenging problems in music information retrieval (MIR). In this paper, we investigate a sparse coding method called Considerable progress has been made using task-dependent Predictive Sparse Decomposition (PSD) [11, 14, 15] that at- features that rely on hand-crafted signal processing tech- tempts to automatically learn useful features from audio data, niques (see [13] and [26] for reviews). An alternative ap- while addressing some of these drawbacks. Like many sparse proach is to use features that are instead learned automat- coding algorithms, it involves learning a dictionary from a ically. This has the advantage of generalizing well to new corpus of unlabeled data, such that new inputs can be rep- tasks, particularly if the features are learned in an unsuper- resented as sparse linear combinations of the dictionary’s vised manner. elements. It differs in that it also trains an encoder that ef- Several systems to automatically learn useful features from ficiently maps new inputs to approximations of their opti- data have been proposed over the years. Recently, Restricted mal sparse representations using the learned dictionary. As Boltzmann Machines (RBMs), Deep Belief Networks (DBNs) a result, once the dictionary is learned inferring the sparse and sparse coding (SC) algorithms have enjoyed a good deal representations of new inputs is very efficient, making the of attention in the computer vision community. These have system scalable and suitable for real-time applications.

Permission to make digital or hard copies of all or part of this work for 2. THE ALGORITHM personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies 2.1 Sparse Coding Algorithms bear this notice and the full citation on the first page. The main idea behind sparse coding is to express signals c 2011 International Society for Music Information Retrieval. x ∈ Rn as sparse linear combinations of basis functions chosen out of an overcomplete set. Letting B ∈ Rn×m 4

(n < m) denote the matrix consisting of basis functions 2 n bj ∈ R as columns with weights z = (z1, ..., zm), this relationship can be written as: 0

−2 m X x = z b = Bz −4 j j (1) −5 −4 −3 −2 −1 0 1 2 3 4 5 j

where most of the zi’s are zero. Overcomplete sparse Figure 1. Shrinkage function with θ = 1 representations tend to be good features for classification systems, as they provide a succinct representation of the sig- nal, are robust to noise, and are more likely to be linearly There is evidence that sparse coding could be a strategy separable due to their high dimensionality. employed by the brain in the early stages of visual and audi- Directly inferring the optimal sparse representation z of tory processing. The authors in [24] found that basis func- a signal x given a dictionary B requires a combinatorial tions learned on natural images using the above procedure search, intractable in high dimensional spaces. Therefore, resembled the receptive fields in the visual cortex. In an various alternatives have been proposed. Matching Pursuit analogous experiment [28], basis functions learned on nat- methods [21] offer a greedy approximation to the solution. ural sounds were found to be highly similar to gammatone Another popular approach, called Basis Pursuit [7], involves functions, which have been used to model the action of the minimizing the loss function: basilar membrane in the inner ear.

1 2 2.3 Predictive Sparse Decomposition Ld(x, z, B) = ||x − Bz||2 + λ||z||1 (2) 2 In order to avoid the iterative procedure typically required with respect to z. Here λ is a hyper-parameter setting the to infer sparse codes, several works have focused on de- tradeoff between accurate approximation of the signal and veloping nonlinear, trainable encoders which can quickly sparsity of the solution. It has been shown that the solu- map inputs to approximations of their optimal sparse codes tion to (2) is the same as the optimal solution, provided it [11, 14, 15]. The encoder’s architecture is denoted z = is sparse enough [10]. A number of works have focused on fe(x, U), where x is an input signal, z is an approxima- efficiently solving this problem [1, 7, 17, 20], however they tion of its sparse code, and U collectively designates all the still rely on a computationally expensive iterative procedure trainable parameters of the encoder. Training the encoder which limits the system’s scalability and real-time applica- is performed by minimizing the encoder loss Le(x, U), de- tions. fined as the squared error between the predicted code z and the optimal sparse code z∗ obtained by minimizing (2), for 2.2 Learning Dictionaries every input signal x in the training set: In classical sparse coding, the dictionary is composed of 1 L (x, U) = ||z∗ − f (x, U)||2 (3) known functions such as sinusoids, gammatones, wavelets e 2 e or Gabors. One can also learn dictionaries that are adaptive Specifically, the encoder is trained by iterating the fol- to the type of data at hand. This is done by first initializing lowing process: the basis functions to random unit vectors, and then iterating the following procedure: 1. Get a sample signal x from the training set and com- pute its optimal sparse code z∗ as described in the pre- 1. Get a sample signal x from the training set vious section. ∗ 2. Calculate its optimal sparse code z by minimizing 2. Keeping z∗ fixed, update U with one step of stochas- (2) with respect to z. Simple optimization methods ∂Le tic gradient descent: U ← U − ν ∂U , where Le is such as gradient descent can be used, or more sophis- the loss function in (3). ticated approaches such as [1, 7, 20]. In this paper, we adopt a simple encoder architecture 3. Keeping z∗ fixed, update B with one step of stochas- ∂Ld given by: tic gradient descent: B ← B − ν ∂B , where Ld is the loss function in (2). The columns of B are then f (x, W, b) = h (Wx + b) (4) rescaled to unit norm, to avoid trivial minimizations e θ of the loss function where the code coefficients go to where W is a filter matrix, b is a vector of trainable bi- zero while the bases are scaled up. ases and hθ is the shrinkage function given by hθ(x)i = sgn(xi)(|xi| − θi)+ (Figure 1). The shrinkage function sets any code components below a certain threshold θ to zero, which helps ensure that the predicted code will be sparse. Training the encoder is done by iterating the above process, with U = {W, b, θ}. Note that once the encoder is trained, inferring sparse codes is very efficient, as it essentially re- quires a single matrix-vector multiplication.

3. LEARNING AUDIO FEATURES In this section we describe the features learned on music data using PSD.

3.1 Dataset We used the GTZAN dataset first introduced in [29], which has since been used in several works as a benchmark for the genre recognition task [2, 3, 6, 12, 18, 25]. The dataset consists of 1000 30-second audio clips, each belonging to one of 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae and rock. The classes are balanced so that there are 100 clips from each genre. All clips are sampled at 22050 Hz.

3.2 Preprocessing Figure 2. A random subset of the 512 basis functions To begin with, we divided each clip into short frames of learned on full CQT frames. The horizontal axis represents 1024 samples each, corresponding to 46.4ms of audio. There log-frequency and ranges from 67 Hz to 1046 Hz. was a 50% overlap between consecutive frames. We then applied a Constant-Q transform (CQT) to each frame, with 96 filters spanning four octaves from C2 to C6 at quarter- cases, we used the Fast Iterative Shrinkage-Thresholding al- tone resolution. For this we used the toolbox provided by gorithm (FISTA) [1] to compute optimal sparse codes. Some the authors of [27]. An important property of the CQT is of the learned basis functions are displayed in Figure 2. One that the center frequencies of the filters are logarithmically can see single notes and what appear to be series of linearly spaced, so that consecutive notes in the musical scale are spaced notes, which could correspond to chords, harmonics linearly spaced. We then applied subtractive and divisive or harmonies. Note that some of the basis functions appear local contrast normalization (LCN) as described in [15], to be inverted, since the code coefficients can be negative. which consisted of two stages. First, from each point in the A number of the learned basis functions also seem to have CQT spectrogram we subtracted the average of its neigh- little recognizable structure. borhood along both the time and frequency axes, weighted by a Gaussian window. Each point was then divided by the 3.4 Features Learned on Octaves standard deviation of the new neighborhood, again weighted by a Gaussian window. This enforces competition between We also tried learning separate dictionaries on each of the neighboring points in the spectrogram, so that low-energy four octaves, in order to capture local frequency patterns. signals are amplified while high-energy ones are muted. The To this end we divided each frame into four octaves, each entire process can be seen as a simple form of automatic consisting of 24 channels, and learned dictionaries of size gain control. 128 on each one. We then trained four separate encoders to predict the sparse representations for each of the four oc- taves. Some of the learned basis functions are shown in Fig- 3.3 Features Learned on Frames ure 3. Interestingly, we find that a number of basis functions We then learned dictionaries on all frames in the dataset, us- correspond to known chords or intervals: minor thirds, per- ing the process described in 2.2. The dictionary size was fect fifths, sevenths, major triads, etc. A number of basis set to 512, so as to get overcomplete representations. Once functions also appear to be similar versions of each other the dictionary was learned, we trained the encoder to pre- shifted across frequency. Other functions have their en- dict sparse representations using the process in 2.3. In both ergy spread out across frequency, which could correspond to sounds caused by percussive instruments.

3.5 Feature Extraction Once the dictionaries were learned and the encoders trained to accurately predict sparse codes, we ran all inputs through their respective encoders to obtain their sparse representa- tions using the learned dictionaries. In the case of dictio- naries learned on individual octaves, for each frame we con- (a) catenated the sparse representations of each of its four oc- taves, all of length 128, into a single vector of size 512. Ex- tracting sparse features for the entire dataset, which contains over 8 hours of audio, took less than 3 minutes, which shows that this feature extraction system is scalable to industrial- size music databases.

(b) 4. CLASSIFICATION USING LEARNED FEATURES We now describe the results of using our learned features as inputs for genre classification. We used a linear Support Vector Machine (SVM) as a classifier, using the LIBSVM library [5]. Linear SVMs are fast to train and scale well to (c) large datasets, which is an important consideration in MIR.

4.1 Aggregated Features Several authors have found that aggregating frame-level fea- tures over longer time windows substantially improves clas- sification performance [2, 3, 12]. Adopting a similar ap- (d) proach, we computed aggregate features for each song by summing up sparse codes over 5-second time windows over- lapping by half. We applied absolute value rectification to the codes beforehand to prevent components of different sign from canceling each other out. Since each sparse code records which dictionary elements are present in a given (e) CQT frame, these aggregate feature vectors can be thought of as histograms recording the number of occurrences of each dictionary element in the time window.

4.2 Classification To produce predictions for each song, we voted over all ag- (f) gregate feature vectors in the song and chose the genre with the highest number of votes. Following standard practice, Figure 3. Some of the functions learned on individual oc- classification performance was measured by 10-fold cross- taves. The horizontal axis represents log-frequency. Recall validation. For each fold, 100 songs were randomly selected that each octave consists of 24 channels a quarter tone apart. to serve as a test set, with the remaining 900 serving as train- Channel numbers corresponding to peaks are indicated. a) ing data. This procedure was repeated 10 times, and the re- A minor third (two notes 3 semitones apart) b) A perfect sults averaged to produce a final classification accuracy. fourth (two notes 5 semitones apart) c) A perfect fifth (two Our classification results, along with several other results notes 7 semitones apart) d) A quartal chord (each note is 5 from the literature, are shown in Figure 4. We see that PSD semitones apart) e) A major triad f) A percussive sound. features learned on individual octaves perform significantly better than those learned on entire frames. 1 Furthermore,

1 In an effort to capture chords which might be split among two of the octaves, we also tried dividing the frequency range into 7 octaves, overlap- ping by half, and similarly learning features on each one. However, this did Classifier Features Acc. (%) to capture patterns between fundamentals and their harmon- CSC Many features [6] 92.7 ics, which could be less useful for distinguishing between SRC Auditory cortical feat. [25] 92 genres. RBF-SVM Learned using DBN [12] 84.3 One aspect that needs mentioning is that since we per- Linear SVM Learned using PSD on octaves 83.4 ± 3.1 formed the unsupervised feature learning on the entire dataset AdaBoost Many features [2] 83 (which includes the training and test sets without labels for Linear SVM Learned using PSD on frames 79.4 ± 2.8 each of the cross-validation folds), our system is technically SVM Daubechies Wavelets [19] 78.5 akin to “transductive learning”. Under this paradigm, test Log. Reg. Spectral Covariance [3] 77 samples are known in advance, and the system is simply LDA MFCC + other [18] 71 asked to produce labels for them. We subsequently con- Linear SVM Auditory cortical feat. [25] 70 ducted a single experiment in which features were learned GMM MFCC + other [29] 61 on the training set only, and obtained an accuracy of 80%. Though less than our overall accuracy, this result is still Figure 4. Genre recognition accuracy of various algorithms within the range observed during the 10 different cross-validation on the GTZAN dataset. Our results with standard deviations experiments, which went from 77% to 87%. The seemingly are marked in bold. large deviation in accuracy is likely due to the variation of class distributions between folds. our approach outperforms many existing systems which use There are a number of directions in which we would like hand-crafted features. The two systems that significantly to extend this work. A first step would be to apply our sys- outperform our own rely on sophisticated classifiers based tem to different MIR tasks, such as autotagging. Further- on sparse representations (SRC) or compressive sampling more, the small size of the GTZAN dataset does not ex- (CSC). The fact that our method is still able to reach compet- ploit the system’s ability to leverage large amounts of data itive performance while using a simple classifier indicates in a tractable amount of time. For this, the Million Song that the features learned were able to capture useful proper- Dataset [4] would be ideal. ties of the audio that distinguish between genres. One possi- A limitation of our system is that it ignores temporal de- ble interpretation is that some of the basis functions depicted pendencies between frames. A possible remedy would be in Figure 3 represent chords specific to certain genres. For to learn features on time-frequency patches instead. Pre- example, perfect fifths (e.g. power chords) are very com- liminary experiments we conducted in this direction did not mon in rock, blues and country, but rare in jazz, whereas yield improved results, as many ’learned’ basis functions quartal chords, which are common in jazz and classical, are resembled noise. This requires further investigation. We seldom found in rock or blues. could also try training a second layer of feature extractors on top of the first, since a number of works have demon- 4.3 Discussion strated that using multiple layers can improve classification performance [12, 15, 16]. Our results show that automatic feature learning is a viable alternative to using hand-crafted features. Our approach per- formed better than most systems that pair signal processing feature extractors with standard classifiers such as SVMs, 5. CONCLUSION Nearest Neighbors or Gaussian Mixture Models. Another positive point is that our feature extraction system is very In this paper, we have investigated the ability for PSD to fast, and the use of a simple linear SVM makes this method automatically learn useful features from constant-Q spec- viable on any size dataset. Furthermore, the fact that the fea- trograms. We found that the features learned capture infor- tures are learned in an unsupervised manner means that they mation about which chords are being played in a particular are not limited to a particular task, and could be used for frame. Furthermore, these learned features can perform at other MIR tasks such as chord recognition or autotagging. least as well as hand-crafted features for the task of genre We also found that features learned on octaves performed recognition. Finally, the system we proposed is fast and uses better than features learned on entire frames. This could be a simple linear classifier which scales well to large datasets. due to the fact that in the second case we are learning four times as many parameters as in the first, which could lead In future work, we will apply this method to larger datasets, to overfitting. Another possibility is that features learned on as well as a wider range of MIR tasks. We will also exper- octaves tend to capture relationships between fundamental iment with different ways of capturing temporal dependen- notes, whereas features learned on entire frames also seem cies between frames. Finally, we will investigate using hi- erarchical systems of feature extractors to learn higher-level not yield an increase in accuracy. features. 6. REFERENCES [16] H. Lee, Y. Largman, P. Pham, and A. Ng: “Unsupervised fea- ture learning for audioclassification using convolutional deep [1] A. Beck and M..Teboulle: “A Fast iterative Shrinkage- belief networks,” Advances in Neural Information Processing Thresholding Algorithm with Application to Wavelet-Based Systems (NIPS) 22, 2009. Image Deblurring,” ICASSP ’09 , pp. 696-696, 2009. [17] H. Lee, A. Battle, R. Raina and A.Y. Ng: “Efficient Sparse [2] J. Bergstra, N. Casagrande, D. Erhan, D. Eck and B. Kegl: Coding Algorithms,” Advances in Neural Information Pro- “Aggregate features and AdaBoost for music classification,” cessing Systems (NIPS), 2006. , 65(2-3):473-484, 2006. [18] T. Li and G. Tzanetakis: “Factors in automatic musical genre [3] J. Bergstra, M. Mandel and D. Eck: “Scalable genre and tag classification,” IEEE Workshop on Applications of Signal Pro- prediction using spectral covariance,” Proceedings of the 11th cessing to Audio and Acoustics, New Paltz, New York, 2003. International Conference on Music Information Retrieval (IS- MIR), 2010. [19] T. Li, M. Ogihara and Q. Li: “A comparative study on content- based music genre classification,” Proceedings of the 26th An- [4] T. Bertin-Mahieux, D. Ellis, B. Whitman and P. Lamere: “The nual International ACM SIGIR Conference on Research and million song dataset,” Proceedings of the 12th International Development in Information Retrieval (SIGIR ’03), 2003. Conference on Music Information Retrieval (ISMIR), 2011. [20] Y. Li and S. Osher: “Coordinate descent optimization for l1 minimization with application to compressed sensing; a [5] Chih-Chung Chang and Chih-Jen Lin: “LIBSVM: a li- greedy algorithm,” Inverse Problems and Imaging , 3(3):487- brary for support vector machines,” software available at 503, 2009. http://www.csie.ntu.edu.tw/ cjlin/libsvm, 2001. [21] S. Mallat and Z. Zhang: “Matching Pursuits with time- [6] K. Chang, J. Jang and C. Iliopoulos: “Music genre classifica- frequency dictionaries,” IEEE Transactions on Signal Process- tion via compressive sampling,” Proceedings of the 11th Inter- ing, 41(12):3397:3415, 1993. national Conference on Music Information Retrieval (ISMIR), pages 387-392, 2010. [22] P.-A. Manzagol, T. Bertin-Mahieux and D. Eck: “On the use of sparse time relative auditory codes for music,” . In Proceed- [7] S.S. Chen, D.L. Donoho, and M.A. Saunders: “Atomic De- ings of the 9th International Conference on Music Information composition by Basis Pursuit,” SIAM Journal on Scientific Retrieval (ISMIR), 2008. Computing, 20(1):33-61, 1999 [23] M. Nourouzi, M. Ranjbar and G. Mori: “Stacks of Convolu- [8] A. Courville, J. Bergstra and Y. Bengio: “A Spike and Slab tional Restricted Boltzmann Machines for Shift-Invariant Fea- restricted Bolztmann Machine” Journal of Machine Learning ture Learning,” IEEE Computer Vision and Pattern Recogni- Research, W&CP 15, 2011. tion (CVPR), 2009.

[9] I. Daubechies, M. Defrise and C. De Mol : “An Iterative [24] B. Olshausen and D. Field: “Emergence of simple-cell recep- Thresholding Algorithm for Linear Inverse Problems with a tive field properties by learning a sparse code for natural im- Sparsity Constraint,” Comm. on Pure and Applied Mathemat- ages,” Nature, 1996. ics, 57:1413-1457, 2004. [25] Y. Panagakis, C. Kotropoulos, and G.R. Arce: “Music genre [10] D.L. Donoho and M. Elad: “Optimally sparse representa- classification using locality preserving non-negative tensor factorization and sparse representations,” Proceedings of the tion in general (nonorthogonal) dictionaries via L1 minimiza- tion,”. Proceedings of the National Academy of Sciences, USA 10th International Conference on Music Information Retrieval , 199(5):2197-2202, 2003 Mar 4. (ISMIR), pages 249-254, 2009. [26] G. Peeters: “A large set of audio features for sound description [11] K. Gregor and Y. LeCun: “Learning Fast Approximations of (similarity and classification) in the cuidado project”. Techni- Sparse Coding,” Proceedings of the 27th International Confer- cal Report, IRCAM, 2004. ence on Machine Learning, Haifa, Israel, 2010. [27] C. Schoerkhuber and A. Klapuri: “Constant-Q transform tool- [12] P. Hamel and D. Eck: “Learning Features from Music Au- box for music processing,” 7th Sound and Music Computing dio with Deep Belief Networks,” 11th International Society Conference, Barcelona, Spain, 2010. for Music Information Retrieval Conference (ISMIR 2010). . [28] E. Smith and M. Lewicki: “Efficient Auditory Coding” Nature, [13] P. Herrera-Boyer, G. Peeters and S. Dubnov: “Automatic Clas- 2006. sification of Musical Instrument Sounds,” Journal of New Mu- sic Research, vol. 32, num 1, March 2003. [29] G. Tzanetakis and P. Cook: “Musical Genre Classification of Audio Signals,” IEEE Transactions on Speech and Audio Pro- [14] K. Kavukcuoglu, M.A. Ranzato, Y. LeCun: “Fast Inference in cessing, 10(5):293-302, 2002. Sparse Coding Algorithms with Applications to Object Recog- nition,” Computational and Biological Learning Laboratory, [30] Jianchao Yang, Kai Yu, Yihong Gong, Thomas Huang: “Lin- Technical Report, CBLL-TR-2008-12-01, ear Spatial Pyramid Matching Using Sparse Coding for Image Classification,” IEEE Conference on Computer Vision and Pat- [15] Y. LeCun, K. Kavukvuoglu and C. Farabet: “Convolutional tern Recognition (CVPR), 2009. Networks and Applications in Vision,” Proc. International Symposium on Circuits and Systems (ISCAS’10), IEEE, 2010.