Unsupervised Learning of Sparse Features for Scalable Audio Classification

UNSUPERVISED LEARNING OF SPARSE FEATURES FOR SCALABLE AUDIO CLASSIFICATION Mikael Henaff, Kevin Jarrett, Koray Kavukcuoglu and Yann LeCun Courant Institute of Mathematical Sciences New York University [email protected] ; [email protected] ABSTRACT led to solid and state-of-the-art results on several object recognition benchmarks [8, 15, 23, 30]. In this work we present a system to automatically learn fea- Some of these methods have also began receiving inter- tures from audio in an unsupervised manner. Our method est as means to automatically learn features from audio data. first learns an overcomplete dictionary which can be used to The authors of [22] explored the use of sparse coding using sparsely decompose log-scaled spectrograms. It then trains learned dictionaries in the time domain, for the purposes of an efficient encoder which quickly maps new inputs to ap- genre recognition. Convolutional DBNs were used in [16] proximations of their sparse representations using the learned to learn features from speech and music spectrograms in an dictionary. This avoids expensive iterative procedures usu- unsupervised manner. Using a similar method, but with su- ally required to infer sparse codes. We then use these sparse pervised fine-tuning, the authors in [12] were able to achieve codes as inputs for a linear Support Vector Machine (SVM). 84.3% accuracy on the Tzanetakis genre dataset, which is Our system achieves 83.4% accuracy in predicting genres one of the best reported results to date. on the GTZAN dataset, which is competitive with current Despite their theoretical appeal, systems to automatically state-of-the-art approaches. Furthermore, the use of a sim- learn features also bring a specific set of challenges. One ple linear classifier combined with a fast feature extraction drawback of DBNs noted by the authors of [12] were their system allows our approach to scale well to large datasets. long training times, as well as the large number of hyper- parameters to tune. Furthermore, several authors using sparse coding algorithms have found that once the dictionary is 1. INTRODUCTION learned, inferring sparse representations of new inputs can be slow, as it usually relies on some kind of iterative proce- Over the past several years much research has been devoted dure [14, 22, 30]. This in turn can limit the real-time appli- to designing feature extraction systems to address the many cations or scalability of the system. challenging problems in music information retrieval (MIR). In this paper, we investigate a sparse coding method called Considerable progress has been made using task-dependent Predictive Sparse Decomposition (PSD) [11, 14, 15] that at- features that rely on hand-crafted signal processing tech- tempts to automatically learn useful features from audio data, niques (see [13] and [26] for reviews). An alternative ap- while addressing some of these drawbacks. Like many sparse proach is to use features that are instead learned automat- coding algorithms, it involves learning a dictionary from a ically. This has the advantage of generalizing well to new corpus of unlabeled data, such that new inputs can be rep- tasks, particularly if the features are learned in an unsuper- resented as sparse linear combinations of the dictionary’s vised manner. elements. It differs in that it also trains an encoder that ef- Several systems to automatically learn useful features from ficiently maps new inputs to approximations of their opti- data have been proposed over the years. Recently, Restricted mal sparse representations using the learned dictionary. As Boltzmann Machines (RBMs), Deep Belief Networks (DBNs) a result, once the dictionary is learned inferring the sparse and sparse coding (SC) algorithms have enjoyed a good deal representations of new inputs is very efficient, making the of attention in the computer vision community. These have system scalable and suitable for real-time applications. Permission to make digital or hard copies of all or part of this work for 2. THE ALGORITHM personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies 2.1 Sparse Coding Algorithms bear this notice and the full citation on the first page. The main idea behind sparse coding is to express signals c 2011 International Society for Music Information Retrieval. x 2 Rn as sparse linear combinations of basis functions chosen out of an overcomplete set. Letting B 2 Rn×m 4 (n < m) denote the matrix consisting of basis functions 2 n bj 2 R as columns with weights z = (z1; :::; zm), this relationship can be written as: 0 −2 m X x = z b = Bz −4 j j (1) −5 −4 −3 −2 −1 0 1 2 3 4 5 j where most of the zi’s are zero. Overcomplete sparse Figure 1. Shrinkage function with θ = 1 representations tend to be good features for classification systems, as they provide a succinct representation of the signal, are robust to noise, and are more likely to be linearly There is evidence that sparse coding could be a strategy separable due to their high dimensionality. employed by the brain in the early stages of visual and audi- Directly inferring the optimal sparse representation z of tory processing. The authors in [24] found that basis func- a signal x given a dictionary B requires a combinatorial tions learned on natural images using the above procedure search, intractable in high dimensional spaces. Therefore, resembled the receptive fields in the visual cortex. In an various alternatives have been proposed. Matching Pursuit analogous experiment [28], basis functions learned on nat- methods [21] offer a greedy approximation to the solution. ural sounds were found to be highly similar to gammatone Another popular approach, called Basis Pursuit [7], involves functions, which have been used to model the action of the minimizing the loss function: basilar membrane in the inner ear. 1 2 2.3 Predictive Sparse Decomposition Ld(x; z; B) = jjx − Bzjj2 + λjjzjj1 (2) 2 In order to avoid the iterative procedure typically required with respect to z. Here λ is a hyper-parameter setting the to infer sparse codes, several works have focused on de- tradeoff between accurate approximation of the signal and veloping nonlinear, trainable encoders which can quickly sparsity of the solution. It has been shown that the solu- map inputs to approximations of their optimal sparse codes tion to (2) is the same as the optimal solution, provided it [11, 14, 15]. The encoder’s architecture is denoted z = is sparse enough [10]. A number of works have focused on fe(x; U), where x is an input signal, z is an approxima- efficiently solving this problem [1, 7, 17, 20], however they tion of its sparse code, and U collectively designates all the still rely on a computationally expensive iterative procedure trainable parameters of the encoder. Training the encoder which limits the system’s scalability and real-time applica- is performed by minimizing the encoder loss Le(x; U), de- tions. fined as the squared error between the predicted code z and the optimal sparse code z∗ obtained by minimizing (2), for 2.2 Learning Dictionaries every input signal x in the training set: In classical sparse coding, the dictionary is composed of 1 L (x; U) = jjz∗ − f (x; U)jj2 (3) known functions such as sinusoids, gammatones, wavelets e 2 e or Gabors. One can also learn dictionaries that are adaptive Specifically, the encoder is trained by iterating the fol- to the type of data at hand. This is done by first initializing lowing process: the basis functions to random unit vectors, and then iterating the following procedure: 1. Get a sample signal x from the training set and com- pute its optimal sparse code z∗ as described in the pre- 1. Get a sample signal x from the training set vious section. ∗ 2. Calculate its optimal sparse code z by minimizing 2. Keeping z∗ fixed, update U with one step of stochas- (2) with respect to z. Simple optimization methods @Le tic gradient descent: U U − ν @U , where Le is such as gradient descent can be used, or more sophis- the loss function in (3). ticated approaches such as [1, 7, 20]. In this paper, we adopt a simple encoder architecture 3. Keeping z∗ fixed, update B with one step of stochas- @Ld given by: tic gradient descent: B B − ν @B , where Ld is the loss function in (2). The columns of B are then f (x; W; b) = h (Wx + b) (4) rescaled to unit norm, to avoid trivial minimizations e θ of the loss function where the code coefficients go to where W is a filter matrix, b is a vector of trainable bi- zero while the bases are scaled up. ases and hθ is the shrinkage function given by hθ(x)i = sgn(xi)(jxij − θi)+ (Figure 1). The shrinkage function sets any code components below a certain threshold θ to zero, which helps ensure that the predicted code will be sparse. Training the encoder is done by iterating the above process, with U = fW; b; θg. Note that once the encoder is trained, inferring sparse codes is very efficient, as it essentially requires a single matrix-vector multiplication. 3. LEARNING AUDIO FEATURES In this section we describe the features learned on music data using PSD. 3.1 Dataset We used the GTZAN dataset first introduced in [29], which has since been used in several works as a benchmark for the genre recognition task [2, 3, 6, 12, 18, 25]. The dataset consists of 1000 30-second audio clips, each belonging to one of 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae and rock.

Load more