
MUSIC GENRE CLASSIFICATION USING SPECTRAL ANALYSIS AND SPARSE REPRESENTATION OF THE SIGNALS Mehdi Banitalebi-DehkordiAmin Banitalebi-Dehkordi window size is chosen the signal within each frame can be ABSTRACT considered as a stationary signal. The windowed signal is In this paper, we proposed a robust music genre then transformed into another representation space in classification method based on a sparse FFT based feature order to achieve good discrimination and/or energy extraction method which extracted with discriminating compaction properties. The length of a short term analysis power of spectral analysis of non-stationary audio signals, window depends on the signal type. In music and the capability of sparse representation based classification, the length of a short-term window is classifiers. Feature extraction method combines two sets influenced by the adopted audio coding scheme [6]. A of features namely short-term features (extracted from music data window of length less than 50 ms is usually windowed signals) and long-term features (extracted from referred to as a short-term window [7]. combination of extracted short-time features). Non-stationary signals such as audio signals can be Experimental results demonstrate that the proposed modeled as the product of a narrow bandwidth low-pass feature extraction method leads to a sparse representation process modulating a higher bandwidth carrier [7]. The of audio signals. As a result, a significant reduction in the low-pass content of these signals cannot be effectively dimensionality of the signals is achieved. The extracted captured by using a too short analysis window. To features are then fed into a sparse representation based improve the deficiency of the short-term feature analysis, classifier (SRC). Our experimental results on the GTZAN we propose a simple but very effective method for database demonstrate that the proposed method combining the short-term and long-term features. outperforms the other state of the art SRC approaches. Optimum selection of the number of features also plays an Moreover, the computational efficiency of the proposed important role in this context. Too few features can fail to method is better than that of the other Compressive encapsulate sufficient information while too many Sampling (CS)-based classifiers. features usually degrade the performance since they may Index Terms— Feature Extraction, Compressive be irrelevant. Moreover, too many features could also Sampling, Genre Classification entail excessive computation downgrading the system’s efficiency. As a result, we need an effective method for 1. INTRODUCTION feature extraction and/or selection. Compressive or Sparse Sampling turns out to be an appropriate tool for such a Audio classification provides useful information for purpose. In this paper, a Compressive Sampling (CS) understanding the content of both audio and audio-visual based classifier which is based on the theory of sparse recordings. Audio information can be classified from representation and reconstruction of signals is presented different points of view. Among them, the generic classes for music genre classification. The innovative aspect of of music have attracted a lot of attention [1]. Low bit-rate our approach lies in the adopted method of combining audio coding is an application that can benefit from short-term (extracted from windowed signals) and long- distinguishing music classes [2]. In the previous studies, term (extracted from combination of extracted short-time various classification schemes and feature extraction features) signal characteristics to make a decision. methods have been used for this purpose. Most of the music genre classification algorithms resort to the so- An important issue in audio classification algorithms called bag-of-features approach [3-4], which models the which has not been widely investigated is the effect of audio signals by their long-term statistical distribution of background noise on the classification performance. In short-time features. Features commonly exploited for fact, a classification algorithm trained using clean music genre classification can be roughly classified into sequences may fail to work properly when the actual timbral texture, rhythmic, pitch content ones, or their testing sequences contain background noise with a certain combinations [3-4]. Having extracted descriptive features, level of signal-to-noise ratio (SNR). For practical pattern recognition algorithms are employed for their applications wherein environmental sounds are involved classification into genres. These features can be in audio classification tasks, noise robustness is an categorized into two types, namely short-term and long- essential characteristic of the processing system. We show term features. The short-term features are derived from a that the proposed feature extraction and data classification short segment (such as a frame). In contrast, long-time algorithm is robust to the background noise. features usually characterize the variation of spectral The rest of this paper is organized as follows. In the shape or beat information within a long segment. The next section, the proposed feature extraction algorithm is short and long segments are also referred to as “analysis described. The corresponding CS-based classifier is then window” and “texture window” respectively [5]. In short- introduced in Section 3. The experimental settings and time estimates, the signal is partitioned into successive results are detailed in Section 4 leading to conclusions in frames using small sized windows. If an appropriate Section 5. 2. COMPRESSIVE SENSING BACKGROUND 1 p a = ( a ) p . (5) p j We sample the speech signal x(t) at the Nyquist rate j and process it in frames of N samples. Each frame is then a N ×1 vector x, which can be represented as x= ΨX , Equation (4) can be reformulated as: where Ψ is a N × N matrix whose columns are the min Y − ΦΨθ s.t. θ = K, (6) ψ θ 2 0 similarly sampled basis functions i (t) , and X is the vector that chooses the linear combinations of the basis There are a variety of algorithms to perform the functions. X can be thought of as x in the domain of Ψ , reconstructions in (4) and (6); in this paper we make use and it is X that is required to be sparse for compressed of OMP [11] to solve (6). sensing to perform well. We say that X is K-sparse if it contains only K non-zero elements. In other words, x can OMP is a relatively-efficient iterative algorithm that ∧ be exactly represented by the linear combination of K produces one component of θ each iteration, and thus basis functions. ∧ allows for simple control of the sparsity of θ . As the true It is also important to note that compressed sensing sparsity is often unknown, the OMP algorithm is run for a will also recover signals that are not truly sparse, as long ∧ these signals are highly compressible, meaning that most pre-determined number of iterations, K, resulting in θ of the energy of x is contained in a small number of the being K-sparse. elements of X. Given a measurement sample Y ∈ RM and a × dictionary D∈ RM N (the columns of D are referred to as the atoms), we seek a vector solution X satisfying: min X s.t. Y = DX (1) 0 In above equation (known as l norm), is the Figure 1. The measurement of Compressive Sampling 0 0 number of non-zero coefficient of X [5]. 3. FEATURE EXTRACTION In Fig. 1, consider a signal X with Length N that has In many of music genre classification algorithms K non-zero coefficients on sparse basis matrix Ψ , and authors used speech signal to determine music classics, consider also an M × N measurement basis matrix Φ , this has effect on computational complexity. In [1-4], the M << N where the rows of Φ are incoherent with the computational complexity of algorithms is high, since it uses the high dimensional received signals. In [11], columns of Ψ . In matrix notation, we have X = Ψθ , in authors also used the high dimensional received signals, which θ can be approximated using only K << N non- but they trying to reduce computational complexity by a zero entries. The CS theory states that such a signal X feature extraction process that select the useful data for can be reconstructed by taking linear, non-adaptive music genre classification. measurement as follows [10]: = Φ = ΦΨθ = θ Timbral texture features are frequently used in Y X A (2) various music information retrieval systems [8]. Some where Y represents an M ×1 sampled vector, A = ΦΨ timbral texture features which are widely used for audio is an M × N matrix. The reconstruction is equivalent to classification have been summarized in Table I [1]. 1 finding the signal’s sparse coefficient vectors θ , which Among them, MFCC , spectral centroids, spectral roll off, can be cast into a l optimization problem: spectral flux and zero crossings are short-time features, 0 thus their statistics are computed over a texture window. min θ s.t. y = ΦΨθ = Aθ (3) The low-energy feature is a long-time feature. 0 In the previous speech processing researches, various Generally, (3) is NP-hard, and this l0 optimization is feature extraction methods have been used. Scheirer and replaced with an l1 optimization [10] as follows: Slaney proposed features such as 4-Hz modulation energy and spectral centroids [2]. Various content-based features min θ s.t. Y = ΦΨθ = Aθ (4) have been proposed for applications such as sound 1 classification and music information retrieval (MIR) [3-4]. From (4) we will recover θ with high probability if These features can be categorised into two types, namely short-term and long-term features. The short-term features enough measurements are taken. In general, the l p norm are derived from a short segment (such as a frame). In is defined as: contrast, long-time features usually characterise the 1 Mell Frequency Coefficient variation of spectral shape or beat information within a representation.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-