Arxiv:1804.01149V1 [Cs.SD] 3 Apr 2018
Total Page:16
File Type:pdf, Size:1020Kb
Music Genre Classification using Machine Learning Techniques Hareesh Bahuleyan University of Waterloo, ON, Canada [email protected] Abstract audio file. The first model described in this paper Categorizing music files according to their uses convolutional neural networks (Krizhevsky genre is a challenging task in the area et al., 2012), which is trained end-to-end on the of music information retrieval (MIR). In MEL spectrogram of the audio signal. In the sec- this study, we compare the performance ond part of the study, we extract features both in of two classes of models. The first is a the time domain and the frequency domain of the deep learning approach wherein a CNN audio signal. These features are then fed to con- model is trained end-to-end, to predict the ventional machine learning models namely Logis- genre label of an audio signal, solely us- tic Regression, Random Forests (Breiman, 2001), ing its spectrogram. The second approach Gradient Boosting (Friedman, 2001) and Support utilizes hand-crafted features, both from Vector Machines which are trained to classify the the time domain and frequency domain. given audio file. The models are evaluated on the Audio Set We train four traditional machine learning dataset (Gemmeke et al., 2017). We classifiers with these features and compare compare the proposed models and also study the their performance. The features that con- relative importance of different features. tribute the most towards this classification The rest of this paper is organized as follows. task are identified. The experiments are Section2 describes the existing methods in the lit- conducted on the Audio set data set and we erature for the task of music genre classification. report an AUC value of 0:894 for an en- Section3 is an overview of the the dataset used semble classifier which combines the two in this study and how it was obtained. The pro- proposed approaches.1 posed models and the implementation details are discussed in Section4. The results are reported in 1 Introduction Section 5.2, followed by the conclusions from this With the growth of online music databases and study in Section6. easy access to music content, people find it in- creasing hard to manage the songs that they lis- 2 Literature Review ten to. One way to categorize and organize songs Music genre classification has been a widely stud- arXiv:1804.01149v1 [cs.SD] 3 Apr 2018 is based on the genre, which is identified by some characteristics of the music such as rhyth- ied area of research since the early days of the mic structure, harmonic content and instrumen- Internet. Tzanetakis and Cook(2002) addressed tation (Tzanetakis and Cook, 2002). Being able this problem with supervised machine learning ap- to automatically classify and provide tags to the proaches such as Gaussian Mixture model and k- music present in a user’s library, based on genre, nearest neighbour classifiers. They introduced 3 would be beneficial for audio streaming services sets of features for this task categorized as tim- such as Spotify and iTunes. This study explores bral structure, rhythmic content and pitch con- the application of machine learning (ML) algo- tent. Hidden Markov Models (HMMs), which rithms to identify and classify the genre of a given have been extensively used for speech recognition tasks, have also been explored for music genre 1The code has been opensourced and is available at https://github.com/HareeshBahuleyan/ classification (Scaringella and Zoia, 2005; Soltau music-genre-classification et al., 1998). Support vector machines (SVMs) with different distance metrics are studied and animal sounds and so on2. This study requires compared in Mandel and Ellis(2005) for classi- only the audio files that belong to the music cat- fying genre. egory, specifically having one of the seven genre In Lidy and Rauber(2005), the authors dis- tags shown in Table1. cuss the contribution of psycho-acoustic features for recognizing music genre, especially the impor- Table 1: Number of instances in each genre class tance of STFT taken on the Bark Scale (Zwicker and Fastl, 1999). Mel-frequency cepstral coef- Genre Count ficients (MFCCs), spectral contrast and spectral 1 Pop Music 8100 roll-off were some of the features used by (Tzane- 2 Rock Music 7990 takis and Cook, 2002). A combination of visual and acoustic features are used to train SVM and 3 Hip Hop Music 6958 AdaBoost classifiers in Nanni et al.(2016). 4 Techno 6885 With the recent success of deep neural net- 5 Rhythm Blues 4247 works, a number of studies apply these techniques to speech and other forms of audio data (Abdel- 6 Vocal 3363 Hamid et al., 2014; Gemmeke et al., 2017). Rep- 7 Reggae Music 2997 resenting audio in the time domain for input to neural networks is not very straight-forward be- Total 40540 cause of the high sampling rate of audio signals. However, it has been addressed in Van Den Oord The number of audio clips in each category et al.(2016) for audio generation tasks. A com- has also been tabulated. The raw audio clips of mon alternative representation is the spectrogram these sounds have not been provided in the Audio of a signal which captures both time and frequency Set data release. However, the data provides the information. Spectrograms can be considered as YouTubeID of the corresponding videos, along images and used to train convolutional neural net- with the start and end times. Hence, the first task works (CNNs) (Wyse, 2017). A CNN was de- is to retrieve these audio files. For the purpose of veloped to predict the music genre using the raw audio retrieval from YouTube, the following steps MFCC matrix as input in Li et al.(2010). In were carried out: Lidy and Schindler(2016), a constant Q-transform (CQT) spectrogram was provided as input to the 1. A command line program called CNN to achieve the same task. youtube-dl (Gonzalez, 2006) was This work aims to provide a comparative study utilized to download the video in the mp4 between 1) the deep learning based models which format. only require the spectrogram as input and, 2) the traditional machine learning classifiers that need 2. The mp4 files are converted into the desired to be trained with hand-crafted features. We also wav format using an audio converter named investigate the relative importance of different fea- ffmpeg (Tomar, 2006) (command line tool). tures. Each wav file is about 880 KB in size, which means that the total data used in this study is ap- 3 Dataset proximately 34 GB. In this work, we make use of Audio Set, which is 4 Methodology a large-scale human annotated database of sounds This section provides the details of the data pre- (Gemmeke et al., 2017). The dataset was cre- processing steps followed by the description of ated by extracting 10-second sound clips from a the two proposed approaches to this classification total of 2.1 million YouTube videos. The audio problem. files have been annotated on the basis of an on- tology which covers 527 classes of sounds includ- 2https://research.google.com/audioset/ ing musical instruments, speech, vehicle sounds, ontology/index.html Figure 1: Sample spectrograms for 1 audio signal from each music genre Figure 2: Convolutional neural network architecture (Image Source: Hvass Tensorflow Tutorials) 4.1 Data Pre-processing the image class. In this study, the sound wave can In order to improve the Signal-to-Noise Ratio be represented as a spectrogram, which in turn can (SNR) of the signal, a pre-emphasis filter, given be treated as an image (Nanni et al., 2016)(Lidy by Equation1 is applied to the original audio sig- and Schindler, 2016). The task of the CNN is to nal. use the spectrogram to predict the genre label (one of seven classes). y(t) = x(t) − α ∗ x(t − 1) (1) 4.2.1 Spectrogram Generation where, x(t) refers to the original signal, and y(t) refers to the filtered signal and α is set to 0.97. A spectrogram is a 2D representation of a signal, Such a pre-emphasis filter is useful to boost ampli- having time on the x-axis and frequency on the tudes at high frequencies (Kim and Stern, 2012). y-axis. A colormap is used to quantify the mag- nitude of a given frequency within a given time 4.2 Deep Neural Networks window. In this study, each audio signal was con- verted into a MEL spectrogram (having MEL fre- Using deep learning, we can achieve the task of quency bins on the y-axis). The parameters used music genre classification without the need for to generate the power spectrogram using STFT are hand-crafted features. Convolutional neural net- listed below: works (CNNs) have been widely used for the task of image classification (Krizhevsky et al., 2012). • Sampling rate (sr) = 22050 The 3-channel (RGB) matrix representation of an image is fed into a CNN which is trained to predict • Frame/Window size (n fft) = 2048 • Time advance between frames (hop size) In this study, a CNN architecture known as = 512 (resulting in 75% overlap) VGG-16, which was the top performing model in the ImageNet Challenge 2014 (classification + lo- • Window Function: Hann Window calization task) was used (Simonyan and Zisser- • Frequency Scale: MEL man, 2014). The model consists of 5 convolutional blocks (conv base), followed by a set of densely • Number of MEL bins: 96 connected layers, which outputs the probability that a given image belongs to each of the possible • Highest Frequency (f max) = sr/2 classes. 4.2.2 Convolutional Neural Networks For the task of music genre classification using spectrograms, we download the model architec- From the Figure1, one can understand that there ture with pre-trained weights, and extract the conv exists some characteristic patterns in the spectro- base.