Music Genre Classification Using Mid-Level Features

Music Genre Classification using Mid-Level Features Hannah Bollar Shivangi Misra Taylor Shelby University of Pennsylvania University of Pennsylvania University of Pennsylvania [email protected] [email protected] [email protected] Abstract the song, the lyrical content of the song, or a com- bination of these and more. The goal of our project was to classify mu- Since we do not clearly understand how humans sic genres using more human intuitive fea- classify genre, it is hard to select features to pass tures. We used mid level features such into various machine learning algorithms without as those found in sheet music, includ- understanding first how important or how much ing pitch, rhythm, tempo, and instruments. each of the features acts as a defining attribute of For a large dataset, this information is ob- said genre. Much of the current research into mu- tainable from MusicXML files. In or- sic genre classification focuses on low level audio der to extract relevant patterns in tempo- based features, such as amplitude modulation and ral and spatial information from these fea- zero crossing rate (Z. Fu and Zhang, 1981); how- tures, long-short term memory (LSTM) re- ever, there are many mid level features fundamen- current neural networks were used for the tal to the typical human approach to genre classifi- classification task. Despite a messy data cation which are difficult to extract from frequency set and minimal features, we were able to analysis. classify 5 genres with an accuracy of approximately 63%. 3 Background 1 Credits 3.1 Music and Features Features such as rhythm, harmony, and chord pro- Though we made our own initial derivations and gression are key indicators, used directly by mu- explanations for how we will be implementing sicians and intuitively by non-experts, yet these this project, we found both Bahuleyan’s Music are difficult to extract from audio spectrogram Genre Classification using Machine Learning data. Though research is underway regarding bet- Techniques and McKay’s Automatic Genre Clas- ter ways to extract these features from audio, it sification of MIDI Recordings to be incredibly seems prudent to first demonstrate their effective- helpful throughout this process (Bahuleyan, 2018; ness as identifiers (McKay, 2004). To do this, we McKay, 2004). propose extracting features including rhythm, har- 2 Introduction mony, and chord progression from MIDI data files. It has been shown that using these mid level Music genre classification has been addressed by features has the potential to provide results com- machine learning from a variety of perspectives parable to low-level features, with a smaller fea- and approaches (Z. Fu and Zhang, 1981). There ture space (C. Perez-Sancho´ and Inesta˜ , 2009; is inherently difficulty in this classification due to M. G. Armentano and Cardoso, 2016); however, the somewhat subjective nature of the definitions we believe there is more to be explored. of genre, since the genres themselves have been To work with these features, we extract data developed over time based on a collective musi- from music XML files. As there is no large cal understanding instead of a defined set of pre- database of these, we found a MIDI dataset and written rules. These characteristics are often diffi- converted the files into XMLs using Musescore cult to pinpoint as they may be due to patterns in (Mus). MIDI’s (Musical Instrument Digital Inter- the rhythm of the song, instruments used to play face) formatting acts as a music storage and trans- fer protocol which carries 16 channels of informa- 3.3 Thoughts and Expectations tion, each containing event messages that dictate Experiments using mid level features similar to notation, pitch, velocity, vibrato, panning, or clock those that we have used have resulted in an accu- signals for tempo in songs. Music XML also in- racy up to 85%; however, with our limited feature cludes this data, but in a readable text format, ex- set, we expect to have similar accuracy to genre plicitly mentioning things such as the pitch and classification experiments using equally limited rhythm type of notes which make it ideal for ex- features of approximately 40%. (Z. Fu and Zhang, tracting the mid level features we wish to utilize. 1981). This process is explained in detail in section 4.3. Even with those expectations, we did foresee some issues. It is possible that the mid-level fea- 3.2 Algorithms tures we have selected on their own will not provide enough data (since we hand picked them Due to our disparate featurization, we spent based on usability and accessibility), and even that some time deciding between Recurrent Neural the way we have chosen to input these features Networks (RNNs) and Support Vector Machines to our network can generate differing results com- (SVMs). The following is an explanation and pros pared to other input representations. This further and cons of each. enhances the well established idea that that music RNNs use internal states of the nodes to pro- genre classification is a complex problem, depen- cess information as well as the incoming stream dant on many features that even present scholars of data. These nodes act as a layering system with are still looking to understand (Z. Fu and Zhang, a directed connection to every other node in the 1981). following layer. The nodes themselves have an Furthermore, due to the sequential nature of activation state, weight, and tags for being input, music chord progressions, we have chosen to both output, or even hidden nodes depending on what use RNNs and to encode some sequence into our layer in which they are placed. features by utilizing multiple measures at a time. One main problem with RNNs is that they of- This is something from which we expected an in- ten face “Vanishing Gradient” problem. That is, teresting output, since we had not yet determined the update value returned from the partial deriva- if using both of these is redundant or critical to tive of the error function with respect to the cur- capturing the sequential aspect of music. rent weight function is too small. This leads to the weight not changing enough or even notice- 4 Implementation Details ably at all (due to decimal storage limits); it can ei- ther heavily slow down training while stuck in this 4.1 Genres chasm or even completely stop the network from progressing entirely. To get around this issue to From a music theory standpoint, chord progres- still use RNNs we consider the Long Short-Term sions are a key factor in distinguishing between Memory (LSTM) variation of RNNs for our work. genres, so we will encode both chords and some This variation is beneficial since it uses feedback representation of their sequence as features for our connections (instead of just feedfoward) to aug- model. ment the error and update calculations. Because there are hundreds of potential genres, SVMs use classification and regression analysis we limit our space to the following defined genres to analyze data. Using this technique, we would - ’Classical’, ’Country’, ’Disco’, ’Jazz’, ’Modern’ need to implement multiclass SVMs with a ker- (inclusive of Pop and Rock genres) - so that we nel trick for our output to converge as expected. can work with a more defined data set. This small These have been used to some success in other mu- subset of genres allows us to limit the scope of sic classification machine learning research, so we our research, as well as comment on our aforemen- believe them to be a viable alternative (M. I. Man- tioned hypothesis based on conclusive experimen- del, 2005). However, given that our feature set is tal evidence. These genres were selected to pro- different than the commonly used ones for music- vide examples of both very different genres (clas- classification, we opted not to use them for our sical and pop) and potentially similar genres (disco project. and pop). 2 4.2 Database paired in a 14x24 matrix. The 14 columns rep- The database we used is The Lakh MIDI Dataset resented the 12 scale degrees, plus one for rest (Raffel, 2016a)(Raffel, 2016b). This database has and one for un-pitched percussive hits. The pitch some pre-labeled examples for genre, which will fC,D,E,...g of the note corresponds to a different be very useful for our project. It is also large number (scale degree in f1,2,3,...g) depending on enough that we can split off a specific percentage the key in which the note is played. For example, of our data-set as training data, while the rest is in C major, C corresponds to the first spot in the divided into cross-verification and test data. The array, C#/Db to the second, and so on, while in G MIDI data of these files will be used to extract the major, C corresponds to the fifth spot. This made it actual feature data and pair them with labels. possible to compare chords across key signatures However, the data-set genre labels are user gen- without explicitly labelling them, since a C one erated, so there is not a firm consensus as to their chord (1,3,5) would have the same scale degrees as accuracy. As we were reviewing the data set for a G one chord (1,3,5), despite being comprised of troubleshooting, we did find several examples in different notes. Although this ignores chord inver- genres that did not even remotely seem relevant, sions (3,5,1) (5,1,3), for our purposes it is a firm but the data set was too large to sort through and enough foundation to provide distinguishing char- resolve all of these individually.

Load more