Classifying Musical Scores By Instrument

Bryce Cai Michael Svolos Stanford University Stanford University Stanford, CA 94305 Stanford, CA 94305 [email protected] [email protected]

Abstract

A given line of can be played with greater or less ease by different musical instruments, due to the invariants of their construction and design, and ideally, these instrumental idioms are taken into consideration when writing a part for that instrument. Thus, the question posed by this paper is whether these idoms, these differences, can be used to classify instrumental parts by instrument. We trained a multiclass classifier using linear regression over 860 instrumental parts from Classical-era musical scores. Our one-vs-one algorithm saw 23% accuracy, exceeding the baseline by 9%.

1 Task definition

Instrumental music from the Classical era (1750-1825) is written with individual lines of music for each instrument. Each line is played at the same time to create the sounds of the complete piece, whether it’s a Bach piece for solo flute, a Mozart concerto for clarinet and orchestra, or a Hadyn string quartet. Every wind or string instrument has its own range of notes that it’s capable of playing - for instance, the flute cannot play below the B below middle C. Each instrument also has a different, much more nebulous set of limitations of what it can easily play, though. These limitations are a consequence of the way each instrument was designed physically and acoustically. An example of this is a quick passage on that requires many long jumps of the slide; such a passage is technically possible, but it would require great skill at the instrument. Similarly, passages that have large leaps in pitch might be difficult on wind instruments, since they require quick changes in mouth position, but are easy on violin, since one can just place their finger on the string for the new note. This idea, that different instruments can play different parts with varying ease, is known as instru- mental idiom. It’s important to note that instrumental idiom is a quality of instruments, not of parts. A composer could write any part for any instrument if they can find players capable of playing it. The best composers, though, are able to write beautiful and expressive music that also is well suited to the instruments they’re writing for. This makes it easier for performers to play their music well and is seen as part of the craft of conducting. This brings us to the fundamental question of this paper: are differences in instrumental parts and idiom so large that they can distinguish which instrument a part is for? For this task, we measure success with accuracy, i.e. how many parts were successfully labeled out of total.

2 Infrastructure

One of the challenges of this project was collecting data. There were a number of options available to us: would we use audio files of pieecs? Images of the score? Score data in some musical score file format? Each of these options had their own good and bad features. For instance, there are many

Preprint. Under review. more recordings of pieces or scans of scores than there are digitized scores, but analyzing audio or images is much more difficult and outside of the scope of this project. For this reason, we decided to work with digitized scores. We downloaded our scores from Mutopia Project, a collection of public-domain, digitized scores of pieces of classical music. Scores were downloaded manually. There are a number of file formats for sheet music, including MusicXML, MIDI, and file formats unique to certain score editors such as MuseScore or Sibelius. MIDI is a general standard that is recognized by most editors, so the downloaded scores were digitized in this format and carried the .mid suffix. The process of translating scores into this format is time-consuming, so there were only 2000 scores available for download on the website. MIDI files are organized into tracks: one master track that contains general info such as the title of the song and time resolution information, and one or more other tracks for each line of music. To process our data, we used a python toolkit called python-midi to split each file into tracks so that each track represented one instrumental part to classify. Here, each track is a list of MIDI events. These include track-wide information such as instrument name (which is how we verified our guesses) as well as events for the note value and timing of each individual note. To filter our data, we only evaluated instruments that are mostly monophonic, i.e. that only play one note at a time (trumpet: yes, : no). We also excluded instruments with fewer than 40 parts in the dataset, and capped all instruments at 100 parts. Finally, we cut off each part at 1000 MIDI events. This left us with ten instruments and 860 total parts.

Figure 1: Number of parts per instrument.

3 Approach

3.1 Baseline and Oracle

A simple baseline for this classification problem is simply choosing an instrument at random for each assignment. Since there are 10 instruments, this baseline gives 10% accuracy. A more advanced baseline involves restricting this random choice from the set of all instruments to the set of instruments that are able to play the piece to begin with. This was done by examining the full range of a musical score by noting the highest and lowest notes in the piece. If the range fell outside of the playable range of an instrument, that instrument was excluded from the final set; the

2 assignment was then chosen randomly from the instruments remaining (i.e., the set of instruments for which the entire piece was within its playable range). This range baseline gave an improved accuracy of 14% upon implementation. Oracles for this problem are either poor or very difficult to find. One such rudimentary oracle is examining the assignment beforehand, which always gives 100% accuracy, but this oracle is not very useful as all it shows is that the maximum accuracy of a classifier is 100%. A more useful oracle involves human classification, using experts trained with the knowledge of known instrumental idioms to classify a part. However, such subjects are hard to find; being able to assign a piece to a list of 10 instruments depends on having expert knowledge on playing each of those instruments. As another rudimentary oracle, our own attempts at assigning instruments to pieces resulted in an oracle accuracy of approximately 35%; for reference, this oracle was determined with non-expert musician knowledge. A good oracle involving more expert knowledge would likely have a higher accuracy. We would expect a well-trained classifier to be more accurate than this oracle as well as most of our knowledge was also range-based.

3.2 Multiclass Classification

Since our problem is one of multiclass classification, we implemented multiple methods in an effort to find the best method of classification.

3.2.1 One-vs.-one

The main method implemented was the one-vs.-one multiclass classification approach, which reduced n the problem of classifying a part into one of n categories (for our problem, n = 10) into 2 problems of binary classification between each possible pair of instruments. Instead of training one n-ary n classifier, the problem was reduced into training 2 = n(n − 1)/2 binary classifiers instead. One example of an implementation of one-vs.-one classification is as follows. During training, we would pass each data point (in our case, a solo part) into the n − 1 binary classifiers comparing that data point’s category (in our case, that part’s associated instrument) with any other category (in our case, any other instrument). In our implementation, these binary classifiers are simply linear n classifiers. After training, we would have 2 = n(n − 1)/2 binary classifiers comparing pairs of categories, each trained on the subset of data classified under either of the two categories it is comparing. Using this classifier to classify a new data point would involve passing the data point through each of the n(n − 1)/2 binary classifiers, keeping track of the count of times a category “wins” each comparison (i.e., the number of times for each category that a classifier categorizes the data point into that category, equivalent to the number of categories A for which the comparison between A and the current category yields a categorization into the current category). The category with the most “wins” is then the category that the point is assigned to. Ties are arbitrarily broken (in our case, the first instrument in some set order we passed into the model).

3.2.2 One-vs.-rest

Another approach for linear multiclass classification that was attempted was one-vs.-rest, where n binary classifiers are trained, one for each category, with the whole set of training data. Each data point is processed through the classifier for each instrument i as a training point for “i” or “not i.” In this way, all n classifiers process all n elements, and the problem of multiclass classification is again reduced to a series of binary classification problems. The final classification for one-vs.-rest classification depends on a confidence score rather than just the series of simple binary classifications as a test data point can be classified into multiple categories or no category at all. To ensure accuracy, the scales of the confidence values must be kept the same between different binary classifiers.

3 3.2.3 Implementation Again, our model took in input from musical parts formatted as MIDI files. These files are processed as input with the Python MIDI package that can be found at https://github.com/vishnubob/python-midi. As mentioned, MIDI files only consist of the instrumental parts themselves and no information related to audio files or processing of audio data. (In other words, we are only concerned with the instrumental part and nothing recording-related, which would likely be a much more difficult problem.) n Our implementation of one-vs.-one multiclass classification simplified the 2 = n(n − 1)/2 binary n classifiers into 2 comparisons from a set of n feature vectors (one for each instrument). Each classifier was then a function of the two feature vectors vi and vj of the instruments i and j between which it is comparing and the feature vector w of the input solo; each binary classifier is simply the function that categorizes the solo to instrument i if w · vi > w · vj and j if w · vj > w · vi; ties are broken arbitrarily. Our classifiers used stochastic gradient descent with hinge loss and regularization on the weight vector. Our step size was 0.1, our regularization parameter was 0.0001, and we trained for 300 epochs.

3.2.4 Features Our implementation incorporated a number of features, including:

• Pitch frequency (i.e., the amount of times every note pitch appears). Note that notes that are the “same” separated by multiple octaves (e.g., middle C and high C) have different pitches. • Highest note pitch. • Lowest note pitch. • Average note pitch. • The amount of times a pair of successive pitches (m, n) occurs (i.e., the number of times pitch n follows pitch m in the piece). • The amount of times a successive trigram pair of pitches (`, m, n) occurs. • Average note length.

The pitch-related features were implemented to detect patterns in the range and frequency of pitches. The note length feature was implemented to determine the specialization that speed offered. The feature vector for an instrument then consists of a normalized sparse vector consisting of all of these features. The weight of each category of features was experimented with to maximize categorization accuracy; discussion of this will follow in the Error Analysis section. Our training set consisted of the first 70% of each subset of the database of solos for a certain instrument (so our training set contained 70% of the solos for each instrument). Our test set consisted of the other 30% of our database.

4 Literature Review

The subject of categorizing instruments and instrumental sound has long been of interest to musicians. Much work has been done at the audio level, where problems of multiclass classification using audio input have shown success in determining the instrument playing an audio sample. AI-related subjects, such as machine learning with feature extraction [1] and implementation of neural networks [2], have been implemented with significant deals of success. Projects such as MIT’s PixelPlayer have also been developed to extract the sound of a certain from an audio input where that instrument may be in the background. With regard to score-based analysis independent of any audio input, however, not much work has been done. Similar studies have involved classifying scores by musical genre or era (e.g., Baroque, Classical, etc.).

4 One such study, by Armentano et al., also approached a problem of multiclass classification of musical score using MIDI input. [3] This study’s goal was to build a classifier for one of nine musical genres, including jazz soul, punk, and modern classic. Like our approach, the classifier was trained on a modified MIDI input from which features were extracted, tested by classifying new test data into one of many categories, and updated with user feedback (i.e., accuracy). These features included factors calculated form the prevalence of notes, the prevalence of transitions from note to succeeding note, and more complex aspects such as rhythm and sound strength (i.e., dynamics). Similar to our approach, classification was done with a voting scheme to assign a genre to a given musical file. Implementation on a dataset of 225 pieces with a knowledge base implemented on MongoDB yielded root-level success as high as 70%-80%. While on a much simpler scale and approaching a different problem (that of instrument assignment), our project bears a complementary approach to a similar problem as study [3] has.

5 Error Analysis

This project analyzed two major questions: what was the best approach for linear multiclass classification, and what features were most important in classification?

5.1 Comparing Approaches

The one-vs.-rest approach was found to have subpar accuracy; even with some semblance of normal- ization (e.g., normalizing the total weight of note frequency), the accuracy of the classifier was found to be approximately 5%, lower than even the random choice baseline. This was likely not entirely due to the method of choice but rather also due partly to the problem of imperfect normalization; as more features were implemented, the complexity and method of feature vector normalization became less structured, leading to one instrument (in many cases, violin or cello) being favored by giving the greatest magnitude of feature vector dot product. On the other hand, the one-vs.-one approach was found to have an accuracy of 23.2%. While imperfect and significantly less than the level of success of the poor human-based oracle, the discrepancy in accuracy between the one-vs.-one method and both the one-vs.-rest method and the baselines is still significant. The accuracy of the one-vs.-one method was 9% greater than that of the range baseline and nearly double that of the random baseline. The inaccuracy of this approach can likely be explained by the simplicity of the factors used; known instrumental idioms rely on longer sequences of more complex combinations of note features, such as a very quick sequence of very high notes that a violin would have a much easier time playing than a french horn. Such idioms would not have been able to be comprehended entirely by our simple feature extractor.

5.1.1 Variation With Instruments Another question posed was: are certain instruments more easily classified? To analyze this question, the accuracy of each individual binary classifier in the one-vs.-one classifier was analyzed. Table 1 shows the fraction of correct classifications in each one-vs.-one classifier involving a certain instrument, displaying the fraction that a piece for an instrument i was correctly classified by an instrument i-vs.-instrument j binary classifier. All of these accuracies are between 49.5 and 69 percent. As a completely random classifier would give an accuracy of 50%, many of these accuracies are not far removed from a random classifier. Nevertheless, most single instrument classifications are significantly more accurate than random (even if that significant difference is only 5%). The classifier seems to be better at classifying lower non-string instruments, such as parts for bassoon and french horn; this could be because our choice of instruments to analyze, with only cello, bassoon, and french horn being classified in a relatively low range, gave more possible choices of instrument (and more ambiguity) to higher-pitched parts. Naturally, the problem of classification becomes harder with more valid classifications, especially if the difference between two categories is not distinguished easily (e.g., two instruments with similar ranges, for which factors such as note range would be irrelevant in classification).

5 Table 1: Accuracy of individual binary classifiers in one-vs.-one.

Instrument Accuracy Cello 0.55117 Oboe 0.57151 Bassoon 0.68570 Flute 0.68544 French Horn 0.56389 Viola 0.57524 Trombone 0.62732 Violin 0.49570 Clarinet 0.53000 Trumpet 0.58230

These results could also depend greatly on the choice of input database. Given a generally generic or nondistinguishable database of scores for a certain instrument, no features of distinguishing instrumental idiom could be taught to the classifier, leading to a classifier with relatively poorer accuracy. Larger datasets would likely boost classifier accuracy.

5.2 Analyzing Features

As we tried different tunings of our algorithm, we found that removing some features and amplifying others gave us more total accuracy. Namely, removing bigram and trigram feature sets and amplifying highest note, lowest note, and average note pitch features (by adding features for each of these values squared and cubed) increased the accuracy. It is believed that this is because the sheer number of bigram and trigram features outweight the more salient features such as range. This is an exhibition of one of the limitations of the algorithm we chose. Instrument ranges are hard cutoffs, similar to constraints in a constraint satisfaction problem. But in linear regression, there’s no way to tell the algorithm that some feature values completely exclude some classifications.

6 Conclusion

In summary, this project endeavored to use multiclass classification methods to classify musical scores by instrument. Using a one-vs-one approach, we achieved substantive improvement over our baseline. A big takeaway from this project is the importance of large datasets. Our project involved training many classifiers, each of which only had approximately 200 data points, which was likely detrimental to their accuracy. If we are to move forward with this research, one area upon which we would focus effort is finding or creating additional data. We would also consider using more complicated algorithms such as neural nets to classify our data, since they are more likely to capture the complex nature of each feature vector. We could also look into a clustering algorithm to see how our data interacts with itself in space.

References

[1] Simmermacher, C., Deng, J.D., & Cranefield, S. (2006). Feature Analysis and Classification of Classical Musical Instruments: An Empirical Study. Industrial Conference on Data Mining. [2] Toghiani-Rizi, B., & Windmark, M. (2016). Musical Instrument Recognition Using Their Distinctive Characteristics in Artificial Neural Networks. Uppsala University. [3] Armentano, G. M., De Noni, W. A., & Cardoso, H. F. (2016). Genre Classification of Symbolic Pieces of Music. Springer Science.

6