
Patrick J. Donnelly∗† and John W. Sheppard† Classification of Musical ∗School of Music Montana State University Timbre Using Bayesian P.O. Box 173420 Bozeman, Montana 59715, USA †Department of Computer Science Networks Montana State University P.O. Box 173880 Bozeman, Montana 59715, USA {patrick.donnelly2,john.sheppard}@cs.montana.edu Abstract: In this article, we explore the use of Bayesian networks for identifying the timbre of musical instruments. Peak spectral amplitude in ten frequency windows is extracted for each of 20 time windows to be used as features. Over a large data set of 24,000 audio examples covering the full musical range of 24 different common orchestral instruments, four different Bayesian network structures, including naive Bayes, are examined and compared with two support vector machines and a k-nearest neighbor classifier. Classification accuracy is examined by instrument, instrument family, and data set size. Bayesian networks with conditional dependencies in the time and frequency dimensions achieved 98 percent accuracy in the instrument classification task and 97 percent accuracy in the instrument family identification task. These results demonstrate a significant improvement over the previous approaches in the literature on this data set. Additionally, we tested our Bayesian approach on the widely used Iowa musical instrument data set, with similar results. The identification of musical instruments in audio Timbre recordings is a frequently explored, yet unsolved, machine learning problem. Despite a number of When a musical instrument plays a note, we experiments in the literature over the years, no single perceive a musical pitch, the instrument playing feature-extraction scheme or learning approach has that note, and other aspects, like loudness. Timbre, emerged as a definitive solution to this classification or tone color, is the psychoacoustic property of problem. The ability of a computer to learn to sound that allows the human brain to readily identify musical instruments is an important distinguish between two instances of the same note, problem within the field of music information each played on a different instruments. The primary retrieval, with high commercial value. For instance, musical pitch we perceive is usually the first har- companies could automatically index their music monic partial, known as the fundamental frequency. libraries based on the musical instruments present Pitched instruments are those whose partials are in the recording, allowing search and retrieval by approximate integer multiples of the fundamental specific musical instrument. Timbre identification frequency. With the exception of unpitched is also important to the tasks of musical genre percussion, orchestral instruments are pitched. The categorization, automatic score creation, and track perception of timbre depends on the presence of separation. harmonics (i.e., spectrum), as well as the fine timing This work investigates classification of single, (envelope) of each harmonic constituent (partial) of monophonic musical instruments using several the musical signal (Donnelly and Limb 2009). different Bayesian network structures and a feature- extraction scheme based on a psychoacoustic Algorithms definition of timbre. The results of this seminal use of graphical models in the task of musical This work compares three types of algorithms on instrument classification are compared with the the machine learning task of timbre classification. baseline algorithms of support vector machines and This section briefly explains each of the algorithms a k-nearest neighbor classifier. we used. Nearest Neighbor Computer Music Journal, 37:4, pp. 70–86, Winter 2014 doi:10.1162/COMJ a 00210 The k-nearest neighbor (k-NN) is a common c 2014 Massachusetts Institute of Technology. instance-based learning algorithm in which a 70 Computer Music Journal Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/COMJ_a_00210 by guest on 02 October 2021 previously unknown example is classified with the b is the bias offset, most common class amongst its k-nearest neighbors, y is the class label such that y ∈{−1, +1}, where k is a small positive integer. A neighbor is and the kernel function K(fi, fj) = (fi) · (fj) determined by the application of some distance is the inner product of the basis function. metric D(·, ·), such as Euclidean distance, in a mul- When the kernel function K(f) = f,theSVMis tidimensional feature space. Formally, let X be a a linear classifier. When the kernel is a non-linear space of points where each point x ∈ X is defined as i function, such as a polynomial (Equation 3), the x ={x1, ..., xd}; c and X ⊂ X be a set of training i i i i tr features are projected into a higher-order space. This examples. For x ∈ X − X find r ∈ X such that q tr tr allows the algorithm to fit the maximum margin ∀x ∈ X , x = r, D(x , r) < D(x , x) and return the tr q q hyperplane in the transformed feature space, which associated class label c (Cover and Hart 1967). In r is no longer linear in the original space (Boser, other words, each query example f in the test set q Guyon, and Vapnik 1992). will be compared to a subset of examples from the training set, using a distance metric, and the most = · δ common class label among these k neighbors will be K(fi, fj) (fi fj) (3) assigned to fq. Bayesian Networks Support Vector Machine Bayesian networks are probabilistic graphical models that are composed of random variables, represented The support vector machine (SVM) algorithm as nodes, and their conditional dependencies, repre- constructs a hyperplane in high dimensional space sented as directed edges. The joint probability of the that represents the largest margin separating two variables represented in the directed, acyclic graph classes of data. To support multiclass problems, can be calculated as the product of the individual the SVM is often implemented as a series of “one- probabilities of each variable, conditioned on the versus-all” binary classifiers. node’s parent variables. The Bayesian classifier The SVM is a discriminant-based method for without latent variables is defined as: classification or regression, following the approach of Vapnik (1999). The SVM algorithm constructs = | a hyperplane in high dimensional space that rep- classify(f) argmax P(c) P( f parent( f )) (4) c∈C resents the largest margin separating two classes f ∈f of data. The SVM is defined as the hyperplane P c c · − = where ( ) is the prior probability of class and w (f) b 0 that solves the following quadratic P( f | parent( f )) is the conditional probability of programming problem: feature f given the values of the variable’s parents. The classifier finds the class label which has the 1 minimize w 2 + C · ξ (1) highest probability of explaining the values of the 2 i i feature vector (Friedman, Geiger, and Goldszmidt 1997). subject to: y(w · (f) − b) ≥ 1 − ξi, ξi ≥ 0(2) Previous Work where f is a vector of features, Beginning with initial investigations of psychoa- w is the discriminant vector, coustician John Grey (1977), the task of musical C is a regularizing coefficient, instrument identification has relied on clustering ξi is a slack variable, techniques. Fujinaga and MacMillan (2000) created Donnelly and Sheppard 71 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/COMJ_a_00210 by guest on 02 October 2021 Table 1. Comparison of Three Approaches to the efficacy of the SVM on the family identification Instrument Identification task for a data set that included non-Western instruments. Liu and Xie (2010) achieved 87 percent Instruments SVM k-NN QDA accuracy on a set of eight instrument families 17 80.2 73.5 77.2 covering both Western and Chinese instruments. 20 78.5 74.5 75.0 Although k-NN and SVM remain the most 27 69.7 65.7 68.5 commonly used system for timbre classification, Family 77.6 76.2 80.8 a few other approaches have been utilized. Kostek Agostini, Longari, and Pollastri (2003) compared a support (2004) used a multilayer feedforward neural network vector machine (SVM), k-nearest neighbor (k-NN) classifier, and to identify twelve musical instruments playing a quadratic discriminant analysis (QDA) in the task of wide variety of articulations using a combination of identification of musical instruments. The authors compared MPEG-7 and wavelet-based features. She achieved three different sets of instruments of varying size as well as the 71 percent accuracy, ranging from 55 percent correct identification of the instrumental family (strings, woodwinds, or brass) and percent accuracy is listed. Boldface values indicate identification of the English horn to 99 percent the approach with the highest accuracy for each experiment. correct identification of the piano. Like many other studies, Kostek noted the most common a k-NN system that achieved 68 percent instrument misclassification occurred between instruments classification on a large database of 23 different within the same family and that performance recorded instruments. Kaminskyj and Czaszejko deteriorated as the number of musical instruments (2005) used k-NN to achieve 93 percent instrument increased. Another study (Wieczorkowska 1999) classification and 97 percent instrument family used a binary decision tree, a variation of the C4.5 recognition on a set of 19 instruments. algorithm (Quinlan 1993), to classify 18 instruments In the 2000s, investigators began to explore using 62 features, yielding 68 percent classification other techniques. In a seminal study using SVM, accuracy. On a limited set of 6 instruments, Benetos, Marques and Moreno (1999) classified 200 msec of Kotti, and Kotropoulos (2006) achieved 95 percent recorded audio for eight musical instruments, using accuracy using a non-negative matrix factorization 16 Mel-frequency cepstral coefficients as features. classifier with MPEG-7 spectral features. The authors achieved 70 percent accuracy using a Several recent studies have explored the utility of “one-versus-all” multi-class SVM with a polynomial temporal information in instrument classification kernel, which outperformed the 63 percent accuracy tasks.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages17 Page
-
File Size-