Music Genre Classification Using Similarity Functions
Total Page:16
File Type:pdf, Size:1020Kb
12th International Society for Music Information Retrieval Conference (ISMIR 2011) MUSIC GENRE CLASSIFICATION USING SIMILARITY FUNCTIONS Yoko Anan, Kohei Hatano, Hideo Bannai and Masayuki Takeda Department of Informatics, Kyushu University yoko.anan, hatano, bannai, takeda @inf.kyushu-u.ac.jp { } ABSTRACT periments. Further, to design a kernel, the function to be designed needs to be positive semidefinite, which is a limita- We consider music classification problems. A typical ma- tion when we try to exploit the structure of music to improve chine learning approach is to use support vector machines classification accuracy. with some kernels. This approach, however, does not seem In this paper, we follow an alternative approach. We em- to be successful enough for classifying music data in our ploy a (dis)similarity-based learning framework proposed experiments. In this paper, we follow an alternative ap- by Wang et al. [20]. This (dis)similarity-based approach has proach. We employ a (dis)similarity-based learning frame- a theoretical guarantee that one can obtain accurate classi- work proposed by Wang et al. This (dis)similarity-based ap- fiers using (dis)similarity measures under a natural assump- proach has a theoretical guarantee that one can obtain accu- tion. In addition, the advantage of this approach is able to rate classifiers using (dis)similarity measures under a natu- use any (dis)similarity measures which do not have to be ral assumption. We demonstrate the effectiveness of our ap- positive semidefinite and any data. proach in computational experiments using Japanese MIDI Further, we combine this (dis)similarity-based learning data. approach with 1-norm soft margin optimization formula- tion [5,22]. An advantage of the formulation is that it is use- 1. INTRODUCTION ful for feature selection because of the sparse nature of the underlying solution. In other words, the formulation help Music classification is an important problem in information us to find “relevant” instances (i.e., music data) to classify retrieval from music data. There are a lot of researches to music. Such relevant instances might contain representative tackle the problem (see, e.g., [1,3,4,10,11,14,18]), as highly features of the class. Therefore, it might be useful to extract accurate music classifiers are useful for music search and good features. feature extraction. For simplicity, throughout the paper, we deal with clas- One of typical approaches to classify music is to rep- sification problems of symbolic music data such as MIDI resent each music data as a feature vector, which is then files only. Thus we do not consider audio signal data and we classified by standard machine learning methods. On the assume (dis)similarity functions over texts. Note that our other hand, finding good features for music classification is framework using (dis)similarity functions does not depend a non-trivial task. For example, performance worm [15], on the data format. We can deal with audio signal data as performance alphabet [16], and other approaches includ- well if we employ (dis)similarity functions over signals. ing [1, 10, 11, 18]. We demonstrate the effectiveness of our approach in com- Another popular approach in Machine Learning is to use putational experiments using Japanese music data. Our ap- support vector machines (SVMs) with kernels [7–9, 12, 19]. proach, combined with non-positive semidefinite (dis)similarity One way to improve accuracy of music classification is to measures such as edit distance, shows better performance design a good kernel for music data. This approach, how- than SVMs with string kernels. ever, does not seem to be very successful so far. As we will show later, well known string kernels such as n-gram kernels [12] and mismatch kernels [8] for texts do not ob- 2. LEARNING FRAMEWORK USING tain satisfactory results for music classification in our ex- DISSIMILARITY FUNCTION In this section, we review a learning framework using dis- Permission to make digital or hard copies of all or part of this work for similarity function proposed by Wang et al. [20]. Let X be personal or classroom use is granted without fee provided that copies are the instance space. We assume that a dissimilarity function not made or distributed for profit or commercial advantage and that copies + d(x, x0) is a function from X X to R . A pair (x, y) of bear this notice and the full citation on the first page. instance x X and label y × 1, 1 is called an example. c 2011 International Society for Music Information Retrieval. ∈ ∈ {− } For instance, X might be some set of MIDI data and then 693 Poster Session 6 an example is a pair of a MIDI file and positive or negative where each wi 0, wi = 1, hi(x) = sgn[d(x00, x) ≥ i i − label. The learner is given a set S of examples, where each d(xi0 , x)] and xi00 and xi0 are positive and negative instances, example is drawn randomly and independently from an un- such that sgn[f(x)] is∑ accurate enough (see [20] for the de- known distribution P over X 1, +1 . Then, the learner tails). is supposed to output a hypothesis× {− h(x}):X 1, 1 . The goal of the learner is to minimize the error→ of {− the hy-} 3. OUR FORMULATION pothesis h w. r. t. the distribution P , i.e., the probability that h misclassifies the label of a randomly drawn example In this section, we consider how to find an accurate weighted (x, y) according to P , Pr (h(x) = y). In particular, we combination of base classifiers consisting of a pair of posi- (x,y) P 6 ∼ tive and negative instances. To do so, we employ the 1-norm assume that a hypothesis is constructed using a dissimilarity soft margin optimization, which is a standard formulation function d. Also, we will use the notation that sgn[a] = 1 if of classification problems in Machine Learning (see,e.g, [5, a > 0 and 1 otherwise. 21]). Simply put, the problem is to find a linear combi- Then we− show a definition of “good” dissimilarity func- nation of base classifiers (or a hyperplane over the space tion. defined by base classifiers) which has large margin with re- Definition 1 (Strong (, η)-goodness, Wang et al. [20]) spect to examples, where the margin of a linear combination A dissimilarity function d(x, x0) is said to be strongly (,η)- w with respect to an example z is a distance between w and good, if at least 1 probability mass of examples z satisfy: z. In fact, the large margin generalization theory (e.g., [17]) − guarantees that a weighted combination of base classifier Pr (d(x, x0) < d(x, x00) y0 = y, y00 = y) 1/2+η/2 z0,z00 P | − ≥ is likely to have higher accuracy when it has larger margin ∼ (1) w.r.t. examples. Further, an additional advantage of 1-norm where the probability is over random examples z0 = (x0, y0) soft margin optimization is that the resulting linear combi- and z00 = (x00, y00). nation of base classifiers is likely to be sparse since we reg- ularize 1-norm of the weight vector. This property is useful Roughly speaking, this definition says that for the most of for feature selection tasks. random examples z = (x, y) and random positive and nega- tive examples, the instance x is likely to be closer to the in- 3.1 The 1-norm soft margin formulation stance with the same label. Then, under the natural assump- tion that the given dissimilarity function d is (, η)-good, we Suppose that we are given a set S = (x1, y1),..., (xm, ym) , { } can construct an accurate classifier based on d, as is shown where each (xi, yi) is an example in X 1, +1 . Here, in the following theorem. following the dissimilarity-based approach× {− in the previous} section, we assume the set of hypotheses, H = h(x) = Theorem 1 (Wang et al. [20]) If d is a strongly (, η)-good sgn[d(x , x) d(x , x)] x and x are positive{ and neg- dissimilarity function, then with probability at least 1 δ i j i j ative instances− in S, respectively| . For simplicity of the over the choice of m = (4/η2) ln(1/δ) pairs of examples− notation, we denote H as H = }h , . , h , where n is (z , z ) with labels y = 1, y = 1, i = 1, 2, ..., m, the 1 n 0 00 0 00 the number of pairs of positive and{ negative examples} in S. following classifier F (x) = sgn[f(x−)] where Then, the 1-norm soft margin optimization problem is for- 1 n mulated as follows (e.g. [5, 21]): f(x) = sgn[d(x, x00) d(x, x0 )] m i − i i=1 m ∑ 1 max ρ ξi (2) has an error rate of no more than + δ. That is ρ,b ,w n,ξ m − ν ∈R ∈R ∈R i=1 ∑ Pr (F (x) = y) = Pr (yf(x) 0) + δ. sub.to z P 6 z P ≤ ≤ ∼ ∼ This theorem says that an unweighted voting classifier con- yi( wjhj(xi) + b) ρ ξi(i = 1, . , m), ≥ − j sisting of sufficiently many randomly drawn examples is ac- ∑ curate enough with high probability. We should note that n w 0, wj = 1 the existence of a (, η)-good dissimilarity function might ≥ j=1 be too restrictive in some cases. For such cases, Wang et al. ∑ also proposed more relaxed definitions of good dissimilarity ξ 0. ≥ functions. Under such relaxed definitions, it can be shown that there exists a weighted combination Here the term yi( j wjhj(xi)+b) represents the margin (w, b) (x , y ) m of the hyperplane w.r.t.