A Distance Model for Rhythms

Jean-Fran¸cois Paiement [email protected] Yves Grandvalet [email protected] IDIAP Research Institute, Case Postale 592, CH-1920 Martigny, Switzerland Samy Bengio [email protected] Google, 1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA Douglas Eck [email protected] Universit´ede Montr´eal, Department of Computer Science, CP 6128, Succ. Centre-Ville, Montr´eal, Qu´ebec H3C 3J7, Canada

Abstract nested periodic components. Such a hierarchy is im- Modeling long-term dependencies in time se- plied in western notation, where different levels ries has proved very difficult to achieve with are indicated by kinds of notes (whole notes, half notes, traditional machine learning methods. This quarter notes, etc.) and where bars establish measures problem occurs when considering music data. of an equal number of beats. Meter and rhythm pro- In this paper, we introduce a model for vide a framework for developing musical melody. For rhythms based on the distributions of dis- example, a long melody is often composed by repeating tances between subsequences. A specific im- with variation shorter sequences that fit into the met- plementation of the model when consider- rical hierarchy (e.g. sequences of 4, 8 or 16 measures). ing Hamming distances over a simple rhythm It is well know in that distance patterns representation is described. The proposed are more important than the actual choice of notes in model consistently outperforms a standard order to create coherent music (Handel, 1993). In this Hidden Markov Model in terms of conditional work, distance patterns refer to distances between sub- prediction accuracy on two different music sequences of equal length in particular positions. For databases. instance, measure 1 may be always similar to measure 5 in a particular musical genre. In fact, even random music can sound structured and melodic if it is built by 1. Introduction repeating random subsequences with slight variation. Many algorithms have been proposed for audio beat Reliable models for music would be useful in a broad tracking (Dixon, 2007; Scheirer, 1998). Probabilistic range of applications, from contextual music genera- models have also been proposed for tempo tracking tion to on-line music recommendation and retrieval. and inference of rhythmic structure in musical audio However, modeling music involves capturing long-term (Whiteley et al., 2007; Cemgil & Kappen, 2002). The dependencies in time series, which has proved very dif- goal of these models is to align rhythm events with ficult to achieve with traditional statistical methods. the metrical structure. However, simple Markovian as- Note that the problem of long-term dependencies is sumptions are used to model the transitions between not limited to music, nor to one particular probabilis- rhythms themselves. Hence, these models do not take tic model (Bengio et al., 1994). into account long-term dependencies. A few generative Music is characterized by strong hierarchical depen- models have already been proposed for music in gen- dencies determined in large part by meter, the sense eral (Pachet, 2003; Dubnov et al., 2003). While these of strong and weak beats that arises from the inter- models generate impressive musical results, we are not action among hierarchical levels of sequences having aware of quantitative comparisons between models of

th music with machine learning standards, as it is done Appearing in Proceedings of the 25 International Confer- in Section 3 in terms of out-of-sample prediction ac- ence on Machine Learning, Helsinki, Finland, 2008. Copy- right 2008 by the author(s)/owner(s). curacy. In this paper, we focus on modeling rhyth- A Distance Model for Rhythms mic sequences, ignoring for the moment other aspects a learning process requiring a prohibitive amount of of music such as pitch, timbre and dynamics. How- data: in order to learn long range interactions, the ever, by capturing aspects of global temporal struc- training set should be representative of the joint dis- ture in music, this model should be valuable for full tribution of subsequences. To overcome this problem, melodic prediction and generation: combined with an we summarize the joint distribution of subsequences audio transcription algorithm, it should help improve by the distribution of distances between these sub- the poor performance of state-of-the-art transcription sequences. This summary is clearly not a sufficient systems; it could as well be included in genre classifiers statistics for the distribution of subsequences, but its or automatic composition systems (Eck & Schmidhu- distribution can be learned from a limited number of ber, 2002); used to generate rhythms, the model could examples. The resulting model, which generates dis- act as a drum machine or automatic accompaniment tances, is then used to recover subsequences. system which learns by example. Our main contribution is to propose a generative 2.2. Decomposition of Distances model for distance patterns, specifically designed for l l Let D(x ) = (di,j)ρ×ρ be the distance matrix asso- capturing long-term dependencies in rhythms. In Sec- l l l l ciated with each sequence x , where di,j = d(yi, yj). tion 2, we describe the model, detail its implemen- Since D(xl) is symmetric and contains only zeros on tation and present an algorithm using this model for the diagonal, it is completely characterized by the up- rhythm prediction. The algorithm solves a constrained per triangular matrix of distances without the diago- optimization problem, where the distance model is nal. Hence, used to filter out rhythms that do not comply with the inferred structure. The proposed model is evalu- ρ−1 ρ l Y Y l ated in terms of conditional prediction error on two p(D(x )) = p(di,j|Sl,i,j) (1) distinct databases in Section 3 and a discussion fol- i=1 j=i+1 lows. where S = {dl | (1 < s < j and 1 ≤ r < s) 2. Distance Model l,i,j r,s (2) or (s = j and 1 ≤ r < i)} . In this Section, we present a generative model for In words, we order the elements column-wise and do distance patterns and its application to rhythm se- a standard factorization, where each random variable quences. Such a model is appropriate for most music depends on the previous elements in the ordering. data, where distances between subsequences of data Hence, we do not assume any conditional indepen- exhibit strong regularities. dence between the distances.

2.1. Motivation Since d(yi, yj) is a metric, we have that d(yi, yj) ≤ d(y , y ) + d(y , y ) for all i, j, k ∈ {1, . . . , ρ}. This in- l l l m i k k j Let x = (x1, . . . , xm) ∈ R be the l-th rhythm se- equality is usually referred to as the triangle inequality. 1 n quence in a dataset X = {x ,..., x } where all the Defining sequences contain m elements. Suppose that we con- l l l struct a partition of this sequence by dividing it into αi,j = min (dk,j + di,k) and l l l k∈{1,...,(i−1)} ρ parts defined by yi = (x1+(i−1)m/ρ, . . . , xim/ρ) with l l l (3) βi,j = max (|dk,j − di,k|) , i ∈ {1, . . . , ρ}. We are interested in modeling the dis- k∈{1,...,(i−1)} tances between these subsequences, given a suitable we know that given previously observed (or sampled) metric d(y , y ): m/ρ × m/ρ → . As was pointed i j R R R distances, constraints imposed by the triangle inequal- out in Section 1, the distribution of d(y , y ) for each i j ity on dl are simply specific choice of i and j may be more important when i,j modeling rhythms (and music in general) than the ac- l l l βi,j ≤ di,j ≤ αi,j . (4) tual choice of subsequences yi. One may observe that the boundaries given in Eq. (3) Hidden Markov Models (HMM) (Rabiner, 1989) are contain a subset of the distances that are on the con- commonly used to model temporal data. In princi- ditioning side of each factor in Eq. (1) for each indexes ple, an HMM is able to capture complex regularities i and j. Thus, constraints imposed by the triangle in- in patterns between subsequences of data, provided equality can be taken into account when modeling each its number of hidden states is large enough. However, factor of p(D(xl)): each dl must lie in the interval im- when dealing with music, such a model would lead to i,j posed by previously observed/sampled distances given A Distance Model for Rhythms in Eq. (4). Figure 1 shows an example where ρ = 4. eral edit distance such as the Levenshtein distance. l Using Eq. (1), the distribution of d2,4 would be condi- However, this approach would not make sense psycho- l l l l tioned on d1,2, d1,3, d2,3, and d1,4, and Eq. (4) reads acoustically: doing an insertion or a deletion in a l l l l l rhythm produces a translation that alters dramatically |d1,2 − d1,4| ≤ d2,4 ≤ d1,2 + d1,4. Then, if subsequences l l l l the nature of the sequence. Putting it another way, y1 and y2 are close and y1 and y4 are also close, we l l rhythm perception heavily depends on the position on know that y2 and y4 cannot be far. Conversely, if sub- sequences yl and yl are far and yl and yl are close, which rhythmic events occur. In the remainder of this 1 2 1 4 l l l paper, di,j is the Hamming distance between subse- we know that y2 and y4 cannot be close. quences yi and yj. We now have to encode our belief that melodies of the same musical genre have a common distance structure. dl dl dl For instance, drum beats in rock music can be very 1,2 1,3 1,4 repetitive, except in the endings of every four mea- sures, without regard to the actual beats being played. This should be accounted for in the distributions of the corresponding dl . With Hamming distances, the dl dl i,j 2,3 2,4 l conditional distributions of di,j in Eq. (1) should be modeled by discrete distributions, whose range of pos- sible values must obey Eq. (4). Hence, we assume that l l l l the random variables (di,j − βi,j)/(αi,j − βi,j) should dl 3,4 be identically distributed for l = 1, . . . , n. As an ex- ample, suppose that measures 1 and 4 always tend to be far away, that measures 1 and 3 are close, and that Figure 1. Each circle represents the random variable asso- measures 3 and 4 are close; Triangle inequality states ciated with the corresponding factor in Eq. (1), when ρ = 4. that 1 and 4 should be close in this case, but the de- l For instance, the conditional distribution for d2,4 possibly sired model would still favor a solution with the great- depends on the variables associated to the grey circles. est distance possible within the constrains imposed by triangle inequalities.

All these requirements are fulfilled if we model di,j − 2.3. Modeling Relative Distances Between βi,j by a binomial distribution of parameters (αi,j − Rhythms βi,j, pi,j), where pi,j is the probability that two sym- bols of subsequences y and y differ. With this choice, We want to model rhythms in a music dataset X con- i j the conditional probability of getting d = β + δ sisting of melodies of the same musical genre. We i,j i,j would be first quantize the database by segmenting each song in m time steps and associate each note to the near- B(δ, αi,j, βi,j, pi,j) =   est time step, such that all melodies have the same αi,j − βi,j (5) (p )δ(1 − p )(αi,j −βi,j −δ) , length m1. It is then possible to represent rhythms by δ i,j i,j sequences containing potentially three different sym- bols: 1) Note onset, 2) Note continuation, and 3) Si- with 0 ≤ pi,j ≤ 1. If pi,j is close to zero/one, the lence. When using quantization, there is a one to one relative distance between subsequences yi and yj is mapping between this representation and the set of all small/large. However, the binomial distribution is not possible rhythms. Using this representation, symbol 2 flexible enough since there is no indication that the can never follow symbol 3. Let A = {1, 2, 3}; in the distribution of di,j − βi,j is unimodal. We thus model remaining of this paper, we assume that xl ∈ Am for each di,j − βi,j with a binomial mixture distribution in all xl ∈ X . order to allow multiple modes. We thus use c l When using this representation, d can simply be cho- X (k) (k) i,j p(d = β + δ|S ) = w B(δ, α , β , p ) sen to be the Hamming distance (i.e. counting the i,j i,j i,j i,j i,j i,j i,j k=1 number of positions on which corresponding symbols (6) are different.) One could think of using more gen- (k) Pc (k) with wi,j ≥ 0, k=1 wi,j = 1 for every indexes i and 1This hypothesis is not fundamental in the proposed j, and Si,j defined similarly as in Eq. (2). Parameters model and could easily be avoided if one would have to (1) (c−1) (1) (c) deal with more general datasets. θi,j = {wi,j , . . . , wi,j } ∪ {pi,j , . . . , pi,j } A Distance Model for Rhythms can be learned with the EM algorithm (Dempster Estimation procedures may not be well-defined and et al., 1977) on rhythm data for a specific music style. asymptotic theory may not hold if a model is not iden- tifiable. However, the model defined in Eq. (6) is iden- In words, we model the difference between the ob- tifiable if α − β > 2c − 1 (Titterington et al., 1985, served distance dl between two subsequences and the i,j i,j i,j p.40). While this is the case for most d , we observed minimum possible value β for such a difference by a i,j i,j that this condition is sometimes violated. Whatever binomial mixture. happens, there is no impact on the estimation because The parameters θi,j can be initialized to arbitrary val- we only care about what happens at the distribution ues before applying the EM algorithm. However, as level: there may be several parameters leading to the the likelihood of mixture models is not a convex func- same distribution, some components may vanish in the tion, one may get better models and speed up the fitting process, but this is easily remedied, and EM be- learning process by choosing sensible values for the haves well. initial parameters. In the experiments reported in Sec- As stated in Section 1, musical patterns form hierarchi- tion 3, the k-means algorithm for clustering (Duda cal structures closely related to meter (Handel, 1993). et al., 2000) was used. More precisely, k-means was Thus, the distribution of p(D(xl)) can be computed used to partition the values (dl − βl )/(αl − βl ) i,j i,j i,j i,j for many numbers of partitions within each rhythmic into c clusters corresponding to each component of the (1) (c) sequence. Let P = {ρ1, . . . ρh} be a set of numbers of mixture in Eq. (6). Let {µi,j , . . . , µi,j } be the cen- partitions to be considered by our model, where h is (1) (c) troids and {ni,j , . . . , ni,j } the number of elements in the number of such numbers of partitions. The choice each of these clusters. We initialize the parameters θi,j of P depends on the domain of application. Following with meter, P may have dyadic2 tree-like structure when (k) n modeling music (e.g. P = {2, 4, 8, 16}). Let D (xl) w(k) = i,j and p(k) = µ(k). ρr i,j n i,j i,j be the distance matrix associated with sequence xl di- We then follow a standard approach (Bilmes, 1997) vided into ρr parts. Estimating the joint probability Qh l to apply the EM algorithm to the binomial mixture r=1 p(Dρr (x )) with the EM algorithm as described l in this section leads to a model of the distance struc- in Eq. (6). Let zi,j ∈ {1, . . . , c} be a hidden variable l tures in music datasets. Suppose we consider 16 bars telling which component density generated di,j. For every iteration of the EM algorithm, we first compute songs with four beats per bar. Using P = {8, 16} would mean that we consider pairs of distances be- ψ tween every group of two measures (ρ = 8), and every p(zl = k|dl , αl , βl , θˆ ) = k,i,j,l i,j i,j i,j i,j i,j Pc single measures (ρ = 16). t=1 ψt,i,j,l One may argue that our proposed model for long-term where θˆ are the parameters estimated in the previous i,j dependencies is rather unorthodox. However, simpler iteration, or the parameters guessed with k-means on models like Poisson or Bernoulli process (we are work- the first iteration of EM, and ing in discrete time) defined over the whole sequence ψ =w ˆ(k)B(dl , αl , βl , p(k)) . would not be flexible enough to represent the particu- k,i,j,l i,j i,j i,j i,j lar long-term structures in music. Then, the parameters can be updated with 2.4. Conditional Prediction Pn l l l l l l ˆ (d − β )p(z = k|d , α , β , θi,j) p(k) = l=1 i,j i,j i,j i,j i,j i,j For most music applications, it would be particularly i,j Pn l l l l l l ˆ (α − β )p(z = k|d , α , β , θi,j) l=1 i,j i,j i,j i,j i,j i,j helpful to know which sequencex ˆs,..., xˆm maximizes p(ˆx ,..., xˆ |x , . . . , x ). Knowing which musical and s m 1 s−1 events are the most likely given the past s − 1 obser- n (k) 1 X vations would be useful both for prediction and gen- w = p(zl = k|dl , αl , βl , θˆ ). i,j n i,j i,j i,j i,j i,j eration. Note that in the remaining of the paper, we l=1 refer to prediction of musical events given past obser- This process is repeated until convergence. vations only for notational simplicity. The distance model presented in this paper could be used to predict Note that using mixture models for discrete data is known to lead to identifiability problems. Identifiabil- 2Even when considering non-dyadic measures (e.g. a ity refers here to the uniqueness of the representation three-beat waltz), the very large majority of the hierarchi- cal levels in metric structures follow dyadic patterns (Han- (up to an irrelevant permutation of parameters) of any del, 1993) in most tonal music. distribution that can be modeled by a mixture. A Distance Model for Rhythms any part of a music sequence given any other part with Algorithm 1 Simple optimization algorithm to max- only minor modifications. imize p(ˆxi,..., xˆm|x1, . . . , xi−1) While the described modeling approach captures long Initializex ˆs,..., xˆm using Eq. (9) range interactions in the music signal, it has two short- Initialize end = false comings. First, it does not model local dependen- while end = false do cies: it does not predict how the distances in the Set end = true for j = s to m do smallest subsequences (i.e. with length smaller than ∗ Setx ˆj = arg max[log pHMM(x |x1, . . . , xs−1) + m/ max(P)) are distributed on the events contained a∈A in these subsequences. Second, as the mapping from Ph ∗ λ r=1 log p(Dρr (x1, . . . , xs−1, x ))] ∗ sequences to distances is many to one, there exists where x = (ˆxs,..., xˆj−1, a, xˆj+1,..., xˆm) l several admissible sequences x for a given set of dis- if xˆj has been modified in the last step then tances. These limitations are addressed by using an- Set end = false other sequence learner designed to capture short-term end if dependencies between musical events. Here, we use end for a standard Hidden Markov Model (HMM) (Rabiner, end while 1989) displayed in Figure 2, following standard graph- Outputx ˆs,..., xˆm. ical model formalism. Each node is associated to a random variable and arrows denote conditional depen- dencies. Learning the parameters of the HMM can be probabilities for obvious computational reasons: done as usual with the EM algorithm.

maxx˜s,...,x˜m [log pHMM(˜xs,..., x˜m|x1, . . . , xs−1) Ph l +λ r=1 log p(Dρr (x ))] , (8) ... where tuning λ has the same effect as choosing a threshold P0 in Eq. (7) and can be done by cross- validation. xl xl xl 1 2 3 Multidimensional Scaling (MDS) is an algorithm that tries to embed points (here “local” subsequences) into a potentially lower dimensional space while trying to Figure 2. Hidden Markov Model. Each node is associated be faithful to the pairwise affinities given by a “global” to a random variable and arrows denote conditional de- pendencies. During training of the model, white nodes are distance matrix. Here, we propose to consider the pre- hidden while grey nodes are observed. diction problem as finding sequences that maximize the likelihood of a “local” model of subsequences un- der the constraints imposed by a “global” generative The two models are trained separately using their re- model of distances between subsequences. In other spective version of the EM algorithm. For predicting words, solving problem (7) is similar to finding points the continuation of new sequences, they are combined between which distances are as close as possible to a by choosing the sequence that is most likely according given set of distances (i.e. minimizing a stress func- to the local HMM model, provided it is also plausible tion in MDS). Naively trying all possible subsequences (m−s+1) regarding the model of long-term dependencies. Let to maximize (8) leads to O(|A| ) computations. l l Instead, we propose to search the space of sequences pHMM(x ) be the probability of observing sequence x estimated by the HMM after training. The final pre- using a variant of the Greedy Max Cut (GMC) method dicted sequence is the solution of the following opti- (Rohde, 2002) that has proven to be optimal in terms mization problem: of running time and performance for binary MDS op- timization.  The subsequencex ˆ ,..., xˆ can be simply initialized  max pHMM(˜xs,..., x˜m|x1, . . . , xs−1) s m  x˜s,...,x˜m with h (7) Y l  subject to p(Dρr (x )) ≥ P0 ,  (ˆx ,..., xˆ ) = max p (˜x ,..., x˜ |x , . . . , x )  r=1 s m HMM s m 1 s−1 x˜s,...,x˜m (9) where P0 is a threshold. In practice, one solves a La- using the local HMM model. Then, Algorithm 1 car- grangian formulation of problem (7), where we use log- ries on complete optimization. A Distance Model for Rhythms

For each position, we try every admissible symbol of estimate and to interpret, other performance criteria, the alphabet and test if a change increases the proba- such as ratings provided by a panel of experts, should bility of the sequence. We stop when no further change be more appropriate to evaluate the relevance of music can increase the value of the utility function. Obvi- models. We plan to define such an evaluation protocol ously, many other methods could have been used to in future work. We used 5-fold double cross-validation search the space of possible sequencesx ˆs,..., xˆm, such to estimate the accuracies. Double cross-validation is as simulated annealing (Kirkpatrick et al., 1983). We a recursive application of cross-validation that enables chose Algorithm 1 for its simplicity and the fact that to jointly optimize the hyper-parameters of the model it yields excellent results, as reported in the following and evaluate its generalization performance. Standard section. cross-validation is applied to each subset of K −1 folds with each hyper-parameter setting and tested with the 3. Experiments best estimated setting on the remaining hold-out fold. The reported accuracies are the averages of the results Two rhythm databases from different musical genres of each of the K applications of simple cross-validation were used to evaluate the proposed model. Firstly, 47 during this process. jazz standards melodies (Sher, 1988) were interpreted For the baseline HMM model, double cross-validation and recorded by the first author in MIDI format. Ap- optimizes the number of possible states for the hidden propriate rhythmic representations as described in Sec- variables. 2 to 20 possible states were tried in the re- tion 2.3 have been extracted from these files. The com- ported experiments. In the case of the model with dis- plexity of the rhythm sequences found in this corpus is tance constraints, referred to as the global model, the representative of the complexity of common jazz and hyper-parameters that were optimized are the num- pop music. We used the last 16 bars of each song to ber of possible states for hidden variables in the local train the models, with four beats per bar. Two rhyth- HMM model (i.e. 2 to 20), the Lagrange multiplier mic observations were made for each beat, yielding ob- λ, the number of components c (common to all dis- served sequences of length 128. We also used a subset tances) for each binomial mixture, and the choice of of the Nottingham database 3 consisting of 53 tradi- P, i.e. which partitions of the sequences to consider. tional British folk dance tunes called “hornpipes”. In Values of λ ranging between 0.1 and 4 and values of this case, we used the first 16 bars of each song to train c ranging between 2 and 5 were tried during double the models, with four beats per bar. Three rhyth- cross-validation. Since music data commonly shows mic observations were made for each beat, yielding strong dyadic structure following meter, many subsets observed sequences of length 192. The sequences from of P = {2, 4, 8, 16} were allowed during double cross- this second database contain no silence (i.e. rests), validation. leading to sequences with binary states. Note that the baseline HMM model is a poor bench- The goal of the proposed model is to predict or gener- mark on this task, since the predicted sequence, when ate rhythms given previously observed rhythm pat- prediction consists in choosing the most probable sub- terns. As pointed out in Section 1, such a model sequence given previous observations, only depends on could be particularly useful for music information re- the state of the hidden variable in position s, where s trieval, transcription, or music generation applica- is the index of the last observation. This observation tions. Let εt = 1 ifx ˆt = xt, and 0 otherwise, with i i i implies that the number of possible states for the hid- xt = (xt , . . . , xt ) a test sequence, andx ˆt the output 1 m i den variables of the HMM upper-bounds the number of of the evaluated prediction model on the i-th posi- different sequences that the HMM can predict. How- tion when given (xt , . . . , xt ) with s < i. Assume that 1 s ever, this behavior of the HMM does not harm the the dataset is divided into K folds T ,...,T (each 1 K validity of the reported experiments. The main goal containing different sequences), and that the k-th fold of this quantitative study is to measure to what extent T contains n test sequences. When using cross- k k distance patterns are present in music data and how validation, the accuracy Acc of an evaluated model well these dependencies can be captured by the pro- is given by posed model. What we really want to measure is how K m much gain we observe in terms of out-of-sample predic- 1 X 1 X 1 X t Acc = εi . (10) tion accuracy when using an arbitrary model if we im- K nk m − s k=1 t∈Tk i=s+1 pose additional constraints based on distance patterns. That being said, it would be interesting to measure the Note that, while the prediction accuracy is simple to effect of appending distance constraints to more com- 3http://www.cs.nott.ac.uk/~ef/music/database.htm. plex music prediction models (Pachet, 2003; Dubnov A Distance Model for Rhythms

Table 1. Accuracy (the higher the better) for best models Table 3. Accuracy over the last 64 positions for many sets on the jazz standards database. of partitions P on the jazz database, given the first 64 observations. The higher the better.

Observed Predicted HMM Global P Global 32 96 34.5% 54.6% 64 64 34.5% 55.6% {2} 49.3% 96 32 41.6% 47.2% {2, 4} 49.3% {2, 4, 8} 51.4% {2, 4, 8, 16} 55.6% Table 2. Accuracy (the higher the better) for best models on the hornpipes database.

Observed Predicted HMM Global 4. Conclusion 48 144 75.1% 83.0% 96 96 75.6% 82.1% The main contribution of this paper is the design 144 48 76.6% 80.1% and evaluation of a generative model for distance pat- terns in temporal data. The model is specifically well-suited to music data, which exhibits strong reg- et al., 2003) in future work. ularities in dyadic distance patterns between subse- quences. Reported conditional prediction accuracies Results in Table 1 for the jazz standards database show that such regularities are present in music data show that considering distance patterns significantly and can be effectively captured by the proposed model. improves the HMM model. One can observe that the Moreover, learning distributions of distances between baseline HMM model performs much better when try- subsequences really helps for accurate rhythm predic- ing to predict the last 32 symbols. This is due to the tion. Rhythm prediction can be seen as the first step fact that this database contains song endings. Such towards full melodic prediction and generation. A endings contain many silences and, in terms of accu- promising approach would be to apply the proposed racy, a useless model predicting silence at any position model to melody prediction. It could also be read- performs already well. On the other hand, the end- ily used to increase the performance of transcription ings are generally different from the rest of the rhythm algorithms, genre classifiers, or even automatic com- structures, thus harming the performance of the global position systems. model when just trying to predict the last 32 symbols. Results in Table 2 for the hornpipes database again The choice of the HMM to initialize the model is not show that the prediction accuracy of the global model optimal. However, this has no impact on the validity is consistently better than the prediction accuracy of of the reported results, since our goal was to show the the HMM, but the difference is less marked. This is importance of distance patterns between subsequences mainly due to the fact that this dataset only contains in rhythm data. In order to sample to models to gen- two symbols, associated to note onset and note con- erate subjectively good results (Pachet, 2003; Dubnov tinuation. Moreover, the frequency of these symbols is et al., 2003), one could use other benchmark and ini- quite unbalanced, making the HMM model much more tialization techniques, such as repetition of common accurate when almost always predicting the most com- patterns. mon symbol. Finally, besides being fundamental in music, modeling In Table 3, the set of partitions P is not optimized distance between subsequences should also be useful in by double cross-validation. Results are shown for dif- other application domains, such as in natural language ferent fixed sets of partitions. The best results are processing. Being able to characterize and constrain reached with “deeper” dyadic structure. This is a good the relative distances between various parts of a se- indication that the basic hypothesis underlying the quence of bags-of-concepts could be an efficient means proposed model is well-suited to music data, namely to improve performance of automatic systems such as that dyadic distance patterns exhibit strong regulari- machine translation (Och & Ney, 2004). On a more ties in music data. We did not compute accuracies for general level, learning constraints related to distances ρ > 16 because it makes no sense to estimate distribu- between subsequences can boost the performance of tion of distances between too short subsequences. ”short memory” models such as the HMM. A Distance Model for Rhythms

Acknowledgments Och, F. J., & Ney, H. (2004). The alignment template approach to statistical machine translation. Com- This work was supported in part by the IST Program putational Linguistics, 30, 417–449. of the European Community, under the PASCAL Net- work of Excellence, IST-2002-506778, funded in part Pachet, F. (2003). The continuator: Musical interac- by the Swiss Federal Office for Education and Science tion with style. Journal of New Music Research, 32, (OFES) and the Swiss NSF through the NCCR on 333–341. IM2. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recogni- References tion. Proceedings of the IEEE, 77, 257–285. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learn- Rohde, D. L. T. (2002). Methods for binary multidi- ing long-term dependencies with gradient descent is mensional scaling. Neural Comput., 14, 1195–1232. difficult. IEEE Transactions on Neural Networks, 5, 157–166. Scheirer, E. (1998). Tempo and beat analysis of acous- tic musical signals. Journal of the Acoustical Society Bilmes, J. (1997). A gentle tutorial on the EM algo- of America, 103, 588–601. rithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Sher, C. (Ed.). (1988). The New Real Book, vol. 1-3. Sher Music Co. Cemgil, A. T., & Kappen, H. J. (2002). Rhythm quan- tization and tempo tracking by sequential Monte Titterington, D. M., Smith, A. F. M., & Makov, U. E. Carlo. Advances in Neural Information Processing (1985). Statistical analysis of finite mixture distri- Systems 14 (pp. 1361–1368). butions. Wiley. Whiteley, N., Cemgil, A. T., & Godsill, S. J. (2007). Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Sequential inference of rhythmic structure in mu- Maximum likelihood from incomplete data via the sical audio. Proc. of IEEE Int. Conf. on Acous- EM algorithm. Journal of the Royal Statistical So- tics, Speech and Signal Processing (ICASSP 07) (pp. ciety, 39, 1–38. 1321–1324). Dixon, S. (2007). Evaluation of the audio beat tracking system beatroot. Journal of New Music Research, 36, 39–50.

Dubnov, S., Assayag, G., Lartillot, O., & Bejerano, G. (2003). Using machine-learning methods for musical style modeling. IEEE Computer, 10.

Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification, second edition. Wiley Inter- science.

Eck, D., & Schmidhuber, J. (2002). Finding temporal structure in music: Blues improvisation with LSTM recurrent networks. Neural Networks for Signal Pro- cessing XII, Proc. 2002 IEEE Workshop (pp. 747– 756). New York: IEEE.

Handel, S. (1993). Listening: An introduction to the perception of auditory events. Cambridge, Mass.: MIT Press.

Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Sci- ence, Number 4598, 13 May 1983, 220, 4598, 671– 680.