Composition Identification in Ottoman-Turkish Makam Music Using Transposition-Invariant Partial Audio-Score Alignment

COMPOSITION IDENTIFICATION IN OTTOMAN-TURKISH MAKAM MUSIC USING TRANSPOSITION-INVARIANT PARTIAL AUDIO-SCORE ALIGNMENT

Sertan S¸enturk,¨ Xavier Serra Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain {sertan.senturk, xavier.serra}@upf.edu

ABSTRACT cultural heritage archival, music preservation and musicological studies. Composition identification is a crucial step The composition information of audio recordings is highly linking performances and compositions during the creation valuable for many tasks such as automatic music descrip- of such music corpus from unlabeled musical data [3]. tion and music discovery. Given a music collection, two Composition information can be used to generate and im- typical scenarios are retrieving the composition(s) perfor- prove linked musical data, enhance the music content de- med in an audio recording and retrieving the audio record- scription and facilitate navigation in semantic web appli- ing(s), where a composition is performed. We present a cations. Consider a scenario, where a musician uploads composition identification methodology for these two tasks, his interpretation of a composition to a platform such as which makes use of music scores. Our methodology first SoundCloud, YouTube etc. The performed compositions attempts to align a fragment of the music score of a com- can be automatically identified and labeled semantically position with an audio recording. Next, it computes a sim- using an ontology, e.g. [4]. Next the performance can be ilarity from the best obtained alignment. True audio-score linked with related concepts (e.g. form, composer, score) pair emits a high similarity value. We repeat this proce- available in other sources such as biographies of the per- dure between all audio recordings and music scores, and forming artist, the music score of the composition or the filter the true pairs by a simple approach using logistic re- musical and editorial metadata stored in open encyclope- gression. The methodology is specialized according to the dias such as MusicBrainz and Wikipedia. Such a scheme cultural-specific aspects of Ottoman-Turkish makam mu- would facilitate searching, accessing and navigating rele- sic (OTMM), achieving 0.96 and 0.95 mean average preci- vant music content in a more informed manner. Likewise, sion (MAP) for composition retrieval and performance re- tasks such as enhanced listening and music recommenda- trieval tasks, respectively. We hope that our method would tion may also benefit from the musical data linked via au- be useful in creating semantically linked music corpora for tomatic composition identification. cultural heritage and preservation, semantic web applica- Due to inherent characteristics of the oral tradition and tions and musicological studies. the practice of Ottoman-Turkish makam music (OTMM), performances of the same piece may be substantially different from one another. This aspect brings certain compu- 1. INTRODUCTION tational challenges for the computational analysis and re- Version identification is an important task in music infor- trieval of OTMM (Section2). In this paper, we propose a mation retrieval which aims to find the versions of a mu- composition identification methodology, which makes use sic piece from a collection of audio recordings automati- of the available music scores of the relevant compositions cally [1,2]. For popular music such as rap, pop and rock, using partial audio-score alignment. The methodology is the task aims to identify the covers of an original audio designed to address the culture-specific challenges brought recording. For classical music traditions a more relevant by OTMM. To the best of our knowledge, our methodology task is associating compositions with the audio performan- is the first automatic composition identification proposed ces. The composition information is highly useful in many for OTMM. We consider two composition identification other computational tasks such as automatic content de- scenarios, 1) identifying the compositions performed in scription and music discovery (e.g. searching the perfor- an audio recording, 2) identifying the audio recordings in mances of a composition in a music collection). which a composition is performed. Note that there might For classical music cultures, music collections consist- not be any relevant audio recordings for some composi- ing of music scores and audio recordings along with edito- tions, and vice versa. Our methodology also aims to iden- rial metadata are desirable in many applications involving tify such cases. Our contributions can be summarized as: 1. The first composition identification methodology ap- Copyright: c 2016 Sertan S¸enturk,¨ Xavier Serra . This plied to Ottoman-Turkish makam music is an open-access article distributed under the terms of the 2. An open and editorially complete dataset for com- Creative Commons Attribution 3.0 Unported License, which permits unre- position identification in OTMM (Section 5.1) stricted use, distribution, and reproduction in any medium, provided the original 3. Extending the state of the art in transposition-inva- author and source are credited. riant partial audio-score alignment for OTMM by in- troducing subsequence dynamic time warping (Sec- they are supposed to show their virtuosity by changing tion 4.2.2) the tuning and the intonation of some intervals, adding 4. Simplifications and generalizations of the fragment embellishments and/or inserting, repeating and omitting selection and the fragment duration steps used in the notes and phrases during the performance. Melody ex- score-informed tonic identification method proposed traction algorithms might not perform well in recordings by [5] and verification of this method on a larger with substantial heterophonic interactions [8]. dataset as a side product of the composition iden- • Most of the scores of OTMM are descriptive and they tification experiments (Table1) are transcribed sometimes centuries later. The scores typically notate basic, monophonic melodic lines. They For reproducibility purposes, relevant materials such as do not usually indicate the heterophony, intonation de- musical examples, data and results are open and publicly viations and other expressive elements observed in the 1 available via the Compmusic Website. performances. The rest of the paper is structured as follows: Section2 provides an introduction to Ottoman-Turkish makam mu- In the experiments, we focus on pes¸rev and saz semaisi, sic. Section3 gives a definition of the composition iden- which are the two most common instrumental forms of the tification tasks we are dealing with. Section4 explains classical repertoire. Both pes¸rev and saz semaisi typically the methodology applied to both composition identifica- consist of four non-repeating sections called hane and a tion scenarios explained above. Section5 presents the ex- repetitive section called teslim performed between these perimental setup, the test dataset and the results. Section6 hanes. discusses the obtained results. Section7 wraps up the paper with a brief conclusion. 3. PROBLEM DEFINITION

2. OTTOMAN-TURKISH MAKAM MUSIC Given a specific music collection, two basic composition identification scenarios are: The melodic structure of most of the traditional music re- pertoires of Turkey follow the concept of makams [6]. Cur- 1. Composition retrieval: Identification of the com- rently, Arel-Ezgi-Uzdilek (AEU) theory is the mainstream positions which are performed in an audio record- theory for OTMM [6]. AEU theory argues that there are ing. 24 equal intervals and that a whole tone is divided into 2. Performance retrieval: Identification of the audio 9 equidistant intervals. These intervals can be approxi- recordings in which a composition is performed. mated from 53-TET (tone equal tempered) intervals, each These scenarios are ranked retrieval problems where the of which is termed as a Holdrian comma (Hc) [6]. query is an audio recording and the retrieved documents For centuries, OMMT has been predominantly an oral are the compositions in the composition identification task, tradition. Since the start of the 20th century, a notational and vice versa. In both cases, the common step is to esti- representation extending standard Western music notation mate whether a composition and an audio recording are has been used in OTMM complementary to the oral prac- relevant to one another. The relevances in the composition tice [7]. This notation typically follows the rules of AEU identification problems are binary, i.e. 1 if the composition theory. and the audio recordings are paired and 0 otherwise. Below we list some of the characteristics of OTMM, which The results in both cases can be aggregated by applying pose challenges for composition identification: this step to multiple documents and queries. Neverthe- • There is no definite tonic frequency (e.g. A4 = 440Hz) less, there might be situations where it may be impossible in the performances. The performed tonic is occasion- or impractical to retrieve the whole collection, for exam- ally transposed due to instrument/vocal range or aes- ple restricted access to copyrighted music material or the thetic reasons [6]. This necessitates automatic tonic iden- lack of computational resources in fast-query applications tification for any fully-automatic alignment method (Sec- (e.g. real-time composition identification in mobile appli- tion 4.2). cations). Moreover, both scenarios might require different • The performances of OTMM occasionally include im- constraints to obtain better results and/or process more effi- provisations played before, after or even within a com- ciently. For example, a good performance retrieval method position. It is also common to repeat, insert or omit sec- should find multiple relevant audio recordings for a com- tions of a composition. position; on the other hand only the top ranked documents • Until the 20th century, most of these music has been are important in composition retrieval when more than a strictly transmitted from a master to the students within single composition is rarely performed in the queried au- the oral tradition. This resulted in the musical material dio recordings (Section 5.1). In this paper, we deal with propagating differently in different “schools.” There- these two tasks separately and leave the joint retrieval task fore, performances of the same composition may differ as a future direction to explore. from each other substantially. • OTMM is a heterophonic music tradition. Musicians 4. METHODOLOGY simultaneously perform the same “melodic idea;” Yet We assume that the scores of the compositions are avail- 1 http://compmusic.upf.edu/node/306 able and estimate the relevance by partially aligning the score of a composition (n) with the audio recording of a (n) (n) y1 , . . . , yJ (n) , n ∈ [1 : N], where N is number of performance (m). The alignment step in our methodology compositions with scores in the collection and J (n) is the is based on the score-informed tonic identification proce- number of samples in the synthetic melody (Figure1b). dure described in [5], which we use to obtain the best pos- Notice that the unit of the pitch values in the audio pre- sible alignment between a score and an audio recording in dominant melody X(m) is Hertz, whereas the unit of the a manner invariant of the transposition of the performance pitch values in the synthetic melody Y (n) is cents. For (Section 4.2). Next, we compute a similarity value ∈ [0, 1] proper alignment of Y (n) within X(m) (provided that they between the composition and the performance from the are related with each other), X(m) has to be normalized best alignment path. We observe a high similarity value, with respect to the tonic frequency. if the composition (n) is indeed performed in the audio To identify the tonic, we first compute a pitch class distri- recording (m) (Section 4.3). The block diagram of trans- bution from the audio predominant melody [5]. We use position invariant partial-audio score alignment is given in kernel-density estimation to obtain a smooth pitch class Figure1. The alignment process is repeated between each distribution without spurious peaks [5]. We select the bin audio recording and music score, and a similarity value width of the distribution as 7.5 cents (the same as the pitch is obtained for each composition and performance pair in precision of the audio predominant melody) and use a Gaus- our collection. Finally, the performance-composition pairs sian kernel with a standard deviation of 15 cents (≈ 2/3 with low similarity values are discarded using outlier de- Hc) so that a pitch value contributes in an interval slightly tection (Section 4.4) and the relevant pairs are obtained. smaller than a semitone, which is reported as optimal for this task [5]. The width of the kernel is selected as 75 cents 4.1 Feature Extraction center to tail (i.e. 5 times the standard deviation) as the In audio-score alignment of Eurogenetic music, features contribution to the samples beyond this width are redun- which can capture the harmony such as chroma features [9, dant. 10] are typically used. In [8], it is shown that predomi- Finally we pick the peaks of the distribution as the tonic nant melody performs better for OTMM due the melodic candidates [5, 13] (Figure1e). We denote the tonic candi- (m) (m) nature of the music tradition. In our method we follow dates for the audio recording (m) as C = {c1 ,..., the melodic features proposed for audio-score alignment (m) (m) CK(m) }, where K is the number of peaks in the pitch of OTMM in [5] and [8]. class distribution (Figure1e). Notable is that the candi- From the audio recording (m), we extract a predomi- dates correspond to (stable) pitch classes instead of fre- nant melody using a version of the methodology proposed quencies. This choice reduces the computational complex- in [11], optimized for makam music [12]. 2 The pitch pre- ity as we will compute the “octave-wrapped” pitch dis- cision of the predominant melody is taken as 7.5 cents tances in the alignment step (Section 4.2, Equation2). (≈ 1/3 Hc), which is a suitable value for tracking pitch deviations in makam music [13]. The frame rate of the ex- 4.2 Transposition-Independent Partial Alignment tracted predominant melody is downsampled from ∼ 2.9 c(m) ms to ∼ 46 ms, which is shown to be sufficient for audio- Assuming a candidate k obtained from the pitch class score alignment in OTMM [8]. We denote the predom- distribution as the tonic frequency, we normalize each pitch inant melody extracted from the audio recording (m) as sample in the audio predominant melody to cent scale by: (m) (m) (m) (m,k) (m) (m) X = x1 , . . . , xI(m) , m ∈ [1 : M], where M is xˆ = 1200 log2 x /c (1) number of audio recordings in the collection and I(m) is i i k the number of samples in the audio predominant melody Note that there are 1200 cents in an octave. We denote the (Figure1d). predominant melody normalized with respect to the tonic From the machine readable score of the composition (n), (m) ˆ (m,k) candidate ck as X . Next, we attempt to align the we first pick a short fragment (either from the start of the score fragment to the corresponding location in the au- score or from the repetition) indicated in the score. We dio recording by searching the synthetic melody Y (m,k), also try different fragment durations in Section5. Then computed from the selected score fragment in the normal- we sample the note symbols in the note sequence of the ized audio predominant melody Xˆ (m,k). We compare two selected fragment according to their durations in nominal methods for partial alignment: 1) Hough transform, and 2) tempo indicated in the score [8]. In practice, the previ- Subsequence DTW. ous note is commonly sustained in the place of a rest, so we omit the rests in the score and add their duration to 4.2.1 Hough Transform the previous note [8]. The sampled symbols are mapped The Hough transform is a simple and yet effective para- to the theoretical scale-degrees in cents according to the metric line detection method [14]. It is previously used AEU theory such that the tonic symbol is assigned to 0 in section-level audio-score alignment [8], tonic identifica- cents (Figure1b). The generated synthetic pitch track has tion [5] and tempo estimation [15] in OTMM and found to a sampling rate of ∼ 46 ms, equal to the frame rate of produce comparable results to methodologies using com- the predominant melody. We denote the synthetic melody plex models such as hierarchical hidden Markov models [15]. (n) computed from the score of the composition (n) as Y = Nevertheless, it cannot handle extensive tempo deviations 2 The implementation is available in https://github.com/ or insertions, repetitions and omission in a musical phrase sertansenturk/predominantmelodymakam since it is a linear operation. n opt ahele- each compute ment and [8] in proposed ob- criteria matrix binarization and similarity segment matrix binary distance line a the the tain binarize make we To prominent, more performed. is the audio-axis fragment where the recording score audio to the blob in dis- time-interval the low the of indicates by projection formed The trajectory values. will diagonal tance matrix a distance in the blob(s) frequency, show normalized tonic is correct melody the predominant with the and recording octave- dio by affected tonic. the not or is melody predominant It cents the in in errors classes. distance shortest pitch the two as between interpreted be may tance pitch synthetic the of sample melody predominant dio X pitch synthetic the between track computed is metric distance a with Hough along the matrices using similarity obtained binary path the alignment except of linear blocks one the A All of matrices. top path. distance on the displayed from is computed transform matrices similarity binary of set peaks, detected peaks, its and melody predominant score, d) the from selected fragment short 1 Figure eotda notmmo hsvlefrmkmmsc[8]. music makam for value given this the of optimum than an less as reported is cents) (in threshold, distance binarization the if note same where as: computed ˆ fteslce cr rgeti efre ihnteau- the within performed is fragment score selected the If alignment, partial for selected is transform Hough the If eetopthvle r osdrdt eogt the to belong to considered are values pitch two Here ( D h rdmnn eoyetatdfo h ui recording, audio the from extracted melody predominant The m,k ( B Y ,j i, ) x ˆ g) ahelement Each . ( ( i ( n ,j i, = ) m,k lc iga ftetasoiinivratpriladosoeainetuigteHuhtransform Hough the using alignment audio-score partial invariant transposition the of diagram Block . ) h e fdsac arcsbtentesnhtcmld n h omlzdpeoiatmelodies, predominant normalized the and melody synthetic the between matrices distance of set The n h omlzdadopeoiatmelody predominant audio normalized the and ) ) min ntebnr iiaiymti as: matrix similarity binary the in eoe the denotes B ( 1200 ,j i, | a (b) (a) x ˆ = ) i ( m,k − α D ( etake We . ) | ( X i x ˆ ˆ ,j i, − th 0 1 i ( g ( D , D , m,k m,k y apeo h omlzdau- normalized the of sample and ) Y j ( n B ntedsac matrix distance the in ) ( ) ) n − | h and ) ) euethe use We 1h). (Figure o 1200 mod ,j i, j i, epciey hsdis- This respectively. , (g) (f) (d) (c) (h) y α r h aefrprilainetuigSDTW. using alignment partial for same the are j (

n cents Hz 50 = y ) | j ( α > ≤ b) n o 1200 mod ) α h ytei ic optdfo h cr fragment, score the from computed pitch synthetic The eoe the denotes et,wihis which cents, f) , h e fpeoiatmlde omlzdwt epc otedetected the to respect with normalized melodies predominant of set The s =0.6 D (i) j (3) (2) th is uuae otmatrix cost cumulated e ]fratoog xlnto fDWadSDTW. and DTW of Chap- explanation [19, thorough to a readers for the 4] to ter refer allowed We are paths target. subsequence the within variant a start/end this is In series [18,19]. time other the the of of one when used variant mu- warpings. can pathological and to it However, tempo prone Hough be repetitions. in the and changes deletions Unlike to insertions, sical robust 17]. is [16, DTW alignment transform, score [1,2] audio identiﬁcation song and cover as such for tasks relevant methodologies many state-of-the-art the are (DTW) warping time dynamic speciﬁcally more and programming Dynamic DTW Subsequence 1h. Figure 4.2.2 in seen be can transform Hough found peak the alignment this example by An accumulated matrix. has transformation that the points in linear the The of sequence the [14]. segment line indi- trans- path prominent which obtained most peak, the the highest From the cates select we score. matrix, the formation in indicated tempo between deviation tempo between a angles ma- searched similarity the restrict binary − We the we 1h). segments, to (Figure line transform trix the Hough detect the To apply segments. line as mated sn DW ecmuea element an compute we SDTW, Using typical a is which (SDTW), DTW subsequence use We approxi- be can blobs these 1g, Figure in seen be can As sse bv,w eettese iecniinas condition size step the select we above, seen As A e) 26 ( ,j i, . h ic ls itiuincmue rmteaudio the from computed distribution class pitch The 57 p ( = ) m,n,k ◦ and                D + 0 ) i , ∞ hc h iesgetflos ssimply is follows, segment line the which , − ( ,j i, j , 63 + ) . i) 43 min h iiaiyvlecmue o the for computed value similarity the ◦ hc losteainett have to alignment the allows which , A      eusvl as: recursively A A A ( ( ( i i i 0 − − − 0 cents . 5 2 1 1 (e) j , j , j , and − − − c) ...... 1) 1) 2) h ui recording, audio The 2 > i , > j , ie h nominal the times A ( ,j i, 1 1 ) ,j i, nteac- the in 0 = 0 = 0 6= { h) (2 a The , A ) (4) 1) , (1, 1), (1, 2)}. Analogous to the angle restriction in the Irrelevant recordings Relevant recordings Hough transform (Section 4.2.1), this step size ensures Kernel density estimate that the intra-tempo variations in any path will stay between half and double the nominal tempo indicated in the score. Moreover, we use Equation2 as the local distance distance Mahalanobis measure to calculate the accumulated cost matrix. Also, Similarity between the recordings and Acemaşiran Peşrevi notice that the accumulated cost matrix is extended with a zeroth row and column, initialized to enable subsequence Figure 2. Similarity vs Mahalanobis distance between the matching. Finally we back-track the path p(m,n,k) ending composition “Acemas¸iran Pes¸revi” and the audio record- (m) (m) at arg min(i) A(i, J ) (remember that J is the length ings in the dataset, and the kernel density-estimate com- of the synthetic melody), which emits the lowest accumu- puted from the similarity values between the audio record- lated cost [19, Chapter 4]. ings and the composition.

4.3 Similarity Computation the top documents [20]. After applying partial audio-score Using either the Hough transform or SDTW, we obtain a alignment between the query and each document, we rank (m,n,k) (m,n,k) (m,n,k) path p = p1 . . . pL(m,n,k) between the au- the documents with respect to the similarities obtained. We dio recording of the performance (m) and the score of the then reject documents with low similarities according to an (m,n,k) composition (n) using the tonic candidate ck with pl = automatically learned threshold. (m,n,k) (m,n,k) (m,n,k) m (m,n,k) m rl , ql , rl ∈ [1,I ], ql ∈ [1,J ] As seen in Figure2, the relevant documents stand as “out- and l ∈ 1 : L(m,n,k), where L(m,n,k) is the length of the liers” among the irrelevant documents with respect to the path p(m,n,k). We compute a similarity, s(m,n,k) ∈ [0 : 1], similarities they emit. To fetch the relevant documents per between the score fragment and the audio recording for the query, one can apply “outlier detection” using similarities (m) between each document and query. Outlier detection is a tonic candidate c by: k common problem, which has many applications such as P B(r(m,n,k), q(m,n,k)) fraud detection and server malfunction detection [21]. s(m,n,k) = l l l (5) Upon inspecting the similarity values emitted by irrele- L(m,n,k) vant documents, we have noticed that the values roughly s(m,n,k) gives us a measure of how closely the score frag- follow a Normal distribution (Figure2). However, the dis- ment is followed by the corresponding time-interval in the tributions observed for each query have a different mean audio recording indicated by the path. For example if the and variance. This is expected since the similarity com- difference between the matched values of the audio pre- putation could be affected by several factors such as the dominant melody and the synthetic predominant melody melodic complexities of the score fragment and the audio are always below 50 cents, the similarity is 1. performance, as well as the quality of the extracted audio For the partial alignment between the score of the com- predominant melody. To deal with this variability, we composition (n) and the audio recording (m), we obtain a pute the Mahalanobis distance of each similarity value to n set of alignment similarities as S(m,n) = s(m,n,1),..., the distribution represented by the other similarity values (Figure2). 4 Mahalanobis distance is a unitless and scale- (m,n,K(m))o (m,n,k) s , where s is the alignment similarity invariant distance metric, which outputs the distance be- between the composition (m) and audio recording (n) for tween a point and a distribution in standard deviations. (m) (m) the tonic candidate ck , k ∈ 1 : K . To reject irrelevant documents we apply a simple method The similarity between the composition (n) and the audio where all documents below a certain threshold are rejected. recording (m) is simply taken as the maximum alignment To learn the decision boundary for thresholding, we ap- similarity value, i.e: s(m,n) = max(S(m,n)). ply logistic regression [20], a simple binary classification Figure2 show the similarities computed between the per- model, to the similarity values and the Mahalanobis dis- formances in our audio collection (Section 5.1) and the tances on labeled data (Section 5.1). The training step will composition, “Acemas¸iran Pes¸revi.” 3 In this example the be explained in Section5 in more detail. similarity of the relevant audio recordings are much higher After eliminating the documents according to the learned compared to the non-relevant ones. decision boundary, we add a last document called none to Note that finding a true pair also implies correctly identi- the end of the list. This document indicates that the query (m,n) fying the tonic pitch class [5], i.e. ζ = arg max (m) might not have any relevant document in the collection if ck (S(m,n)), where ζ(m,n) is the estimated tonic pitch class of all of the documents above are irrelevant. the performance of the composition in the audio recording. 5. EXPERIMENTS 4.4 Irrelevant Document Rejection In the experiments, we compare two alignment methods In many common retrieval scenarios, including composi- (Hough vs. SDTW). We try to align either the repetition in tion identification, the users are only interested in checking 4 Note that the Mahalanobis distances shown in Figure2 are less than 3 http://musicbrainz.org/work/ what a “real” Normal distribution would produce. This is because of the 01412a5d-1858-43b3-b5b0-78f383675e9b contribution by the true pairs to the distribution. 350 the score as done in [5] or the start in the score as a simpler 50 300 alternative and for the case when the structure information 40 250 200 is not available in the score. We search the optimal frag- 30 . . 20 . . ment duration between 4 and 24 seconds. 2 # Music Scores # Audio Recordings # As mentioned in Section3, we evaluate the performance 10 1 0 0 0 2 4 6 8 10 0 1 2 retrieval and the composition retrieval tasks separately. To # Relevant Audio Recordings # Relevant Music Scores test the document rejection step, we use 10-fold cross val- idation. We run the transposition-invariant partial audio Figure 3. The number of relevant documents for the score alignment between each score fragment and audio queries a) Histogram of the number of relevant audio recording (Section 4.2) and then compute the similarity recordings per score, b) Histogram of the number of rel- value for each performance-composition pair in the train- evant scores per audio recording ing set (Section 4.3). We also compute the Mahalanobis Composition Retrieval Performance Retrieval distance for each query (performance in composition re- 1 1 trieval task and vice versa). We apply logistic regression to 0.9 0.9 the similarity values and the Mahalonobis distances com- 0.8 0.8 Hough, repetition

MAP MAP Hough, start puted for each annotated audio-score pair (with the binary SDTW, repetition relevances 0 or 1), and learn a decision boundary between 0.7 0.7 SDTW, start 0.6 0.6 the relevant and irrelevant documents. Then given a query 4 8 12 16 20 24 4 8 12 16 20 24 (a composition in the performance retrieval task, and vice Score Duration (seconds) Score Duration (seconds) versa) from the testing set, we carry out all the steps ex- Figure 4. MAP for composition and performance retrieval plained in Section4 and reject all the documents (perfor- task before document rejection, across different methods, mances in the performance retrieval task, and vice versa) fragment locations and durations. Only the queries with at “below” the decision boundary. least one relevant document are considered. We use mean average precision (MAP) [20] to evaluate the methodology. MAP can be considered as a summary of how a method performs for different queries and the our collection may be as many as 11. On the other hand, number of documents retrieved per query. For the docu- the releases of OTMM are typically organized such that ment rejection step, we report the average MAP obtained there is a single composition performed in each track. For from the MAPs of each testing set. We also conduct 3- this reason, we were only able to obtain two audio record- way ANOVA tests on the MAPs obtained from each testing ings in which there are two compositions performed. Note set to find if there are significant differences between the that the tonic frequency changes in the performances of alignment methods, fragment locations and fragment dura- each composition in these two recordings. tions. For all results below, the term “significant” has the The average cardinalities of the compositions per audio following meaning: the claim is statistically significant at recording and audio recordings per composition are 0.49 the p = 0.01 level as determined by a multiple comparison and 2.48, respectively. Notice that we have also included test using the Tukey-Kramer statistic. some compositions in our data collection, which do not have any relevant performances, and vice versa (Figure3). 5.1 Dataset Our methodology also aims to identify such queries with- For our experiments, we gathered a collection of 743 audio out relevant documents. If we consider this case as an ad- recordings and 146 music scores of different pes¸rev and ditional, special “document” called none, the average car- saz semaisi compositions. The audio recordings are se- dinality for compositions per audio recording and audio lected from the CompMusic corpus [22]. These recordings recordings per composition is 1.00 and 2.88, respectively. are either in public-domain or commercially available. The scores are selected from the SymbTr score collection [23]. 5.2 Results SymbTr-scores are given in a machine readable format, Before document rejection, the MAP is around 0.47 for which stores the duration and symbol of each note. The both composition retrieval and performance retrieval tasks structural divisions in the compositions (i.e. the start and end note of each section) and the nominal tempo are also Composition Retrieval Performance Retrieval indicated in the scores. 1 1

We manually labeled the compositions performed in each 0.9 0.9 audio recording. In the dataset there are 360 recordings as- 0.8 0.8

MAP MAP Hough, repetition sociated with 87 music scores, forming 362 audio-score Hough, start 0.7 0.7 SDTW, repetition pairs. This information along with other relevant metadata SDTW, start 0.6 0.6 such as the releases, performers and composers are stored 4 8 12 16 20 24 4 8 12 16 20 24 in MusicBrainz. 5 Figure3 shows the histogram of the Score Duration (seconds) Score Duration (seconds) number of relevant compositions per audio recording and Figure 5. MAP for composition and performance retrieval the number of relevant audio recordings per composition. task after document rejection, across different methods, The number of recordings for a particular composition in fragment locations and durations. All queries are consid- 5 http://musicbrainz.org/ ered. Composition Retrieval Performance Retrieval 1 1 6. DISCUSSION 0.95 0.95 The results show that even aligning an 8 second fragment

MAP Hough, repetition MAP 0.9 Hough, start 0.9 SDTW, repetition is highly effective, nevertheless, the optimal value of frag- SDTW, start 0.85 0.85 ment duration for composition identification is around 16 4 8 12 16 20 24 4 8 12 16 20 24 Score Duration (seconds) Score Duration (seconds) seconds. Using a fragment duration longer than 16 seconds is not necessary since it increases the computation Figure 6. MAP for composition and performance retrieval time without any significant benefit on identification per- task after document rejection, across different methods, formance. The results further show that aligning the start fragment locations and durations. Only the queries with is sufficient, and there is no need to exploit the structure in- no relevant documents are considered. formation to select a fragment from the repetition as in [5]. If a fragment of 16 seconds from the start of the score is selected, the Hough transform and SDTW produces the Durations (sec.) Methods Locations same results in both composition retrieval and performance 4 8 12 16 20 24 retrieval tasks. One surprising case is the lower MAP’s Start 30 15 2 3 2 2 Hough Repetition 14 5 0 0 0 0 obtained in the performance retrieval task using SDTW to Start 32 6 3 3 3 3 align the repetition. Although the drop is not significant for SDTW Repetition 24 3 1 2 3 3 fragment durations longer than 12 seconds, we observed that SDTW tends to align irrelevant subsequences in the performances with the score fragments, which have similar Table 1. Number of errors in tonic identification note-symbol sequences but different durations. Both the Hough transform and SDTW have a complex- ity of O(I(m)J (n)), where I(m) is the length of the predominant melody extracted from the audio recording (m) using either of the alignment methods, fragment locations and J (n) is the length of the synthetic melody generated and fragment durations longer than 8 seconds. The MAP from the score of the composition (n). Nonetheless, the is low before document rejection since the queries without Hough transform is applied to a sparse, binary similarity relevant documents will practically have 0 average preci- matrix, hence it can operate faster than SDTW. Moreover, sion. Figure4 shows the composition retrieval and perfor- the Hough transform is a simpler algorithm. These prop- mance retrieval results before document rejection only for erties make the Hough transform an alternative to more the queries with relevant documents. The retrieval results complex alignment algorithms, when precision in intra- before document rejection show that most of the audio- alignment (e.g. note-level) is not necessary. Given these score pairs may be found by partial audio-score alignment observations, we select alignment of the first 16 seconds of by using a score fragment of at least 12 seconds. Although the score using the Hough transform as the optimal setting. Hough transform performs slightly better than SDTW, these For the score fragments longer than 8 seconds, the tonic increases are not significant for fragment durations longer identification errors always occur in two historical record- than 8 seconds. ings, where the recording speed (hence the pitch) is not Figure5 shows the average MAPs from all queries ob- stable and another recording where the musicians some- tained using different fragment durations, fragment loca- times play the repetition by transposing the melodic inter- tions and partial alignment methods in a 10-fold cross val- vals by a fifth. Even though the tonic identification has idation scheme. The best average MAP is 0.96 for compo- failed in these cases, the fragments are correctly aligned sition retrieval using either the Hough transform or SDTW to the score. For such recordings, the stability of the tonic and aligning 24 seconds from the start. For performance frequency can be assessed and the tonic frequency can be retrieval the best average MAP of 0.95 is achieved using refined locally by referring to aligned tonic notes in the the Hough transform and aligning 16 seconds from the alignment path computed using SDTW. start. When we inspect average MAPs obtained from the From Figure5, we can observe that by using a simple queries without any relevant documents (Figure6), we ob- outlier detection step based on logistic regression, we were serve that the document rejection step always achieves an able to reject most of the irrelevant documents in both com- average MAP higher than 0.95 for all the parameter composition retrieval and performance retrieval scenarios. By binations in the composition retrieval task and an average comparing Figure4 with Figure5, we can also conclude MAP closer to or higher than 0.9 for all the parameter com- that this step does not remove many relevant documents, binations in the performance retrieval task, respectively. providing reliable performance and composition matches. The usefulness of this step is more evident when the results When we inspect the alignment results, we find that the for the queries with no relevant documents are checked score fragments were aligned properly for most of the cases. (Figure6). For such queries, since all the documents typ- Moreover the tonic is identified almost perfectly for all the ically have a low, comparable similarity, our methodology audio recordings by aligning the relevant scores (Table1), is able to reject almost all the irrelevant documents. From and we achieved 100% accuracy out of the 362 audio-score Figure6, we can also observe that the document rejection pairs by aligning at least 12 seconds from the repetition us- step is robust to changes in the fragment duration, the frag- ing the Hough transform. ment location and the alignment method. 7. CONCLUSION ings of 14th International Society for Music Information Re- trieval Conference. Curitiba, Brazil: Pontif´ıcia Universidade In this paper, we presented a methodology to identify the Catolica´ do Parana,´ 2013, pp. 175–180. relevant compositions and performances in a collection con- [6] E. B. Ederer, “The theory and praxis of makam in classical sisting of audio recordings and music scores, using trans- Turkish music 1910-2010,” Ph.D. dissertation, University of position invariant partial audio-score alignment. To the California, Santa Barbara, September 2011. best of our knowledge, our methodology is the first auto- [7] E. Popescu-Judetz, Meanings in Turkish Musical Culture. Is- matic composition identification proposed for OTMM. The tanbul: Pan Yayıncılık, 1996. methodology is highly successful, achieving 0.95 MAP in [8] S. S¸enturk,¨ A. Holzapfel, and X. Serra, “Linking scores and retrieving the compositions performed in a recording and audio recordings in makam music of Turkey,” Journal of New 0.96 MAP in retrieving the audio recordings where a com- Music Research, vol. 43, no. 1, pp. 34–52, 3 2014. position is performed. What is more, our methodology is [9]E.G omez,´ “Tonal description of music audio signals,” Ph.D. dissertation, Universitat Pompeu Fabra, 2006. not only reliable in identifying relevant compositions and audio recordings but also identifying the cases when there [10]M.M uller,¨ F. Kurth, and M. Clausen, “Audio matching via chroma-based statistical features.” in Proceedings of the 6th are no relevant documents for a given query. Our algo- International Conference on Music Information Retrieval (IS- rithm additionally identifies the tonic frequency of the per- MIR 2005), 2005, p. 6th. formance of each composition in the audio recording al- [11] J. Salamon and E. Gomez,´ “Melody extraction from poly- most perfectly, as a result of partial audio-score alignment. phonic music signals using pitch contour characteristics,” Our results indicate that the Hough transform can be a IEEE Transactions on Audio, Speech, and Language Process- cheaper and effective alternative to alignment methods with ing, vol. 20, no. 6, pp. 1759–1770, 2012. more temporal flexibility such as SDTW in finding musi- [12] H. S. Atlı, B. Uyar, S. S¸enturk,¨ B. Bozkurt, and X. Serra, cally relevant patterns. As the next step we would like to “Audio feature extraction for exploring Turkish makam music,” in 3rd International Conference on Audio Technologies evaluate our method on more forms, possibly with shorter for Music and Media, Bilkent University. Ankara, Turkey: structural elements such as the vocal form, s¸arkı. We would Bilkent University, 2014. also like to investigate network analysis methods to iden- [13] A. C. Gedik and B. Bozkurt, “Pitch-frequency histogram- tify the relevant performances and compositions jointly. based music information retrieval for Turkish music,” Signal Our method can easily be adapted to neighboring music Processing, vol. 90, no. 4, pp. 1049–1063, 2010. cultures such as Greek, Armenian, Azerbaijani, Arabic and [14] R. O. Duda and P. E. Hart, “Use of the Hough transformation Persian music, which share similar melodic characteristics. to detect lines and curves in pictures,” Communications of the We hope that our method would be a starting point for fu- ACM, vol. 15, no. 1, pp. 11–15, 1972. ture studies in automatic composition identification, and [15] A. Holzapfel, U. S¸ims¸ekli, S. S¸enturk,¨ and A. T. Cemgil, facilitate future research and applications on linked data, “Section-level modeling of musical audio for linking performances to scores in Turkish makam music,” in IEEE Inter- automatic music description, discovery and archival. national Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). Brisbane, Australia: IEEE, 2015, pp. Acknowledgments 141–145. We are thankful to Dr. Joan Serra` for his suggestions on [16]M.M uller¨ and D. Appelt, “Path-constrained partial music synchronization,” in IEEE International Conference on this work. This work is partly supported by the Euro- Acoustics, Speech and Signal Processing. IEEE, 2008, pp. pean Research Council under the European Union’s Sev- 65–68. enth Framework Program, as part of the CompMusic pro- [17] B. Niedermayer, “Accurate audio-to-score alignment - data ject (ERC grant agreement 267583). acquisition in the context of computational musicology,” Ph.D. dissertation, Johannes Kepler Universitat,¨ Linz, Febru- ary 2012. 8. REFERENCES [18] X. Anguera and M. Ferrarons, “Memory efficient subse- [1] D. P. Ellis and G. E. Poliner, “Identifying ‘cover songs’ with quence DTW for Query-by-Example spoken term detection,” chroma features and dynamic programming beat tracking,” in in IEEE International Conference on Multimedia and Expo. Proceedings of IEEE International Conference on Acoustics, IEEE, 2013, pp. 1–6. Speech and Signal Processing (ICASSP), vol. 4, 2007, pp. [19]M.M uller,¨ Information retrieval for music and motion. 1429–1432. Springer Heidelberg, 2007, vol. 6. [2] J. Serra,` X. Serra, and R. G. Andrzejak, “Cross recurrence [20] C. D. Manning, P. Raghavan, and H. Schutze,¨ Introduction to quantification for cover song identification,” New Journal of information retrieval. Cambridge University Press, 2008, Physics, vol. 11, no. 9, 2009. vol. 1. [3] V. Thomas, C. Fremerey, M. Muller,¨ and M. Clausen, “Link- [21] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detec- ing sheet music and audio - Challenges and new approaches,” tion: A survey,” ACM Computing Surveys, vol. 41, no. 3, pp. in Multimodal Music Processing, ser. Dagstuhl Follow-Ups. 15:1–15:58, Jul. 2009. Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum fur¨ [22] B. Uyar, H. S. Atlı, S. S¸enturk,¨ B. Bozkurt, and X. Serra, “A Informatik, 2012, vol. 3, pp. 1–22. corpus for computational research of Turkish makam music,” [4] Y. Raimond, S. A. Abdallah, M. Sandler, and F. Giasson, in 1st International Digital Libraries for Musicology Work- “The music ontology,” in Proceedings of the 8th International shop, London, United Kingdom, 2014, pp. 57–63. Conference on Music Information Retrieval (ISMIR), Vienna, [23] K. Karaosmanoglu,˘ “A Turkish makam music symbolic Austria, 2007, pp. 417–422. database for music information retrieval: SymbTr,” in Pro- [5] S. S¸enturk,¨ S. Gulati, and X. Serra, “Score informed tonic ceedings of 13th International Society for Music Information identification for makam music of Turkey,” in Proceed- Retrieval Conference (ISMIR), 2012, pp. 223–228.