Phrase-based Rāga Recognition using Vector Space Modeling

Sankalp Gulati*, Joan Serrà^, Vignesh Ishwar*, Sertan Şentürk* and Xavier Serra*

*Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain ^Telefonica Research, Barcelona, Spain

Te 41st IEEE International Conference on Acoustics, Speech and Signal Processing Shanghai, China, 2016 Indian Art Music Indian Art Music

Indian subcontinent: India, Pakistan, Bangladesh, Srilanka, Nepal and Bhutan Indian Art Music Indian Art Music: Hindustani music Indian Art Music: Rāga: melodic framework q Grammar for improvisation and composition

Automatic Rāga Recognition

Yaman Training corpus Shankarabharnam Todī Darbari Kalyan Bageśrī Kambhojī Hamsadhwani Des Kirvani Rāga recognition System Behag Begada

Rāga label Rāga Characterization:

10 30 time (s) Rāga Characterization: Svaras

10 30 time (s) Rāga Characterization: Svaras

10 30 time (s) 1

Lower Sa Higher Sa 0.8 Tonic middle Sa 0.6

0.4

Normalized salience 0.2

0 -12050 100 150 0 200 250 120300 Frequency (bins), 1 bin = 10 Cents, Ref = 55 Hz Frequency (cents), fref = tonic frequency q P. Chordia and S. Şentürk, “Joint recognition of raag and tonic in North Indian music,” Computer Music Journal, vol. 37, no. 3, pp. 82–98, 2013. q G. K. Koduri, S. Gulati, P. Rao, and X. Serra, “Rāga recognition based on pitch distribution methods,” Journal of New Music Research, vol. 41, no. 4, pp. 337–350, 2012. Rāga Characterization: Intonation

10 30 time (s) 1

Lower Sa Higher Sa 0.8 Tonic middle Sa 0.6

0.4

Normalized salience 0.2

0 -12050 100 150 0 200 250 120300 Frequency (bins), 1 bin = 10 Cents, Ref = 55 Hz Frequency (cents), fref = tonic frequency Rāga Characterization: Intonation

10 30 time (s) 1

Lower Sa Higher Sa 0.8 Tonic middle Sa 0.6

0.4

Normalized salience 0.2

0 -12050 100 150 0 200 250 120300 Frequency (bins), 1 bin = 10 Cents, Ref = 55 Hz Frequency (cents), fref = tonic frequency q G.K.Koduri,V.Ishwar,J.Serrà,andX.Serra,“Intonation analysis of rāgas in Carnatic music,” Journal of New Music Research, vol. 43, no. 1, pp. 72–93, 2014. q H. G. , S. Arthi, and T. V. Sreenivas, “Carnatic music analysis: Shadja, swara identifcation and verifcation in alapana using stochastic models,” in IEEE WASPAA, 2011, pp. 29–32. Rāga Characterization: Ārōh-Avrōh

10 30 time (s)

q Ascending-descending pattern; melodic progression Rāga Characterization: Ārōh-Avrōh

10 30 time (s)

Melodic Progression Templates N-gram Distribution Hidden Markov Model

q S. Shetty and K. K. Achary, “Raga mining of indian music by extracting - pattern,” Int. Journal of Recent Trends in Engineering, vol. 1, no. 1, pp. 362–366, 2009. q V. Kumar, H Pandya, and C. V. Jawahar, “Identifying in indian music,” in 22nd Int. Conf. on Pattern Recognition (ICPR), 2014, pp. 767–772. q P. V. Rajkumar, K. P. Saishankar, and M. John, “Identifcation of Carnatic raagas using hidden markov models,” in IEEE 9th Int. Symposium on Applied Machine Intelligence and Informatics (SAMI), 2011, pp. 107–110. Rāga Characterization: Melodic motifs

10 30 time (s) Rāga Characterization: Melodic motifs

10 30 time (s)

Rāga A Rāga B Rāga C

q R. Sridhar and T. V. Geetha, “Raga identifcation of carnatic music for music information retrieval,” International Journal of Recent Trends in Engineering, vol. 1, no. 1, pp. 571–574, 2009. q S. Dutta, S. PV Krishnaraj, and H. A. Murthy, “Raga verifcation in carnatic music using longest common segment set,” in Int. Soc. for Music Information Retrieval Conf. (ISMIR), pp. 605-611,2015 Goal q Automatic rāga recognition

Yaman Training corpus Shankarabharnam Todī Darbari Kalyan Bageśrī Kambhojī Hamsadhwani Des Harikambhoji Kirvani Rāga recognition Atana System Behag Kapi Begada

Rāga label Goal

q Automatic rāga recognition

Yaman Training corpus Shankarabharnam Todī Darbari 10 30 Kalyan 10 30 Bageśrī 10 10 30time (s) 30 Kambhojī time (s) 10 time (s) 30 10Hamsadhwani10 3030 time (s) 1010 Des 3030 10 time1010 (s) 30 Harikambhoji 30time30 time (s) (s) 10 30 1010 timetime (s) (s) 3030 10 time (s) timetime (s) (s) 30 1010 Kirvani 3030 Rāga recognitiontime (s) Atana timetime (s) (s) 10 time (s) 30 1010 timetime (s) (s) 3030 10 System 30 1010 Behag timetime (s) (s) 3030 10 time (s) 30 1010 Kapi 3030 time (s) Begada timetime (s) (s) 10 time (s) 30 1010 timetime (s) (s) 3030 10 time (s) 30 1010 timetime (s) (s) 3030 10 time (s) 30 1010 timetime (s) (s) 3030 R10āga label time (s) 30 1010 timetime (s) (s) 3030 10 time (s) 30 1010 timetime (s) (s) 3030 time (s) 1010 timetime (s) (s) 3030 1010 timetime (s) (s) 3030 1010 timetime (s) (s) 3030 timetime (s) (s) Block Diagram: Proposed Approach

Carnatic music collection Melodic Pattern Discovery Data Intra-recording Inter-recording Processing Pattern Discovery Pattern Detection

Melodic Pattern Clustering Pattern Network Similarity Community Generation Thresholding Detection

Feature Extraction Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

Feature Matrix Block Diagram: Pattern Discovery

Carnatic music collection Melodic Pattern Discovery Data Intra-recording Inter-recording Processing Pattern Discovery Pattern Detection

Melodic Pattern Clustering Pattern Network Similarity Community Generation Thresholding Detection

Feature Extraction Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

Feature Matrix Block Diagram: Pattern Clustering

Carnatic music collection Melodic Pattern Discovery Data Intra-recording Inter-recording Processing Pattern Discovery Pattern Detection

Melodic Pattern Clustering Pattern Network Similarity Community Generation Thresholding Detection

Feature Extraction Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

Feature Matrix Block Diagram: Feature Extraction

Carnatic music collection Melodic Pattern Discovery Data Intra-recording Inter-recording Processing Pattern Discovery Pattern Detection

Melodic Pattern Clustering Pattern Network Similarity Community Generation Thresholding Detection

Feature Extraction Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

Feature Matrix Block Diagram: Pattern Discovery

Carnatic music collection Melodic Pattern Discovery Data Intra-recording Inter-recording Processing Pattern Discovery Pattern Detection

Melodic Pattern Clustering Pattern Network Similarity Community Generation Thresholding Detection

Feature Extraction Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

Feature Matrix Block Diagram: Pattern Discovery

Carnatic music collection Melodic Pattern Discovery Data Intra-recording Inter-recording Processing Pattern Discovery Pattern Detection

Melodic Pattern Clustering Pattern Network Similarity Community S. Gulati, J. SerràGeneration, V. Ishwar , andThresholding X. Serra, “Mining melodicDetection patterns in large audio collections of Indian art music,” in Int. Conf. on Signal Image Feature Extraction Technology & Internet Based Systems - MIRA, Marrakesh, Morocco, 2014, pp. 264–271.Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

Feature Matrix Proposed Approach: Pattern Discovery

Data Intra-recording Inter-recording Processing Pattern Discovery Pattern Detection

-Predominant pitch estimation

-Downsampling -Hz to Cents -Tonic normalization -Brute-force segmentation

-Uniform Time-scaling

-Segment filtering Flat Non-flat Proposed Approach: Pattern Discovery

Data Intra-recording Inter-recording Processing Pattern Discovery Pattern Detection

Recording 1

Recording 2 Seed Patterns

Recording N Proposed Approach: Pattern Discovery

Data Intra-recording Inter-recording Processing Pattern Discovery Pattern Detection

Inter-recording search Rank-refinement Block Diagram: Pattern Clustering

Carnatic music collection Melodic Pattern Discovery Data Intra-recording Inter-recording Processing Pattern Discovery Pattern Detection

Melodic Pattern Clustering Pattern Network Similarity Community Generation Thresholding Detection

Feature Extraction Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

Feature Matrix Proposed Approach: Pattern Clustering

Pattern Network Similarity Community Generation Thresholding Detection

Undirectional

q M. EJ Newman, “Te structure and function of complex networks,” Society for Industrial and Applied Mathematics (SIAM) review, vol. 45, no. 2, pp. 167–256, 2003. Proposed Approach: Pattern Clustering

Pattern Network Similarity Community Generation Thresholding Detection

* Ts

q M. EJ Newman, “Te structure and function of complex networks,” Society for Industrial and Applied Mathematics (SIAM) review, vol. 45, no. 2, pp. 167–256, 2003. q S. Maslov and K. Sneppen, “Specifcity and stability in topology of protein networks,” Science, vol. 296, no. 5569, pp. 910– 913, 2002. Proposed Approach: Pattern Clustering

Pattern Network Similarity Community Generation Thresholding Detection

q V. D. Blondel, J. L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Teory and Experiment, vol. 2008, no. 10, pp. P10008, 2008. q S Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3, pp. 75–174, 2010. Proposed Approach: Pattern Clustering

Pattern Network Similarity Community Generation Thresholding Detection

q V. D. Blondel, J. L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Teory and Experiment, vol. 2008, no. 10, pp. P10008, 2008. q S Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3, pp. 75–174, 2010. Block Diagram: Feature Extraction

Carnatic music collection Melodic Pattern Discovery Data Intra-recording Inter-recording Processing Pattern Discovery Pattern Detection

Melodic Pattern Clustering Pattern Network Similarity Community Generation Thresholding Detection

Feature Extraction Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

Feature Matrix Proposed Approach: Feature Extraction Proposed Approach: Feature Extraction

Text/document classification

Phrases Words Rāga Topic Recordings Documents Proposed Approach: Feature Extraction

Text/document classification

Term Frequency Inverse Document Frequency (TF-IDF) Proposed Approach: Feature Extraction

Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization Proposed Approach: Feature Extraction

Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization Proposed Approach: Feature Extraction

Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization Proposed Approach: Feature Extraction

Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

1 0 4

2 2 0

1 1 0

3 3 3 Proposed Approach: Feature Extraction

Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization Proposed Approach: Feature Extraction

Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

1, if f(p, r) > 0 F1(p, r)= (1) !0, otherwise

F2(p, r)=f(p, r)(2)

F3(p, r)=f(p, r) × irf(p, R)(3) N irf(p, R)=log (4) |{r ∈ R : p ∈ r}| " # Proposed Approach: Feature Extraction

Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

1, if 1f,(p,if r)f>(p,0 r) > 0 F1(p, rF)=1(p, r)= (1) (1) !0, otherwise!0, otherwise

F2(p, rF)=2(p,f( rp,)= r)(2)f(p, r)(2)

F3(p, rF)=3(p,f( rp,)= r)f×(p,irf( r)p,× Rirf()(3)p, R)(3) N N irf(p, Rirf()=logp, R)=log (4) (4) |{r ∈ R|{:rp∈∈Rr}|: p ∈ r}| " " # # Proposed Approach: Feature Extraction

Vocabulary Term-frequency Feature Extraction Feature Extraction Normalization

1, if11,f,(p,ifif rf)f(>p,(p,0 r) r)>>00 F1(p, rF)=1(p, r)= (1) (1) F1(p, r)= 0, 1,otherwiseif f(p, r) > 0 (1) F1!(0p,, r)=!otherwise!0, otherwise (1) !0, otherwise 2 p, r f p, r F2(p,F rF)=2(p,( f r()=p,)= rf)(2)(p,( r)(2))(2) F2(p, r)=f(p, r)(2)

3 p, r f p, r p, R F3(p,F rF)=3(p,( f r()=p,)= rf) (×p,( irf( r))×p,×irf( Rirf()(3)p, R)(3))(3) F3(p, r)=f(p, r) × irf(p, R)(3) N NN irf(p,irf( Rirf()=logp,p, R R)=log)=log N (4) (4)(4) irf(p, R)=log|{r ∈ |{R|{r:r∈p∈R∈Rr:}|:pp∈∈r}|r}| (4) " "" |{r ∈ R#: p ∈#r#}| " # Evaluation: Music Collection Evaluation: Music Collection q Corpus: CompMusic Carnatic music § Commercial released music (~325) CDs § Metadata available in Musicbrainz

MusicBrainz Evaluation: Music Collection q Corpus: CompMusic Carnatic music § Commercial released music (~325) CDs § Metadata available in Musicbrainz q Datasets: subsets of corpus § DB40rāga MusicBrainz ¨ 480 audio recordings, 124 hours of music ¨ 40 diverse set of rāgas ¨ 310 compositions, 62 unique artists § DB10rāga ¨ 10 rāga subset of DB40rāga Evaluation: Music Collection q Corpus: CompMusic Carnatic music § Commercial released music (~325) CDs § Metadata available in Musicbrainz q Datasets: subsets of corpus § DB40rāga MusicBrainz ¨ 480 audio recordings, 124 hours of music ¨ 40 diverse set of rāgas ¨ 310 compositions, 62 unique artists § DB10rāga ¨ 10 rāga subset of DB40rāga

h#p://compmusic.upf.edu/node/278

Evaluation: Classifcation methodology q Experimental setup § Stratifed 12-fold cross validation (balanced) § Repeat experiment 20 times § Evaluation measure: mean classifcation accuracy Evaluation: Classifcation methodology q Experimental setup § Stratifed 12-fold cross validation (balanced) § Repeat experiment 20 times § Evaluation measure: mean classifcation accuracy q Classifers § Multinomial, Gaussian and Bernoulli naive Bayes (NBM, NBG and NBB) § SVM with a linear and RBF-kernel, and with a SGD learning (SVML, SVMR and SGD) § logistic regression (LR) and random forest (RF) Evaluation: Classifcation methodology q Statistical signifcance § Mann-Whitney U test (p < 0.01) § Multiple comparisons: Holm Bonferroni method

q H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” Te annals of mathematical statistics, vol. 18, no. 1, pp. 50–60, 1947. q S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian journal of statistics, vol. 6, no. 2, pp. 65–70, 1979. Evaluation: Classifcation methodology q Statistical signifcance § Mann-Whitney U test (p < 0.01) § Multiple comparisons: Holm Bonferroni method q Comparison with the state-of-the-art

§ S1: Pitch-class-distribution (PCD)-based method (PCD120, PCDfull)

§ S2: Parameterized (PCD)-based method (PCDparam)

q H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” Te annals of mathematical statistics, vol. 18, no. 1, pp. 50–60, 1947. q S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian journal of statistics, vol. 6, no. 2, pp. 65–70, 1979. q P. Chordia and S. Şentürk, “Joint recognition of raag and tonic in North Indian music,” Computer Music Journal, vol. 37, no. 3, pp. 82–98, 2013. q G.K.Koduri,V.Ishwar,J.Serrà,andX.Serra,“Intonation analysis of rāgas in Carnatic music,” Journal of New Music Research, vol. 43, no. 1, pp. 72–93, 2014. Results where, f(p, r) denotes the raw frequency of occurrence of phrase db Mtd Ftr NBM NBB LR SVML 1NN p in recording r. F1 only considers the presence or absence of a F1 90.6 74 84.1 81.2 - ¯ phrase in a recording. In order to investigate if the frequency of oc- aga M F2 91.7 73.8 84.8 81.2 - currence of melodic phrases is relevant for characterizing ragas,¯ we F3 90.5 74.5 84.3 80.7 - take F2(p, r)=f(p, r). As mentioned, the melodic phrases that PCD120 - - - - 82.2 DB10r S1 occur across different ragas¯ and in several recordings are futile for PCDfull ----89.5 raga¯ recognition. Therefore, to reduce their effect in the feature vec- S2 PDparam 37.9 11.2 70.1 65.7 - tor we employ a weighting scheme, similar to the inverse document F1 69.6 61.3 55.9 54.6 - frequency (idf) weighting in text retrieval. ¯ aga M F2 69.6 61.7 55.7 54.3 - F3 69.5 61.5 55.9 54.5 - F3(p, r)=f(p, r) irf(p, R) (2) ⇥ PCD120 - - - - 66.4 DB40r S1 N PCDfull ----74.1 irf(p, R)=log (3) S2 PDparam 20.8 2.6 51.4 44.2 - r R : p r ✓ |{ 2 2 }| ◆ where, r R : p r is the number of recordings where the |{ 2 2 }| Table 1. Accuracy (in percentage) of different methods (Mtd) for melodic phrase p is present, that is f(p, r) =0for these recordings. two datasets (db) using different classifiers and features (Ftr). We denote our proposed method by M 6 the evaluation measure. In order to assess if the difference in the 3. EVALUATION performance of any two methods is statistically significant, we use the Mann-Whitney U test [31] with p =0.01. In addition, to com- 3.1. Music Collection pensate for multiple comparisons, we apply the Holm-Bonferroni The music collection used in this study is compiled as a part of the method [32]. CompMusic project [26–28]. The collection comprises 124 hours of commercially available audio recordings of Carnatic music belong- 3.3. Comparison with the state of the art ing to 40 ragas.¯ For each raga,¯ there are 12 music pieces, which amounts to a total of 480 recordings. All the editorial metadata for We compare our results with two state of the art methods proposed each audio recording is publicly available in Musicbrainz3, an open- in [7] and [12]. As an input to these methods, we use the same source metadata repository. The music collection primarily consists predominant melody and tonic as used in our method. The method of vocal performances of 62 different artists. There are a total of 310 in [7] uses smoothened pitch-class distribution (PCD) as the tonal different compositions belonging to diverse forms in Carnatic music feature and employs 1-nearest neighbor classifier (1NN) using Bhat- (for example k¯irtana, , virtuttam). The chosen ragas¯ contain tacharyya distance for predicting raga¯ label. We denote this method diverse set of svaras (note), both in terms of the number of svaras and by S1. The authors in [7] report a window size of 120 s as an opti- their pitch-classes (svarasthan¯ as).¯ To facilitate comparative studies mal duration for computing PCDs (denoted here by PCD120). How- and promote reproducible research we make this music collection ever, we also experiment with PCDs computed over the entire audio 4 publicly available online . recording (denoted here by PCDfull). Note that in [7] the authors From this music collection we build two datasets, which we de- do not experiment with a window size larger than 120 s. note by DB40raga¯ and DB10raga.¯ DB40raga¯ comprises the entire The method proposed in [12] also uses features based on pitch music collection and DB10raga¯ comprises a subset of 10 ragas.¯ We distribution. However, unlike in [7], the authors use parameterized use DB10raga¯ to make our results more comparable to studies where pitch distribution of individual svaras as features (denoted here by the evaluations are performed on similar number of ragas.¯ PDparam). We denote this method by S2. The authors of both these papers courteously ran the experiments on our dataset using the orig- 3.2. Classification and Evaluation Methodology inal implementations of the methods. The features obtained above are used to train a classifier. In order to assess the relevance of these features for raga¯ recognition, we ex- 4. RESULTS AND DISCUSSION periment with different algorithms exploiting diverse classification M strategies [29]: Multinomial, Gaussian and Bernoulli naive Bayes In Table 1, we present the results of our proposed method and the S S (NBM, NBG and NBB, respectively), support vector machines with two state of the art methods 1 and 2 for the two datasets DB10raga¯ a linear and a radial basis function kernel, and with a stochastic gra- and DB40raga.¯ The highest accuracy for every method is highlighted dient descent learning (SVML, SVMR and SGD, respectively), lo- in bold for both the datasets. Due to lack of space we present results gistic regression (LR) and random forest (RF). We use the imple- only for the best performing classifiers. mentation of these classifiers available in scikit-learn toolkit [30], We start by analyzing the results of the variants of M. From Ta- version 0.15.1. Since in this study, our focus is to extract a musically ble 1, we see that the highest accuracy obtained by M for DB10raga¯ relevant set of features based on melodic phrases, we use the default is 91.7%. Compared to DB10raga,¯ there is a significant drop in the parameter settings for the classifiers available in scikit-learn. performance of every variant of M for DB40raga.¯ The best perform- We use a stratified 12-fold cross-validation methodology for ing variant in the latter achieves 69.6% accuracy. We also see that evaluations. The folds are generated such that every fold comprises for both the datasets, the accuracy obtained by M across the feature equal number of feature instances per raga.¯ We repeat the entire sets is nearly the same for each classifier, with no statistically sig- experiment 20 times, and report the mean classification accuracy as nificant difference. This suggests that, considering just the presence or the absence of a melodic phrase, irrespective of its frequency of 3https://musicbrainz.org occurrence, is sufficient for raga¯ recognition. Interestingly, this find- 4http://compmusic.upf.edu/node/278 ing is consistent with the fact that characteristic melodic phrases are Results where, f(p, r) denotes the raw frequency of occurrence of phrase db Mtd Ftr NBM NBB LR SVML 1NN p in recording r. F1 only considers the presence or absence of a F1 90.6 74 84.1 81.2 - ¯ phrase in a recording. In order to investigate if the frequency of oc- aga M F2 91.7 73.8 84.8 81.2 - currence of melodic phrases is relevant for characterizing ragas,¯ we F3 90.5 74.5 84.3 80.7 - take F2(p, r)=f(p, r). As mentioned, the melodic phrases that PCD120 - - - - 82.2 DB10r S1 occur across different ragas¯ and in several recordings are futile for PCDfull ----89.5 raga¯ recognition. Therefore, to reduce their effect in the feature vec- S2 PDparam 37.9 11.2 70.1 65.7 - tor we employ a weighting scheme, similar to the inverse document F1 69.6 61.3 55.9 54.6 - frequency (idf) weighting in text retrieval. ¯ aga M F2 69.6 61.7 55.7 54.3 - F3 69.5 61.5 55.9 54.5 - F3(p, r)=f(p, r) irf(p, R) (2) ⇥ PCD120 - - - - 66.4 DB40r S1 N PCDfull ----74.1 irf(p, R)=log (3) S2 PDparam 20.8 2.6 51.4 44.2 - r R : p r ✓ |{ 2 2 }| ◆ where, r R : p r is the number of recordings where the |{ 2 2 }| Table 1. Accuracy (in percentage) of different methods (Mtd) for melodic phrase p is present, that is f(p, r) =0for these recordings. two datasets (db) using different classifiers and features (Ftr). We denote our proposed method by M 6 the evaluation measure. In order to assess if the difference in the 3. EVALUATION performance of any two methods is statistically significant, we use the Mann-Whitney U test [31] with p =0.01. In addition, to com- 3.1. Music Collection pensate for multiple comparisons, we apply the Holm-Bonferroni The music collection used in this study is compiled as a part of the method [32]. CompMusic project [26–28]. The collection comprises 124 hours of commercially available audio recordings of Carnatic music belong- 3.3. Comparison with the state of the art ing to 40 ragas.¯ For each raga,¯ there are 12 music pieces, which amounts to a total of 480 recordings. All the editorial metadata for We compare our results with two state of the art methods proposed each audio recording is publicly available in Musicbrainz3, an open- in [7] and [12]. As an input to these methods, we use the same source metadata repository. The music collection primarily consists predominant melody and tonic as used in our method. The method of vocal performances of 62 different artists. There are a total of 310 in [7] uses smoothened pitch-class distribution (PCD) as the tonal different compositions belonging to diverse forms in Carnatic music feature and employs 1-nearest neighbor classifier (1NN) using Bhat- (for example k¯irtana, varnam, virtuttam). The chosen ragas¯ contain tacharyya distance for predicting raga¯ label. We denote this method diverse set of svaras (note), both in terms of the number of svaras and by S1. The authors in [7] report a window size of 120 s as an opti- their pitch-classes (svarasthan¯ as).¯ To facilitate comparative studies mal duration for computing PCDs (denoted here by PCD120). How- and promote reproducible research we make this music collection ever, we also experiment with PCDs computed over the entire audio 4 publicly available online . recording (denoted here by PCDfull). Note that in [7] the authors From this music collection we build two datasets, which we de- do not experiment with a window size larger than 120 s. note by DB40raga¯ and DB10raga.¯ DB40raga¯ comprises the entire The method proposed in [12] also uses features based on pitch music collection and DB10raga¯ comprises a subset of 10 ragas.¯ We distribution. However, unlike in [7], the authors use parameterized use DB10raga¯ to make our results more comparable to studies where pitch distribution of individual svaras as features (denoted here by the evaluations are performed on similar number of ragas.¯ PDparam). We denote this method by S2. The authors of both these papers courteously ran the experiments on our dataset using the orig- 3.2. Classification and Evaluation Methodology inal implementations of the methods. The features obtained above are used to train a classifier. In order to assess the relevance of these features for raga¯ recognition, we ex- 4. RESULTS AND DISCUSSION periment with different algorithms exploiting diverse classification M strategies [29]: Multinomial, Gaussian and Bernoulli naive Bayes In Table 1, we present the results of our proposed method and the S S (NBM, NBG and NBB, respectively), support vector machines with two state of the art methods 1 and 2 for the two datasets DB10raga¯ a linear and a radial basis function kernel, and with a stochastic gra- and DB40raga.¯ The highest accuracy for every method is highlighted dient descent learning (SVML, SVMR and SGD, respectively), lo- in bold for both the datasets. Due to lack of space we present results gistic regression (LR) and random forest (RF). We use the imple- only for the best performing classifiers. mentation of these classifiers available in scikit-learn toolkit [30], We start by analyzing the results of the variants of M. From Ta- version 0.15.1. Since in this study, our focus is to extract a musically ble 1, we see that the highest accuracy obtained by M for DB10raga¯ relevant set of features based on melodic phrases, we use the default is 91.7%. Compared to DB10raga,¯ there is a significant drop in the parameter settings for the classifiers available in scikit-learn. performance of every variant of M for DB40raga.¯ The best perform- We use a stratified 12-fold cross-validation methodology for ing variant in the latter achieves 69.6% accuracy. We also see that evaluations. The folds are generated such that every fold comprises for both the datasets, the accuracy obtained by M across the feature equal number of feature instances per raga.¯ We repeat the entire sets is nearly the same for each classifier, with no statistically sig- experiment 20 times, and report the mean classification accuracy as nificant difference. This suggests that, considering just the presence or the absence of a melodic phrase, irrespective of its frequency of 3https://musicbrainz.org occurrence, is sufficient for raga¯ recognition. Interestingly, this find- 4http://compmusic.upf.edu/node/278 ing is consistent with the fact that characteristic melodic phrases are Results where, f(p, r) denotes the raw frequency of occurrence of phrase db Mtd Ftr NBM NBB LR SVML 1NN p in recording r. F1 only considers the presence or absence of a F1 90.6 74 84.1 81.2 - ¯ phrase in a recording. In order to investigate if the frequency of oc- aga M F2 91.7 73.8 84.8 81.2 - currence of melodic phrases is relevant for characterizing ragas,¯ we F3 90.5 74.5 84.3 80.7 - take F2(p, r)=f(p, r). As mentioned, the melodic phrases that PCD120 - - - - 82.2 DB10r S1 occur across different ragas¯ and in several recordings are futile for PCDfull ----89.5 raga¯ recognition. Therefore, to reduce their effect in the feature vec- S2 PDparam 37.9 11.2 70.1 65.7 - tor we employ a weighting scheme, similar to the inverse document F1 69.6 61.3 55.9 54.6 - frequency (idf) weighting in text retrieval. ¯ aga M F2 69.6 61.7 55.7 54.3 - F3 69.5 61.5 55.9 54.5 - F3(p, r)=f(p, r) irf(p, R) (2) ⇥ PCD120 - - - - 66.4 DB40r S1 N PCDfull ----74.1 irf(p, R)=log (3) S2 PDparam 20.8 2.6 51.4 44.2 - r R : p r ✓ |{ 2 2 }| ◆ where, r R : p r is the number of recordings where the |{ 2 2 }| Table 1. Accuracy (in percentage) of different methods (Mtd) for melodic phrase p is present, that is f(p, r) =0for these recordings. two datasets (db) using different classifiers and features (Ftr). We denote our proposed method by M 6 the evaluation measure. In order to assess if the difference in the 3. EVALUATION performance of any two methods is statistically significant, we use the Mann-Whitney U test [31] with p =0.01. In addition, to com- 3.1. Music Collection pensate for multiple comparisons, we apply the Holm-Bonferroni The music collection used in this study is compiled as a part of the method [32]. CompMusic project [26–28]. The collection comprises 124 hours of commercially available audio recordings of Carnatic music belong- 3.3. Comparison with the state of the art ing to 40 ragas.¯ For each raga,¯ there are 12 music pieces, which amounts to a total of 480 recordings. All the editorial metadata for We compare our results with two state of the art methods proposed each audio recording is publicly available in Musicbrainz3, an open- in [7] and [12]. As an input to these methods, we use the same source metadata repository. The music collection primarily consists predominant melody and tonic as used in our method. The method of vocal performances of 62 different artists. There are a total of 310 in [7] uses smoothened pitch-class distribution (PCD) as the tonal different compositions belonging to diverse forms in Carnatic music feature and employs 1-nearest neighbor classifier (1NN) using Bhat- (for example k¯irtana, varnam, virtuttam). The chosen ragas¯ contain tacharyya distance for predicting raga¯ label. We denote this method diverse set of svaras (note), both in terms of the number of svaras and by S1. The authors in [7] report a window size of 120 s as an opti- their pitch-classes (svarasthan¯ as).¯ To facilitate comparative studies mal duration for computing PCDs (denoted here by PCD120). How- and promote reproducible research we make this music collection ever, we also experiment with PCDs computed over the entire audio 4 publicly available online . recording (denoted here by PCDfull). Note that in [7] the authors From this music collection we build two datasets, which we de- do not experiment with a window size larger than 120 s. note by DB40raga¯ and DB10raga.¯ DB40raga¯ comprises the entire The method proposed in [12] also uses features based on pitch music collection and DB10raga¯ comprises a subset of 10 ragas.¯ We distribution. However, unlike in [7], the authors use parameterized use DB10raga¯ to make our results more comparable to studies where pitch distribution of individual svaras as features (denoted here by the evaluations are performed on similar number of ragas.¯ PDparam). We denote this method by S2. The authors of both these papers courteously ran the experiments on our dataset using the orig- 3.2. Classification and Evaluation Methodology inal implementations of the methods. The features obtained above are used to train a classifier. In order to assess the relevance of these features for raga¯ recognition, we ex- 4. RESULTS AND DISCUSSION periment with different algorithms exploiting diverse classification M strategies [29]: Multinomial, Gaussian and Bernoulli naive Bayes In Table 1, we present the results of our proposed method and the S S (NBM, NBG and NBB, respectively), support vector machines with two state of the art methods 1 and 2 for the two datasets DB10raga¯ a linear and a radial basis function kernel, and with a stochastic gra- and DB40raga.¯ The highest accuracy for every method is highlighted dient descent learning (SVML, SVMR and SGD, respectively), lo- in bold for both the datasets. Due to lack of space we present results gistic regression (LR) and random forest (RF). We use the imple- only for the best performing classifiers. mentation of these classifiers available in scikit-learn toolkit [30], We start by analyzing the results of the variants of M. From Ta- version 0.15.1. Since in this study, our focus is to extract a musically ble 1, we see that the highest accuracy obtained by M for DB10raga¯ relevant set of features based on melodic phrases, we use the default is 91.7%. Compared to DB10raga,¯ there is a significant drop in the parameter settings for the classifiers available in scikit-learn. performance of every variant of M for DB40raga.¯ The best perform- We use a stratified 12-fold cross-validation methodology for ing variant in the latter achieves 69.6% accuracy. We also see that evaluations. The folds are generated such that every fold comprises for both the datasets, the accuracy obtained by M across the feature equal number of feature instances per raga.¯ We repeat the entire sets is nearly the same for each classifier, with no statistically sig- experiment 20 times, and report the mean classification accuracy as nificant difference. This suggests that, considering just the presence or the absence of a melodic phrase, irrespective of its frequency of 3https://musicbrainz.org occurrence, is sufficient for raga¯ recognition. Interestingly, this find- 4http://compmusic.upf.edu/node/278 ing is consistent with the fact that characteristic melodic phrases are Results where, f(p, r) denotes the raw frequency of occurrence of phrase db Mtd Ftr NBM NBB LR SVML 1NN p in recording r. F1 only considers the presence or absence of a F1 90.6 74 84.1 81.2 - ¯ phrase in a recording. In order to investigate if the frequency of oc- aga M F2 91.7 73.8 84.8 81.2 - currence of melodic phrases is relevant for characterizing ragas,¯ we F3 90.5 74.5 84.3 80.7 - take F2(p, r)=f(p, r). As mentioned, the melodic phrases that PCD120 - - - - 82.2 DB10r S1 occur across different ragas¯ and in several recordings are futile for PCDfull ----89.5 raga¯ recognition. Therefore, to reduce their effect in the feature vec- S2 PDparam 37.9 11.2 70.1 65.7 - tor we employ a weighting scheme, similar to the inverse document F1 69.6 61.3 55.9 54.6 - frequency (idf) weighting in text retrieval. ¯ aga M F2 69.6 61.7 55.7 54.3 - F3 69.5 61.5 55.9 54.5 - F3(p, r)=f(p, r) irf(p, R) (2) ⇥ PCD120 - - - - 66.4 DB40r S1 N PCDfull ----74.1 irf(p, R)=log (3) S2 PDparam 20.8 2.6 51.4 44.2 - r R : p r ✓ |{ 2 2 }| ◆ where, r R : p r is the number of recordings where the |{ 2 2 }| Table 1. Accuracy (in percentage) of different methods (Mtd) for melodic phrase p is present, that is f(p, r) =0for these recordings. two datasets (db) using different classifiers and features (Ftr). We denote our proposed method by M 6 the evaluation measure. In order to assess if the difference in the 3. EVALUATION performance of any two methods is statistically significant, we use the Mann-Whitney U test [31] with p =0.01. In addition, to com- 3.1. Music Collection pensate for multiple comparisons, we apply the Holm-Bonferroni The music collection used in this study is compiled as a part of the method [32]. CompMusic project [26–28]. The collection comprises 124 hours of commercially available audio recordings of Carnatic music belong- 3.3. Comparison with the state of the art ing to 40 ragas.¯ For each raga,¯ there are 12 music pieces, which amounts to a total of 480 recordings. All the editorial metadata for We compare our results with two state of the art methods proposed each audio recording is publicly available in Musicbrainz3, an open- in [7] and [12]. As an input to these methods, we use the same source metadata repository. The music collection primarily consists predominant melody and tonic as used in our method. The method of vocal performances of 62 different artists. There are a total of 310 in [7] uses smoothened pitch-class distribution (PCD) as the tonal different compositions belonging to diverse forms in Carnatic music feature and employs 1-nearest neighbor classifier (1NN) using Bhat- (for example k¯irtana, varnam, virtuttam). The chosen ragas¯ contain tacharyya distance for predicting raga¯ label. We denote this method diverse set of svaras (note), both in terms of the number of svaras and by S1. The authors in [7] report a window size of 120 s as an opti- their pitch-classes (svarasthan¯ as).¯ To facilitate comparative studies mal duration for computing PCDs (denoted here by PCD120). How- and promote reproducible research we make this music collection ever, we also experiment with PCDs computed over the entire audio 4 publicly available online . recording (denoted here by PCDfull). Note that in [7] the authors From this music collection we build two datasets, which we de- do not experiment with a window size larger than 120 s. note by DB40raga¯ and DB10raga.¯ DB40raga¯ comprises the entire The method proposed in [12] also uses features based on pitch music collection and DB10raga¯ comprises a subset of 10 ragas.¯ We distribution. However, unlike in [7], the authors use parameterized use DB10raga¯ to make our results more comparable to studies where pitch distribution of individual svaras as features (denoted here by the evaluations are performed on similar number of ragas.¯ PDparam). We denote this method by S2. The authors of both these papers courteously ran the experiments on our dataset using the orig- 3.2. Classification and Evaluation Methodology inal implementations of the methods. The features obtained above are used to train a classifier. In order to assess the relevance of these features for raga¯ recognition, we ex- 4. RESULTS AND DISCUSSION periment with different algorithms exploiting diverse classification M strategies [29]: Multinomial, Gaussian and Bernoulli naive Bayes In Table 1, we present the results of our proposed method and the S S (NBM, NBG and NBB, respectively), support vector machines with two state of the art methods 1 and 2 for the two datasets DB10raga¯ a linear and a radial basis function kernel, and with a stochastic gra- and DB40raga.¯ The highest accuracy for every method is highlighted dient descent learning (SVML, SVMR and SGD, respectively), lo- in bold for both the datasets. Due to lack of space we present results gistic regression (LR) and random forest (RF). We use the imple- only for the best performing classifiers. mentation of these classifiers available in scikit-learn toolkit [30], We start by analyzing the results of the variants of M. From Ta- version 0.15.1. Since in this study, our focus is to extract a musically ble 1, we see that the highest accuracy obtained by M for DB10raga¯ relevant set of features based on melodic phrases, we use the default is 91.7%. Compared to DB10raga,¯ there is a significant drop in the parameter settings for the classifiers available in scikit-learn. performance of every variant of M for DB40raga.¯ The best perform- We use a stratified 12-fold cross-validation methodology for ing variant in the latter achieves 69.6% accuracy. We also see that evaluations. The folds are generated such that every fold comprises for both the datasets, the accuracy obtained by M across the feature equal number of feature instances per raga.¯ We repeat the entire sets is nearly the same for each classifier, with no statistically sig- experiment 20 times, and report the mean classification accuracy as nificant difference. This suggests that, considering just the presence or the absence of a melodic phrase, irrespective of its frequency of 3https://musicbrainz.org occurrence, is sufficient for raga¯ recognition. Interestingly, this find- 4http://compmusic.upf.edu/node/278 ing is consistent with the fact that characteristic melodic phrases are Results where, f(p, r) denotes the raw frequency of occurrence of phrase db Mtd Ftr NBM NBB LR SVML 1NN p in recording r. F1 only considers the presence or absence of a F1 90.6 74 84.1 81.2 - ¯ phrase in a recording. In order to investigate if the frequency of oc- aga M F2 91.7 73.8 84.8 81.2 - currence of melodic phrases is relevant for characterizing ragas,¯ we F3 90.5 74.5 84.3 80.7 - take F2(p, r)=f(p, r). As mentioned, the melodic phrases that PCD120 - - - - 82.2 DB10r S1 occur across different ragas¯ and in several recordings are futile for PCDfull ----89.5 raga¯ recognition. Therefore, to reduce their effect in the feature vec- S2 PDparam 37.9 11.2 70.1 65.7 - tor we employ a weighting scheme, similar to the inverse document F1 69.6 61.3 55.9 54.6 - frequency (idf) weighting in text retrieval. ¯ aga M F2 69.6 61.7 55.7 54.3 - F3 69.5 61.5 55.9 54.5 - F3(p, r)=f(p, r) irf(p, R) (2) ⇥ PCD120 - - - - 66.4 DB40r S1 N PCDfull ----74.1 irf(p, R)=log (3) S2 PDparam 20.8 2.6 51.4 44.2 - r R : p r ✓ |{ 2 2 }| ◆ where, r R : p r is the number of recordings where the |{ 2 2 }| Table 1. Accuracy (in percentage) of different methods (Mtd) for melodic phrase p is present, that is f(p, r) =0for these recordings. two datasets (db) using different classifiers and features (Ftr). We denote our proposed method by M 6 the evaluation measure. In order to assess if the difference in the 3. EVALUATION performance of any two methods is statistically significant, we use the Mann-Whitney U test [31] with p =0.01. In addition, to com- 3.1. Music Collection pensate for multiple comparisons, we apply the Holm-Bonferroni The music collection used in this study is compiled as a part of the method [32]. CompMusic project [26–28]. The collection comprises 124 hours of commercially available audio recordings of Carnatic music belong- 3.3. Comparison with the state of the art ing to 40 ragas.¯ For each raga,¯ there are 12 music pieces, which amounts to a total of 480 recordings. All the editorial metadata for We compare our results with two state of the art methods proposed each audio recording is publicly available in Musicbrainz3, an open- in [7] and [12]. As an input to these methods, we use the same source metadata repository. The music collection primarily consists predominant melody and tonic as used in our method. The method of vocal performances of 62 different artists. There are a total of 310 in [7] uses smoothened pitch-class distribution (PCD) as the tonal different compositions belonging to diverse forms in Carnatic music feature and employs 1-nearest neighbor classifier (1NN) using Bhat- (for example k¯irtana, varnam, virtuttam). The chosen ragas¯ contain tacharyya distance for predicting raga¯ label. We denote this method diverse set of svaras (note), both in terms of the number of svaras and by S1. The authors in [7] report a window size of 120 s as an opti- their pitch-classes (svarasthan¯ as).¯ To facilitate comparative studies mal duration for computing PCDs (denoted here by PCD120). How- and promote reproducible research we make this music collection ever, we also experiment with PCDs computed over the entire audio 4 publicly available online . recording (denoted here by PCDfull). Note that in [7] the authors From this music collection we build two datasets, which we de- do not experiment with a window size larger than 120 s. note by DB40raga¯ and DB10raga.¯ DB40raga¯ comprises the entire The method proposed in [12] also uses features based on pitch music collection and DB10raga¯ comprises a subset of 10 ragas.¯ We distribution. However, unlike in [7], the authors use parameterized use DB10raga¯ to make our results more comparable to studies where pitch distribution of individual svaras as features (denoted here by the evaluations are performed on similar number of ragas.¯ PDparam). We denote this method by S2. The authors of both these papers courteously ran the experiments on our dataset using the orig- 3.2. Classification and Evaluation Methodology inal implementations of the methods. The features obtained above are used to train a classifier. In order to assess the relevance of these features for raga¯ recognition, we ex- 4. RESULTS AND DISCUSSION periment with different algorithms exploiting diverse classification M strategies [29]: Multinomial, Gaussian and Bernoulli naive Bayes In Table 1, we present the results of our proposed method and the S S (NBM, NBG and NBB, respectively), support vector machines with two state of the art methods 1 and 2 for the two datasets DB10raga¯ a linear and a radial basis function kernel, and with a stochastic gra- and DB40raga.¯ The highest accuracy for every method is highlighted dient descent learning (SVML, SVMR and SGD, respectively), lo- in bold for both the datasets. Due to lack of space we present results gistic regression (LR) and random forest (RF). We use the imple- only for the best performing classifiers. mentation of these classifiers available in scikit-learn toolkit [30], We start by analyzing the results of the variants of M. From Ta- version 0.15.1. Since in this study, our focus is to extract a musically ble 1, we see that the highest accuracy obtained by M for DB10raga¯ relevant set of features based on melodic phrases, we use the default is 91.7%. Compared to DB10raga,¯ there is a significant drop in the parameter settings for the classifiers available in scikit-learn. performance of every variant of M for DB40raga.¯ The best perform- We use a stratified 12-fold cross-validation methodology for ing variant in the latter achieves 69.6% accuracy. We also see that evaluations. The folds are generated such that every fold comprises for both the datasets, the accuracy obtained by M across the feature equal number of feature instances per raga.¯ We repeat the entire sets is nearly the same for each classifier, with no statistically sig- experiment 20 times, and report the mean classification accuracy as nificant difference. This suggests that, considering just the presence or the absence of a melodic phrase, irrespective of its frequency of 3https://musicbrainz.org occurrence, is sufficient for raga¯ recognition. Interestingly, this find- 4http://compmusic.upf.edu/node/278 ing is consistent with the fact that characteristic melodic phrases are Error Analysis

Kalyāṇi Varāḷi Tōḍi Pūrvikalyāṇi

Error Analysis

Allied rāgas Error Analysis: complementary with S1 M (F ) 1 S1 (PCDfull) Summary Summary q Phrase-based rāga recognition using VSM is a successful strategy Summary q Phrase-based rāga recognition using VSM is a successful strategy q Mere presence/absence of melodic phrases is enough to recognize rāga Summary q Phrase-based rāga recognition using VSM is a successful strategy q Mere presence/absence of melodic phrases enough to recognize rāga q Multinomial Naive Bayes classifer outperforms the rest Summary q Phrase-based rāga recognition using VSM is a successful strategy q Mere presence/absence of melodic phrases enough to recognize rāga q Multinomial Naive Bayes classifer outperforms the rest q Topological properties of network - similarity threshold Summary q Phrase-based rāga recognition using VSM is a successful strategy q Mere presence/absence of melodic phrases enough to recognize rāga q Multinomial Naive Bayes classifer outperforms the rest q Topological properties of network - similarity threshold q Complementary errors in prediction compared to PCD-based methods (Future Work) Resources q Rāga dataset: § http://compmusic.upf.edu/node/278 q Demo: § http://dunya.compmusic.upf.edu/pattern_network/ q CompMusic: § http://compmusic.upf.edu/ q Related datasets: § http://compmusic.upf.edu/datasets Resources q Rāga dataset: § http://compmusic.upf.edu/node/278 q Demo: § http://dunya.compmusic.upf.edu/pattern_network/ q CompMusic: § http://compmusic.upf.edu/ q Related datasets: § http://compmusic.upf.edu/datasets Resources q Rāga dataset: § http://compmusic.upf.edu/node/278 q Demo: § http://dunya.compmusic.upf.edu/pattern_network/ q CompMusic (Project): § http://compmusic.upf.edu/ q Related datasets: § http://compmusic.upf.edu/datasets Phrase-based Rāga Recognition using Vector Space Modeling

Sankalp Gulati*, Joan Serrà^, Vignesh Ishwar*, Sertan Şentürk* and Xavier Serra*

*Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain ^Telefonica Research, Barcelona, Spain

Te 41st IEEE International Conference on Acoustics, Speech and Signal Processing Shanghai, China, 2016