Bundeli Folk-Song Genre Classification with Knn And
Total Page:16
File Type:pdf, Size:1020Kb
Bundeli Folk-Song Genre Classification with kNN and SVM Ayushi Pandey Indranil Dutta Dept. of Computational Linguistics Dept. of Computational Linguistics The EFL University The EFL University [email protected] [email protected] Abstract hand, which encompasses several administrative While large data dependent techniques districts in India. While the 2001 census identi- fies 3,070,000 Bundeli speakers, Ethnologue esti- have made advances in between-genre 1 classification, the identification of sub- mates 20,000,000 speakers . Inspite of the large types within a genre has largely been over- population, Bundeli, often considered a dialect of looked. In this paper, we approach auto- Western Hindi, would be considered an under- matic classification of within-genre Bun- resourced language because of the lack of tex- deli folk music into its subgenres; Gaari, tual and literary resources that are available. Bun- Rai and Phag. Bundeli, which is a domi- delkhand, however, is home to a rich tradition of nant dialect spoken in a large belt of Ut- lyrical styles and genres that are both performa- tar Pradesh and Madhya Pradesh has a tive and poetic. In addition to the major gen- rich resource of folk songs and an atten- res; Gaaris, Rais and Phags, several other gen- dant folk tradition. First, we success- res can be found in this region, including Sora, Hori, and Limtera. With the exception of sev- fully demonstrate that a set of common 2 stopwords in Bundeli can be used to per- eral Phags , Bundeli is a resource poor language, form broad genre classification between in that there are no available sources of textual standard Bundeli text (newspaper corpus) material. Using a bag-of-words technique on the and lyrics. We then establish the prob- lyrics corpus, and Term Frequency-Inverse Docu- lem of structural and lexical similarity in ment Frequency tfidf scores, we report on the per- within-genre classification using n-grams. formance of a k-Nearest Neighbour (kNN) clas- Finally, we classify the lyrics data into sifier. We demonstrate that a 10-fold kNN cross the three genres using popular machine- validation exhibits an accuracy of nearly 85% in learning classifiers: Support Vector Ma- classifying within the Bundeli folk genres when a chine (SVM) and kNN classifiers achiev- k of 2 neighbours is used. Using the SVM clas- ing 91.3% and 85% and accuracy respec- sification, we achieve an accuracy of over 90%. tively. We also use a Na¨ıve Bayes clas- Following a brief description of related classifica- sifier which returns an accuracy of 75%. tion techniques applied to song genre classifica- Our results underscore the need to extend tion within Indian and Western music, in general, popular classification techniques to sparse in section 2 below, we provide a detailed account and small corpora, so as to perform hith- of methods used for creating news and folk song erto neglected within genre classification lyrics corpora in section 3. Following that, in sec- and also exhibit that well known classi- tion 4, we emphasize the need to use commonly- fiers can also be employed in classifying removed stopwords towards affecting a classifica- ‘small’ data. tion between news and song lyrics corpora. In this section, we also report on the feasibility of 1 Introduction using probability density functions on word level Bundeli is spoken in regions of Madhya Pradesh 1http://www.ethnologue.com/language/bns and Uttar Pradesh, in a region known as Bundelk-133 2http://www.kavitakosh.org D S Sharma, R Sangal and J D Pawar. Proc. of the 11th Intl. Conference on Natural Language Processing, pages 133–138, Goa, India. December 2014. c 2014 NLP Association of India (NLPAI) n-grams to better understand lexical and structural be successfully used to classify, relatively small similarity within-genres. In section 5, we present corpus of Bundeli folk songs into specific genres; detailed classification analyses on the song lyrics namely, Gaari, Rai and Phag. corpus and show the extent of accuracy that can be achieved when popular machine learning tech- 3 Corpus Creation niques are employed to classify within folk genres, The three genres Gaari, Phag and Rai were cho- that exhibit a great deal of lexical and structural sen because they are commercially available in the overlap. In Section 6, we present the conclusions form of analog audio tapes. These audio tapes our analysis and also future directions for further were digitized and converted to MP3 format for research. transcription and future analyses. As far as stylis- tic register go, gaaris are marriage songs that are 2 Related Work sung in repeated choruses. The chorus is period- ically repeated after each verse. Phags are lyrical Song genre classification as one of the primary poems, showing rich lexical diversity and rhyth- tasks of music information retrieval, has been ap- mic meter. Rais are dance songs, sung to the beat proached from analysis and classification of au- of the increasing speed of the Dholak (local per- dio signal features (Kini et al., 2011; Jothilak- cussion instrument). In Rais, we see the usage shmi and Kathiresan, 2012; Tzanetakis and Cook, of repeated lines more than any other genre. In 2002); retrieval of lyrics based features (Howard Gaaris and Rais, a significant level of semantic et al., 2011; Mayer et al., 2008) and approaches overlap can be predicted. Both use simple, con- which use both audio and lyrics based features versational terms, and owing to their content being for genre classification (Mayer and Rauber, 2011; repeated, they pose a challenge for classification. Neumayer and Rauber, 2007). Within the West- These patterns can be evinced from repetitive cho- ern context, both popular, classical, and folk mu- ruses as in Figure 1, below. sic genres have been classifed using common ma- chine learning algorithms. The techniques used for classification include both stochastic and prob- abilistic methods; Hidden Markov Models (Chai and Vercoe, 2001), Machine-learning etc. How- ever, within the Indian context most all work on song genre classification has been restricted to audio feature vector extraction and classifica- tion. More precisely, various classification tech- niques such as Gaussian Mixture Models (GMM), k-nearest neighbour (kNN) classifiers and Support Vector Machine classifiers (SVM) have been em- ployed to classify between Hindustani, Carnatic, Ghazal, Folk and Indian Western genres (Kini Figure 1: Panel A depicts Gaari with repeated cho- et al., 2011) and north Indian devotional music rus. Panel B shows Rais with repeated lines. Panel (Jothilakshmi and Kathiresan, 2012). The only in- C shows the lyrical form of the Phag stance of lyrics based classification has been ex- plored in the context of Bollywood music in an Therefore, within-genre classification remains a effort to identify specific features of Bollywood problem at the textual level because there is signif- song lyrics using Complex Networks (Behl and icant lexical overlap between Gaaris and Rais. Choudhury, 2011). However, classification of folk genres has not received any attention so far. In 3.1 Song Corpus this paper, we propose a two-fold approach, first, The songs come from various sources. Our first we suggest that classification between broad text source was a collection from an oral tradition of genres such as news and songs can be success- singing from regions near Sagar, Madhya Pradesh. fully accomplished using common stopwords in Online videos and audio cassestes from a Jhansi Bundeli. And second, we also demonstrate that based casette company, Kanhaiya Casette, became big data based machine learning approaches could134 our second source. Phags were mined from a web resource for poetry and prose which contains col- Type of text Number 3 lections of the famous Bundeli poet Isuri . Isuri’s News Articles 98 phags, however, are also available in a few text- Gaaris (G) 37 books on Bundeli folk culture. The lyrics were or- Rais (R) 39 thographically transcribed by listening to the song Phags (P) 40 files in MP3 format in Devanagari. Owing to poor Total songs 116 audio quality of the songs, they required a native (G+R+P) Bundeli speaker for transcription, the first author being one. Orthographic normalisation was main- Table 1: Details of the news and song corpora tained for words that used non-contrastive phone- mic or morphological features. For example, the function word “se” and “sein” meaning “from” this classification. One way ANOVA with Broad was normalised to “sein”. Similarly for “un- Genre as predictor shows a significant main effect honein” and “unne”, meaning third person erga- F[1,26]=5.438;p<0.05. As the boxplot in Figure tive was normalised to “unne”. Where necessary, 2 shows, stopword token frequencies are signifi- similar normalization procedures were used to ho- cantly higher in news as compared to songs. Thus, mogenize orthographic variation. Since there was stopwords when included in any corpus, can help no available list of stopwords, we adopted the cor- classify news from songs, even though, more com- responding stoplist from Hindi. We narrowed the monly, stopwords are excluded during the prepro- scope of the exhaustive stopword list by select- cessing stages of standard classification-based ap- ing only the most frequent stopwords occurring in proaches. songs. We also performed a test of lexical diversity to classify between the news and song corpus. Using 3.2 News corpus Python scripts we calculated the type-token ratio Before analysing the songs within themselves, we of the two broad genres. The news corpus had a needed to establish a differentiation between a lexical diversity of 0.028% while the song corpus generalised corpus and a song corpus. The only had nearly half the lexical diversity of the news available online resource was the publication of a corpus, i.e., 0.014%.