<<

Received: March 9, 2017 173

Neural Network Based Indian Folk Song Classification Using MFCC and LPC

Malay Bhatt1* Tejas Patalia 2

1Rai University, Ahmedabad, 2V.V.P. Engineering College, Rajkot, India * Corresponding author’s Email: [email protected]

Abstract: A large number of videos are uploaded on the web or added as a situational song in the movies. The classification of folk dance videos is essential for dance education, to preserve cultural heritage, and for companies to provide better customer oriented service. India is a country having many regional languages and each region has its own popular folk . Four different Indian folk dances namely, '', 'Lavani', '' and '' are considered. A Folk Dance Classification Framework is proposed which extracts audio signal from video, takes a fragment of 125 seconds from the beginning and further separates it into a set of small segments, calculates Mel-frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) coefficients, generates high dimensional feature vector, reduces dimensionality using Principle Component Analysis (PCA) and classifies segments using Scale Conjugate Gradient Neural Network. The performances of chosen classifiers, K-Nearest Neighbor, Naïve Bayes and Neural Networks, are compared. Class labels of all segments are clubbed together and based on majority voting class label is assigned to a folk dance song. System achieves more than 90% accuracy. Keywords: Classification; Folk dance song; Mel-frequency Cepstral Coefficients; Neural network; Linear predictive coding

are added every year. Currently, most of the popular 1. Introduction sites provide meta-data or text based annotation Automatic dance song analysis will be one of the The human generated tags are used to categorize services used by music distributors to catch the and describe the living and non-living things of the attention of music lovers and to achieve maximum vast universe. The tags used for dance are known as profit. More than a decade, researchers are paying 'Dance Genre'. Dance song comprises of purposely attention to audio signal analysis and classification. selected sequences of human movement The earliest civilizations discovered in the accompanied with vocal and background music. Indian subcontinent are those of Mohenjo Daro and Folk music and folk dances constitute a significant Harappa in the Indus valley, and are dated about part of the folk heritage around the world. The 6000 B.C. It seems that by that time dance had preservation of the folk music and choreographies achieved a deemed measure of discipline and it is and their dissemination to the younger generations is likely that it must have played some important role a very important issue since folk dance forms an in the society, Two beautiful little statuette of important part of a country’s or region’s history and bronze dancing girls were found at Mohenjo Daro in culture[1]. Many years ago, classification of dance 1926 and 1931 respectively [2]. genre was purely manual through conversation with There are mainly two popular dances in India: public and experts' opinion. Automatic dance genre Classical Dance and Folk Dance. Classical Dance classification is essential because of resource was practiced in courts, temples and on special digitization and plenty of new audio and video songs

International Journal of Intelligent Engineering and Systems, Vol.10, No.3, 2017 DOI: 10.22266/ijies2017.0630.19

Received: March 9, 2017 174 occasions. Folk is a term broadly Ara' in 1931, choreographed dance sequences used to describe all forms of folk and tribal dances became ubiquitous in Hindi and other Indian films in regions across India. Folk dance forms are [7]. practiced in groups in rural areas as an expression of Often in movies, the actors don't sing the songs their daily work and rituals. Folk and tribal dances themselves that they dance too, but have another are performed for every possible occasion, to artist sing in the background. For an actor to sing in celebrate the arrival of seasons, birth of a child, a the song is unlikely but not rare. The dances in wedding and festivals [3]. The dances burst with Bollywood can range from slow dancing, to a more verve and vitality. Mostly, Men and women perform upbeat hip hop style dance. The dancing itself is a some dances exclusively, while in some fusion of Indian classical, Indian folk dance, Indian performances they dance together. On most tribal, belly dancing, jazz, hip hop and everything occasions, the dancers sing themselves, while being else. Video song can be extracted from Indian accompanied by artists on the instruments. Some of movies or can be downloaded from popular site the popular folk dances that are performed across YouTube which gives large number of songs rapidly. villages and cities are 'Bhangra', 'Garba', 'Lavani', It is difficult to identify the folk dance form for a 'Ghoomar', '' and 'Bihu' [4]. An effect of normal user because it pertains to particular region. various factors, along with comparative analysis, on Folk dances have cultural differences too. automatic Indian music recognition, classification Classification of a dance form can be done using and retrieval is described in [5]. The impact of artist audio and/or visual cues. (singer) significantly affects the genre based Folk dance classification based on visual cues classification of dance songs. The songs sung by involves large amount of video processing like key- same artist in both training and testing results in frame extraction, shot boundary detection, Moving over optimistic accuracy. The Artist filters for genre object (Dancer) detection, feature based posture based song classification are discussed in [6]. We identification, recognition and classification. Visual have also collected dance songs of same genre of cues based classification is purely affected by how different artists to eliminate the effect of artist. the actual dance song is choreographed and how it is picturized by cameramen. Many choreographers and Lavani: cameramen prefer to take different kinds of shots Lavani is a popular folk form of Maharashtra. like Top View, Side View, Long Shot, Close Up or Traditionally, the songs are sung by female artists, Circular motion. It is very difficult to recognize the but male artists may occasionally sing Lavani. The actual dance move under such different camera dance format associated with Lavani is known situations. When words are important at that time as 'Tamasha'. This dance format contains the dancer only Close Up shot is considered so it is not possible ('Tamasha Bai'), the helping dancer - Maavshi, The to detect the posture. While folk dance classification Drummer - 'Dholki vaala' & The Flute Boy - based on audio cues involves less processing of data, 'BaasnuriVaala' [3]. fast recognition with good accuracy. Figure 1 Bhangra: shows (a) Ghoomar and (b) Garba shot. Bhangra is a form of dance-oriented folk Indian movies consist of 2 main events: Non- music of . The present musical style is Song and Song. Audio part can be easily separated derived from non-traditional musical from the extracted video song. Each dance form has accompaniment to the riffs of Punjab called by the its own non-vocal and vocal part. Non-vocal part same name [3]. comprises of instruments which are widely used in Garba: that specific state/region. Vocal part also differs Garba is a folk dance of . It has been from region to region. adopted all over the world. A garba song has lot of Classification based on the vocal part is possible music in it. The traditional music is composed of if dance forms belong to different region having its drums, flute, etc. It is more in music than in vocal own Natural Language. Natural Language [3]. Processing gives a necessary clue regarding the Ghoomar: region and possible dance form but has the Ghoomar is a folk dance of . It has also limitation that it cannot recognize different folk been adopted widely in Bollywood movies. It has lot dances of the same region. Non-vocal segment of music in it [3]. distinguishes different folk dances of the same Dance and song sequences have been an integral region as proportionate use of instruments varies component of films across the country. With the significantly. introduction of sound to cinema in the film 'Alam International Journal of Intelligent Engineering and Systems, Vol.10, No.3, 2017 DOI: 10.22266/ijies2017.0630.19

Received: March 9, 2017 175

the extracted features is then generated and is compared with all the other features of the audio files present in the database. Based on the similarity measures the system retrieves the required audio files from the data base and presents it in the form of the result. The remainder of this paper is organized as: Section 2 covers literature review, Section 3 gives a detailed view of the proposed system Folk Dance Song Classification, Section 4 discusses dataset and experimental setup, Section 5 describes results and Figure.1 Indian folk dance: (a) Rajasthani Ghoomar and comparison among chosen classifiers and Section 6 (b) Gujarati Garba concludes our work.

Table 1. Popular Folk Dances in Bollywood Movies 2. Literature survey Movie Name Song Title Song Type Kapsoras et. al. considered classification of Bajirao Pinga Lavani Greek folk dances of Western Macedonia region as Mastani(2015) sub activity of human action recognition and it Aiyaa Sava Dollar Lavani comes under computer vision. Creation and Agneepath ChikniChameli Lavani Preservation of folk dance database is beneficial for (2012) research and cultural heritage and also connects the Ferrari ki Mala Jau De Lavani youth to their roots. Risks and precautions Sawaari (2012) associated with each folk dance are also stored Inkaar (1978) MungdaMungda Lavani altogether that makes it easy for new learners and Singham Returns Aata Majhi Satakli Lavani thus useful in education. Challenges of Greek (2014) dance are: most of the folk dances are group Sailaab (1990) Humko Aajkal Hai Lavani performance, long skirt that hides legs etc. are Intezaar highlighted. They considered various video songs of Lagaan (2001) RadhaKaise Na Jale Garba 2 popular folk dances, namely Lotzia and Capetan Kai poChe Shubharambh Garba Loukas. Extended Independent Subspace (2013) Algorithm and Space-Time Intereset Points (STIPs) Jai Santoshi Maa Jai Santoshi Maa Garba are used for feature extraction. K-means clustering (1975) is incorporated on extracted features for codebook Hum Dil De Dholi Taro Baje Garba Chuke Sanam generation. Histogram over the codebook is (1999) constructed for each sequence using vector Ramleela Nagada sang dhol Garba quantization. Support Vector machine is used with (2013) chi-square kernel on the histogram for classification. AA Ab Laut Yehi hai Pyaar Garba Highest accuracy 89 % is achieved with STIPs[8]. Chalein (1999 ) Jab we met Nagadanagada Bhangra (2007) Rang De Basanti Rang De Basanti Bhangra (2006) Love Aaj kal Ahunahun Bhangra (2009) Mujhse Shaadi AajaSoniye Bhangra karogi (2004) Band Baaja Ainvayi Ainvayi Bhangra Baaraat (2010) Paheli (2005) Laaga Re Jal Laaga Ghoomar

Figure 2 provides a flow chart of a content based retrieval system. An audio query is the audio Figure.2 Content Based Audio Retrieval System file that is given as an input to the system. The features of the input audio are calculated. A query of International Journal of Intelligent Engineering and Systems, Vol.10, No.3, 2017 DOI: 10.22266/ijies2017.0630.19

Received: March 9, 2017 176

Samanta et al. in [9] build up a method based on end part. Features from the 30 second segments are visual features for classification of Indian Classic extracted with 1153 audio samples per second. Dances, namely , Bharatnatyam and . Zhang, Tong, and CC Jay Kuo in [14] have The authors propose a pose descriptor which is a developed a hierarchical system for audio combination of Gradient and Optical Flow. The classification and retrieval. The system describes descriptor of 168 dimension is calculated in a that generally the audio features are divided into 2 hierarchical manner to represent each frame of a categories: Perceptual and Physical. The perceptual sequence is based on histogram of oriented optical features are subjective and are related to human flow, The pose basis is learned using an on-line perception. For example pitch, timbre, loudness, etc. dictionary learning technique and each video is Physical features refer to the features that can be represented sparsely as a dance descriptor by mathematically calculated using the equations. For pooling pose descriptor of all the frames. Finally, example the spectrum, ZCR, etc. Their system dance videos are classified using support vector consists of three stages. The First stage is known as machine (SVM) with intersection kernel with an coarse-level audio classification and segmentation in average accuracy of 83.33% for Bharatnatyam, 90% which samples are classified and segmented into for Kathak and 86.6% for Odissi speech, music, environmental sound and silence. Kaynak et. al. developed Audio-Visual speech Environmental sounds are further classified in the recognition using lip geometric features which second stage using the Hidden Markov Model. Third includes horizontal, vertical lip aperture and first - stage implements query by example audio retrieval. order derivative of the lip corner. They classified 20 Tzanetakis et.al.in [15] focused on music genere hours of audio-video speech database of isolated classification of audio signals. Audio classification utterances from 22 nonnative English speakers using hierarchy is proposed in which at level 0 audio Hidden Markov Model. This kind of system have signal is considered, at level 1 music and speech are fixed camera position and mostly static background separated, at level 2 music is further classified into for speech recognition[10].Extraction of dialogue classical, country, disco, hiphop, jazz, rock, blues, and action scene from movies using audio cues is reggae, pop and metal while speech is separated into investigated in [11]. male, female and environment(spots etc.). Classical Song et.al. in [12] have proposed a system for and Jazz music is further classified at level 4. classifying the audio information into four different Software used by them is a part of MARSYAS classes: speech, non-speech, music and non-pure which is available under the GNU Public License. speech. For the purpose of feature extraction, 8000 They achieved 61% accuracy for 10 musical genres Hz is used as a sampling frequency and the frame and timbral texture, rhythmic content and pitch size used is 32 milliseconds which is equal to 256 content are proposed as freatures. Similarly, samples per frame (0.032*8000=256). As well as, Nopthaisong et.al. performed Thai music 25% overlapping is considered in each of the two classification based on audio cues in [16]. adjacent frames. The features like low short time Chu et. al. contributed for environmental sound energy ratio (LSTER) and high zero crossing rate recognition using time -frequency audio features. ratio (HZCRR), Bandwidth and Noise Rate were Environmental sound recognition is essential for extracted from the audio and 'Decision Tree' based understanding of scene or context surrounding the classification is carried out.Reasonable classification audio sensor. They proposed Matching Pursuit(MP) accuracy is achieved on 12000 audio clips in 'wav' algorithm to obtain effective time-frequency format having total duration 200 minutes. The features. Matching Pursuit generates flexible, database contains 1236 music clips, 5935 pure intuitive and physically interpretable set of features speech clips, 1793 silence clips and 3036 non-pure using a dictionary of atoms. MP- based feature is speech clips. Highest accuracy achieved is 95% for adopted to supplement the MFCC features to silence and lowest accuracy achieved is 88.8% for achieve higher recognition accuracy. They non-pure speech. considered 14 different environmental sound and Silla, Jr., Carlos N., Celso, A., A Kaestner, and used k-Nearest Neighbour classifier and Gaussian Alessandro L. Koerich in [13] have worked for an Mixture Model classifier. Average accuracy automatic genre classification system using achieved for 14 classes is 83.9% and for 7 classes ensemble classifiers, and is based on extraction of accuracy is higher than 90 %. Highest accuracy multiple feature vectors from a single music piece. achieved with kNN (K=32) is 77.3% [17]. The 3 segments are extracted as: first 30-seconds Palecek, K., and Chaloupka, J. focused in [18] from beginning, one from middle and one from the on the audio-visual speech recognition under environmental noise. The White and the babble International Journal of Intelligent Engineering and Systems, Vol.10, No.3, 2017 DOI: 10.22266/ijies2017.0630.19

Received: March 9, 2017 177 noise are added in the audio signal. Discrete Cosine signals. For pairwise comparison, LIBSVM is used Transform and Active Appearance Model(AAM) by authors for multi-class classification and Multi based features are identified and extracted from category Proximal Support Vector Machines are video signals. These features are then enhanced used for multi-objective functions for classification. through Hierarchical Linear Discriminant Analysis Best classification result is obtained by Linear and normalized using z-score approach for all Discriminant Analysis on full feature set having 30 speakers. MFCC based audio features are clubbed features which includes MFCC, FFT, Pitch and with visual features using middle fusion approach. Beat[20]. Only Visual cues based speech recognition is affected by many factors like high computational cost, speaker's appearance, lighting condition, camera's relative position, moving or stationary camera etc. They also mentioned that speech recognition rate is very low for very small vocabularies based on visual cues only. Audio visual speech database having recordings of 35 speakers in Czech language is used. Word recordings of 30 speakers are used for training and word recordings of remaining 5 speakers are used for testing. Digital web Camera is kept in front of speakers. They achieved 79.2 % recognition rate by combining AAM based features with Audio features. They carried out 3 experiments: 14 -states whole word Hidden Markow Model for audio only and 14- state whole word Hidden Markow Model for visual only experiments while 2-stream 14 state whole word HMM for combined audio-visual speech recognition. Chen et al highlighted mixed type audio classification using support vector machine. They took 1 hour of movie database and segmented into 5 second segments. These five second segments are Figure.3 Proposed Framework for Folk Dance Classification further windows into 1 second and classified into five different classes: Pure Speech, Pure Music, 3. Proposed framework Environment Sound, Speech with music and music with environment sound. They took Variance of Folk dance classification framework is depicted zero crossing rate, Silence Ratio, Harmonic Ratio in Fig.3. It consists of mainly 9 phases mentioned and Sub-band energy as features. They also above. considered four different representation for each feature: min, max, (min +max)/2 and mean. It is also 3.1. Audio extraction observed that variance of zero crossing rate is non- Plenty of folk dance videos are available on linearly separable and hence support vector machine various sites on the internet. Folk dance videos for (one against all approach) is selected for 'Garba', 'Lavani' ,'Bhangara' and 'Ghoomar' are classification with gaussain kernel. They downloaded and audio part is separated from the experimented with different values of error penalty video part. 40 different dance videos are collected (C) and 'sigma'. They found that SVM outperforms for each class. Total 160 videos are collected from as compared to kNN, NN and Naive Baysian various web sites. approaches. Because of the broad spectrum of environment sound (opening of door, sound of 3.2. Resampling footsteps, sounds of nature and sounds of animals) clubbed with music highest accuracy achieved is It is observed that audio files obtained from the 78.649%[19]. previous phase having different sampling rate. It is Li, T., and Tzanetakis, G. focused on factors like essential for further processing that all audio files Timbral Texture, Rhythmic Content Features and must have same frame rate. All extracted audio files Pitch Content features for classification of audio International Journal of Intelligent Engineering and Systems, Vol.10, No.3, 2017 DOI: 10.22266/ijies2017.0630.19

Received: March 9, 2017 178 are resampled to 24Kbps frame rate (24000 bits per using a sampling rate of 22.020 KHz with MFCC second). and other features. The accuracy of the system was recorded more than 77% . Each extracted segment of the song fragment is taken as input for MFCC feature extraction. Number of samples in each portion are 5x24000=120000. We have selected frame size of 15(15x24 samples) millisecond and frame shift is considered as 10 (10x24 samples)millisecond for analysis. In total 13 MFCC features are extracted. Pre-emphasis for MFCC is taken as 0.97. Let's denote the feature matrix obtain for each segment of the dance fragment as PxK. Here, P represents total number of overlapping frames in one portion and K is MFCC coefficients. Here K is 13 and value of P is 499. Figure.4 Dance Fragment Extraction and Dance Segments Generation MFCC feature extraction involves following steps:

1.Tw=15 ms 3.3. Dance fragment extraction 2. Ts=10 ms 3.No_of_Frames= round( 1E-3 * Tw * fs ) It is not necessary to analyze the whole audio file 4.No_of _Frame_shift= round( 1E-3 * Ts * fs ) for the classification of the folk dance. It is found 5..Preemphasis: that normally duration for dance is between 5-8 speech(t)=1 - α x speech(t-1); minutes. We have taken initial 125 seconds of each Here, α = 0.97 dance for processing. Extraction process is 6. For each frame described in Fig.4. 1. Compute Discrete Fourier Transform (DFT) Let's assume that fragmented audio files each 2. Take log of Amplitude of DFT having size of 125 seconds are numbered from 3. Perform Mel Scaling and Smoothing using D1,D2,...... ,DN. In our case N is 160. triangular filterbank having 26 filters equally Each Segmented audio file is further decomposed spaced between lower and upper frequency. into a portion of 5 seconds. Let's consider D(i, j) Mel_freq=1127 x log(1+freq/700) denotes jth portion of ith dance file. 4. Perform Discrete Cosine Transform (DCT) Total number of portions in each dance file = 5. Take initial 13(N) cepstral coefficients. 125/5 = 25. 6. Perfrom Cepstar Lifting Cep_lifter= 1+0.5 * L* sin(pi * [0:N-1]/L) 푁 푀

푇표푡푎푙푝표푟푡푖표푛 = ∑ ∑ 퐷(푖, 푗) = 4000 Here, L is sine lifter parameter with value =22; 푖=1 푗=1 N is cepstral coefficients (1) 3.5. Linear predictive coding (LPC)

In our case, N=160 and M=25. Linear predictive coding(LPC) is defined as a The base of any automatic analysis system is the digital method for encoding an analog signal in feature vectors. Large number of time and frequency which a particular value is predicted by a linear domain features have been originated for automatic function of the past values of the signal. It was first music recognition. proposed as a method for encoding human speech by the United States Department of Defense in 3.4. MFCC feature extraction federal standard 1015, published in 1984[22]. MFCC are used widely for speech recognition, At a particular time, t, the speech sample s(t) is singer identification etc. Audio Song classificaiton represented as a linear sum of the p previous based on listeners' taste is using MFCC is explained samples. with 82% accuracy in[1]. Wei, Zhang, et al. in [21] LPC coefficients are computed using the below have developed a system for an audio classification given equation International Journal of Intelligent Engineering and Systems, Vol.10, No.3, 2017 DOI: 10.22266/ijies2017.0630.19

Received: March 9, 2017 179

푠(푡) = 1 − 푎(2)푠(푡 − 1) . . . − 푎(푝 + 1)푠(푡 − 푝)

(2)

Here, p is the order of the polynomial and A=[1 a(2) a(3) .... a(p+1)]. The most important aspect of LPC is the linear predictive filter which allows the value of the next sample to be determined by a linear combination of previous samples. Normally, speech is sampled at 8000 samples/second with 8 bits used to represent each sample. This provides a rate of 64000 Figure.5 Vector generation for 5 Second Segment bits/second. Linear predictive coding reduces this to 2400 bits/second. At this reduced rate the speech has a distinctive synthetic sound and there is a noticeable loss of quality. However, the speech is still audible and it can still be easily understood. Since there is information loss in linear predictive coding, it is a lossy form of compression. In Proposed framework, 13 coefficients are extracted with value of P =12(12th order polynomial).

3.6. Dance segment feature vector generation

Vector is generated by combination of 13 Figure.6 Dimensionality Reduction for a Song Fragment MFCC coefficients with 13 LPC coefficients. Each song segment of 5 second comprises of 499x26 size 3.8.1. K-Nearest neighbour(kNN) classification [25] vector. Pictorial view of process is in Fig.5. Each dance fragment (part of Song) is a The k-nearest neighbor method is originated in combination of 25 segments each of size 499x26. 1950s. It belongs to lazy learners' category as it does This way size of dance fragment is 25x499x26 less work when training tuple is presented and does which is equivalent to 12475x26. Principal more work at the time of classification. It is a Component Analysis is a widely used method for supervised learning approach in which a new query dimensionality reduction [23][24]. Principal is classified based on the majority of its k nearest components are achieved by a linear transformation neighbors. Commonly used distance measures are to a new set of features which are uncorrelated, Euclidean distance, Malhobis distance and ordered, arranged as per importance and also retains Manhattan distance. 10 -fold cross validation is as much possible variance. Singular Value used for performance comparison. Value of K is Decomposition method is used for extraction of considered as 1. Higher value of K does not improve principle components. Dimensionality reduction the performance. process converts 12475x26 vector into 2000x26 The Euclidean distance between two tuples, say, vector is depicted in Fig.6. X1= and X2= is defined as follows. 3.7. Dimensionality reduction 푁 2 Figure 6 illustrates the dimensionality reduction dist(x1, x2)=√∑푖=1(푥1푖 − 푥2푖) (3) for a song fragment. 3.8.2. Naive baysian method [26] 3.8. Dance song segment classification The Naive Bayes algorithm calculates Classification accuracy is compared using 3 discriminant function for each possible n classes. It different approaches described here. assumes that each feature of an audio clip is drawn independently from a normal distribution and classifies according to the Bayes optimal decision rule.

International Journal of Intelligent Engineering and Systems, Vol.10, No.3, 2017 DOI: 10.22266/ijies2017.0630.19

Received: March 9, 2017 180

3.8.3. Neural network method Algorithm for Folk Dance Classification For each song used in Testing Neural Network is very popular in Machine For each song segment i Learning. Automatic Speech recognition based on Extract class_label and store into MFCC, LPC and preceptual linear prediction (PLP) calculated_segement_label[i] altogether as features using multilayer feed forward if ( calculated_segement_label[i] == '1') Neural Network with back-propagation is discussed then C1_class=C1_class+1 in [27]. elseif ( calculated_segement_label[i] == '2') Scaled conjugate gradient (SCG) algorithm [28] then C2_class=C2_class+1 is a fast supervised learning algorithm based on elseif ( calculated_segement_label[i] == '3') conjugate directions. The performance is far better then C3_class=C3_class+1 against that of standard back propagation algorithm, elseif ( calculated_segement_label[i] == '4') conjugate gradient algorithm with linear search. then C4_class=C4_class+1 SCG does not include user dependent parameters end and neglects linear search in each iteration in order end to determine an appropriate step size. Back Out_class=max(C1class, C2class, C3class, C4class) propagation is used to calculate derivatives of end performance with respect to the weight and bias variables. 4. Experimental Setup Training terminates when one of the following conditions are satisfied[29] A novel folk dance database is constructed for 1) The maximum number of epochs is achieved four Indian Folk dances widely popular in Indian 2) The maximum amount of time is exceeded movies. Folk dance songs are extracted from movies 3) Performance is minimized to the goal and albums which are publicly available in 4) The performance gradient falls below Cds/DVDs/MP3s. A collection of 40 songs is min_grad prepared for each folk dance. Proposed approach is 5) Validation performance has increased more evaluated using 160 popular songs. To maintain than max_fail times since the last time it is uniformity, all the audio songs extracted from video decreased are resampled at a rate of 24Kbps. We have used MATLAB R2013a for experiment on a Laptop with 3.9. Folk Dance Classification i7 -5500U processor with 2.40 GHz & 8GB RAM.

Song Classification is very important because 5. Experimental Results each song segment which is of 5 second may contain pure vocal (male or female) or may contain Folk Dance Recognition System is divided into extensive use of musical instrument which is not three main parts: Training Phase , Validation and dominating in the remaining segments. Here, Testing Phase. Folk Dance Model(FDM) is Majority voting is considered in the final decision generated and 80% of data is used for training and for folk dance song classification. 20% of remaining data is used for testing. The reliable measure of classifier accuracy can be obtained by Holdout, random subsampling, k-fold Table 2. Neural Network Parameters used in MATLAB cross validation or the bootstrap techniques. To Parameter Value calculate the accuracy of our system, we have used Epochs 1000 5-fold random cross validation method. Accuracy Goal 0 of the proposed system is computed using the Time Inf following eq. (4). Fig.7 and Fig. 8 show confusion Min_grad 1e-6 Max_fail 10 Sigma 5.0e-5 Table 3. Folk Song Database Lamda 5.0e-7 Id Folk Dance Name State No. of Songs Hidden layer 50 Neurons D1 Garba Gujarat 40 Input Neurons 26 (13MFCC+13 LPC) D2 Lavani Maharastra 40 Output Class 4 Performance Mean Squared Error D3 Bhangra Punjab 40 D3 Ghoomar Rajasthan 40

International Journal of Intelligent Engineering and Systems, Vol.10, No.3, 2017 DOI: 10.22266/ijies2017.0630.19

Received: March 9, 2017 181

matrices for song segments and for whole song respectively.

퐴푐푐푢푟푎푐푦 (%) = (퐷퐼퐶 / 푇퐷 ) 푥 100 (4)

Here, DIC depicts correctly identified Dance and TD indicates Total Dance used for Testing. ROC stands for Receiver Operating Characteristic. ROC curves come from signal detection theory shows the trade-off between the true positive rate and the false positive rate for a given model. The area under the ROC curve is a measure of the accuracy of the model. The closer the ROC curve of a model is to the diagonal line, the less accurate the model [30]. A ROC curve for song segment (Not for the whole song) is plotted in Fig. 9. It is clear from Testing ROC curve that specified model is highly accurate as the curve moves steeply up from zero. Figure.8 Whole Folk Dance Song Test Data Accuracy obtained via different classification algorithms is mentioned in Table 4.

Figure.9 ROC Curve for Song Segment

6. Conclusion Indian folk dance plays a vital role in human life as well as widely used in Bollywood Movies. 'Garba', 'Lavni', 'Ghoomar' and 'Bhangra' are four different folk dances considered for classification. For each class 40 different songs are considered which gives a total of 160 songs. MFCC and LPC Figure.7 Folk Dance Song Segment Confusion Matrices based features are extracted from audio and for Training, and Testing classification accuracy is measured using and K- Nearest Neighbour, Neural Network and Naive Table 4. Classification Accuracy Comparison Bayesian approaches. More than 90% classification K-Nearest Neural Network Naive Bayesian accuracy is achieved from Neural Network. Similar Neighbour to folk dance songs present in Bollywood movies, 85% 90.8% 60% the other song genres like , Kawaali; events

International Journal of Intelligent Engineering and Systems, Vol.10, No.3, 2017 DOI: 10.22266/ijies2017.0630.19

Received: March 9, 2017 182

like War, Violence, Wedding; the voices of animals; [13] C.N. Silla Jr, C. A Kaestner, and A. L. Koerich. chirping of birds can be classified. This can be "Automatic classification using considered as a step towards 'Content Based ensemble of classifiers", In: Proc. of IEEE Bollywood Movie Mining'. International Conf. on Systems, Man & Cybernetics, pp. 1687-1692, 2007. References [14] T. Zhang, and C.C. Kuo, "Hierarchical classification of audio data for archiving and [1] R. Sharma, Y. S. Murthy, S.G. Koolagudi, retrieving", In: Proc. of IEEE International Conf. "Audio Songs Classification Based on Music on. Acoustics, Speech, and Signal Processing, Patterns", In: Proc. of Second International pp. 3001-3004, 1999. Conf. On Computer and Communication [15] G. Tzanetakis, and P. Cook, "Musical genre Technologies, pp. 157-166, 2015. classification of audio signals", IEEE [2] R. Massey. India’s dances: Their History, transactions on Speech and Audio Processing Technique, and Repertoire, Abhinav Vol.10, No.5, pp. 293-302, 2002. Publications, p. 26, 2004. [16] C. Nopthaisong, and M.M. Hasan, "Automatic [3] https://en.wikipedia.org/wiki/Indian_folk_music music classification and retreival: Experiments [4] D. Hoiberg, Students' Britannica India, Vol. 2, with Thai music collection", In: Proc. of IEEE Popular Prakashans, p.392, 2000 International Conf on Information and [5] T. C. Nagavi, and N.U. Bhajantri, "Overview of Communication Technology, pp. 76-81, 2007. automatic Indian music information recognition, [17] S. Chu, S. Narayanan, and C.C. J. Kuo, classification and retrieval systems", In: Proc. of "Environmental sound recognition with time– the International Conf. on Recent Trends in frequency audio features." IEEE Transactions Information Systems, pp. 111-116, 2011. on Audio, Speech, and Language Processing, [6] A. Flexer, "A closer look on artist filters for Vol.17, No.6, pp. 1142-1158, 2009. musical genre classification", World, Vol.19, [18] K. Palecek and J. Chaloupka, "Audio-visual No.122, p.16-7, 2007. speech recognition in noisy audio [7] S. Shreshthova, Between cinema and environments", In: Proc. of 36th IEEE performance: Globalizing Bollywood Dance, Telecommunications and Signal Processing, pp. ProQuest. p.372, 2008. 484-487, 2013. [8] I. Kapsouras, S. Karanikolos, N. Nikolaidis, & [19] L. Chen, S. Gunduz, and M.T. Ozsu, "Mixed A. Tefas, "Folk dance recognition using a bag of type audio classification with support vector words approach and ISA/STIP features", In: machine", In: Proc. of IEEE International Conf. Proc. of the 6th Balkan Conf. in Informatics, pp. on Multimedia and Expo, pp. 781-784, 2006. 71-74, 2013. [20] T. Li, and G. Tzanetakis, "Factors in automatic [9] S. Samanta, P. Purkait, & B.Chanda, "Indian musical genre classification of audio signals", Classical Dance classification by learning dance In: Proc. of IEEE Workshop on Applications of pose bases". In: Proc. of the IEEE Workshop on Signal Processing to Audio and Acoustics, pp. Applications of Computer Vision, pp. 265-270, 143-146, 2003. 2012. [21] W. Zhang, Q. Zhao, Y. Liu, and M. Pang, "Co- [10] M.N. Kaynak, Q.Zhi, A.D. Cheok, K. Sengupta, training Approach for Label-Minimized Audio Z. Jian, and K.C.Chung, "Analysis of lip Classification", In: Proc. of IEEE International geometric features for audio-visual speech Conf. on Measuring Technology and recognition", IEEE Transactions on Systems, Mechatronics Automation, pp. 860-863, 2010. Man and Cybernetics, Part A: Systems and [22] M.N. Raja, P.R. Jangid, and S.M. Gulhane, Humans, Vol..34, No.4, pp.564-570, 2004. "Linear Predictive Coding", International [11] L. Chen, S.J. Rizvi, and M.T. Ozsu, Journal of Engineering Sciences & Technology, "Incorporating audio cues into dialog and action pp.373-379, 2015. scene extraction", In: Proc. of Electronic [23] M. J. Zaki, W. Meira Jr, and W. Meira, Data Imaging, pp. 252-263, 2003 mining and analysis: fundamental concepts and [12] Y. Song, W.H. Wang, and F.J. Guo, "Feature algorithms, Cambridge University press, extraction and classification for audio pp.187-201, 2014 information in news video." In: Proc. of IEEE [24] V. Panagiotou and N. Mitianoudis, "PCA International Conf. on. Wavelet Analysis and summarization for audio song identification Pattern Recognition, pp. 43-46, 2009. using Gaussian mixture models", In: Proc. of

International Journal of Intelligent Engineering and Systems, Vol.10, No.3, 2017 DOI: 10.22266/ijies2017.0630.19

Received: March 9, 2017 183

18th international conf. on digital signal processing, pp.1-6, 2013. [25] H. Maniya, M. Hasan, and K.P. Patel, "Comparative study of naive bayes classifier and KNN for Tuberculosis", In: Proc. of International Conf. on Web Services Computing, pp.22-6, 2011. [26] D. W. Aha, "A study of instance based algorithms for supervised learning tasks: Mathematical, empirical, and psychological evaluations", Doctoral Dissertation, University of California, 1990. [27] N. Dave, "Feature extraction methods LPC, PLP and MFCC in speech recognition", International Journal for Advance Research in Engineering and Technology, Vol.1, No.6, pp. 1-4, 2013. [28] M.F.Moller, "A scaled conjugate gradient algorithm for fast supervised learning", Neural Networks ,Vol.6, No. 4, pp.525-533,1993. [29] http://in.mathworks.com/help/nnet/ref/trainscg.h tml [30] G. M. Williams, D.A. Ramirez, M.M. Hayat, and A.S. Huntington, "Instantaneous receiver operating characteristic performance of multi- gain-stage APD photoreceivers", IEEE Journal of the Electronic Devices Society ,Vol.1, No.6, pp. 145–153, 2013.

International Journal of Intelligent Engineering and Systems, Vol.10, No.3, 2017 DOI: 10.22266/ijies2017.0630.19