Indian dance form recognition from videos.

Ankita Bisht∗, Riya Bora∗, Goutam Kumar ∗, Pushkar Shukla† and Balasubrmanian Raman‡

Department of Computer Science, Indian Institute of Technology Roorkee, .

Abstract—Classical dance forms are an integral part of the Indian culture and heritage. Therefore, preserving and compre- hending these dance forms is a relevant problem in context with the digital preservation of the Indian Heritage. In this paper, we propose a novel framework to classify forms from videos. The representations are then extracted through Deep Convolution Neural Network(DCNN) and Optical Flow. Moreover these representations are trained on a multi- class linear support vector machine(SVM). Furthermore, a novel dataset is introduced to evaluate the performance of the proposed framework. The framework is able to achieve the accuracy of 75.83% when tested on 211 videos. Keywords-dance form recognition, Indian classical dance, deep convolution neural network(DCNN), optical flow,multi-class SVM.

I.INTRODUCTION 0Shastriya Nitra0 or Indian classical dance is a hypernym that encloses the different performing arts which are associ- ated with the Indian culture. These dance forms are mostly performed with a given theme and reveal a lot of informa- tion about ancient Indian traditions, cultures and customs. These dance forms are a combination of variety of head movement,facial expression and hand movements. The Sangeet Academy recognizes a total of 8 different classical dance forms. These forms have been listed below

• Bharatnatyam Fig. 1. = Samples of different types Indian classical dance forms • Sattriya that are recognized by the framework. • dances, facial expression plays an important role in Indian • classical dance. Secondly, these dance forms are comprised of different hand and body postures. Thus, recognizing these • gestures and postures is a daunting task. Thirdly, these dance • Manipuri forms involve movements that are also common to other dance forms. Therefore, these dance forms cannot solely be • recognized on the basis of hand and body movements. The prime motivation behind the paper is to develop a The previous studies on Indian dance form recognition have framework capable of recognizing different types of Indian mostly focused upon individual dance forms. These studies Classical dances that can aid in efficient retrieval and storage have been conducted on a relatively small data-set. Moreover of Indian Classical Dance forms. These dance forms have there are no publically available data-sets to test frameworks been an integral part of the Indian culture. Preserving and for Indian dance form recognition. The major contributions of digitally archiving these dance forms is vital with respect to the the paper lies in proposing a framework capable of recognizing overall conservation of the Indian cultural heritage. Further, the Indian Dance forms. The frame-work relies on the state of the Internet boom has also resulted in an outbursts of unorganized art Deep Convolutional Neural Networks(DCNN) and optical media content related to these dance forms. Therefore,it is flow for extracting representation to recognize Indian Dance important to efficiently organize and store these types of digital forms. Further, we introduce a publically available data-set media. comprising of 626 video of 7 different Indian Classical dance forms. There are several challenges associated with recognizing Indian classical dance forms. Firstly, unlike many western The rest of the paper has been organized as follows. A Fig. 2. = A block diagram representing proposed architecture. Different matrix of a frame are fed to input layer of convolution network for training purpose. Different and given as input to Linear SVM for final classification study of the previous work on Indian classical dance forms and postures with the use of Kinect sensor categorizing into vertical other dance forms in general have been presented in Section 2. symmetry, horizontal symmetry, and angular symmetry. The The proposed methodology for detecting dance forms has been approach results in a recognition rate of 86.75% for five presented in Section 3. Section 4 gives the details of the data- subjects. set that has been introduced. The results have been discussed in Section 5. Section 6 concludes the paper. et al. [2] also proposed a glove-based dance gesture recognition and vision-based gesture recognition. Different II.LITERATURE REVIEW representations were proposed as per the need of dance forms and employed various classifiers.Their work focused on static As per our knowledge very little research has been carried cues. An accuracy of 92.7% was obtained on a limited data out on recognizing Indian dance forms. The previous works set. A considerable amount of work has been done by Mallik have mostly focused upon a single dance form. A significant et al. [5], in recognizing Indian dance forms. A part of the amount of research has been done on expression and gesture work includes recognizing Bharatnatyam and Odissi with the recognition on various Indian classical dance forms. Saha et al. use of Scale-Invariant Feature Transform (SIFT). SIFT repre- [7] proposed a gesture recognition algorithm for Indian classi- sentations were later on trained on Support Vector Machine cal style giving an accuracy of 86.8%.The Kinect sensor used classifier (Linear kernel, cost factor of 2.0) on the Weka here, generates the skeleton of a human body, out of which machine learning framework resulting in a recognition rate of eleven coordinates are used to discriminate between ’Anger’, 92.78%. ’Fear’, ’Happiness’, ’Sadness’ and ’Relaxation’. Furthermore, their system provides an option to check whether the emotion III.PROPOSED ARCHITECTURE is positive or negative. Soumya et al.[11] proposed a model to recognize different hand gestures n a dance form using It was observed by the authors that the different cues Artificial Neural Network(ANN). Devi et al.[1] gave a simple that are associated with Indian classical dance forms can be two-level classification method for single-hand gestures termed categorized into two types. The static cues comprised of body as Asamyukta Hastas’ in Sattriya dance. The average accuracy postures, facial expression, costumes etc. Whereas hand , head is 75.45%.Saha et al. [8] proposed a technique to recognize leg and body movements were some dynamic cues that were associated with the movements of the dancer. Separate frame- 1) Optical Flow Normalization: For a given pixel V(x,y) works were designed to capture static and dynamic cues. A the value of the optical flow is calculated as block diagram representation of the proposed approach has 2 2 been presented in Figure 2. The framework relies on deep V (x, y) = u(x, y) + v(x, y) (6) spatio-temporal descriptors for dance form recognition. Twelve The maximum and minimum values for V can be expressed key frames were selected from the entire video in order to as capture the static cues like body posture , facial expressions and costumes. Deep representation from the pre-trained Alex- Net [4] network were extracted from each of these frames.The V = max(u(x, y)2 + v(x, y)2)(x, y) ∈ C (7) representation were then combined together and trained using max a multi-class linear Support Vector Machine(SVM). 2 2 Two frameworks were designed to capture the dynamic Vmin = min(u(x, y) + v(x, y) )(x, y) ∈ C (8) cues in the dance videos. The optical flow of each of these frames was calculated in order to estimate the velocity of pixels Therefore, in these frames. A second round of optical flow was applied to these images in order to estimate the acceleration of individual Z V (x, y) f(x, y) = (9) pixels. Deep representations were then extracted from both (V − V ) ∗ 255 the matrices of images. These sets of representations were max min concatenated individually and were trained on two separate Where f (x, y) is the normalized data of optical flow value. multi-class linear support vector machine. The output scores C is the set of all pixels coordinates in the image. The value of the three SVM were combined during the test phase to of f (x, y) is from 0 to 255 and can be displayed as a gray produce the final output. image.

A. Optical flow B. Using Deep CNN’s for extracting representations In 1981, Horn and Schunck[3] derived an equation for Deep learning models are trained on large sets of labeled optical flow estimation. The aim of the algorithm was to data that learn representations from the data without manual determine the velocity or net displacements of pixels for interference. Convolution neural network(CNN) are one of the successive frames. Let the intensity of a given pixel P(x,y) most popular deep learning algorithms that have widely been at time t be equal to I(x,y,t). Therefore for a small change used for image recognition and classification tasks[9]. Pre- of time(t+∆t),the pixel would’ve moved to a new position trained CNN were chosen over hand crafted CNN due to the P(x+∆x,y+∆y). Thus the intensity value of the pixel changes small size of training dataset. The proposed architecture follow to I(x + ∆x, y + ∆y, t + ∆t). Assuming that movement of Alex-Net framework [4].The frames are reshaped to 227×227 the pixel to be very small or negligible it can be assumed that as per the needs of the model. Finally, the model consists of 227 × 227 × 3 input layer, 96 × 11 × 11 conv, 3 × 3 max- pooling, 128 × 5 × 5 conv, 3 × 3 max pooling, 256 × 3 × 3 dI(x, y, t) conv, 192×3×3 conv, 192×3×3 conv, 3×3 pooling 4096×1 = 0 (1) dt fc-6 4096 × 1fc-7 1000× fc-8 layers. Dimensions were extracted from the 0fc70 layer of the Neu- Therefore we can say that: ral Network. The dimensions of the extracted representations is [4096x1]. The features of each frame in the video are then concatenated to form a single feature matrix of a video. We I(x, y, t) = I(x + ∆x, y + ∆y, t + ∆t) (2) have worked with 5 different CNNs for better recognition rate. The features from the fc7 layer of each network were extracted The above equation forms the basis of optical flow. The and stored in a single feature vector. The representation were optical flow of an image can be broken down into orthogo- extracted from 12 different frames from across the entire video. nal components. Let these components be u and v along x coordinate and y coordinate respectively where C. CNN on optical flow dx dy Applying CNNs on optical flow is an efficient technique u = , v = (3) dt dt for capturing motion in videos. They were first introduced by Simonyan and Zisserman [10]. Since then CNNs on optical flow have become quite popular and have been applied to Applying Taylor series expansion on Eq.2. tasks like activity recognition from videos [10] and Egocentric Video indexing [6]. The input to these models are optical flows ∂Idx ∂Idy ∂I of consecutive frames that are stacked over one another. Our + + = 0 (4) ∂xdt ∂ydt ∂t approach differs from the previously existing models that have used optical flow because the proposed framework relies on two different CNNs that are applied separately to different i.e. types of optical flow matrices. While the first CNN is applied Ixu + Iyv + It = 0 (5) on optical flow of images, the optical flow of consecutive Dance Form Number of videos APPROACH ACCURACY Bharatnatyam 82 Optical Flow1 + Multi Class SVM 54.92% Kathak 99 Optical Flow2 + Multi Class SVM 61.15% Kuchipudi 87 Frame + Multi Class SVM 73.11% Manipuri 70 Proposed Approach 75.83% Mohiniyattam 110 Odissi 85 TABLE II. TOTAL NUMBER OF VIDEOS PRESENT IN THE DATA-SETFOR Sattriya 93 DIFFERENT STYLES OF INDIAN CLASSICAL DANCEFORMS.

TABLE I. TOTAL NUMBER OF VIDEOS PRESENT IN THE DATA-SETFOR DIFFERENT STYLES OF INDIAN CLASSICAL DANCEFORMS. The label with the highest score is assigned to the given test example which was calculated from Eq. 10. optical flows are fed to the second CNN. Applying optical flow on optical flows of consecutive frames gives us the acceleration of each pixel in an image. The acceleration component is then IV. DATA-SETDESCRIPTION captured using the second CNN. The data set used in our paper comprises of 626 video recordings, collected from youtube that belongs to the fol- lowing 7 dance forms: Bharatnatyam, Kathak, Kuchipudi, Manipuri, Mohiniyattam, Odissi and Sattriya. It was ensured that the dance videos were clear and effective with minimum background activity. A detailed descriptions of the dataset has been provided in the table II. A total of 420 videos were considered for training purposes. The remaining videos in each class were considered for testing.

V. EXPERIMENTS A. Experimental Setup The experiment has been conducted on a system with following configuration: Intel(i7 7th generation) processor and a NVIDIA QUADRO-2000 GPU. Matlab 2016 is used for performing these scenarios.

B. Experimental analysis Two different sets of experiments were performed on the dataset described in Section 4. Fig. 3. = A Bar Chart showing distribution of Training and testing In the first set of experiments, the authors wanted to data for different dance forms. evaluate the performance of the individual parts of the model and compare them with the proposed model. A comparison D. Classification between the sub-parts and the complete frame-work has been presented in Table III. Here CNN+ optical Flow2 refers to The representations extracted from each of the three CNN’s the frame works that was fed with optical flow of optical are trained separately on a multi-class Support Vector Machine. flow of consecutive frames. An overall accuracy of 75.83% While the SVM algorithm was traditionally designed for binary was achieved by the model. Two interesting observations classification, there have been a few extension of SVM for were made from Table III. Firstly, the performance of the multi-class classification. A one vs all strategy was employed model is better when it is fed with optical flows of optical while training the Multi-class SVM. The SVM was trained on flow matrices for consecutive frames. Secondly, the proposed a linear kernel. The value of C was set to 1 and the kernel approach yielded a higher accuracy than the individual parts offset was set to 0. of the model. For a given test example, each of the three SVM’s resulted During the second set of experiments , the focus was in scores. The final score for each class w calculated as analyzing the performance of the model for different classes. A confusion matrix for the different classes has been proposed in Table I. The mean average precision and the mean average F inalScore = w1∗Score1+w2∗Score2+w3∗Score3 (10) recall for the different classes has been presented in Table IV. For all the 7 classes, the highest mean average precision of where w1, w2, w3 are the weights that are associated 100% was obtained by Kathak and Kuchipudi and the highest with each SVM. The weights are estimated empirically and mean average recall of 100% was obtained by Kathak. Odissi the value of w1 , w2 and w3 is found to be 0.75, 0.10 and achieved the lowest mean average precision of 48.148% while 0.25 respectively. A given test example is assigned the label. Manipuri had the lowest mean average recall of 50%. Bharatnatyam Kathak Kuchipudi Manipuri Mohiniyattam Odissi Sattriya Bharatnatyam 0.6579 0.0000 0.0000 0.1316 0.0000 0.1842 0.0263 Kathak 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Kuchipudi 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 Manipuri 0.0000 0.0000 0.0000 0.7143 0.0000 0.2857 0.0000 Mohiniyattam 0.0000 0.0000 0.0698 0.0000 0.8140 0.0000 0.1163 Odissi 0.2593 0.0000 0.2593 0.0000 0.0000 0.4815 0.0000 Sattriya 0.0227 0.0000 0.0227 0.0000 0.3182 0.0000 0.6364

TABLE III. CONFUSION MATRIX FOR THE DIFFERENT TYPES OF INDIAN CLASSICAL DANCEFORMS.

Fig. 4. = Samples of correctly classified videos from the data-set.

Dance Form Precision Recall Bharatnatyam 65.789% 75.758% recognize 7 different types of Indian classical dance forms. The Kathak 100% 100% proposed technique relies on combination of 3 sub-frameworks Kuchipudi 100% 59.259% that utilize representations from pre trained convolutional Manipuri 71.429% 50.00% Mohiniyattam 81.395% 71.429% neural networks- and optical flows. We also introduced a new Odissi 48.148% 59.091% dataset of 620 videos for these 7 Indian Classical Dance forms. Sattriya 63.636% 82.353% An overall accuracy of 75.83% is achieved by the proposed TABLE IV. PRECISION AND RECALL FOR DIFFERENT DANCE FORMS. framework.

VI.CONCLUSION REFERENCES Indian classical dance is an important aspect of the Indian [1] M. Devi and S. Saharia. A two-level classification scheme for single- Culture. In this paper, we proposed a novel framework to hand gestures of sattriya dance. In Accessibility to Digital World (ICADW), 2016 International Conference on, pages 193–196. IEEE, 2016. [2] M. Devi, S. Saharia, and D. Bhattacharyya. Dance gesture recognition: A survey. International Journal of Computer Applications, 122(5), 2015. [3] B. K. Horn and B. G. Schunck. Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981. [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [5] A. Mallik, S. Chaudhury, and H. Ghosh. Nrityakosha: Preserving the intangible heritage of indian classical dance. Journal on Computing and Cultural Heritage (JOCCH), 4(3):11, 2011. [6] Y. Poleg, A. Ephrat, S. Peleg, and C. Arora. Compact cnn for indexing egocentric videos. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pages 1–9. IEEE, 2016. [7] S. Saha, S. Ghosh, A. Konar, and A. K. Nagar. Gesture recognition from indian classical dance using kinect sensor. In Computational Intelligence, Communication Systems and Networks (CICSyN), 2013 Fifth International Conference on, pages 3–8. IEEE, 2013. [8] S. Saha, R. Lahiri, A. Konar, B. Banerjee, and A. K. Nagar. Human skeleton matching for e-learning of dance using a probabilistic neural network. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 1754–1761. IEEE, 2016. [9] P. Shukla, T. Gupta, A. Saini, P. Singh, and R. Balasubramanian. A deep learning frame-work for recognizing developmental disorders. In Ap- plications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 705–714. IEEE, 2017. [10] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014. [11] C. Soumya and M. Ahmed. Artificial neural network based identifica- tion and classification of images of bharatanatya gestures. In Innovative Mechanisms for Industry Applications (ICIMIA), 2017 International Conference on, pages 162–166. IEEE, 2017.