Indian Dance Form Recognition from Videos
Total Page:16
File Type:pdf, Size:1020Kb
Indian dance form recognition from videos. Ankita Bisht∗, Riya Bora∗, Goutam Kumar ∗, Pushkar Shuklay and Balasubrmanian Ramanz Department of Computer Science, Indian Institute of Technology Roorkee, India. Abstract—Classical dance forms are an integral part of the Indian culture and heritage. Therefore, preserving and compre- hending these dance forms is a relevant problem in context with the digital preservation of the Indian Heritage. In this paper, we propose a novel framework to classify Indian classical dance forms from videos. The representations are then extracted through Deep Convolution Neural Network(DCNN) and Optical Flow. Moreover these representations are trained on a multi- class linear support vector machine(SVM). Furthermore, a novel dataset is introduced to evaluate the performance of the proposed framework. The framework is able to achieve the accuracy of 75.83% when tested on 211 videos. Keywords-dance form recognition, Indian classical dance, deep convolution neural network(DCNN), optical flow,multi-class SVM. I. INTRODUCTION 0Shastriya Nitra0 or Indian classical dance is a hypernym that encloses the different performing arts which are associ- ated with the Indian culture. These dance forms are mostly performed with a given theme and reveal a lot of informa- tion about ancient Indian traditions, cultures and customs. These dance forms are a combination of variety of head movement,facial expression and hand movements. The Sangeet Nritya Academy recognizes a total of 8 different classical dance forms. These forms have been listed below • Bharatnatyam Fig. 1. = Samples of different types Indian classical dance forms • Sattriya that are recognized by the framework. • Kathak • Kuchipudi dances, facial expression plays an important role in Indian • Odissi classical dance. Secondly, these dance forms are comprised of different hand and body postures. Thus, recognizing these • Kathakali gestures and postures is a daunting task. Thirdly, these dance • Manipuri forms involve movements that are also common to other dance forms. Therefore, these dance forms cannot solely be • Mohiniyattam recognized on the basis of hand and body movements. The prime motivation behind the paper is to develop a The previous studies on Indian dance form recognition have framework capable of recognizing different types of Indian mostly focused upon individual dance forms. These studies Classical dances that can aid in efficient retrieval and storage have been conducted on a relatively small data-set. Moreover of Indian Classical Dance forms. These dance forms have there are no publically available data-sets to test frameworks been an integral part of the Indian culture. Preserving and for Indian dance form recognition. The major contributions of digitally archiving these dance forms is vital with respect to the the paper lies in proposing a framework capable of recognizing overall conservation of the Indian cultural heritage. Further, the Indian Dance forms. The frame-work relies on the state of the Internet boom has also resulted in an outbursts of unorganized art Deep Convolutional Neural Networks(DCNN) and optical media content related to these dance forms. Therefore,it is flow for extracting representation to recognize Indian Dance important to efficiently organize and store these types of digital forms. Further, we introduce a publically available data-set media. comprising of 626 video of 7 different Indian Classical dance forms. There are several challenges associated with recognizing Indian classical dance forms. Firstly, unlike many western The rest of the paper has been organized as follows. A Fig. 2. = A block diagram representing proposed architecture. Different matrix of a frame are fed to input layer of convolution network for training purpose. Different and given as input to Linear SVM for final classification study of the previous work on Indian classical dance forms and postures with the use of Kinect sensor categorizing into vertical other dance forms in general have been presented in Section 2. symmetry, horizontal symmetry, and angular symmetry. The The proposed methodology for detecting dance forms has been approach results in a recognition rate of 86.75% for five presented in Section 3. Section 4 gives the details of the data- subjects. set that has been introduced. The results have been discussed in Section 5. Section 6 concludes the paper. Devi et al. [2] also proposed a glove-based dance gesture recognition and vision-based gesture recognition. Different II. LITERATURE REVIEW representations were proposed as per the need of dance forms and employed various classifiers.Their work focused on static As per our knowledge very little research has been carried cues. An accuracy of 92.7% was obtained on a limited data out on recognizing Indian dance forms. The previous works set. A considerable amount of work has been done by Mallik have mostly focused upon a single dance form. A significant et al. [5], in recognizing Indian dance forms. A part of the amount of research has been done on expression and gesture work includes recognizing Bharatnatyam and Odissi with the recognition on various Indian classical dance forms. Saha et al. use of Scale-Invariant Feature Transform (SIFT). SIFT repre- [7] proposed a gesture recognition algorithm for Indian classi- sentations were later on trained on Support Vector Machine cal style giving an accuracy of 86.8%.The Kinect sensor used classifier (Linear kernel, cost factor of 2.0) on the Weka here, generates the skeleton of a human body, out of which machine learning framework resulting in a recognition rate of eleven coordinates are used to discriminate between ’Anger’, 92.78%. ’Fear’, ’Happiness’, ’Sadness’ and ’Relaxation’. Furthermore, their system provides an option to check whether the emotion III. PROPOSED ARCHITECTURE is positive or negative. Soumya et al.[11] proposed a model to recognize different hand gestures n a dance form using It was observed by the authors that the different cues Artificial Neural Network(ANN). Devi et al.[1] gave a simple that are associated with Indian classical dance forms can be two-level classification method for single-hand gestures termed categorized into two types. The static cues comprised of body as Asamyukta Hastas’ in Sattriya dance. The average accuracy postures, facial expression, costumes etc. Whereas hand , head is 75.45%.Saha et al. [8] proposed a technique to recognize leg and body movements were some dynamic cues that were associated with the movements of the dancer. Separate frame- 1) Optical Flow Normalization: For a given pixel V(x,y) works were designed to capture static and dynamic cues. A the value of the optical flow is calculated as block diagram representation of the proposed approach has 2 2 been presented in Figure 2. The framework relies on deep V (x; y) = u(x; y) + v(x; y) (6) spatio-temporal descriptors for dance form recognition. Twelve The maximum and minimum values for V can be expressed key frames were selected from the entire video in order to as capture the static cues like body posture , facial expressions and costumes. Deep representation from the pre-trained Alex- Net [4] network were extracted from each of these frames.The V = max(u(x; y)2 + v(x; y)2)(x; y) 2 C (7) representation were then combined together and trained using max a multi-class linear Support Vector Machine(SVM). 2 2 Two frameworks were designed to capture the dynamic Vmin = min(u(x; y) + v(x; y) )(x; y) 2 C (8) cues in the dance videos. The optical flow of each of these frames was calculated in order to estimate the velocity of pixels Therefore, in these frames. A second round of optical flow was applied to these images in order to estimate the acceleration of individual Z V (x; y) f(x; y) = (9) pixels. Deep representations were then extracted from both (V − V ) ∗ 255 the matrices of images. These sets of representations were max min concatenated individually and were trained on two separate Where f (x, y) is the normalized data of optical flow value. multi-class linear support vector machine. The output scores C is the set of all pixels coordinates in the image. The value of the three SVM were combined during the test phase to of f (x, y) is from 0 to 255 and can be displayed as a gray produce the final output. image. A. Optical flow B. Using Deep CNN’s for extracting representations In 1981, Horn and Schunck[3] derived an equation for Deep learning models are trained on large sets of labeled optical flow estimation. The aim of the algorithm was to data that learn representations from the data without manual determine the velocity or net displacements of pixels for interference. Convolution neural network(CNN) are one of the successive frames. Let the intensity of a given pixel P(x,y) most popular deep learning algorithms that have widely been at time t be equal to I(x,y,t). Therefore for a small change used for image recognition and classification tasks[9]. Pre- of time(t+∆t),the pixel would’ve moved to a new position trained CNN were chosen over hand crafted CNN due to the P(x+∆x,y+∆y). Thus the intensity value of the pixel changes small size of training dataset. The proposed architecture follow to I(x + ∆x, y + ∆y, t + ∆t). Assuming that movement of Alex-Net framework [4].The frames are reshaped to 227×227 the pixel to be very small or negligible it can be assumed that as per the needs of the model. Finally, the model consists of 227 × 227 × 3 input layer, 96 × 11 × 11 conv, 3 × 3 max- pooling, 128 × 5 × 5 conv, 3 × 3 max pooling, 256 × 3 × 3 dI(x; y; t) conv, 192×3×3 conv, 192×3×3 conv, 3×3 pooling 4096×1 = 0 (1) dt fc-6 4096 × 1fc-7 1000× fc-8 layers.