Pose Oblivious Indian Classical Dance Sequence Classification System

Nrityantar: Pose oblivious Indian classical dance sequence classification system Vinay Kaushik∗ Prerana Mukherjee∗ Brejesh Lall Dept. of Electrical Engineering Dept. of Electrical Engineering Dept. of Electrical Engineering IIT Delhi IIT Delhi IIT Delhi [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION In this paper, we attempt to advance the research work done Research in action recognition from video sequences has in human action recognition to a rather specialized appli- expedited in the recent years with the advent of large-scale cation namely Indian Classical Dance (ICD) classification. data sources like ActivityNet[4] and ImageNet [13] and avail- The variation in such dance forms in terms of hand and body ability of high computing resources. Along with other com- postures, facial expressions or emotions and head orienta- puter vision areas, Convolutional neural networks (ConvNets) tion makes pose estimation an extremely challenging task. has been extended for this task as well by utilizing the To circumvent this problem, we construct a pose-oblivious spatio-temporal filtering approaches [10], multi-channel in- shape signature which is fed to a sequence learning frame- put streams [26, 11], dense trajectory estimation with optical work. The pose signature representation is done in two- flow [5]. However, the extent of success is not that over- fold process. First, we represent person-pose in first frame whelming as is in the case of image classification and recog- of a dance video using symmetric Spatial Transformer Net- nition tasks. Long Short Term Memory networks (LSTMs) works (STN) to extract good person object proposals and capture the temporal context such as pose and motion in- CNN-based parallel single person pose estimator (SPPE). formation effectively in action recognition [22]. We attempt Next, the pose basis are converted to pose flows by as- to advance the research work done in human action recogni- signing a similarity score between successive poses followed tion to a rather specialized domain of Indian Classical Dance by non-maximal suppression. Instead of feeding a simple (ICD) classification. chain of joints in the sequence learner which generally hin- There is a rich cultural heritage prevalent in the Indian ders the network performance we constitute a feature vector Classical Dance or Shashtriya Nritya1 forms. The seven In- of the normalized distance vectors, flow, angles between an- dian classical dance forms include: Bharatanatyam, Kathak, chor joints which captures the adjacency configuration in the Odissi, Manipuri, Kuchipuri, Kathakali and Mohiniattam. skeletal pattern. Thus, the kinematic relationship amongst Each category varies in the hand (or hasta-mudras) and the body joints across the frames using pose estimation helps body postures (or anga bhavas), facial expressions or emo- in better establishing the spatio-temporal dependencies. We tions depicting the nava-rasas2, head orientation, as well as present an exhaustive empirical evaluation of state-of-the- the rhythmic musical patterns accompanied in these. Even art deep network based methods for dance classification on the dressing attires and make-up hugely differs across them. ICD dataset. All these put together constitute the complex rule engines which govern the gesture and movement patterns in these CCS Concepts dance forms. In this work, however we do not address the gesture aspect of these complex dance forms. In this paper, •Computing methodologies ! Supervised Learning; we classify these categories based on the human body-pose Feature selection; Computer Vision; Image Processing; estimation utilizing sequential learning. Instead of feeding a simple chain of joints in the sequence learner which generally arXiv:1812.05231v1 [cs.CV] 13 Dec 2018 Keywords hinders the network performance, we constitute a feature vector of the normalized distance vectors which captures the Pose Signature; Dance classification; Action Recognition; adjacency configuration in the skeletal pattern. Thus, the Motion and Video Analysis; Deep Learning; Supervised Learn- kinematic relationship amongst the body joints across the ing; LSTM frames helps in better establishing the spatio-temporal de- ∗Equal contribution pendencies. During an Indian classical dance performance, the dancer may be often required to sit in a squat position or half-sitting 1\Shastriya Nritya" is the Sanskrit equivalent of \Classical dance". 2\Rasa" is an emotion experienced by the audience created by the facial expression or the feeling of the actor. \Nava" refers to nine of such emotions namely, erotic, humor, pa- thetic, terrible, heroic, fearful, odious, wondrous and peace- ful. mized and failed to capture the temporal dependencies in the video streams to encode the high-level contextual information. Histogram of Oriented Gradients [6] have remained the de-facto choice in capturing the shape information in action recognition task. Histogram of Optical Flow magnitude and orientation (HOF) [7] models the motion. They obtain better performance since they represent the dynamic con- tent of the cuboid volume. Space time interest point (STIP) detectors [14] are extensions of 2D interest point detectors that incorporate temporal information. Similarly, as in the 2D case, STIP features are stable under rotation, viewpoint, scale and illumination changes. Spatio-temporal corners are Figure 1: Indian Classical Dance (ICD) Forms: located in region that exhibits a high variation of image in- (a) Bharatnatyam (b) Kuchipudi (c) Manipuri (d) tensity in all three directions(x,y,t). This requires that such Kathak (e) Mohiniattam (f) Odissi corner points are located at spatial corners such that they invert motion in two consecutive frames (high temporal gra- dient variation). They are identified from local maxima of pose, cross-legged pose, take circular movements or turns, a cornerness function computed for all pixels across spatial thus leading to severe body-occlusion which renders the hu- and temporal scales. Holistic representation of each action man pose estimation as a highly challenging task. The addi- sequence is done by a vector of features. Motion history im- tion of certain dress attires further complicates the pose es- ages [1] obtained utilizing Hu moments are utilized as global timation. As in the case of Manipuri dance, women dancers space-time shape descriptors. wear an elaborately decorated barrel shaped long skirt which is stiffened at the bottom and closed near the top due to 2.2 Deep feature based approaches which the leg positions are obscured. In some dance forms With the deep learning networks, most of the computer vi- such as Kathakali, only the facial expressions are highlighted sion tasks have reached remarkable accuracy gains. Human and the dancers wear heavy masks. In this work, we lever- action recognition can be solved more effectively if the en- age the feature strength by constructing dance pose shape tire 3D pose comprehensive information is being exploited signature which is fed to a sequence learning framework. It in the deep networks [10, 24]. In [25], authors utilize the encodes the motion information along with the geometric co-occurrence of the skeletal joints in a sequence learning constraints to handle the aforementioned problems. In this framework. However, since there is a joint optimization paper, we address six oldest Indian dance classes namely framework involved it is computationally expensive. In [20], Bharatnatyam, Kathak, Odissi, Kuchipudi, Manipuri and authors investigate a two-stream ConvNet architecture and Mohiniattam. Bharatnatyam is one of most popular ICD jointly model the spatio-temporal information using sepa- belonging to the southernbelt of India, particularly origi- rate networks to capture these. They also demonstrate that nated in Tamil Nadu. Kuchipudi emanated from Andhra multi-frame dense optical flow results in improved perfor- Pradesh and Telangana regions. Mohiniattam belongs to mance gains in spite of limited training data. In [23], au- Kerala. Kathak originated in the northern part of India. thors utilize trajectory constrained pooling to aggregate the Manipuri and Odissi are from eastern part of India, Manipur discriminative convolutional features into trajectory-pooled and Orissa respectively. We have not considered Kathakali deep-convolutional descriptors. They construct two normal- ICD as it involves mainly facial expressions and does not ization methods for spatiotemporal and channel informa- contain enough visual dance stances to be catered in the tion. Trajectory-pooled deep-convolutional descriptor per- adopted classification pipeline. The different dance forms forms sampling and pooling on aggregated deep features have been depicted in Fig. 1. thus improving the discriminative ability. Remaining sections in the paper are organized as follows. In Sec. 2 we discuss the related work in dance and ac- 2.3 Pose feature based approaches tion recognition classification. In Sec. 3, we outline the Neural Network (CNN) architectures [3, 17] can provide methodology we propose to detect the person dancing, track excellent result for 2D human pose estimation with images. the dancer's pose and movement as well as characterize the However, handling occlusion is quite challenging task. In dancer's trajectory to understand their behavior. In Sec. 4, [18], authors obtain pose proposals by localizing the pose we discuss experimental results and conclude the paper in classes using anchor poses. Then, the pose

Load more