<<

Nrityantar: Pose oblivious Indian classical sequence classification system

Vinay Kaushik∗ Prerana Mukherjee∗ Brejesh Lall Dept. of Electrical Engineering Dept. of Electrical Engineering Dept. of Electrical Engineering IIT Delhi IIT Delhi IIT Delhi [email protected] [email protected] [email protected]

ABSTRACT 1. INTRODUCTION In this paper, we attempt to advance the research work done Research in action recognition from video sequences has in human action recognition to a rather specialized appli- expedited in the recent years with the advent of large-scale cation namely (ICD) classification. data sources like ActivityNet[4] and ImageNet [13] and avail- The variation in such dance forms in terms of hand and body ability of high computing resources. Along with other com- postures, facial expressions or emotions and head orienta- puter vision areas, Convolutional neural networks (ConvNets) tion makes pose estimation an extremely challenging task. has been extended for this task as well by utilizing the To circumvent this problem, we construct a pose-oblivious spatio-temporal filtering approaches [10], multi-channel in- shape which is fed to a sequence learning frame- put streams [26, 11], dense trajectory estimation with optical work. The pose signature representation is done in two- flow [5]. However, the extent of success is not that over- fold process. First, we represent person-pose in first frame whelming as is in the case of image classification and recog- of a dance video using symmetric Spatial Transformer Net- nition tasks. Long Short Term Memory networks (LSTMs) works (STN) to extract good person object proposals and capture the temporal context such as pose and motion in- CNN-based parallel single person pose estimator (SPPE). formation effectively in action recognition [22]. We attempt Next, the pose basis are converted to pose flows by as- to advance the research work done in human action recogni- signing a similarity score between successive poses followed tion to a rather specialized domain of Indian Classical Dance by non-maximal suppression. Instead of feeding a simple (ICD) classification. chain of joints in the sequence learner which generally hin- There is a rich cultural heritage prevalent in the Indian ders the network performance we constitute a feature vector Classical Dance or Shashtriya Nritya1 forms. The seven In- of the normalized distance vectors, flow, angles between an- dian classical dance forms include: , , chor joints which captures the adjacency configuration in the , Manipuri, Kuchipuri, and Mohiniattam. skeletal pattern. Thus, the kinematic relationship amongst Each category varies in the hand (or hasta-) and the body joints across the frames using pose estimation helps body postures (or anga ), facial expressions or emo- in better establishing the spatio-temporal dependencies. We tions depicting the nava-rasas2, head orientation, as well as present an exhaustive empirical evaluation of state-of-the- the rhythmic musical patterns accompanied in these. Even art deep network based methods for dance classification on the dressing attires and make-up hugely differs across them. ICD dataset. All these put together constitute the complex rule engines which govern the gesture and movement patterns in these CCS Concepts dance forms. In this work, however we do not address the gesture aspect of these complex dance forms. In this paper, •Computing methodologies → Supervised Learning; we classify these categories based on the human body-pose Feature selection; Computer Vision; Image Processing; estimation utilizing sequential learning. Instead of feeding a simple chain of joints in the sequence learner which generally arXiv:1812.05231v1 [cs.CV] 13 Dec 2018 Keywords hinders the network performance, we constitute a feature vector of the normalized distance vectors which captures the Pose Signature; Dance classification; Action Recognition; adjacency configuration in the skeletal pattern. Thus, the Motion and Video Analysis; Deep Learning; Supervised Learn- kinematic relationship amongst the body joints across the ing; LSTM frames helps in better establishing the spatio-temporal de- ∗Equal contribution pendencies. During an Indian classical dance performance, the dancer may be often required to sit in a squat position or half-sitting

1“Shastriya ” is the equivalent of “Classical dance”. 2“” is an emotion experienced by the audience created by the facial expression or the feeling of the actor. “Nava” refers to nine of such emotions namely, erotic, humor, pa- thetic, terrible, heroic, fearful, odious, wondrous and peace- ful. mized and failed to capture the temporal dependencies in the video streams to encode the high-level contextual infor- mation. Histogram of Oriented Gradients [6] have remained the de-facto choice in capturing the shape information in ac- tion recognition task. Histogram of Optical Flow magnitude and orientation (HOF) [7] models the motion. They obtain better performance since they represent the dynamic con- tent of the cuboid volume. Space time interest point (STIP) detectors [14] are extensions of 2D interest point detectors that incorporate temporal information. Similarly, as in the 2D case, STIP features are stable under rotation, viewpoint, scale and illumination changes. Spatio-temporal corners are Figure 1: Indian Classical Dance (ICD) Forms: located in region that exhibits a high variation of image in- (a) Bharatnatyam (b) (c) Manipuri (d) tensity in all three directions(x,y,t). This requires that such Kathak (e) Mohiniattam (f) Odissi corner points are located at spatial corners such that they invert motion in two consecutive frames (high temporal gra- dient variation). They are identified from local maxima of pose, cross-legged pose, take circular movements or turns, a cornerness function computed for all pixels across spatial thus leading to severe body-occlusion which renders the hu- and temporal scales. Holistic representation of each action man pose estimation as a highly challenging task. The addi- sequence is done by a vector of features. Motion history im- tion of certain dress attires further complicates the pose es- ages [1] obtained utilizing Hu moments are utilized as global timation. As in the case of Manipuri dance, women dancers space-time shape descriptors. wear an elaborately decorated barrel shaped long skirt which is stiffened at the bottom and closed near the top due to 2.2 Deep feature based approaches which the leg positions are obscured. In some dance forms With the deep learning networks, most of the computer vi- such as Kathakali, only the facial expressions are highlighted sion tasks have reached remarkable accuracy gains. Human and the dancers wear heavy masks. In this work, we lever- action recognition can be solved more effectively if the en- age the feature strength by constructing dance pose shape tire 3D pose comprehensive information is being exploited signature which is fed to a sequence learning framework. It in the deep networks [10, 24]. In [25], authors utilize the encodes the motion information along with the geometric co-occurrence of the skeletal joints in a sequence learning constraints to handle the aforementioned problems. In this framework. However, since there is a joint optimization paper, we address six oldest Indian dance classes namely framework involved it is computationally expensive. In [20], Bharatnatyam, Kathak, Odissi, Kuchipudi, Manipuri and authors investigate a two-stream ConvNet architecture and Mohiniattam. Bharatnatyam is one of most popular ICD jointly model the spatio-temporal information using sepa- belonging to the southernbelt of , particularly origi- rate networks to capture these. They also demonstrate that nated in Tamil Nadu. Kuchipudi emanated from Andhra multi-frame dense optical flow results in improved perfor- Pradesh and Telangana regions. Mohiniattam belongs to mance gains in spite of limited training data. In [23], au- . Kathak originated in the northern part of India. thors utilize trajectory constrained pooling to aggregate the Manipuri and Odissi are from eastern part of India, Manipur discriminative convolutional features into trajectory-pooled and Orissa respectively. We have not considered Kathakali deep-convolutional descriptors. They construct two normal- ICD as it involves mainly facial expressions and does not ization methods for spatiotemporal and channel informa- contain enough visual dance stances to be catered in the tion. Trajectory-pooled deep-convolutional descriptor per- adopted classification pipeline. The different dance forms forms sampling and pooling on aggregated deep features have been depicted in Fig. 1. thus improving the discriminative ability. Remaining sections in the paper are organized as follows. In Sec. 2 we discuss the related work in dance and ac- 2.3 Pose feature based approaches tion recognition classification. In Sec. 3, we outline the Neural Network (CNN) architectures [3, 17] can provide methodology we propose to detect the person dancing, track excellent result for 2D human pose estimation with images. the dancer’s pose and movement as well as characterize the However, handling occlusion is quite challenging task. In dancer’s trajectory to understand their behavior. In Sec. 4, [18], authors obtain pose proposals by localizing the pose we discuss experimental results and conclude the paper in classes using anchor poses. Then, the pose proposals are Sec. 5. scored to get the classification score and regressed indepen- dently. Since, pose regression techniques result in a single 2. RELATED WORKS pose estimation thus cannot handle multimodal pose dis- tributions. In order to capture such information, the pose 2.1 Handcrafted feature based approaches estimates can be discretized into various bins followed by Most of the traditional works in action recognition do- pose classification. In [16], authors construct a classification- main constitute of constructing handcrafted features to dis- regression framework that utilizes classification network to criminate action labels [15, 21, 14, 2, 6]. Next, these fea- generate a discretized multimodal pose estimate followed by tures were encoded into Bag of Visual Words (BoVW) or regression network to refine the discretized estimate to a Fisher Vector (FV) encodings. The classifier models are continuous one. The architecture incorporates various loss then trained on these encodings and classify the videos with functions to enable different models. In [26], authors incor- suitable labels. However, these feature sets were not opti- porate pose, motion and color channel information into a Markovian model to incorporate the spatial and temporal added BN-Auxiliary layer i.e. the fully connected layer of information for localization and classification. the auxiliary classifier is also normalized, not just the con- volutional layers. 3. PROPOSED METHODOLOGY ResNext-101 Kinetics features: It is pre-trained on Kinetic In this section, the proposed framework and its main com- dataset with 400 video classes. We have used 16 consecu- ponents are discussed in detail including the recognition of tive frames for generating a 3D feature. The architecture a from a sequence of frames in a video using comprises of 101 convolutional layers with skip connections LSTM networks and feature extraction for the video frames. and cardinality of 32. The output can be represented using First, we extract features using uniform frame skipping in a feature vector of size 2048 or 16x128 for 16 frames. Thus, a sequence of frames such that the skipping of frame does it gives a 128-dimensional feature vector per frame. not affect the sequence of the action in the video. The fea- ture vector comprises of multiple components. We perform 3.2 Pose Signature feature extraction using a 3-tier framework combining In- After the AlphaPose estimation, we construct novel pose ception V3 features, 3D CNN features and novel pose sig- signature which consists of the normalized distances and an- natures. The second component trains our model for dance gles between the anchor joints in the pose vector. The pose classification. It takes a chunk of features as input where a vector constitutes 16 anchor joints namely, 1: footright, 2: chunk is defined by a collection of features for the selected kneeright, 3:hipright, 4:hipleft , 5: kneeleft , 6: footleft frames in the video. The feature chunks are then fed to a , 7: hipcenter , 8:spine, 9: shouldercenter, 10: head , 11: LSTM network in order to classify various dance forms. The handright ,12: elbowright,13: shoulderright ,14: shoulderleft, output of the LSTM layer is connected to 2 fully connected 15: elbowleft, 16: handleft as shown in Fig. 3. We uti- layers with Batch Normalization at each stage and is finally lize joint 7 as the reference point for taking the normal- connected to a softmax layer of size 6, corresponding to the ized distances to all other anchor joints. The distance met- 6 ICD dance forms. ric used is Euclidean distance norm. We also incorporate the angles between the key anchor joints to characterize the 3.1 Feature Construction: Inception V3 Fea- dance stances. The angles are considered between these or- tures, 3D CNN, Pose Signature dered anchor joint pairs: {1, 3},{2, 7},{4, 6},{5, 7},{11, 13}, The feature vector for training our LSTM model is an {9, 12},{9, 15},{14, 16}. We embed the normalized distances embedding of various deep features (both 2D and 3D) and between the leg joints ({1, 6},{2, 5}) and hand joints ({11, 16}, geometric cues in a single vector learning to classify the clas- {12, 15}) as well in the pose signature. Next, we calcu- sical dance type. The 2D spatial cues are learnt using In- late the flow vectors between the anchor joints in successive ceptionV3 network weights trained on ImageNet dataset [8]. frames to characterize the temporal dependencies. We also The 3D temporal cues are described by using ResNext-101 compute the flow directionality for the anchor joints across with cardinality 32 on Kinetics dataset [12] which contains successive frames. Thus, in total the pose signature con- 400 video classes. In order to construct the pose signature, stitutes a 75D vector. Instead of simply feeding a chain of geometric information is used to construct features compris- joints in the sequence learner which generally hinders the ing of pose, flow and other structural information. The vari- network performance we constitute a pose signature which ation in Indian classical dance forms in terms of hand and captures the adjacency configuration in the skeletal pattern. body postures, head orientation makes pose estimation an Hence, the kinematic relationship amongst the body joints extremely challenging task. To circumvent this problem, we across the frames using pose estimation step helps in better construct a pose-oblivious shape signature which is fed to a establishing the spatio-temporal dependencies. sequence learning framework. The pose estimation is done as a two-fold process. The pose estimation is done using 3.3 Sequence Learning Framework Alphapose framework. First, we represent person-pose in The feature vectors obtained from 3DCNN, pose signature a frame of a dance video using a symmetric Spatial Trans- and ImageNet features are concatenated to be fed into a se- former Networks (STN) to extract good person object pro- quence learning framework. We utilize Long Short Term posals and CNN-based parallel single person pose estimator Memory networks for dance sequence classification. Usu- (SPPE). Next, the pose basis are converted to pose flows ally, Recurrent Neural Networks (RNNs) are difficult to train by assigning a similarity score between successive poses fol- with various activation functions such as tanh and sigmoid lowed by non-maximal suppression. The video frames are due to the problems of vanishing gradient and error explod- fed into Alphapose framework computes the skeletal pose ing [9]. LSTMs can be used to learn the long-term depen- comprising of anchor joints at various body parts of a hu- dencies in place of RNNs and it solves the aforementioned man being. The anchor joints are used to further compute problems. It consists of one self-connected memory cell c various geometric features, resulting into a pose signature and three multiplicative units, input gate i, forget gate f (as explained in Sec. 3.2. The feature generation strategy is and output gate o. described in detail further below. Given an input sequence x = (x0, ..., xT ), the activations Inception V3 features: These features widely are used for of the memory cell and three gates are given as follows: classification of images. The input size is 299x299x3 and i = σ(W x + W h + W c + b ) (1) the output layer gives a feature of size 1x1x2048. We utilize t xi t hi t−1 ci t−1 i the pre-trained ImageNet model used for image classifica- ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) (2) tion. This has several advantages over the other networks. ct = ftct−1 + it(tanh(Wxcxt + Whcht−1 + bc) (3) It incorporates Batch Normalization to the previous archi- o = σ(W x + W h + W c + b ) (4) tectures, replaces the 5x5 convolution layers in Inception t xo t ho t−1 co t o V2 by two 3x3 convolution layers and has the benefit of an ht = ot tanh(ct), (5) Figure 2: (a) Achitecture of the Nrityantar framework (b) Block Diagram of the sequence learning framework in Nrityantar. The number in the layers indicate the number of features.

4. EXPERIMENTAL RESULTS In this section, we evaluate our model and compare with other five different architectures on a benchmark dataset for Indian classical dance: ICD dataset [19]. We also discuss the overfitting issues and the computational efficiency of the proposed model. 4.1 Evaluation Dataset and Parameter Settings ICD dataset mainly consists of curated videos from YouTube of primarily 3 oldest dance forms Bharatnatyam, Kathak Figure 3: (a) Pose of a Kathak dancer (b) Visual- and Odissi with video class annotation. Each class consists ization of the anchor joints of 30 video clips of maximum resolution 400A˜U350˚ and of 25 sec maximum duration. We have done data augmenta- tion with these videos for classes Manipuri, Kuchipudi and where σ(.) is the sigmoid function, W are the Mohiniattam from YouTube. During data processing, we weights between two units and h denotes the output values further clipped the video segments into 5-6 seconds chunks of a LSTM cell. of frames at 25 fps to generate a maximum of 150 frames. We have fed input feature vector into the LSTM network. The train to test ratio for evaluation has been selected as The concatenated feature is 2251D vector. We have uti- 7 : 3. The resultant dataset posed several challenges includ- lized a dense network after an LSTM layer followed by soft- ing varying illumination changes, shadow effects of dancers max layer for dance sequence classification as shown in Fig. on stage, similar dance stances etc. The low accuracy of the 2(b). The constructed feature vector is able to characterize skeleton joint coordinates and the partial body parts missing the spatio-temporal dependencies between the anchor joints. in some sequences makes this dataset very challenging. Tab. We utilize the cross-entropy loss function given as below. 4.1 shows the parameter settings of our proposed model on Training Loss: We use Adam optimizer with Categorical the evaluated dataset. For training, we utilize Adam opti- cross entropy for training. The equation for categorical cross mizer with 0.0004 as learning rate and 0.000001 as decay. entropy is: The loss function is categorical cross-entropy. Training is done for a maximum of 100 epochs with early stopping if N C validation error is not improved for consecutive 5 epochs. −1 X X 1 logp [y C ] (6) N yiCc model i c i=1 c=1 4.2 Evaluation Results and Discussion In order to verify the effectiveness of the proposed net- The double sum is over the observations (video features) work, we compare with other four deep learning-based ar- i, whose number is N, and the categories c, whose number chitectures: Convolutional LSTM (LRCN), CONV3D, Mul- th is C. The term 1yiCc is the indicator function of the i ob- tilayer Perceptron (MLP) and LSTM. We also provide a th servation belonging to the c category. The pmodel [yiCc] baseline with the various features from Inception V3 Ima- is the probability predicted by the model for the ith obser- geNet features, Kinetics 3D-CNN features and a combina- vation to belong to the cth category. When there are more tion of these with Pose signature using LSTM framework. than two categories, the neural network outputs a vector of The LSTM output is forwarded to 2 fully connected layers C probabilities, each giving the probability that the network in all cases. The size of the layer varied as the feature length input should be classified as belonging to the respective cat- varied. The first dense layer is of quarter size to the LSTM egory. output. The second dense layer is 1/2 of the first one. This the results were better even with reduced feature length. Table 1: Parameters used for training the LSTM Next, we utilize AlphaPose framework to compute pose of TRAINING the dancing person. The pose constitutes 16 anchor joints PARAMETERS of different body parts. We then utilize these 16 anchor Optimizer Adam joints to construct a pose signature capturing the tempo- Loss Categorical Cross-entropy ral flow and flow directionality between successive frames Learning rate 0.0001 and embed the geometric constraints with the normalized Decay 0.000001 distances between various body anchor joints. The resul- Feature length 2251 tant pose signature is of dimensionality 75 − D feature vec- Output length 6 tor. LSTM achieved the best results which is comparable Sequence Length 48 frames to the result using ResNext-101 3D CNN feature used in- (One training sample) (Uniformly Distributed) dependently. Since both the architectures reflect temporal Batch Size 32 information, the results differ by close margins. Maximum Epoch 100 After training these features individually, we concatenate these features to construct a novel feature set for training via LSTM networks. The InceptionV3 combined with ki- Table 2: Final results netics gave slightly better results (∼ 1 − 2%) than combined Dance Class Precision Recall F1-score Support Class Accuracy with pose signature. The combined feature descriptor (2251 Bharatnatyam 0.54 0.87 0.66 76 86.84 length) gives the best performance. Kathak 0.65 0.88 0.74 58 87.93 Kuchipudi 0.82 0.71 0.76 126 70.63 Manipuri 0.62 0.25 0.35 53 24.52 Mohiniattam 0.97 0.94 0.95 63 93.63 Odissi 0.83 0.64 0.72 69 63.76 5. CONCLUSION Average 0.75 0.72 0.71 445 72.35 We have presented a novel pose signature in a sequencial learning framework for Indian Classical Dance (ICD) clas- sification. We incorporated pose, flow and spatio-temporal is done for both optimality as well as for a fair comparison dependencies in the pose signature to capture the adjacency between all features. The final layer is a dense softmax layer relationship between anchor joints in the skeletal pattern with size 6. Tab. 4.2 provides the classification results of of the dancer. We performed exhaustive experiments and the proposed architecture. Tab. 4.2 shows the confusion demonstrated the effectiveness of the proposed methodol- matrix for the 6 ICD forms. For training, we utilize Incep- ogy on dance classification. We showed that deep descrip- tion V3 pretrained model on ImageNet and extract image tors with handcrafted pose signature outperformed on ICD features. We randomly sample 48 frames from every video dataset. We also showed that due to high similarities be- and pass those frames through ImageNet pretrained model tween dance moves and dressing attires it is highly challeng- to create a stacked feature set for each video. The initial ing to classify dance sequences. In future works, we plan to input size to the model is (batch size, sample size, feature incorporate facial gestures into the classification pipeline. size) i.e. (32,48,2048). We add 2 fully connected layers to the output of LSTM and feed it to softmax layer for eval- uation. We achieved best results with LSTM framework 6. REFERENCES as compared to other evaluated deep learning architectures. [1] A. F. Bobick and J. W. Davis. The recognition of Second best results are obtained using Multi-layer Percep- human movement using temporal templates. IEEE tron approach with reasonable accuracy (40-50%) but lag- Transactions on pattern analysis and machine ging behind LSTM (60-70%). intelligence, 23(3):257–267, 2001. We have also shown the evaluation on various possible permutations of the feature sets to validate the efficacy of [2] M. Bregonzio, S. Gong, and T. Xiang. Recognising the concatenation on improving discriminative ability with- action as clouds of space-time interest points. In out hindering the network’s training performance. Tab. 4 Computer Vision and Pattern Recognition, 2009. provides an evaluation on various possible combination of CVPR 2009. IEEE Conference on, pages 1948–1955. the feature sets. In order to further improve the feature IEEE, 2009. representation and model the spatio-temporal dependencies, [3] A. Bulat and G. Tzimiropoulos. Human pose we incorporate Residual Network with multiple cardinalities estimation via convolutional part heatmap regression. (also known as ResNext) using 3D CNN architecture. The In European Conference on Computer Vision, pages ResNext-101 was trained on Kinetics dataset, which com- 717–732. Springer, 2016. prises of 400 video classes. The ResNext architecture uti- [4] F. Caba Heilbron, V. Escorcia, B. Ghanem, and lizes 16 consecutive frames as input and outputs 3D feature J. Carlos Niebles. Activitynet: A large-scale video vector with size 2048 (or 16x128). We implemented ResNext benchmark for human activity understanding. In for all frames in the videos (with sequence length>48). The Proceedings of the IEEE Conference on Computer output is then saved in a JSON format with segment length Vision and Pattern Recognition, pages 961–970, 2015. 16 per video. We convert this data into a feature vector of [5] G. Ch´eron, I. Laptev, and C. Schmid. P-cnn: 128 dimensionality for the input images and then selected 48 Pose-based cnn features for action recognition. In frames again uniformly for training. The training input is of Proceedings of the IEEE international conference on size (32,48,128). The training is computationally less expen- computer vision, pages 3218–3226, 2015. sive due to reduced feature vector. The results were better [6] N. Dalal and B. Triggs. Histograms of oriented than InceptionV3 which utilised only spatial information, gradients for human detection. In Computer Vision but since ResNext captures flow i.e. temporal information, and Pattern Recognition, 2005. CVPR 2005. IEEE Table 3: Comparison of various features and their combinations for Dance Classification Average Method Bharatnatyam Kathak Kuchipudi Manipuri Mohiniattam Odissi Accuracy InceptionV3 80.48 62.71 28.90 30.64 98.41 53.62 59.1 Pose Signature 88.15 79.31 43.65 41.50 96.82 71.01 67.41 Kinetics 84.21 68.96 56.34 30.18 96.82 57.97 65.61 InceptionV3+ 51.31 67.24 65.87 67.92 93.65 73.91 68.98 Pose Signature InceptionV3+ 85.52 68.96 63.49 24.53 96.82 57.97 67.19 Kinetics InceptionV3+ Pose Signature+ 86.84 87.93 70.63 24.52 93.65 63.76 72.35 Kinetics

Computer Society Conference on, volume 1, pages [18] G. Rogez, P. Weinzaepfel, and C. Schmid. Lcr-net: 886–893. IEEE, 2005. Localization-classification-regression for human pose. [7] N. Dalal, B. Triggs, and C. Schmid. Human detection In CVPR 2017-IEEE Conference on Computer Vision using oriented histograms of flow and appearance. In & Pattern Recognition, 2017. European conference on computer vision, pages [19] S. Samanta, P. Purkait, and B. Chanda. Indian 428–441. Springer, 2006. classical dance classification by learning dance pose [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and bases. 2012. L. Fei-Fei. Imagenet: A large-scale hierarchical image [20] K. Simonyan and A. Zisserman. Two-stream database. In Computer Vision and Pattern convolutional networks for action recognition in Recognition, 2009. CVPR 2009. IEEE Conference on, videos. In Advances in neural information processing pages 248–255. Ieee, 2009. systems, pages 568–576, 2014. [9] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, [21] H. Wang and C. Schmid. Action recognition with et al. Gradient flow in recurrent nets: the difficulty of improved trajectories. In Proceedings of the IEEE learning long-term dependencies, 2001. international conference on computer vision, pages [10] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional 3551–3558, 2013. neural networks for human action recognition. IEEE [22] L. Wang, Y. Qiao, and X. Tang. Action recognition transactions on pattern analysis and machine and detection by combining motion and appearance intelligence, 35(1):221–231, 2013. features. THUMOS14 Action Recognition Challenge, [11] A. Karpathy, G. Toderici, S. Shetty, T. Leung, 1(2):2, 2014. R. Sukthankar, and L. Fei-Fei. Large-scale video [23] L. Wang, Y. Qiao, and X. Tang. Action recognition classification with convolutional neural networks. In with trajectory-pooled deep-convolutional descriptors. Proceedings of the IEEE conference on Computer In Proceedings of the IEEE conference on computer Vision and Pattern Recognition, pages 1725–1732, vision and pattern recognition, pages 4305–4314, 2015. 2014. [24] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, [12] W. Kay, J. Carreira, K. Simonyan, B. Zhang, X. Tang, and L. Van Gool. Temporal segment C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, networks: Towards good practices for deep action T. Back, P. Natsev, et al. The kinetics human action recognition. In European Conference on Computer video dataset. arXiv preprint arXiv:1705.06950, 2017. Vision, pages 20–36. Springer, 2016. [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. [25] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, Imagenet classification with deep convolutional neural X. Xie, et al. Co-occurrence feature learning for networks. In Advances in neural information skeleton based action recognition using regularized processing systems, pages 1097–1105, 2012. deep lstm networks. In AAAI, volume 2, page 6, 2016. [14] I. Laptev. On space-time interest points. International [26] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and journal of computer vision, 64(2-3):107–123, 2005. T. Brox. Chained multi-stream networks exploiting [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. pose, motion, and appearance for action classification Gradient-based learning applied to document and detection. In Computer Vision (ICCV), 2017 recognition. Proceedings of the IEEE, IEEE International Conference on, pages 2923–2932. 86(11):2278–2324, 1998. IEEE, 2017. [16] S. Mahendran, H. Ali, and R. Vidal. A mixed classification-regression framework for 3d pose estimation from 2d images. arXiv preprint arXiv:1805.03225, 2018. [17] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.