Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition

Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition Yong Du, Wei Wang, Liang Wang Center for Research on Intelligent Perception and Computing, CRIPAC Nat’l Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences {yong.du, wangwei, wangliang}@nlpr.ia.ac.cn Abstract BRNN BRNN BRNN Human actions can be represented by the trajectories of BRNN BRNN skeleton joints. Traditional methods generally model the BRNN spatial structure and temporal dynamics of human skeleton BRNN BRNN Softmax Layer Softmax BRNN with hand-crafted features and recognize human actions by Layer Fully Connected BRNN well-designed classifiers. In this paper, considering that re- BRNN BRNN current neural network (RNN) can model the long-term contextual information of temporal sequences well, we propose Layer1 Layer2 Layer3 Layer4 Layer5 Layer6 Layer7 Layer8 Layer9 an end-to-end hierarchical RNN for skeleton based action Figure 1: An illustrative sketch of the proposed hierarchi- recognition. Instead of taking the whole skeleton as the in- cal recurrent neural network. The whole skeleton is divided put, we divide the human skeleton into five parts accord- into five parts, which are fed into five bidirectional recur- ing to human physical structure, and then separately feed rent neural networks (BRNNs). As the number of layers them to five subnets. As the number of layers increases, the increases, the representations extracted by the subnets are representations extracted by the subnets are hierarchically hierarchically fused to be the inputs of higher layers. A fused to be the inputs of higher layers. The final represen- fully connected layer and a softmax layer are performed on tations of the skeleton sequences are fed into a single-layer the final representation to classify the actions. perceptron, and the temporally accumulated output of the perceptron is the final decision. We compare with five other deep RNN architectures derived from our model to verify man skeleton joints in the 3D space [37]. Currently, reliable the effectiveness of the proposed network, and also com- joint coordinates can be obtained from the cost-effective pare with several other methods on three publicly available depth sensor using the real-time skeleton estimation algo- datasets. Experimental results demonstrate that our model rithms [27, 28]. Effective approaches should be investigated achieves the state-of-the-art performance with high compu- for skeleton based action recognition. tational efficiency. Human skeleton based action recognition is generally considered as a time series problem [5, 17], in which the 1. Introduction characteristics of body postures and their dynamics over time are extracted to represent a human action. Most of As an important branch of computer vision, action recog- the existing skeleton based action recognition methods ex- nition has a wide range of applications, e.g., intelligent plicitly model the temporal dynamics of skeleton joints by video surveillance, robot vision, human-computer interac- using Temporal Pyramids (TPs) [19, 31, 33] and Hidden tion, game control, and so on [15, 36]. Traditional studies Markov Models (HMMs) [20, 34, 35]. The TPs methods about action recognition mainly focus on recognizing ac- are generally restricted by the width of the time windows tions from videos recorded by 2D cameras. But actually, and can only utilize limited contextual information. As for human actions are generally represented and recognized in HMMs, it is very difficult to obtain the temporal aligned se- the 3D space. Human body can be regarded as an articu- quences and the corresponding emission distributions. Re- lated system including rigid bones and hinged joints which cently, recurrent neural networks (RNNs) with Long-Short are further combined into four limbs and a trunk [31]. Hu- Term Memory (LSTM) [8, 10] neurons have been used for man actions are composed of the motions of these limbs action recognition [1, 11, 16]. All this work just uses sin- and trunk which are represented by the movements of hu- gle layer RNN as a sequence classifier without part-based feature extraction and hierarchical fusion. 2. Related Work In this paper, taking full advantage of deep RNN in mod- In this section, we briefly review the existing literature elling the long-term contextual information of temporal se- that closely relates to the proposed model, including three quences, we propose a hierarchical RNN for skeleton based categories of approaches representing temporal dynamics action recognition. Fig. 1 shows the architecture of the pro- by local features, sequential state transitions and RNN. posed network, in which the temporal representations of Approaches with local features By clustering the ex- low-level body parts are modeled by bidirectional recurrent tracted joints into five parts, Wang et al.[32] use the spatial neural networks (BRNNs) and combined into the represen- and temporal dictionaries of the parts to represent actions, tations of high-level parts. which can capture the spatial structure of human body and Human body can be roughly decomposed into five parts, movements. Chaudhry et al.[2] encode the skeleton struc- e.g., two arms, two legs and one trunk, and human actions ture with a spatial-temporal hierarchy, and exploit Linear are composed of the movements of these body parts. Given Dynamical Systems to learn the dynamic features. Vemu- this fact, we divide the human skeleton into the five corre- lapalli et al.[31] utilize rotations and translations to rep- sponding parts, and feed them into five bidirectionally re- resent the 3D geometric relationships of body parts in Lie currently connected subnets (BRNNs) in the first layer. To group, and then employ Dynamic Time Warping (DTW) model the movements from the neighboring skeleton parts, and Fourier Temporal Pyramid (FTP) to model the tempo- we concatenate the representation of the trunk subnet with ral dynamics. Instead of modelling temporal evolution of those of the other four subnets, respectively, and then in- features, Luo et al.[19] develop a novel dictionary learn- put these concatenated results to four BRNNs in the third ing method combined with Temporal Pyramid Matching, to layer as shown in Fig. 1. With the similar procedure, the keep the temporal dynamics. To represent both human mo- representations of the upper body, the lower body and the tions and correlative objects, Wang et al.[33] first extract whole body are obtained in the fifth and seventh layers, re- the local occupancy patterns from the appearance around spectively. Up to now, we have finished the representation skeleton joints, and then process them with FTP to obtain learning of skeleton sequences. Finally, a fully connected temporal structure. Zanfir et al.[38] propose a moving pose layer and a softmax layer are performed on the obtained rep- descriptor for capturing postures and skeleton joints. Us- resentation to classify the actions. It should be noted that, ing five joints coordinates and their temporal differences as to overcome the vanishing gradient problem when training inputs, Cho and Chen [4] perform action recognition with a RNN [8, 12], we adopt LSTM neurons in the last BRNN hybrid multi-layer perceptron. In the above methods, the lo- layer. cal temporal dynamics is generally represented within a cer- In the experiments, we compare with five other deep tain time window or differential quantities, it cannot glob- RNN architectures derived from our proposed model to ver- ally capture the temporal evolution of actions. ify the effectiveness of the proposed network, and compare Approaches with sequential state transitions Lv et with several methods on three publicly available datasets. al.[20] extract local features of individual and partial com- Experimental results demonstrate that our method achieves binations of joints, and train HMMs to capture the ac- the state-of-the-art performance with high computational tion dynamics. Based on skeletal joints features, Wu and efficiency. The main contributions of our work can be sum- Shao [34] adopt a deep forward neural network to estimate marized as follows. Firstly, to the best of our knowledge, the emission probabilities of the hidden states in HMM, we are the first to provide an end-to-end solution for skele- and then infer action sequences. To accurately calculate the ton based action recognition by using hierarchical recurrent similarity between two sequences with Dynamic Manifold neural network. Secondly, by comparing with other five de- Warping, Gong et al.[5] perform both temporal segmen- rived deep RNN architectures, we verify the effectiveness tation and alignment with structured time series represen- of the necessary parts of the proposed network, e.g., bidi- tations. Though HMM can model the temporal evolution rectional network, LSTM neurons in the last BRNN layer, of actions, the input sequences have to be segmented and hierarchical skeleton part fusion. Finally, we demonstrate aligned, which in itself is a very difficult task. that our proposed model can handle skeleton based action Approaches with RNN The combination of RNN and recognition very well without sophisticated preprocessing. perceptron can directly classify sequences without any seg- The remainder of this paper is organized as follows. In mentation. By obtaining sequential representations with a Section 2, we introduce the related work on skeleton based 3D convolutional neural network, Baccouche et al.[1] pro- action recognition. In Section 3, we first review the back- pose a LSTM-RNN to recognize actions. Regarding the ground of RNN and LSTM, and then illustrate the details of histograms of optical flow as inputs, Grushin et al.[11] use the proposed network. Experimental results and discussion LSTM-RNN for robust action recognition and achieve good are presented in Section 4. Finally, we conclude the paper results on KTH dataset. Considering that LSTM-RNNs em- in Section 5. ployed in [1] and [11] are both unidirectional with only one hidden layer, Lefebvre et al.[16] propose a bidirectional xt xt LSTM-RNN with one forward hidden layer and one back- t t ward hidden layer for gesture classification.

Load more