Spatial-Temporal Data Augmentation Based on Lstm Autoencoder Network for Skeleton-Based Human Action Recognition

SPATIAL-TEMPORAL DATA AUGMENTATION BASED ON LSTM AUTOENCODER NETWORK FOR SKELETON-BASED HUMAN ACTION RECOGNITION Juanhui Tu1, Hong Liu1, Fanyang Meng1;2, Mengyuan Liu3, Runwei Ding1 1Key Laboratory of Machine Perception, Peking University, Shenzhen Graduate School 2Shenzhen Institute of Information Technology, China 3School of Electrical and Electronic Engineering, Nanyang Technological University [email protected], [email protected], [email protected], [email protected], [email protected] ABSTRACT Relation to prior work:As neural networks often require a lot of data to improve generalization and reduce the risk of Data augmentation is known to be of crucial importance for over-fitting, data augmentation is an explicit form of regular- the generalization of RNN-based methods of skeleton-based ization that is widely used during the training of deep neural human action recognition. Traditional data augmentation meth- networks [7–10]. It aims at enlarging the training dataset ods artificially adopt various transformations merely in spatial from existing data using various translations. Wang et al. [7] domain, which lack effective temporal representation. This proposed rotation, scaling and shear transformation as data paper extends traditional Long Short-Term Memory (LSTM) augmentation techniques based on 3D transformation to make and presents a novel LSTM autoencoder network (LSTM-AE) better use of limited supply of training data. Ke et al. [8] em- for spatial-temporal data augmentation. In the LSTM-AE, the ployed cropping technique to increase the number of samples. LSTM network preserves the temporal information of skeleton Yang et al. [9] exploited horizontal flip as data augmentation sequences, and the autoencoder architecture can automatically method without losing any information. Li et al. [10] designed eliminate irrelevant and redundant information. Meanwhile, a different data augmentation strategies, such as 3D coordinate regularized cross-entropy loss is defined to guide the LSTM- random rotation, Gaussian noise and video crop to augment AE to learn more suitable representations of skeleton data. the scale of original dataset. Experimental results on the currently largest NTU RGB+D dataset and public SmartHome dataset verify that the proposed However, the aforementioned data augmentation methods model outperforms the state-of-the-art methods, and can be only leverage various transformations in spatial domain, which integrated with most of the RNN-based action recognition ignore the effective representation in temporal domain. For models easily. instance, the method horizontal flip confuses the temporal information of skeleton sequences. Different from previous Index Terms— 3D Action Recognition, Long Short-Term works, our proposed LSTM autoencoder network (LSTM-AE) Memory, Data Augmentation, Autoencoder can retain temporal representation of skeleton sequences. In 1. INTRODUCTION essence, the above methods add interference information un- related to classification to expand the dataset. And then deep Human action recognition has been used in a wide range of neural networks are utilized to learn suitable features related applications, such as video surveillance [1], human-machine to classification. In contrast, with the characteristic of autoen- interaction [2], and video analysis [3]. With the wide spread of coder, our LSTM-AE can eliminate irrelevant information such depth sensors such as Microsoft Kinect, action recognition us- as noise. In consequence, based on samples generated from ing 3D skeleton sequences has attracted a lot of research atten- LSTM-AE, deep neural networks can directly learn discrimina- tion. Lots of advanced approaches have been proposed [4–6], tive features related to classification. Moreover, the proposed especially deep learning methods like Recurrent Neural Net- regularized cross-entropy loss enables original samples to be work (RNN) and Long Short-Term Memory (LSTM). Despite consistent with generated samples at semantic level. significant progress, the generalization ability of RNN models is still a research focus. Our main contributions are as following: (1) A novel spatial-temporal data augmentation network (LSTM-AE) is This work is supported by National Natural Science Foundation of China (NS- designed to generate samples which reserve both spatial and FC, No.U1613209,61340046,61673030), Natural Science Foundation of Guangdong Province (No.2015A030311034), Scientific Research Project of Guangdong Province temporal representation of skeleton sequences, and can be in- (No.2015B010919004), Specialized Research Fund for Strategic and Prospective In- tegrated with various RNN-based models. (2) A regularized dustrial Development of Shenzhen City (No.ZLZBCXLJZI20160729020003), Scientific Research Project of Shenzhen City (No.JCYJ20170306164738129), Shenzhen Key Lab- cross-entropy loss is defined to guide LSTM-AE to learn more oratory for Intelligent Multimedia and Virtual Reality (No.ZDSYS201703031405467). suitable representations of skeleton sequences. 978-1-4799-7061-2/18/$31.00 ©2018 IEEE 3478 ICIP 2018 LSTM Autoencoder Network RNN-based Class Encoder Decoder Model Label Weight parameter sharing X D(X) (b) Encoder Decoder RNN-based Class f(x) Label RNN-based Models LSTM LSTM LSTM LSTM Model layer FC Layer Layer Layer Layer X D(X) LSTM LSTM LSTM Class Layer Layer Layer Label (a) (c) Fig. 1. (a) Overall framework of the end-to-end RNN-based method, which consists of the LSTM autoencoder network and RNN-based Models. (b) The contrastive network with LSTM-AE. (c) The baseline LSTM network (RNN-based method without LSTM-AE). 2. THE PROPOSED METHOD similar structure, i.e., stacking several LSTM layers. The num- In this section, overall framework of the end-to-end RNN- ber of LSTM layers to construct the autoencoder architecture based method for skeleton-based human action recognition is flexible. Suppose both encoder and decoder contain two is illustrated in Fig.1(a). It consists of LSTM autoencoder layers of LSTM respectively as shown in Fig.1(a), the neurons network (LSTM-AE) and RNN-based models. Fig.1(b) and of the second LSTM layer in encoder is equal to that of the (c) are listed for comparison with our proposed method. The first LSTM layer in decoder, which is corresponding to the K remainder of this section is organized as follows: we first de- compression dimensions . Especially, different compres- scribe the LSTM-AE, then introduce three RNN-based models sion dimensions affect the data reconstruction capability. To X that we adopt in our experimental section. Finally, a regular- be more specific, for the encoder step, the input data are f(x) ized cross-entropy loss function of LSTM-AE is introduced. mapped to a compressed data representation in a low- dimensional subspace. For the decoder step, the compressed 2.1. LSTM Autoencoder Network data representation f(x) is mapped to a vector X in the original RNN is a powerful model for sequential data modeling and data space. feature extraction [11], which is designed to preserve temporal 2.2. RNN-based Models information. Due to the vanishing gradient and error blow- Since the LSTM network is capable of modeling long-term ing up problems [12,13], the standard RNN can barely store temporal dynamics and automatically learning feature repre- information for long periods of time. The advanced RNN sentations, many recent works widely leverage LSTM neurons architecture LSTM [13] mitigates this problem. An LSTM as basic units to build deep architectures to recognize human neuron contains a memory cell ct which has a self-connected actions from raw skeleton inputs. The compared LSTM archi- recurrent edge of weight 1. At each time step t, the neuron can tectures are introduced as follows: choose to write, reset or read the memory cell governed by the Deep LSTM network (baseline): According to [14, 15], input gate it, forget gate ft, and output gate ot: as shown in Fig.1(c), we build the baseline LSTM network by stacking three LSTM layers called deep LSTM network, it; ft; ot = σ(Wxxt + Whht−1 + b) followed by one full-connected layer. gt = tanh(Wxgxt + Whght−1 + bg) (1) Deep Bidirectional LSTM (BLSTM) network: The idea ct = ft ∗ ct−1 + it ∗ gt of BLSTM is derived from bidirectional RNN [16], which ht = ot ∗ tanh(ct) processes sequence data in both forward and backward direc- tions with two separate hidden layers. We use BLSTM instead We employ LSTM neuron to build the proposed LSTM- of LSTM to implement the baseline, which generates a new AE. The network is capable of retaining the effective temporal BLSTM network. information of skeleton sequences, which is different from the Deep LSTM-zoneout (LSTMZ) network: Zoneout [17] is traditional data transformations in spatial domain. As shown a new method for regularizing RNNs. Instead of discarding in Fig.1(a), for a skeleton sequence as the input, the input (setting to zero) the output of each hidden neuron with a prob- data X and the reconstruction data X through the autoencoder ability during the training like dropout, zoneout stochastically architecture are input to RNN-based models in parallel. In forces some hidden units to maintain their previous values this way, they share the weight parameters of RNN-based at each timestep. Hence, the computation of c and h are models in the process of network training. D(X) and D(X) are t t changed as follows: the output of RNN-based models respectively, and the final c c ct = dt ∗ ct−1 + (1 − dt ) ∗ (ft ∗ ct−1 + it ∗ gt) (2) output of RNN-based models of LSTM-AE is represented as h h D(X) + D(X). Fig.1(b) shows the contrastive network with ht = dt ∗ht−1 +(1−dt )∗(ot ∗tanh(ft ∗ct−1 +it ∗gt) (3) c h LSTM-AE, it does not have original data X as additional input where dt and dt are the zoneout masks. Based on zoneout, to RNN-based models. The contrastive network is utilized to we build deep LSTM-zonout network. The architecture of demonstrate the validity of LSTM-AE architecture. deep LSTM-zoneout networks is similar to that of deep LSTM For the autoencoder architecture of LSTM-AE, it com- network, including three LSTM-zoneout layers and one fully- prises encoder and decoder.

Load more