Spatiotemporal Learning of Dynamic Gestures from 3D Point Cloud Data

Joshua Owoyemi and Koichi Hashimoto

Abstract— In this paper, we demonstrate an end-to-end as illumination invariance [12] and can capture appropriate spatiotemporal gesture learning approach for 3D point cloud information about the exact size and shape of an object in its data using a new gestures dataset of point clouds acquired physical space, allowing precise and accurate data usage for from a 3D sensor. Nine classes of gestures were learned from gestures sample data. We mapped point cloud data into dense 3D manipulation, coordination and visualization [13] [14]. occupancy grids, then time steps of the occupancy grids are In most of the reviewed works, researchers used depth used as inputs into a 3D convolutional neural network which maps [15] or skeleton representations [16] [17] from 3D learns the spatiotemporal features in the data without explicit sensors to analyze or learn human motion. To our knowledge, modeling of gesture dynamics. We also introduced a 3D region there has been few works in gesture recognition based on of interest jittering approach for point cloud data augmentation. This resulted in an increased classification accuracy of up to only 3D point cloud data. This might be largely because 10% when the augmented data is added to the original training point cloud data are unorganized, making it difficult to be data. The developed model is able to classify gestures from the used directly for model inputs. Therefore, some considerable dataset with 84.44% accuracy. We propose that point cloud preprocessing has to be done in order to convert the raw data will be a more viable data type for scene understanding data into usable formats. In our case, we make use of an and motion recognition, as 3D sensors become ubiquitous in years to come. occupancy grid representation, with unit cells referred to as voxels. This representation can be used for arbitrary size of I.INTRODUCTION point cloud spaces, subject to the resolution of the voxels. In recent years, understanding human motion has been The key contributions of our work are as follows: 1. We gaining more popularity [1] in robotics, mainly for human- demonstrate spatiotemporal learning from point cloud data robot interaction (HRI) [2] applications. This knowledge through a new dataset of common Japanese gestures. 2. We is useful because of increase in situations where humans develop a 3D CNN model which learns gestures end-to-end and robots continue to share working and inhabiting spaces. from 3D representation of point clouds stream and outputs a There is therefore need for robots to ‘understand’ the in- corresponding gesture class performed by the human. 3. We tentions of humans through gesture and action recognition, evaluate the 3D CNN model on the new dataset of common interpretation and prediction of human motions. This will Japanese gestures. help to effectively work with humans and ensure the safety II.RELATED WORK of both parties. While there has been a lot of research using 2D sensors and images for this purpose [3] [4], 3D sensors With 3D sensors becoming ubiquitous in computer vision have gained traction because of increased accessibility to low and robotic applications, the task of motion and gesture cost 3D sensors such as the [5], allowing recognition from 3D data is one of increasing practical for more uses for 3D data. relevance. A general approach to gesture recognition from In this paper we are interested in learning human action 3D data involves feature extraction from data, followed by and gestures from 3D point cloud data. According to [6], a possible dimensionality reduction, and application of a actions are more generic whole body movements while classifier on the resulting processed data. Some of earlier arXiv:1804.08859v1 [cs.CV] 24 Apr 2018 gestures more fine-grained upper body movements performed approaches include; Hidden Markov Model (HMM) based by a user that have a meaning in a particular context. In our classifier [18] using the size of an ellipsoid containing an previous work [7], we have developed a model to predict the object and the length of vector from the center of mass intentions of human arm motions in a workspace. We further to extreme points, analysis of Fourier transform and Radon evaluate this approach for dynamic gestures in this paper. transform of self-similarity matrix of features obtained from Researchers have used 2D data to achieved some re- actions [19], and computation of local spin image descriptors markable progress in action and gesture recognition [8] [9] [14] or Local Surface Normal (LSN) based descriptors [20]. [10] [3]. However, using 2D images have proven difficult A different approach was the use of multi-viewed projection in some situations such as varying illumination conditions of point cloud into view images and describing hand gestures and cluttered backgrounds [11]. On the other hand, 3D data by extracting and fusing features in the view images [21], such as point cloud and depth maps offer advantages such claiming that conversion of feature space increases the inner- class similarity and reduces inter-class similarity. Converse to Authors are with the Department of System Information Sciences, these aforementioned approaches, our approach does not rely Graduate School of Information Sciences, Tohoku University, Aoba-ku, Sendai 980-8579, Japan (email: [email protected]; on hand-engineered features or descriptors from point cloud [email protected]). data. Rather, we use a 3D convolutional neural network (3D CNN) based model which is able to both automatically build relevant features and classify inputs from example data. A. Other 3D Data Types Apart from point clouds, 3D data are also represented as depth maps, which are images that contains information relating to the distance of the surfaces of scene objects from a viewpoint. Depth maps have proven useful in gesture recog- nition [4] [22] [23] mostly because the data is in 2D, which makes it easy to apply popular feature extraction approaches. However, we argue that our approach is applicable to not only gesture learning but action learning in general. While depth data offers only a point of view of the 3D space, using raw point cloud data allows us to define arbitrary regions of interest (ROIs) in the case of using sensors that provide 3D field of view. An example will be LIDAR sensor mounted on a self-driving car. We can create multiple ROIs depending on the location of objects in the scene and individually analyze their actions. Secondarily, researchers also use skeletal features, pro- vided by the sensor’s software development kit (SDK) as in the case of Kinect1, or extracted manually from depth data [16]. These methods [17] [17] [24], however, involve manu- ally modeling or engineering the features for the gestures or actions to be learned. Our approach, on the other hand, does not involve explicitly modeling the dynamics of the gestures or actions learned. Instead, the gestures are learned end-to- end, from input directly to gesture class using supervised learning. B. Gesture and action recognition with CNN CNNs have proven effective in pattern recognition tasks [25] and is known to outperform hand-engineered feature- based approaches in image classification, object recognition and similar tasks [26] [4] [3]. For gesture recognition, CNNs have been used to achieve state-of-the-art results [8] [17] [9] [24] utilizing different kind of feature representations and data types. Asadi-Aghbolaghi et al. [6] presents a good survey on deep learning for action recognition in image sequences. In this paper, our model is inspired by the early fusion Fig. 1. Sample frames of the Dataset of Common Japanese Gestures. We model in [27] where consecutive frames in a video are collected one class of no gestures and 9 classes of gestures. The gesture fed into a 2D CNN in order to classify the actions in the classes are (1) No gesture, (2) Come here, (3) Me, (4) No thank you, (5) video stream. Similarly, we also feed consecutive frames Money, (6) Peace, (7) Not allowed, (8) OK, (9) I’m sorry, (10) I got it!. of point cloud data, however, into a 3D CNN to learn the actions performed in the point cloud stream. To demonstrate We also describe the data augmentation approach employed the spatiotemporal learning capability of our approach, we to improve the efficiency of the learning process. collected a new dataset of common Japanese gestures (See Fig. 1). These were chosen arbitrarily from seeing videos A. Data collection and asking randomly chosen Japanese people to tell us the We collected training data as point cloud frames acquired common gestures they use in a day-to-day life. from a Kinect sensor. The dataset consist of a class labeled III.METHOD ‘No Gesture’ and nine other gesture classes making a total In this section we describe the methods we used in learn- of ten classes. The classes of gestures are: 1. No Gesture, 2. ing spatiotemporal features from point cloud data. First, we Come, 3. Me, 4. No Thank You, 5. Money, 6. Peace, 7. Not present the data collected, then the training data preparation. Allowed, 8. OK, 9. I’m Sorry, and 10. I Got It!. The dataset were collected with a Kinect Sensor facing subjects as shown 1https://msdn.microsoft.com/en-us/library/dn799271.aspx in Fig. 2. There were 5 subjects with each subject repeating p P where {z }p=1 is a sequence of range measurement that either hit zp = 1, or pass through zp = 0 a given voxel and p is the individual points in the point cloud. This process is illustrated in Fig. 3 alongside the data augmentation approach which we will explain in the next subsection. We also define a “lookback-window” size m corresponding to the number of prior time steps to consider when recognizing the gesture at time t. Hence, for each input time step, we have a data  sample tensor V t = Vt,Vt−1,Vt−2, ..., Vt−m+1 paired with the label Yt, the corresponding gesture class. Therefore we aim to find a set of parameters of the non-linear function that maps the input V t to the corresponding label Yt, given by (2). Fig. 2. The setup for point cloud data collection. The subject sits in front of a Kinect sensor at a considerable distance. The ROI is -0.50 m to 0.50 m, -0.30 m to 0.40 m, 0.50 m to 1.40 m in , y and z axes respectively.  Yt = f V t (2)

C. Training data augmentation the gestures at least 30 times per gesture. A total of 87,156 point cloud frames were collected for training and 29,758 We doubled the amount of training data by carrying out the data frames for testing. Sample frames of the dataset are data augmentation approach we call ‘3D ROI jittering’ on the shown in Fig. 1 for four consecutive frames of each gesture original dataset. This was achieved by applying a translation  T class. A ROI was determined to specify the volume in the vector αx, αy, αz to the original 3D ROI of each gesture scene where the subject is expected to be in. Therefore, the performance in the dataset. As illustrated in Fig. 3, basically sensor only captures the upper part of the subject. The body we are shifting the ROI around in space while the position was not removed from the point cloud because some gestures of the point cloud data is fixed. This is intended to simulate also involve bodily movements and the use of both hands. An spatial variations in the performance of the gestures. example is the gesture ”I’m sorry”, which involves slightly For an ROI, ([x1, x2], [y1, y2], [z1, z2]), and jittering size bending or bowing the head with both hands clapped together α, the jittered ROI can be obtained as: in front of the head.       B. Training Data Preparation xi,aug 1 0 0 αx xi yi,aug  0 1 0 αy yi To prepare the training data, point clouds from a 3D   =     (3) zi,aug  0 0 1 αz  zi  sensor are converted into 3D occupancy grids, with each 1 0 0 0 1 1 point (x, y, z) in the point cloud discretely mapped to a voxel a coordinate (i, j, k) by updating the occupancy grid similar to T Where the vector α , α , α  is chosen randomly from [28]. Each voxel has an initial value V p = 0 and is updated x y z ijk the set, by;  0, α   0, −α  p p−1 p ∪ , Vijk = Vijk + z (1) αx, αy, αz αx, αy, αz

y j x i z k Jitter Size

Data Augmentation Point Cloud Occupancy Grid (ROI: 1.0 m x 0.7 m x 0.9m) Original ROI Jittered ROIs 21 x 14 x 18

Fig. 3. Input occupancy grid, mapped from point clouds. The dimension of the occupancy grid depends on the size chosen for the voxels. Here we used a voxel size of 50 mm. There is also a data augmentation step which involves randomly jittering the ROI by a size α across a combination of the axes. To augment a point cloud data, the original ROI is jittered randomly in the x-axis or the y-axis or the z-axis or a combination of the three axes. the union of permutations with replacement for {0, α} and {0, −α}. These permutations, therefore, represent the differ- ent jittering configurations we can have for a chosen jitter size. D. 3D CNN Model CNN models are known to be suited for representing the non-linear relationships, as in (2), involving a multidimen- sional input spaces such as image classification and object detection problems [25]. Here, we are interested in using a CNN model to learn gestures from 3D data, hence the use of 3D CNN. CNNs are characterized by convolution operations be- Fig. 4. 3D convolution illustration. 3D kernel, represented by the multicoloured cubes, are applied to inputs from previous layer. Here we tween an input tensor I and a convolution kernel K (see also have a temporal dimension and the kernel weights are shared across illustration in Fig. 4). For 3D inputs we have an output; this dimension. For the input layer, contiguous occupancy grid time steps represent the temporal dimension of the input. A(i, j, k) = (K ∗ I) X X X = I(i − l, j − m, k − n)K(l, m, n) Adam optimizer [31], and used dropouts [32] of 0.3 on the l m n fully connected layers to further prevent overfitting. During (4) training, we reduce the learning rate by a factor of 0.3 after which is the activation of the node (i, j, k) of the feature the validation loss has not decreased in 3 epochs. map in the next layer. All training was done on an Intel(R) Xenon(R) CPU E5- Deep CNN models achieve automatic feature construction 2637 v4 @ 3.50GHz x 12 with a 4 NVIDIA X GPUs. by stacking multiple convolutional layers, where higher The Keras2 library with Tensorflow3 backend was used for layers capture more complex or discriminating features. For- model implementation. mally, each layer’s output of the model is a set of activations A from the layer’s Rectified Linear Units (ReLUs) [29] IV. EVALUATION which are functions of kernel weights W , and biases b to The developed model was evaluated on the test data kept be optimized. Equations (5) to (7) represent operations at the apart during data collection. We used a window size of layers. l are intermediate layers while L is the output layer. 4 timesteps for our training and evaluation. The following subsections outline details of the evaluations carried out on 0 A = V t (5) the model. l l l−1 l A = ReLU(W A + b ) (6) A. Data Augmentation Yˆ = Softmax(W LAL−1 + bL) (7) Training with augmented data not only increased the num- ber of training data that was used by 100% but also helped Using similar configurations from previous work [7], with in compensating for the spatial variation in the gestures since modifications relating to the size of voxel space. Our CNN the subject could perform the gesture in different positions model configuration is as follows: As shown in Fig. 5, we within the ROI. We compared the result of training with and have a total of 7 layers, that is, 4 convolutional layers, 2 fully without augmented data and found that adding augmented connected layers and the output layer. The 3D convolutional data to our training samples increased test accuracy up layers were designed to extract spatial features in each to 10%. This is true for the different classifiers that we input time step and temporal features across time steps of evaluated. This is a significant improvement from using only the input. The first and second layers made use of 5x5x5 the collected data. A summary of the results from data convolution kernels, followed by a 3x3x3 in the third layer augmentation is shown in Fig. 6, showing the comparison and a 2x2x2 kernel in the fourth layer. A stride of 2x2x2 of different jitter sizes on different classifiers. was used throughout the convolutional layers. We apply max pooling [30] on the second and fourth layer to reduce B. Models Comparison the dimensionality of the parameters connected to the next We compared the 3D CNN model with a random forest layers. model, an off-the-shelf classifier. This comparison is to E. Training Details evaluate if the 3D CNN model has considerable accuracy advantage over an off-the-shelf classifier. On the other hand, We trained our final model using a 5-fold cross-validation, LSTM models [33] are known to perform better on time employing an early stop approach of a patience of 3 in series or temporal data, so we compared the 3D CNN model each cross-validation cycle. This prevents the model from overfitting and helps to choose a good trade-off between 2https://keras.io/ accuracy and training loss. For optimization, we used an 3https://www.tensorflow.org/ min max

3DConv 3DConv Model Input 3DConv (64,3,2) 3DConv (128,2,2) Full Full Output (3D Occupancy Grids) (64,5,2) MaxPool (96,2,2) MaxPool (512) (128) (Class) 4x18x21x14 (2,2,2) (2,2,2)

Fig. 5. The architecture of the developed 3D CNN model. Input into the model are time steps of occupancy grids converted from point cloud data. Here, we show the first four filters of the convolutional layers and the activation map in the fully connected layers. The red cells show the activations in the filters of each layer. The filters that are black signify no activation for the given input.

90 TABLE I 85 MODELS PERFORMANCE COMPARISON 80 Model Accuracy 75 Random Forest 67.64% 60 3D CNN 75.80% Accuracy (%) 65 3D CNN + Augmentation 84.44% 60 3D CNN + LSTM 74.47% 55 3D CNN + LSTM + Augmentation 81.82% 50 0 5 10 15 20 Jitter Size (cm) Random Forest 3D CNN LSTM 3D CNN

Fig. 6. Evaluation and comparison of augmentation jitter sizes. Using jitter sizes of 5cm, 10 cm and 1.5 cm. The jitter size of 0 cm signifies no augmentation was performed. It is observed that the accuracies of both the 3D CNN and the off-the-shelf classifier increased after applying data augmentation. The accuracies increase, from 57.4% to 67.64% for the the random forest model, from 74.47% to 81.98% for the 3D CNN LTSM model and from 75.8% to 82.8% for the 3D CNN model. However a further increase in the jitter size did not yield significant increase in accuracy. It rather becomes worse, more evidently in the random forest model and later Inputs 3D Conv Layers LSTM Layer for other models. Fully-Connected Layers Output Layer with an LSTM variant. This was done by replacing the first fully connected layers in the model with an LSTM layer and Fig. 7. The architecture of the LSTM variant of 3D CNN of our final model. The input time steps are separated into individual mini CNN passing each time steps of the input through individual mini networks in order that the spatial features are learned in the CNN layers and 3D CNN networks to learn spatial features, and then into then the temporal features in the LSTM layers. This achieved a classification the LSTM layers to learn the temporal features. The LSTM accuracy of 81.82%. variant of the 3D CNN architecture is shown in Fig. 7. Even though these two models are not directly equivalent, we used smaller voxel size would dramatically increase the dimension similar hyperparameters for training in other to have similar of the training data, and subsequently the training and conditions. In our evaluation, the 3D CNN LSTM model did computation time involved. At some point a smaller voxel not perform better than the 3D CNN. A summary of this size in infeasible because of limited computation resource evaluation results is shown in Table I and memory. A future work could address an approach for We see that the 3D CNN model is able to learn both spatial computation and memory efficient representation of point and temporal relationship in the data presented, furthermore, clouds, or other data augmentation scheme that compensates it outperforms the LSTM variant of the model for this for anthropometry of subjects used for training thereby particular problem. The confusion matrix for our final model covering a wide range of users. is shown in Fig. 8. ACKNOWLEDGMENT V. CONCLUSIONS This work is partially supported by JSPS Grant-in-Aid We showed an end-to-end approach for spatiotemporal 16H06536. gesture learning from point cloud data. Our data augmenta- tion approached achieved an increase in the model accuracy REFERENCES up to 10%. One limitation for this work is the ability to [1] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, and J. Gall, “A survey on work with higher resolution representation of point clouds. A human motion analysis from depth data,” Lecture Notes in Computer [13] Y. Song, J. Tang, F. Liu, and S. Yan, “Body surface context: A new robust feature for action recognition from depth videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 6, pp. 952–964, June 2014. [14] B. Apostol, C. R. Mihalache, and V. Manta, “Using spin images for hand gesture recognition in 3D point clouds,” System Theory, Control and Computing (ICSTCC), 2014 18th International Conference, no. November 2010, pp. 544–549, 2014. [15] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. Ogunbona, “Action Recognition from Depth Maps Using Deep Convolutional Neural Networks,” IEEE Transactions on Human Machine Systems, vol. 46, True Label True no. 4, pp. 1–12, 2015. [16] S. Wu, F. Jiang, and D. Zhao, “Hand gesture recognition based on skeleton of point clouds,” 2012 IEEE 5th International Conference on Advanced Computational Intelligence, ICACI 2012, pp. 566–569, 2012. [17] Z. Ding, P. Wang, P. O. Ogunbona, and W. Li, “Investigation of different skeleton features for cnn-based 3d action recognition,” arXiv preprint arXiv:1705.00835, 2017. [18] M. Cholewa and P. Sporysz, “Classification of dynamic sequences of 3D point clouds,” Lecture Notes in Computer Science (including Predicted Label subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8467 LNAI, no. PART 1, pp. 672–683, 2014. Fig. 8. Confusion Matrix of the prediction on the test set, showing the [19] M. Asadi-Aghbolaghi and S. Kasaei, “View invariant human action performance of the model in each class. Class 5 gesture, Peace, has the recognition using fourier-based and radon-based point cloud analysis,” lowest accuracy, while class 6, Not Allowed, has the highest accuracy. 2014 7th International Symposium on Telecommunications, IST 2014, pp. 66–71, 2014. [20] A. Abdolmaleki, M. Movahedi, N. Lau, and L. P. Reis, “RoboCup Science (Time-of-Flight and Depth Imaging. Sensors, Algorithms, and 2012: Robot Soccer World Cup XVI,” Lecture Notes in Computer Applications), vol. 8200 LNCS, pp. 149–187, 2013. Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7500, pp. 237–248, 2013. [2] S. Nikolaidis, K. Gu, R. Ramakrishnan, and J. Shah, “Efficient Model Learning for Human-Robot Collaborative Tasks,” arXiv, pp. [21] C. Liang, Y. Song, and Y. Zhang, “Hand gesture recognition using Image Processing (ICIP), 2016 1–9, 2014. [Online]. Available: http://arxiv.org/abs/1405.6341 view projection from point cloud,” in IEEE International Conference on [3] K. Simonyan and A. Zisserman, “Two-stream convolutional networks , no. 28. IEEE, 2016, pp. 4413– for action recognition in videos,” in Advances in Neural Informa- 4417. tion Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, [22] B. Feng, F. He, X. Wang, Y. Wu, H. Wang, S. Yi, and W. Liu, N. Lawrence, and K. Weinberger, Eds. Curran Associates, Inc., 2014, “Depth-Projection-Map-Based Bag of Contour Fragments for Robust pp. 568–576. Hand Gesture Recognition,” IEEE Transactions on Human-Machine [4] P. Wang, W. Li, S. Liu, Y. Zhang, Z. Gao, and P. Ogunbona, Systems, vol. 47, no. 4, pp. 511–523, 2016. “Large-scale continuous gesture recognition using convolutional neural [23] J. Imran and P. Kumar, “Human action recognition using RGB-D networks,” 2016 23rd International Conference on Pattern Recognition sensor and deep convolutional neural networks,” 2016 International (ICPR), pp. 13–18, Dec 2016. Conference on Advances in Computing, Communications and Infor- [5] G. Marin, F. Dominio, and P. Zanuttigh, “HAND GESTURE RECOG- matics, ICACCI 2016, pp. 144–148, 2016. NITION WITH LEAP MOTION AND KINECT DEVICES Giulio [24] C. Li, Y. Hou, P. Wang, and W. Li, “Joint Distance Maps Based Ac- Marin , Fabio Dominio , Pietro Zanuttigh Department of Information tion Recognition with Convolutional Neural Networks,” IEEE Signal Engineering , University of Padova,” International Conference on Processing Letters, vol. 24, no. 5, pp. 624–628, 2017. Image Processing(ICIP), pp. 1565–1569, 2014. [25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient Based [6] M. Asadi-Aghbolaghi, A. Clapes,´ M. Bellantonio, H. J. Escalante, Learning Applied to Document Recognition,” Proceedings of the V. Ponce-Lopez,´ X. Baro,´ I. Guyon, S. Kasaei, and S. Escalera, Deep IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. Learning for Action and Gesture Recognition in Image Sequences: A [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classifica- Survey. Cham: Springer International Publishing, 2017, pp. 539–578. tion with Deep Convolutional Neural Networks,” Advances In Neural [7] J. Owoyemi and K. Hashimoto, “Learning Human Motion Intention Information Processing Systems, pp. 1–9, 2012. with 3D Convolutional Neural Network,” in 2017 IEEE International [27] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and Conference on Mechatronics and Automation (ICMA, 2017, pp. 1810– F. F. Li, “Large-scale video classification with convolutional neural 1815. networks,” Proceedings of the IEEE Computer Society Conference on [8] S. Ji, W. Xu, M. Yang, and K. Yu, “3D Convolutional Neural Networks Computer Vision and Pattern Recognition, pp. 1725–1732, 2014. for Human Action Recognition,” Tpami, vol. 35, no. 1, pp. 221–231, [28] D. Maturana and S. Scherer, “VoxNet: A 3D Convolutional Neural 2013. Network for Real-Time Object Recognition,” Iros, pp. 922–928, 2015. [9] C. Lea, A. Reiter, R. Vidal, and G. D. Hager, “Segmental spatiotem- [29] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier Nonlinearities poral CNNs for fine-grained action segmentation,” Lecture Notes in Improve Neural Network Acoustic Models,” Proceedings of the 30 th Computer Science (including subseries Lecture Notes in Artificial International Conference on Machine Learning, vol. 28, p. 6, 2013. Intelligence and Lecture Notes in Bioinformatics), vol. 9907 LNCS, [30] D. Scherer, A. Muller,¨ and S. Behnke, “Evaluation of pooling opera- pp. 36–52, 2016. tions in convolutional architectures for object recognition,” Artificial [10] C. Lea, R. Vidal, and G. D. Hager, “Learning convolutional action Neural Networks–ICANN 2010, pp. 92–101, 2010. primitives for fine-grained action recognition,” Proceedings - IEEE [31] D. P. Kingma and J. L. Ba, “Adam: a Method for Stochastic Optimiza- International Conference on Robotics and Automation, vol. 2016-June, tion,” International Conference on Learning Representations 2015, pp. pp. 1642–1649, 2016. 1–15, 2015. [11] S. S. Rautaray and A. Agrawal, “Vision based hand gesture recognition [32] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and for human computer interaction: a survey,” Artificial Intelligence R. R. Salakhutdinov, “Improving neural networks by preventing Review, vol. 43, no. 1, pp. 1–54, 2012. co-adaptation of feature detectors,” ArXiv e-prints, pp. 1–18, 2012. [12] B. Feng, F. He, X. Wang, Y. Wu, H. Wang, S. Yi, and W. Liu, “Depth- [Online]. Available: http://arxiv.org/abs/1207.0580 projection-map-based bag of contour fragments for robust hand gesture [33] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and recognition,” IEEE Transactions on Human-Machine Systems, vol. 47, J. Schmidhuber, “LSTM: A Search Space Odyssey,” IEEE Transac- no. 4, pp. 511–523, Aug 2017. tions on Neural Networks and Learning Systems, 2016.