3D Skeletal Joints-Based Hand Gesture Spotting and Classification

applied sciences Article 3D Skeletal Joints-Based Hand Gesture Spotting and Classification Ngoc-Hoang Nguyen, Tran-Dac-Thinh Phan , Soo-Hyung Kim, Hyung-Jeong Yang and Guee-Sang Lee * Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Korea; [email protected] (N.-H.N.); [email protected] (T.-D.-T.P.); [email protected] (S.-H.K.); [email protected] (H.-J.Y.) * Correspondence: [email protected] Abstract: This paper presents a novel approach to continuous dynamic hand gesture recognition. Our approach contains two main modules: gesture spotting and gesture classification. Firstly, the gesture spotting module pre-segments the video sequence with continuous gestures into isolated gestures. Secondly, the gesture classification module identifies the segmented gestures. In the gesture spotting module, the motion of the hand palm and fingers are fed into the Bidirectional Long Short- Term Memory (Bi-LSTM) network for gesture spotting. In the gesture classification module, three residual 3D Convolution Neural Networks based on ResNet architectures (3D_ResNet) and one Long Short-Term Memory (LSTM) network are combined to efficiently utilize the multiple data channels such as RGB, Optical Flow, Depth, and 3D positions of key joints. The promising performance of our approach is obtained through experiments conducted on three public datasets—Chalearn LAP ConGD dataset, 20BN-Jester, and NVIDIA Dynamic Hand gesture Dataset. Our approach outperforms the state-of-the-art methods on the Chalearn LAP ConGD dataset. Keywords: continuous hand gesture recognition; gesture spotting; gesture classification; multi-modal features; 3D skeletal; CNN Citation: Nguyen, N.-H.; Phan, T.-D.-T.; Kim, S.-H.; Yang, H.-J.; Lee, G.-S. 3D Skeletal Joints-Based Hand Gesture Spotting and Classification. 1. Introduction Appl. Sci. 2021, 11, 4689. https:// Nowadays, the role of dynamic hand gesture recognition has become crucial in vision- doi.org/10.3390/app11104689 based applications for human-computer interaction, telecommunications, and robotics, due to its convenience and genuineness. There are many successful approaches to isolated hand Academic Editor: Hyo-Jong Lee gesture recognition with the recent development of neural networks, but in real-world systems, the continuous dynamic hand gesture recognition remains a challenge due to the Received: 14 April 2021 diversity and complexity of the sequence of gestures. Accepted: 18 May 2021 Initially, most continuous hand gesture recognition approaches were based on tra- Published: 20 May 2021 ditional methods such as Conditional Random Fields (CRF) [1], Hidden Markov Model (HMM), Dynamic Time Warping (DTW), and Bézier curve [2]. Recently, deep learning Publisher’s Note: MDPI stays neutral methods based on convolution neural networks (CNN) and recurrent neural networks with regard to jurisdictional claims in (RNN) [3–7] have gained popularity. published maps and institutional affil- The majority of continuous dynamic hand-gesture recognition methods [3–6] include iations. two separate procedures: gesture spotting and gesture classification. They utilized the spatial and temporal features to improve the performance mainly in gesture classification. However, there are limitations in the performance of gesture spotting due to its inherent variability in the duration of the gesture. In existing methods, gestures are usually Copyright: © 2021 by the authors. spotted by detecting transitional frames between two gestures. Recently, an approach [7] Licensee MDPI, Basel, Switzerland. simultaneously performed the task of gesture spotting and gestures classification, but it This article is an open access article turned out to be suitable only for feebly segmented videos. distributed under the terms and Most of the recent researches [8–11] intently focus on improving the performance of conditions of the Creative Commons the gesture classification phase, while the gesture spotting phase is often neglected on the Attribution (CC BY) license (https:// assumption that the isolated pre-segmented gesture sequences are available for input to creativecommons.org/licenses/by/ the gesture classification. 4.0/). Appl. Sci. 2021, 11, 4689. https://doi.org/10.3390/app11104689 https://www.mdpi.com/journal/applsci Appl. Sci. 2021, 11, x FOR PEER REVIEW 2 of 13 Appl. Sci. 2021, 11, 4689 2 of 13 assumption that the isolated pre-segmented gesture sequences are available for input to the gesture classification. However,However, inin real-worldreal-world systems,systems, spottingspotting ofof thethe gesturegesture segmentationsegmentation playsplays aa crucialcrucial rolerole inin thethe whole whole process process of of gesture gesture recognition, recognition, hence, hence, it greatlyit greatly affects affects the the final final recognition recogni- performance.tion performance. In paper In paper [3], they [3], segmented they segmente the videosd the videos into sets into of imagessets of andimages used and them used to predictthem to the predict fusion the score, fusion which score, means which they mean simultaneouslys they simultaneously did the gesturedid the spottinggesture spot- and gestureting and classification. gesture classification. The authors The in authors [5] utilized in [5] the utilized Connectionist the Connectionist temporal classification temporal clas- to detectsification the to nucleus detect of the the nucleus gesture of and thethe gesture no-gesture and the class no-gesture to assist class the gesture to assist classification the gesture withoutclassification requiring without explicit requiring pre-segmentation. explicit pre-segmentation. In [4,6], the continuous In [4,6], the gestures continuous are often ges- spottedtures are into often isolation spotted based into onisolation the assumption based on thatthe assumption hands will always that hands be put will down always at the be endput down of each at gesturethe end whichof each turnedgesture out which to be turn inconvenient.ed out to be inconvenient. It does not work It does well not for work all situations,well for all suchsituations, as in “zoom such as in”, in “zoom “zoom in”, out” “zoom gestures, out” i.e., gestures, when onlyi.e., when the fingers only the move fin- whilegers move the hand while stands the hand still. stands still. InIn thisthis paper,paper, wewe proposepropose aa spotting-classificationspotting-classification algorithmalgorithm forfor continuouscontinuous dynamicdynamic handhand gesturesgestures which which we we separate separate the the two two tasks tasks like like [4, 6[4]] but and we [6] avoid but we the avoid existing the problems existing ofproblems those methods. of those Inmethods. the spotting In the module, spotting as mo showndule, as in shown Figure1 in, the Figure continuous 1, the continuous gestures fromgestures the unsegmentedfrom the unsegmented and unbounded and unbounde input streamd input are firstly stream segmented are firstly into segmented individually into isolatedindividually gestures isolated based gestures on 3D keybased joints on 3D extracted key joints from extracted each frame from by each 3D humanframe by pose 3D andhuman hand pose pose and extraction hand pose algorithm. extraction Thealgorithm. time series The time of 3D series key of poses 3D key are poses fed into are thefed Bidirectionalinto the Bidirectional Long Short-Term Long Short-Term Memory (Bi-LSTM)Memory (Bi-LSTM) networkwith network connectionist with connectionist temporal classificationtemporal classification (CTC) [12 ](CTC) for gesture [12] for spotting. gesture spotting. FigureFigure 1.1. GestureGesture Spotting-ClassificationSpotting-Classification Module.Module. TheThe isolatedisolated gestures gestures segmented segmented using using the the gesture gesture spotting spotting module module are are classified classified in the in gesturethe gesture classification classification module module with with a multi-modal a multi-modal M-3D M-3D network. network. As indicated As indicated in Figure in Fig-1, inure the 1, gesturein the gesture classification classification module, module, the M-3D the networkM-3D network is built is by built combining by combining multi-modal multi- datamodal inputs data whichinputs comprisewhich comprise RGB, OpticalRGB, Optical Flow, Depth,Flow, Depth, and 3D and pose 3D informationpose information data channels.data channels. Three Three residual residual 3D Convolution 3D Convolutio Neuraln Neural Network Network based based on ResNet on ResNet architectures architec- (3D_ResNet)tures (3D_ResNet) [13] stream [13] stream networks networks of RGB, of OpticalRGB, Optical Flow andFlow Depth and Depth channel channel along along with anwith LSTM an LSTM network network of 3D of pose 3D pose channel channel are effectively are effectively combined combined using using a fusion a fusion layer layer for gesturefor gesture classification. classification. TheThe preliminarypreliminary versionversion of of this this paper paper has has appeared appeared in in [14 [14].]. In In this this paper, paper, depth depth infor- in- mationformation has has been been considered considered together together with with 3D skeleton3D skeleton joints joints information information with with extensive exten- experiments,sive experiments, resulting resulting in upgraded in upgraded performance. performance. TheThe remainderremainder of of this this paper paper is is organized organized as follows.as follows. In SectionIn Section2, we 2, review we review the related the re- works.lated

Load more