Recognizing American Manual Signs from RGB-D Videos

Longlong Jing∗ · Elahe Vahdani∗ · Matt Huenerfauth · Yingli Tian†

Abstract In this paper, we propose a 3D Convolutional Neu- in terms of average fusion by using only 5 channels instead ral Network (3DCNN) based multi-stream framework to rec- of 12 channels in the previous work. ognize American Sign Language (ASL) manual signs (con- Keywords American Sign Language Recognition · sisting of movements of the hands, as well as non-manual Hand Recognition · RGB-D Video Analysis · face movements in some cases) in real-time from RGB-D Multimodality · 3D Convolutional Neural Networks · Proxy videos, by fusing multimodality features including hand ges- Video. tures, facial expressions, and body poses from multi-channels (RGB, depth, motion, and skeleton joints). To learn the over- all temporal dynamics in a video, a proxy video is generated 1 Introduction by selecting a subset of frames for each video which are then used to train the proposed 3DCNN model. We collect a new The focus of our research is to develop a real-time system ASL dataset, ASL-100-RGBD, which contains 42 RGB- that can automatically identify ASL manual signs (individ- D videos captured by a Microsoft Kinect V2 camera, each ual words, which consist of movements of the hands, as well of 100 ASL manual signs, including RGB channel, depth as facial expression changes) from RGB-D videos. How- maps, skeleton joints, face features, and HDface. The dataset ever, our broader goal is to create useful technologies that is fully annotated for each semantic region (i.e. the time du- would support ASL education, which would utilize this tech- ration of each word that the human signer performs). Our nology for identifying ASL signs and provide ASL students proposed method achieves 92 .88 % accuracy for recogniz- immediate feedback about whether their signing is fluent or ing 100 ASL words in our newly collected ASL-100-RGBD not. dataset. The effectiveness of our framework for recogniz- There are more than one hundred sign languages used ing hand from RGB-D videos is further demon- around the world, and ASL is used throughout the U.S. and strated on the Chalearn IsoGD dataset and achieves 76% ac- Canada, as well as other regions of the world, including curacy which is 5.51% higher than the state-of-the-art work West and Southeast Asia. Within the U.S.A., about 28 million people today are Deaf or Hard-of-Hearing (DHH) L. Jing and E. Vahdani [1]. There are approximately 500, 000 people who use ASL Department of Computer Science, The Graduate Center, The City Uni-

arXiv:1906.02851v1 [cs.CV] 7 Jun 2019 as a primary language [59], and since there are significant versity of New York, NY, 10016. linguistic differences between English and ASL, it is possi- E-mail: {ljing, evahdani}@gradcenter.cuny.edu ∗Equal Contribution ble to be fluent in one language but not in the other. M. Huenerfauth In addition to the many members of the Deaf community Golisano College of Computing and Information Sciences, the who may prefer to communicate in ASL, there are many Rochester Institute of Technology (RIT), Rochester, NY, USA. individuals who seek to learn the language. Due to a va- E-mail: [email protected] riety of educational factors and childhood language expo- sure, researchers have measured lower levels of English lit- Y. Tian eracy among many deaf adults in the U.S. [79]. Studies have Department of Electrical Engineering, The City College, and the De- partment of Computer Science, the Graduate Center, the City Univer- shown that deaf children raised in homes with exposure to sity of New York, NY, 10031. ASL have better literacy as adults, but it can be challenging E-mail: [email protected] for parents, teachers, and other adults in the life of a deaf † Corresponding Author child to rapidly gain fluency in ASL. The study of ASL as a 2 Longlong Jing∗ et al. foreign language in universities has significantly increased by 16.4% from 2006 to 2009, which ranked ASL as the 4th most studied language at colleges [23]. Thus, there are many individuals would benefit from a flexible way to prac- tice their ASL signing skills, and our research investigates technologies for recognizing signs performed in color and depth videos, as discussed in [30]. Fig. 1 Example images of lexical facial expressions along with hand While the development of user-interfaces for educational gestures for signs: NEVER, WHO, and WHAT. For NEVER, the signer shakes her head side-to-side slightly, which is a Negative facial ex- software was described in our prior work [30], this article pression in ASL. For WHO and WHAT, the signer is furrowing the instead focuses on the development and evaluation of our brows and slightly tilting moving the head forward, which is a WH- ASL recognition technologies, which underlie our educa- word Question facial expression in ASL. tional tool. Beyond this specific application, technology to automatically recognize ASL signs from videos could en- able new communication and accessibility technologies for two-handed signs, the two hands may have symmetrical people who are DHH, which may allow these users to in- movements, or the signer’s dominant hand (e.g. the right put information into computing systems by performing sign hand of a right-handed person) will have greater move- language or may serve as a foundation for future research on ments than the non-dominant hand. machine translation technologies for sign languages. – The orientation of the palm of the hand in 3D space is The rest of this article is structured as follows: Section also a meaningful aspect of an ASL sign, and this param- 1.1 provides a summary of relevant ASL linguistic details, eter may differentiate pairs of otherwise identical signs. and Section 1.2 motivates and defines the scope of our con- – Some signs co-occur with specific “non-manual signals,” tributions. Section 2 surveys related work in ASL recogni- which are generally facial expressions that are charac- tion, gesture recognition in videos, and some video-based terized by specific eyebrow movement, head tilt/turn, or ASL corpora (collections of linguistically labeled video record- head movement (e.g., forward-backward relative to the ings). Section 3 describes our framework for ASL recogni- torso). tion, Section 4 describes the new dataset of 100 ASL words captured by a RGBD camera which is used in this work, As discussed in [60], facial expressions in ASL are most and Section 5 presents the experiments to evaluate our ASL commonly utilized to convey information about entire sen- recognition model and the extension of our framework for tences or phrases, and these classes of facial expressions Chalearn IsoGD dataset. Finally, Section 6 summarizes the are commonly referred to as “syntactic facial expressions.” proposed work. While some researchers, e.g. [57], have investigated the iden- tification of facial expressions that extend across multiple words to indicate grammatical information, in this paper, 1.1 ASL Linguistics Background we describe our work on recognizing manual signs which consist of movements of the hands and facial expression ASL is a natural language conveyed through movements and changes. poses of the hands, body, head, eyes, and face [80]. Most In addition to “syntactic” facial expressions that extend ASL signs consist of the hands moving, pausing, and chang- across multiple words in an ASL sentence, there exists an- ing orientation in space. Individual ASL signs (words) con- other category of facial expressions, which is specifically sist of a sequence of several phonological segments, which relevant to the task of recognizing individual signs: “lexi- include: cal facial expressions,” which are considered as a part of – An essential parameter of a sign is the configuration of the production of an individual ASL word (see examples in the hand, i.e. the degree to which each of the finger joints Fig. 1). Such facial expressions are therefore essential for are bent, which is commonly referred to as the “hand- the task of sign recognition. For instance, words with neg- shape.” In ASL, there are approximately 86 handshapes ative semantic polarity, e.g. NONE or NEVER, tend to oc- which are commonly used [62], and the hand may tran- cur with a negative facial expression consisting of a slight sit between handshapes during the production of a single head shake and nose wrinkle. In addition, there are specific sign. ASL signs that almost always occur in a context in which – During an ASL sign, the signer’s hands will occupy spe- a specific ASL syntactic facial expression occurs. For in- cific locations and will perform movement through space. stance, some question words, e.g. WHO or WHAT, tend to Some signs are performed by a single hand, but most are co-occur with a syntactic facial expression (brows furrowed, performed using both of the signer’s hands, which move head tilted forward), which indicates that an entire sentence through the space in front of their head and torso. During is a WH-word Question. Thus, the occurrence of such a fa- Recognizing American Sign Language Manual Signs from RGB-D Videos 3 cial expression may be useful evidence to consider when sign language, such as [5,48,51,57]. More details about SL building a sign-recognition system for such words. recognition can be found in these survey papers [19, 64]. As cost-effective consumer depth cameras have become 1.2 Motivations and Scope of Contributions available in recent years, such as RGB-D cameras of Mi- crosoft Kinect V2 [4], Intel Realsense [2], Orbbec Astra As discussed in Section 2.1, most prior ASL recognition re- [3], it has become practical to capture high resolution RGB search typically focuses on isolated hand gestures of a re- videos and depth maps as well as to track a set of skeleton stricted vocabulary. In this paper, we propose a 3D multi- joints in real time. Compared to traditional 2D RGB images, stream framework to recognize a set of grammatical ASL RGB-D images provide both photometric and geometric in- words in real-time from RGB-D videos, by fusing multi- formation. Therefore, recent research work has been moti- modality features including hand gestures, facial expressions, vated to investigate ASL recognition using both RGB and and body poses from multi-channels (RGB, depth, motion, depth information [6,8,12,32,70,72,84,86,88,89]. In this ar- and skeleton joints). In an extension to our previous work ticle, we briefly summarize ASL recognition methods based [89] and [86], the main contributions of the proposed frame- on RGB-D images or videos. work can be summarized as follows: Some early work of SL recognition based on RGB-D – We propose a 3D multi-stream framework by using 3D cameras only focused on a very small number of signs from convolutional neural network for ASL recognition in RGB- static images [40, 70, 72]. Pugeault and Richard proposed a D videos by fusing multi-channels including RGB, depth, multi-class random forest classification method to recognize motion, and skeleton joints. 24 static ASL fingerspelling alphabet letters by ignoring the – We propose a random temporal augmentation strategy to letters j and z (as they involve motion) and by combining augment the training data to handle wide diverse videos both appearance and depth information of handshapes cap- of relative small datasets. tured by a Kinect camera [70]. Keskin et al. [40] recognized – We create a new ASL dataset, ASL-100-RGBD, includ- 24 static handshapes of the ASL alphabet, based on scale in- ing multiple modalities (facial movements, hand ges- variant features extracted from depth images, and then fed tures, and body pose) and multiple channels (RGB, depth, to a Randomized Decision Forest for classification at the skeleton joints, and HDface) by collaborating with ASL pixel level, where the final recognition label was voted based linguistic researchers; this dataset contains annotation of on a majority. Ren et al. proposed a modified Finger-Earth the time duration when each ASL word is performed by Movers Distance metric to recognize static handshapes for the human in the video. The dataset will be released to 10 digits captured using a Kinect camera [72]. public with the publication of this article. – We further evaluate the proposed framework to recog- While these systems only used the static RGB and depth nize hand gestures on the Chalearn LAP IsoGD dataset images, some studies employed the RGB-D videos for ASL [81] which consists of 249 gesture classes in RGB-D recognition. Zafrulla et al. developed a hidden Markov model videos. The accuracy of our framework is 5.51% higher (HMM) to recognize 19 ASL signs collected by a Kinect and than the state-of-the-art work in terms of average fusion compared the performance with that from colored-glove and using fewer channels (5 channels instead of 12). accelerometer sensors [88]. For the Kinect data, they also compared the system performance between the signer seated and standing and found that higher accuracy resulted when 2 Related Work the users were standing. Yang developed a hierarchical con- ditional random field method to recognize 24 manual ASL 2.1 RGB-D Based ASL Recognition signs (seven one-handed and 17 two-handed) from the hand- shape and motion in RGB-D videos [84]. Lang et al. [49] Sign language (SL) recognition has been studied for three presented a HMM framework to recognize 25 signs of Ger- decades since the first attempt to recognize Japanese SL by man Sign Language using depth-camera specific features. Tamura and Kawasaki in 1988 [77]. The existing SL recog- Mehrotra et al. [56] employed a support vector machine nition research can be classified as sensor-based methods in- (SVM) classifier to recognize 37 signs of Indian Sign Lan- cluding data gloves and body trackers to capture and track guage based on 3D skeleton points captured using a Kinect hand and body motions [20, 35, 45, 50] and non-intrusive camera. Almeida et al. [6] also employed a SVM classifier to camera-based methods by applying computer vision tech- recognize 34 signs of Brazilian Sign Language using hand- nologies [9, 10, 13, 14, 24, 29, 39, 42–44, 53, 66–69, 75, 83, shape, movement and position captured by a Kinect. Jiang 85]. While much research in this area focuses on the hands, et al. proposed to recognize 34 signs of Chinese Sign Lan- there is also some research focusing on linguistic informa- guage based on the color images and the skeleton joints cap- tion conveyed by the face and head of a human performing tured by a Kinect camera [32]. Recently, Kumar et al. [47] 4 Longlong Jing∗ et al. combined a Kinect camera with a Leap Motion sensor to its advantage of avoiding gradient vanishing and explosion, recognize 50 signs of India Sign Language. the 3D-ResNet outperforms many complex networks. As discussed above, SL consists of hand gestures, facial ASL recognition shares properties with video action recog- expressions, and body poses. However, most existing work nition, therefore, many networks for video action have been has focused only on hand gestures without combining with applied to this task. Pigou et al. proposed temporal resid- facial expressions and body poses. While a few attempted ual networks for gesture and sign language recognition [68] to combine hand and face [5, 41, 48, 57, 68, 83], they only and temporal convolutions on top of the features extracted use RGB videos. To the best of our knowledge, we believe by 2DCNN for gesture recognition [67]. Huang et al. pro- that this is the first work that combines multi-channel RGB- posed a Hierarchical Attention Network with Latent Space D videos (RGB and depth) with fusion of multi-modality (LS-HAN) which eliminates the pre-processing of the tem- features (hand, face, and body) for ASL recognition. poral segmentation [29]. Pu et al. proposed to employ 3D residual convolutional network (3D-ResNet) to extract vi- sual features which are then fed to a stacked dilated convo- 2.2 CNN for Action and Hand Gesture Recognition lution network with connectionist temporal classification to et Since the work of AlexNet [46] which makes use of the pow- map the visual features into text sentence [69]. Camgoz al. erful computation ability of GPUs, deep neural networks attempted to generate spoken language translations from et al. (DNNs) have enjoyed a renaissance in various areas of com- sign language video [10]. Camgoz proposed SubUNets puter vision, such as image classification [17,76], object de- for simultaneous hand shape and continuous sign language et al. tection [25,28], image description [16,36], and others. Many recognition [9]. Cui proposed a weakly supervised efforts have been made to extend CNNs from the image framework to train the network from videos with ordered to the video domain [21], which is more challenging since gloss labels but no exact temporal locations for continuous video data are much larger than images; therefore, handling sign language recognition [14]. In prior work, our research video data in the limited GPU memory is not tractable. An team proposed a 3D-FCRNN for ASL recognition by com- intuitive way to extend image-based CNN structures to the bining the 3DCNN and a fully connected RNN [86]. video domain is to perform the fine-tuning and classifica- tion process on each frame independently, and then conduct a later fusion, such as average scoring, to predict the action 2.3 Public Camera-based ASL Datasets class of the video [37]. To incorporate temporal information in the video, [73] introduced a two-stream framework. One As discussed in Section 2.1, technology to recognize ASL stream was based on RGB images, and the other, on stacked signs from videos could enable new educational tools or as- optical flows. Although that work proposed an innovative sistive technologies for people who are DHH, and there has way to learn temporal information using a CNN structure, been significant prior research on sign language recognition. in essence, it was still image-based, since the third dimen- However, a limiting factor for much of this research has been sion of stacked optical flows collapsed immediately after the the scarcity of video recordings of sign language that have first convolutional layer. been annotated with time interval labels of the words that the To model the sequential information of extracted fea- human has performed in the video: For ASL, there have been tures from different segments of a video, [16] and [87] pro- some annotated video-based datasets [63] or collections of posed to input features into Recurrent Neural Network (RNN) motion capture recordings of humans wearing special sen- structures, and they achieved good results for action recog- sors [54]. Most publicly available datasets, e.g. [22,41], con- nition. The former emphasized pooling strategies and how tain general ASL vocabularies from RGB videos and a few to fuse different features, while the latter focused on how with RGB-D channels. to train an end-to-end DNN structure that integrates CNNs 2D Camera-based ASL databases: The American Sign with RNNs. These networks mainly use CNN to extract spa- Language Linguistic Research Project (ASLLRP) dataset con- tial features, then RNN is applied to extract the temporal in- tains video clips of signing from the front and side and in- formation of the spatial features. 3DCNN was recently pro- cludes a close-up view of the face [63], with annotations posed to learn the spatio-temporal features with 3D convolu- for 19 short narratives (1,002 utterances) and 885 additional tion operations [15,27,31,33,34,71,78], and has been widely elicited utterances from four Deaf native ASL signers; an- used in video analysis tasks such as video caption and ac- notation includes: the start and endpoints of each sign, a tion detection. 3DCNN is usually trained with fixed-length unique gloss label for each sign, part of speech, and start clips (usually 16 frames [27], [78],) and later fusion is per- and end points of a range of non-manual behaviors (e.g., formed to obtain the final category of the entire video. Hara raised/lowered eyebrows, head position and periodic head et al. [27] proposed the 3D-ResNet by replacing all the 2D movements, expressions of the nose and mouth) also labeled kernels in 2D-ResNet with 3D convolution operations. With with respect to the linguistic information that they convey Recognizing American Sign Language Manual Signs from RGB-D Videos 5

(a) Eight Consecutive frames from a video clip of an ASL sign

(b) Randomly sampled eight frames from the video clip of the same ASL sign Fig. 2 Generating representative proxy video by our proposed random temporal augmentation. (a) Eight consecutive frames from a video clip of an ASL sign. (b) Randomly sampled eight frames from the video clip of the same ASL sign. With the same number of frames, the proxy video reserves more temporal dynamics of the ASL sign.

(serving to mark, e.g., different sentence types, topics, nega- same gesture, and there is a color image and the correspond- tion, etc.). Dreuw et al. [18] produced several subsets from ing depth map for each [72]. the ASLLRP dataset as benchmark databases for automatic The Chalearn LAP IsoGD dataset [81] is a large-scale recognition of isolated and continuous sign language. hand gesture RGB-D dataset, which is derived from Chalearn The American Sign Language Lexicon Video Dataset Gesture dataset (CGD 2011) [26]. This dataset consists of (ASLLVD) [7] is a large dataset of videos of isolated signs 47, 933 RGB-D video clips fallen into 249 classes of hand from ASL. It contains video sequences of about 3,000 dis- gestures including (Hindu/ Buddhist hand gestures), tinct signs, each produced by 1 to 6 native ASL signers Chinese numbers, and diving signals. Although it is not about recorded by four cameras under three views: front, side, and ASL recognition, it can be used to learn RGB-D features face region, along with annotations of those sequences, in- from different environment settings. Using the learned fea- cluding start/end frames and class label (i.e., gloss-based tures as a pretrained model, the fine-tuned ASL recognition identification) of every sign, as well as hand and face lo- model will be more robust to handle different backgrounds cations at every frame. and scales (e.g. distance variations between Kinect camera The RVL-SLLL ASL Database [55] consists three sets and the signer). of ASL videos with distinct motion patterns, distinct hand- To support our research, we have collected and anno- shapes, and structured sentences respectively. These videos tated a novel RGB-D ASL dataset, ASL-100-RGBD, de- were captured from 14 native ASL signers (184 videos per scribed in Section 4, with the following properties: signer) under different lighting conditions. For annotation, the videos with distinct motion patterns or distinct hand- shapes are saved as separate clips. However, there is no de- – 100 ASL signs have been collected which are performed tailed annotations for the videos of structured sentences which by 15 individual signers (often with multiple recordings limited the usefulness of the database. from each signer). RGB-D Camera-based ASL and Gesture Databases: – The ASL-100-RGBD dataset has been captured using Recently, some RGB-D databases have been collected for a Kinect V2 camera and contains multiple channels in- hand gesture and SL recognition, for ASL or other sign lan- cluding RGB, depth, skeleton joints, and HDface. guages [12,22,66]. Here we only briefly summarize RGB-D – Each video consists of the 100 ASL words with time- databases for ASL. coded annotations in collaboration with ASL computa- The ”Spelling-It-Out” dataset consists of 24 static hand- tional linguistic researchers. shapes of the ASL fingerspelling alphabet, ignoring the let- – The 100 ASL words have been strategically selected to ters j and z as they involve motion, from four signers; each support sign recognition technology for ASL education signer repeats 500 samples for each letter in front of a Kinect tools (many of these words consist of hand gestures and camera [70]. The NTU dataset consists of 10 static hand ges- facial expression changes), with the detailed vocabulary tures for digits 1 to 10 and was collected from 10 subjects by composition described in Section 4. a Kinect camera. Each subject performs 10 different poses with variations in hand orientation, scale, articulation for the 6 Longlong Jing∗ et al.

Proxy Video Generation 3DCNN Modeling

RGB Body Network Fusion Final Prediction

2 4 3 1 5 e 1 )

s P 0 k k k k k v n c c A 0 c c c n e o o 1 o o o G l l o l l l ( D B B C B B B

Depth Body Network

2 4 3 1 5 e 1 )

s P 0 k k k k k v n c c A 0 c c c n e o o 1 o o o G l l o l l l ( D B B C B B B

RGB Hands Network

2 4 3 1 5 e 1 )

s P 0 k k k k k v n c c A 0 c c c n e o o 1 o o o G l l o l l l ( D B B C B B B

RGB Face Network

2 4 3 1 5 e 1 )

s P 0 k k k k k v n c c A 0 c c c n e o o 1 o o o G l l o l l l ( D B B C B B B

Fig. 3 The pipeline of the proposed multi-channel multi-modality 3DCNN framework for ASL recognition. The multiple channels contain RGB, Depth, and Optical flow while the multiple modalities include hand gestures, facial expressions and body poses. While the full size image is used to represent body pose, to better model hand gestures and the facial expression changes, the regions of hands and face are obtained from the RGB image based on the location guided by skeleton joints. The whole framework consists of two main components: proxy video generation and 3DCNN modeling. First, proxy videos are generated for each ASL sign by selecting a subset of frames spanning the whole video clip of each ASL sign, to represent the overall temporal dynamics. Then the generated proxy videos of RGB, Depth, Optical flow, RGB of hands, and RGB of face are fed into the multi-stream 3DCNN component. The predictions of these networks are weighted to obtain the final results of ASL recognition.

3 The Proposed Method for ASL Recognition clip channel, by selecting a subset of frames, which has proved to be very effective for our proposed framework. The pipeline of our proposed method is illustrated in Fig. 3. Videos are often redundant in the temporal dimension, There are two main components in the framework: random and some consecutive frames are very similar without ob- temporal augmentation to generate proxy videos (which are servable difference, as shown in Fig. 2 (a) which displays representative of the overall temporal dynamics of video clip 8 consecutive frames in a video clip of an ASL sign while of an ASL sign) and 3DCNN to recognize the class label of the proxy video in 2 (b) displays the 8 frames selected from the sign. the same video clip by random temporal augmentation. With the same number of frames, the proxy video provides more temporal dynamics. Thus, proxy videos are generated to rep- resent the overall temporal dynamics for each ASL word. 3.1 Random Temporal Augmentation for Proxy Video The process of proxy video generation by randomly sam- Generation pling is formulated in Eq. (1) below:

The performance of the deep neural network greatly depends on the amount of the training data. Large-scale training data Si = random(bN/T c) + bN/T c ∗ i, (1) and different data augmentation techniques usually are needed for deep networks to avoid over-fitting. During training, dif- where N is the number of frames of a sign video, T is the ferent kinds of data augmentation techniques, such as ran- number of the sampled frames from the video, Si is the dom resizing and random cropping of images, are already ith sampled frame, random(N/T ) generates one random widely applied in 3DCNN training. In order to capture the number ranging b0, N/T c for every i. To generate the proxy overall temporal dynamics, we apply a random temporal video, each video is uniformly divided into T intervals, and augmentation, to generate a proxy video for each sign video one frame is randomly sampled from every interval. If the Recognizing American Sign Language Manual Signs from RGB-D Videos 7

Fig. 4 The full list of the 100 ASL words in our “ASL-100-RGBD” dataset under 6 semantic categories. These ASL words have been strategically selected to support sign recognition technology for ASL education tools (many of these words consist of both hand gestures and facial expression changes.)

total number of frames for a video is less than T , it is padded which makes the training of very deep of convolutional neu- with the last frame to the length of T . These proxy videos ral networks feasible. Compared to 2D ResNet, the size of make it feasible to train deep neural network on the proposed the convolution kernel is w × h × t (w is the width of the dataset. kernel, h is the height of the kernel and t is the temporal di- mension of the kernel) in 3D-ResNet, while it is w × h in 2D-ResNet. In this paper, 3D-ResNet is chosen as the base 3.2 3D Convolutional Neural Network network for ASL recognition.

3DCNN was first proposed for video action recognition [31], In order to handle the three important elements of ASL and was improved in C3D [78] by using a similar archi- recognition (hand gesture, facial expression, and body pose), tecture to VGG [74]. It obtained the state-of-the-art perfor- a hybrid framework is designed including two 3DCNN net- mance for several video recognition tasks. The difference works: one for full body, to capture the full body move- between the 2DCNN and 3DCNN operation is that 3DCNN ments including hands and face with the inputs of the multi- has an extra temporal dimension, which can capture the spa- channel proxy videos generated from the full images includ- tial and temporal information between video frames more ing RGB, depth, and optical flow; and another for hand and effectively. face, to capture the coordinates of hands and face with the After the emergence of C3D, many 3DCNN models were inputs of the multi-channel proxy videos generated from the proposed for video action recognition [11], [15], [71]. 3D- cropped regions of left hand, right hand, and face. Note for ResNet was the 3D version of ResNet which introduced iden- the Hand-Face network, RGB and depth channels are em- tical mapping to avoid gradient vanishing and explosion, ployed for hand regions. The optical flow is not employed 8 Longlong Jing∗ et al. since it cannot accurately track the quick and large hand education applications, and the words were chosen based on motions. For the face regions, only RGB channel is em- vocabulary that is traditionally included in introductory ASL ployed since facial expressions generally change much less courses. Specifically, as discussed in [30], our recognition in depth. The prediction results of the networks are weighted system must identify a subset of ASL words that relate to a to obtain the final prediction of each ASL sign. list of errors often made by students who are learning ASL. The optical flow images are calculated by stacking the Our proposed educational tool [30] would receive as input a x-component, the y-component, and the magnitude of the video of a student who is performing ASL sentences, and flow. Each value in the image is then rescaled to 0 and 255 . the system would automatically identify whether the stu- This practice has yielded good performance in other studies dent’s performance may include one of several dozen errors, [16, 87]. As observed in the experimental results, by fusing which are common among students learning ASL. As part of all the features generated by RGB, optical flow, and depth the operation of this system, we require a sign-recognition images, the performance can be improved, which indicates component that can identify when a video of a person in- that complementary information are provided by different cludes any of these 100 words during the video (and the channels in training deep neural networks. time period of the video when this word occurs). When one of these 100 key words are identified, then the system will consider other properties of the signer’s movements [30], to 4 Proposed ASL Dataset: “ASL-100-RGBD” determine whether the signer may have made a mistake in their signing. As mentioned in Section 2.3, a new dataset has been col- For instance, the 100 ASL signs includes words related lected for this research in collaboration with ASL computa- to questions (e.g. WHO, WHAT), time-phrases (e.g. TO- tional linguistic researchers, from native ASL signers (indi- DAY, YESTERDAY), negation (e.g. NOT, NEVER), and viduals who have been using the language since very early other categories of words that relate to key grammar rules childhood) who performed a word list of 100 ASL signs of ASL. A full listing of the words included in this dataset (See the full list of ASL words in Fig. 4) by using a Kinect appears in Fig. 4. Note that there is no one-to-one map- V2 camera. Participants responded affirmatively to the fol- ping between English words and ASL signs, and some ASL lowing screening question: Did you use ASL at home grow- signs have variations in their performance, e.g. due to ge- ing up, or did you attend a school as a very young child ographic/regional differences or other factors. For this rea- where you used ASL? Participants were provided with a son, some words in Fig. 4 appear with integers after their slide-show presentation that asked them to perform a se- name, e.g. THURSDAY and THURSDAY2, to reflect more quence of 100 individual ASL signs, without lowering their than one variation in how the ASL word may be produced. hands between words. Since this new dataset includes 100 For instance, THURSDAY indicates a sign produced by the signs with RGB and depth data, we refer to it as the “ASL- signer’s dominant hand in the ”H” alphabet-letter handshape 100-RGBD” dataset. gentle circling in space; whereas, THURSDAY2 indicates During the recording session, a native ASL signer met a sign produced with the signer’s dominant hand quickly the participant and conducted the session: prior research in switching from the alphabet-letter handshape of ”T” to ”H” ASL computational linguistics has emphasized the impor- while held in space in front of the torso. Both are commonly tance of having only native signers present when recording used ASL signs for the concept of ”Thursday”; they simply ASL videos so that the signer does not produce English- represent two different ASL words that could be used for the influenced signing [54]. Several videos were recorded of same concept. each of the 15 people, while they signed the 100 ASL signs. As shown in Fig. 4, the words are grouped into 6 seman- Typically three videos were recorded from each person, to tic categories (Negative, WH Questions, Yes/No Questions, produce a total collection of 42 videos (each video contains Time, , and Conditional), which in some cases sug- all the 100 signs) and 4 , 200 samples of ASL signs. gest particular facial expressions that are likely to co-occur To facilitate this collection process, we have developed with these words when they are used in ASL sentences. For a recording system based on Kinect 2.0 RGB-D camera to instance, time-related phrases that appear at the beginning capture multimodality (facial expressions, hand gestures, and of ASL sentences tend to co-occur with a specific facial ex- body poses) from multiple channels of information (RGB pression (head tilted back slightly and to the side, with eye- video and depth video) for ASL recognition. The recordings brows raised). Additional details about how detecting words also include skeleton and HDface information. The video in these various categories would be useful in the context of resolution is 1920 x 1080 pixels for the RGB channel and educational software appear in [30]. 512 x 424 pixels for the depth channel respectively. After the videos were collected from participants, the The 100 ASL signs in this collection were selected to videos were analyzed by a team of ASL linguists, who pro- strategically support research on sign recognition for ASL duced time-coded annotations for each video. The linguists Recognizing American Sign Language Manual Signs from RGB-D Videos 9

5.1 Implementation Details

Same 3D-ResNet architecture is employed for all experi- ments. Different channels and modalities are fed to the net- work as input. The input channels are RGB, Depth, RG- Bflow (i.e. Optical flow of RGB images), and Depthflow (i.e. Optical flow of depth images) of modalities including hands, face, and full body. The fusion of different channels are studied and compared. Our proposed model is trained in PyTorch on four Titan X GPUs. To avoid over-fitting, the pretrained model from Kinetics or Chalearn dataset is employed and the follow- ing data augmentation techniques were used: random crop- ping (using a patch size of 112 × 112) and random rotation (with a random number of degrees in a range of [−10, 10]). The models are then fine-tuned for 50 epochs with an initial learning rate of λ = 3 × 10−3, reduced by a factor of 10 after every 25 epochs. To apply the pre-trained 3D-ResNet model on 3 bands in RGB image format to one channel depth images or op- tical flow images, the depth images are simply converted to 3 bands as RGB image format. For the optical flow images, the pre-trained 3D-ResNet model takes the x-component, Fig. 5 Four sample frames of each channel of an ASL sign from our the y-component, and the magnitude of flow as the R, G, dataset including RGB, skeleton joints (25 joints for every frame), depth map, basic face features (5 main face components), and HDFace and B bands in the RGB format. (1,347 points.)

5.2 Experiments on ASL-100-RGBD used a coding scheme in which an English identifier label was used to correspond to each of the ASL words used in To prepare the training and testing for evaluation of the pro- the videos, in a consistent manner across the videos. For ex- posed method on “ASL-100-RGBD” dataset, we first ex- ample, all of the time spans in the videos when the human tracted the video clips for each ASL sign. We use 3, 250 performed the ASL word “NOT” were labeled with the En- ASL clips for training ( 75 % of the data) and the remain- glish string ”NOT” in our linguistic annotation. ing 25 % ASL clips for testing. To ensure a subject indepen- dent evaluation, there is no same signer appearing in both Fig. 5 demonstrates several frames of each channel of an training and testing datasets. To augment the data, a new 16- ASL sign from our dataset including RGB, skeleton joints frame proxy video is generated from each video by selecting (25 joints for every frame), depth map, basic face features different subset of frames for each epoch during the training (5 main face components), and HDFace (1,347 points). With phase. the publication of this article, ASL-100-RGBD dataset1 will be released to the research community. 5.2.1 Effects of Data Augmentations

The training dataset which contains 3, 250 ASL video clips of 100 ASL manual signs is relatively small for 3DCNN 5 Experiments and Discussions training and could easily cause an over-fitting problem. In order to extract more representative temporal dynamics as In this section, extensive experiments are conducted to eval- well as avoid over-fitting, a random temporal augmentation uate the proposed approach on the newly collected “ASL- technique is applied to generate proxy videos (a new proxy 100-RGBD” dataset and the Chalearn LAP IsoGD dataset video for each epoch) for each ASL clip. The ASL recogni- [81]. tion results of using the proposed proxy video (16 frames per video) are compared with the traditional method (us- ing the same number of consecutive frames). The network, 1 Some example videos can be found at our research website http://media-lab.ccny.cuny.edu/wordpress/datecode/ 3DResNet-34, dose not converge when trained with 16 con- secutive frames, while the network trained with proxy video 10 Longlong Jing∗ et al. obtained 68.4% on the testing dataset. This is likely due to the majority of movements being from hands in these videos and the consecutive frames could not effectively represent the temporal and spatial information. Therefore, the network could not distinguish the clips based on only 16 consecutive frames. We also evaluate the effect of random cropping (us- ing a patch size of 112 × 112) and random rotation (with a random number of degrees in a range of [−10, 10]). Table 1 lists the effects of different data augmentation techniques for the performance for recognizing 100 ASL words on only RGB channel. With the proxy videos, the 3DCNN model obtains 68.4% accuracy on the testing data for recognizing 100 ASL signs. By adding the random crop, the performance is improved by 4.4% and adding the ran- dom rotation further improved the performance to 75.9%. In the following experiments, proxy videos together with ran- Fig. 6 Example images of three datasets. ASL-100-RGBD: various dom crop and random rotation are employed to augment the ASL signs. Kinetics dataset: consisting of diverse human actions, in- data. volving different parts of body. Chalearn IsoGD: various hands ges- tures including mudras (Hindu/ Buddhist hand gestures) and diving signals. Table 1 The comparison of the performance of different data augmen- tation methods on only RGB channel with 16 frames for recogniz- ing 100 ASL manual signs. All the models are pretrained on Kinet- ics and finetuned on ASL-100-RGBD dataset. The best performance 5.2.3 Effects of Pre-trained Models is achieved with the random proxy video, random crop, and random rotation. To evaluate the effects of pre-trained models, we fine-tune Augmentations Fusions √ √ √ 3DResNet-34 with pretrained models from the Kinectics [38]  Random Proxy Video √ √ and the Chalearn LAP IsoGD datasets [81], respectively. Ki-  Random Crop √ Random Rotation  netics dataset consists of RGB videos of diverse human ac- Performance Not converging 68.4% 72.8% 75.9% tions which involve different parts of body while the Chalearn LAP IsoGD dataset contains both RGB and depth videos of various hand gestures including mudras (Hindu/ Buddhist 5.2.2 Effects of Network Architectures hand gestures), Chinese numbers and diving signals, as shown in Fig. 6. In this experiment, the ASL recognition results of differ- The results are shown in Table 3. The temporal duration ent number of layers at 18, 34, 50, and 101 for 3DRes- is fixed to 16 and the channels are RGB, Depth, and RG- Net are compared on full RGB, optical flow, and depth im- Bflow. In all channels, the performance using the pretrained ages. As shown in Table 2, the performance of 3DResNet- models from Chalearn dataset is better than pretrained mod- 18, 3DResNet-50, and 3DResNet-101 achieve comparable els from Kinetics dataset. This is probably because all the results on RGB channel. However, the performance on op- videos in Chalearn dataset are focused on hand gestures and tical flow and depth channels are much lower than that of the network trained on this dataset can learn prior knowl- RGB channel because the network has been pre-trained on edge of hand gestures. The Kinetics dataset consists of gen- from Kinetics dataset which contains only RGB images. As eral videos from YouTube and the network focuses on the shown in Table 2, 3DResNet-34 obtained the best perfor- prior knowledge of motions. Therefore, for each channel the mance for all RGB, optical flow, and depth channels. Hence, pretrained model on the same channel of Chalearn dataset is 3DResNet-34 is chosen for all the subsequent experiments. used in the subsequent experiments.

Table 2 The effects of number of layers for 3DResNet with 16 frames on RGB, optical flow, and depth channels. All the models are pre- Table 3 The comparison of the performance of recognizing 100 ASL trained on Kinetics and finetuned on ASL-100-RGBD dataset. words on 3DResNet-34 with different pretrained models. Network RGB (%) Optical Flow (%) Depth (%) Channels Kinetics (%) Chalearn (%) 3DResNet-18 73.2 61.9 65.0 RGB 75.9 76.38 3DResNet-34 75.9 62.8 66.5 Depth 66.5 68.18 3DResNet-50 72.3 55.4 62.0 RGB Flow 62.8 66.79 3DResNet-101 72.5 55.0 61.5 Recognizing American Sign Language Manual Signs from RGB-D Videos 11

5.2.4 Effects of Temporal Duration of Proxy Videos 5.2.6 Effects of Different Modalities

We study the effects of temporal duration (i.e. number of We attain further insight into the learned features of the 3DCNN frames used in proxy videos) by finetuning 3DResNet-34 model for RGB channel. In Fig 7 we visualize some ex- on ASL-100-RGBD dataset at different temporal duration amples of the attention maps of the fifth convolution layer in proxy videos at 16, 32, and 64 respectively. Note that on our test dataset generated by the trained RGB 3DCNN the same temporal duration is also used to train the cor- model for ASL recognition. These attention maps are com- responding pre-trained model on the Chalearn dataset. Re- puted by averaging the magnitude of activations of convolu- sults are shown in Table 4. The performance of the network tion layer which reflect the attention of the network. The at- with 64 frames achieves the best performance. Therefore, tention maps show that the model mostly focused on hands 3D-ResNet-34 with 64 frames is used in all the following and face of the signer during the ASL recognition process. experiments.

Table 4 The comparison of the performance of networks with different temporal duration (i.e. number of frames used in proxy videos). All the models are pretrained on Chalearn dataset and finetuned on ASL-100- RGBD dataset by using same temporal duration. Channel 16 frames (%) 32 frames (%) 64 frames (%) RGB 76.38 80.73 87.83 Depth 68.18 74.21 81.93 RGB Flow 66.79 71.74 80.51

Fig. 7 The example RGB images and their corresponding attention 5.2.5 Effects of Different Input Channels maps from the fifth convolution layer of the 3DResNet-34 on our test dataset of ASL recognition which the hands and face have most of the attention. In this section, we examine the fusion results of different in- put channels. The RGB channel provides global spatial and temporal appearance information, the depth channel provides Hence, we conduct experiments to analyze the effects of the distance information, and the optical flow channel cap- each modality (hand gestures, facial expression, and body tures the motion information. The network is finetuned on poses) with the RGB channel. As shown in Fig. 3, the hand the three input channels respectively. The average fusion is regions and the face regions are obtained from the RGB im- obtained by weighting the predicted results. age based on the location guided by skeleton joints. The per- formance of each modality and their fusions are summarized Table 5 shows the performance of ASL recognition on in Table 6. ASL-100-RGBD dataset for each input channel and differ- ent fusions. While RGB channel alone achieves 87 .83 %, by fusing with optical flow, the performance is boosted up Table 6 The comparison of the performance of different modalities to 89 .02 %. With the fusion of all the three channels (RGB, and their fusions. All the models are pretrained on Chalearn dataset and finetuned on ASL-100-RGBD dataset with 64 frames. Optical flow, and Depth), the performance is further im- Channels Fusions proved to 89 .91 %. This indicates that depth and optical flow √ √ √ channels contain complementary information to RGB chan- Body √ √ √ nel for ASL recognition. Hand √ Face Performance 87.83% 80.9% 89.81% 91.5%

Table 5 The comparison of the performance of networks with differ- ent input channels and their fusions. All the models are pretrained on Chalearn dataset and finetuned on ASL-100-RGBD dataset with 64 In addition to the accuracy of ASL sign recognition, we frames. further analyzed the accuracy of the six categories (see Fig. Channels Fusions 4 for details) for each modality and their combinations in √ √ √ √ RGB √ √ √ Table 7. For the categories that involve many facial expres- Depth √ √ √ Optical Flow sions, such as Question(Yes/No) and Negative, the accu- Performance 87.83% 81.93% 80.51% 89.91% 89.02% 89.71 racy of hand modality is improved by more than 15% af- ter fusion with face modality. For the Conditional category which utilizes more subtle facial expressions, the accuracy 12 Longlong Jing∗ et al. of hand modality is not improved after fusion with face modal- We also examined the performance of ResNet-34 by chang- ity. ing the temporal duration to 16, 32, and 64. Our results indi- cate that ResNet-34 with 64 frames has the best architecture Table 7 The performance (%) of different modalities and their fusions on six categories listed in Fig. 4: Conditional (Cond), Negative (Neg), for Chalearn dataset, as shown in Table 10. Pointing (Point), Question (WH), Yes/No Question (Y/N) and Time. The last column is the accuracy (%) for ASL signs. Table 10 Ablation study of temporal duration on RGB videos of Modalities Cond Neg Point WH Y/N Time Acc Chalearn Dataset. Hand 90.0 78.1 68.4 84.3 68.4 81.4 80.9 Body 100.0 87.4 84.2 88.0 89.5 87.6 87.83 Network Temporal Duration Accuracy Body+Hand 90.9 86.6 89.5 88.7 94.7 90.2 89.81 ResNet-34 16 45.00% Body+Hand+Face 90.9 93.3 84.2 90.6 84.2 91.8 91.5 ResNet-34 32 56.28% ResNet-34 64 58.32%

5.2.7 Fusions of Different Channels and Modalities 5.3.2 Effects of Different Channels and Modalities The fusion results of different input channels and modali- ties on ASL-100-RGBD dataset are shown in Table 8. The We evaluate the effects of different channels including RGB, experiments are based on 3DResNet-34 with 64 frames, pre- RGB flow, Depth, and Depth flow. Because the Chalearn trained on Chalearn dataset. Among all the models, fusion of dataset is designed for hand gesture recognition, we fur- RGB+Depth+Hands RGB+ Face RGB achieves the best ther analyze the effects of different hands (left and right), as performance with 92.88% accuracy. Adding RGBflow to this well as the whole body. We develop a method to distinguish combination results in 92.48% accuracy which is compara- left and right hands in Chalearn Isolated Gesture dataset, ble but not improved since the channels have redundant in- and will release the coordinates of hands (distinguished be- formation. tween right and left hands) with the publication of this arti- cle. Since the Chalearn dataset is collected for recognizing Table 8 Performance of 3DResNet-34 with 64 from fusion of different hand gestures, here, the face channel is not employed. channels and modalities. We train 12 3D-ResNet-34 networks with 64 frames by Channels Fusions using different combinations of channels and modalities re- √ √ √ RGB √ √ √ √ spectively and show the results in table 11. The accuracy Depth √ √ √ of right hand is significantly higher than the left hand. The RGBflow √ √ √ √ reason is that for most of the gestures in Chalearn dataset, RGB of Hands √ √ √ RGB of Face the right hand is dominant and the left hand does not move Performance 91.19% 92.48% 92.48% 92.88% much for many hand gestures.

Table 11 Performance of 3D-ResNet-34 with 64 frames on Chalearn 5.3 Experiments on Chalearn LAP IsoGD dataset Dataset for different channels and modalities. Channel Global Channel (%) Left Hand (%) Right Hand (%) 5.3.1 Effects of Network Architectures RGB 58.32 18.01 48.58 Depth 63.16 19.43 54.15 RGB Flow 60.26 21.97 48.79 The 3D-ResNet is pre-trained on Kinetics [38] for all the Depth Flow 55.37 20.28 47.07 experiments in this section. To find the best network archi- tecture for Chalearn dataset, the parameters of 3D-ResNet 5.3.3 Effects of Fusions on different channels and are studied on RGB videos. The results are shown in Table Modalities 9. By changing the number of layers to 18, 34, 50 while fix- ing the temporal duration to 32, ResNet-34 achieved the best Here we analyze the effects of average fusion on different accuracy. channels and modalities. The results are shown in Table 12. Using only RGB and Depth channels, the accuracy is 67.58% Table 9 Ablation study of number of layers of the network on RGB which is improved to 69.97% by adding RGB flow. We ob- videos of Chalearn Dataset. serve that among all different triplets of channels, Right Hand Network Temporal Duration Accuracy RGB + Depth + RGBflow has the highest accuracy at 73.32%. ResNet-18 32 52.69% By applying the average fusion on four channels RGB+ RG- ResNet-34 32 56.28% Bflow+ Right Hand RGB + Right Hand Depth, our model ResNet-50 32 54.57% achieves the accuracy about 75.88% which outperforms the Recognizing American Sign Language Manual Signs from RGB-D Videos 13 average fusion results of all previous work on Chalearn dataset. While the authors of FOANet [61] had reported a 12% boost In the-state-of-the-art work of [61], the accuracy of average from using sparse fusion in their original experiments, our fusion is 71.93% for 7 channels and 70.37% for 12 channels, experiments do not reveal such a boost when implementing respectively. a system following the technical details provided in [61]. Finally, the average fusion of all global channels (RGB, Table 14 lists the accuracy on individual channels of our RGB flow, Depth, Depth flow) and Right hand channels ( network and FOANet [61]. In this table, the values inside the Right hand RGB, Right hand RGB flow, Right hand Depth, parenthesis represent the accuracy of FOANet. As shown in Right hand Depth flow) resulted in 76.04% accuracy and the the table, in the Global channel, our framework outperforms accuracy of 12 channels together resulted in 75.68%. This FOANet in all the four channels by 10% to 25%. Also, for means that the 12 channels contain redundant information, the RGB of Right Hand, we obtain a comparable accuracy and adding more channels does not necessarily improve the ( 48%) as FOANet. However, FOANet is outperforming our results. results in the Right Hand for Depth, RGBflow, and Depth- flow by nearly 10%. From our experiments, the performance of ”Global” channels (whole body) in general is superior to Table 12 Performance of 3DResNet-34 with 64 frames for fusion of different channels and modalities on Charlearn dataset. the Local channels (Right/ Left Hand) because the Global Channels Fusions channels include more information. By using the similar ar- √ √ √ √ chitecture, FOANet reported 64% accuracy from Depth of RGB √ √ √ √ Depth √ √ √ √ Right Hand and 38% from Depth of the entire frame. In- RGBflow √ √ √ RGB of Right Hand √ √ stead, our framework achieves more consistent results. For Depth of Right Hand example, in our framework the accuracy of Depth channel Performance 67.58% 69.97% 73.32% 75.53% 75.88% is higher than RGB and RGBflow for both Global and Right Hand, while the accuracy in FOANet for Depth and RGB are almost the same in the Global channel (around 40%) but 5.3.4 Comparison with the-state-of-the-arts very different in the Right Hand channel (17% difference.) Our framework achieves accuracy of 75.88% and 76.04% from the fusion of 5 and 8 channels, respectively, on Chalearn Table 14 The accuracy (%) of 12 channels on the test set of Chalearn IsoGD dataset. Table 13 lists the state-of-the-art results from IsoGD Dataset. Comparison between our framework and FOANet [61]. Chalearn IsoGD competition 2017 as well as a recent paper, The values inside the parenthesis belong to FOANet. FOANet [61]. As shown in the table, in terms of the Average Channel Global Channel (%) Left Hand (%) Right Hand (%) Fusion, our framework achieves around 6% higher accuracy RGB 58.32 (41.27) 18.01 (16.63) 48.58 (47.41) than the-state-of-the-arts methods. Depth 63.16 (38.50) 19.43 (24.06) 54.15 (64.44) RGB Flow 60.26 (50.96) 21.97 (24.02) 48.79 (59.69) Depth Flow 55.37 (42.02) 20.28 (22.71) 47.07 (58.79) Table 13 Comparison with State-of-the-art Results on Chalearn IsoGD Dataset. Framework Accuracy on Test Set (%) 6 Conclusion Our Results 76.04 FOANet (Average Fusion) [61] 70.37 Miao et al. (ASU) [58] 67.71 In this paper, we have proposed a 3DCNN-based multi-channel SYSU-IEEE 67.02 and multi-modality framework, which learns complemen- Lostoy 65.97 tary information and embeds the temporal dynamics in videos Wang et al. (AMRL) [82] 65.59 to recognize manual signs of ASL from RGB-D videos. To Zhang et al. (XDETVP) [90] 60.47 validate our proposed method, we collaborate with ASL ex- perts to collect an ASL dataset of 100 manual signs includ- ing both hand gestures and facial expressions with full anno- It is worth noting that FOANet [61] reported the accu- tation on the word labels and temporal boundaries (starting racy of 82.07% by applying Sparse Fusion on the softmax and ending points.) The experimental results have demon- scores of 12 channels (combinations of right hand, left hand, strated that by fusing multiple channels in our proposed frame- and whole body while each has 4 channels of RGB, Depth, work, the accuracy of recognizing ASL signs can be im- RGBflow and Depthflow). The purpose of using sparse fu- proved. This technology for identifying the appearance of sion is to learn which channels are important for each ges- specific ASL words has valuable applications for technolo- ture. The accuracy of FOANet framework using average fu- gies that can benefit people who are DHH [9, 42, 43, 48, sion is 70.37% which is around 6% lower than our results 52, 65, 68]. As an additional contribution, our “ASL-100- and nearly 12% lower than the accuracy of sparse fusion. RGBD” dataset, will be released to enable other members of 14 Longlong Jing∗ et al. the research community to use this resource for training or 16. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., evaluation of models for ASL recognition. The effectiveness Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent con- of the proposed framework is also evaluated on the Chalearn volutional networks for visual recognition and description. In: Proceedings of the IEEE conference on Computer Vision and Pat- IsoGD Dataset. Our method has achieved 75.88% accuracy tern Recognition, pp. 2625–2634 (2015) using only 5 channels which is 5.51% higher than the state- 17. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, of-the-art work using 12 channels in terms of average fusion. E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013) 18. Dreuw, P., Forster, J., Ney, H.: Tracking benchmark databases for Acknowledgements This material is based upon work supported by video-based sign language recognition. In: Proc. ECCV Interna- the National Science Foundation under award numbers 1400802, 1400810, tional Workshop on Sign, Gesture, and Activity (2010) and 1462280. 19. Er-Rady, A., Thami, R.O.H., Faizi, R., Housni, H.: Automatic sign language recognition: A survey. In: Proceedings of the 3rd In- ternational Conference on Advanced Technologies for Signal and Image Processing (2017) References 20. Fang, G., Gao, W., Zhao, D.: Large-vocabulary continuoius sign language recognition based on transition-movement models. IEEE 1. American deaf and hard of hearing statistics. Transactions on Systems, Man, and Cybernetics - Part A: Systems https://www.nidcd.nih.gov/health/statistics/quick-statistics- and Humans 37(1) (2007) hearing 21. Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.: 2. Intel realsense technology: Observe the world in 3d. Rank pooling for action recognition. IEEE transactions on Pattern https://www.intel.com/content/www/us/en/architecture-and- Analysis and Machine Intelligence 39(4), 773–787 (2017) technology/realsense-overview.html (2018) 22. Forster, J., Schmidt, C., Hoyoux, T., Koller, O., Zelle, U., Piater, 3. Orbbec astra. https://orbbec3d.com/product-astra/ (2018) J.H., Ney, H.: Rwth-phoenix-weather: A large vocabulary sign 4. Set up kinect for windows v2 or an xbox kinect sensor with kinect language recognition and translation corpus. In: LREC, pp. 3785– adapter for windows. https://support.xbox.com/en-US/xbox-on- 3789 (2012) windows/accessories/kinect-for-windows-v2-setup (2018) 23. Furman, N., Goldberg, D., Lusin, N.: Enrollments in 5. von Agris, U., Knorr, M., Kraiss, K.F.: The significance of facial languages other than english in united states institu- features for automatic sign language recognition. In: Proceedings tions of higher education, fall 2010. Retrieved from of IEEE International Conference on Automatic Face & Gesture http://www.mla.org/2009 enrollmentsurvey (2010) Recognition (2008) 24. Gattupalli, S., Ghaderi, A., Athitsos, V.: Evaluation of deep learn- 6. Almeidaab, S.G.M., Guimaresc, F.G., Ramrez, J.: Feature extrac- ing based pose estimation for sign language recognition. In: tion in brazilian sign language recognition based on phonological Proceedings of the 9th ACM International Conference on Perva- structure and using rgb-d sensors. Expert Systems with Applica- sive Technologies Related to Assistive Environments, p. 12. ACM tions 41(16), 7259–7271 (2014) (2016) 7. Athitsos, V., Neidle, C., Sclaroff, S., Nash, J., Stefan, A., Yuan, 25. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hier- Q., Thangali, A.: The asl lexicon video dataset. In: Proceedings archies for accurate object detection and semantic segmentation. of CVPR 2008 Workshop on Human Communicative Behaviour In: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Analysis. IEEE (2008) Conference on, pp. 580–587. IEEE (2014) 8. Buehler, P., Everingham, M., Huttenlocher, D.P., Zisserman, A.: 26. Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H.J.: The Upper body detection and tracking in extended signing sequences. chalearn gesture dataset (cgd 2011). Machine Vision and Appli- International journal of computer vision 95(2), 180 (2011) cations 25(8), 1929–1951 (2014) 9. Camgoz,¨ N.C., Hadfield, S., Koller, O., Bowden, R.: Subunets: 27. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns re- End-to-end hand shape and continuous sign language recognition. trace the history of 2d cnns and imagenet? In: Proceedings of the In: ICCV, vol. 1 (2017) IEEE Conference on Computer Vision and Pattern Recognition 10. Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neu- (CVPR), pp. 6546–6555 (2018) ral sign language translation. CVPR 2018 Proceedings (2018) 28. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in 11. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new deep convolutional networks for visual recognition. In: Computer model and the kinetics dataset. In: Computer Vision and Pattern Vision–ECCV 2014, pp. 346–361. Springer (2014) Recognition (CVPR), 2017 IEEE Conference on, pp. 4724–4733. 29. Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based IEEE (2017) sign language recognition without temporal segmentation. arXiv 12. Chai, X., Li, G., Lin, Y., Xu, Z., Tang, Y., Chen, X., Zhou, M.: Sign preprint arXiv:1801.10111 (2018) language recognition and translation with kinect. In: Proceedings 30. Huenerfauth, M., Gale, E., Penly, B., Pillutla, S., Willard, M., Har- of IEEE International Conference on Automatic Face and Gesture iharan, D.: Evaluation of language feedback methods for student Recognition (2013) videos of american sign language. ACM Transactions on Acces- 13. Charles, J., Pfister, T., Everingham, M., Zisserman, A.: Automatic sible Computing (TACCESS) 10(1), 2 (2017) and efficient human pose estimation for sign language videos. In- 31. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks ternational Journal of Computer Vision 110(1), 70–90 (2014) for human action recognition. IEEE transactions on pattern anal- 14. Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural net- ysis and machine intelligence 35(1), 221–231 (2013) works for continuous sign language recognition by staged opti- 32. Jiang, Y., Tao, J., Ye, W., Wang, W., Ye, Z.: An isolated sign lan- mization. In: IEEE Conference on Computer Vision and Pattern guage recognition system using rgb-d sensor with sparse coding. Recognition (CVPR) (2017) In: Proceedings of IEEE 17th International Conference on Com- 15. Diba, A., Fayyaz, M., Sharma, V., Karami, A.H., Mahdi Arzani, putational Science and Engineering (2014) M., Yousefzadeh, R., Van Gool, L.: Temporal 3D ConvNets: 33. Jing, L., Yang, X., Tian, Y.: Video you only look once: Overall New Architecture and Transfer Learning for Video Classification. temporal convolutions for action recognition. Journal of Visual ArXiv e-prints (2017) Communication and Image Representation 52, 58–65 (2018) Recognizing American Sign Language Manual Signs from RGB-D Videos 15

34. Jing, L., Ye, Y., Yang, X., Tian, Y.: 3d convolutional neural net- 53. Liu, Z., Huang, F., Tang, G.W.L., Sze, F.Y.B., Qin, J., Wang, X., work with multi-model framework for action recognition. In: 2017 Xu, Q.: Real-time sign language recognition with guided deep IEEE International Conference on Image Processing (ICIP), pp. convolutional neural networks. In: Proceedings of the 2016 Sym- 1837–1841. IEEE (2017) posium on Spatial User Interaction, pp. 187–187. ACM (2016) 35. Kadous, M.: Machine recognition of auslan signs using power- 54. Lu, P., Huenerfauth, M.: Cuny american sign language motion- gloves:towards large-lexicon recognition of sign language. In: capture corpus: first release. In: Proceedings of the 5th Workshop Proceedings of the Workshop on the Integration of Gesture in Lan- on the Representation and Processing of Sign Languages: Interac- guage and Speech, pp. 165–174 (1996) tions between Corpus and Lexicon, The 8th International Confer- 36. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for ence on Language Resources and Evaluation (LREC 2012), Istan- generating image descriptions. arXiv preprint arXiv:1412.2306 bul, Turkey (2012) (2014) 55. Martnez, A.M., Wilbur, R.B., Shay, R., Kak, A.C.: The rvl-slll asl 37. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., database. In: Proc. of IEEE International Conference Multimodal Fei-Fei, L.: Large-scale video classification with convolutional Interfaces (2002) neural networks. In: CVPR (2014) 56. Mehrotra, K., Godbole, A., Belhe, S.: Indian sign language recog- 38. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vi- nition using kinect sensor. In: In Proceedings of the International jayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Conference Image Analysis and Recognition, pp. 528–535 (2015) et al.: The kinetics human action video dataset. arXiv preprint 57. Metaxas, D., Liu, B., Yang, F., Yang, P., Michael, N., Neidle, C.: arXiv:1705.06950 (2017) Recognition of nonmanual markers in asl using non-parametric 39. Kelly, D., McDonald, J., Markham, C.: A person independent sys- adaptive 2d-3d face tracking. In: Proc. of the Int. Conf. on Lan- tem for recognition of hand postures used in sign language. Pattern guage Resources and Evaluation (LREC), European Language Re- Recognition Letters 31(11), 1359–1368 (2010) sources Association (2012) 40. Keskin, C., Kra, F., Kara, Y., Akarun, L.: Hand pose estimation 58. Miao, Q., Li, Y., Ouyang, W., Ma, Z., Xu, X., Shi, W., Cao, X., and hand shape classification using multi-layered randomized de- Liu, Z., Chai, X., Liu, Z., et al.: Multimodal gesture recognition cision forests. In: In Proceedings of the European Conference on based on the resc3d network. In: ICCV Workshops, pp. 3047– Computer Vision, pp. 852–863 (2012) 3055 (2017) 41. Koller, O., Forster, J., Ney, H.: Continuous sign language recog- 59. Mitchell, R.E., Young, T.A., Bachleda, B., Karchmer, M.A.: How nition: Towards large vocabulary statistical recognition systems many people use asl in the united states? why estimates need up- handling multiple signers. Computer Vision and Image Under- dating. Sign Language Studies 6(3), 306–335 (2006) standing 141, 108–125 (2015) 60. Mulrooney, K.: American Sign Language Demystified, Hard Stuff 42. Koller, O., Ney, H., Bowden, R.: Deep learning of mouth shapes Made Easy. McGraw Hill (2010) for sign language. In: Proceedings of the IEEE International Con- 61. Narayana, P., Beveridge, J.R., Draper, B.A.: Gesture recognition: ference on Computer Vision Workshops, pp. 85–91 (2015) Focus on the hands. In: Proceedings of the IEEE Conference on 43. Koller, O., Ney, H., Bowden, R.: Deep hand: How to train a cnn on Computer Vision and Pattern Recognition, pp. 5235–5244 (2018) 1 million hand images when your data is continuous and weakly 62. Neidle, C., Thangali, A., Sclaroff, S.: Challenges in development labelled. In: Proceedings of the IEEE Conference on Computer of the american sign language lexicon video dataset (asllvd) cor- Vision and Pattern Recognition, pp. 3793–3802 (2016) pus. In: Proceedings of the Language Resources and Evaluation 44. Koller, O., Zargaran, S., Ney, H., Bowden, R.: Deep sign: Enabling Conference (LREC) (2012) robust statistical continuous sign language recognition via hybrid 63. Neidle, C., Vogler, C.: A new web interface to facilitate access cnn-hmms. International Journal of Computer Vision 126(12), to corpora: Development of the asllrp data access interface (dai). 1311–1325 (2018) In: Proc. 5th Workshop on the Representation and Processing of 45. Kong, W., Ranganath, S.: Towards subject independent continues Sign Languages: Interactions between Corpus and Lexicon, LREC sign language recognition: A segment and merge approach. Pat- (2012) tern Recognition 47(3), 1294–1308 (2014) 64. Ong S. C.and Ranganath, S.: Automatic sign language analysis: 46. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classifica- A survey and the future beyond lexical meaning. IEEE Pattern tion with deep convolutional neural networks. In: Advances in Analysis and Machine Intelligence 27(6), 873–891 (2005) Neural Information Processing Systems, pp. 1097–1105 (2012) 65. Palmeri, M., Vella, F., Infantino, I., Gaglio, S.: Sign languages 47. Kumar, P., Gauba, H., Roy, P.P., Dogra, D.P.: A multimodal frame- recognition based on neural network architecture. In: Interna- work for sensor based sign language recognition. Neurocomputing tional Conference on Intelligent Interactive Multimedia Systems 259, 21–38 (2017) and Services, pp. 109–118. Springer (2017) 48. Kumar, P., Roy, P.P., Dogra, D.P.: Independent bayesian classifier 66. Pigou, L., Dieleman, S., Kindermans, P.J., Schrauwen, B.: Sign combination based sign language recognition using facial expres- language recognition using convolutional neural networks. In: sion. Information Sciences 428, 30–48 (2018) Proceedings of European Conference on Computer Vision Work- 49. Lang, S., Block, M., Rojas, R.: Sign language recognition using shops, pp. 572–578 (2014) kinect. In: In Proceedings of International Conference on Artificial 67. Pigou, L., Van Den Oord, A., Dieleman, S., Van Herreweghe, M., Intelligence and Soft Computing, pp. 394–402 (2012) Dambre, J.: Beyond temporal pooling: Recurrence and temporal 50. Liang, R.H., Ouhyoung, M.: A real-time continuous gesture convolutions for gesture recognition in video. International Jour- recognition system for sign language. In: Proceedings of the Third nal of Computer Vision 126(2-4), 430–439 (2018) IEEE International Conference on Automatic Face and Gesture 68. Pigou, L., Van Herreweghe, M., Dambre, J.: Gesture and sign Recognition, pp. 558–567 (1998) language recognition with temporal residual networks. In: Pro- 51. Liu, J., Liu, B., Zhang, S., Yang, F., Yang, P., Metaxas, D.N., Nei- ceedings of the IEEE Conference on Computer Vision and Pattern dle, C.: Recognizing eyebrow and periodic head gestures using Recognition, pp. 3086–3093 (2017) crfs for non-manual grammatical marker detection in asl. In: Proc. 69. Pu, J., Zhou, W., Li, H.: Dilated convolutional network with iter- of the 10th IEEE International Conference and Workshops on Au- ative optimization for continuous sign language recognition. In: tomatic Face and Gesture Recognition (FG) (2013) IJCAI, pp. 885–891 (2018) 52. Liu, W., Fan, Y., Li, Z., Zhang, Z.: Rgbd video based human hand 70. Pugeault, N., Bowden, R.: Spelling it out: Real-time asl finger- trajectory tracking and gesture recognition system. Mathematical spelling recognition. In: Proc. of IEEE International Conference Problems in Engineering 2015 (2015) on Computer Vision Workshops, pp. 1114–1119 (2011) 16 Longlong Jing∗ et al.

71. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation lstm for gesture recognition. In: Proceedings of the IEEE Confer- with pseudo-3d residual networks. In: The IEEE International ence on Computer Vision and Pattern Recognition, pp. 3120–3128 Conference on Computer Vision (ICCV) (2017) (2017) 72. Ren, Z., Yuan, J., Meng, J., Zhang, Z.: Robust part-based hand gesture recognition using kinect sensor. IEEE Trans. on Multime- dia 15, 1110–1120 (2013) 73. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Informa- tion Processing Systems, pp. 568–576 (2014) 74. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 75. Starner, T., Weaver, J., Pentland, A.: Real-time american sign lan- guage recognition using desk and wearable computer based video. IEEE Pattern Analysis and Machine Intelligence 20(12), 1371– 1375 (1998) 76. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. arXiv preprint arXiv:1409.4842 (2014) 77. Tamura, S., Kawasaki, S.: Recognition of sign language motion images. Pattern Recognition 21(4), 343–353 (1988) 78. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learn- ing spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015) 79. Traxler, C.B.: The stanford achievement test: National norming and performance standards for deaf and hard-of-hearing students. Journal of deaf studies and deaf education 5(4), 337–348 (2000) 80. Valli, C., Lucas, C., Mulrooney, K.J., Villanueva, M.: Linguistics of American Sign Language: An Introduction. Gallaudet Univer- sity Press (2011) 81. Wan, J., Li, S., Zhao, Y., Zhou, S., Guyon, I., Escalera, S.: Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: Proceedings of CVPR 2008 Work- shops. IEEE (2016) 82. Wang, H., Wang, P., Song, Z., Li, W.: Large-scale multimodal ges- ture recognition using heterogeneous networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pp. 3129–3137 (2017) 83. Yang, H., Sclaroff, S., Lee, S.: Sign language spotting with a threshold model based on conditional random fields. IEEE Pat- tern Analysis and Machine Intelligence 31(7), 1264–1277 (2009) 84. Yang, H.D.: Sign language recognition with the kinect sensor based on conditional random fields. Sensors 15, 135–147 (2015) 85. Yang, R., Sarkar, S., Loeding, B.: Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming. IEEE Pattern Analysis and Machine Intelligence 32(3), 462–477 (2010) 86. Ye, Y., Tian, Y., Huenerfauth, M.: Recognizing american sign lan- guage gestures from within continuous videos. The 8th IEEE Workshop on Analysis and Modeling of Faces and Gestures (AMFG) in conjunction with CVPR 2018 (2017) 87. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep net- works for video classification. In: Proceedings of the IEEE confer- ence on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015) 88. Zafrulla, Z., Brashear, H., Starner, T., Hamilton H.and Presti, P.: American sign language recognition with the kinect. In: In Pro- ceedings of the International Conference on Multimodal Inter- faces, pp. 279–286 (2011) 89. Zhang, C., Tian, Y., Huenerfauth, M.: Multi-modality american sign language recognition. In: Proceedings of IEEE International Conference on Image Processing (ICIP) (2016) 90. Zhang, L., Zhu, G., Shen, P., Song, J., Shah, S.A., Bennamoun, M.: Learning spatiotemporal features using 3dcnn and convolutional