Spelling Correction Real-Time American Sign Language Alphabet Translation System Based on YOLO Network and LSTM

Home , Chilean manual alphabet, Movement (sign language)

electronics

Article Spelling Correction Real-Time American Sign Language Alphabet Translation System Based on YOLO Network and LSTM

Miguel Rivera-Acosta 1 , Juan Manuel Ruiz-Varela 1 , Susana Ortega-Cisneros 1,* , Jorge Rivera 2 , Ramón Parra-Michel 1 and Pedro Mejia-Alvarez 1

1 Advanced Studies and Research Center (CINVESTAV), National Polytechnic Institute (IPN), Zapopan 45019, Mexico; [email protected] (M.R.-A.); [email protected] (J.M.R.-V.); [email protected] (R.P.-M.); [email protected] (P.M.-A.) 2 CONACYT—Advanced Studies and Research Center (CINVESTAV), National Polytechnic Institute (IPN), Guadalajara Campus, Zapopan 45019, Mexico; [email protected] * Correspondence: [email protected]; Tel.: +52-33-3777-3600

Abstract: In this paper, we present a novel approach that aims to solve one of the main challenges in hand gesture recognition tasks in static images, to compensate for the accuracy lost when trained models are used to interpret completely unseen data. The model presented here consists of two main data-processing stages. A deep neural network (DNN) for performing handshape segmentation and classification is used in which multiple architectures and input image sizes were tested and compared to derive the best model in terms of accuracy and processing time. For the experiments presented in this work, the DNN models were trained with 24,000 images of 24 signs from the Citation: Rivera-Acosta, M.; American Sign Language alphabet and fine-tuned with 5200 images of 26 generated signs. The Ruiz-Varela, J.M.; Ortega-Cisneros, S.; system was real-time tested with a community of 10 persons, yielding a mean average precision and Rivera, J.; Parra-Michel, R.; Mejia-Alvarez, P. Spelling Correction processing rate of 81.74% and 61.35 frames-per-second, respectively. As a second data-processing Real-Time American Sign Language stage, a bidirectional long short-term memory neural network was implemented and analyzed for Alphabet Translation System Based adding spelling correction capability to our system, which scored a training accuracy of 98.07% with on YOLO Network and LSTM. a dictionary of 370 words, thus, increasing the robustness in completely unseen data, as shown in Electronics 2021, 10, 1035. https://doi. our experiments. org/10.3390/electronics10091035 Keywords: american sign language; artificial intelligence; deep learning; deep neural networks; Academic Editor: hand posture recognition Athanasios Voulodimos

Received: 16 March 2021 Accepted: 13 April 2021 1. Introduction Published: 27 April 2021 Deafness and hearing loss are severe problems that have been harassing society for

Publisher’s Note: MDPI stays neutral many years. Unfortunately, according to data from World Health Organization (WHO) [1], with regard to jurisdictional claims in from year to year, the number of people with disabling hearing loss (DHL) will be increasing published maps and institutional affil- worldwide. The WHO announces that for 2050 the number of people with DHL will rise iations. from 466 million (2019) to 933 million [2]. These projections highlight the importance of developing a real-time system to fa- cilitate the communication of the people with DHL to those with no knowledge of sign languages. Such relevance can be proven by the extensive research aiming to develop real-time sign language translators with different approaches found in the state-of-the-art. Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. Sign language is a language that uses hand and facial expressions for communication. This article is an open access article It is mainly used by people with DHL for communication with others; nevertheless, hand distributed under the terms and kinematics are also used for other purposes, such as human–machine interfacing and other conditions of the Creative Commons human activity recognition (HAR) tasks. Attribution (CC BY) license (https:// Considering this, computer-based sign language recognition (SLR) is an important creativecommons.org/licenses/by/ topic that could provide many benefits in a wide variety of activities if it is efficiently 4.0/). performed. One of the most common approaches to capture information for handshape

Electronics 2021, 10, 1035. https://doi.org/10.3390/electronics10091035 https://www.mdpi.com/journal/electronics Electronics 2021, 10, 1035 2 of 18

classification is through RGB images. Nevertheless, solving SLR in images in an efficient manner is a challenging task due to many factors. First, images represent a vast amount of information, where most of the scene’s area is irrelevant for the interpretation of the gesture. In addition, due to the nature of non-spoken languages, gestures are widely variable in many aspects, handshapes, scene lighting, image noise, sensor specifications, among others. Thus, the recompilation and labeling of images for covering all of the cases represent a hard workload, making it difficult to develop a system that provides good results in unseen data. Sign language recognition mainly consists of three tasks: hand segmentation within the scene, handshape classification, and hand shape translation. Previous works can be classified by the sensor used for hand segmentation and hand gesture extraction. Microsoft Kinect is a device that provides RGB and depth images. Therefore, hand segmentation is commonly performed using depth information because it is usually closer to sensors than the rest of the user body and background [3,4]. In [3], a colored glove is used for hand segmentation with the Kinect device. These authors proposed a per-pixel approach classifier, where features of a set of points in the hand area are extracted to classify 24 static American Sign Language (ASL) alphabet signs. Nevertheless, due to the size and nature of the Kinect device, portability is a serious drawback for this approach. Another method for hand feature extraction is to use sensory gloves. This approach avoids the hand segmentation problem. Therefore, many glove-based systems are found in the state-of-the-art [5,6]. This approach yields good results with low computation requirements since no image processing is required for hand features extraction; however, the necessity of a glove is a huge drawback for this methodology, resulting in high implementation costs. An event-based neuromorphic sensor is used in [7] for the translation of 24 static ASL signs. Their proposed method is based on the generation of a characteristic array that contains information about the contour of the handshape, then a feedforward artificial neural network implemented in a field-programmable gate array (FPGA) is used for the classification of the characteristic array. As modern algorithms for image processing are emerging, RGB and RGB-D cameras are gaining popularity. Specifically, deep learning and machine-learning algorithms reach high scores in image classification tasks, i.e., deep neural network (DNN) and support vector machine (SVM) have already been applied in SLR [8,9]. In recent years, the convolutional neural network (CNN) algorithm is gaining relevance in image classification due to the promising results obtained in real-world applications. Therefore, many CNN-based sign language translators are found in the state-of-the- art [10–12]. Although CNNs are excellent tools for classifying handshapes in SLR, they lack hand segmentation capability, requiring specialized algorithms for this task. Thus, DNNs for object detection in images are being proposed and implemented to perform SLR, such as faster R-CNN [13,14], where the real-time translation is being reported. Razieh Rastgoo et al. [15] used a multimodal approach for handshape recognition using static visual data. Their proposal is completely based on deep learning; a fine-tuned CNN is used to detect hands in the input images, and seven restricted Boltzmann machines are used to recognize the final hand label. The authors use two modalities of input, RGB images and depth information. The system was trained using four open-sourced datasets, one RGB image dataset, and three RGB-depth images datasets. Nevertheless, experimental results presented by authors are based on noisy images from datasets (noise was introduced artificially). Modern devices are emerging, yielding the possibility of designing highly efficient HAR systems based on trainable models. A leap motion controller (LMC) is being used for this purpose. The LMC is an optical sensor, which captures the movement of user hands allowing researchers to perform handshape recognition without requiring special algorithms for hand segmentation. Since LMC automatically detects and tracks user’s hands, it is used to classify dynamic gestures, as in [16], where authors proposed a dynamic gesture recognition system. Their Electronics 2021, 10, 1035 3 of 18

approach consists of the following stages: a series of frames are captured from the LMC. These frames contain a set of parameters, such as the acceleration, speed, and position of the hand, palm, and fingers information. Using this information, they compute a set of features, such as Euclidian distance between a fingertip and the palm, angles, among others; these array features are concatenated to create a full dynamic gesture vector. Lastly, a bidirectional long short-term memory is used to classify the time-based gesture vector. In addition to their proposal scored an accuracy of 95.23%, the training dataset contains only 312 samples, which is significantly smaller than our image datasets. Jordan J. Bird et al. [17] proposal is based on a fused modality of LMC and visual data, according to results presented by authors (in unseen data); this multimodal approach scored 76.5% accuracy. A mixed CNN-LSTM model called DeepConvLSTM for the classification of 60 ASL signs with a leap motion sensor is presented in [18]. The authors used data augmentation to reduce overfitting for maximum testing accuracy of 91.1% using the DeepConvLSTM model, which outperformed recurrent neural networks for SLR. Thus, according to the authors, this demonstrates the importance of using convolutional layers for HAR. Never- theless, as shown by [17], besides portability loss, RGB sensors score higher accuracy than LMC in unseen data. A dynamic hand gestures (air-written digits) feedforward network-based recognition system is proposed in [19]. The authors use an inertial measurement unit (IMU); because their model is implemented on an FPGA device and taking advantage of FPGA parallelism, their system can learn in real time from time-dependent IMU data. As proved by the state-of-the-art proposals in sign language recognition, for many years, researchers have been focusing on increasing sign recognition accuracy by using more complex algorithms or modern and complex sensors, even multiple sensors. Besides that, this approaches considerably reduces portability it increases overall system complexity and costs. However, focusing on increasing recognition accuracy, partially solves the problem for two reasons; at first, in most literature, reported accuracy is calculated using data from the same dataset or captured under similar conditions. Therefore, this accuracy is expected to be drastically lower under real-world conditions. The second reason is that previous proposals have a major drawback that must be highlighted; the predictions of the recognition systems based on a single sensor often depend only on the result of a single classifier that only considers information of present time. Thus, if their classifier makes a wrong prediction, the SLR system’s output will be wrong. On the other hand, for solving the above-mentioned issues, we are proposing a handshape classifier whose output does not depend on a single prediction, but information from past, present and future is used for computing the system output, increasing the robustness of our system in real-world conditions, as shown in our experiments. This problem yields the main motivation of this work, to propose a methodology to compensate the accuracy loss of SLR in RGB images when unseen data are presented to a trained model; this, with a single sensor under real-time conditions. Initially, we analyze two models of DNNs for performing the classification of 24 classes of ASL alphabet signs plus two additional signs, i.e., 26 static signs in total. These experiments were carried out with completely unseen data (images were captured in different background scenes and with a different camera sensor) to support our hypothesis. With these experiments’ findings, we select the most suitable DNN model; selection metrics are the model with the highest accuracy while accomplishing real-time restrictions. Following, we present an approach for dismissing DNN misclassifications with a bidirectional long short-term memory (LSTM), i.e., the LSTM is used for performing spelling corrections in the predictions of the DNN. Next, the results of a series of experiments of the here presented multimodality model with a community of 10 users are presented. Results Electronics 2021, 10, x FOR PEER REVIEW 4 of 20

With these experiments’ findings, we select the most suitable DNN model; selection metrics are the model with the highest accuracy while accomplishing real-time restrictions. Electronics 2021, 10, 1035 Following, we present an approach for dismissing DNN misclassifications with4 of 18 a bidirectional long short-term memory (LSTM), i.e., the LSTM is used for performing spelling corrections in the predictions of the DNN. Next, the results of a series of experiments of the here presented multimodality model with a community of 10 users are havepresented. shown Results that our have proposal shown can that achieve our proposal higher accuracy can achieve than higher those in accuracy the state-of-the-art than those inin thethe ﬁeldstate- ofof sign-the-art language in the field recognition of sign inlanguage RGB images. recognition in RGB images. Figure1 1 shows shows a a block block diagram diagram of of the the process process performed performed by by the the here here presented presented AmericanAmerican Sign Sign Language Language alphabet alphabet translator transla (ASLAT).tor (ASLAT) It mainly. It consists mainly of consists three processing of three stages.processing The stages. ﬁrst stage The is first an imagestage is to an letter image conversion, to letter conversion, which performs which theperforms sign language the sign recognitionlanguage recognition task, this stage task, contains this stage an contain image resizers an image and an resizer object and detection an object DNN, detection whose inputDNN, is whose the image input streaming is the image from stre theaming camera, from and the the camera, output and consists the output of a time-series consists of of a letterstime-series that are of lettersthe predictions that are ofthe the predictions DNN. The of second the DNN. stage The is a letter second sequence stage is to a word letter conversionsequence to that word removes conversion all equal thatconsecutive removes predictions all equal (e.g., consecutive RREEEDD predictions is converted (e.g. into, RED).RREEEDD The third is converted data processing into RED). stage The is a word-spelling third data processing correction stage module, is a consisting word-spelling of a bidirectionalcorrection module, LSTM consisting neural network, of a bidirectional whose input LSTM is the neural reduced network, letter sequence,whose input and is thethe output is the predicted translated word of its dictionary. In our work, this correction feature reduced letter sequence, and the output is the predicted translated word of its dictionary. is designed to compensate for accuracy lost in the ASL alphabet translator. Nevertheless, it In our work, this correction feature is designed to compensate for accuracy lost in the ASL can be used to enhance the results in other DNN applications by considering past, present alphabet translator. Nevertheless, it can be used to enhance the results in other DNN and future predictions. applications by considering past, present and future predictions.

Sign Language Recognition Word Spelling Correction Image to (Many consecutive letters) Letter sequence Word-spelling letter H...E...L...O _... to word HELO HELLO correction conversion conversion

Image to letter conversion Word-spelling correction Reduced Dictionary Bidir LSTM Image Sign letter word Image DNN neural network resizing prediction Sequence prediction

Figure 1. ProcessingFigure 1. Processing stages of proposedstages of Americanproposed American Sign Language Sign Language alphabet translator. alphabet translator.

InIn summary,summary, thethe mainmain contributionscontributions ofof thisthis workwork areare listedlisted below:below: • AA novelnovel approachapproach for for solving solving SLR SLR is is proposed, proposed, where where output output does does not not depend depend on a o sin-n a glesingle algorithm algorithm (classiﬁer), (classifier) but, but a second a second stage stage is added is added for considering for considering past, past, present present and futureand future input input information, information, thus, thus, increasing increasing the robustness the robustne of thess of system the system in unseen in unseen data; • Proofdata; that with our SLR multimodality model, interpretation accuracy of complex • wordsProof that can bewith improved our SLR by multimodality adding a bidirectional model, interpretation LSTM for spelling accuracy correction of com con-plex cerningwords can state-of-the-art be improved approaches; by adding a bidirectional LSTM for spelling correction • Aconcerning methodology state- toof- generatethe-art approaches; SLR systems using regression DNNs yielding to high • accuracy,A methodology at real-time to generate speeds, SLR which systems was using demonstrated regression in DNNs our experiments yielding to withhigh differentaccuracy, users; at real-time speeds, which was demonstrated in our experiments with • Andifferent open-source users; ASL dataset that can be used for training ASL translation systems • andAn open natural-source language ASL processingdataset that algorithms. can be used for training ASL translation systems Theand restnatural of this language paper is processing organized algorithms. as follows: Section2 presents the methodology and resources used during developing this work. Section3 contains the results of our proposal. The rest of this paper is organized as follows: Section 2 presents the methodology Section4 presents the discussions. and resources used during developing this work. Section 3 contains the results of our 2.proposal. Materials Section and Methods 4 presents the discussions. Our proposal is based on an open-access top object detection network, trained with an open-access ASL alphabet dataset [20], and ﬁne-tuned with an own created image dataset called ASLYset.

In addition, a semi-character level word spelling correction (WSC) was designed. The purpose of the WSC is to add the SLR output sequence to word conversion and spelling correction features. Currently, its dictionary contains 370 words (words WSC Electronics 2021, 10, x FOR PEER REVIEW 5 of 19

Electronics 2021, 10, x FOR PEER REVIEW 5 of 19

2. Materials and Methods 2. MaterialsOur proposal and Methods is based on an open-access top object detection network, trained with Electronics 2021, 10, 1035 5 of 18 an open-accessOur proposal ASL is alphabet based on dataset an open-access [20], and fine-tunedtop object detectionwith an own network, created trained image with da- tasetan open-access called ASLYset. ASL alphabet dataset [20], and fine-tuned with an own created image da- tasetIn called addition, ASLYset. a semi-character level word spelling correction (WSC) was designed. The purposeIn addition, of the WSC a semi-character is to add the SLRlevel output word spelling sequence correction to word (WSC)conversion was designed. and spelling The is trained to correct, including pronouns, colors, numbers, animals, names, verbs, and correctionpurpose of features. the WSC Currently, is to add itsthe dictionary SLR output contains sequence 370 towords word (words conversion WSC isand trained spelling to objects). However, since artiﬁcial intelligence is used, it can correct more words by properly correct,correction including features. pronouns, Currently, itscolors, dictionary numbers, contains animals, 370 words names, (words verbs, WSC and is trainedobjects). to generatingHowever,correct, including since the trainingartificial pronouns, dataset.intelligence colors, is used,numbers, it can animals, correct more names, words verbs, by properlyand objects). gen- eratingHowever,The the here sincetraining presented artificial dataset. intelligence ASLAT is basedis used, on it can a graphical correct more user words interface, by properly which gen- contains a frameeratingThe showing the here training presented real-time dataset. ASLAT images is captured based on bya gr theaphical camera, user and interfac threee, lineswhich of contains text, containing, a SLRframe responseThe showing here ofpresented real-time current imagesASLAT image, captured is sentence based byon obtained thea gr camera,aphical from userand sign threeinterfac sequence linese, which of text, to contains wordcontain- converter a anding,framethe SLR showing corrected response real-time sentenceof current images resulted image, captured sentence from by WSC, obtainedthe camera, respectively, from and sign three in sequence descending lines of to text, word order, contain- con- as shown invertering, Figure SLR and2 response. the However, corrected of current these sentence linesimage, resulted were sentence added from obtained WSC, for testing respectively, from purposessign sequence in descending since to aword text-to-speech order, con- as shown in Figure 2. However, these lines were added for testing purposes since a text- stageverter will and be the added corrected in future sentence releases. resulted from WSC, respectively, in descending order, to-speechas shown stage in Figure will be2. However,added in future these linesreleases. were added for testing purposes since a text- to-speech stage will be added in future releases.

Figure 2. Proposed American Sign Language alphabet translator graphical user interface. Figure 2. Proposed American Sign Language alphabet translator graphical user interface. Figure 2. Proposed American Sign Language alphabet translator graphical user interface. The experiments for testing the here presented ASLAT for the processing of live cam- The experiments for testing the here presented ASLAT for the processing of live era streaming (Figure 3) were carried out with two hardware configurations using Nvidia cameraThe streaming experiments (Figure for testing3) were the here carried presented out with ASLAT two for hardware the processing conﬁgurations of live cam- using GPUs (1 × 4 GB GTX-1050 and 1 × 8 GB GTX-1080, Santa Clara, CA, USA) [21]. Nvidiaera streaming GPUs (1(Figure× 4 GB 3) were GTX-1050 carried and out with 1 × 8two GB hardware GTX-1080, configurations Santa Clara, using CA, Nvidia USA) [21]. GPUs (1 × 4 GB GTX-1050 and 1 × 8 GB GTX-1080, Santa Clara, CA, USA) [21].

Figure 3. American Sign Language alphabet translator experiments environment. Figure 3. American Sign Language alphabet translator experiments environment. FigureThree 3. American groups Signof user Language volunteers alphabet participated translator in experimentsthe generation environment. of the datasets and testingThree experiments groups ofof user this volunteersASLAT. The participated first group in consisted the generation of 4 persons of the (male)datasets aged and Three groups of user volunteers participated in the generation of the datasets and testing experiments of this ASLAT. The first group consisted of 4 persons (male) aged testing experiments of this ASLAT. The first group consisted of 4 persons (male) aged between 23 and 33 years. They generated the ASLYset used for fine-tuning the DNN networks. The second consisted of 5 persons (3 persons of the first group plus two volunteers),

which generated the testing dataset (not considered for training nor ﬁne-tuning) used for calculating the results presented in Section3. The third group, formed by 10 persons (8 males and 2 females) aged between 23 and 35 years, tested the whole ASLAT in real-time conditions with the word spelling correction feature included. The open-source ASL alphabet dataset and the ASLYset were used to train two different DNN architectures with three different input image sizes, which were analyzed to Electronics 2021, 10, 1035 6 of 18

select the architectures with higher mean average precision (mAP) [22] while maintaining real-time speed to be included in the ASLAT.

2.1. YOLO Networks You only look once (YOLO) is an open-source object detection model developed by Joseph Redmond et al. [23–25]. YOLO networks use a single CNN to predict bounding boxes and class probabilities for those boxes in a single network evaluation. According to the authors, this behavior allows the YOLO algorithm to detect objects in images at a faster frame rate with higher accuracy than other object detection algorithms, such as deformable parts models [26], R-CNN [27], Fast R-CNN [28], and faster R-CNN [29]. In the state-of-the-art, YOLO networks are commonly used for many applications in the ﬁeld of object detection in images, mainly due to their capacity to perform real- time video processing with high accuracy. YOLO networks were successfully applied in pedestrian detection [30], vehicle license plate location [31], medical applications as in [32], and sign language recognition [33], where 99.91 (mAP@50) is reported.

2.2. Images Dataset To design a functional ASLAT, a proper image dataset must be generated, and every image must be labeled to train the YOLO networks. These images must accomplish some requirements to increase the robustness of the ASLAT, such as variation in orientation, distance (camera to hand), hand shape, light conditions, among others. For the training of DNNs, an open-source ASL alphabet dataset was used [20], which consists of 87,000 RGB images of 29 classes, of which 26 are letters A–Z and 3 classes for SPACE, DELETE, and NOTHING. The images have a size of 200 × 200 pixels and were generated with variations in terms of persons, background and lighting conditions. However, for the purpose of this work, we used 24,000 images of this dataset, consisting of 1000 images for every A–Y letter. Since the here presented ASLAT is based on static image processing, J and Z signs were excluded from the experiments due to their movement-based nature. The 24,000 images of the ASL alphabet dataset were labeled with YOLO format using LabelingMaster [34] software, the bounding box was added in the hand shape only, so the most relevant features of every ASL sign are prioritized, as can be seen in sign “A” (marked with a blue box) of Electronics 2021, 10, x FOR PEER REVIEW 7 of 19 Figure4.

A B C D E F

G H I K L M

N O P Q R S

T U V W X Y

Figure 4. Samples of the American Sign Language fingerspelling dataset. Figure 4. Samples of the American Sign Language ﬁngerspelling dataset. We generated an ASL alphabet fingerspelling dataset (called ASLYset) for fine-tuning the DNNs. All the images in the ASLYset were captured using an RGB camera HP Wide Vision HD camera. This dataset was generated with four users (non-speakers of ASL) who were asked to perform the signs shown in Figure 5 [7]. It consisted of 5200 images of size 416 × 416, volunteers hand-spelled 24 ASL alphabet signs (whole alphabet excluding “J” and “Z”) and two additional signs called “SP” and “FN”, each volunteer generated 50 images for each of the 26 signs. All of the images were labeled for object detection using YOLO format; however, this dataset could be useful for training or testing other computer vision algorithms (e.g., faster R-CNN) by properly labeling images. The image labeling was performed using the methodology mentioned above where only the hand shape was marked (arm position and orientation were not considered), so the most relevant features of each sign were prioritized for learning. The ASLYset was divided into two categories, training (ASLYtrain) and testing (ASLYtest) datasets of 3900 and 1300 images, respectively. This dataset can be downloaded in https://data.mendeley.com/datasets/xs6mvhx6rh/1 (accessed on 14 April 2021).

Electronics 2021, 10, 1035 7 of 18

We generated an ASL alphabet ﬁngerspelling dataset (called ASLYset) for ﬁne-tuning the DNNs. All the images in the ASLYset were captured using an RGB camera HP Wide Vision HD camera. This dataset was generated with four users (non-speakers of ASL) who were asked to perform the signs shown in Figure5[7]. It consisted of 5200 images of size 416 × 416, volunteers hand-spelled 24 ASL alphabet signs (whole alphabet excluding “J” and “Z”) and two additional signs called “SP” and “FN”, each volunteer generated 50 images for each of the 26 signs. All of the images were labeled for object detection using YOLO format; however, this dataset could be useful for training or testing other computer vision algorithms (e.g., faster R-CNN) by properly labeling images. The image labeling was performed using the methodology mentioned above where only the hand shape was marked (arm position and orientation were not considered), so the most relevant features of each sign were prioritized for learning. The ASLYset was divided into two categories, training (ASLYtrain) and testing (ASLYtest) datasets of 3900 and 1300 images, respectively. Electronics 2021, 10, x FOR PEER REVIEW 8 of 19 Electronics 2021, 10, x FOR PEER REVIEW This dataset can be downloaded in https://data.mendeley.com/datasets/xs6mvhx68 of 19

rh/1 (accessed on 14 April 2021).

FN FN

SP SP

Figure 5. AmericanFigure Sign 5. American Language Signalphabet Language used in alphabetASLYset. used in ASLYset. Figure 5. American Sign Language alphabet used in ASLYset. A sample Aof samplethe generated of the ASLYset generated resized ASLYset to 352 resized × 352 can to be 352 seen× in352 Figure canbe 6. seen in Figure6. A sample of the generated ASLYset resized to 352 × 352 can be seen in Figure 6.

Figure 6. Samples of the signs H, R, K, F, A, and C of the ASLYset. Figure 6. Samples of the signs H, R, K, F, A, and C of the ASLYset. Figure 6. Samples of the signs H, R, K, F, A, and C of the ASLYset. Presented experimental results were obtained using a dataset generated for this spe- Presented experimental results were obtained using a dataset generated for this specific purpose (not considered for training) (this testing set will be called testset in the rest cific purpose (not considered for training) (this testing set will be called testset in the rest of this paper). The testset contained 1300 images from five volunteers. The testing dataset of this paper). The testset contained 1300 images from five volunteers. The testing dataset consisted of 50 images per sign. All the images were captured with an e-CAM132_TX2 consisted of 50 images per sign. All the images were captured with an e-CAM132_TX2 RGB camera (different sensors than training and fine-tuning datasets) and in two different RGB camera (different sensors than training and fine-tuning datasets) and in two different background scenes. A sample of the testset resized to 352 × 352 can be seen in Figure 7. background scenes. A sample of the testset resized to 352 × 352 can be seen in Figure 7.

Electronics 2021, 10, 1035 8 of 18

Presented experimental results were obtained using a dataset generated for this specific purpose (not considered for training) (this testing set will be called testset in the rest of this paper). The testset contained 1300 images from five volunteers. The testing dataset consisted of 50 images per sign. All the images were captured with an e-CAM132_TX2 RGB camera (different sensors than training and fine-tuning datasets) and × Electronics 2021, 10, x FOR PEER REVIEWin two different background scenes. A sample of the testset resized to 352 9 of352 19 can be seen in Figure7.

Figure 7. Samples of the signs B, E, I, S, G, and D of the testing dataset. Figure 7. Samples of the signs B, E, I, S, G, and D of the testing dataset. The ASLYset dataset contains the following contributions: The ASLYset dataset contains the following contributions: • All of the images contained in this dataset were captured in a complex background • Allscene, of thewith images different contained persons and in handshapes this dataset; this were allows captured training in DNNs a complex with more background scene,effectiveness with different for real-world persons conditions; and handshapes; this allows training DNNs with more • effectivenessResearchers in for the real-worldfield of computer conditions; vision for natural language processing, sign lan- • Researchersguage recognition, in the and field human–computer of computer vision interfacing for naturalusing handshape language commands processing, sign languagecan benefit recognition, from this dataset; and human–computer interfacing using handshape commands • canThese benefit data can from be used this dataset;to train DNNs for object detection in images to perform Amer- • Theseican Sign data Language can be alphabet used to translation; train DNNs for object detection in images to perform • AmericanThe dataset Sign does Language not contain alphabet only images translation; for ASL alphabet translation, but every • Theimage dataset is labeled does with not YOLO contain format only for imagesa faster and for easier ASL alphabetusage of this translation, data; but every • imageThe good is labeled practical with results YOLO obtained format by forusing a faster DNNs and greatly easier depend usage on of the this quality data; and amount of data used for its training; this dataset can be used to train a YOLO • The good practical results obtained by using DNNs greatly depend on the quality and network or to enrich another dataset for sign language recognition. amount of data used for its training; this dataset can be used to train a YOLO network 2.3. YOLOor toenrich Network another Models dataset for sign language recognition. 2.3. YOLOTo obtain Network a real-time Models ASLAT with high mAP, some experiments were performed to select the proper YOLO network architecture for this specific application. For this purpose,To two obtain architectures a real-time (called ASLAT A1 and withA2 in high the rest mAP, of the some paper) experiments (Figure 8) were were tested performed toand select analyzed. the proper The architectures YOLO network are based architecture on the network for this YOLOv3-tiny specific application. and YOLOv3, For this purpose,respectively two [35]. architectures Each architecture (called A1was and implemented A2 in the restwith of three the paper)different (Figure input8 image) were tested andsizes analyzed. to obtain a Thewider architectures variety of options are based for selecting on the the network most suitable YOLOv3-tiny architecture. and The YOLOv3, respectivelyinput image sizes [35]. are Each 288 architecture× 288, 352 × 352, was and implemented 416 × 416, based with on threesizes used different by authors input image sizesin [24]. to obtain a wider variety of options for selecting the most suitable architecture. The input image sizes are 288 × 288, 352 × 352, and 416 × 416, based on sizes used by authors in [24].

Electronics 2021, 10, 1035 9 of 18 Electronics 2021, 10, x FOR PEER REVIEW 10 of 19

A1 Detection Results

Convolutional layer Pooling layer Upsample layer Detection layer Concatenation Addition A2

Detection Results

FigureFigure 8. 8.A1 A1 andand A2A2 YouYou only only look look once once (YOLO) (YOLO) architectures architectures used used for for the the experiments. experiments.

TableTable1 1shows shows the the layer layer parameters parameters of of YOLO YOLO models models A1 A1 and and A2. A2.

TableTable 1. 1.Architectures Architectures A1 A1 and and A2 A2 layers layers parameters parameters description. description.

A1A1 (416 (416× ×416) 416)1 1 A2 A2 (352 (352× ×352) 352)1 1 C1—416C1—416 ×× 416, 3232× × 3 ×× 3,3, 1 C1—352 C1—352 ×× 352, 3232× × 3 ×× 3, 1 C22—11 C22—11 ×× 11, 256256× × 1 ×× 1, 1 × × × × × P2—416P2—416 × 416, 22 × 2, 2 C2—352 C2—352 × 352, 6464 × 3 × 3, 2 U23 U23 C3—208 × 208, 64 × 3 × 3, 1 C3—176 × 176, 32 × 1 × 1, 1 C24—22 × 22, 256 × 1 × 1, 1 C3—208P4—208 ×× 208, 264× ×2, 3 2 × 3, 1 C3—176 C4—176 ×× 176, 6432× × 31 ×× 1,3, 1 C24—22 C25—22 ×× 22, 512256× × 31 ×× 1,3, 1 × × × × × × × × × P4—208C5—104 × 208,104, 1282 × 2, 23 3, 1 C4—176 C5—176 × 176, 12864 × 33 × 3,3, 1 2 C25—22 C26—22 × 22, 256512 × 13 × 3,1, 1 P6—104 × 104, 2 × 2, 2 C6—88 × 88, 64 × 1 × 1, 1 C27—22 × 22, 512 × 3 × 3, 1 C5—104C7—52 × ×52, 104, 256 128× 3×× 3 3,× 3, 1 1 C5—176 C7—88 × ×88, 176, 128 128× 3 ×× 3 3,× 3, 1 2 C26—22 C28—22 ×× 22, 256256× × 1 ×× 1, 1 P6—104P8—52 × ×52, 104, 2 × 2 2,× 2, 2 2 C6—88 C8—88 ×× 88, 25664 ×× 13 ×× 1,3, 1 2 C27—22 C29—22 ×× 22, 512512× × 3 ×× 3, 1 C9—26 × 26, 512 × 3 × 3, 1 C9—44 × 44, 128 × 1 × 1, 1 C30—22 × 22, 93 × 1 × 1, 1 C7—52P10—26 ×× 52,26, 256 2 × ×2, 3 1 × 3, 1 C7—88 C10—44 ×× 88,44, 128 512 ×× 33 ×× 3,3, 1 1 C28—22 D31 × 22, 256 × 1 × 1, 1 P8—52C11—26 ×× 52,26, 2 1024× 2, 2× 3 × 3, 1 C8—88 C11—44 ×× 88,44, 256 1024 × ×3 ×3 3,× 23, 2 C29—22 C32—22 ×× 22, 128512× × 13 ×× 3,1, 1 C12—26 × 26, 256 × 1 × 1, 1 C12—22 × 22, 512 × 1 × 1, 1 U33 C9—26C13—26 ×× 26,26, 512 512 ×× 33 ×× 3,3, 1 1 C9—44 C13—22 ×× 44,22, 128 1024 × ×1 ×3 1,× 13, 1 C30—22 C34—44 ×× 22,44, 12893 ×× 11 ×× 1,1, 1 1 P10—26C14—26 × 26,26, 2 93 ×× 2,1 1× 1, 1 C10—44 C14—22 ×× 44,22, 512512× × 13 ×× 3,1, 1 D31 C35—44 × 44, 256 × 3 × 3, 1 D15 C15—22 × 22, 1024 × 3 × 3, 2 C36—44 × 44, 128 × 1 × 1, 1 C11—26C16—26 ×× 26, 1281024× ×1 3× ×1, 3, 1 1 C11—44 C16—11 ×× 44,11, 5121024× ×1 3× ×1, 3, 1 2 C32—22 C37—44 ×× 22,44, 256128× × 31 ×× 1,3, 1 C12—26U17 × 26, 256 × 1 × 1, 1 C12—22 C17—11 ×× 22,11, 1024512 ×× 13 ×× 1,3, 1 1 U33 C38—44 × 44, 128 × 1 × 1, 1 C18—52 × 52, 256 × 1 × 1, 1 C18—11 × 11, 512 × 1 × 1, 1 C39—44 × 44, 256 × 3 × 3, 1 C13—26C19—52 ×× 26,52, 93512× ×1 3× ×1, 3, 1 1 C13—22 C19—11 ×× 22,11, 10241024× × 33 ×× 3, 1 C34—44 C40—44 ×× 44, 93128× ×1 1× ×1, 1, 1 1 C14—26D20 × 26, 93 × 1 × 1, 1 C14—22 C20—11 ×× 22,11, 93512× ×1 1× ×1, 1, 1 1 C35—44 D41 × 44, 256 × 3 × 3, 1 D21 D15 C15—22 × 22, 1024 × 3 × 3, 2 C36—44 × 44, 128 × 1 × 1, 1 1 Cxx is a convolutional layer with parameters IW × IH, NF × FW × FH, ST, Pxx is pooling layer with parameters IWC16—26× IH, FW × ×26,FH, 128 ST, × where 1 × 1, IW, 1 IH, C16—11 NF, FW, FH, × and11, ST51 are2 × input1 × 1, feature 1 C37—44 map width × and 44, height, 256 × and 3 × number,3, 1 ﬁlters width, and height and stride, respectively. U17 C17—11 × 11, 1024 × 3 × 3, 1 C38—44 × 44, 128 × 1 × 1, 1 2.4.C18—52 Word × Spelling 52, 256 Correction × 1 × 1, 1 C18—11 × 11, 512 × 1 × 1, 1 C39—44 × 44, 256 × 3 × 3, 1 C19—52As previously × 52, 93 × 1 mentioned, × 1, 1 C19—11 the main × processing11, 1024 × 3 module × 3, 1 C40—44 of the WSC × 44, is 93 a bidirectional × 1 × 1, 1 LSTM-based Recurrent Neural Network (RNN) [36]. Its input consists of the sequence of D20 C20—11 × 11, 93 × 1 × 1, 1 D41 letters obtained from the SLR subsystem. Nevertheless, as seen in Figure1, to decrease the number of letters that formD21 the output phrase, the ASLAT contains a signs sequence to1 Cxx word is a converter,convolutional which layer omits with parameters equal subsequent IW × IH, NF letters, × FW this× FH, yields ST, Pxx to is the pooling misspelling layer ofwith words parameters that contains IW × IH, this FW property× FH, ST, where (e.g., HELLO,IW, IH, NF, DIFFERENCE); FW, FH, and ST however,are input feature this can map be dismissedwidth and height, by the and WSC. number Since, filters the RNN width, is and trained height to and receive stride, a respectively. single word at a time, the WSC is required to detect when all the letters to represent the desired word were issued to

Electronics 2021, 10, x FOR PEER REVIEW 11 of 19

2.4. Word Spelling Correction As previously mentioned, the main processing module of the WSC is a bidirectional LSTM-based Recurrent Neural Network (RNN) [36]. Its input consists of the sequence of letters obtained from the SLR subsystem. Nevertheless, as seen in Figure 1, to decrease the number of letters that form the output phrase, the ASLAT contains a signs sequence to Electronics 2021, 10, 1035 word converter, which omits equal subsequent letters, this yields to the misspelling10 of 18of words that contains this property (e.g., HELLO, DIFFERENCE); however, this can be dismissed by the WSC. Since the RNN is trained to receive a single word at a time, the WSC theis required SLR subsystem; to detect this when is achieved all the letters by issuing to represent the “SP” the sign desired (Figure word5). Sign were “FN” issued is usedto the toSLR clear subsystem; the output this phrase. is achieved Nevertheless, by issuing this the can “SP” be sign used (Figure to include 5). Sign the “FN” text-to-speech is used to capabilityclear the output in future phrase. releases. Nevertheless, this can be used to include the text-to-speech capability in future releases. RNN Architecture RNNThe Architecture used semi-character level RNN is based on the work presented in [37], where inputThe word used codiﬁcation semi-character and methodology level RNN tois generatebased on training the work words presented were used in [37], to design where ourinput WSC. word As codification above-mentioned, and methodology each time stepto generate (t) is represented training words by a were word used issued to design to the RNN.our WSC. Every As word above-mentioned, is encoded in aea vectorch time of step size (t) 3 N,is represented where N is theby a number word issued of possible to the lettersRNN. formingEvery word the wordsis encoded (24 ASL in a alphabetvector of signs).size 3 N, Thus, where the N size is the of thenumber input of vector possible (x) ofletters the RNN forming is 72. the Vector words (x) (24 is ASL formed alphabet by the signs). concatenation Thus, the size of three of the sub-vectors input vector (b, (x) i, e)of correspondingthe RNN is 72. to Vector the position (x) is formed of the letterby the in co thencatenation word as seen of three in the sub-vectors following equation:(b, i, e) cor- responding to the position of the letter in the word as seen in the following equation: x(t) = (b(t),i(t),e(t)) (1) x(t) = (b(t),i(t),e(t)) (1) Vectors b and e are one-hot representations of the beginning and ending letters of the word,Vectors respectively, b and e are andone-hot vector representations i represents internalof the beginning letters. Forand example,ending letters the word of the “ENVIRONMENT”word, respectively, and is encoded vector i asrepresents b = {E = 1},internal i = {N letters. = 3, V For = 1, example, I = 1, R = the 1, Oword = 1, “ENVI-M = 1, ERONMENT” = 1}, e = {T is = encoded 1}, with remainingas b = {E = 1}, elements i = {N = equal3, V = to1, zero.I = 1, R The = 1, outputO = 1, M of = the 1, E RNN = 1}, e is = a{T classiﬁcation = 1}, with remaining dense layer elements using equal softmax to zero activation. The output function of the (calculated RNN is a with classification the next equation),dense layerxn usingis the softmax output n activationof LSTM densefunction layer, (calculated and with with W number the next of outputs,equation), where xn is Wthe is output the number n of LSTM of words dense in thelayer, dictionary and with (370 W numb for theer experiments), of outputs, where as shown W is inthe Figure number9: of words in the dictionary (370 for the experiments), as shown in Figure 9: xn x1 x2 x3 W y(n) = e /( n = 1..w e + e + e + . . . + e ), (2) y(n) =∑ exn/(∑n = 1..w ex1 + ex2 + ex3 + … + eW), (2)

RESTAURANT

Semi-character level encoding b (t) i (t) e (t) Embedding layer

Forward LSTM ….

Backward LSTM ….

LSTM output

Dense layer

Prediction result FigureFigure 9. 9.Recurrent Recurrent neuralneural networknetwork (RNN) (RNN) proposed proposed model. model. 2.5. Spelling Correction Dataset Since most of the open-sourced spelling correction datasets are extracted from digital

documents, their misspelled words represent mostly keyboard typing errors. On the other hand, the WSC is intended to correct SLR misclassiﬁcations, which greatly depends on ASL sign similarities (such as letters “T” and “N” see Figure5). The RNN training dataset is based on 235 English sentences used on a daily basis. Using these sentences, 11,750 training sentences (formed by incorrect words) were generated by adding errors by applying the following methodology. All words containing three or more letters are randomly modiﬁed Electronics 2021, 10, 1035 11 of 18

by replacing a single letter of vector i with one of its most similar ASL signs. For example, the word “GRATEFUL” was modiﬁed, resulting in incorrect words, such as “GRAMEFUL” and “GRATSFUL”. In addition, as previously mentioned, one of the equal subsequent letters is removed, i.e., the word “APPOINTMENT”, was modiﬁed forming the words “APOINTSENT” and “APOINTMENT”.

3. Results 3.1. YOLO Models Results All the metrics used for selecting the proper YOLO network architecture were calculated using the testset. On the other hand, ASLYtest was used for performing a proper comparison of our method with state-of-the-art proposals. The criterion to select the YOLO network architectures was the architecture with higher mAP that accomplished real-time frames per second (fps) rate; since the here presented translator was intended to process live camera streaming, where 15 and 30 fps were common framerates, 20 fps was considered as real-time in our experiments since it provided smooth video streaming. Table2 shows the results obtained from the YOLO network architectures presented in Figure8 by using the testset (see Figure7), besides that A2 (416 × 416) with 81.76 percentage resulted in the higher mAP, its 16.65 frame processing rate shown not being sufﬁcient to translate the American Sign Language alphabet properly at real-time speed in the experiments. On the other hand, A2 (352 × 352) presented a higher mAP than all the other models while offering a real-time response time. Therefore, this YOLO network model was the most suitable for the ASLAT based on previously mentioned evaluation metrics. Thus, architecture A2 (352 × 352) was used to carry out the experiments performed by processing live-camera streaming and testing with the WSC system.

Table 2. YOLO Network models performance.

Frame-Rate (fps) mAP @ 50 (%) ASLYtest/testset Image Size 288 × 288 352 × 352 416 × 416 288 × 288 352 × 352 416 × 416 A1 GTX-1050 47.15 36.00 29.54 97.07/72.66 98.86/74.26 99.46/75.63 A1 GTX-1080 122.24 95.32 72.90 A2 GTX-1050 29.81 22.85 16.65 98.06/69.87 99.81/81.74 99.85/81.76 A2 GTX-1080 81.79 61.35 46.03

The per-sign average precision (AP) for every architecture is shown in Figure 10, where signs “M” and “N” are the letters with less AP for most of the models due to the similarities with other signs. Other signs, such as “F” and “X”, showed a large discrepancy in some models. This was the consequence of three main factors. Overﬁtting for some models and signs—under some lighting and hand rotation conditions, these signs could be misinterpreted as other signs, such as “R” and “W” in the case of sign “F”. In addition, since a different camera sensor was used for calculating the AP and raw camera resolutions and aspect ratios were considerably different, during the image rescaling process, signs could be signiﬁcantly deformed. However, this could be solved by enhancing the training dataset by including images with varying aspect ratios of different camera sensors or by applying data augmentation as in [18]. Electronics 2021, 10, x FOR PEER REVIEW 13 of 19

Electronics 2021, 10, 1035 12 of 18

A1 (288 × 288) A2 (288 × 288) 100 100

80 80

60 60

40 40

20 20

0 0

I I

S S

F F

P P

B E L B E L

T T

R X Y R X Y

C C

V V

K K

G G

D D

A U A U

O Q O Q

H N H N

M M

W W

SP SP

FN FN AP AP A1 (352 × 352) A2 (352 × 352) 100 100

80 80

60 60

40 40

20 20

0 0

B E L T

X Y

S V

A U

O Q

B E L T

H N

R X Y

A U

O Q

H N

SP FN AP FN AP A1 (416 × 416) A2 (416 × 416) 100 100

80 80

60 60

40 40

20 20

0 0

B E L

B E L T

X Y

K G

A U

O Q

A U

O Q

H N

FN FN AP AP

Figure 10. YOLO networks average precision per sign using testset.

Since the here presented ASLAT is designed to be used by a single person at a time (in real-timeFigure 10. conditions), YOLO networks the average SLR subsystem precision per is configuredsign using testset to present. the prediction with higher probability. Besides this, the SLR is designed in such a way that a response is issued only if eight (thisSince number the here is presented configurable) ASLAT consecutive is designed images to be used are by classified a single asperson the sameat a time sign. (inFigure real-time 11 presentsconditions), the the recall SLR of subsystem model A2 is (352configured× 352) to for present every the sign prediction using the with 1300 im- higher probability. Besides this, the SLR is designed in such a way that a response is issued ages for testing (testset). This represents how successful the model is to detect presented only if eight (this number is configurable) consecutive images are classified as the same signs since it depends on the true positives and false negatives. This metric is important to sign. Electronics 2021, 10, x FOR PEER REVIEWour system since, as previously mentioned, eight consecutive signs must be14 detected of 19 to be Figure 11 presents the recall of model A2 (352 × 352) for every sign using the 1300 issuedimages to thefor testing letter sequence(testset). This to wordrepresents converter how successful (See Figure the 1model). is to detect presented signs since it depends on the true positives and false negatives. This metric is important to our system since, asA2 previously (352 × 352) mentioned, eight consecutive signs must be detected to 100be issued to the letter sequence to word converter (See Figure 1).

0 A B C D E F G H I K L M N O P Q R S T U V W X Y SP FN Recall

Figure 11. Sign language recognition (SLR) system signs detection recall using testset. Figure 11. Sign language recognition (SLR) system signs detection recall using testset.

RecallRecall was was calculated calculated using: using: Recall(n) = TP(n)/(TP(n) + FN(n)), 26 ≥ n ≥ 1, (3) Recall(n) = TP(n)/(TP(n) + FN(n)), 26 ≥ n ≥ 1, (3) where TP is true positive, and FN is false-negative. The mean recall of YOLO model A2 (352 × 352) is 71.23%, which was calculated using:

Mean recall = ∑n = 1..26 Recall(n)/26, (4)

To better understand the performance of A2 (352 × 352), Figure 12 shows a confusion matrix of the selected YOLO model. The confusion matrix was generated with 50 images per sign (testing set). In Figure 12, rows and columns represent target and predictions, respectively; N/D is not detected. As can be seen in the confusion matrix, where false positives (FP) and false negatives (FN) are shown; signs, such as “X”, “M”, and “F” presents the highest FN, as above-mentioned, overfitting, and sign deformation during image rescaling can explain these misclassifications, and signs with the most similarities, such as, “N”, and “T” presents the higher FP.

A 24 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 12 0 0 0 0 0 0 0 10 B 0 48 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 C 0 0 36 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 13 D 0 0 0 35 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 13 E 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 F 0 0 0 0 0 24 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 24 G 0 0 1 0 0 0 34 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 H 0 0 0 0 0 0 2 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 I 0 0 0 4 0 0 0 0 36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 9 K 0 0 0 0 0 0 0 0 0 41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 L 0 0 0 0 0 0 0 0 0 1 38 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 10 M 0 0 0 0 0 0 0 0 0 0 0 12 10 0 0 0 0 0 1 0 0 0 0 0 0 0 27 N 1 0 0 0 0 0 0 0 0 0 0 2 32 0 1 0 0 0 0 1 0 0 1 0 0 0 12 O 0 0 1 0 0 0 0 0 0 0 0 0 0 38 2 0 0 0 0 0 0 0 0 0 0 0 9 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 48 0 0 0 0 0 0 0 0 0 0 1 R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 0 0 0 0 0 7 0 0 0 21 S 0 0 0 0 0 1 0 0 0 0 0 1 4 0 0 0 0 34 2 0 0 0 0 0 0 0 8 T 4 1 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 21 0 0 0 0 0 0 0 12 U 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 44 0 0 0 0 0 0 5 V 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 41 1 0 0 0 0 6 W 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 38 0 0 0 0 9 X 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 8 0 0 3 0 0 10 0 0 0 27 Y 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 39 0 0 8 SP 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 47 0 3 FN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50 0 A B C D E F G H I K L M N O P Q R S T U V W X Y SP FN N/D

Figure 12. YOLO model A2 (352 × 352) confusion matrix.

3.2. Bidirectional LSTM Results The RNN was trained to correct words within an own generated dictionary. The dictionary consists of 235 English sentences, which are of common use on a daily basis. A sample is presented in Table 3. Resulting in a training accuracy of 98.07%.

Table 3. Recurrent Neural Network (RNN) training dataset sample.

Electronics 2021, 10, x FOR PEER REVIEW 14 of 19

Electronics 2021, 10, 1035 13 of 18 Recall was calculated using:

Recall(n) = TP(n)/(TP(n) + FN(n)), 26 ≥ n ≥ 1, (3) where TP is true positive, and FN is false-negative. The mean recall of YOLO model A2 where TP(352 is ×true352) positive, is 71.23%, and which FN is was false- calculatednegative. using: The mean recall of YOLO model A2 (352 × 352) is 71.23%, which was calculated using:

Mean recall = n = 1...26 Recall(n)/26, (4) Mean recall = ∑n = 1..26∑ Recall(n)/26, (4)

To betterTo understand better understand the performance the performance of A2 (352 of A2 × 352), (352 Figure× 352), 12 Figure shows 12 a shows confusion a confusion matrix ofmatrix the selected of the selected YOLO model. YOLO model.The conf Theusion confusion matrix was matrix generated was generated with 50 withimages 50 images per signper (testing sign (testingset). In set).Figure In 12, Figure rows 12 and, rows columns and columns represent represent target and target predictions, and predictions, respectively;respectively; N/D is not N/D detected. is not detected.As can be Asseen can in the be seenconfusion in the matrix, confusion where matrix, false wherepos- false itives (FP)positives and false (FP) negatives and false (FN) negatives are shown; (FN) signs, are shown;such as signs,“X”, “M”, such and as “F” “X”, presents “M”, and “F” presents the highest FN, as above-mentioned, overﬁtting, and sign deformation during the highest FN, as above-mentioned, overfitting, and sign deformation during image image rescaling can explain these misclassiﬁcations, and signs with the most similarities, rescaling can explain these misclassifications, and signs with the most similarities, such such as, “N”, and “T” presents the higher FP. as, “N”, and “T” presents the higher FP.

Figure 12. YOLO model A2 (352 × 352) confusion matrix. Figure 12. YOLO model A2 (352 × 352) confusion matrix.

3.2. Bidirectional LSTM Results 3.2. Bidirectional LSTM Results The RNN was trained to correct words within an own generated dictionary. The dic- The RNN was trained to correct words within an own generated dictionary. The tionary consists of 235 English sentences, which are of common use on a daily basis. A dictionary consists of 235 English sentences, which are of common use on a daily basis. A sample is presented in Table 3. Resulting in a training accuracy of 98.07%. sample is presented in Table3. Resulting in a training accuracy of 98.07%.

Table 3. Recurrent Neural Network (RNN) training dataset sample. Table 3. Recurrent Neural Network (RNN) training dataset sample. Correct Sentence Wrong Sentence I would like to makeCorrect an appointment Sentence i would liue to mave Wrong an apointsent Sentence is thereI a would restaurant like to close make to an us appointment is theue ia would restaubant liue toclote mave to us an apointsent oscar is helpingis me there with a restaurant my homework close to usoscar is helpsng is theueme winh a restaubant my honework clote to us oscarhello is I helping am cesar me with my homework oscar hslo is i helpsng am cesyr me winh my honework I think it tasteshello good I am cesar i thimk it tases hslo god i am cesyr I think it tastes good i thimk it tases god 3.3. Real-Time Experiment Results 3.3. Real-Time Experiment Results The experiments consisted of processing live camera streaming with previously mentioned hardware,The experiments using the model consisted A2 of(352 processing × 352), an live RGB camera camera streaming HP wide with vision previously HD men- camera tionedwas used hardware, for capturing using images. the model The A2proc (352ess to× test352), the an ASLAT RGB camera consisted HP of wide 10 non- vision HD speakerscamera of ASL was persons used for (including capturing the images. dataset The voluntaries) process to testwho the were ASLAT asked consisted to spell of a 10 non- speakers of ASL persons (including the dataset voluntaries) who were asked to spell a sentence found in the dictionary, obtaining results shown in Table4. Similarly to the process of generating a dataset, users were asked to perform signs with variations in hand position

Electronics 2021, 10, 1035 14 of 18

and allowing natural variations, such as rotation and occlusions, since these problems are often presented in non-controlled environments. As seen in Table4, the WSC presents some severe wrong predictions, such as words “licenxe” and “tabout” to “outside” and “vacant”, respectively, besides their Levenshtein distance is 1 concerning ground truth words. For the case of “licenxe”, it generated noisy training words such as the following: lscense, lioense, licsnse, licemse, licese, licetse, and licente; “licenxe” was not found in the training dataset (not even letter “x”), which suggest that the LSTM model was overﬁtting the training noisy words. Therefore, we upgraded the LSTM training dataset by improving our noisy word generation methodology and by optimizing the RNN model.

Table 4. American Sign Language alphabet translator (ASLAT) real-time results.

User Sentence A2 (352 × 352) Prediction WSC 1 I like pancakes i like patcakes i like pancakes 2 Is there a restaurant close to us is there a reistaurat close to us is there a restaurant close to us 3 Please wait for me pletse wait for me please wait for me 4 Andrea plays the piano atdrea plays the siatosts andrea plays the piano 5 The boss is busy the bcs is brusy the boss is busy 6 We need to rent a room we ned to rent a rom we need to rent a room 7 I am getting my license si am getting my licenxe of am getting my outside 8 Could be better could be beaer could be better 9 I should leave i shculd letve i should leave 10 I am not sure about that xi am not sure tabout thta of am not sure vacant wanna

Table5 shows the Levenshtein distance between correct sentence and SLR output and WSC output. Since WSC output represents the whole system response, its misclassifications have huge repercussions in the output sentence, i.e., the sentences of users 7 and 10, where the WSC misclassifications significantly increased Levenshtein distance between ground truth and predictions. This is a major drawback of our system. Therefore, the WSC must be improved to reduce these errors as possible.

Table 5. Levenshtein distance between sentence and A2 (352 × 352) prediction and sentence and word spelling correction (WSC) output.

Levenshtein Distance Levenshtein Distance User Sentence/A2 (352 × 352) Prediction Sentence/WSC 1 1 0 2 2 0 3 1 0 4 6 0 5 3 0 6 2 0 7 2 8 8 2 0 9 2 0 10 4 11

A video of the here presented ASLAT working in real time can be seen in the Supple- mentary Materials. Table6 presents a comparison of the herein presented translation system compared to other proposals found in the state-of-the-art. The presented mAP (for our work) corre- sponded to an accuracy of A2 (352 × 352) network/and accuracy using testset since not all works provide results with completely unseen data. Electronics 2021, 10, 1035 15 of 18

Table 6. Comparison of the proposed translator with state-of-the-art gesture recognition approaches. Note that not all works are based on visual data.

Proposal Sensor Classiﬁcation Level Dataset Samples Accuracy (%) Ours RGB Letters (single sign) 29,200 99.81/81.74 (mAP@50) [9] RGB Letters (single sign) 2400 99.5 [10] RGB–depth Letters (single sign) 48,000 99.9 [33] RGB Letters (single sign) 4547 99.91 (mAP@50) [15] RGB–depth Letters (single sign) 2524–131,000 99.31–98.13 [19] IMU Numbers (dynamic gesture) 1000 98.60 [16] LMC Words (dynamic gesture) 300–480 96.67–95.23 [11] RGB Letters (single sign) 14,781 95.54 [17] RGB–LMC Gestures (dynamic–BSL) 16,498 94.44 [13] RGB Letters (single sign) 130,000 93.33 [18] LMC Words and numbers 16,890 91.1

4. Discussion Approximately 6.1% of the world’s population is affected by hearing loss; according to WHO, this number will increase in the future. Therefore, there are many related works aiming to solve hand gesture recognition [7–19]. Modern algorithms and resources are available, which make it viable to develop an easy-to-use ASLAT for common users with spell correction features as herein presented. Nevertheless, efficiently performing sign language recognition in images is a challenging task due to many factors. First, images represent a vast amount of information, where most of the scene’s area is irrelevant for the interpretation of the gesture. Therefore, the segmentation and classification of the relevant/irrelevant information in every image consume most of the processing time. Thus, a specialized algorithm for performing these tasks is required, increasing the system complexity and making it difficult to fulfill real-time timing requirements. In addition, due to the nature of non-spoken languages, gestures are widely variable in many aspects, including handshape, scene lighting, image noise, sensor specifications, among others. Thus, the recompilation and labeling of images for covering all the cases represent a hard workload, making it difficult to develop a system that provides good results in unseen data. Experiments have shown that the system presented in this paper can real-time translate 24 ASL alphabet signs, along with two additional signs in an efficient manner, with a mAP@50 of 99.81% at a frame processing rate of 61.35. At real-time conditions, letters with the most similarities, such as “N”, “M”, and “T” can be optimized to improve the precision. However, by adding word spelling correction capacity, these misclassifications are almost fully dismissed in controlled environments, as shown in experiments. Since the text is naturally sequential, RNNs are often used for text classification [38] and word spelling correction [37]. This work presents a proposal for the challenging task of sign language recognition by providing a method for the segmentation, classification and translation of the American Sign Language alphabet in static RGB images. Multiple experiments were carried out with unseen data and images captured with a camera sensor with different specifications than those used for generating the training dataset. Besides, precision with unseen data is low compared to training accuracy, presented results with our novel proposal of adding a second stage of prediction correction shown that the here presented ASL translator is feasible for real-time processing under real-world conditions. However, our proposal has limitations that must be resolved for this. In the experiments, a set of 26 static gestures were considered for the interpretation of a non-spoken language. Nevertheless, in practice, complex dynamic gestures (where not only the hand is involved) are used for everyday interaction. Thus, additional classes are required to allow the interpretation of gesture spoken languages practically. Electronics 2021, 10, 1035 16 of 18

In addition, as shown in the results, some signs present drastically lower accuracy in unseen data, one of the causes is the model overfitting, to solve this, the richest dataset must be used for training the models; not only by adding more images under same conditions or using or data augmentation, but more sensors, background scenes, and handshapes must be included in the experiments. In addition, a model pretrained with COCO dataset [39] could be used rather than training models from scratch. This could lead to a better performance in the SLR. Since DNN is used for the segmentation and classification of handshape, required computation resources are massive to fulfill the timings restrictions to be considered real- time processing. This is particularly challenging for devices with hardware with low levels of parallelism, such as microprocessors. Thus, quantized models must be included in the experiments for analyzing how feasible is the implementation of this ASL translator in mobile devices such as smartphones. In addition, since RNNs are trained to recognize time-based patterns, they can be used to classify and correct the sequence of words obtained with SLR and WSC subsystems along with movement-based ASL alphabet signs such as letters “J” and “Z”; even complex non-alphabet movement-based ASL signs such as “Hello”. This represents several advantages since currently, all the words that form a phrase must be issued one-per-one, making it difficult to use this ASLAT for long periods. By adding this feature, its output phrase can be displayed as sound for enabling efficient real-time communication between persons affected with hearing loss or speech difficulties with those with no knowledge in sign languages. Currently, the SLR subsystem is configured to present a single sign as classification response (i.e., with higher probability), but, due to the methodology used to design this ASLAT (YOLO–LSTM synergy), a more accurate translation could be achieved by training the WSC to contain as input (for example), the two signs with higher accuracy. This could be useful for some cases. For example (hypothetical case), if the word “leave” is being spelled, when issuing letter “a”, SLR classification results could be, t = 85%, a = 84%; using the current methodology, only the information of letter “t” is considered in the WSC (i.e., “a” information is deprecated besides it is the correct letter). Thus, by considering more than one SLR classification response, a higher accuracy could be achieved. However, this experimentation will be performed in future releases.

Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/electronics10091035/s1. Author Contributions: Conceptualization, M.R.-A., J.M.R.-V., S.O.-C., J.R. and P.M.-A.; methodology, M.R.-A., J.M.R.-V. and S.O.-C.; software, M.R.-A., J.M.R.-V. and J.R.; validation, M.R.-A., J.M.R.-V. and P.M.-A.; formal analysis, M.R.-A., J.M.R.-V. and S.O.-C.; investigation, M.R.-A. and J.M.R.-V.; resources, S.O.-C. and P.M.-A.; writing the paper, M.R.-A., J.M.R.-V., S.O.-C., J.R. and P.M.-A.; Reviewing the paper, M.R.-A., J.R. and R.P.-M.; funding acquisition, R.P.-M. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by CONACYT, grant number 336595. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Conﬂicts of Interest: The authors declare no conﬂict of interest.

References 1. World Health Organization. Deafness and Hearing Loss. Available online: https://www.who.int/news-room/fact-sheets/ detail/deafness-and-hearing-loss (accessed on 9 September 2020). 2. World Health Organization. WHO Global Estimates on Prevalence of Hearing Loss, Prevention of Deafness WHO. 2018. Available online: https://www.who.int/deafness/Global-estimates-on-prevalence-of-hearing-loss-for-website.pptx?ua=1 (accessed on 9 September 2020). 3. Dong, C.; Leu, M.C.; Yin, Z. Sign Language Alphabet Recognition Using Microsoft Kinect. In Proceedings of the 2015 IEEE Conference on CVPRW, Boston, MA, USA, 7–12 June 2015. 4. Hee-Deok, Y. Sign Language Recognition with the Kinect Sensor Based on Conditional Random Fields. Sensors 2015, 15, 135–147. Electronics 2021, 10, 1035 17 of 18

5. Cemil, O.; Ming, C.L. American Sign Language word recognition with a sensory glove using artificial neural networks. Eng. Appl. Artif. Intell. 2011, 4, 1204–1213. 6. Ognjan, L.; Miroslav, P. Hand gesture recognition using low-budget data glove and cluster-trained probabilistic neural network. Assem. Autom. 2014, 34, 94–105. 7. Rivera-Acosta, M.; Ortega-Cisneros, S.; Rivera, J.; Sandoval-Ibarra, F. American Sign Language Alphabet Recognition Using a Neuromorphic Sensor and an Artificial Neural Network. Sensors 2017, 17, 2176. [CrossRef] 8. Jie, G.; Wengang, Z.; Houqiang, L.; Weiping, L. Sing Language Recognition Using Real-Sense. In Proceedings of the 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), Chengdu, China, 12–15 July 2015. 9. Md Azher, U.; Shayhan, A.C. Hand Sign Language Recognition for Bangla Alphabet using Support Vector Machine. In Proceedings of the International Conference on Innovations in Science, Engineering and Technology (ICISET), Dhaka, Bangladesh, 28–29 October 2016. 10. Wenjin, T.; Ming, C.L.; Zhaozheng, Y. American Sign Language alphabet recognition using Convolutional Neural Networks with multiview augmentation and inference fusion. Eng. Appl. Artif. Intell. 2018, 76, 202–213. 11. Sarfaraz, M.; Harish, C.T.; Adhyan, S. American Sign Language Character Recognition Using Convolution Neural Network. Smart Computing and Informatics. Smart Innov. Syst. Technol. 2018, 78, 403–412. 12. Yuancheng, Y.; Yingli, T.; Matt, H.; Yingya, L. Recognizing American Sign Language Gestures from within Continuous Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2064–2073. 13. Dinesh, S.; Sivaprakash, S.; Keshav, M.; Ramya, K. Real-Time American Sign Language Recognition with Faster Regional Convolutional Neural networks. Int. J. Innov. Res. Sci. Eng. Technol. 2018, 7, 297–305. 14. Oishee, B.H.; Mohammad, I.J.; Md, S.I.; Al-Farabi, A.; Alving, S.P. Real Time Bangladeshi Sign Language Detection using Faster R-CNN. In Proceedings of the International Conference on Innovation in Engineering and Technology (ICIET), Dhaka, Bangladesh, 27–28 December 2018. 15. Rastgoo, R.; Kiani, K.; Escalera, S. Multi-Modal Deep Hand Sign Language Recognition in Still Images Using Restricted Boltzmann Machine. Entropy 2018, 20, 809. [CrossRef][PubMed] 16. Yang, L.; Chen, J.; Zhu, W. Dynamic Hand Gesture Recognition Based on a Leap Motion Controller and Two-Layer Bidirectional Recurrent Neural Network. Sensors 2020, 20, 2106. [CrossRef][PubMed] 17. Jordan, J.B.; Anikó, E.; Diego, R.F. British Sign Language Recognition via Late Fusion of Computer Vision and Leap Motion with Transfer Learning to American Sign Language. Sensors 2020, 20, 5151. 18. Vincent, H.; Tomoya, S.; Gentiane, V. Convolutional and Recurrent Neural Network for Human Activity Recognition: Application on American Sign Language. PLoS ONE 2020, 15, 1–12. 19. Kim, M.; Cho, J.; Lee, S.; Jung, Y. IMU Sensor-Based Hand Gesture Recognition for Human-Machine Interfaces. Sensors 2019, 19, 3827. [CrossRef][PubMed] 20. Akash. ASL Alphabet Image Data Set for Alphabets in the American Sign Language. 2018. Available online: https://www. kaggle.com/grassknoted/asl-alphabet (accessed on 9 September 2020). 21. Nvidia, CUDA GPUs. Available online: https://developer.nvidia.com/cuda-gpus (accessed on 9 September 2020). 22. Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.L.; da Silva, E.A.B. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. [CrossRef] 23. Joseph, R.; Santosh, D.; Ross, G.; Ali, F. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2016, arXiv:1506.02640. 24. Joseph, R.; Ali, F. YOLO9000: Better, Faster, Stronger. arXiv 2016, arXiv:1612.08242. 25. Joseph, R.; Ali, F. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. 26. Pedro, F.; Ross, B.; David, M.; Deva, R. Object Detection with Discriminatively Trained Part Based Models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. 27. Ross, G.; Jeff, D.; Trevor, D.; Jitendra, M. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. 28. Ross, G. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. 29. Shaoqing, R.; Kaiming, H.; Ross, G.; Jian, S. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. 30. Wenbo, L.; Jianwu, D.; Yangping, W.; Song, W. Pedestrian Detection Based on YOLO Network Model. In Proceedings of the 2018 IEEE International Conference on Mechatronics and Automation, Changchun, China, 5–8 August 2018. 31. Weidong, M.; Xiangpeng, L.; Qi, W.; Qingpeng, Z.; Yanqiu, L. New approach to vehicle license plate location based on new model YOLO-L and plate pre-identification. IET Image Proc. 2019, 13, 1041–1049. 32. Zuzanna, K.; Jacek, S. Bones detection in the pelvic area on the basis of YOLO neural network. In Proceedings of the 19th International Conference Computational Problems of Electrical Engineering, Banska Stiavnica, Slovakia, 9–12 September 2018. 33. Steve, D.; Nanik, S.; Chastine, F. Indonesian Sign Language Recognition using YOLO Method. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1077, 012029. [CrossRef] 34. Tzutalin, LabelImg. Git Code. 2015. Available online: https://github.com/tzutalin/labelImg/ (accessed on 9 September 2020). Electronics 2021, 10, 1035 18 of 18

35. YOLO: Real-Time Object Detection. Available online: https://pjreddie.com/darknet/yolo/ (accessed on 5 July 2019). 36. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Computation 1997, 9, 1735–1780. [CrossRef][PubMed] 37. Keisuke, S.; Kevin, D.; Matt, P.; Benjamin, V. Robust Word Recognition via Semi-Character Recurrent Neural Network. arXiv 2017, arXiv:1608.02214. 38. Pengfei, L.; Xipeng, Q.; Xuanjing, H. Recurrent Neural Network for Text Classiﬁcation with Multi-Task Learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence (IJCAI-16), New York City, NY, USA, 9–15 July 2016. 39. Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, L.; Dollar, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312v3, 2015.