S S symmetry

Article Multi‑Stroke Thai Finger‑Spelling Recognition System with Deep Learning

Thongpan Pariwat and Pusadee Seresangtakul *

Natural Language and Speech Processing Laboratory, Department of Computer Science, Faculty of Science, , Khon Kaen 40002, ; [email protected] * Correspondence: [email protected]; Tel.: +66‑80‑414‑0401

Abstract: Sign language is a type of language for the hearing impaired that people in the general public commonly do not understand. A sign language recognition system, therefore, represents an intermediary between the two sides. As a communication tool, a multi‑stroke Thai finger‑spelling sign language (TFSL) recognition system featuring deep learning was developed in this study. This research uses a vision‑based technique on a complex background with semantic segmentation per‑ formed with dilated convolution for hand segmentation, hand strokes separated using optical flow, and learning feature and classification done with convolution neural network (CNN). We then com‑ pared the five CNN structures that define the formats. The first format was used to setthenumber of filters to 64 and the size of the filter to3 × 3 with 7 layers; the second format used 128 filters, each filter 3 × 3 in size with 7 layers; the third format used the number of filters in ascending order with 7 layers, all of which had an equal 3 × 3 filter size; the fourth format determined the number offilters in ascending order and the size of the filter based on a small size with 7 layers; the final format was a structure based on AlexNet. As a result, the average accuracy was 88.83%, 87.97%, 89.91%, 90.43%, and 92.03%, respectively. We implemented the CNN structure based on AlexNet to create models   for multi‑stroke TFSL recognition systems. The experiment was performed using an isolated video of 42 Thai alphabets, which are divided into three categories consisting of one stroke, two strokes, Citation: Pariwat, T.; Seresangtakul, and three strokes. The results presented an 88.00% average accuracy for one stroke, 85.42% for two P. Multi‑Stroke Thai Finger‑Spelling strokes, and 75.00% for three strokes. Sign Language Recognition System with Deep Learning. Symmetry 2021, Keywords: TFSL recognition system; deep learning; semantic segmentation; optical flow; 13, 262. https://doi.org/10.3390/ complex background sym13020262

Academic Editor: Peng‑Yeng Yin Received: 11 January 2021 Accepted: 31 January 2021 1. Introduction Published: 4 February 2021 Hearing‑impaired people around the world use sign language as their medium for communication. However, sign language is not universal, with one hundred varieties used Publisher’s Note: MDPI stays neutral around the world [1]. One of the most widely used types of sign language is American Sign with regard to jurisdictional claims in Language (ASL), which is used in the United States, Canada, West Africa, and Southeast published maps and institutional affil‑ Asia and influences Thai Sign Language (TSL). Typically, global sign language is divided iations. into two forms: gesture language and finger‑spelling. Gesture language or sign language involves the use of hand gestures, facial expressions, and the use of mouths and noses to convey meanings and sentences. This type is used for communication between deaf people in everyday life, focusing on terms such as eating, ok, sleep, etc. Finger‑spelling Copyright: © 2021 by the authors. sign language is used for spelling letters related to the written language with one’s fingers Licensee MDPI, Basel, Switzerland. and is used to spell people’ names, places, animals, and objects. This article is an open access article ASL is the foundation of Thai finger‑spelling sign language (TFSL).The TFSL was distributed under the terms and invented in 1953 by Khunying Kamala Krairuek using American finger‑spelling as a pro‑ conditions of the Creative Commons totype to represent the 42 Thai consonants, 32 vowels, and 6 intonation marks [2]. All Attribution (CC BY) license (https:// forty‑two Thai letters can be presented with a combination of twenty‑five hand gestures. creativecommons.org/licenses/by/ For this purpose, the number signs are combined with alphabet signs to create additional 4.0/).

Symmetry 2021, 13, 262. https://doi.org/10.3390/sym13020262 https://www.mdpi.com/journal/symmetry Symmetry 2021, 13, 262 2 of 19

meanings. For example, the K sign combined with the 1 sign (K+1) meaning ‘ข’ (/kh/). Con‑ sequently, TFSL strokes can be organized into three groups (one‑stroke, two‑stroke, and three‑stroke groups) with 15 letters, 24 letters, and 3 letters, respectively. However,ฃ ‘ ’(/kh/) and ‘ฅ’(/kh/) have not been used since Thailand’s Royal Institute Dictionary announced their discontinuation [3], and are no longer used in Thai finger‑spelling sign language. This study is part of a research series on TFSL recognition that focuses on one stroke performed by a signer standing in front of a blue background. Global and local features us‑ ing support vector machine (SVM) on the radial basis function (RBF) kernel were applied and were able to achieve 91.20% accuracy [4]; however, there were still problems with highly similar letters. Therefore, a new TFSL recognition system was developed byap‑ plying pyramid histogram of oriented gradient (PHOG) and local features with k‑nearest neighbors (KNN), providing an accuracy up to 97.6% [3]. However, past research has been conducted only using one‑stroke, 15‑character TFSL under a simple background, which does not cover all types of TFSL. There is also research on TFSL recognition systems that uses a variety of techniques, as shown in Table 1.

Table 1. Research on Thai finger‑spelling sign language (TFSL) recognition systems.

Researchers No. of Sign Method Dataset Accuracy (%) Black glove with 6 sphere marker, Adhan and Pintavirooj [5] 42 1050 96.19 geometric invariant and ANN Saengsri et al. [6] 16 Glove with motion tracking (N/A) 94.44 Nakjai and Kantanyakul [2] 25 CNN 1375 91.26 Chansri and Srinonchat [7] 16 HOG and ANN 320 83.33 Silanon [8] 16 HOG and ANN 2100 78.00

The study of TFSL recognition systems for practical applications requires using a va‑ riety of techniques, since TFSL has multi‑stroke gestures and combines hand gestures to convey letters as well as complex background handling and a wide range of light volumes. Deep learning is a tool that is increasingly being used in sign language recognition [2,9,10], face recognition [11], object recognition [12], and others. This tech‑ nology is used for solving complex problems such as object detection [13], image segmen‑ tation [14], and image recognition [12]. With this technique, the feature learning method is implemented instead of feature extraction. Convolution neural network (CNN) is a learn‑ ing feature process that can be applied to recognition and can provide high performance. However, deep learning requires much data for training, and such data could use simple background or complex background images. In the case of a picture with a simple back‑ ground, the object and background color are clearly different. Such an image can be used as training data without cutting out the background. On the other hand, a complex back‑ ground image features objects and backgrounds that have similar colors. Cutting out the background image will obtain input images for training with deep learning, featuring only objects of interest without distraction. This can be done using the semantic segmentation method, which uses deep learning to apply image segmentation. This method requires labeling objects of interest to separate them from the background. Autonomous driving is one application of semantic segmentation used to identify objects on the road, and can also be used for a wide range of other applications. This study is based on the challenges of the TFSL recognition system. There are up to 42 Thai letters that involve spelling with one’s fingers—i.e., spelling letters withacom‑ bination of multi‑stroke sign language gestures. The gestures for spelling several letters, however, are similar, so a sign language recognition system under a complex background should instead be used. There are many studies about the TFSL recognition system. How‑ ever, none of them study the TFSL multi‑stroke recognition system using the vision‑based technique with complex background. The main contributions of this research are the ap‑ plication of a new framework for a multi‑stroke Thai finger‑spelling sign language recog‑ nition system to videos under a complex background and a variety of light intensities by Symmetry 2021, 13, 262 3 of 19

separating people’s hands from the complex background via semantic segmentation meth‑ ods, detecting changes in the strokes of hand gestures with optical flow, and learning fea‑ tures with CNN under the structure of AlexNet. This system supports recognition that covers all 42 characters in TFSL. The proposed method focuses on developing a multi‑stroke TFSL recognition system with deep learning that can act as a communication medium between the hearing impaired and the general public. Semantic segmentation is then applied to hand segmentation for complex background images, and optical flow is used to separate the strokes of signlan‑ guage. The processes of feature learning and classification use CNN.

2. Review of Related Literature Research on sign language recognition systems has been developed in sequence with a variety of techniques to obtain a system that can be used in the real world. Deep learn‑ ing is one of the most popular methods effectively applied to sign language recognition systems. Nakjai and Katanyukul [2] used CNN to develop a recognition system for TFSL with a black background image and compared 3‑layer CNN, 6‑layer CNN, and HOG, find‑ ing that the 3‑layer CNN with 128 filters on every layer had an average precision (mAP) of 91.26%. Another study that applied deep learning to a sign language recognition sys‑ tem was published by Lean Karlo S. et al. [9], who developed an recognition system with CNN under a simple background. The dataset was divided into 3 groups: alphabet, number, and static word. The results show that alphabet recogni‑ tion had an average accuracy of 90.04%, number recognition had an accuracy average of 93.44%, and static word recognition had an average accuracy of 97.52%. The total aver‑ age of the system was 93.67%. Rahim et al. [15] applied deep learning to a non‑touch sign word recognition system using hybrid segmentation and CNN feature fusion. This system used SVM to recognize sign language. The research results under a real‑time environment provided an average accuracy of 97.28%. Various sign language studies aimed to develop a vision base by using, e.g., image and video processing, object detection, image segmentation, and recognition systems. Sign language recognition systems using vision can also be divided into two types of back‑ grounds: simple backgrounds [3,4,9,16] and complex backgrounds [17–19]. A simple back‑ ground entails the use of a single color such as green, blue, or white. This can help hand segmentation work more easily, and the system can recognize accuracy at a high level. Pariwat et al. [4] used sign language images on a blue background while the signer wore a black jacket in a TFSL recognition system with SVM on RBF, including global and local fea‑ tures. The average accuracy was 91.20%. Pariwat et al. [3] also used a simple background in a system that was developed by combining PHOG and local features with KNN. This combination enhanced the average accuracy to levels as high as 97.80%. Anh et al. [10] presented a recognition system using deep learning in a video se‑ quence format. A simple background, like a white background, was used in training and testing, providing an accuracy of 95.83%. Using a simple background makes it easier to develop a sign language recognition system and provides high‑accuracy results, but this method is not practical due to the complexity of the backgrounds in real‑life situations. Studying sign language recognition systems with complex backgrounds is challeng‑ ing. Complex backgrounds consist of a variety of elements that are difficult to distinguish, which requires more complex methods. Chansri et al. [7] developed a Thai sign language recognition system using Microsoft Kinect to help locate a person’s hand in a complex back‑ ground. This study used the fusion of depth and color video, HOG features, and a neural network, resulting in accuracy of 84.05%. Ayman et al. [17] proposed the use of Microsoft Kinect for Arabic sign language recognition systems with complex backgrounds with his‑ togram of oriented gradients—principal component analysis (HOG‑PCA) and SVM. The resulting accuracy value was 99.2%. In summary, research on sign language recognition systems using vision‑based techniques remains a challenge for real‑world applications. Symmetry 2021, 13, x FOR PEER REVIEW 4 of 19

neural network, resulting in accuracy of 84.05%. Ayman et al. [17] proposed the use of Microsoft Kinect for Arabic sign language recognition systems with complex backgrounds with histogram of oriented gradients—principal component analysis (HOG-PCA) and SVM. The resulting accuracy value was 99.2%. In summary, research on sign language

Symmetry 2021, 13recognition, 262 systems using vision-based techniques remains a challenge for real-world ap- 4 of 19 plications.

3. The Proposed Method 3. The Proposed Method This proposed method is a development of the multi-stroke TFSL recognition system This proposed method is a development of the multi‑stroke TFSL recognition system covering 42 letters. The system consists of four parts: (1) creating a SegNet model via se- covering 42 letters. The system consists of four parts: (1) creating a SegNet modelvia mantic segmentationsemantic using segmentation dilated usingconvolutions dilated convolutions;; (2) creating (2) a creatinghand-segmented a hand‑segmented library library with a SegNet withmodel a SegNet; (3) TFSL model; model (3) TFSLcreati modelon with creation CNN with for CNNthe TFSL for the model TFSL; and model; (4) and (4) classification processesclassification using processes optical flow using for optical hand stroke flow detection for hand and stroke using detection the CNN and using theCNN model for classification.model for classification.Figure 1 shows Figure an overview1 shows an of overview the system. of the system.

Part. 1 Hand segmentation model creation Part. 2 Hand-segmented library creation Part 3. TFSL Model Creation Part. 4 Classification

Hand sign Hand sign label Hand sign Hand sign video

Training process with Convolution Neural Network Hand segmentation (CNN) Stroke detection Semantic using SegNet model hand segmentation

Hand segmentation Create the hand TFSL Model using SegNet model segmented library SegNet model Classification and combination rules

Display text

Hand segmented Library

Figure 1. Multi‑stroke Thai finger‑spelling sign language recognition framework. Figure 1. Multi-stroke Thai finger-spelling sign language recognition framework.

3.1. Hand Segmentation Model Creation 3.1. Hand Segmentation Model Creation Segmentation is one of the most important processes for recognition systems. In par‑ Segmentationticular, is one sign of language the most recognition important systems processes require for recognition the separation system of thes. hands In par- from the ticular, sign languagebackground recognition image. Complex systems backgrounds require the are separation a challenging of the task hands for hand from segmentation the background imagebecause. Complex they feature backgrounds a wide range are of a colors challenging and lighting. task for Semantic hand segmentationsegmentation has been because they featureeffectively a wide applied range of to colorscomplex and background lighting. Semantic images segmentation and it is used has to beendescribe images ata effectively appliedpixel to level complex with a classbackground label [20 ].images For example, and it inis anused image to describe containing images people, at trees, a cars, pixel level withand a class signs, label semantic [20].segmentation For example, can in find an image separate containing labels for people, the objects trees, depictedcars, in the im‑ and signs, semanticage, and segmentation it is applied tocan autonomous find separate vehicles, labels robotics, for the human–computer objects depicted interactions, in the etc. Dilated convolution is used in segmentation tasks and has consistently improved ac‑ image, and it is applied to autonomous vehicles, robotics, human–computer interactions, curacy performance [21]. Dilated convolution provides a way to exponentially increase etc. the receptive view (global view) of the network and provide linear parameter accretion. Dilated convolutionIn general, dilatedis used convolution in segmentation includes tasks the and application has consistently of an input im withproved spacing ac- defined curacy performanceby the dilation[21]. Dilated rate. For convolution example, in provides Figure 2, aa 1‑dilatedway to convolutionexponentially (a) refersincreas toe the nor‑ the receptive viewmal convolution(global view) with of a 3the× 3network receptive and field, provide a 2‑dilated linear (b) parameter refers to accretion. one‑pixel spacing with In general, dilateda size convolution of 7 × 7, and includes a 4‑dilated the (c) application refers to a 3‑pixelof an input spacing with receptive spacing field defined with a size of × × by the dilation 15rate.15. For The example red dot, in represents Figure 2, the a 1 3-dilated3 input convolution filters, and (a) the refers green to the area nor- is the receptive mal convolutionarea with of thesea 3 × inputs.3 receptive field, a 2-dilated (b) refers to one-pixel spacing with a size of 7 × 7, and a 4-dilated (c) refers to a 3-pixel spacing receptive field with a size of 15 × 15. The red dot represents the 3 × 3 input filters, and the green area is the receptive area of these inputs.

Symmetry 2021,, 1313,, 262x FOR PEER REVIEW 5 of 19 Symmetry 2021, 13, x FOR PEER REVIEW 5 of 19

( a) ( b) (c) (a) (b) (c) Figure 2. Dilated convolution on a 2D image: (a) The normal convolution uses a 3 × 3 receptive Figure 2. DilatedFigure convolution 2. Dilatedfield; convolution on a(b 2D) one image:- pixelon a (2D aspacing) The image: normal has (a )7 The convolution× 7; normal(c) and threeconvolution uses- apixel 3 × spacing3 uses receptive a 3uses × field;3 receptivea receptive (b) one‑pixel field size spacing of 15 × has 7 × 7; (field;c) and (b three‑pixel) one-pixel15. spacing spacing uses has a7 receptive× 7; (c) and field three size-pixel of spacing 15 × 15. uses a receptive field size of 15 × 15. This study employs semantic segmentation using dilated convolutions to create a This studyhandhand‑segmented employs-segmented semantic library. library. segmentation The The process process using for for creatingdilated creating convolutionsa segmented a segmented libraryto create library is asa is show as shownn in Fig- in hand-segmentedureFigure library. 3. First,3. First, The create process create image imagefor filecreatings files and a image andsegmented labelimages library forlabels hand is asforsign shows hand asn training in signsFig- data.as training Second, data. Sec‑ ure 3. First, createcreateond, image create a semantic file a semantics and segmentation image segmentation label snetwork for h networkand model sign models(SegNet as training (SegNet model). data. model). The Second, three The layers three of layers con- create a semanticvolutionof convolutionsegmentation are as follows: are network as follows: dilated model dilatedconvolution, (SegNet convolution, model). batch Thenormalization, batch three normalization, layers and of con-rectified and rectified linear unit lin‑ volution are as (ReLU),earfollows: unit with (ReLU),dilated the convolution, final with layer the final used batch forlayer normalization,softmax used and for pixelsoftmax and classification. rectified and l pixelinear Next, classification. unit once data train- Next, once (ReLU), with theingdata final is training completed, layer used is completed, forthe softmax results the andof resultsthe pixel SegNet classification. of the model SegNet are Next, modelready once areto databe ready used train- to in be the used next in step. the ing is completed,Thenext the SegNet step. result Thems odelof SegNetthe that SegNet was model created model that is are was applied ready created in to the be is process appliedused in of the in creatingthe next process step. a hand of-segmented creating a The SegNet modellibrary.hand‑segmented that was created library. is applied in the process of creating a hand-segmented library. P i

Dilated Conv Dilated Conv Dilated Conv x

Dilated Conv Dilated Conv e

Dilated Conv P Dilated Conv Dilated Conv l + +

Dilated Coni v

+ S C Dilated Conv Dil+ated Conv Dil+ated Conv x Dilated Conv Dilated Conv + o e Dilated Conv Dil+ated Conv + Dilated Conv l

batch batDchil ated Conv f + a

batch l

Dilated Conv Dilated Conv t + + batch batDchil ated Conv s SegNet + S m + + batch C

+ + + s Dilated Conv normalizbaatticoDhni lated Conv normalizbaatticohn+ normao lizbaatticohn i

Dilated Conv l

+ + a f batch batch f normalization normalibzaattcihon + normalizationa Model

batch batch i t Hand sign batch x batch batch s SegNet c + nor+malization+ nor+malibzaattcihon nom r+malization

+ s normalizbaatticohn nor+malizbaatticohn nor+malizbaatticohn a normalization normalization + i a t normalization normalization normalizaf tion Model

Relu + Renluor+malization i

batch batch + i batch Rx elu

Hand sign o nor+malization Relnuor+malization Renluor+malization Relu c n nor+malization Relnu+or+malization Relu++ Relu+ a normalization t Relu + Relu + Relu + i Relu ReRluelu RReleulu Relu o Relu+ Relu+ Relu+ n Relu Relu Relu

Hand sign label Hand sign label Figure 3. Hand segmentation model creation via semantic segmentation.segmentation . Figure 3. Hand segmentation model creation via semantic segmentation. 3.2. Hand Hand‑Segmented-Segmented Library Creation 3.2. Hand-Segmented Library Creation Creating a hand hand‑segmented-segmented library is done as preparation for training, with the input Creating ahand hand sign-segmented imageimage usedused library to to create iscreate done a a libraryas library preparation consisting consisting for of training, of 25 25 gestures gestures with of the TFSL.of input TFSL Each . Each gesture gesture has hand sign imagehas5000 used 5000 images to images create fora afortotal library a total of 125,000consisting of 125,000 images. of images. 25 Thisgestures This process ofprocess TFSL includes. includesEach the gesture following the following processes, pro- has 5000 imagescessesas for shown ,a as total inshow Figureof n125,000 in4 Fig. Theure images. first 4. The Thisprocess first process process is to includes engageis to engage thein followinginsemantic semantic pro-segmentation segmentation by by applying cesses, as showapthen plyingin SegNet Figure the model 4 SegNet. The from first m theodelprocess previous from is theto process engage previous toin segment semanticprocess theto segmentation segment hands and the background byhands and image.back- applying the SegNetgroundThe second m image.odel process from The isthesecond to previous label process the positionprocess is to label to of thesegment the hand position asthe a hands binary of the and image.hand back- as The a binary third processimage. ground image. TheisThe to thirdsecond blur theprocess process Gaussian is isto toblur filter label the the methodGaussian position to filter reduceof the method hand the noiseasto areduce binary of the image.the noise image of the and image removesmall The third processandspace is remove to object blur fromsmall the Gaussian binary space image,object filter from leaving method binary only to image,reduce areas of leavingthe wide noise space. onl ofy theareas The image fourth of wide process space. sorts The and remove smallfourththe space areas process object in ascending sorts from the binary order, area image,s selectsin ascending leaving the largest onlorder,y spaceareas selects of (area wide the of largespace. the hand),st Thespace and (area creates of the a fourth process hand)boundingsorts ,the and area box createss at in the aascending bounding position oforder, box the at handselects the position to the frame large of only thest space thehand features (areato frame of of the interest.only the Thefeatures last hand), and createsofprocess interest. a bounding is toThe crop lastbox the processat imagethe position is by to handcrop of the bythe applyinghandimage to by frame a hand cropped only by applyingth binarye features image a cropped to the binary input of interest. Theimage last process to to the remove inputis to only cropimage the the to interesting imageremove by only hand part the andby interest applying thening resize parta cropped the and image then binary to resize 150 ×the250 image pixels. to image to the input150Examples ×image 250 pixels. ofto imagesremove Example from onlys thethe of hand‑segmentedimagesinterest froming part the and hand library then-segmented are resize shown the library inimage Figure are to 5shown. in Fig- 150 × 250 pixels.ure Example 5. s of images from the hand-segmented library are shown in Fig- ure 5.

Symmetry 2021, 13, x FOR PEER REVIEW 6 of 19

Symmetry 2021, 13, x FOR PEER REVIEW 6 of 19 Symmetry 2021, 13, x FORSymmetry PEER 2021REVIEW, 13, 262 6 of 19 6 of 19

1) Semantic segmentation Hand sign files 2) Label overlay to binary 1) Semantic segmviean StaetigoNn et Hand sign files 2) Label overlay to binary 1) Semantic segmentation via SegNet 2) Label overlay to binary Hand sign files via SegNet

3) Gaussian filter and 3) Gaussian filter and remove noise 3) Gaussian filter and remove noise remove noise

HanHda snedgm sengtmede nfitleeds files 4) Select lar4g)e Saerelea catn lda rge area and Hand segmented files 5) Hand 5m) aHrkand mark 4) Select large area and bounding box 5) Hand mark bounding box bounding box

FigureFigure 4. The 4. The process process of creating of creating a hand a- segmentedhand-segmented library. library. Figure 4. The processFigure of creating 4. The a process hand- ofsegmented creating a library. hand‑segmented library.

Figure 5. Examples of the hand-segmented library files. Figure 5. Examples of the hand-segmented library files. FigureFigure 5.5. ExamplesExample ofs theof the hand‑segmented hand-segmented library files.library files. 3.3. TFSL Model Creation 3.3. TFSL Model 3.3.Creation TFSL Model Creation 3.3.This TFSL process Model uses Creation data training methods to learn features and create a CNN classifi- This process usesThis data process training uses methods data training to learn methods features to learn and features create a and CNN create clas asifi- CNN classi‑ cationfication model , model,as show asn inshown Figure in 6. Figure The6. Thetraining training starts starts with with introducing introducing data data from from a hand a - cation model, as showThisn in process Figure 6. uses The trainingdata training starts with methods introducing to learn data features from a handand -create a CNN classifi- segmentedhand‑segmented library into library deep into learning deep learning via CNN, via CNN, which which consists consists of of multiple multiple layers inin the cation model, as shown in Figure 6. The training starts with introducing data from a hand- segmented libraryfeaturethe into feature detection deep detection, ,learning processes, processes, via each CNN, each with which with convolution convolution consists (Conv)of (Conv), multiple, ReLU, ReLU, layers and and inp poolingooling the [[22]22]. . feature detectionsegmented, processes, library each with into co deepnvolution learning (Conv) via, ReLU, CNN, and which pooling consists [22]. of multiple layers in the

feature detection, processes, each with convolution (Conv)) , ReLU, and pooling [22]. C F

(

) r C e F y (

a l r x

e a d

y ) e m a t l t x

C c TFSL f a e d F o ( n e S

m t Model t r n

c TFSL f e o e o c y

n S

Model a y n l l x

l o a c

Conv+Relu Conv+Relu Conv+Relu d u Pooling Pooling Pooling e F y m t l t l

c TFSL Conv+Relu Conv+Relu Conv+Relu f e u Pooling Pooling Pooling o F n S Model n

Hand Segmented o c

Library y Hand Segmented l Conv+Relu Conv+Relu Conv+Relu l Pooling Pooling Pooling u Library F FigureFigure 6. Training 6. Training process process with with convolutio convolutionn neuralneural network network (CNN). (CNN). Hand SeFiguregmented 6. Training process with convolution neural network (CNN). Library ThreeThree processes processes are are repeated repeated forfor each each cycle cycle of layers,of layers, with with each each layer layer learning learning to detect to de- different features. Convolution is a layer that learns features from the input image.For Three processestectFigure different 6.are Training repeated feature processs .for Convolution each with cycle convolutio ofis layers,a layern neural withthat eachlearn network slayer features ( CNNlearning from). to the de- input image. convolution, small squares of input data are used to learn the image features by preserving tect differentFor feature convolution,s. Convolution small squaresis a layer of thatinput learn datas arefeatures used fromto learn the the input image image. features by pre- For convolution,the small relationshipThree squares processes between of input are pixels. data repeated are The used image for to matrix each learn cycle andthe filterimage of layers, featuresor kernel with by are eachpre- mathematically layer learning to de- servingproduced the relationship by the dot product, between as shownpixels.in The Figure image7. matrix and filter or kernel are mathe- serving the relationshipmaticallytect different produced between feature by pixels. thes. dot ConvolutionThe product image ,matrix as isshow a andlayern in filter Figurethat or learn kernel 7. s featuresare mathe- from the input image. matically producedFor convolution, by the dot product small, assquares shown ofin inputFigure data 7. are used to learn the image features by pre- serving the relationship between pixels. The image matrix and filter or kernel are mathe- matically produced by the dot product, as shown in Figure 7.

Symmetry 2021, 13, x FOR PEER REVIEW 7 of 19

Symmetry 2021, 13, x FOR PEER REVIEW 7 of 19

Symmetry 2021, 13, 262 7 of 19 Symmetry 2021, 13, x FOR PEER REVIEW 7 of 19

Figure 7. Learning features with a convolution filter [23].

ReLU is an activation function that allows an algorithm to work faster and more ef- fectively. The function returns 0 if it receives any negative input, but for any positive value

x, it returns that value, as presented in Figures 8 and 9. Thus, ReLU can be written as Figure 7. Learning features with a convolution filter [23]. Equation (1): ReLU is an activation function that allows an algorithm to work faster and more ef- FigureFigure 7. Learning features feature withs with a convolutiona convolution filter filter [23f(].x [23]) = . max(0, x) (1) fectively. The function returns 0 if it receives any negative input, but for any positive value x, it returnsReLU is that an activation value, as function present thated allowsin Fig anure algorithms 8 and to9. workThus faster, ReLU and can more be effec‑written as tively.ReLU The is function an activation returns 0function if it receives that any allows negative an algorithm input, but to for work any positive faster and value more ef- Equation (1): fectivex, it returnsly. The that function value, returns as presented 0 if itin receives Figures any8 and negative9. Thus, input, ReLU but can for be any written positive as value xEquation, it returns (1): that value, as presented fin(x )Fig = maxures(0 8, xand) 9. Thus, ReLU can be written( 1as) Equation (1): f (x) = max(0, x) (1) f(x) = max(0, x) (1)

Figure 8. The ReLU function. Figure 8. The ReLU function. Figure 8. The ReLU function.

Figure 8. The ReLU function.

FigureFigure 9. 9. AnAn example example of thethevalues values after after using using the the ReLU ReLU function function [20]. [20]. Figure 9. An example of the values after using the ReLU function [20]. Pooling resizes the data to a smaller size, with the details of the input remaining in‑ FigurePooling 9. An exampleresizes theof the data values to a after smaller using size, the ReLUwith thefunction details [20] of. the input remaining in- tact.Pooling It also has resizes the advantage the data of increasing to a smaller the sensitivity size, towith calculations the details and solving of the the input remaining in- tact.problems It alsoof has overfitting. the advantage There of increasing are two types the sensitivityof pooling tolayers, calculations max and solvingaverage pooling, the tact.problemsas illustrated ItPooling also of overfitting has inresizes Figure the the 10advantage. There. data toare a twosmaller of increasingtypes size, of poolingwith the the layers sensitivitydetails, max of the a ndto input calculationsavera remainingge pooling andin-, solving the problemsastact. illustrate It also d ofhas in overfitting Figurethe advantage 10. . There of increasing are two the types sensitivity of pooling to calculations layers, andmax solving and avera the ge pooling, asproblems illustrate of overfittingd in Figure. There 10. are two types of pooling layers, max and average pooling, as illustrated in Figure 10.

Symmetry 2021, 13, x FOR PEER REVIEW 8 of 19 Symmetry 2021, 13, 262 8 of 19

Symmetry 2021, 13, x FOR PEER REVIEW 8 of 19

FigureFigure 10. 10. AnAn example example of of a averageverage and and m maxax p poolingooling [23] [23].. Figure 10. An example of average and max pooling [23]. TheThe final final layer layer of the of CNNthe architecture CNN architecture is the classificationis the classification layer, consisting layer, ofconsisting 2 sub- of2sub‑ layerslayers:: the the fully fully connected connected (FC) (FC) and and s softmaxoftmax layers layers.. The The fully fully connected connected layer layer will use the featurefeatureThe map mapfinal matrix matrix layer in in of the the the form form CNN of of a a vector vector architecture derived from is the the feature classification detection processlayer, consistingto of 2 sub- layerscreatecreate: the predictive predictive fully connectedmodels. The The s softmax(FC)oftmax and function function softmax layer layer provide provideslayerss. the The classification fully connected output. output. layer will use the feature map matrix in the form of a vector derived from the feature detection process to 3.3.4.4. Classification Classification create predictive models. The softmax function layer provides the classification output. InIn the classification process, process, video video file sfiles showing showing isolated isolated TFSL were TFSL used were as theused in- as theinput. put.The videoThe video input input files files consisted consisted of of oneone stroke, stroke, two two strokes, strokes, and and three three strokes. strokes. These Thesefiles 3.4.fileswere Classification were taken taken with with a digital a digital camera camera and and required require thed the signer signer to actto act with with the the right right hand hand in ina blacka black jacket jacket standing standing in in front front of of a a complex complex background. background. TheThe whole processprocess of testing involvedinvolvedIn the the the classification following following steps: process, motion motion segmentation segmentation video file sby by showing optical optical flow, flow, isolated split splittingting TFSLthe stroke the wstrokeseres used as the in- put.ofof Thethe the hand hand video gesture gestures, inputs, hand hand files segmentation segmentation consisted via via of the the one SegNet SegNet stroke, model, model, two classification, classification, strokes, combining andcombining three strokes. These filessignssigns, were, and and taken display displaying withing the a text text, digital, as show shown cameran in Figure and 11.11 require. d the signer to act with the right hand in a black jacket standing in front of a complex background. The whole process of testing

e h t

a i involved the following steps: motion segmentation by opticals flow, splitting the strokes v l

K n e g n i n d o s o i i

o t t g a a m c t n

of the hand gestures, hand segmentation via thei SegNet model, classification, combining i t f n 1st Storke i e n s e i s N a b m l g g C m e e

o

signs, and displaying the text, as showS n in Figure 11. s Video hand sign C d n Motion segmentation Splitting the strokes of a 3 by optical flow the hand gestures H

2nd Storke e h

t

a i s v l

Figure 11. ClassificationClassification and and display display text process.text process. K n e g n i n d o s o i i

o t t g a a m c t n

i i t f n 1st Storke i

Motion segmentation used Lucas and Kanade’s opticale flow calculation method, which n s e

Motion segmentation used Lucas and Kanade’s optical flow calculation method,i s N a b m l g g C

is a method of tracking the whole image using the pyramid algorithm. The track startsm e which is a method of tracking the whole image usinge the pyramid algorithm. The track

o S s Video hand sign C

with the top layer of the pyramid and runs down to thed bottom. Motion segmentation was

starts with the top layer of the pyramid and runs downn to the bottom. Motion segmenta- Motion segmentation Splitting the strokes of a 3 by optical ftionperformedlow was performed bythe opticalhand ge sbytu flowre soptical using flow theusing orientation the orientationH and andmagnitude magnitude of the of vector.the vector. Calculating Calculatingthe orientation the and orientation magnitude and of magnitude the optical flowof the wasoptical accomplished flow was accomplished using frame‑by‑frame using 2nd Storke framecalculations-by-frame considering calculation 2Ds motion.considering In each 2D motion. frame, the In angleeach frame, of the motionthe angle at of each the point mo- varied, depending on the magnitude and direction of the motion observed from the vector tion atFigure each point 11. Classificationvaried, depending and on display the magnitude text process. and direction of the motion ob- servedrelative from to the the axis vector (x, y). relative The direction to the axis of motion(x, y). The can direction be obtained of motion from Equation can be obta (2) [ined24]: ( ) from Equation (2) [24]: − y θ = tan 1 (2) Motion segmentation used Lucas andyx Kanade’s optical flow calculation method,  = tan−1( ) (2) which is a method of tracking the wholex image using the pyramid algorithm. The track where θ is the direction of vector v = [x,y]T, x is coordinate x, and y is coordinate y in the startswhere with  is− theπ direction topπ b− layer1 ≤of vector of the v− =πpyramid [x,yπ]Tb, x is coordinateand runs x ,down and y is to coordinate the bottom. y in the Motion segmenta- range of 휋 2 + b−1 B θ <휋 b2 + B where b is the number of block and B is the range of − + 휋 ≤ θ < − + 휋 where b is the number of block and B is the amount tionamount was performed of2 bins. B by optical2 B flow using the orientation and magnitude of the vector. Calculatingof bins. the orientation and magnitude of the optical flow was accomplished using The magnitude of optical flow calculates the vector length by using the linear equa- frametion- toby find-frame the length calculation of each vectors considering between the 2Dprevious motion. frame In and each the current frame, frame. the angle of the mo- tionThe at magnitude each point can bevaried calculated, depending from the equation on the below magnitude [24]: and direction of the motion ob- served from the vector relative tom the= √ axisx2 + y(x,2 y). The direction of motion(3) can be obtained from Equation (2) [24]:

y  = tan−1( ) (2) x where  is the direction of vector v = [x,y]T, x is coordinate x, and y is coordinate y in the 휋 b−1 휋 b range of − + 휋 ≤ θ < − + 휋 where b is the number of block and B is the amount 2 B 2 B of bins. The magnitude of optical flow calculates the vector length by using the linear equa- tion to find the length of each vector between the previous frame and the current frame. The magnitude can be calculated from the equation below [24]:

m = √x2 + y2 (3)

Symmetry 2021, 13, 262 9 of 19

The magnitude of optical flow calculates the vector length by using the linear equation to find the length of each vector between the previous frame and the current frame.The magnitude can be calculated from the equation below [24]: Symmetry 2021, 13, x FOR PEER REVIEW √ 9 of 19

m = x2 + y2 (3)

Symmetry 2021, 13, x FOR PEER REVIEW 9 of 19 wherewhere m m is is the the magnitude, magnitude, x x is is coordinate coordinate x, x, and and y y is is coordinate coordinate y. y. InIn thisthis research,research, motionmotion waswas distinguisheddistinguished byby findingfinding thethe mean mean of all of magnitude all magnitudess within each frame of the split hand sign. The mean of the magnitude was high when the wwithinhere m each is the framemagnitude, of the x is split coordinate hand xsign., and They is coordinate mean of they. magnitude was high when the handhandInsignals signalthis research,swere were changed, motionchanged was, as as distinguished shown shown in in Figure Figure by finding12 12.. the mean of all magnitudes within each frame of the split hand sign. The mean of the magnitude was high when the hand signals were changed, as shown in Figure 12.

Figure 12. The mean value in each frame of magnitude of the video files. FigureFigure 12.12.The The meanmean valuevaluein ineach eachframe frameof ofmagnitude magnitude of of the the video video files. files. Splitting the stroke of the hand was achieved by comparing each stroke with the thresholdSplittingSplitting value. Te thethesting stroke strokethe various of theof threshold hand the was values, hand achieved itwas was achieved found by thatcomparing a magnitudeby comparing each value stroke each with stroke the withthe of 1–3 means slight , and a value equal to 4 or more indicates a change in the thresholdthreshold value. value. TestingTesting the the various various threshold threshold values, values, it it was was found found that that a a magnitude magnitude value value stroke of the hand signal. The threshold value in this experiment was considered as 4. If of 1–3 means slight movement, and a value equal to 4 or more indicates a change in the ofthe 1–3 mean means value slightof magnitude movement, in each andframe a was value lower equal than to the 4 threshold or more value, indicates then as = change in the stroke0stroke (meaning), ofofthe the buthand hand if the signal. valuesignal. was TheThe more thresholdthreshold than the valuethrevaluesholdin in thisvalue this experiment ,experiment then s = 1 (meaningless) was was considered considered, as as 4. 4. If If theasthe shown mean mean in value valueFigure of of13 magnitude .magnitude After that, 10 in in frames each each frame between frame was wasthe lower start lower and than than the the end the threshold of threshold s = 0 (mean- value, value, then then s = s 0 = (meaning),ing)0 (meaning), represent but the but if hand theif the valuesign. value was was more more than than the the threshold threshold value, value then, then s = 1s (meaningless),= 1 (meaningless) as , shownas shownAfter in extracting Figure in Figure13 the. 13 After image. After that, from that, 10 the frames10 hand frames stroke between between separation, the the start thestart and next and the step endthe was end of the s of = 0s (meaning)= 0 (mean- representhanding) representsegmentation the hand the process. hand sign. Thissign. process involves the separation of hands from the back- ground using the SegNet model, as illustrated in Figure 14. After extracting the image from the hand stroke separation, the next step was the hand segmentation process. This process involves the separation of hands from the back- ground using the SegNet model, as illustrated in Figure 14.

Figure 13. The strokes of the hand sign. Figure 13. The strokes of the hand sign.

Figure 13. The strokes of the hand sign.

Symmetry 2021, 13, 262 10 of 19 Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19

Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19 After extracting the image from the hand stroke separation, the next step was the hand segmentation process. This process involves the separation of hands from the background using the SegNet model, as illustrated in Figure 14.

Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19

Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19

Figure 14. Image of hand segmentation after splitting the strokes of the hand.

The next step was to classify the hand segmentation images with the CNN model to predict 10 frames of each hand sign. The system votes on the results of the prediction in descending order and selects the most predicta ble hand sign. As shown in Figure 11, when Figurepassing 14. ImageFigure the of classification hand 14. Image segmentation of hand process, segmentation after the splitting results after splitting canthe strokesbe thepredicted strokes of ofthe as the hand. the hand K .sign and the 3 sign. Figure 14. Image of hand segmentation after splitting the strokes of the hand. After that, the signs are combined and then translated into Thai letters. The rules for the The next step was to classify the hand segmentation images with the CNN model to Thecombination nextFigure step 14. of was Image signs to of classify arehand shown segmentation the handin Table after segmentation splitting 2. the strokes images of the hand with. the CNN model to predictThe10 next framespredict step of was10 each frames to hand classify of sign.each hand The the systemsign. hand The votessegmentation system on votes the resultson the images results of the ofwith prediction the predictionthe CNN in in model to descendingThe next orderstep was and to selects classify the the mosthand predictasegmentationble hand images sign. with As the shown CNN inmodel Figure to 11, when predicdescendingtTable 10 frames 2. order Combination and of selectseach rules hand the for most TFSL. sign. predictable The system hand sign. votes As shownon the in results Figure 11 of, when the prediction in passing the classificationpredictpassing 10 the frames classification ofprocess, each hand process,the sign. results The the syst resultsem can votes can be beon predicted predictedthe results ofas thethe as prediction K the sign K andsign in the and3 sign. the3 descending order and selects the most predictable hand sign. As shown in Figure 11, when descendingAfter that,No. the orderAfter signs that, and areRule the selects combined signs are the Alphabet andcombined most then predictable translatedand then tr No.anslated into hand Thai into sign.Rule letters. Thai As letters. shownThe Alphabet The rules rulesin Figure for for the the 11, when passingcombination the classification of signs are process, shown (/k/ the) in results Table can 2. be predicted as the K sign and the 3 sign. (/r/) passingcombination the1 classification After of signs that, aretheK signs shown process, are incombined Tableก the 2and. results then translated can22 be into predicted Thai letters.R The as rulesthe forKร signthe and the 3 sign. (/ ʰ (/ After that, the2 combinationTable signs 2. CombinationK are +of 1 signs combined are rules shown forข TFSL.in kand Table/) then 2. translated23 intoW Thai lettersว w/). The rules for the Table 2. Combination3 rulesK + 2 for TFSL. ค (/kʰ/) 24 D ด (/d/) combination Tableof signNo. 2. Combination s areRule shown rules for TFSL. in AlphabetTable 2 . No. Rule Alphabet 4 K + 3 ฆ (/kʰ/) 25 D + 1 ฎ (/d/) No. No.1 Rule RuleK Alphabet Alphabetก (/k/ ) No. No. 22 Rule RuleR AlphabetAlphabet ร (/r/) 5 T (/t/) 26 F (/f/) Table 2. 1Combination12 KruleKsK for + 1 TFSL.ก (/k/)ตก (/k/ข) (/ kʰ/) 2222 23 R RW ร (/r/) รวฟ (/r/)(/w/) h(/ (/ ʰʰ (/ (/ 2 6 23 KT + + 1 1K +K 1 + 2 ข (/kถข /)tkค/)/) (/ kʰ/) 2323 27 24 W FW +D 1 ว w/) วดฝ(/w/) (/d/)f/) 3 K + 2 h (/kʰ/) 24 D (/d/) No. 3 7 Rule4 KT + + 2 2 K + 3 Alphabetค (/kฐค (//)tʰฆ/) (/ k ʰ/) 24 28No. 25 DL +Rule 1 ด ดฎล(/d/) (/(/d/)l/)Alphabet 4 4 K + 3 K + 3 ฆ (/kฆh (//)kʰ/) 2525 D + 1 D + 1 ฎ (/d/) ฎ (/d/) 1 8 K5 T + 3 T ก ฒ(/ k/(/tʰ)/)ต (/t/) 2922 26 L + F1 R ฬฟ (/(/f/)l/) ร (/r/) 5 5 T T ต (/t/)ต (/t/) 2626 F F ฟ (/f/) ฟ (/f/) 6 T + 1 (/ ʰถ (/tʰ/) 27 F + 1 ฝ (/͡ɕf/) 2 6 9 K6 + T1T + + 1 4T + 1 ขถ (/t(/ฑถkh (//)ʰtt/)ʰ/) 2727 3023 F + 1 J1F + 1J2W ฝ (/f/) ฝ (/f/) t ว (/w/) 7 T + 2 h ฐ (/tʰ/) 28 L ล (/l/) 7 10 7 TT + + 2 5T + 2 ฐ (/t(/ t̚ ฐ (//) ʰt ʰ/) 2828 31 L LY ล (/l/) ลย(/l/) (/j/) (/ 3 K + 2 ค kh /) (/tʰ/) 24 D (/l/) ด d/) 8 11 88 T +S 3 T +T 3 + 3 ฒ (/tสฒ (/ (//)s/)tʰฒ/) 2929 32 29 L + 1 YL L ++ 1+1 1 ฬ (/l/) ฬญฬ(/l/) (/j/) 4 9 K9 9 + T3 + 4 T +T 4 + 4 ฆฑ (/t(/ฑkh (//)ʰt/)ʰฑ/) (/tʰ/) 3030 2530 J1 + J2J1 J1 + J2+D J2 + 1 t͡ɕ จ (/ t͡ɕ /) ฎ (/d/) 12 S + 1 ศ (/s/) 33 M ม (/m/) 10 1010 T + 5 T +T 5 + 5 ฏต(/ (/ t̚ /) t/)̚ /) 3131 31 Y YY ย (/j/) ยย(/j/) (/j/) ฟ (/ 5 11 equ T SS + 2 ส (/s/)ษ (/t/)s/) 32 3426 YN + 1 F ญน (/j/)(/n/) f/) 1111 S S ส (/s/)ส (/s/) 32 32 Y + 1 Y + 1 ญ (/j/) ญ (/j/) 12 S + 1 ถศ (/s/)(/ (/ ʰ(/s/)s/) 33 M (/m/) ม (/m/) (/n/) ฝ (/ 6 14 T1212 + 1S + Q S + S 1 + 1 ซศt /) (/s/) 33 3527 33 M N +M F1 +ม 1 ณ (/m/) f/) equ S + 2 ษ (/s/) ศ 34 N นม(/n/) 15 equ P S + 2 ษ(/ (/ps/)ʰ/) 34 36 N N + G น (/n/) (/ŋ/) 7 Tequ + 2 S + 2 ฐซ พ(/tʰ/)ษ (/s/) 2834 N L ณนง (/n/) ล (/l/) 14 14 S + QS + Q (/s/)ซ (/s/) 3535 N + 1 N + 1 ณ (/n/) (/n/) 15 16 14 PP + 1 S + Q พ (/pป h(//)p/)ซ (/s/) 36 37 35 NT N+ GH+ 1 งทณ(/ (/(/ηn/)t/)ʰ/) 8 T15 + 3 P ฒ (/พ t(/ʰp/)ʰ/) 36 29 N + G L +ง 1(/ ŋ/) ฬ (/l/) 16 P + 1 ป (/p/) (/ ʰ (/ ʰ 37 T + H ท (/hʰ 17 1615 P + 2 P + 1P ผ ป (/pp/)พ/) p /) 37 38 36 T + TH + NH + + G 1 ท (/tʰ/) ธง(/tŋt/)/)/) 9 17 T + P4 + 2 ผฑ(/p (/ht/)ʰ/) 38 30 T + HJ1 + 1+ J2 ธ (/th/) จ (/t͡ɕ/) 18 1716 P + 3 P + P 2 + 1 ภ ผ(/ (/ppʰʰป/) (/ p/) 38 39 37 T + H + C1 T+ +H H ธ (/tʰ/) ท (/ t͡ɕʰtʰ/) 18 P + 3 ภ (/ph/) 39 C + H ฉ (/ h/) 10 19 T1817 + 5 H P + P 3 + 2 ฏ (/ภ (/t(//)̚h/)pʰผ /) (/ pʰ/) 39 4031 38 C + CH T+ +H H + + Y1 1 t͡ɕʰ ธ (/ t͡ɕʰtʰ/) ย (/j/) 19 H ห (/h/)ห 40 C + H + 1 ช (/ h/) 1918 H P + 3 ห (/h/)ภ (/ pʰ/) 40 39 C + H + 1 C + H t͡ɕʰ t͡ɕʰ 11 20 20 S H H + + 1 1 ฮส(/h/) ฮ(/ (/s/)h/) 41 4132 C + H Y++ 22+ 1 ฌ (/ t͡ɕʰh /) ญ (/j/) 20 H + 1 ฮ (/h/) (/h/) 41 C + H + 2 t͡ɕʰ ͡ɕʰ 21 19 B H บ (/b/) (/ ห 42 40 C A+ H + 1 อ (/ tʔ/) 21 21 B B ศ บ(/บs/) (/b/)b/) 42 42 A A ʔ ม (/m/) 12 S 20+ 1 H + 1 ฮ (/h/) 3341 C + H M+ 2 t͡ɕʰ equ S 21+ 2 B ษ (/s/)บ (/b/) 3442 A N ʔ น (/n/) 4. Experiments4. Experiments4. Experiments and Results and andResults Results 4.1.14 Data Collection4.1.S Data + QCollection ซ (/s/) 35 N + 1 ณ (/n/) 4.1. Data4. CollectionExperiments and Results Data collectionData forcollection this researchfor thisพ research(/ isʰ divided is divide intod into two two types: types: images from from 6 locations 6 locations ง (/ 15 Datafor4.1. training,collection DataP Collection and forvideos this for research testingp /) at 3is loca dividetions d(Library, into36 two Computer types:N Lab2, +images G and Office1), from 6 locationsŋ/) for training, and videos for testing at 3 locations (Library, Computer Lab2, and Office1), 16for training,whichP Data +takeand 1 a collection videosslightly difffor forerent testingป this (/ viewp/) research at of the3 loca camera.is dividetions They d(Library,37 into were two taken Computer types: withT + a imagesH digital Lab2, camerafrom and 6 locations ทOffice1), (/tʰ/) which take a slightly different view of the camera. They were taken with a digital camera which takefor training, a slightly and diff videoserentผ for (/viewp testingʰ/) of the at 3camera. locationsThey (Library, were Computer taken with Lab2, a digital and Office1),ธ camera(/tʰ/) 17 whichP + take 2 a slightly different view of the camera.38 They wereT + taken H + with 1 a digital camera 18 P + 3 ภ (/pʰ/) 39 C + H ฉ (/t͡ɕʰ/) 19 H ห (/h/) 40 C + H + 1 ช (/t͡ɕʰ/)

20 H + 1 ฮ (/h/) 41 C + H + 2 ฌ (/t͡ɕʰ/) 21 B บ (/b/) 42 A อ (/ʔ/)

4. Experiments and Results 4.1. Data Collection Data collection for this research is divided into two types: images from 6 locations for training, and videos for testing at 3 locations (Library, Computer Lab2, and Office1), which take a slightly different view of the camera. They were taken with a digital camera

Symmetry 2021, 13, x FOR PEER REVIEW 11 of 19

Symmetry 2021, 13, 262 11 of 19

Symmetry 2021, 13, x FOR PEER REVIEW 11 of 19 and required the signer to use his or her right hand to make a hand signal while standing inand front required of a thecomplex signer to background use his or her at right the hand 6 locations to make apresent hand signaled in while Figure standing 15. in front of a complex background at the 6 locations presented in Figure 15. and required the signer to use his or her right hand to make a hand signal while standing in front of a complex background at the 6 locations presented in Figure 15.

FigureFigure 15.15. TheThe six six locations ofs data of data collection collection (a) Library; (a) Library (b) Computer; (b) Computer Lab1; (c) Computer Lab1; (c Lab2;) Computer Lab2; Figure(d) Canteen; 15. The ( esix) Office1; locations ofand dataf ( ) Office2. collection (a) Library; (b) Computer Lab1; (c) Computer Lab2; ((dd) )Canteen Canteen; (e;) (Office1e) Office1; and ;( fand) Office2 (f) Office2. . The hand sign images used for the training were divided into 2 groups: the train‑ ingThe dataThe hand for hand the sign SegNetsign images images model used andforused the the fortraining training the trainingwere data divided for thewere CNNinto divided 2 model,groups: into with the training2 25 groups: signs the training datadatain total. for for the Thethe SegNet trainingSegNet model datamodel and for th theande training SegNet the training data model for usedthe data CNN original for model, the images CNN with and model,25 signs image inwith labels,total 25. signs in total. The training× data for the SegNet model used original images and image labels, 320 × 240 The320 training240 pixels data in size, for totalingthe SegNet 10,000 model images. use Thed second original group images contained and images image that labels, 320 × 240 pixelswere in hand‑segmented size, totaling 10,000 via images. semantic The second segmentation group contained and were image containeds that were inhand the- segmentedpixelshand‑segmented in sizevia semantic, totaling library. segmentation These10,000 images images. and were were The 150 contain second× 250ed pixels ingroup the in hand size. contained- Thesegmented training ima library. datasetges that were hand- Thesesegmentedcontained images 25 via signs,were semantic 150 as presented × 250 segmentation pixels in Figure in size16. The, and each training were with 5000 containdataset images containeded forin athe total 25hand ofsigns 125,000-segmented, as library. presentedTheseimages. images in Figure were 16, each 150 with × 250 5000 pixels images in for size a total. The of 125,000training images. dataset contained 25 signs, as presented in Figure 16, each with 5000 images for a total of 125,000 images.

FigureFigure 16. 16. TwentyTwenty‑five-five hand hand sign signss for 42 for Thai 42 letters Thai letters..

The video used in the testing process was isolated TFSL featuring 42 Thai letters. These videos were taken from a hand signer group consisting of 4 people, and each letter was recorded 5 times, totaling (42 × 4 × 5) 840 files.

Figure 16. Twenty-five hand signs for 42 Thai letters.

The video used in the testing process was isolated TFSL featuring 42 Thai letters. These videos were taken from a hand signer group consisting of 4 people, and each letter was recorded 5 times, totaling (42 × 4 × 5) 840 files.

Symmetry 2021, 13, 262 12 of 19

The video used in the testing process was isolated TFSL featuring 42 Thai letters. SymmetrySymmetry 20212021,, 1313,, xx FORFOR PEERPEER REVIEWREVIEW 1212 ofof 1919 These videos were taken from a hand signer group consisting of 4 people, and each let‑ ter was recorded 5 times, totaling (42 × 4 × 5) 840 files.

4.2.4.2. Experiment Experimentalalal Results Results TheThe experimentsexperiments experiments inin in thisthis this studystudy study were were divided divided intointo into threethree three parts parts... The The firstfirst first part partinvolvedinvolved involved thethethe evaluationevaluation evaluation toto to measure measure performance performance using using an an intersectionintersection intersection over over union union areas areas ((IoUIoU (IoU););); thethe the second part was the experiment used to determine the efficiency of the CNN model; and secondsecond part was thethe experimentexperiment used toto determine thethe efficiencyefficiency of thethe CNN model;; and the last part was the testing of the multi‑stroke TFSL recognition system. thethe lastlast part was thethe testingtesting of thethe multi--strokestroke TFSL recognitionrecognition system.system. 4.2.1. Intersection Over Union Areas (IoU) 4.2.1. IIntersection Over Union Areass ((IoUIoU)) IoU is a statistical method that uses two data consistency measurements between a IoU is a statistical method that uses two data consistency measurements between a predictedIoU is a bounding statistical box method and ground that uses truth two by data dividing consistency the overlapping measurements areas between thea predicted bounding box and ground truth by dividing the overlapping areas between the predictedprediction bounding and ground box and truth ground by a uniontruth by area dividing between thethe overlapping prediction areas and between ground truth,the prediction and ground truth by a union area between the prediction and ground truth, as predictionas Equation and(4) ground and shown truth by in a Figure union 17area[18 between]. A high‑value the prediction IoU (value and ground closer to truth 1) refers, as Equation (4) and shown in Figure 17 [18]. A high-value IoU (value closer to 1) refers to a Equtoation a bounding (4) and boxshown with in high Figure accuracy. 17 [18]. TheA high 0.5- thresholdvalue IoU is (value the accuracy closer to that 1) refers determines to a bounding box with high accuracy. The 0.5 threshold is the accuracy that determines boundingwhether thebox predictedwith high bounding accuracy. box The IoU 0.5 is accurate,threshold as is shown the accuracy in Figure that18. determines According to whether the predicted bounding box IoU is accurate, as shown in Figure 18. According to whetherthe Standard the predicted Pascal Visualbounding Object box Classes IoU is accurate Challenge, as 2007, show acceptablen in Figure IoU 18. valuesAccording must to be the Standard Pascal Visual Object Classes Challenge 2007, acceptable IoU values must be thegreater Standard than Pascal 0.5 [13 Visual]. Object Classes Challenge 2007, acceptable IoU values must be greater thanthan 0.5 [13][13].. Area of overlap between bounding boxes IoU = Area of overlap between bounding boxes . (4) IoUIoU = Area of union between bounding boxes ((4)) Area of union between bounding boxes

PPrreeddiicctteedd bboouunnddiinngg bbooxx

Ovveerrllaapp aarreeaa

Grroouunndd ttrruutthh IIoU =

PPrreeddiicctteedd bboouunnddiinngg bbooxx

Unniioonn aarreeaa

Grroouunndd ttrruutthh .. FigureFigure 17. 17. TheThe intersectionintersection intersection overover over unionunion union areasareas areas ((IoUIoU (IoU))) [25][25] [25.. ].

IIooU = 00..44003344 IIooU = 00..77333300 IIooU = 00..99226644

Poorr Good Excellllentt

Figure 18. IoU is evaluated as follows: the left IoU is 0.4034, which is poor; the middle IoU = 0.7330, Figure 18. IoUIoU isis evaluatedevaluated asas followsfollows:: tthe lefteft IoUIoU isis 0.40340.4034,, whichwhich isis pooroor;; thethe middleiddle IoUIoU == which is good; and the high IoU = 0.9264, which is excellent [13]. 0.73300.7330,, whichwhich isis goodood;; andand thethe highhigh IoUIoU == 0.92640.9264,, whichwhich isis eexcellentxcellent [13][13].. The research used sign language gestures on a complex background along with la‑ The researchresearch used signsign languagelanguage gesturess on a complexcomplex background along with labellabel bel images, all of which were 320 × 240 pixels in size. A total of 10,000 images passed imageimagess,, all of which were 320 × 240 pixels inin sizesize.. A totaltotal of 10,000 imagesimages passed throughthrough semanticsemantic segmentationsegmentation trainingtraining using dilated convolutions.convolutions. The resultsresults of thethe dilation rate configurationconfiguration testtest and a variety of convolutionconvolutionss cancan be structuredstructured according toto thethe follow-follow- inging details.. The structure of thethe trainingtraining processprocess consistsconsists ofof fivefive blocks of convolution.

Symmetry 2021, 13, 262 13 of 19

through semantic segmentation training using dilated convolutions. The results of the di‑ Symmetry 2021, 13, x FOR PEER REVIEWlation rate configuration test and a variety of convolutions can be structured13 of 19 according to

the following details. The structure of the training process consists of five blocks of con‑ volution. Each block includes 32 convolution filters (3 × 3 in size), with different dilation factors,Each block batch includes normalization, 32 convolution and ReLU. filters The (3 dilations × 3 in size), are with 1, 2, 4,different 8, 16, and dilation 1, respectively. factors, Thebatch training normalization, option uses and MaxEpochs ReLU. The =dilations 500 and are MiniBatchSize 1, 2, 4, 8, 16, = and 64. According1, respectively. to the The IoU performancetraining option assessment, uses MaxEpochs the average = 500 IoU and was MiniBatchSize 0.8972, which = 64. is According excellent. to the IoU per- formance assessment, the average IoU was 0.8972, which is excellent. 4.2.2. Experiments on the CNN Models 4.2.2.Experiments Experiments to on determine the CNN theModels effectiveness of the CNN models in this research used the 5‑foldExperiments cross validation to determine method. the effectiveness All input hand of the sign CNN images models used in this in thisresearch experiment used containedthe 5-fold 125,000 cross validation images from method. the hand‑segmented All input hand sign library, images which used consisted in this experiment of 25 finger‑ spellingcontain images,ed 125,000 each images with 5000from gestures.the hand- Thissegmented dataset library, was divided which intoconsisted 5 groups, of 25 each finger with- 25,000spelling images. images, Each each round with of 5000 5‑fold gestures. cross validation This dataset used was 100,000 divided images into 5 for groups, training each and 25,000with images25,000 images. for testing. Each Thisround experiment of 5-fold cross compared validation the effectivenessused 100,000 images of the for trainingconfiguration, theand number 25,000 ofimages filters, for testing.and the This sizes experiment of five compared different the effectiveness CNN filters. of the configu- ration,The the first number format of filters used, sevenand the layers sizes ofof five 64 differfiltersent anda3 × CNN3 filter filters size.. The second format set theThe filter first size format to used 128 seven and thelayers size of 64 offilters the andfilter to3 ×a 33 ×with 3 filter 7 size. layers. The The second third format format sortedset the the filter number size to of 128 filters and thein sizeascending of the filter order to 3 up× 3 towith 7 layers,7 layers. starting The third fromformat 2, 5, 10,20,40, 80,sorted and160, the number respectively, of filters and in all ascending filter sizes order upwere to 3 × 73. layers, The fourth starting format from 2, determined 5, 10, 20, 40, the number80, and of 160, filters respectively, in ascending and all filter order sizes and were set 3 × the3. The sizes fourth of theformat filter determined from thelarge tosmallwith 7 layersnumber (the of filters number in ascending of filters/filter order and size; set the 4/11 × sizes11, 8/5of the× 5,filter 16/5 from× 5, large 32/3 to× small3, 64/3 with× 3, 128/37 layers× 3, (the 256/3 number× 3). of The filters/filter final format size; 4/11 was × 11, a 8/5structure × 5, 16/5 × based 5, 32/3 ×on 3, AlexNet,64/3 × 3, 128/3 which defined the× 3, numbers 256/3 × 3). and The sizes final of format the filters was a structure as follows based on(number AlexNet of, which filters/filter defined the size): num- 96/11 × 11, 256/5bers× and5, 384/3sizes of× the3, 384/3 filters× as3, follows 256/3 × (number3, 256/3 of× filters/filter3 [26]. The size): structure 96/11 details × 11, 256/5 are shown × 5, in384/3 Figure × 3,19 384/3. × 3, 256/3 × 3, 256/3 × 3 [26]. The structure details are shown in Figure 19.

FigureFigure 19. 19.The The five five CNN CNN format formatss for fortraining training the model. the model.

Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19 Symmetry 2021, 13, 262 14 of 19

The CNN model experiment used the 5‑fold cross validation method, and the results are shown in Table 3. This experiment found that the accuracy results of the experiment using the structure based on the AlexNet format were 92.03%. The fourth model of 7 layers using the ascending sorting of the filter number and filter size from a larger size to asmaller size offered an average accuracy of up to 90.04%. Format 3, which uses seven descending filters, has a secondary accuracy average of 89.91%. While the first and second formatsuse seven static filters, they have 64 and 128 filters, respectively, with an average accuracyof 88.23 and 87.97.

Table 3. Five‑fold cross validation for CNN model testing.

Five‑Fold Cross Format 1 Accuracy Format 2 Accuracy Format 3 Accuracy Format 4 Accuracy Format 5 Accuracy Validation (%) (%) (%) (%) (%) Figure 14. Image of hand segmentation after splitting the strokes of the hand. 1st 80.12 80.60 83.43 84.17 87.72 The next step was to classify2nd the hand segmentation87.12 images with the85.90 CNN model to 86.28 86.15 88.96 predict 10 frames of each hand sign.3rd The system votes90.06 on the results of90.34 the prediction in 92.44 92.53 93.98 4th 93.84 92.82 95.52 94.87 97.00 descending order and selects the 5thmost predictable hand90.00 sign. As shown in90.20 Figure 11, when 91.86 92.47 92.50 passing the classification process,Average accuracythe results can be 88.23predicted as the K sign87.97 and the 3 sign. 89.91 90.04 92.03 After that, the signs are combined and then translated into Thai letters. The rules for the combination of signs are shown in Table 2. Comparing the five CNN accuracy results, it was found that the fifth format provided Table 2. Combination rules for TFSL. the highest accuracy. Therefore, we used this format for training to create models for the multi‑stroke TFSL recognition system, divided into 112,500 training images representing No. Rule Alphabet No. 90% andRule 12,500 for Alphabet testing data, representing 10%. The accuracy of the model in the exper‑ 1 K ก (/k/) 22 iment wasR 99.98%. ร (/r/) 2 K + 1 ข (/kʰ/) 23 W ว (/w/) 4.2.3. The Experiment of the Multi‑Stroke TFSL Recognition System 3 K + 2 ค (/kʰ/) 24 D ด (/d/) This recognition system focuses on user‑independent testing via the isolated TFSL 4 K + 3 ฆ (/kʰ/) 25 D + 1 ฎ (/d/) video format. The accuracy of each alphabet was tested 20 times. The experiment was (/ (/ 5 T ต t/) 26 dividedF into three groups:ฟ f/) one stroke, as shown in Table 4; the two‑stroke results, which 6 T + 1 ถ (/tʰ/) 27 are presented F + 1 in Tableฝ (/5f/); and the three‑stroke results, which are shown in Table 6. 7 T + 2 ฐ (/tʰ/) 28 L ล (/l/) 8 T + 3 ฒ (/tʰ/) 29 Table L 4. +The 1 results ofฬ the (/l/) one‑stroke TFSL recognition system. (/ ʰ 9 T + 4 ฑ t /) 30 J1No. + J2 Thai Letters t͡ɕ Rule Correct Incorrect Accuracy (%) 10 T + 5 t̚ /) 31 Y (/j/) 1 กย(/k/) K 13 7 65.00 11 S ส (/s/) 32 Y 2+ 1 ตญ(/t/) (/j/) T 18 2 90.00 12 S + 1 ศ (/s/) 33 M3 มส (/s/)(/m/) S 18 2 90.00 4 พ (/ph/) P 20 0 100.00 equ S + 2 ษ (/s/) 34 N น (/n/) 5 ห (/h/) H 20 0 100.00 14 S + Q ซ (/s/) 35 N6 + 1 บณ(/b/) (/n/) B 17 3 85.00 15 P พ (/pʰ/) 36 N 7+ G รง (/r/)(/ŋ/) R 15 5 75.00 8 ว (/w/) W 17 3 85.00 16 P + 1 ป (/p/) 37 T + H ท (/tʰ/) 9 ด (/d/) D 20 0 100.00 17 P + 2 ผ (/pʰ/) 38 T + 10H + 1 ฟธ (/f/)(/tʰ/) F 20 0 100.00 18 P + 3 ภ (/pʰ/) 39 C 11+ H ล (/l/) t͡ɕʰ L 20 0 100.00 12 ย (/j/) Y 20 0 100.00 19 H ห (/h/) 40 C + H + 1 t͡ɕʰ 13 ม (/m/) M 18 2 90.00 20 H + 1 ฮ (/h/) 41 C +14 H + 2 น (/n/) t͡ɕʰ N 11 9 55.00 21 B บ (/b/) 42 15A อ (/ʔ/) A 17 3 85.00 Average accuracy 88.00 4. Experiments and Results 4.1. Data Collection Data collection for this research is divided into two types: images from 6 locations for training, and videos for testing at 3 locations (Library, Computer Lab2, and Office1), which take a slightly different view of the camera. They were taken with a digital camera

Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19

Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19

Figure 14. Image of hand segmentation after splitting the strokes of the hand.

The next step was to classify the hand segmentation images with the CNN model to predict 10 frames of each hand sign. The system votes on the results of the prediction in descending orderSymmetry and selects 2021, 13 the, x FOR most PEER predicta REVIEWble hand sign. As shown in Figure 11, when 16 of 19 passing the classification process, the results can be predicted as the K sign and the 3 sign. After that,Symmetry the signs2021 are, 13 ,combined 262 and then translated into Thai letters. The rules for the 15 of 19 combination of signs are shown in Table 2. Table 5. The results of the two-stroke TFSL recognition system. Table 2. Combination rules for ThaiTFSL. Classification Strokes Detection No. Rule Letters TableCorrect 5. The results Incorrect of the two‑stroke Accuracy TFSL (%) recognition Correct system. Incorrect Error Rate (%) No. Rule Alphabet No. Rule Alphabet 1 ข (/kʰ/) K + 1 17 3 85.00 19 1 5 Figure 14. Image of hand segmentation after splitting the1 strokes of theK hand. ก (/k/) 22 R ร (/r/) Thai (/ ʰ Classification Strokes Detection 2 No.K + 1 2 ค ขk (//)k ʰ/) K + 2Rule 23 16 W 4 ว (/w/) 80.00 16 4 20 Letters (/ ʰ Correct Incorrect Accuracy (%) Correct Incorrect Error Rate (%) The next step was to classify the hand segmentation3 imagesK + 2 with 3 theฅ CNNคk (//)k ʰmodel/) K + to 3 24 20 D 0 ด (/d/) 100.00 20 0 0 (/ hʰ predict 10 frames of each hand sign. The system votes4 on theK1 results+ 3 4 of theขถ(/k ฆpredictiont (//)/)k ʰ/) T + in 1K + 1 25 17 17 D + 1 33 ฎ (/d/) 85.0085.00 20 19 0 1 0 5 descending order and selects the most predictable hand sign. As2 shown inค Figure (/ hʰ 11, whenK + 2 16 4 80.00 16 4 20 5 T 5 ฐ(/ktต /)/)(/ t/) T + 2 26 15 F 5 ฟ (/f/) 75.00 20 0 0 passing the classification process, the results can be predicted3 as the K signฅ (/k and(/ hʰ/) the 3 sign.K + 3 20 0 100.00 20 0 0 6 T + 1 6 ฒ ถt (//)t ʰ/) T + 3 27 20 F + 1 0 ฝ (/f/) 100.00 20 0 0 After that, the signs are combined and then translated into Thai4 letters. Theถ (/t hrules/) for theT + 1 17 3 85.00 20 0 0 7 T + 2 7 ฑ (/t ʰ(//)t ʰ/) T + 4 28 20 L 0 (/l/) 100.00 20 0 0 combination of signs are shown in Table 2. 5 ฐ (/tฐh/) T + 2 15 5 ล 75.00 20 0 0 8 ฏ (/t/)̚ T + 5 17 3 85.00 20 0 0 8 T6 + 3 ฒ (/tฒh /)(/tʰ/) T + 3 29 20 L + 1 0 ฬ (/l/)100.00 20 0 0 9 (/s/) S + 1 11 9 55.00 16 4 20 Table 2. Combination rules for TFSL. 9 T7 + 4 ฑศ(/tฑh /)(/tʰ/) T + 4 30 20 J1 + J2 0 t͡ɕ 100.00 20 0 0 10 ษ (/s/) S + 2 15 5 75.00 18 2 10 No. Rule Alphabet No.10 RuleT8 + 5 Alphabetฏ (/ t̚ /) /) T + 5 31 17 Y 3 ย (/j/) 85.00 20 0 0 9 11 ศซ (/s/)(/s/) (/ S + QS + 1 15 11 59 (/ 75.0055.00 20 16 0 4 0 20 1 K ก (/k/) 2211 R S ร (/สr/) s/) 32 Y + 1 ญ j/) 10 ษ (/s/)(/ S + 2 15 5 75.00 18 2 10 12 S + 1 12 ป p/)ศ (/ s/) P + 1 33 20 M 0 ม (/m/) 100.00 20 0 0 2 K + 1 ข (/kʰ/) 23 W11 ซว (/s/)(/w/) S + Q 15 5 75.00 20 0 0 13 ผ (/pʰ (//) P + 2 18 2 (/ 90.00 20 0 0 3 K + 2 ค (/kʰ/) 24equ 12D S + 2 ปด(/p/) (/ษd/)s/) P + 1 34 20 N 0 น n/)100.00 20 0 0 14 ภ (/phʰ (//) P + 3 18 2 (/ 90.00 18 2 10 4 K + 3 ฆ (/kʰ/) 2514 D13S + +1 Q ผ ฎ(/p (/ซd/)/)s/) P + 2 35 18 N + 1 2 ณ n/)90.00 20 0 0 15 ฮ (/h/)h (/ ʰ H + 1 18 2 (/ 90.00 20 0 0 5 T ต (/t/) 2615 14F P ภ (/pฟ (/พf/)/)p /) P + 3 36 18N + G 2 ง ŋ/)90.00 18 2 10 15 16 ฮฎ (/h/)(/d/) (/ D + H1 + 1 20 18 02 (/ ʰ 100.0090.00 20 20 0 0 0 0 6 T + 1 ถ (/tʰ/) 2716 F P+ 1+ 1 ฝ (/ปf/)p/) 37 T + H ท t /) 16 17 ฎฝ(/d/) (/f/) (/ ʰ F + 1D + 1 18 20 20 (/ ʰ 90.00100.00 20 20 0 0 0 0 7 T + 2 ฐ (/tʰ/) 2817 L P + 2 ล (/ผl/)p /) 38 T + H + 1 ธ t /) 17 ฝ (/f/)(/ F + 1 18 2 90.00 20 0 0 18 P + 3 18 ฬ ภl/) (/p ʰ/) L + 1 39 17 C + H 3 t͡ɕʰ 85.00 17 3 15 8 T + 3 ฒ (/tʰ/) 29 L18 + 1 ฬฬ(/l/) (/l/) L + 1 17 3 85.00 17 3 15 19 จ (/t͡ɕ /)(/ J1 + J2 20 0 100.00 20 0 0 9 T + 4 ฑ (/tʰ/) 3019 J1 19+ HJ2 จ (/ tห͡ɕ /)h/) J1 + J2 40 20 C + H + 10 t͡ɕʰ 100.00 20 0 0 20 ญ (/j/) (/ Y + 1 19 1 95.00 20 0 0 10 T + 5 t̚ /) 3120 20 HY + 1 ญย (/j/)(/ฮj/) h/) Y + 1 41 19 C + H + 21 t͡ɕʰ 95.00 20 0 0 21 21 ณณ (/n/)(/n/) (/ N + N1 + 1 11 11 99 ʔ 55.0055.00 20 20 0 0 0 0 11 S ส (/s/) 3221 Y + B1 ญ (/บj/)b/) 42 A 22 22 งง (/(/ŋ/)/) N + NG + G 12 12 88 60.0060.00 20 20 0 0 5 5 12 S + 1 ศ (/s/) 33 M ม (/m/)h 23 23 ทท(/t (/tʰ/) T + HT + H 17 17 33 85.0085.00 20 20 0 0 0 0 (/ 4. Experiments and Results (/ equ S + 2 ษ s/) 34 24N ฉ (/น n/)h/) C + H 19 1 95.00 20 0 0 4.1. Data Collection 24 ฉ (/t͡ɕʰ/) C + H 19 1 95.00 20 0 0 14 S + Q ซ (/s/) 35 N + 1 ณ (/Averagen/) accuracy 85.42 Average of error rate 3.54 Average accuracy 85.42 Average of error rate 3.54 15 P พ (/pʰ/) 36Data collectionN + G for this งresearch (/ŋ/) is divided into two types: images from 6 locations for training, and videos for testing at 3 locations (Library, Computer Lab2, and Office1), 16 P + 1 ป (/p/) 37 T + H ท (/tʰ/) which take a slightly different view of the Tablecamera. 6.Table TheyThe results 5were shows taken of the withresults three‑stroke a digital of the TFSL cameraTFSL recognition two- stroke system.recognition system. The results of 17 P + 2 ผ (/pʰ/) 38 T + H + 1 ธ (/tʰ/) the experiment show that groups with very similar signs, such as T, S, M, N, and A, af- 18 P + 3 (/pʰ/) 39 C + H t͡ɕʰ fected the accuracy rates in two-stroke sign language recognition systems, such as T + 2, S ภ Thai Classification Stroke Detection 19 H ห (/h/) 40 C +No. H + 1 t͡ɕʰ Rule+ 2, and S + Q providing an accuracy of 75%, N + G providing an accuracy of 60%, and S Letters Correct Incorrect Accuracy (%) Correct Incorrect Error Rate (%) 20 H + 1 ฮ (/h/) 41 C + H + 2 t͡ɕʰ + 1 and N + 1 offering an accuracy of 55%. Sign language gesture number 2 was similar to 1 ธ (/th/) T + HK +gestures. 1 17 This affected3 the accuracy85.00 of K + 2, which20 had an accuracy0 rate of 80%.0 The 21 B บ (/b/) 42 A ʔ 2 ช (/ h/) C + Hsimilarity + 1 between 13 sign7 language gestures65.00 affected the19 average accuracy1 of the two-stroke5 3 ฌ (/ h/) C + Hsign + 2 language 15 recognition5 system, 75.00which was equal to18 85.42%. In the2 sign language recog-10 4. Experiments and Results Average accuracynition system for two-stroke, we measured75.00 the errorAverage rate of of stroke error ratedetection. The results5 4.1. Data Collection show that sign language gestures with slight hand movements, such as K + 2 and S + 1, Data collection for this research is divided into two types: images from 6 locations had a high error rate of 20%, affecting the overall accuracy of the system. The average In terms of performance, the one‑stroke TFSL recognition system shown in Table 4 for training, and videos for testing at 3 locations (Library, Computer Lab2, and Office1), error rate was 3.54%. had an average accuracy of 88.00%. According to the results, the system also encountered which take a slightly different view of the camera. They were taken with a digital camera very similar sign language gesture discrimination problems in three groups: (a) a handful of gestures that use the thumb position ingress in different sign gestures for the letters T, S, M, N, and A, yielding lower accuracy for the letter N at 55%; (b) group gestures with two fingers, consisting of the index finger and middle finger, which are slightly different from the position of the thumb. This group consisted of letters K and 2 and lowered the accuracy of the letter K to 65%; and (c) a sign language gesture similar to raising afinger up one finger. The R gesture involves crossing between the index finger and ringfinger, which is similar to the gesture for number 1, which involves holding up one index fin‑ ger. This resulted in the accuracy of the letter R being 75%, as shown in Figure 20. These three groups of similarities among the sign language gestures resulted in a low accuracy

average. Symmetry 2021, 13, x FOR PEER REVIEW 15 of 19

In terms of performance, the one-stroke TFSL recognition system shown in Table 4 had an average accuracy of 88.00%. According to the results, the system also encountered very similar sign language gesture discrimination problems in three groups: (a) a handful of gestures that use the thumb position ingress in different sign gestures for the letters T, S, M, N, and A, yielding lower accuracy for the letter N at 55%; (b) group gestures with two fingers, consisting of the index finger and middle finger, which are slightly different from the position of the thumb. This group consisted of letters K and 2 and lowered the accuracy of the letter K to 65%; and (c) a sign language gesture similar to raising a finger up one finger. The R gesture involves crossing between the index finger and ring finger, which is similar to the gesture for number 1, which involves holding up one index finger. Symmetry 2021, 13, 262 16 of 19 This resulted in the accuracy of the letter R being 75%, as shown in Figure 20. These three groups of similarities among the sign language gestures resulted in a low accuracy aver- age.

(a) T, S, N, M, and A gestures

(b) K and 2 gestures

(c) R and 1 gestures

Figure 20.FigureA group 20. ofA one‑strokegroup of one TFSL-stroke signs TFSL with signs many with similarities. many similarities.

Table 5 shows the results of the TFSL two‑stroke recognition system. The results of the experiment show that groups with very similar signs, such as T, S, M, N, and A, af‑ fected the accuracy rates in two‑stroke sign language recognition systems, such as T + 2, S + 2, and S + Q providing an accuracy of 75%, N + G providing an accuracy of 60%, and S + 1 and N + 1 offering an accuracy of 55%. Sign language gesture number 2 was similar to K gestures. This affected the accuracy of K + 2, which had an accuracy rate of80%.The similarity between sign language gestures affected the average accuracy of the two‑stroke sign language recognition system, which was equal to 85.42%. In the sign language recog‑ nition system for two‑stroke, we measured the error rate of stroke detection. The results show that sign language gestures with slight hand movements, such as K + 2 and S + 1, had a high error rate of 20%, affecting the overall accuracy of the system. The average error rate was 3.54%. Three‑stroke TFSL combines three sign language gestures to display one letter. Thus, three‑stroke TFSL consists of three letters: T + H + 1, C + H + 1, and C + H + 2. The results showed that the accuracy of T + H + 1 was 85%, that of C + H + 2 was 75%, and that of Symmetry 2021, 13, 262 17 of 19

C + H + 1 was 65%, indication a low accuracy value. This low accuracy was due to several reasons, such as the detection of faulty strokes due to the use of multi‑stroke sign language and sign language similarities. The average overall accuracy was 75%. The three‑stroke has different posture movements, so there is a slight error rate, where the average error rate is 5%, as shown in Table 6. Table 7 show that the multi‑stroke TFSL recognition system has an overall average accuracy of 85.60%. The experiment used a training image for CNN modeling and testing with multi‑stroke isolated TFSL. The results show that one‑stroke tested 300 times was cor‑ rect 264 times, incorrect 36 times, and had an average accuracy of 88%. For two‑strokes, the average accuracy was 85.42% from 480 tests, with 410 correct and 70 incorrect. The use of three‑strokes was tested 60 times and was found to be correct 45 times and incorrect 15 times with an average accuracy of 75%. Overall stroke detection of the system showed an average error rate of 3.70. The results show that sign language gestures with little move‑ ment affected stroke detection and caused misclassified.

Table 7. The overall results of the multi‑stroke TFSL recognition system experiment.

Classification Stroke Detection No. Stroke Group Correct Incorrect Accuracy (%) Correct Incorrect Error Rate (%) 1 One‑stroke 264 36 88.00 ‑ ‑ ‑ 2 Two‑stroke 410 70 85.42 464 16 3.54 3 Three‑stroke 45 15 75.00 57 3 5.00 Average accuracy 85.60 Average of error rate 3.70

5. Conclusions The research focused on the development of a multi‑stroke TFSL recognition system to support the use of complex background with deep learning. This study compared five CNN performance models. According to the results, the first and second structure for‑ mats are static CNN architecture, and the determined number of filters and size of the filters were the same for all layers, providing a minimal accuracy value. The third format uses an architectural style to determine the number of filters from ascending with the same size of the filter, which results in increased accuracy. The fourth format uses an ascending number of filters. The first layer uses large filters for learning the global feature, whereas the next layer uses smaller filters for learning the local feature. This results in higher accu‑ racy. The fifth format is a mixed architectural style, designed based on AlexNet structure. Its first and second convolution layer increases the number of filters from ascending while the third and fourth layers show the number of a static filter rises from the second layer. However, the fifth and sixth layers use a fixed number of filters which drop fromthetwo previous layers. For similar purpose, the large size of the filter is used in the first layer to learn global features, while the smaller filter is used in the next layer to learn thelocal feature. The results show that mixed architecture with global feature learning followed by local feature learning shows outperforming accuracy. By using the fifth format to train the system, the results of the overall study indicate that factors affecting the system’s accuracy average include (1) very similar sign language gestures that negatively affect classification, resulting in lower accuracy average results; (2) low‑movement sign language spelling ges‑ tures that affect the detection of multiple spelling gestures, causing faulty stroke detection; and (3) the spelling of multiple‑stroke sign language, which affects the average accuracy since too much movement can sometimes lead to movement being detected between ges‑ tural changes, resulting in system recognition errors. A solution could be to improve the motion detection system to make the strokes of the sign language more accurate or to apply a long short‑term memory network (LSTM) to enhance the recognition system’s accuracy. In conclusion, the study results demonstrate that similarities in the gestures and strokes when finger‑spelling Thai sign language caused decreases in accuracy. For future studies, further TFSL recognition system development is needed. Symmetry 2021, 13, 262 18 of 19

Author Contributions: Conceptualization, T.P. and P.S.; methodology, T.P.; validation, P.S.; investi‑ gation, T.P. and P.S.; data curation, T.P.; writing—original draft preparation, T.P. and P.S.; writing— review and editing, T.P. and P.S.; supervision, P.S.; All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The data presented in this study are available on request from the cor‑ responding author. The data are not publicly available due to the copyright of the Natural Language and Speech Processing Laboratory, Department of Computer Science, Faculty of Science, Khon Kaen University, Khon Kaen. Acknowledgments: This study was funded by the Computer Science Department, Faculty of Sci‑ ence, Khon Kaen University, Khon Kaen, Thailand. The authors would like to extend our thanks to Sotsuksa Khon Kaen School for their assistance and support in the development of the Thai finger‑ spelling sign language data. Conflicts of Interest: The authors declare no conflict of interest.

References 1. Zhang, C.; Tian, Y.; Huenerfauth, M. Multi‑Modality American Sign Language Recognition. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 2881–2885. [CrossRef] 2. Nakjai, P.; Katanyukul, T. Hand Sign Recognition for Thai Finger Spelling: An Application of Convolution Neural Network. J. Signal Process. Syst. 2019, 91, 131–146. [CrossRef] 3. Pariwat, T.; Seresangtakul, P. Thai Finger‑Spelling Sign Language Recognition Employing PHOG and Local Features with KNN. Int. J. Adv. Soft Comput. Appl. 2019, 11, 94–107. 4. Pariwat, T.; Seresangtakul, P. Thai Finger‑Spelling Sign Language Recognition Using Global and Local Features with SVM. In Proceedings of the 2017 9th International Conference on Knowledge and Smart Technology (KST), Chonburi, Thailand, 1–4 February 2017; pp. 116–120. [CrossRef] 5. Adhan, S.; Pintavirooj, C. Thai Sign Language Recognition by Using Geometric Invariant Feature and ANN Classification. In Proceedings of the 2016 9th Biomedical Engineering International Conference (BMEiCON), Laung Prabang, , 7–9 December 2016; pp. 1–4. [CrossRef] 6. Saengsri, S.; Niennattrakul, V.; Ratanamahatana, C.A. TFRS: Thai Finger‑Spelling Sign Language Recognition System. InPro‑ ceedings of the 2012 Second International Conference on Digital Information and Communication Technology and It’s Applica‑ tions (DICTAP), , Thailand, 16–18 May 2012; pp. 457–462. [CrossRef] 7. Chansri, C.; Srinonchat, J. Reliability and Accuracy of Thai Sign Language Recognition with Kinect Sensor. In Proceedings of the 2016 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI‑CON), , Thailand, 28 June–1 July 2016; pp. 1–4. [CrossRef] 8. Silanon, K. Thai Finger‑Spelling Recognition Using a Cascaded Classifier Based on Histogram of Orientation Gradient Features. Comput. Intell. Neurosci. 2017, 2017, 1–11. [CrossRef][PubMed] 9. Tolentino, L.K.S.; Juan, R.O.S.; Thio‑ac, A.C.; Pamahoy, M.A.B.; Forteza, J.R.R.; Garcia, X.J.O. Static Sign Language Recognition Using Deep Learning. Int. J. Mach. Learn. Comput. 2019, 9, 821–827. [CrossRef] 10. Vo, A.H.; Pham, V.‑H.; Nguyen, B.T. Deep Learning for Vietnamese Sign Language Recognition in Video Sequence. Int. J. Mach. Learn. Comput. 2019, 9, 440–445. [CrossRef] 11. Goel, T.; Murugan, R. Deep Convolutional—Optimized Kernel Extreme Learning Machine Based Classifier for Face Recognition. Comput. Electr. Eng. 2020, 85, 106640. [CrossRef] 12. Wang, N.; Wang, Y.; Er, M.J. Review on Deep Learning Techniques for Marine Object Recognition: Architectures and Algorithms. Control Eng. Pract. 2020, 104458. [CrossRef] 13. Cowton, J.; Kyriazakis, I.; Bacardit, J. Automated Individual Pig Localisation, Tracking and Behaviour Metric Extraction Using Deep Learning. IEEE Access 2019, 7, 108049–108060. [CrossRef] 14. Chen, L.‑C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Con‑ volutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [CrossRef][PubMed] 15. Rahim, M.A.; Islam, M.R.; Shin, J. Non‑Touch Sign Word Recognition Based on Dynamic Hand Gesture Using Hybrid Segmen‑ tation and CNN Feature Fusion. Appl. Sci. 2019, 9, 3790. [CrossRef] 16. Koller, O.; Forster, J.; Ney, H. Continuous Sign Language Recognition: Towards Large Vocabulary Statistical Recognition Sys‑ tems Handling Multiple Signers. Comput. Vis. Image Underst. 2015, 141, 108–125. [CrossRef] Symmetry 2021, 13, 262 19 of 19

17. Hamed, A.; Belal, N.A.; Mahar, K.M. Arabic Sign Language Alphabet Recognition Based on HOG‑PCA Using Microsoft Kinect in Complex Backgrounds. In Proceedings of the 2016 IEEE 6th International Conference on Advanced Computing (IACC), Bhimavaram, India, 27–28 February 2016; pp. 451–458. [CrossRef] 18. Dahmani, D.; Larabi, S. User‑Independent System for Sign Language Finger Spelling Recognition. J. Vis. Commun. Image Repre‑ sent. 2014, 25, 1240–1250. [CrossRef] 19. Auephanwiriyakul, S.; Phitakwinai, S.; Suttapak, W.; Chanda, P.; Theera‑Umpon, N. Thai Sign Language Translation Using Scale Invariant Feature Transform and Hidden Markov Models. Pattern Recognit. Lett. 2013, 34, 1291–1298. [CrossRef] 20. Yu, F.; Koltun, V. Multi‑Scale Context Aggregation by Dilated Convolutions. In Proceedings of the 2016 International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. 21. Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1091–1100. [CrossRef] 22. Hosoe, H.; Sako, S.; Kwolek, B. Recognition of JSL Finger Spelling Using Convolutional Neural Networks. In Proceedings of the 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA), Nagoya, Japan, 8–12 May 2017; pp. 85–88. [CrossRef] 23. Bunrit, S.; Inkian, T.; Kerdprasop, N.; Kerdprasop, K. Text‑Independent Speaker Identification Using Deep Learning Model of Convolution Neural Network. Int. J. Mach. Learn. Comput. 2019, 9, 143–148. [CrossRef] 24. Colque, R.V.H.M.; Junior, C.A.C.; Schwartz, W.R. Histograms of Optical Flow Orientation and Magnitude to Detect Anomalous Events in Videos. In Proceedings of the 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images, Salvador, Bahia, Brazil, 26–29 August 2015; pp. 126–133. [CrossRef] 25. Cheng, S.; Zhao, K.; Zhang, D. Abnormal Water Quality Monitoring Based on Visual Sensing of Three‑Dimensional Motion Behavior of Fish. Symmetry 2019, 11, 1179. [CrossRef] 26. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [CrossRef]