International Conference on Digital Transformation and Applications (ICDXA) 2020

Gesture Recognition-Malaysian Recognition using Convolutional Neural Network

Alison Liew Shu Lien1 and Lim Khai Yin2

Department of Computer Science and Embedded Systems Tunku Abdul Rahman University College Kuala Lumpur, [email protected] and 2limky@ tarc.edu.my

ABSTRACT. Sign language is a communication recognition (SLR) applications for medium for the deaf and vocally impaired. However, implementation at hospitality centers. Presently, this language is not practised in public due to the deaf most researches into the development of SLR community being a minority and it takes time to learn often involve complex gears such as special and skilled manpower to assist the deaf in public gloves with embedded sensors and attached to interaction. Thus, this study aims to produce a Recognition (MSLR) the translating console with wires or wirelessly application to recognise MSL alphabets to help connected via Bluetooth or WiFi. The normal people communicate with the deaf. The accumulation of hardware procurement costs proposed work involves a few stages that consist of will be unfeasible if the solution were to be background subtraction to detect the moving hand, implemented at small business establishments or skin segmentation based on skin tones using YCbCr homes for personal usage (Erard, 2017). (Luminance, Chrominance) colour space for Thus, this research aims to develop a robustness in illumination and a 2D Convolutional vision-based solution to recognise MSL signs Neural Network (CNN) model for feature extraction without depending on special hardware and and classification of 24 alphabets. As MSL alphabets compare the efficiency of the proposed method are similar to (ASL) alphabets, the ASL Dataset from the with existing researches in SLR. University of Surrey’s Center for Vision, Speech and The paper is organised as follows: Section Signal Processing is used for model training and 2 briefly discusses the literature review, Section testing. Evaluation criteria include micro averages in 3 covers the proposed methodology and its Precision, Recall and F1-Score. The test accuracy is results are presented in Section 4. Finally, 79.54% with misclassifications on letters such as ‘E’ Section 5 concludes the study. and ‘Q’ due to the signing orientation and similarity in finger articulation.

KEYWORDS: 2D CNN, Computer Vision, 2 LITERATURE REVIEW Fingerspelling, Machine Learning, Malaysian Sign Language Recognition. One of the popular algorithms used in SLR is Neural Network (NN). In (Xiao, Zhao and Huan, 1 INTRODUCTION 2018), the framework used 2D Convolutional Neural Network (CNN) for extracting colour Malaysian Sign Language (MSL) was developed and depth features and Dynamic Bayesian from American Sign Language (ASL) upon the Network (DBN) for classification. The results latter’s introduction into the Penang School for were 95.3% and 94.88% training and validation the Deaf in mid 1970s and has since shared accuracy respectively. The authors noted DBN approximately 75% similarity between both SLs structure to be increasingly complex and (Hurlbut, 2010). computationally expensive as more features and The deaf-mutes face communication nodes are used to train larger datasets. However, barrier with those of normal hearing. Qualified their research omitted the interference between interpreters are in short supply with less than skin and face, which in a real-world setting, 100 certified MSL interpreters (Lau, 2017). would negatively impact the recognition Given the inconvenience faced by signers in accuracy to be incorrectly correlated to research public, there is a need for sign language results. 1

ICDXA/2020/T2/03 ©ICDXA2020 International Conference on Digital Transformation and Applications (ICDXA) 2020

In (Akmeliawatil, Ooi and Kuang, 2007), a Thus, this paper works on several two-layer feed forward NN was used to hypotheses. Scale-, rotation- and translation- recognise MSL in real-time. The authors invariant features would yield better results than transformed the RGB (Red, Green, Blue) images colour, depth and hand shape features that do not to YCbCr and performed colour segmentation address the problem of different gesturing on custom-made gloves for subsequent retrieval orientation. Using a 2D CNN for large dataset of hand centroid features. Their system achieved feature extraction and classification would a 90% recognition rate but did not specify the reduce the complexity of SLR experiments in dataset size and nature of recognition results; terms of workflow execution, time and whether train, test or validation accuracy was computational resources. Lastly, 2D CNN can taken. produce equal or higher training, validation and Another CNN algorithm is the pre-trained testing accuracies than non-NNs. GoogLeNet which yielded 72% accuracy in ASL (Garcia, 2016). However, it generalised weakly on datasets that GoogLeNet was not 3 RESEARCH METHODOLOGY previously trained on. In (Aryanie and Heryadi, 2015), K-Nearest The proposed MSLR is constructed with the Network (KNN) was used in ASL to classify following framework in Figure 1. ASL alphabets a-k. The recognition rate yielded 28.6% when k=5 and Principal Component Analysis (PCA) was used to reduce the dimensions of colour features. The low recognition rate was due to the PCA’s inability to discern highly correlated features in certain alphabets with similar appearances. Moreover, their dataset size was small and limited to 11 alphabets. Unlike other algorithms, CNN acts as a feature extractor for spatial and scaling-, rotation- and translation- invariant features (Xiao, Zhao and Huan, 2018). The features improve recognition results in case of hand pose changes (Kuznetsova and Leal-taix, 2013). With rectified liner unit (ReLU) nonlinearity, CNN trains faster on large datasets. Overfitting is Figure 1: Flowchart of the proposed MSLR overcome with regularisation by dropout. framework. However, its operation requires high computation power and longer training time. Its 3.1 Video Acquisition Softmax loss function solves the Support Vector Machine loss in indirectly mapping each letter’s During the real-time testing, live-feed video scores to their probabilities, which reduces acquisition is done via a 0.9MP HP Wide Vision recognition rates (Garcia, 2016). HD laptop webcam running on an Intel(R) Core SLRs involve large datasets; hence CNN is (TM) i5-7200U processor with 2.50-2.71Ghz more suitable than KNN, which otherwise speed and 8GB RAM. The video consists of requires higher computation cost to find k complex background with different illumination samples with more unseen data (Kumar and and the frames will be mirrored. Manjunatha, 2017). As this study trains on 2D static image data, a 2D CNN is used, which 3.2 Skin Colour Segmentation offers better validation accuracy in 2D convolutions than 3D convolutions present in The BGR frames are converted to YCbCr for 3D CNN. better resistance to illumination changes (Kolkur 2

ICDXA/2020/T2/03 ©ICDXA2020 International Conference on Digital Transformation and Applications (ICDXA) 2020

et al., 2017). Their pixels are compared to skin thresholds to build a skin mask and remove unnecessary background elements. The skin threshold used is proposed by (Alberto et al., 2011), which is:

0

���� Since MSL alphabets are derived from ASL; the 1 �� ���� ≥ �ℎ���ℎ��� ASL FingerSpelling Dataset is used. It contains = (5) 0 ��ℎ������ 120,000 samples of RGB and depth images of 24 classes of statically gestured alphabets (excluding J and Z) performed by 5 users in 3.4 Hand Detection similar lighting and background (Pugeault and Bowden, 2011). The PNG format images are Solely conducting background subtraction may close-ups of hand gestures and sized at an eliminate unwanted background elements average dimension of 150x150 pixels. In this interfering with hand detection, but it still study, 66,080 RGB images are used for training, detects non-skin coloured moving objects. validation and testing. Therefore, the background subtracted mask created is combined with the skin mask in an 3.8 Evaluation Metrics AND bitwise. The rationale is to ensure the combination of masks results in only the The multi-class classification performance is detection of a skin-coloured object, or the user’s evaluated with a confusion matrix and micro- hand. Figure 2 shows the hand detection after averages of Precision; fraction of true positives combining both background subtraction and skin (TP) among positive recalls, Recall (Sensitivity); masks. fraction of TP among correct events, and F1- Score; harmonic mean of Precision and Recall. 3

ICDXA/2020/T2/03 ©ICDXA2020 International Conference on Digital Transformation and Applications (ICDXA) 2020

Eq. (6), (7) and (8) are the formula for each of H 0.98 1.00 0.99 534 the micro-averages I 0.92 0.83 0.87 525 K 1.00 0.72 0.84 569 L 1.00 0.97 0.98 514 … Precision, ��� = , M 0.68 0.76 0.72 525 … (6) N 0.63 0.79 0.70 530 … O 0.57 0.97 0.72 532 Recall, ��� = , (7) … P 0.77 0.41 0.54 527 Q 0.18 0.01 0.01 529 × F1-Score, � = 2 × , (8) R 0.80 0.99 0.88 562 S 0.41 0.99 0.58 526 T 0.53 0.49 0.51 531 where k is the number of class labels. FP and FN U 0.96 0.93 0.94 529 represent false positive and false negative, V 0.92 0.93 0.92 527 respectively. W 0.96 1.00 0.98 513 X 0.78 0.48 0.60 522 Y 0.93 0.97 0.95 521 Micro avg 0.80 0.80 0.80 12782 4 EXPERIMENTAL RESULTS Macro avg 0.80 0.80 0.77 12782 Weighted In this experiment, 20 trials of model training avg 0.80 0.80 0.77 12782 are performed on an 8-core CPU Intel Xeon E5- 1620 3.5 GHz with GTX 1080 GPU acceleration The alphabet ‘Q’ has the lowest TP among the using Python 3.6. A 70/30 train-validation split 24 alphabets with only 3 out of 529 images of was done on 53,298 images. For testing, 12,782 ‘Q’ correctly predicted, thus resulting in low images are used. Additionally, data precision (0.18), recall (0.01) and F1-Score augmentation; enabling horizontal and vertical (0.01) as shown in Table 1. flips, rotation range of 20°, zoom range of 0.2, shear range of 0.2 and brightness range between 5 DISCUSSION 0.5 to 2.0 are performed during model training. Upon implementing hyperparameter In this study, 2D CNN is used to extract features tuning such as an L2 weight decay of 1e-3, 70% from gestured alphabets and classify them into dropout, learning rate of 1e-3, 100 epochs and 24 alphabets. Instead of using a pre-trained batch size of 512, the model’s training duration model, building and training a CNN model from is 4295 seconds. The results are 98.2% train scratch provides a better control over accuracy, 84.26% validation accuracy and hyperparameter tuning and the network’s 79.54% test accuracy. The aforementioned capacity to generalise better on a specific hyperparameters produce the highest test dataset. Furthermore, the model eases the accuracy and lowest overfitting between the processes of extracting features such as model’s train and test accuracies among the 20 translation-, rotation- and scale- invariant trials of model training and different without requiring extra work and resources in hyperparameter tuning. Table 1 shows the constructing handcrafted features. Additionally, performance evaluation on the model. the reduction in data pre-processing needs without depending on handcrafted features also Table 1: Classification report of the proposed reduces computation power usage; hence model allowing the model to run smoothly in real-time F1- on a standard laptop. Alphabet Precision Recall Score Total The increase of L2 weight decay by a A 0.75 0.92 0.83 545 factor of 10 to 1e-3 has reduced overfitting B 0.93 1.00 0.96 534 C 0.96 0.97 0.97 540 between train and test accuracies. A possible D 0.95 0.92 0.93 544 cause to this decreased overfitting is the increase E 0.55 0.15 0.23 539 in L2 weight decay coefficient that has a higher F 0.98 0.91 0.94 527 penalty on the network’s squared value of G 0.96 1.00 0.98 537 weights. Therefore, network weights become 4

ICDXA/2020/T2/03 ©ICDXA2020 International Conference on Digital Transformation and Applications (ICDXA) 2020

smaller and increases the network’s stability to approach suggested by Kuznetsova, with a minimise the changes in the output when there shorter training time of 821 seconds and 97.4% are small changes in the input. recognition accuracy (Kuznetsova and Leal-taix, The alphabet with extremely low TP is 2013). This is because MLRF utilises lower ‘Q’. The misclassification has caused the training time and low memory consumption. model’s accuracy to suffer. Table 2 tabulates the Moreover, errors made in the earlier levels in alphabet with letters that were commonly MRLF are not propagated to subsequent levels; misclassified and returned high FP. Figure 4 hence increasing classification accuracy. The shows ‘Q’ with the letters the model confused authors (Kuznetsova and Leal-taix, 2013) also them with. used scaling-, rotation- and translation- invariant features extracted with Ensemble of Shape Table 2: Frequently misclassified alphabets Functions (ESF) descriptor. TP FP When compared with colour, dept, Alphabet Total Alphabet Total intensity and hand contour features used in NN- Q 3 O 202 related literature review in SLR and MCRF, it is T 126 seen that the proposed features contributed to a training accuracy of 98.2%, higher than other features. Thus, this concludes that the proposed features contribute to better model accuracy than when intensity and depth features are used. The common conclusion drawn up from the related work and experimental results is Figure 4: Left: ‘Q’. Middle: ‘O’. Right: ‘T’ recognition accuracies suffered due to the similarities in different signed gesture For example, ‘Q’ was commonly mistaken with appearance. ‘O’ and ‘T’ as seen in Figure 4 because of the closely set fingers. Another argument to support the misclassification phenomena is the ‘Q’ 6 CONCLUSION dataset images revealed the gesture was made in different rotations such as which direction the This study presents a real-time method for thumb and index finger, pressed together in a recognising MSL alphabets by implementing a pinch shape, are pointing at. For instance, in 2D CNN model for both feature extraction and Figure 4 (Left), the thumb and index finger were classification. After training and testing the pointing towards the camera. The same can be model, the CNN model using ADAM optimizer said for other alphabets which were signed with reveal micro-averages of Precision, Recall and varying degrees of rotation and clarity by each F1-Score results are 0.80, 0.80 and 0.80 of the 5 signers. Thus, the similarity in hand respectively. Test accuracy is 79.54%. For pre- signing posture and perspective of gestures processing of dataset and real-time frames, the made in front of the camera affected how the resizing grayscale images to 64x64, employing classifier learned the features of each alphabet. Gaussian blurring and morphological opening In a peer review against related work done with a kernel size of 3x3 are performed. on the same dataset, the proposed methodology Furthermore, background subtraction and skin outperforms the GoogLeNet approach which segmentation based on YCbCr colour space are scored 72% recognition accuracy (Garcia, 2016). used for hand segmentation and detection in an Additionally, it outperforms the Multi-class environment with non-static background and Random Forest (MCRF) approach in (Pugeault changing illumination. and Bowden, 2011), which yielded 75% mean The strength of this study is the model precision from using combined hand appearance could run fast enough on a standard laptop. and intensity information extracted with Gabor Hand detection was successful with both skin filters. However, the proposed model is second segmentation and background subtraction to the Multi-layered Random Forest (MLRF) 5

ICDXA/2020/T2/03 ©ICDXA2020 International Conference on Digital Transformation and Applications (ICDXA) 2020

without having to rely on another classifier for Garcia, B. (2016) Real-time American Sign recognising and detecting hands, thus further Language Recognition with Convolutional making the application run with less Neural Networks. computation power requirement. Furthermore, Hurlbut, H. M. (2010) ‘Malaysian Sign the YCbCr colour space threshold adopted has Language : A phonological statement’, been proven to be robust in bright illumination. (1972), pp. 157–178. Results reveal the proposed features Hussain, Z., Naaz, A. and Uddin, M. N. (2016) contributed to better accuracy compared to ‘Moving Object Detection Based on colour, depth, intensity and hand centroid Background Subtraction & Frame features. 2D CNN reduced the experimental Differencing Technique’, International complexities when used for both feature Journal of Advanced Research in Computer extraction and classification. In terms of its and Communication Engineering IJARCCE, performance over non-NNs, the proposed model 5(5), pp. 817–819. doi: lacks in accuracy compared to MLRF but better 10.17148/IJARCCE.2016.55200. than MCRF. Thus, this paper’s hypotheses have Kolkur, S. et al. (2017) ‘Human Skin Detection been proven. Using RGB , HSV and YCbCr Color Future work consists of improving the Models’, 137, pp. 324–332. hand detection function of the system by a Kumar, B. P. P. and Manjunatha, M. B. (2017) separate classifier to be trained for recognising ‘A Hybrid Gesture Recognition Method for hands. Pre-trained models like VGG or ResNet American Sign Language’, Indian Journal are worth exploring to classify MSL for their of Science and Technology, 10(January), pp. shorter training duration and higher accuracy 1–12. doi: 10.17485/ijst/2017/v10i1/109389. due to pre-trained weights in image Kuznetsova, A. and Leal-taix, L. (2013) ‘Real- classification. Early stopping can be employed time sign language recognition using a to avoid overtraining the model. It may consumer depth camera’. doi: minimise the occurrence of overfitting the model 10.1109/ICCVW.2013.18. on training dataset which could cause poor Lau, C. (2017) ‘Acute shortage of sign language performance on the test dataset. interpreters in Malaysia’, MalaysiaKini, 18 March, p. 2. Available at: REFERENCES https://www.malaysiakini.com/news/376165 Pugeault, N. and Bowden, R. (2011) ‘Spelling It Akmeliawatil, R., Ooi, M. P. and Kuang, Y. C. Out : Real – Time ASL Fingerspelling (2007) ‘Real-Time Malaysian Sign Recognition’, in 1st IEEE Workshop on Language Translation using Colour Consumer Depth Cameras for Computer Segmentation and Neural Network’, (May), Vision. Barcelona, Spain, pp. 1–6. pp. 1–3. Xiao, Q., Zhao, Y. and Huan, W. (2018) ‘Multi- Alberto, J. et al. (2011) ‘Explicit Image sensor data fusion for sign language Detection using YCbCr Space Color Model recognition based on dynamic Bayesian as Skin Detection’, in Applications of network and convolutional neural network’, Mathematics and Computer Engineering. Multimedia Tools and Applications. Mexico City, pp. 123–128. Multimedia Tools and Applications, pp. 1– Aryanie, D. and Heryadi, Y. (2015) ‘American 18. Sign Language-Based Finger-spelling Recognition using k-Nearest Neighbors Classifier’, 2015 3rd International Conference on Information and Communication Technology (ICoICT). IEEE, pp. 533–536. doi: 10.1109/ICoICT.2015.7231481. Erard, M. (2017) Why Sign Language Gloves Don’t Help Deaf People, The Atlantic. 6

ICDXA/2020/T2/03 ©ICDXA2020