06 Gesture Recognition-Malaysian Sign Language Recognition Using

International Conference on Digital Transformation and Applications (ICDXA) 2020 Gesture Recognition-Malaysian Sign Language Recognition using Convolutional Neural Network Alison Liew Shu Lien1 and Lim Khai Yin2 Department of Computer Science and Embedded Systems Tunku Abdul Rahman University College Kuala Lumpur, Malaysia [email protected] and 2limky@ tarc.edu.my ABSTRACT. Sign language is a communication recognition (SLR) applications for medium for the deaf and vocally impaired. However, implementation at hospitality centers. Presently, this language is not practised in public due to the deaf most researches into the development of SLR community being a minority and it takes time to learn often involve complex gears such as special and skilled manpower to assist the deaf in public gloves with embedded sensors and attached to interaction. Thus, this study aims to produce a Malaysian Sign Language Recognition (MSLR) the translating console with wires or wirelessly application to recognise MSL alphabets to help connected via Bluetooth or WiFi. The normal people communicate with the deaf. The accumulation of hardware procurement costs proposed work involves a few stages that consist of will be unfeasible if the solution were to be background subtraction to detect the moving hand, implemented at small business establishments or skin segmentation based on skin tones using YCbCr homes for personal usage (Erard, 2017). (Luminance, Chrominance) colour space for Thus, this research aims to develop a robustness in illumination and a 2D Convolutional vision-based solution to recognise MSL signs Neural Network (CNN) model for feature extraction without depending on special hardware and and classification of 24 alphabets. As MSL alphabets compare the efficiency of the proposed method are similar to American Sign Language (ASL) alphabets, the ASL FingerSpelling Dataset from the with existing researches in SLR. University of Surrey’s Center for Vision, Speech and The paper is organised as follows: Section Signal Processing is used for model training and 2 briefly discusses the literature review, Section testing. Evaluation criteria include micro averages in 3 covers the proposed methodology and its Precision, Recall and F1-Score. The test accuracy is results are presented in Section 4. Finally, 79.54% with misclassifications on letters such as ‘E’ Section 5 concludes the study. and ‘Q’ due to the signing orientation and similarity in finger articulation. KEYWORDS: 2D CNN, Computer Vision, 2 LITERATURE REVIEW Fingerspelling, Machine Learning, Malaysian Sign Language Recognition. One of the popular algorithms used in SLR is Neural Network (NN). In (Xiao, Zhao and Huan, 1 INTRODUCTION 2018), the framework used 2D Convolutional Neural Network (CNN) for extracting colour Malaysian Sign Language (MSL) was developed and depth features and Dynamic Bayesian from American Sign Language (ASL) upon the Network (DBN) for classification. The results latter’s introduction into the Penang School for were 95.3% and 94.88% training and validation the Deaf in mid 1970s and has since shared accuracy respectively. The authors noted DBN approximately 75% similarity between both SLs structure to be increasingly complex and (Hurlbut, 2010). computationally expensive as more features and The deaf-mutes face communication nodes are used to train larger datasets. However, barrier with those of normal hearing. Qualified their research omitted the interference between interpreters are in short supply with less than skin and face, which in a real-world setting, 100 certified MSL interpreters (Lau, 2017). would negatively impact the recognition Given the inconvenience faced by signers in accuracy to be incorrectly correlated to research public, there is a need for sign language results. 1 ICDXA/2020/T2/03 ©ICDXA2020 International Conference on Digital Transformation and Applications (ICDXA) 2020 In (Akmeliawatil, Ooi and Kuang, 2007), a Thus, this paper works on several two-layer feed forward NN was used to hypotheses. Scale-, rotation- and translation- recognise MSL in real-time. The authors invariant features would yield better results than transformed the RGB (Red, Green, Blue) images colour, depth and hand shape features that do not to YCbCr and performed colour segmentation address the problem of different gesturing on custom-made gloves for subsequent retrieval orientation. Using a 2D CNN for large dataset of hand centroid features. Their system achieved feature extraction and classification would a 90% recognition rate but did not specify the reduce the complexity of SLR experiments in dataset size and nature of recognition results; terms of workflow execution, time and whether train, test or validation accuracy was computational resources. Lastly, 2D CNN can taken. produce equal or higher training, validation and Another CNN algorithm is the pre-trained testing accuracies than non-NNs. GoogLeNet which yielded 72% accuracy in ASL (Garcia, 2016). However, it generalised weakly on datasets that GoogLeNet was not 3 RESEARCH METHODOLOGY previously trained on. In (Aryanie and Heryadi, 2015), K-Nearest The proposed MSLR is constructed with the Network (KNN) was used in ASL to classify following framework in Figure 1. ASL alphabets a-k. The recognition rate yielded 28.6% when k=5 and Principal Component Analysis (PCA) was used to reduce the dimensions of colour features. The low recognition rate was due to the PCA’s inability to discern highly correlated features in certain alphabets with similar appearances. Moreover, their dataset size was small and limited to 11 alphabets. Unlike other algorithms, CNN acts as a feature extractor for spatial and scaling-, rotation- and translation- invariant features (Xiao, Zhao and Huan, 2018). The features improve recognition results in case of hand pose changes (Kuznetsova and Leal-taix, 2013). With rectified liner unit (ReLU) nonlinearity, CNN trains faster on large datasets. Overfitting is Figure 1: Flowchart of the proposed MSLR overcome with regularisation by dropout. framework. However, its operation requires high computation power and longer training time. Its 3.1 Video Acquisition Softmax loss function solves the Support Vector Machine loss in indirectly mapping each letter’s During the real-time testing, live-feed video scores to their probabilities, which reduces acquisition is done via a 0.9MP HP Wide Vision recognition rates (Garcia, 2016). HD laptop webcam running on an Intel(R) Core SLRs involve large datasets; hence CNN is (TM) i5-7200U processor with 2.50-2.71Ghz more suitable than KNN, which otherwise speed and 8GB RAM. The video consists of requires higher computation cost to find k complex background with different illumination samples with more unseen data (Kumar and and the frames will be mirrored. Manjunatha, 2017). As this study trains on 2D static image data, a 2D CNN is used, which 3.2 Skin Colour Segmentation offers better validation accuracy in 2D convolutions than 3D convolutions present in The BGR frames are converted to YCbCr for 3D CNN. better resistance to illumination changes (Kolkur 2 ICDXA/2020/T2/03 ©ICDXA2020 International Conference on Digital Transformation and Applications (ICDXA) 2020 et al., 2017). Their pixels are compared to skin thresholds to build a skin mask and remove unnecessary background elements. The skin threshold used is proposed by (Alberto et al., 2011), which is: 0<Y<255 (1) Figure 2: Hand detection via fusion of 78<Cb<135 (2) background subtraction and skin segmentation 126<Cr<180 (3) The skin mask is pre-processed using Gaussian 3.5 Image Pre-processing blurring and morphological opening to remove Frames with detected hand regions are cropped, the noise, and then applied over the original BGR frame to indicate segmented skin regions. resized to 64x64 grayscaled, Gaussian blurred and undergo morphological opening with a 3x3 kernel. 3.3 Background Subtraction 3.6 2D CNN Architecture This process is conducted with an illumination resistant two-frame difference technique Figure 3 shows the proposed 2D CNN model (Hussain, Naaz and Uddin, 2016). The absolute summary consisting of 5 convolutional layers, 3 difference between the first frame (background) max-pooling layers, 2 dense layers and 8, 955, and subsequent frames (current) is calculated to 608 trainable parameters. To reduce overfitting, find moving objects, and generate a mask to L2 weight regularisation is added to the first indicate detected motion. The optimal threshold dense layer. Furthermore, ADAM optimizer, is between 22 to 255. Eq. (4) and (5) are the dropout and ReLU activation functions are formula used to find the absolute difference: applied. The input layer of the model receives 64x64 grayscale images. �� = +�� $%&'( 23&%( − ��23&%(+ (4) 3.7 Dataset Properties �� Since MSL alphabets are derived from ASL; the 1 �� +�� + ≥ �ℎ��ℎ�� ASL FingerSpelling Dataset is used. It contains = > $%&'( (5) 0 ��ℎ�� 120,000 samples of RGB and depth images of 24 classes of statically gestured alphabets (excluding J and Z) performed by 5 users in 3.4 Hand Detection similar lighting and background (Pugeault and Bowden, 2011). The PNG format images are Solely conducting background subtraction may close-ups of hand gestures and sized at an eliminate unwanted background elements average dimension of 150x150 pixels. In this interfering with hand detection, but it still study, 66,080 RGB images are used for training, detects non-skin coloured moving objects. validation and testing. Therefore,

Load more