A Real Time Malaysian Sign Language Detection Algorithm Based on Yolov3

International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-8, Issue-2S11, September 2019 A Real Time Malaysian Sign Language Detection Algorithm Based on YOLOv3 Mohamad Amar Mustaqim Mohamad Asri, Zaaba Ahmad, Itaza Afiani Mohtar, Shafaf Ibrahim Abstract— Sign language is a language that involves a Experts believes that sign language is unique within the movement of hand gestures. It is a medium for the hearing communities and not simply extracted from spoken language impaired person (deaf or mute) to communicate with others. [2]. The sign language that represents word “Doctor” in However, in order to communicate with the hearing impaired American Sign Language (ASL) is represented by forming person, the communicator has to have knowledge in sign language. This is to ensure that the message delivered by the “D” handshape but straighten the pointer finger and then hearing impaired person is understood. This project proposes a pointing it to the left-hand pulse. In contrast, for Malaysian real time Malaysian sign language detection based on the Sign Language (MSL), a signer acts like a doctor wearing a Convolutional Neural Network (CNN) technique utilizing the You stethoscope to represents the word doctor [3]. Therefore, the Only Look Once version 3 (YOLOv3) algorithm. Sign language choice of word and sign language was developed based on images from web sources and recorded sign language videos by the culture of the region and the communities understanding frames were collected. The images were labelled either alphabets or movements. Once the preprocessing phase was completed, the on certain words. system was trained and tested on the Darknet framework. The In Malaysia, there are two types of sign language that are system achieved 63 percent accuracy with learning saturation used among hearing impaired communities, which are (overfitting) at 7000 iterations. Once it is successfully conducted, Malaysian Sign Language (MSL) and Kod Tangan Bahasa this model will be integrated with other platform in the future such Melayu (KTBM). MSL is an informal language that is as mobile application. created naturally by the deaf communities while KTBM is a formal language which involved handshape movement cued Keywords: Convolutional Neural Network (CNN), Sign Language Translation, YOLO. with a speech that is introduced by the government in the education system. KTBM is a teaching module that was I. INTRODUCTION released in 1985 by the Ministry of Education, which was adapted from American Sign Language (ASL) and converted Communication is the key in our daily life. into Malay language for education purpose in school [4]. Communicating is the process of exchanging information Meanwhile, MSL is like a layman language that is usually between sender and receiver through any medium available. used as communication medium among deaf communities The understanding between two parties is very important to [5]. ensure that the message delivered is well interpreted by the In a real situation, when normal people meet with deaf receiver. Deaf people use sign language as a medium to people, communication breakdown happens due to different communicate with others. Sign language is the way of style of communication. In order to communicate with the communication that is based on hand movement and visual hearing impaired person, the knowledge of sign language is orientation. Sign language experts stated that the visual used necessary, but it becomes a barrier for those who do not learn consists of handshape (the way the hand and fingers form a the language. The common constraint faced by deaf people in sign), a location of the hand, palm orientation and movement communication is the absence of a signal interpreter [6]. This of the hand as its features [1]. This claim clarifies that each assertion is supported by Ting, a sign language teacher at hand gesture or movement has a different meaning to be Sarawak Society for The Deaf when she claimed that people represented. are not interested to learn sign language due to the low Different countries have different sign languages because demand of sign language class for normal people [7]. the sign language itself was developed by the deaf communities based on their local culture and heavily II. BACKGROUND OF STUDY influenced and translated from the spoken language [1]. A. Object detection Revised Version Manuscript Received on September 16, 2019. Mohamad Amar Mustaqim Mohamad Asri, Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Perak Branch Tapah Campus, 35400 Tapah Road, Perak, Malaysia. Zaaba Ahmad, Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Perak Branch Tapah Campus, 35400 Tapah Road, Perak, Malaysia. Itaza Afiani Mohtar, Faculty of Computer and Mathematical Sciences, Fig. 1: Object detection framework Universiti Teknologi MARA, Perak Branch Tapah Campus, 35400 Tapah Road, Perak, Malaysia. Shafaf Ibrahim, Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Melaka Branch Jasin Campus, 77300 Merlimau, Melaka, Malaysia. Published By: Retrieval Number: B11020982S1119/2019©BEIESP Blue Eyes Intelligence Engineering DOI: 10.35940/ijrte.B1102.0982S1119 651 & Sciences Publication A Real Time Malaysian Sign Language Detection Algorithm Based on YOLOv3 Object detection is an approach that enables the computer B. Convolutional Neural Network (CNN) to recognize the class that belongs to the object or locates the Convolutional Neural Network (CNN) is a deep neural location of the object. The purpose of object detection is to network that consist of more than two layers of neural determine what type of objects that are present and where the network. CNN primarily used to perform image object is located in an image or video. The first step is object classification, object recognition and object detection in localization, which the algorithm predicts whether there is an today‟s technology. CNN are comprised of learnable weights interesting information on the image and where the presence and biases of neurons that works by receiving inputs and of instances. This region will be bounded by a large set of perform dot product computation, which then determine the bounding boxes that covers the entire image. For instance, in output of the network. Region-based Convolutional Neural Network (RCNN), Differs from regular Neural Network, the neurons of CNN selective search algorithm was adopted in generating the architecture were arranged in three dimensions that are region proposals. known as width, height and depth. Depth of CNN do not Selective search approach tends to group pixels of an resemble the number of layers in the network of CNN, but image and clustering the pixelated groups. This approach refers to the dimension of the activation volume instead. Fig. begins by extracting the pixels of an image, and then groups 2 represents comparison between CNN and regular Neural the nearest neighbour pixels in order to reduce the correlation Network. Convolutional Neural Networks have a sequence of of the two pixels. The full image as the largest segment is layers that perform different functions. There are three main achieved from the iteration of this process [8]. The second types of layers in CNN architecture, which are Convolutional step of object detection is the evaluation of the extracted layer, Pooling layer and Fully Connected layer. visual features from the image or feature extraction. Feature Convolutional layer is layer that will operate dot product extraction refers to the process of identifying the key points computation between the weights and a small region that in an image (interest points) that can help to define the connected to the input [12]. image‟s contents such as corners, shapes, edges, and blobs [9]. Next, the extracted features were evaluated through matrix computation and the output matrix is used to determine the pattern and class of the object. The third step in object detection is the combination of multiple overlapping bounding boxes into a single box by using non-max suppression. During the object classification and localization on each grid cell of the image, it is possible that more than one grid cell will think that the centre of the object is in it, which may produce multiple bounding boxes of object detection. To solve this problem, non-max suppression will group those boxes into one, by choosing the highest probabilities as the most confident detection among each bounding boxes. Non-max suppression is an algorithm that will select the most scoring detection and replace the other lower scoring detection that bound the same object [10]. As a Fig. 3: Comparison between regular Neural Network result, the highest probabilities of classification will be (top) and Convolutional Neural Network (bottom) [12] suppressed with other remaining bounding boxes and it will be taken as the output. Fig. 5 shows a general process that In the pooling layer, the dimensionality of the network is usually referred in experimenting the object detection reducing continuously in order to decrease the number of project. parameters and computation, which also controls overfitting and shortens the training time [12]. The process of determining the class scores will be compute in the Fully Connected layer which then resulting the output of the network. C. YOLO (You Only Look Once) Fig. 2: Example of object detection procedure [11] YOLO is an advanced deep learning object detection implementation that was introduced by [13]. Initially, object Based on the experiment process conducted by [11], the detection works by sliding a small window or known as pre-processing process starts with a dataset construction classifier across the image to make a prediction. This process where the images were collected and labelled to form a will consume more time since the classifier, which is the dataset which will be further used for training, validation and small window is running repeatedly through the entire image test sets. The model construction involves the task of creating until the most certain prediction is determined.

Load more