Sign Language Number Recognition

Iwan Njoto Sandjaja Nelson Marcos, PhD Informatics Engineering Department Software Technology Department Petra Christian University De La Salle University Surabaya, Indonesia Manila, Philippines [email protected] [email protected]

Abstract— Sign language number recognition system lays handshape can express not only some concepts, but also down foundation for handshape recognition which addresses special transition states in temporal sign language. During real and current problems in signing in the deaf community the period of 1994-1998, finger-spelling is the sign and leads to practical applications. The input for the sign language. Sign language is just a string of signs. Isolated language number recognition system is 5000 Filipino Sign words are widely considered as the basic unit in sign Language number video file with 640 x 480 pixels frame size and 15 frame/second. The color-coded gloves uses less color language and many researchers [3] [4] focus on isolated sign compared with other color-coded gloves in the existing language recognition. research. The system extracts important features from the Some researchers [5] [6] also pay attention to video using multi-color tracking algorithm which is faster than continuous sign language recognition. A lot of works on existing color tracking algorithm because it did not use continuous sign language recognition apply HMMs for recursive technique. Next, the system learns and recognizes the recognition. The use of HMM offers the advantage of being Filipino Sign Language number in training and testing phase able to segment a data stream into its continuous signs using Hidden Markov Model. The system uses Hidden Markov implicitly and thus bypasses the hard problem of Model (HMM) for training and testing phase. The feature segmentation entirely. extraction could track 92.3% of all objects. The recognizer also could recognize Filipino sign language number with 85.52% The system architectures for sign language recognition average accuracy. can be categorized into two main classifications based on its input. First is datagloves-based, whose input is from gloves Keywords- computer vision, human computer interaction, sign with sensor. The weakness of this approach is that it has language recognition, hidden markov model, hand tracking, multi- limited movement. The advantage is its having higher color tracking accuracy. Second is vision-based, of which input is from camera (stereo camera or web/usb camera). The weakness of this approach is lower accuracy and consuming more I. INTRODUCTION computing power. The advantages are cheaper and less Sign language is local, in contrast with the general constraining than datagloves. To make human hand tracking opinion which assumes it is universal. Different countries easier, color-coded gloves are usually used. A combination and at times even regions within a country have their own of both architectures is also possible which is called sign languages. In the Philippines, for example, there are 13 hybrid/mix architecture. variations of Filipino Sign Language based on regions [1]. In vision-based approach, the architecture of the system Sign language is a natural language for the deaf. It is a is usually divided into two main parts. The first part is the kind of visual language via primarily hands and arms feature extraction. The feature extraction part should extract movements (called manual articulators which consist of important features from the video using computer vision dominant hand and non-dominant hand) accompanying method or image processing such as background other parts of body, such as facial expression, eye subtraction, pupil detection, hand tracking, and hierarchical movement, eyebrow movement, cheek movement, tongue feature characterization (shape, orientation, and location). movement, and lip motion (called non-manual signal) [2]. The second part is the recognizer. From the features already Most hearing people do not understand any sign language extracted and characterized, the recognizer should be able to and know very little about deafness in general. Although learn the pattern from training data and recognize testing many deaf people lead successful and productive lives, this data correctly. The recognizer employs machine learning communication barrier can have problematic effects on algorithms. Artificial Neural Network (ANN) and Hidden many aspects of their lives. Markov Model (HMM) are the most common algorithms There are three main categories in sign language used. recognition, namely handshape classification, isolated sign In vision-based approach, the camera could be single, language recognition, and continuous sign classification. two or more or special 3D. Stereo camera uses a two-camera Handshape classification, or finger-spelling recognition, is configuration that imitates how human eyes work. The most one of the main topics in sign language recognition since recent approach is virtual stereo camera which only uses worn by the user and achieves 96.8% accuracy (97% with one camera generating second camera virtually. unrestricted grammar). The main problem in sign language recognition is A portable letter sign language translator is developed involving multiple channels and simultaneous events that using specialized glove with flex/bend sensor [8]. Their create combinatorial explosion and huge search space. This system translates hand-spelled words to letters through the research also addresses problems in computer vision, use of Personal Digital Assistant (PDA). They use Neuro- machine learning, machine translation, and linguistic Fuzzy Classifier (NEFCLASS) for letter translation problems. This research could be seen as computer-based algorithms. Their system cannot recognize the letter M and lexicon/dictionary from sign language phrase specifically N but it can recognize other letters with minimum accuracy numbers into words/texts. of 65% (letter Z) maximum accuracy of 100% and average accuracy of 90.2%. II. RELATED LITERATURE A framework for recognizing American Sign Language A vision-based medium Chinese sign language (ASL) is developed using hidden Markov model [6]. The recognition (SLR) system [3] is developed using tied- data set consist of 499 sentences, between 2 and 7 signs mixture density hidden Markov models (TMDHMM). Their long, and a total of 1604 signs from a 22-sign vocabulary. experiment is based on the single frontal view; only a USB They collect these data with an Ascension Technologies color camera is employed and placed in front of the signer MotionStarTM system at 60 frames per second. In addition, to collect the CSL video data, and has the image size of 320 they collect data from the right hand with a Virtual X 240 pixels. In this system, the recognition vocabulary Technologies CybergloveTM, which records wrist yaw, contains 439 CSL signs, including 223 two-handed and 216 pitch, and the joint and abduction angles of the fingers, also one-handed signs. Their experimental results show that the at 60 frames per second. The result shows clearly that the proposed methods could achieve an average recognition quadrilateral-based description of the handshape (95.21%) is accuracy of 92.5% on 439 signs, including 93.3% on two- far more robust than the raw joint angles (83.15%). The best handed signs and 91.7% on one-handed signs, respectively. result is achieved using PaHMM monitoring movement In recent research of sign language recognition [4], a channel right hand and handshape right hand with 88.89% novel viewpoint invariant method for sign language sentence accuracy and 96.15% word accuracy. recognition is proposed. The recognition task is converted CopyCat which is an educational computer game that to a verification task under the proposed method. This utilizes computer gesture recognition technology to develop method verifies the uniqueness for a virtual stereo vision American Sign Language (ASL) skills in children ages 6-11 system, which is formed by the observation and template. is presented [9]. Data from the children’s signing is The recognition vocabulary of this research contains 100 recorded using an IEEE 1394 video camera and using CSL signs. The image revolution is 320 X 240 pixels. The wireless accelerometers mounted in colored gloves. The proposed method achieves an accuracy of 92% at rank 2. dataset consist of 541 phrase samples and 1,959 individual In recent work [7], the design and implementation of sign samples of five children signing game phrases from a hand mimicking system is discussed. The system captures 22 word vocabulary. The vocabulary is limited to subset of hand movement, analyzes it using MATLAB, and produces ASL which includes single and double-handed signs, but 3D hand graphical model that imitates the movements of the does not include more complex linguistic construction such user hand using OpenGL. The system captures hand as classifier manipulation, facial gestures and level movement using two cameras and approximate the user 3D emphasis. Each phrase is a description of an encounter for hand pose using stereovision techniques. Based on the test, the game character, Iris the cat. The students can give the obtained average difference of ideal hand part warning of the predator presence, such as “go chase snake” orientation from actual orientation is about 20. Furthermore, or identify the location of a hidden kitten, such as “white 72.38% of the measured hand angular orientations is less kitten behind wagon”. Bradsher et al. achieve an average than 20, while 86.35% of the test cases have angular word accuracy of 93.39% for the user–dependent models. orientations error which is less than 45. The user–independent models are generated by training on a Two real-time hidden Markov model-based systems for dataset consisting of four children and testing on the other recognizing sentence-level continuous American Sign child’s dataset. Bradsher et al. achieve an average word Language (ASL) using a single camera to track the user's accuracy of 86.28% for the user–independent models. They unadorned hands was presented [5]. For this recognition achieve on average 92.96% of accuracy in word-level with system, sentences consisting the form of “personal pronoun, 1.62% of standard deviation when they chose samples verb, noun, adjective, (the same) personal pronoun” are to across all samples and users (they trained and tested using be recognized. Six personal pronouns, nine verbs, twenty data from all students). nouns, and five adjective are included making up a total A vision-based interface for controlling a computer lexicon of forty words. The first system observes the user mouse via 2D and 3D hand gestures was presented [10] from desk mounted camera and achieves 91.9% word [11]. The proposed algorithm addresses three different accuracy. The second system mounts the camera in a cap subproblems: (a) hand hypothesis generation (i.e., a hand appears in the field of view for the first time) (b) hand hypothesis tracking in the presence of multiple, potential occluding objects (i.e. previously detected hands move arbitrarily in the field of view) and (c) hands hypothesis removal (i.e. a tracked hand disappears from the field of view). Their proposed algorithm also involves simple prediction that uses a simple linear rule to predict location of hand hypotheses at time t, based on their locations at time t-2 and t-1. Having already defined the contour of a hand, finger detection is performed by evaluating at several scales a curvature measure on contour points. As confirmed by several experiments, the proposed interface achieves accurate mouse positioning, smooth cursor movement and reliable recognition of gestures activating button events. Owing to these properties, their interface can be used as a virtual mouse for controlling any Windows application.

III. ARCHITECTURAL DESIGN

The general architectural design for sign language number recognition is shown in Fig. 1. The input of sign language number recognition system is Filipino Sign Figure 1. System Architecture Language number video. In general, there are two main modules in sign language recognition architecture: namely, the feature extraction module, and recognizer module. The A. Feature Extraction feature extraction module extracts important features from The feature extraction module uses OpenCV library the video per frame. The recognizer module learns and [12]. The detailed flowchart of feature extraction is shown recognizes the video from its features. in Fig. 2. The feature extraction module consists of face For each frame in the video, the feature extraction detection, hand tracking, and feature characterization. Face module begins with calling a smooth procedure to eliminate detection module is used to detect face area. Hand tracking noise from the camera. The frame size is 640 x 480 pixels in module tracks hands movement, dominant hand, and non- BGR (Blue, Green, and Red) color space. After smoothing dominant hand. Feature characterization takes important the frame, feature extraction module converts frame’s color features such as the position of the face as reference, space from BGR color space to Hue, Saturation and Value position of dominant hand and its fingers, area of dominant (HSV) color space. hand and each finger, the orientation of dominant hand and each fingers, position of non-dominant hand, area of of non- dominant hand, and non-dominant hand orientation. In this research, the feature characterization that is used as feature vectors is the position of dominant-hand’s thumb in x and y coordinates and the x and y coordinate of others fingers relatively to the thumb position. The output of feature characterization becomes the input for second block which is the recognizer. The recognizer employs Hidden Markov Model. The recognizer consists of two main parts which are training module and testing module. In the training module, the recognizer learns the pattern of sign language number using annotated input from feature extraction module. In testing or verification module, the recognizer receives unknown input which is never be learned before yet annotated for verification purpose.

Figure 2. Feature extraction flowchart

The saturation and value filtering module extracts hue relatively to the thumb position. The first two feature channel based on specific saturation and value parameter. vectors is taken from the center coordinate (x,y) of thumb The saturation and value filtering give two outputs: skin ellipse. The rest of feature vector is taken from the distance frame and color frame. Skin frame is processed by skin between thumb and other finger in x and y. Thus, there are tracking module. 10 feature vectors for each frame. The feature vectors are The skin tracking module is basically color tracking saved in XML format. module with different size filtering because face and non- dominant hand have a larger area than each finger of B. Recognizer dominant hand. Skin tracking procedure is producing a face ellipse. Skin tracking module also creates black-filled The recognizer learns the pattern from the feature contour of the face on color frame for removing the lips in vectors that are generated by the feature extraction module the color frame. using machine learning algorithms. Hidden Markov Model The color frame is processed by color tracking module (HMM) is used as the machine algorithms. The Cambridge- as shown in Fig. 2 with different parameters of hue range University HMM Toolkit [13] is chosen to be used as HMM for each finger. The hue parameter of each finger is known library. by searching the maximum and minimum hue value for each The recognizer consists of three main parts. The first finger. The hue parameter of each finger is not overlapping part is data preparation module. In data preparation module, with other hue parameter. The color tracking module gives the recognizer generates all directories and files that are the ellipse area for each finger as result. needed for HMM processes. The second part is HInit The Merge algorithm is simply executing module. In HInit module, the recognizer creates HMM cvFindContours procedure again with connected contours model for each sign language number and initializes the from color tracking procedure as input. Each contour is models using forward-backward algorithms using labeled fitted by an ellipse. If the number of ellipses is more than training feature vectors from feature extraction module as its two, the Merge algorithm returns the first ellipse, therefore input. The last part is HModels module. In HModels Merge algorithm always returns the first detected ellipse. By module, the recognizer uses labeled training feature vector always returning the first ellipse, the merge algorithm from feature extraction module to re-estimate the HMM avoids a long/endless recursive process. models parameters using Baum-Welch method. After it has The next module is Draw and Print module which finished with the re-estimation of HMM models parameters, draws each ellipse and prints its parameters in XML format. the recognizer recognizes the testing data which is not Fig. 3 shows a sample frame captured with resulting included in training data yet already labeled for verification ellipses. purpose. The Viterbi algorithm is used to recognize the testing data. Lastly, the recognizer interprets and evaluates the result and generates report about the result.

IV. RESULTS AND ANALYSIS A. Feature Extraction Table I shows the summary of result from feature extraction module in terms of time. Feature extraction module was running 5 hours 42 minutes and 35 second for extract the feature of 5000 Filipino sign language number videos. Feature extraction took a lot of time because it had to play all the video one by one. For playing video of number 1-9, it would take 2 seconds (±30 frames). Table I. Feature extraction results in terms of time Figure 3. Sample image with resulting ellipses Time (HH:MM:SS)

There are 6 color trackers, one color tracker for skin Start time 2:32:55 PM color (face and non-dominant hand) and 5 color trackers for End time 8:15:30 PM dominant hand, one for each finger of dominant hand. The ellipse and its parameters are shown in Fig. 2. The whole Duration 05:42:35 processes are repeated until no more frame processed. Finally, the feature characterization converts ellipse and its parameters into feature vectors. The feature vectors For playing the video of number 10-109, 201-209, 301- contain the position of dominant-hand’s thumb in x and y 309, 401-409, 501-509, 601-609, 701-709, 801-809, 901- coordinates and the x and y coordinates of other fingers 909, it would take 3 seconds (±45 frames). The video of remaining numbers took 4 seconds to play (±60 frames). Thus, the total time for playing all the video was 5 hours 16 each sign language number as testing data and used the minutes and 50 seconds. The feature extraction module took other samples of the same sign language number for training a little longer time because it had to switch between one data. Thus, five-fold validation created five set of validation. video to another video and saved the result to XML files. Each set consist of 4000 data for training and 1000 data for Table II shows the summary of result from feature testing. extraction module in terms of accuracy. For each frame The second validation procedure was leave-one-out there are five objects to be tracked which represent five validation. Leave-one-out validation used all of the data fingers. Non-trackable object means the color tracking except for one sample as training data, and the remaining module cannot track the object which is the finger. one as test data. Leave-one-out validation done this for Incorrectly trackable object means the color tracking every possible permutation, and could take a very long time. module detect the object but it found more than one object For this research, leave-one-out validation creates the 120 eventhough already using the merge algorithm. sets of testing and training data. Initially, this research began with four-state HMM Table II. Feature extraction results in term of accuracy. model and then increased the number of states of HMM Result Objects model until found the maximum accuracy. After that, the Correctly trackable objects 1,322,537 experiment continued by adding skip states. The 10-state HMM without skip state has the highest average accuracy Non-trackable objects 109,814 which is 85.52%. Incorrectly trackable objects 4,664 Table III. Recognizer results Total objects 1,437,015 Set A Set B Set C Set D Set E

%Tracking 92.03% Time (HH:MM:SS) HInit 32:8 31:32 31:36 31:45 31:33 The feature extraction module could track 1,322,537 HModels 2:50 3:1 3:13 3:15 3:8 objects of 1,437,015 objects. In other words, 92.03% of all Total 34:57 34:33 34:49 34:59 34:42 objects could be tracked. Little finger had the most record as untrackable object because of small size. The second most Result untrackable object was the index finger because it was Correct 767 881 890 888 850 occluded by the thumb finger in the beginning and the end Wrong 233 119 110 112 150 of each video. The color tracking also detected more than one object although already applied the Merge algorithms Total 1000 1000 1000 1000 1000 Samples but this is in very small number (only 0.0032%). The causes of untrackable objects were occlusion, %Correct 76.70% 88.10% 89.00% 88.80% 85.00% image blurring because of fast object movement, and Accuracy 76.70% 88.10% 89.00% 88.80% 85.00% changes in lighting condition. The occlusion happened when one finger occluded with another finger. For most of the time, the index finger was occluded by the thumb finger in The recognizer using 10-state HMM without skip states the beginning and at the end of each video. Image blurring could achieve 85.52% accuracy in average. The maximum happened when the hand moved too fast for example in accuracy was 89.00% using the set C as input. The creating twin number signs (11, 22, 33, and so on) and every minimum accuracy was with 76.70% using the set A as tens numbers (10, 20, 30, etc). The changes of lighting input. The set A has the lowest accuracy since it was the condition happened because the video was recorded using first video attempt of the model to do the sign. Thus, the natural light from 9am to 3pm. The movement of the hand first sample had significant difference with other sample. also created shadow and changed the lighting condition. Set C has the highest accuracy probably because the model already gets used to do the sign and produce more B. Recognizer constant/similar sign. The average accuracy of leave-one- There are two types of validation method that were out validation is 85.52% same with five-fold validation used in this research. The first validation method was five- result. This happened because the samples are similar with fold validation. Five-fold validation generated five set of another. testing data and training data from five video samples for each number which is sample A, B, C, D, and E. For set A, V. CONCLUSION it used first sample of each sign language number as the Sign language number recognition system in this testing data and used the other samples of the same sign research was able to design a model for recognizing sign language number for training data. For set B, it used second language number that was suitable for number in Filipino sample and so on until set E which used the last sample of Sign Language. The sign language number recognition [2] Philippine Deaf Resource Center & Philippine Federation of the Deaf system was also evaluated in terms of accuracy and time. (2004). An Introduction to Filipino Sign Language. Part I-III. Philippine Deaf Resource Center, Inc, Quezon City, Philippines. The feature extraction could track 92.3% of all objects in 5 hours 16 minutes and 50 seconds using Intel Core 2 [3] Zhang, L.-G., Chen, Y., Fang, G., Chen, X., & Gao, W. (2004). A vision-based sign language recognition system using tied-mixture Duo E4400 2 GHz computer with 2GB memory. It could be density hmm. In ICMI '04: Proceedings of the 6th international concluded from the feature extraction results that this conference on Multimodal interfaces, pages 198–204, New York, research has already implemented computer vision NY, USA. ACM. techniques for robust and real-time color tracking which is [4] Wang, Q., Chen, X., Zhang, L.-G., Wang, C., & Gao, W. (2007). used in feature extraction of dominant hand and skin (face Viewpoint invariant sign language recognition. Computer Vision and and non-dominant hand). Image Understanding, 108:87–97. The recognizer also could recognize Filipino sign [5] Starner, T., Weaver, J., & Pentland, A. (1998). Real-time american language number using the features from feature extraction sign language recognition using desk and wearable computer based module. The 10-state HMM without skip state has the video. Transactions on Pattern Analysis and Machine Intelligence, 20(12):1371–1375. highest average accuracy which is 85.52%. The total average running time for 10-state HMM without skip states [6] Vogler, C. & Metaxas, D. (2004). Handshapes and movements: Multiple-channel ASL recognition. In Springer Lecture Notes in was 34 minutes and 48 seconds. The leave-one-out Artificial Intelligence. Proceedings of the Gesture Workshop'03, validation for 10-state HMM without skip states results to Genova, Italy., pages 247–258. the same accuracy, 85.52%. This research is the pioneer in [7] Fabian, E. A., Or, I., Sosuan, L., & Uy, G. (2007). Vision-based hand sign language recognition in the Philippines. Thus, it is far mimicking system. In ROVISP07: Proceedings of the International from perfect but this research gave framework to be Conference on Robotics, Vision, Information, and Signal Processing, extended in future research. Penang, Malaysia. The video that was used as input in this research could [8] Aguilos, V. S., Mariano, C. J. L., Mendoza, E. B. G., Orense, J. P. D., be improved because the framing of the model seems too & Ong, C. Y. (2007). APoL: A portable letter sign language translator. Master's thesis, De La Salle University Manila. far. The model had her hand too far down. In natural discourse, the placement of the dominant hand is about 3-4 [9] Brashear, H., Henderson, V., Park, K.-H., Hamilton, H., Lee, S., & Starner, T. (2006). American sign language recognition in game inches to the side (and and inch or so in front of) of the development for deaf children. In Assets '06: Proceedings of the 8th mouth which is called the finger-spelling space. Deaf international ACM SIGACCESS conference on Computers and signers who converse never look at the interlocutor's hand accessibility, pages 79–86, New York, NY, USA. ACM. but at the eyes. This close placement of the hand to the face [10] Argyros, A. A. & Lourakis, M. I. A. (2004). Real time tracking of enables the signer to use peripheral vision in catching the multiple skin-colored objects with a possibly moving camera. In the manual signal completely. The signing space should be in European Conference on Computer Vision (ECCV’04), volume 3, the 3-dimensional space mid-torso to the top of the head, pages 368–379, Prague, Chech Republic. Springer-Verlag. with a third of the shoulder in addition on either side. [11] Argyros, A. A. & Lourakis, M. I. A. (2006). Vision-based interpretation of hand gestures for remote control of a computer The video sample could also include all the variants of mouse. In ECCV Workshop on HCI, pages 40–51, Graz, Austria. each number. For instance, there are 2 ways of signing 10, Springer Verlag. LNCS 3979. 16, 17, 18, 19 each. There are also additional unique signs [12] Intel Software Product Open Source (2007). Open Source Computer for 21, 23, 25 (with internal movement). Vision Library [online] Available: For further research, it is advisable that the research uses http://www.intel.com/technology/computing/opencv/ (March 6, 2008) other color system such as YCrCb, CIE, etc. instead of HSV [13] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., and using more advanced color tracking algorithm such as Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & K-Means algorithms or other tracking algorithms such as Woodland, P. (2006). The HTK Book. [online] Cambrigde Lucas Kanade Feature Tracker. The other possibility is using University Engineering Department. Available: only skin color without gloves and fingertip detection http://htk.eng.cam.ac.uk/docs/docs.shtml (March 6, 2008) algorithms for feature extraction. The recognizer module could use another machine learning algorithm for time series data such as fuzzy clustering, neuro fuzzy, etc. The exploration of grammar features of Hidden Markov Models Toolkit is also possible for further research.thanks to ACM SIGCHI for allowing us to modify templates they had developed.

REFERENCES [1] Philippine Federation of the Deaf (2005). Filipino Sign Language: A compilation of signs from regions of the Philippines Part 1. Philippine Federation of the Deaf.