Lip reading: Japanese recognition by tracking temporal changes of lip shape

Koshi Odagiri1, and Yoichi Muraoka1 1Graduate School of Fundamental/Computer Science and Engineering, Waseda University, Tokyo, Japan

Abstract— In this paper, we propose a vision-based ap- is important. There are two types of single sound proach to recognize Japanese vowels. Traditional researches recognition. Those are static lip image recognition and dealt with lip size, lip width and lip height, but our method tracking temporal changes of lip. deals with lip shape. Our method focus on temporal changes In this paper, we propose a method of letter recognition of lip shape, and we define new feature value to recognize focusing on temporal changes of lip shapes by model-based vowels. There are a lot of conventional studies, but those lip extraction for lip reading. studies ’datasets are captured in specific environment such as well-lighted room and using lipsticks. However, we use 2. Related works Active shape models to extract lip area and calculate fea- ture values. Therefore, our technique is not influenced by In this section, we discuss the previous related works and environment. And this paper describe the feature values are we show a direction of our method. robust. We experimented with our approach and about 80% Uchimura’s study[3] is letter recognition using static im- of average accuracy rate was obtained, and this rate is same age recognition. In their study, they use histograms of gray as vowels recognition of Japanese who use lip reading. We scale images to recognize lip area, and the letter recognition conclude that our method helps recognition. method use mouth size and mouth width. They use static image of lip, therefore specifying sections between letters Keywords: lip reading, vowel recognition, lip extraction is difficult and unsuitable to expand word recognition and sentence recognition. 1. Introduction Saitoh and Konishi’s study[4] uses one of the color-based Today, speech recognitions by audio are developed and method. And their method of letter recognition is to use those are used in game hardware, car navigation system temporal changes of lip size and lip aspect ratio. The results and cell phones, however, the systems cannot be used of their method was on average 93.8%. But the method is under noisy environment. Basically, speech recognition by not robust because of color-based method. impaired hearing people is based on sign . But some people use lip reading. Therefore we can say that visual information improve performance of audio speech recognition under bad environment. To recognize mouth area is avery important for lip reading. We classify methods for recognizing into two types. One is color-based recognition such as snake algorithm[1], and two is model based recognition like Active shape models[2]. Color-based recognition is influenced by brightness of envi- ronment. On the other hand, model based recognition is not influenced by light, but need training datasets of face. Lip reading experiments are classified into four types. First is letter recognition, second is word recognition, third is sentence recognition and the last is semantic recognition. But Fig. 1: Lip area extraction by color-based method Japanese language has hiragana letters and unclear grammar. Therefore sentence recognition and semantic recognition are not robust and need a lot of learned datasets. And Japanese Figure 1 and figure 2 are results of lip area extraction pronunciations consist of some hiragana letters. Japanese using color-based method. We experimented lip extraction have differences between mouth shapes when they speak by RGB information of image. Figure 1 shows that this vowels. And almost all sound based on 5 vowels of /a/, method can get almost all lip area, but besides non-lip area. /i/, /u/, /e/ and /o/. Therefore single sound recognition of In figure 2, we changed the threshold of color-comparison. recognition of utterance by visual information using lip features. 3.1 Initialization First, we use 68 points for make active shape models learn faces, and use 19 features in those points. Figure 3 shows 68 points learned by active shape models. In this experiment, we define sections of utterance as one segment between mouth close and next mouth close. Experimentally, one section has about 30 frames to 70 frames. Therefore we adjust those sections to 50 frames. And To adjust the movement of mouth, we adjust mouth size and inclination by the width between features of both sides of the closed Fig. 2: Lip area extraction by different threshold mouth contour in the first frame. 3.2 Feature value The figures show that this color-based algorithm is clearly To tracking temporal changes, we use feature value from influenced by a background and regulation of thresholds. features of lip contours including the inside of mouth. The following figure 4 shows our definition of feature value in this experiments. Our feature value is defined the width between center point of contour and each points. These feature values mean where the features are.

Fig. 3: Lip area extraction by active shape models

On the other hand, we propose a lip extraction method of a model-based method. Figure 3 shows that a lip ex- Fig. 4: Features of lip area and Feature value traction by Active shape models using the same image as the above face. Clearly, the model-based method extract lip area correctly and also in detail. And our method deal Therefore feature values are formulated as with lip shapes more and more minutely. We mentioned the √ above section, Japanese language consists of hiragana letters 2 2 V = (αx + αy) + (Cx + Cy) (1) on pronounces. And there are so small difference between consonants. Therefore Uchimura’s based on mouth size and where V is feature value, α is feature, and C is the center width and Saitoh’s method based on mouth size and aspect feature of mouth. ratio are unsuitable to expand to consonants recognition. 3.3 Relation between feature values We propose a robust method based on model-based lip extraction and tracking temporal changes of feature points In this paragraph, we explain relation between feature on lip shape to recognize vowels, and our method solve the values. The following figure 5 is comparison between feature above problems. values of 5 different people that calculated by the previous paragraph using the top feature of the mouth of /a/. Those features change largely at the vowel of /a/. 3. Method In addition, figure 6 is relation of temporal changes We use model-base method for lip area extraction in between vowels. We can see differences between vowels this experiments. In this section, we propose a method of from figure. 3. Evaluate values of each vowel by formula 3, and the vowel which the evaluated value is smallest is a matchable vowel for input. 4. Experiments In this section, we implement our method and experiment. And discuss the results of our system. 4.1 Setup We implemented the system which has the method we proposed. And the system was divided into the following 2 parts. Fig. 5: Feature values of /a/ by 5 people

!"#$%& 0-+0$+-%!"1 '()*+ ,*-%$.* '-+$*/

,*-%$.*/ +*-"!"1& .*0(1"!%!(" '-+$*/

+!#&-.*- 3!/4 *2%.-0%!("

Fig. 7: Chart of learning part of system

Fig. 6: Feature values of vowels Figure 7 is a chart of learning part of our system. First we input a vowel and calculating feature values by our method. And learn those values to database. Considering previous two graphs, we can recognize vow- els by feature values which proposed by us and can be got !"#$%& 0-+0$+-%!"1 by formula 1. '()*+ ,*-%$.* '-+$*/ 3.4 Learning values

Calculating average of previous feature values by formula 0(3#-.!"1 )!%4 1 for each vowel. And we use those values to recognize an ,*-%$.*/ +*-."*5& 5!/6 .*0(1"!%!(" input vowel. Therefore leaned datas are got by 5-%- ∑ N V D = n=0 np (2) tvp N ($%#$% where Dtv is a learned feature value of a time of a vowel. +!#&-.*- */%!3-%*5 N is number of datasets, p is feature of lip area. V is value *2%.-0%!(" '()*+ got by formula 1. 3.5 Matching method Fig. 8: Chart of estimating part of system For recognition of vowels, we use following formula to calculate which vowel is most likely to the input. Figure 8 is a char of estimating part of our system. ∑T ∑19 Estimating part have the same processes as learning part | − | Sv = Xtvn Dtvn (3) by calculating feature values. But the next step is compar- t=0 n ing process. The comparing process is done by the above where Sv is evaluated value of a vowel, T is number of matching method of section 3 using learned database. Last, frames. And Xtvn is input vowel. D is calculated by formula we can get an estimated answer by the system. Table 1: Environment of experiments OS Windows 7 Professional 64bit edition CPU Intel Core 2 Extreme X9650 Memory 4GByte Camera Logicool 2-MP Webcam C600h Resolution of camera 640px x 480px FPS during capturing 30fps

Our system was run the following table 1. We used web camera. And this means that this system was run by a camera more poor than a camera of iPhone 4. We captured 20 people speaking 5 vowels in front of Fig. 9: Comparison between two trained datasets camera and captured 3 times each. And we used 15 people of those data for valid dataset. Those valid dataset is defined not blurred and can recognize feature points by Active shape models. And our datasets were captured at various back- grounds such as laboratories, houses and meeting rooms. In our experiment, we used Leave-one-out Cross- validation method for the evaluation. And we evaluated the following 2 situations. 1) Using captured vowels other than a vowel 2) Using captured vowels other than a man 4.2 Results The following table 2 is results of the above experiments. Fig. 10: Feature values of vowels Table 2: Results of our experiments (the numbers of accuracy rate correspond to above evaluations) Vowel Accuracy rate of (1) Accuracy rate of (2) /a/ 76% 75% precision problem of Active shape models. On extracting /i/ 92% 90% lip feature points by Active shape models, occasionally, the /u/ 67% 69% extracting method tracks wrong face model. This problem /e/ 84% 82% /o/ 72% 76% occurs because training face dataset is not enough. And Average 78.2% 78.4% this reason also make feature points blurred. Second is speaker’s problem. In our experiments, there was a tendency that those people didn’t open mouth widely when they Average accuracy rates are over about 80%. In Sekiyama’s spoke. This make difference between vowels too small. research[5], average accuracy rate of vowels recognition Therefore, blurred feature points make our system output of Japanese people who use lip reading is about 80%. wrong recognitions. Therefore, our study have the same accuracy. The cases of We have mentioned two studies[3][4] in section 2, and wrong estimations were almost between /a/ and /o/ and /u/ compare the results. The following table 3 shows the results and /o/. Those cases are often shown in other papers. of the two studies. Average accuracy rates is inferior than 4.3 Discussion related works, but on recognizing some vowels, our method is superior. Figure 9 shows comparison of the biggest different tempo- ral changes of feature points of two training datasets which were defined by section 4.1. Clearly, there are no difference Table 3: The results of related works two datasets, therefore our method recognizes robust feature Vowel Uchimura’s study Saitoh’s study /a/ 90% 95.8% values and deal with vowels of unknown people. /i/ 70% 91.8% Figure 10 shows some of the biggest difference of the /u/ 100% 96.9% feature values between /u/ and /o/. We deal here with the /e/ 100% 88.3% /o/ 70% 96.2% cases between /u/ and /o/ for a wrong estimation. Clearly, Average 86% 93.8% the figure shows /o/ vowel closer than /u/ to the input vowel of /u/. Two reasons are considered in this case. First is 5. Conclusion We have described a vowel recognition method by track- ing temporal changes of lip feature points. The results shows that our method can make robust feature values for Japanese vowel recognition. We conclude that our method is widely applicable to lip reading systems. We also mentioned the above section, clearly lip tracking by Active shape models is blurred. So, there are improvements of lip tracking by Active shape models. And this method was evaluated about vowels, Therefore we are extending to consonants on the next step, and word and sentence recognition in the future. References [1] M. Kass, A. Witkin and D. Terzopoulos. “ Snakes: Active Contour Models ”. International Journal of NTSC Computer Vision, pp.321- 331, 1998. [2] T.F. Cootes and D.H. Cooper and C.J. Taylor and J. Graham. ”Active shape models - their training and application ”. Computer Vision and Image Understanding, pp.38-59, 1995. [3] Keiichi UCHIMURA, Junji MICHIDA, Masami TOKOU, Teizo AIDA. “Discrimination of Japanese vowels by image analysis”. The Transactions of the Institute of Electronics, Information and Commu- nication Engineers, pp.2700-2702, 1988. [4] Takeshi SAITOH, Mitsugu HISAKI, Ryosuke KONISHI. “Japanese Phone Classification Based on Mouth Cavity Region”. IEICE techni- cal report, pp.161-166, 2007. [5] Kaoru SEKIYAMA, Kazuki Joe, Michio UMEDA. “Lipreading Japanese syllables”. ITEJ Technical Report 12(1), pp33-40, 1988.