Japanese Vowel Recognition by Tracking Temporal Changes of Lip Shape

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri1, and Yoichi Muraoka1 1Graduate School of Fundamental/Computer Science and Engineering, Waseda University, Tokyo, Japan Abstract— In this paper, we propose a vision-based ap- vowels is important. There are two types of single sound proach to recognize Japanese vowels. Traditional researches recognition. Those are static lip image recognition and dealt with lip size, lip width and lip height, but our method tracking temporal changes of lip. deals with lip shape. Our method focus on temporal changes In this paper, we propose a method of letter recognition of lip shape, and we define new feature value to recognize focusing on temporal changes of lip shapes by model-based vowels. There are a lot of conventional studies, but those lip extraction for lip reading. studies ’datasets are captured in specific environment such as well-lighted room and using lipsticks. However, we use 2. Related works Active shape models to extract lip area and calculate feature values. Therefore, our technique is not influenced by In this section, we discuss the previous related works and environment. And this paper describe the feature values are we show a direction of our method. robust. We experimented with our approach and about 80% Uchimura’s study[3] is letter recognition using static im- of average accuracy rate was obtained, and this rate is same age recognition. In their study, they use histograms of gray as vowels recognition of Japanese who use lip reading. We scale images to recognize lip area, and the letter recognition conclude that our method helps speech recognition. method use mouth size and mouth width. They use static image of lip, therefore specifying sections between letters Keywords: lip reading, vowel recognition, lip extraction is difficult and unsuitable to expand word recognition and sentence recognition. 1. Introduction Saitoh and Konishi’s study[4] uses one of the color-based Today, speech recognitions by audio are developed and method. And their method of letter recognition is to use those are used in game hardware, car navigation system temporal changes of lip size and lip aspect ratio. The results and cell phones, however, the systems cannot be used of their method was on average 93.8%. But the method is under noisy environment. Basically, speech recognition by not robust because of color-based method. impaired hearing people is based on sign language. But some people use lip reading. Therefore we can say that visual information improve performance of audio speech recognition under bad environment. To recognize mouth area is avery important for lip reading. We classify methods for recognizing into two types. One is color-based recognition such as snake algorithm[1], and two is model based recognition like Active shape models[2]. Color-based recognition is influenced by brightness of environment. On the other hand, model based recognition is not influenced by light, but need training datasets of face. Lip reading experiments are classified into four types. First is letter recognition, second is word recognition, third is sentence recognition and the last is semantic recognition. But Fig. 1: Lip area extraction by color-based method Japanese language has hiragana letters and unclear grammar. Therefore sentence recognition and semantic recognition are not robust and need a lot of learned datasets. And Japanese Figure 1 and figure 2 are results of lip area extraction pronunciations consist of some hiragana letters. Japanese using color-based method. We experimented lip extraction have differences between mouth shapes when they speak by RGB information of image. Figure 1 shows that this vowels. And almost all sound based on 5 vowels of /a/, method can get almost all lip area, but besides non-lip area. /i/, /u/, /e/ and /o/. Therefore single sound recognition of In figure 2, we changed the threshold of color-comparison. recognition of utterance by visual information using lip features. 3.1 Initialization First, we use 68 points for make active shape models learn faces, and use 19 features in those points. Figure 3 shows 68 points learned by active shape models. In this experiment, we define sections of utterance as one segment between mouth close and next mouth close. Experimentally, one section has about 30 frames to 70 frames. Therefore we adjust those sections to 50 frames. And To adjust the movement of mouth, we adjust mouth size and inclination by the width between features of both sides of the closed Fig. 2: Lip area extraction by different threshold mouth contour in the first frame. 3.2 Feature value The figures show that this color-based algorithm is clearly To tracking temporal changes, we use feature value from influenced by a background and regulation of thresholds. features of lip contours including the inside of mouth. The following figure 4 shows our definition of feature value in this experiments. Our feature value is defined the width between center point of contour and each points. These feature values mean where the features are. Fig. 3: Lip area extraction by active shape models On the other hand, we propose a lip extraction method of a model-based method. Figure 3 shows that a lip ex- Fig. 4: Features of lip area and Feature value traction by Active shape models using the same image as the above face. Clearly, the model-based method extract lip area correctly and also in detail. And our method deal Therefore feature values are formulated as with lip shapes more and more minutely. We mentioned the q above section, Japanese language consists of hiragana letters 2 2 V = (αx + αy) + (Cx + Cy) (1) on pronounces. And there are so small difference between consonants. Therefore Uchimura’s based on mouth size and where V is feature value, α is feature, and C is the center width and Saitoh’s method based on mouth size and aspect feature of mouth. ratio are unsuitable to expand to consonants recognition. 3.3 Relation between feature values We propose a robust method based on model-based lip extraction and tracking temporal changes of feature points In this paragraph, we explain relation between feature on lip shape to recognize vowels, and our method solve the values. The following figure 5 is comparison between feature above problems. values of 5 different people that calculated by the previous paragraph using the top feature of the mouth of /a/. Those features change largely at the vowel of /a/. 3. Method In addition, figure 6 is relation of temporal changes We use model-base method for lip area extraction in between vowels. We can see differences between vowels this experiments. In this section, we propose a method of from figure. 3. Evaluate values of each vowel by formula 3, and the vowel which the evaluated value is smallest is a matchable vowel for input. 4. Experiments In this section, we implement our method and experiment. And discuss the results of our system. 4.1 Setup We implemented the system which has the method we proposed. And the system was divided into the following 2 parts. Fig. 5: Feature values of /a/ by 5 people !"#$%& 0-+0$+-%!"1 '()*+ ,*-%$.* '-+$*/ ,*-%$.*/ +*-"!"1& .*0(1"!%!(" '-+$*/ +!#&-.*- 3!/4 *2%.-0%!(" Fig. 7: Chart of learning part of system Fig. 6: Feature values of vowels Figure 7 is a chart of learning part of our system. First we input a vowel and calculating feature values by our method. And learn those values to database. Considering previous two graphs, we can recognize vowels by feature values which proposed by us and can be got !"#$%& 0-+0$+-%!"1 by formula 1. '()*+ ,*-%$.* '-+$*/ 3.4 Learning values Calculating average of previous feature values by formula 0(3#-.!"1 )!%4 1 for each vowel. And we use those values to recognize an ,*-%$.*/ +*-."*5& 5!/6 .*0(1"!%!(" input vowel. Therefore leaned datas are got by 5-%- P N V D = n=0 np (2) tvp N ($%#$% where Dtv is a learned feature value of a time of a vowel. +!#&-.*- */%!3-%*5 N is number of datasets, p is feature of lip area. V is value *2%.-0%!(" '()*+ got by formula 1. 3.5 Matching method Fig. 8: Chart of estimating part of system For recognition of vowels, we use following formula to calculate which vowel is most likely to the input. Figure 8 is a char of estimating part of our system. XT X19 Estimating part have the same processes as learning part j − j Sv = Xtvn Dtvn (3) by calculating feature values. But the next step is compar- t=0 n ing process. The comparing process is done by the above where Sv is evaluated value of a vowel, T is number of matching method of section 3 using learned database. Last, frames. And Xtvn is input vowel. D is calculated by formula we can get an estimated answer by the system. Table 1: Environment of experiments OS Windows 7 Professional 64bit edition CPU Intel Core 2 Extreme X9650 Memory 4GByte Camera Logicool 2-MP Webcam C600h Resolution of camera 640px x 480px FPS during capturing 30fps Our system was run the following table 1. We used web camera. And this means that this system was run by a camera more poor than a camera of iPhone 4. We captured 20 people speaking 5 vowels in front of Fig. 9: Comparison between two trained datasets camera and captured 3 times each. And we used 15 people of those data for valid dataset. Those valid dataset is defined not blurred and can recognize feature points by Active shape models. And our datasets were captured at various back- grounds such as laboratories, houses and meeting rooms.

Load more