Automatic Visual Speech Segmentation and Recognition Using Directional Motion History Images and Zernike Moments

Vis Comput DOI 10.1007/s00371-012-0751-7 ORIGINAL ARTICLE Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments Ayaz A. Shaikh Dinesh K. Kumar · · Jayavardhana Gubbi © Springer-Verlag 2012 Abstract Appearance-based visual speech recognition us- 1Introduction ing only video signals is presented. The proposed technique is based on the use of directional motion history im- Human speech perception is greatly improved by seeing a ages (DMHIs), which is an extension of the popular optical- speaker’s lip movements as well as listening to the voice. flow method for object tracking. Zernike moments of each However, mainstream automatic speech recognition (ASR) DMHI are computed in order to perform the classification. has focused almost exclusively on the latter: the acoustic The technique incorporates automatic temporal segmenta- signal. Recent advancements have led to purely acoustic- tion of isolated utterances. The segmentation of isolated ut- based ASR systems yielding excellent results in quiet or terance is achieved using pair-wise pixel comparison. Sup- noiseless environments. However, the recognition error in- port vector machine is used for classification and the results creases considerably in real world due to the existence of en- are based on leave-one-out paradigm. Experimental results vironmental noise. Noise robust algorithms such as, feature show that the proposed technique achieves better perfor- compensation [1], nonlocal means denoising method [2], mance in visemes recognition than others reported in litera- variable frame rate analysis [3]etc.havepresentedsignif- ture. The benefit of this proposed visual speech recognition icant improvement in speech recognition under noisy envi- method is that it is suitable for real-time applications due ronment, however, such algorithms are not exactly prone to to quick motion tracking system and the fast classification noise. To overcome this limitation, non-audio speech modal- method employed. It has applications in command and con- ities have been considered to augment acoustic information trol using lip movement to text conversion and can be used [4]. A number of options with non-audio speech modalities in noisy environment and also for assisting speech impaired have been proposed, such as visual, mechanical and muscle persons. activity-based sensing of the facial movement [5, 6], facial plethysmogram, Electromagnetic Articulography (EMA) to Keywords Motion analysis Temporal segmentation capture the movement of fixed points on the articulators [7] Directional motion history image· Optical flow Zernike· and measuring intra-oral pressure [8]. However, these sys- moments · · tems require sensors to be placed on the face of the per- son and are thus intrusive and impractical in most situations. Speech recognition based on visual speech signal is A.A. Shaikh (!) D.K. Kumar the least intrusive [9], non-constraining and noise robust. School of Electrical· and Computer Engineering and Health Innovations Research Institute, RMIT University, Melbourne, Systems that recognize speech from shape and movement Vic 3001, Australia of the speech articulators such as the lips, tongue and teeth e-mail: [email protected] of the speaker have been considered to overcome the short- D.K. Kumar comings of acoustic speech recognition [10]andthesesys- e-mail: [email protected] tems are known as the Visual Speech Recognition (VSR) systems. The hardware for such a system can be as simple J. Gubbi as a webcam or a camera built into a mobile phone. ISSNIP, Dept of Electrical and Electronic Engineering, The University of Melbourne, Melbourne, Vic 3010, Australia In the past 30 years, various techniques have been pro- e-mail: [email protected] posed for visual speech classification. One technique that A.A. Shaikh et al. has been proposed is based on motion history image (MHI) 2Relatedwork [10]. The MHI is an appearance-based template-matching approach represented by a single image (temporal template) This section presents some related research on AudioVisual generated by binary difference images between successive Speech Recognition (AVSR) and VSR. VSR involves the motion image frames and superposing them so that older process of interpreting the visual information contained in frames may have smaller weights. The advantage of tempo- avideoinordertoextracttheinformationnecessarytoes- ral template-based method is that the continuous image se- tablish the communication at perceptual level between hu- quence is condensed into a gray scale image while dominant mans and computers. Potamianos et al. [9]havepresented motion information is preserved. Therefore, it can repre- adetailedanalysisofaudiovisualspeechrecognitionap- sent motion sequence in a better and more compact manner. proaches, their progress and challenges. Generally the sys- Moreover, this approach is also less sensitive to silhouette tems reported in the literature are concerned with advancing noises such as holes, shadows and missing parts. The MHI theoretical solutions to various subtasks associated with the method is less expensive to compute by keeping a history development of AVSR systems. There are very few papers of temporal changes at each pixel location [11]. However, that have considered the complete system. The major trend the latest motion template overwrites older motion templates in the development of AVSR can be divided into the follow- causing self-occlusion [12], resulting in inaccurate lip mo- ing categories: audio feature extraction, visual feature ex- tion description and therefore it may cause inexact viseme traction, temporal segmentation, audiovisual feature fusion recognition. The other common shortcomings of the exist- and classification of the features to identify the utterance. ing techniques are that these are dependent on manual seg- The proposed visual-only system does not have any audio mentation to identify the start and the end frames of an utter- data and hence the audio feature extraction and their fusion ance from the video sequence and are sensitive to the speak- with visual features are not related to this work. ing speed. There is a need for automatic segmentation of The visual feature extraction techniques that have been applied in the development of VSR systems can be catego- the visual data to separate the individual utterances without rized into: shape-based (geometric), intensity/image-based human intervention. The earlier work where automatic seg- and motion-based. The first automatic visual speech recog- mentation was performed has typically considered the com- nition system reported by Petajan [4]in1984wasbasedon bination of the audio and visual data and thus the temporal shape-based features such as mouth height and width ex- speech segmentation in AVSR system is based on audio sig- tracted from binary images. In general, the shape-based fea- nals [13–15]. ture extraction techniques attempt to identify the lips in the In this research, the issue of occlusion has been overcome image, based either on geometrical templates that encode a by the use of Directional MHIs (DMHIs) based on optical- standard set of mouth shapes [17]orontheapplicationof flow computation. Instead of single MHI image, the DMHI- active contours [18]. In [19], a system called “image-input based technique represents four directions of motions, i.e., microphone” determined the mouth width and height from up, down, left and right. Automatic temporal segmentation the video and derived the corresponding vocal-tract transfer is achieved by an ad hoc method known as pair-wise pixel function used to synthesize the speech. In another approach, comparison method [16]. The system is made insensitive to the researchers [20]focusedontovisualizethemostimpor- the speed of speaking over two stages, firstly, at the time of tant articulator of speech: the tongue, they placed the ultra- the optical-flow computation; similar subsequent images of sound probe beneath the chin, along with the video camera the video containing no difference (zero difference) in en- focused on speaker’s lips to compute the visual features for ergy are avoided for optical-flow computation. In a second speech synthesizer. Since these approaches require extensive stage each optical-flow image is normalized before comput- training, complex algorithms for marking of lips contours ing the DMHIs. For this result the proposed system will be and to place the ultrasound probe is impractical in real-time suitable for the subject independent i.e. for varying speaking systems. speed people. The other approaches for the visual feature extraction are This article is organized as follows: Sect. 2 covers re- based on image intensity, which is highly subjective and in- lated work on audiovisual speech recognition and visual- effective in most real-time situations. In the image-based or only speech recognition, based on shape parameters such as appearance-based approach, the researchers have used the mouth height and width; image intensity global-based and pixel values representing the mouth area as feature vector lip motion-based. In Sect. 3,wepresentthedetailedmethod either directly [21], or after some feature reduction tech- for VSR system. Section 4 discusses the experimental re- niques such as a principal components analysis [22], vec- sults and analysis after presenting the temporal segmenta- tor quantization [23], linear discriminant analysis projec- tion method. Finally, Sect. 5 concludes the article with some tion and

Load more