Vis Comput DOI 10.1007/s00371-012-0751-7

ORIGINAL ARTICLE

Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments

Ayaz A. Shaikh Dinesh K. Kumar · · Jayavardhana Gubbi

© Springer-Verlag 2012

Abstract Appearance-based visual us- 1Introduction ing only video signals is presented. The proposed tech- nique is based on the use of directional motion history im- Human is greatly improved by seeing a ages (DMHIs), which is an extension of the popular optical- speaker’s lip movements as well as listening to the voice. flow method for object tracking. Zernike moments of each However, mainstream automatic speech recognition (ASR) DMHI are computed in order to perform the classification. has focused almost exclusively on the latter: the acoustic The technique incorporates automatic temporal segmenta- signal. Recent advancements have led to purely acoustic- tion of isolated utterances. The segmentation of isolated ut- based ASR systems yielding excellent results in quiet or terance is achieved using pair-wise pixel comparison. Sup- noiseless environments. However, the recognition error in- port vector machine is used for classification and the results creases considerably in real world due to the existence of en- are based on leave-one-out paradigm. Experimental results vironmental noise. Noise robust algorithms such as, feature show that the proposed technique achieves better perfor- compensation [1], nonlocal means denoising method [2], mance in visemes recognition than others reported in litera- variable frame rate analysis [3]etc.havepresentedsignif- ture. The benefit of this proposed visual speech recognition icant improvement in speech recognition under noisy envi- method is that it is suitable for real-time applications due ronment, however, such algorithms are not exactly prone to to quick motion tracking system and the fast classification noise. To overcome this limitation, non-audio speech modal- method employed. It has applications in command and con- ities have been considered to augment acoustic information trol using lip movement to text conversion and can be used [4]. A number of options with non-audio speech modalities in noisy environment and also for assisting speech impaired have been proposed, such as visual, mechanical and muscle persons. activity-based sensing of the facial movement [5, 6], facial plethysmogram, Electromagnetic Articulography (EMA) to Keywords Motion analysis Temporal segmentation capture the movement of fixed points on the articulators [7] Directional motion history image· Optical flow Zernike· and measuring intra-oral pressure [8]. However, these sys- moments · · tems require sensors to be placed on the face of the per- son and are thus intrusive and impractical in most situa- tions. Speech recognition based on visual speech signal is A.A. Shaikh (!) D.K. Kumar the least intrusive [9], non-constraining and noise robust. School of Electrical· and Computer Engineering and Health Innovations Research Institute, RMIT University, Melbourne, Systems that recognize speech from shape and movement Vic 3001, Australia of the speech articulators such as the lips, tongue and teeth e-mail: [email protected] of the speaker have been considered to overcome the short- D.K. Kumar comings of acoustic speech recognition [10]andthesesys- e-mail: [email protected] tems are known as the Visual Speech Recognition (VSR) systems. The hardware for such a system can be as simple J. Gubbi as a webcam or a camera built into a mobile phone. ISSNIP, Dept of Electrical and Electronic Engineering, The University of Melbourne, Melbourne, Vic 3010, Australia In the past 30 years, various techniques have been pro- e-mail: [email protected] posed for visual speech classification. One technique that A.A. Shaikh et al. has been proposed is based on motion history image (MHI) 2Relatedwork [10]. The MHI is an appearance-based template-matching approach represented by a single image (temporal template) This section presents some related research on AudioVisual generated by binary difference images between successive Speech Recognition (AVSR) and VSR. VSR involves the motion image frames and superposing them so that older process of interpreting the visual information contained in frames may have smaller weights. The advantage of tempo- avideoinordertoextracttheinformationnecessarytoes- ral template-based method is that the continuous image se- tablish the communication at perceptual level between hu- quence is condensed into a gray scale image while dominant mans and computers. Potamianos et al. [9]havepresented motion information is preserved. Therefore, it can repre- adetailedanalysisofaudiovisualspeechrecognitionap- sent motion sequence in a better and more compact manner. proaches, their progress and challenges. Generally the sys- Moreover, this approach is also less sensitive to silhouette tems reported in the literature are concerned with advancing noises such as holes, shadows and missing parts. The MHI theoretical solutions to various subtasks associated with the method is less expensive to compute by keeping a history development of AVSR systems. There are very few papers of temporal changes at each pixel location [11]. However, that have considered the complete system. The major trend the latest motion template overwrites older motion templates in the development of AVSR can be divided into the follow- causing self-occlusion [12], resulting in inaccurate lip mo- ing categories: audio feature extraction, visual feature ex- tion description and therefore it may cause inexact viseme traction, temporal segmentation, audiovisual feature fusion recognition. The other common shortcomings of the exist- and classification of the features to identify the utterance. ing techniques are that these are dependent on manual seg- The proposed visual-only system does not have any audio mentation to identify the start and the end frames of an utter- data and hence the audio feature extraction and their fusion ance from the video sequence and are sensitive to the speak- with visual features are not related to this work. ing speed. There is a need for automatic segmentation of The visual feature extraction techniques that have been applied in the development of VSR systems can be catego- the visual data to separate the individual utterances without rized into: shape-based (geometric), intensity/image-based human intervention. The earlier work where automatic seg- and motion-based. The first automatic visual speech recog- mentation was performed has typically considered the com- nition system reported by Petajan [4]in1984wasbasedon bination of the audio and visual data and thus the temporal shape-based features such as mouth height and width ex- speech segmentation in AVSR system is based on audio sig- tracted from binary images. In general, the shape-based fea- nals [13–15]. ture extraction techniques attempt to identify the lips in the In this research, the issue of occlusion has been overcome image, based either on geometrical templates that encode a by the use of Directional MHIs (DMHIs) based on optical- standard set of mouth shapes [17]orontheapplicationof flow computation. Instead of single MHI image, the DMHI- active contours [18]. In [19], a system called “image-input based technique represents four directions of motions, i.e., microphone” determined the mouth width and height from up, down, left and right. Automatic temporal segmentation the video and derived the corresponding vocal-tract transfer is achieved by an ad hoc method known as pair-wise pixel function used to synthesize the speech. In another approach, comparison method [16]. The system is made insensitive to the researchers [20]focusedontovisualizethemostimpor- the speed of speaking over two stages, firstly, at the time of tant articulator of speech: the tongue, they placed the ultra- the optical-flow computation; similar subsequent images of sound probe beneath the chin, along with the video camera the video containing no difference (zero difference) in en- focused on speaker’s lips to compute the visual features for ergy are avoided for optical-flow computation. In a second speech synthesizer. Since these approaches require extensive stage each optical-flow image is normalized before comput- training, complex algorithms for marking of lips contours ing the DMHIs. For this result the proposed system will be and to place the ultrasound probe is impractical in real-time suitable for the subject independent i.e. for varying speaking systems. speed people. The other approaches for the visual feature extraction are This article is organized as follows: Sect. 2 covers re- based on image intensity, which is highly subjective and in- lated work on audiovisual speech recognition and visual- effective in most real-time situations. In the image-based or only speech recognition, based on shape parameters such as appearance-based approach, the researchers have used the mouth height and width; image intensity global-based and pixel values representing the mouth area as feature vector lip motion-based. In Sect. 3,wepresentthedetailedmethod either directly [21], or after some feature reduction tech- for VSR system. Section 4 discusses the experimental re- niques such as a principal components analysis [22], vec- sults and analysis after presenting the temporal segmenta- tor quantization [23], linear discriminant analysis projec- tion method. Finally, Sect. 5 concludes the article with some tion and maximum likelihood linear transform feature ro- strategies for future work. tation [24]. Potamianos et al. [9]reportedtheirrecogni- Automatic visual speech segmentation and recognition using directional motion history images tion results in terms of Error Rate (WER), achiev- overwriting problem. However, the DMHI method has out- ing a WER of 35 % by visual-only features in office-Digit performed these variants in solving the overwriting problem dataset considering the speaker independent scenario. Zhao [37], which is the combination of optical-flow and MHI. et al. [25]introducedthelocalspatiotemporaldescriptors, Once the features of the visual data have been extracted, instead of global parameters, to recognize isolated spoken these have to be classified. The most widely used classi- phrases based solely on visual input, obtaining a speaker fier for AVSR system is the HMM because it models the independent recognition rate of approximately 62 % and a changes of the states and is a very popular method for tradi- speaker dependent result of around 70 %. The advantage tional audio-only speech recognition [38]. Various variants of the image-based approach is that no information is lost. of HMMs have also been used for audiovisual ASR, such Acommoncriticismofthisapproachisthatittendstobe as HMMs with non-Gaussian continuous observation proba- very sensitive to changes in illumination, position, camera bilities [39]. Moreover, additional methods to overcome the distance, rotation and speaker [23]. difference in the speed of speaking for classification have In more standard approaches model-based methods are been employed in audiovisual ASR systems, such as dy- considered in which a geometric model of the lip contour namic time warping (DTW), used by Petajan [4]arecompu- is applied. The typical examples are deformable templates tationally expensive and inaccurate, while other classifiers [26], snakes [27], active shape models (ASM) [18], active that allow the difference among speakers to be considered appearance model (AAM) [28]andmultiscalespatialanal- for classifying the visual data have used artificial neural net- works (ANN) [40, 41], hybrid ANN-DTW systems [42], hy- ysis (MSA). All of these approaches were presented by brid ANN-HMM [43]andrecentlythesupportvectorma- Matthews et al. [29]toextractthevisualfeatures.However, chines (SVM) [44]. SVM is based on the structural risk min- these techniques are computationally expensive and require imization principle in contrast to empirical risk minimiza- the accurate labeling of the lip contours in the training data tion on which many classifiers are based. Ganapathiraju [45] to create the lip models before being used for feature extrac- reports very good results on audio speech recognition with tion. This limits the performance of such techniques when ahybridSVM-HMMsystem.Onthevisualpart,Gordanet applied to real-time systems. al. [46]obtainedahighrecognitionrateusingsimplevisual In contrast to the image-based and model-based ap- features, showing the suitability of SVMs for visual speech proaches, others aim at explicitly extracting relevant vi- recognition. sual speech features based on motion analysis. For exam- This paper has employed SVM because of the non- ple, in [30]oralcavityfeaturesincludingwidth,height,area, temporal type of the ZM features and considering the ability perimeter and their ratios and derivatives were used as in- of SVM to find a globally optimum decision function to sep- put for the recognizer and achieved 25 % recognition rate arate the different classes of data, the larger the separating for a group of sentences. In [31], descriptors of the mouth distance, the higher the generalization power will be. While derived from optical-flow data were used as visual features SVM use a separating hyper-plane makes it suitable for bi- to recognize the connected digits (0–9). In [32], lip contour nary class classifier. However, groups of SVMs can solve geometric features (LCGFs) and lip motion velocity features multi-class problems such as the classification of utterances. (LMVFs) of the side-face images are calculated. The tech- The main goal of the work reported in this paper was to nique achieved the digit recognition errors of 24 % using develop and test a VSR system that classifies the utterance avisual-onlymethodwithLCGF,21.9%withLMVFand based on visual data alone, and performs automatic temporal 26 % with the combined LCGF and LMVF. Other motion- segmentation of the visual data without any audio cues. This based techniques based on MHI have been reported in [10] research need to resolve some of the issues that have not and [33]. Several variants of MHI method have been pro- been overcome till now such as self-occlusion and overwrit- posed to improve some of its constraints and these have ing that affects MHI-based systems, automatic segmentation been used in several applications. MHI methods enhanced of the video data and need to compensate for the variation to the Motion Flow History [34], Pixel Change History [11], in the speed of speaking. Intra-Motion History Image based on front-MHI and rear- MHI, etc. have been proposed in 2D motion recognition. In 3D paradigm, view-invariant Motion History Volume or 3Methodology 3D History Models are proposed by [35]. One of the key In this section, we briefly discuss the data set used, and the constraints of MHI method is its motion overwriting prob- overall approach using DMHIs based on optical flow. lem due to self-occlusion which happens when the motion is repeated in the same location at different times within the 3.1 Dataset utterance [12, 36]. Multilevel MHI (MMHI) [36]andHi- erarchical Motion History Histogram (HMHH) approaches In our experiments, 14 visemes of English language are con- [12]areproposedalongwiththeDMHImethodtosolvethis sidered; visemes are defined in the Facial Animation Param- A.A. Shaikh et al. eters (FAP) of the MPEG-4 standard. The dataset used in Instead of frame subtraction method to calculate the up- this study was recorded by Yau et al. [10]inatypicaloffice date function presented in Eq. (1), this research employs environment. The inexpensive web camera focused on the the probabilistic model of optical flow developed by Sun et mouth region of the speaker and was fixed throughout the al. [47]tocomputetheDMHIs.Opticalflowisameasureof experiments. Factors such as window size (240 320 pix- visually apparent motion of objects between two images and × els), view angle of the camera, background and illumination measures the spatiotemporal variations of video data. The were kept constant for each speaker. Seven volunteers (four word apparent implies that the optical flow does not con- males and three females) participated in the experiments, sider the movement of the objects in the real 3D space, but with each speaker recording 14 visemes at a sampling rate the motion in the image space. It is observed that there is of 30 frames/second. This was repeated 10 times to generate inter and intra subject variation in the speech speed. This sufficient variability.1 variation in speed can give rise to different perceptual im- pression and can cause inexact viseme recognition. To com- 3.2 Computation of DMHIs pensate the variation in speed of speaking, similar subse- quent frames which results in zero energy difference be- Motion History Image (MHI) can be used to describe the tween frames are filtered out using the mean square error direction of motion in an image sequence. The intensity of (MSE) given in Eq. (4). However, due to affect of environ- each pixel in MHI is a function of motion density at that lo- mental noises on video recording some energy differences cation, and therefore the temporal difference of these pixel are expected and for that threshold value should be used to values results in MHI, being a temporal template. One of determine whether or not to calculate the optical flow. Re- the advantages of the MHI representations is that a range moving similar sequential images has the additional advan- of times from frame to frame to several seconds may be tage of reducing the computational load while calculating encoded in a single frame. So, the MHI can span the time optical flow. scale of human visual speech. The MHI Hτ(x,y,t) can be m n computed from an update function Ψ(x,y,t),whichrep- 1 2 resents the brighter pixels, where there is recent movement MSE It (x, y) It 1(x, y) (4) = m n − − × x 1 y 1 and darker where the movements are older: #= #= $ % The optical-flow computation of consecutive frames (de- τ if Ψ(x,y,t) 1 Hτ(x,y,t) = noted by Ψ(x,y,t))providesthehorizontalandvertical = !max(0,Hτ(x,y,t 1) δ) otherwise components of the flow, Ψx and Ψy .Thesecomponents − − (1) are then half-wave rectified into four non-negative sepa- rate channels Ψx+, Ψy+, Ψx−,andΨy− constrained such that where x,y and t show the position and time, Ψ(x,y,t)sig- Ψx Ψ + Ψ − and Ψy Ψ + Ψ −.Figure1 depicts the = x − x = y − y nals object presence (or motion) in the current video image, flow diagram of this flow vector computation. Based on the τ decides the temporal duration of MHI, and δ is the de- four directions, each optical-flow image is normalized ac- cay parameter. Some possible image processing techniques cording to the threshold ξ value, where ξ is computed ac- to define Ψ(x,y,t)could be background subtraction or im- cording to Otsu’s [48]globalthresholdmethod.Basedon age differencing. Usually, MHI is generated from a bina- these normalized image sequences, four separate optical- rized image, obtained from frame subtraction [10], using a flow motion history templates are created after deriving the predefined threshold value, ℘ to obtain a motion or no mo- tion classification:

1, if Diff (x, y, t) ℘ ΨB (x, y, t) ≥ (2) = !0, otherwise where Diff (x, y, t) with difference distance % is as follows:

Diff (x, y, t) I(x,y,t) I(x,y,t %) (3) = − ± " " where, I(x,y,t)" is the intensity value of pixel" location with coordinate (x, y) at the tth frame of the image sequence.

1 The experimental procedure was approved by the Human Experi- Fig. 1 Conceptual framework of optical-flow vector separation into ments Ethics Committee of RMIT University (ASEHAPP 12-10). four directions Automatic visual speech segmentation and recognition using directional motion history images four optical-flow components.

x Hτ− (x, y, t)

τ if Ψx−(x, y, t) > ξ = x !max(0,Hτ− (x, y, t 1) δ) otherwise x − − Hτ+ (x, y, t)

τ if Ψx+(x, y, t) > ξ = x !max(0,Hτ+ (x, y, t 1) δ) otherwise y − − (5) Hτ− (x, y, t)

τ if Ψy−(x, y, t) > ξ y = !max(0,Hτ− (x, y, t 1) δ) otherwise y − − Hτ+ (x, y, t)

τ if Ψy+(x, y, t) > ξ y = !max(0,H+ (x, y, t 1) δ) otherwise τ − − x For positive and negative horizontal directions, Hτ+ (x, y, t) x and Hτ− (x, y, t) are set up as motion history templates. y Hτ+ (x, y, t) and Hτ−(x, y, t) represent the positive and neg- ative vertical directions (up and down, respectively). Feature vectors are computed from these four history templates by employing Zernike moments for classification and recogni- tion. Figure 2 shows the complete system flow diagram of the Fig. 2 Flow chart for the visual speech recognition (L: Left, R: Right, proposed visual speech recognition system based on direc- U: Up, D: Down) tional motion history images (DMHIs) and SVM classifier. Once the temporal segmentation (described in Sect. 4)of isolated utterance is performed, the cropped video contain- Zernike moments are computed by projecting the each ing a viseme is fed to the system for optical-flow-based DMHI image function f(x,y)onto the orthogonal Zernike DMHI computation as described above. polynomial Vnl of order n with repetition l defined within aunitcircle.Thecenteroftheimageistakenastheorigin and the pixel coordinates are mapped to the range of the unit 3.3 Feature extraction using Zernike moments circle (i.e. x2 y2 1) as follows: + ≤ In order to classify an image from a large dataset, the image jlθ Vnl(ρ,θ) Rnl(ρ)e− j √ 1(6) features need to be invariant to scale and rotation. Features = ; = − should have sufficient discriminating power and noise im- & where Rnl is the real-valued& radial polynomial, given by munity for retrieval from the large image dataset. Zernike n l Moments (ZMs) are image moments or features having the −2| | k (n k) n 2k desired properties such as rotation invariance, robustness Rnl(ρ) 1 − ! ρ − (7) = − n l n l to noise, expression efficiency and multilevel representa- k 0 k ( +|2 | k) ( +|2 | k) #= ! − ! − ! tion for describing the shapes of patterns [49]. ZMs have The main advantage of this approach is the simple rotational been demonstrated to outperform other image moments such property of the features [49]. Zernike moments are also in- as geometric, Legendre moments and complex moments in dependent features due to the orthogonality of the Zernike terms of sensitivity to image noise, information redundancy polynomial Vnl [50]. l n and (n l ) is even. Zernike and capability for image representation [50]. Our proposed | ≤ | −|| moments Z of order n and repetition l is given by method uses ZMs as visual speech features to represent the nl approximate image of the DMHI. Before computing the vi- 2π n 1 ∞ sual speech features the DMHIs are resized using bicubic Znl + Vnl(ρ,θ) f (ρ,θ) dρdθ (8) = π ∗ interpolation from rectangular size of 240 320 pixels to ' ( )0 )0 × $ % 240 240 pixels so that no precision is lost and DMHIs be- where, f(ρ,θ) is the intensity distribution of the approxi- × come the square images. mate image of DMHI mapped to a unit circle of radius ρ A.A. Shaikh et al.

other category of the features are on the other side of the plane. The vectors near the hyper-plane are the support vec- tors. SVM is based on the structural risk minimization prin- ciple in contrast to empirical risk minimization on which many classifiers are based. SVM with a radial basis function (RBF) kernel was employed to classify the 256 features. The kernel width parameter gamma 0.25 and the error penalty = parameter C 2wereoptimizedbyiterativeexperiments = (grid search). The implementation was carried out using the libSVM library [51]. Fig. 3 Square-to-circular transformation of the DMHI image To evaluate the proposed classification method, we used the measures of accuracy, sensitivity and specificity defined and angle θ where x ρ cos θ and y ρ sin θ.Figure3 as = = shows the square-to-circular transformation performed for TP TN Accuracy + 100 % (11) the computation of the ZMs that transform the square image = TP TN FP FN × function f(x,y)in terms of the x–y axes to a circular image + + + TP function f(ρ,θ)in terms of i–j axes. Sensitivity 100 % (12) = (TP FN) × To illustrate the rotational characteristics of ZMs, con- + TN sider β as the angle of rotation of the image. The resulting Specificity 100 % (13) rotated ZMs Z is = (FP TN) × nl( + ilβ where TP True Positive, TN True Negative, FP False Znl( Znle− (9) = = = = Positive, and FN False Negative. = where Znl is the Zernike moment of the original image. Leave-one-out cross-validation was performed to assess Equation (8)demonstratesthatrotationofanimageresults the performance of the proposed algorithm. This means that in a phase shift on the Zernike moments [49]. The abso- out of N samples from each of the 14 classes of all sub- jects, N l of them are used to train the classifier and the lute value of Zernike moments are rotation invariant [49]as − shown in Eq. (9). remaining one to test it. This process is repeated 10 times for each class, (i.e., ten-fold cross validation) each time leaving Z( Znl (10) adifferentsampleout. nl =| | " " "This" paper uses the absolute value of the Zernike moments, Z( as the rotation invariant features of the DMHI. An op- 4Temporalvisemesegmentation | nl| timum number of Zernike moments need to be selected to trade-off between the dimensionality of the feature vectors Temporal segmentation of an isolated viseme or utterance and the amount of information represented by the features. from a continuous video of repeated is important for 64 Zernike moments that comprise 0th order moments up to visual speech recognition. It is to identify the start and the 14th order moments have been used as features to represent end frames of an utterance in the sequence of utterances. the approximate images of the DMHIs of each viseme; the To segment sequential utterances into a single viseme, we number of Zernike moments required for representing the used an ad hoc method of temporal segmentation based on DMHIs is determined empirically. pair-wise pixel comparison [16]ofconsecutiveimagesfor In the DMHI method, we have four history components. 14 different mouth movements/activities. Considering 64 Zernike moments that comprise 0th order Figure 4 shows the steps followed in ad hoc temporal moments up to 14th order moments for each DMHIs [up, segmentation method. Figure 4aindicatesthesquaredmean down, left, right], we compute a 256 dimension feature vec- difference of gray scale intensities of corresponding pixels tor to represent each utterance. in accumulative frames (only three visemes are shown for clarity). The result of average moving window smoothing 3.4 Support vector machine classifier and further Gaussian smoothing are shown in Fig. 4band4c, finally selecting an appropriate threshold value of 0.09 leads This work required the classification of 14 viseme classes to the required temporal segmentation. The result of tempo- using 256 features extracted. The SVM classifier is able to ral segmentation (as unit step-pulse-shaped representations find the optimal hyper-plane that separates clusters of vec- i.e., two pulses represent an utterance) is shown in Fig. 4d. tor in such a way that the classes with one category of the It is clear from the figure that the ad hoc scheme followed is features are on one side of the plane and classes with the highly effective in viseme segmentation. Automatic visual speech segmentation and recognition using directional motion history images

Fig. 4 Results of temporal segmentation. (a) Squared mean difference Gaussian filtering. (d)Segmenteddata(blue blocks indicate starting of accumulative frames gray scale intensities. (b)Resultofsmoothing and ending points) data by moving average window. (c)Resultoffurthersmoothingby

5Resultsanddiscussions 1 10 End_Framemanual_i End_Frameauto_i (15) 10 | − | i 1 #= Figure 5 shows the results of temporal segmentation of all 14 visemes for one subject. The results comparing the auto- The variety of feature extraction and classification algo- matic and manual segmentation of this subject for the first rithms in the lip-reading literature have been suggested, it three epochs have been tabulated in Table 1.Theresultsof is quite difficult to compare the results, as they are rarely other subjects were very similar and results from all the sub- tested on common audiovisual database. However, it is ob- jects and all the visemes in terms of starting frame error served from Table 4 that the average accuracy 98 %, based rate and end frame error rate have been calculated (using on DMHIs technique is best as compared to state of the art Eqs. (14)and(15)) and tabulated in Tables 2 and 3.From techniques presented in [9, 25, 28, 31, 32]invisual-only Tables 3 and 4,theaverageerrorbetweenautomaticand scenario. manual segmentation for all subjects and all utterances is Table 4 shows the average accuracy of identifying the 2.98 frames is around 1.5 frames on either side of an utter- visemes for all the seven subjects for 14 different visemes ance. It can be seen from the Table 2 and 3 the subjects 6 and using DMHIs and MHI (for comparison). The results indi- 7havecomparativelyhigherframeerrorrate,bythemanual cate that DMHIs outperformed MHI in identifying the ut- observation it is found that both subjects had comparatively terance on all accounts as it can address the overwriting larger head movements in the videos during utterance and problem significantly. While the average accuracy (98 % had darker skin tones. We have and 93.66 %) and specificity (99.7 % and 99 %) of the two techniques were comparable, the average sensitivity of 1 10 Start_Framemanual_i Start_Frameauto_i (14) DMHI was much better than that of MHI, with sensitivity 10 | − | i 1 of DMHIs being 75.7 % while that of MHI was 24.4 %. #= A.A. Shaikh et al.

Fig. 5 Results of temporal segmentation of all 14 visemes for a single user using the proposed method

Thus, from the results, it is evident that the DMHIs outper- The results indicate that the proposed method using formed MHI in recognizing the lip movements for different DMHI is more sensitive in recognizing the correct viseme . and leads to lower false negatives. The proposed method Automatic visual speech segmentation and recognition using directional motion history images

Fig. 5 (Continued) is based on advanced optical-flow analysis [47], in which senting the ZMs of only MHI as features, information about astandardincrementalmulti-resolutiontechniqueisused their direction is lost which is critical in visual speech recog- to estimate flow fields with large displacements. The op- nition. Hence, comparing sensitivity values in Table 4 sug- tical flow estimated at a coarse level is used to warp the gest that ZMs of DMHIs is successful in representing the lip second image toward the first at the next finer level, and a movement vindicating our hypothesis. The sensitivity and flow increment is calculated between the first image and the unique property of rotational invariance of ZMs ensures that warped second image. In building the pyramid each level is the feature representation is independent of subject and the recursively down-sampled from its nearest lower level. The style with which they speak. method employs robustness against lighting changes. The Another important aspect is the dataset and experimental direction of motion of the lips is an important feature which protocol of classification which was followed. The dataset is provided by the optical-flow-based DMHIs. Contrary to used in this work is based on visemes which are the fun- DMHI the standard MHI is the gray scale representation of damental visual units of human speech which can be ex- difference of successive binary images of a video. By repre- tended to words and sentences by concatenation, while most A.A. Shaikh et al.

Table 1 Results of temporal segmentation for 14 visemes (three epochs)

Epoch 1 Epoch 2 Epoch 3 Start frame End frame Start frame End frame Start frame End frame

/a/ Manual 24 56 76 107 128 158 Auto 27 56 75 107 128 158 /ch/ Manual 19 53 70 101 123 157 Auto 19 53 70 102 122 155 /e/ Manual 49 78 102 133 171 211 Auto 48 76 102 134 170 209 /g/ Manual 43 86 100 144 165 199 Auto 43 83 100 142 163 198 /th/ Manual 1 33 58 95 120 155 Auto 1 34 60 95 119 155 /i/ Manual 22 55 74 107 130 161 Auto 24 54 73 107 129 161 /m/ Manual 21 55 86 120 154 188 Auto 22 56 86 119 153 186 /n/ Manual 80 116 136 173 203 241 Auto 79 116 136 175 204 240 /o/ Manual 26 58 76 114 134 169 Auto 26 57 75 115 133 167 /r/ Manual 16 58 79 119 142 182 Auto 19 57 81 118 147 182 /s/ Manual 22 59 80 120 143 181 Auto 22 58 78 119 146 184 /t/ Manual 16 35 64 85 107 131 Auto 15 34 63 83 108 132 /u/ Manual 64 101 122 160 180 214 Auto 63 101 121 159 179 215 /v/ Manual 11 49 61 105 113 153 Auto 11 48 62 101 113 152

of the others work is based on the digits (0–9). The features 6Conclusionsandfuturework of the numbers/digits are comparatively more discriminative as confirmed by our observation. Moreover, in earlier work, This paper has presented a new visual speech recogni- Hidden Markov Models (HMM)[52]andfeed-forwardmul- tion technique employing directional motion history images, tilayer perceptron (MLP) artificial neural network (ANN) represents four directions of motion, the computation of with back propagation [41]havealreadybeeninvestigated DMHIs is based on optical flow. Reliable temporal segmen- using the same dataset. The mean recognition rate for the tation is a major problem in automatic visual speech recog- nition and the proposed system incorporates automatic tem- identification of nine visemes was reported as 93.57 % us- poral segmentation of the video data. The experimental sys- ing HMM and 84.7 % using ANN. One important thing to tem demonstrates that this technique performs very well in note is that in contrast to the work presented here, ANN and terms of high accuracy and high sensitivity. However, the HMM were tested in subject dependent scenarios. In the pro- overwriting problem may occur in proposed technique, if posed method, the samples from all the subjects were used in the speech is continuous or long motion sequences are con- training the classifier which introduces a lot of inter-subject sidered in videos. Considering the issue with long motion variation. The features and the classifier chosen were suc- sequences, some basic visual units (i.e. short motion se- cessful in countering these effects as reflected by the results quences/lip motion sequences) should be defined, then con- obtained. tinuous speech can be segmented to those visual units, hence Automatic visual speech segmentation and recognition using directional motion history images

Table 2 Average frame error rate of 10 epochs at start of an utterance

Start frame subject1 subject2 subject3 subject4 subject5 subject6 subject7 Average

/a/ 1.3 0.3 0.7 1 1.3 1.3 2 1.13 /ch/ 0.3 0.7 0.3 1.7 1 7.7 4.7 2.34 /e/ 0.7 2 0.7 0.3 1 1.7 1 1.06 /g/ 0.7 1.7 1 0.7 1 2 3.3 1.49 /th/ 1 0.7 1 1 1.7 5.7 3.3 2.06 /i/ 1 0.7 1.3 1.7 0 3.3 3.3 1.61 /m/ 0.7 0.3 1 2.7 0.7 3 1.3 1.39 /n/ 0.7 1 1 0.7 1 3 0.7 1.16 /o/ 0.7 0.3 0.7 1 0.7 2 5.7 1.59 /r/ 3.3 1.7 0.3 1.3 1 5.7 3.7 2.43 /s/ 1.7 0.3 0.3 2 1.7 0.3 2 1.19 /t/ 1 1.3 1 0.7 7 1 3.3 2.19 /u/ 1 1.7 0.7 1.3 4 1 1 1.53 /v/ 0.3 0.3 0.3 1 1 1 3.3 1.03 Average 1.03 0.93 0.74 1.22 1.65 2.77 2.76 Average error of start frame for all subjects, all visemes 1.48

Table 3 Average frame error rate of 10 epochs at end of an utterance

End frame subject1 subject2 subject3 subject4 subject5 subject6 subject7 Average

/a/ 1 2 0.3 0.7 2.3 1 2.3 1.37 /ch/ 1 0.7 0.3 1.3 2 0.7 4.3 1.47 /e/ 1.7 1.7 2.7 1 1.3 0.3 1 1.39 /g/ 2 0.3 1.3 1.3 0.3 0.7 2.3 1.17 /th/ 0.3 0.7 1 0.7 2 1.7 3 1.34 /i/ 0.3 2.3 1 0.7 0.7 1.7 2.7 1.34 /m/ 1.3 1.7 1.3 1 1.3 1 1 1.23 /n/ 1 2 0.3 1 1.3 2.7 2 1.47 /o/ 1.3 0.3 0.7 1.3 0.7 3.3 3 1.51 /r/ 0.7 2.3 1 2.3 0.3 0.7 3 1.47 /s/ 1.7 1 1 1.3 1 2.3 2.3 1.51 /t/ 1.3 0.3 0 0.7 0.3 1 2.3 0.84 /u/ 0.7 1.3 1 1.7 0.7 3.7 1 1.44 /v/ 2 1 0.7 2 1.3 1.7 4.3 1.86 Average 1.16 1.26 0.9 1.21 1.11 1.61 2.46 Average error of end frame for all subjects, all visemes 1.29

the continuous speech can be recognized by concatenating history images which are rotation invariant features and are those visual units. useful in real-time systems. Finally the classification is per- To overcome the inter-subject variation to the style or formed by using support vector machine classifier, which speed of speaking, the system made insensitive to the speed ensures the convergence at global minimum. Unlike most of speaking over two stages, as a result the system is suit- able for the subject independent. The technique computed other work based on single subject, this work considered all the Zernike Moments from each of the directional motion subjects together, using leave-on-out paradigm. A.A. Shaikh et al.

Table 4 Classification results of individual one class SVM for 14 visemes (all values in %)

DMHI MHI Visemes Specificity Sensitivity Accuracy Specificity Sensitivity Accuracy

1/a/99.974.398.199.715.793.7 2/ch/99.771.497.796.728.691.8 3/e/99.577.197.999.815.793.8 4/g/99.87097.799.711.493.3 5/th/10074.398.298.721.493.2 6/i/99.575.797.898.735.794.2 7 /m/ 99.89099.198.452.995.1 8/n/99.571.497.499.318.693.6 9/o/99.98098.598.62593.6 10 /r/ 99.672.997.710017.194.1 11 /s/ 99.67097.49924.293.6 12 /t/ 99.672.997.798.925.193.6 13 /u/ 99.781.498.499.124.793.8 14 /v/ 99.978.698.499.125.693.8

References concatenation of diviseme motion capture data. Comput. Animat. Virtual Worlds 15(5), 485–500 (2004) 1. Xiaodong, C., Alwan, A.: Noise robust speech recognition using 14. Govokhina, O., Bailly, G., Breton, G.: Learning Optimal Audiovi- feature compensation based on polynomial regression of utter- sual Phasing for a HMM-based Control Model for Facial Anima- ance SNR. IEEE Trans. Speech Audio Process. 13(6), 1161–1172 tion (2007) (2005) 15. Musti, U., Toutios, A., Ouni, S., Colotte, V., Wrobel-Dautcourt, 2. Haitian, X., Zheng-Hua, T., Dalsgaard, P., Lindberg, B.: Robust B., Berger, M.O.: HMM-based automatic visual speech segmenta- speech recognition by nonlocal means denoising processing. IEEE tion using facial data. In: Proc. of the Interspeech 2010, pp. 1401– Signal Process. Lett. 15,701–704(2008) 1404 (2010) 3. Zheng-Hua, T., Lindberg, B.: Low-complexity variable frame rate 16. Koprinska, I., Carrato, S.: Temporal video segmentation: a survey. Signal Process. Image Commun. 16(5), 477–500 (2001) analysis for speech recognition and voice activity detection. IEEE 17. Da Silveira, L.G., Facon, J., Borges, D.L.: Visual speech recog- J. Sel. Top. Signal Process. 4(5), 798–807 (2010) nition: a solution from feature extraction to words classification. 4. Petajan, E.: Automatic lipreading to enhance speech recognition. In: XVI Brazilian Symposium on Computer Graphics and Image In: IEEE Global Telecommunications Conference, Atlanta, GA, Processing SIBGRAPI 2003, pp. 399–405 (2003) USA, pp. 265–272. IEEE Computer Society Press, Los Alamitos 18. Luettin, J., Thacker, N.A., Beet, S.W.: Active shape models for (1984) visual speech feature extraction. NATO ASI Ser. Comput. Syst. 5. Arjunan, S.P., Kumar, D.K., Yau, W.C., Weghorn, H.: Unspoken Sci. 150,383–390(1996) vowel recognition using facial electromyogram. In: Engineering 19. Otani, K., Hasegawa, T.: The image input microphone—a new in Medicine and Biology Society. EMBS’06. 28th Annual Inter- nonacoustic speech communication system by media conversion national Conference of the IEEE, Aug. 30 2006–Sept. 3 2006, pp. from oral motion images to speech. IEEE J. Sel. Areas Commun. 2191–2194 (2006) 13(1), 42–48 (1995) 6. Schultz, T., Wand, M.: Modeling coarticulation in EMG-based 20. Hueber, T., Chollet, G., Denby, B., Stone, M., Zouari, L.: Ouisper: continuous speech recognition. Speech Commun. 52(4), 341–353 corpus based synthesis driven by articulatory data. In: 16th Inter- (2010). doi:10.1016/j.specom.2009.12.002 national Congress of Phonetic Sciences, pp. 2193–2196 (2007) 7. Medizinelektronik, C.: (2008). http://www.articulograph.de/ 21. Yuhas, B.P., Goldstein, M.H. Jr., Sejnowski, T.J.: Integration of 8. Soquet, A., Saerens, M., Lecuit, V.: Complementary cues for acoustic and visual speech signals using neural networks. IEEE speech recognition, pp. 1645–1648 (1999) Commun. Mag. 27(11), 65–71 (1989) 9. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Re- 22. Bregler, C., Konig, Y.: Eigenlips for robust speech recognition. In: cent advances in the automatic recognition of audiovisual speech. IEEE International Conference on Acoustics, Speech, and Signal Proc. IEEE 91(9), 1306–1326 (2003) Processing, ICASSP-94, 19–22 Apr 1994, pp. 669–672 (1994) 10. Yau, W.C., Kumar, D.K., Arjunan, S.P.: Visual speech recognition 23. Silsbee, P.L., Bovik, A.C.: Computer lipreading for improved ac- using dynamic features and support vector machines. Int. J. Image curacy in automatic speech recognition. IEEE Trans. Speech Au- Graph. 8(3), 419–437 (2008) dio Process. 4(5), 337–351 (1996) 11. Xiang, T., Gong, S.: Beyond tracking: modelling activity and un- 24. Potamianos, G., Luettin, J., Neti, C.: Hierarchical discriminant derstanding behaviour. Int. J. Comput. Vis. 67(1), 21–51 (2006) features for audio-visual LVCSR. In: Proceedings of IEEE Inter- 12. Meng, H., Pears, N., Bailey, C.: Motion information combination national Conference on Acoustics, Speech, and Signal Processing for fast human action recognition. In: Proc. Computer Vision The- (ICASSP’01), vol. 161, pp. 165–168 (2001) ory and Applications, Spain, pp. 21–28 (2007) 25. Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spa- 13. Ma, J., Cole, R., Pellom, B., Ward, W., Wise, B.: Accurate auto- tiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 matic visible of arbitrary 3D models based on (2009) Automatic visual speech segmentation and recognition using directional motion history images

26. Yuille, A.L., Hallinan, P.W., Cohen, D.S.: Feature extraction from on Spoken Language Processing, pp. 504–507. Citeseer, Princeton faces using deformable templates. Int. J. Comput. Vis. 8(2), 99– (2000) 111 (1992) 46. Gordan, M., Kotropoulos, C., Pitas, I.: A support vector machine- 27. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour based dynamic network for visual speech recognition applications. models. Int. J. Comput. Vis. 1(4), 321–331 (1988) EURASIP J. Appl. Signal Process. 2002(1), 1248–1259 (2002) 28. Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adap- 47. Sun, D., Roth, S., Lewis, J., Black, M.: Learning optical flow. tive multimodal fusion by uncertainty compensation with applica- In: Forsyth, D., Torr, P., Zisserman, A. (eds.) Computer Vision— tion to audiovisual speech recognition. IEEE Trans. Audio Speech ECCV 2008. Lecture Notes in Computer Science, vol. 5304, pp. Lang. Process. 17(3), 423–435 (2009) 83–97. Springer, Berlin (2008) 29. Matthews, I., Cootes, T., Bangham, J., Cox, S., Harvey, R.: Extrac- 48. Otsu, N.: A threshold selection method from gray-level his- tion of visual features for lipreading. IEEE Trans. Pattern Anal. tograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979) Mach. Intell. 24(2), 198–213 (2002) 49. Khotanzad, A., Hong, Y.H.: Invariant image recognition by 30. Goldschen, A.J., Garcia, O.N., Petajan, E.: Continuous optical au- Zernike moments. IEEE Trans. Pattern Anal. Mach. Intell. 12(5), tomatic speech recognition by lipreading. In: Conference Record 489–497 (1990) of the Twenty-Eighth Asilomar Conference on Signals, Systems 50. Teh, C.H., Chin, R.T.: On image analysis by the methods of mo- and Computers, 31 Oct–2 Nov, pp. 572–577 (1994) ments. IEEE Trans. Pattern Anal. Mach. Intell. 10(4), 496–513 31. Mase, K., Pentland, A.: Automatic lipreading by optical-flow anal- (1988) ysis. Syst. Comput. Jpn. 22(6), 67–76 (1991) 51. Chang, C.C., Lin, C.J.: LIBSVM: a Library for Support Vector 32. Iwano, K., Yoshinaga, T., Tamura, S., Furui, S.: Audio-visual Machines (2001) speech recognition using lip information extracted from side-face 52. Yau, W.C.: Video Analysis of Mouth Movement Using Motion images. EURASIP J. Audio Speech Music Process. 2007(1), 4 Templates for Computer-Based Lip-Reading. RMIT University, (2007) Melbourne (2008) 33. Rajavel, R., Sathidevi, P.S.: A novel algorithm for acoustic and vi- sual classifiers decision fusion in audio-visual speech recognition system. Signal Process. 4(1), 23–37 (2010) Ayaz A. Shaikh has obtained his 34. Venkatesh Babu, R., Ramakrishnan, K.: Recognition of human ac- Bachelor of Engineering (Electron- tions using motion history information extracted from the com- ics) in 1999 and M.Phil (Telecom- pressed video. Image Vis. Comput. 22(8), 597–607 (2004) munication) in 2008 from the 35. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recog- Mehran University of Engineer- nition using motion history volumes. Comput. Vis. Image Underst. ing and Technology (MUET), 104(2–3), 249–257 (2006) Jamshoro, Pakistan. He received 36. Valstar, M., Patras, I., Pantic, M.: Facial action unit recognition his PhD degree from the School of using temporal templates. In: 13th IEEE International Workshop Electrical and Computer Engineer- on Robot and Human Interactive Communication, ROMAN, 20– ing, RMIT University, Melbourne, 22 Sept 2004, pp. 253–258 (2004) Australia in 2012. He is a faculty 37. Ahad, M.: Analysis of motion self-occlusion problem due to mo- member of MUET since 2003 (cur- tion overwriting for human activity recognition. J. Multimed. 5(1), rently on leave). Dr. Shaikh has 36–46 (2010) published couple of conference and 38. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ol- journal papers. He has been work- lason, D., Valtchev, V., Woodland, P.: The HTK Book, vol. 2. En- ing as a sessional academic in RMIT University from 2009 to date. His tropic Cambridge Research Laboratory, Cambridge (1997) current research interests include computer vision, image and video 39. Su, Q., Silsbee, P.L.: Robust audiovisual integration using semi- processing, pattern recognition and digital signal processing. He is a continuous hidden Markov models. In: Proc. International Confer- member of IEEE and IEEE Computer Society. ence on Spoken Language Processing, Philadelphia, PA, pp. 42– 45 (2002) 40. Krone, G., Talk, B., Wichert, A., Palm, G.: Neural architectures for Dinesh K. Kumar is a Professor of sensor fusion in speech recognition. In: Proc. European Tutorial Biosignals at RMIT University and Workshop on Audio-Visual Speech Processing, Rhodes, Greece, has extensive work in developing pp. 57–60 (1997) signal classification techniques for 41. Yau, W., Kumar, D., Arjunan, S.: Voiceless speech recognition us- biosignals, audio and video data. He ing dynamic visual speech features. In: Proc. of HCSNet Work- has developed bio-inspired artificial shop on the Use of Vision in HCI, Canberra, Australia, pp. 93– intelligence based systems that have 101. Australian Computer Society, Inc., Canberra (2006) been patented; (i) ‘audio identifi- 42. Duchnowski, P., Meier, U., Waibel, A.: See me, hear me: integrat- cation’ (A2004901712), (ii) ‘video ing automatic speech recognition and lip-reading. In: Proceedings analysis tools’ (A2004901711), and of the International Conference on Spoken Language and Process- (iii) ‘information in sound’ (USPO ing, Yokohama, Japan, pp. 547–550. Citeseer, Princeton (1994) HH 2005). 43. Heckmann, M., Berthommier, F., Kroschel, K.: A hybrid Dinesh’s expertise is in identifying ANN/HMM audio-visual speech recognition system. In: Interna- events from noisy signal recordings tional Conference on Auditory-Visual Speech Processing, Aal- and in separating aretfacts from the borg, Denmark, pp. 189–194. Citeseer, Princeton (2001) targeted signals. He has published over 175 refereed papers, with 96 44. Yau, W., Kant Kumar, D., Chinnadurai, T.: Lip-reading technique papers and seven patent applications in the last 5 years. Dinesh is an using spatio-temporal templates and support vector machines. In: award winning research student supervisor, having been selected to Progress in Pattern Recognition, Image Analysis and Applica- be the Best Supervisor for 2005 and 2007. He has also been awarded tions, pp. 610–617 (2008) Expert Professorial Scholarship by Federal government of Brazil with 45. Ganapathiraju, A., Hamaker, J., Picone, J.: Hybrid SVM/HMM Conselho Nacional de Pesquisa e Desenvolvimento Grant towards pro- architectures for speech recognition. In: International Conference viding his expertise for Biosignals (2008). His work in Biosignals has A.A. Shaikh et al. been recognised by European Union where he has been awarded the an ARC Australian Postdoctoral Fellow Industry (APDI) working on Erasmus Mundus Professorial Scholarship (2009). an industry linkage grant in video processing. His current research in- terests include Video Processing, Internet of Things and Ubiquitous Jayavardhana Gubbi received the healthcare devices. He has coauthored more than 40 papers in peer Bachelor of Engineering degree reviewed journals, conferences, and book chapters over the last ten from Bangalore University, Ben- years. Dr. Gubbi has served as conference secretary and publications galuru, India, in 2000, the Ph.D. chair in several international conferences in the area of wireless sensor degree from the University of Mel- networks, signal processing and pattern recognition. bourne, Melbourne, Vic., Australia, in 2007. For three years, he was a Research Assistant at the Indian In- stitute of Science, where he was en- gaged in speech technology for In- dian languages. Dr. Gubbi is a Re- search Fellow in the Department of Electrical and Electronic Engineer- ing at The University of Melbourne. Currently, from 2010 to 2014, he is