applied sciences

Article Automatic Lip-Reading System Based on Deep Convolutional Neural Network and Attention-Based Long Short-Term Memory

Yuanyao Lu * and Hongbo Li

School of Information Science and Technology, North China University of Technology, Beijing 100144, China; [email protected] * Correspondence: [email protected]

 Received: 27 March 2019; Accepted: 11 April 2019; Published: 17 April 2019 

Abstract: With the improvement of computer performance, virtual reality (VR) as a new way of visual operation and interaction method gives the automatic lip-reading technology based on visual features broad development prospects. In an immersive VR environment, the user’s state can be successfully captured through lip movements, thereby analyzing the user’s real-time thinking. Due to complex image processing, hard-to-train classifiers and long-term recognition processes, the traditional lip-reading recognition system is difficult to meet the requirements of practical applications. In this paper, the convolutional neural network (CNN) used to image feature extraction is combined with a recurrent neural network () based on attention mechanism for automatic lip-reading recognition. Our proposed method for automatic lip-reading recognition can be divided into three steps. Firstly, we extract keyframes from our own established independent database (English pronunciation of numbers from zero to nine by three males and three females). Then, we use the Visual Geometry Group (VGG) network to extract the lip image features. It is found that the image feature extraction results are fault-tolerant and effective. Finally, we compare two lip-reading models: (1) a fusion model with an attention mechanism and (2) a fusion model of two networks. The results show that the accuracy of the proposed model is 88.2% in the test dataset and 84.9% for the contrastive model. Therefore, our proposed method is superior to the traditional lip-reading recognition methods and the general neural networks.

Keywords: virtual reality (VR); self-attention; automatic lip-reading; sensory input; deep learning

1. Introduction methods have had a great impact on social progress in recent years, which promoted the rapid development of artificial intelligence technology and solved many practical problems [1]. Automatic lip-reading technology is one of the important components of human–computer interaction technology and virtual reality (VR) technology. It plays a vital role in human language communication and visual perception. Especially in noisy environments or VR environments, visual signals can remove redundant information, complement speech information, increase the multi-modal input dimension of immersive interaction, reduce the time and workload of human on learning lip language and lip movement, and improve automatic speech recognition ability. It enhances the real experience of immersive VR. Meanwhile, automatic lip-reading technology can be widely used in the VR system [2], information security [3], speech recognition [4] and assisted driving systems [5]. The research of automatic lip-reading involves many fields, such as pattern recognition, computer vision, natural language comprehension and image processing. The contents of the research involve the latest research progress in these fields. Conversely, the study of lip movement

Appl. Sci. 2019, 9, 1599; doi:10.3390/app9081599 www.mdpi.com/journal/applsci Appl. Sci. 2019, 9, 1599 2 of 12 is also a check and development of these theories. Meanwhile, it will also have a profound impact on content-based image compression technology. Traditional lip-reading systems usually consist of two stages: feature extraction and classification. For the first stage, most previous feature extraction methods use pixel values extracted from the mouth region of interest (ROI) as visual information. Then, the abstract image features are extracted by discrete cosine transform (DCT) [6,7], discrete wavelet transform (DWT) [7] and principal component analysis (PCA) [7,8]. Therefore, the model-based methods, such as active appearance model (AAM) [9] and active shape model (ASM) [10] form non-rigid models and obtain a set of advanced geometric features which has the characteristics of lower dimensionality and stronger robustness. In the second stage, the extracted features are fed into the classifiers of support vector machine (SVM) [11] and hidden Markov model (HMM) [12]. At present, deep learning has made significant progress in the field of computer vision (image representation, target detection, human behavior recognition and video recognition). Therefore, it is an inevitable trend of scientific research to shift the direction of automatic lip-reading technology from the traditional manual feature extraction classification model to the end-to-end deep learning architecture. In recent years, researchers have introduced attention mechanisms on convolutional neural networks (CNN) to focus on areas of interest, and the classification and target detection of images have also achieved great success. For example, a CNN feature extraction method based on attention mechanism proposed by Vinyals et al. [13]. Furthermore, the mechanism of attention can be successfully applied in recurrent neural network (RNN) to find the relationship between the context. Since the changes between video frames of automatic lip-reading are continuous and happen in time series, the researchers use the long short-term memory (LSTM) network [14], which can find hidden association information in time series data such as video, audio and text. A multi-layered neural network structure of cascaded feed-forward layer and LSTM layer is proposed for word-level classification in speaker-based lip-reading. Considering that lip motion is a continuous process with time information, visual content can be represented by consecutive frames. Therefore, we proposed a hybrid neural network architecture combining CNN and attention-based LSTM to learn the hidden correlation in spatiotemporal information [15] and used the weights of attention to express the importance of keyframes. Our proposed method for automatic lip-reading recognition can be divided into four parts: Firstly, we extracted keyframes from a sample video, used the key points of the mouth to locate the mouth area to reduce the complexity of redundant information and computational processing in successive frames. Then, features were extracted from the original mouth image using the VGG19 network [16], which consists of 16 convolution layers and three fully connected layers. Thirdly, we used attention-based LSTM network to learn sequential information and attentional weights among video keyframe features. Finally, the final recognition result was predicted by two fully connected layers and a SoftMax layer. The SoftMax function converts predicted results into probability. The main advantages of our method: (1) The VGG19 is equipped to overcome the image deformation including translation, rotation and distortion. Therefore, the extracted features have strong robustness and fault tolerance. (2) Attention-based LSTM is good at finding and exploiting long time dependencies from sequential data, and the introduction of attention mechanism makes the network selectively focus on active video information and reduce the interference of invalid information. Therefore, the relationship of the features among frames is connected and strengthened. The rest of the paper is organized as follows: in Section2, we introduce the preparation work and the architecture of the lip-reading model. Experimental results and analysis of our proposed method are presented in Section3. Section4o ffers conclusions and suggestions for future research directions.

2. Proposed Lip-Reading Model In this section, the proposed framework and main steps are discussed in detail according to the following four parts. Firstly, we need to preprocess the dynamic lip videos, including separating Appl. Sci. 2019, 9, 1599 3 of 12

Appl. Sci. 2019, 9, x FOR PEER REVIEW 3 of 12 audio and video signals, extracting keyframes and positioning the mouth. Secondly, features are extractedextracted from from the the preprocessed preprocessed image image dataset dataset by by using using CNN. CNN. Then, Then, we we use use LSTM LSTM with with attention attention mechanismmechanism to to learn sequencesequence information information and and attention attention weights. weights. Finally, Finally, the ten-dimensional the ten-dimensional features featuresare mapped are mapped through through two fully two connected fully connected layers, and la theyers, result and ofthe automatic result of lip-reading automatic recognitionlip-reading is recognitionpredicted byis SoftMaxpredicted layer. by SoftMax SoftMax layer. normalizes SoftMa thex outputnormalizes of the the fully output connected of the layers fully and connected classifies layersit according and classifies to probability. it according The sumto probability. of probabilities The sum is one. of probabilities Our proposed is CNNone. Our with proposed attention-based CNN withLSTM attention-based architecture isLSTM shown architecture in Figure1 .is shown in Figure 1.

Video Spatial Time Recognation sequence characteristics characteristics results

CNN Atten LSTM

CNN Atten LSTM ,,,,vv,v,v, Predict results 2 n 123 Time series

CNN Atten LSTM ⋅⋅⋅

SoftMax … … … …

Fully connected CNN Atten LSTM layers

FigureFigure 1.1. TheThe architecturearchitecture ofof ourour proposed proposed convolutional convolutional neural neural network network (CNN) (CNN) with with attention-based attention- long short-term memory (LSTM).based long short-term memory (LSTM).

2.1.2.1. Video Video Preprocessing Preprocessing WeWe preprocess preprocess the the sequential sequential images images of of lips lips in in order order to to balance balance training training speed speed and and recognition recognition results,results, including including keyframes keyframes extraction extraction and and lip lip location location segmentation segmentation [17]. [17]. Generally,Generally, the the data data from from video video capture capture is is about about 25 frames per second. Since therethere isis aa didifferencefference in inthe the length length of eachof each utterance utterance and anyand wordany actuallyword actually pronounced pronounced has a series has ofa redundantseries of redundant information informationabout the movement about the movement of the lips, itof is the di ffilips,cult it foris difficult the model for to the extract model the to imageextract features the image and features discover andthe discover hidden relationship the hidden of relationship the sequences. of Therefore,the sequences. we try Therefore, to remove we redundant try to remove information redundant from all informationthe original from images, all the extract original keyframes images, and extract segment keyframes lips in and the followingsegment lips four in steps the following as experimental four stepsdatasets: as experimental datasets: • The time of the utterance is divided into 10 equally interval portions, and a random frame of each • The time of the utterance is divided into 10 equally interval portions, and a random frame of eachportion portion is selected is selected as a as keyframe, a keyframe, thus thus each ea wordch word obtains obtains a sequence a sequence image image frame frame of equal of equal length. length.We use OpenCV library to load images and convert them into a three-dimensional matrix [18]. • • WeThen, use weOpenCV use the library facial landmarkto load images detection and conver of Dlibt toolkitthem into [19 ].a Itthree-dimensional takes the face images matrix as [18]. input, Then,and thewe use returned the facial face landmark structure detection consists of of di Dlibfferent toolkit landmarks [19]. It takes for each the face specific images face as attribute. input, andWe the choose returned to locate face the structure seven key consists points of of different the mouth, landmarks labeled as:for 49,each 51, specific 53, 55, 57,face 58, attribute. 59. WeWe choose segment to locate the mouth the seven images key and points remove of the the mouth, redundant labeled information, as: 49, 51, then53, 55, calculate 57, 58, 59. the center • • Weposition segment of the mouth basedimages on and the remove coordinate the redu pointsndant of the information, image boundary, then calculate denoted the as center(x0, y0 ). positionThe width of the and mouth height based of the on lip the image coordinate are represented points of by thew and imageh, respectively, boundary, denotedL1 and L as2 represent x ,y ()00. The width and height of the lip image are represented by w and h, respectively, L1 and

L2 represent the left and right, upper and lower dividing lines surrounding the mouth, respectively. According to the following formula to calculate the bounding box of the mouth: Appl. Sci. 2019, 9, 1599 4 of 12

the left and right, upper and lower dividing lines surrounding the mouth, respectively. According Appl. Sci. 2019, 9, x FOR PEER REVIEW 4 of 12 to the following formula to calculate the bounding box of the mouth:

ww LxL1 =±= x0 , (1) 10±22, h = L2 y0 h , (2) L =± y± 2 (2) 202 , After the mouth segmentation step, the original dataset will be processed into 224 224 pixels • × • whichAfter the take mouth lips assegmentation a standard. step, This the method original has dataset the characteristics will be processed of strong into robustness, 224 × 224 pixels high computationalwhich take lips e ffiasciency a standard. and consistency This method of eigenvectors.has the characteristics The processes of strong of pretreatment robustness, are high as showncomputational in Figure efficiency2. and consistency of eigenvectors. The processes of pretreatment are as shown in Figure 2.

10 frames 10 frames Input Output

Resize Video 224x224

Figure 2. Preprocessing steps with example frames.

2.2. Attention-Based LSTM In general, RNN can can obtain obtain sequ sequenceence output output with with time time informat informationion through through sequence sequence input input [20]. [20]. For example, LSTM network which is a special type of RNN can learn long-term dependency information. LSTM LSTM was was proposed proposed by by Hochreiter Hochreiter and and Schmidhuber Schmidhuber [21] [21] and and it was improved and promoted byby Alex Alex Graves Graves in 2012 in [201222]. In[22]. many In practicalmany practical applications, applications, LSTM has achievedLSTM has considerable achieved successconsiderable and it success has been and widely it has used.been widely used. The firstfirst step in LSTM is making a decision to discard useless informatio informationn from the cell state, which is accomplished by a decision called “the forget gate”. This gate reads ht 1 and xt, then it outputs which is accomplished by a decision called “the forget gate”. This gate reads− ht−1 and xt , then it a value in the [0,1] interval for each number in cell state Ct 1. The calculation process is as follows: − outputs a value in the [0,1] interval for each number in cell state Ct−1 . The calculation process is as ft = σ W f [ht 1, xt] + b f , (3) follows: · − where σ is the hidden activation function,f ht=⋅σ1 is()Wh the[,] hidden− x + state b at time t 1, xt is the input at time t, t − f tt1 f , − (3) and b is the bias. Then, it is determined that new useful information is stored in the cell state. It consists of two where σ is the hidden activation function, ht−1 is the hidden state at time t −1, xt is the input at parts: First, a sigmoid layer is called the “input gate layer” and it determines which value will be time , and is the bias. updated.t Thenb a new candidate value vector is created by tanh, the activation function that processes Then, it is determined that new useful information→ is stored in the cell state. It consists of two the data on the state and output is tanh in LSTM. Ct is added to the state, and the old cell state Ct 1 parts: First, a sigmoid layer is called the “input gate layer” and it determines which value will be− is updated to C . Second, the cell updates useful information into cell status and multiply the old updated. Then at new candidate value vector is created by tanh, the activation function that processes cell state Ct 1 and the output of “forget gate” ft as the part input of cell, then summing it with the −  C → C − productthe data on of “inputthe state gate” and output isit tanhand candidatein LSTM. informationt is added Ctot .the The state, result and of the the old calculation cell state is thet 1 updated C . This is the new candidate and it changes based on how much we decide to update each is updatedt to Ct . Second, the cell updates useful information into cell status and multiply the old state. The calculation processes are as follows: cell state Ct−1 and the output of “forget gate” ft as the part input of cell, then summing it with the it = σ(Wi [ht 1, xt] + bi), (4) · −  product of “input gate” output it and candidate information Ct . The result of the calculation is the → Ct = tanh(WC [ht 1, xt] + bC), (5) updated Ct . This is the new candidate and it changes· − based on how much we decide to update each state. The calculation processes are as follows:

=⋅σ () + (4) iWhxbt ii[,]tt−1 , Appl. Sci. 2019, 9, x FOR PEER REVIEW 5 of 12

 =⋅+() Appl. Sci. 2019, 9, 1599 CWhxbt tanh[,]Ct−1 t C, 5 of(5) 12

 = ∗∗ CfCt t t−1 +itt C→, (6) Ct = ft Ct 1+it Ct, (6) ∗ − ∗ Finally,Finally, the output value value is is determined determined based based on on the the filtered filtered cell cell state. state. Firstly, Firstly, the sigmoid the sigmoid layer layerdetermines determines the output the output portion portion of the of thecell cell state. state. Then, Then, the the cell cell state state is is passed passed through through tanh andand multipliedmultiplied byby thethe outputoutput ofof thethe sigmoidsigmoid layerlayer toto obtainobtain thethe resultresult ofof thethe cell.cell. TheThe valuevalue rangerange ofof thethe sigmoidsigmoid functionfunction isis [0,1],[0,1], whichwhich isis mostmost suitablesuitable forfor controllingcontrolling thethe openingopening andand closingclosing ofof variousvarious doors.doors. ThisThis partpart ofof thethe calculationcalculation processesprocesses isis shownshown asas follows:follows:

o =⋅σ ()Whx [− , ] + b, t oott1 (7) ot = σ(Wo [ht 1, xt] + bo), (7) · − h =∗ o tanh(C ) httt = ot tanh t(, Ct), (8)(8) ∗ TheThe key to the the LSTM LSTM network network is is the the cell cell state. state. As Asshown shown in Figure in Figure 3, the3, thecalculation calculation process process runs runsthrough through the horizontal the horizontal line. line. It runs It runs directly directly across across the the chain chain with with only only a a small small amount amount of linearlinear interactioninteraction andand itit willwill bebe easyeasy toto keepkeep thethe informationinformation flowingflowing onon it.it.

x t x t

Output Input gate i t gate ot

x t х ct х ht

х

f t Forget gate

x t

FigureFigure 3.3. LSTMLSTM unitunit structure.structure.

Therefore,Therefore, LSTMLSTM networknetwork cancan successfullysuccessfully discoverdiscover sequencesequence relationshipsrelationships andand wewe havehave addedadded anan attentionattention mechanismmechanism basedbased onon it.it. CNN network extractsextracts thethe spatialspatial featuresfeatures toto obtainobtain fixed-lengthfixed-length featurefeature vectors,vectors, and and the the LSTM LSTM network network identifies identifies the videothe video contents contents based based on the on input the feature input vectors.feature Wevectors. use the We framework use the framework described described in Figure4 ,in which Figure combines 4, which attention-based combines attention-based LSTM with aLSTM deep with CNN a toAppl.deep train Sci. CNN spatial-temporal2019 ,to 9, trainx FOR spatial-temporalPEER featuresREVIEW on videofeatures sequences. on video sequences. 6 of 12

FigureFigure 4.4. Attention-basedAttention-based LSTMLSTM model.model.

The CNN is used as the encoder and the LSTM network is used as the decoder. In the decoding process, we introduce the attention mechanism and learn the attention weights ( α ), thus the model pays more attention to the effective area of the whole video [23]. The feature vectors for each frame v φ()V are weighted and then all video frame sequences ( ) are simultaneously used as input to the LSTM network. The input to the attention-based LSTM model is as follows:

n φ ()Vv= α (9) i1= ti i ,

The learning of weight α is related to the state of a hidden layer unit on the LSTM network α and the feature vector of the current time. The correlation score of ti is as follows:

=⋅+⋅+ eWhUvbtitanh(t−1 i ) , (10)

where ht−1 is the output of the hidden unit state at time t −1, vi is the eigenvector of the video frame i , and W , U , b respectively represent the weight matrix to be learned and the offset parameters, the activation function is tanh. Normalization can be obtained as follows:

α = exp(eti ) , ti n exp(e ) (11)  k =1 tk

α () where ti represents the conditional probability ( P ae| ) of the video feature vector of the video frame n t a =1 i at time and the entire video feature vector, furthermore, k =1 ti . The closer the relationship between the frame and whole video feature vector, the bigger the attention weight will become. Then the attention-based LSTM network input at time t is as follows: = φ() hfhtrnnt(,−1 V ), (12) φ() where frnn is a unit of LSTM, ht−1 is the state of the hidden layer unit at time t −1, and V is the input at time t after increasing the attention weights. Although the introduction of the attention mechanism will increase the amount of computation, it can selectively focus on the effective information in the video and reduce the interference of invalid information, thus the performance level of the network model can be significantly improved.

2.3. CNN-LSTM With Attention Networks Previous studies have shown that both CNN and RNN models can achieve better lip-reading recognition performance alone [24]. We have found that the hybrid network of attention-based CNN- Appl. Sci. 2019, 9, 1599 6 of 12

The CNN is used as the encoder and the LSTM network is used as the decoder. In the decoding process, we introduce the attention mechanism and learn the attention weights (α), thus the model pays more attention to the effective area of the whole video [23]. The feature vectors for each frame are weighted and then all video frame sequences (v) are simultaneously used as input φ(V) to the LSTM network. The input to the attention-based LSTM model is as follows:

Xn φ(V) = αtivi, (9) i=1

The learning of weight α is related to the state of a hidden layer unit on the LSTM network and the feature vector of the current time. The correlation score of αti is as follows:

eti = tanh(W ht 1 + U vi + b), (10) · − · where ht 1 is the output of the hidden unit state at time t 1, vi is the eigenvector of the video − − frame i, and W, U, b respectively represent the weight matrix to be learned and the offset parameters, the activation function is tanh. Normalization can be obtained as follows:

exp(eti) αti = Pn , (11) k=1 exp(etk) where αti represents the conditional probability (P(a e)) of the video feature vector of the video frame | Pn i at time t and the entire video feature vector, furthermore, k=1 ati = 1. The closer the relationship between the frame and whole video feature vector, the bigger the attention weight will become. Then the attention-based LSTM network input at time t is as follows:

ht = frnn(ht 1, φ(V)), (12) − where frnn is a unit of LSTM, ht 1 is the state of the hidden layer unit at time t 1, and φ(V) is the input − − at time t after increasing the attention weights. Although the introduction of the attention mechanism will increase the amount of computation, it can selectively focus on the effective information in the video and reduce the interference of invalid information, thus the performance level of the network model can be significantly improved.

2.3. CNN-LSTM With Attention Networks Previous studies have shown that both CNN and RNN models can achieve better lip-reading recognition performance alone [24]. We have found that the hybrid network of attention-based CNN-LSTM can further improve performance. The sequence-based attention mechanism can be applied to tasks related to time-series computer vision and assist the model in focusing on some sequence information of the video. Considering the influence of light, angle and clarity of the input images, we use a better quality of the camera, and the proposed model is trained with RGB images. The part of CNN is improved by using the model based on VGG19 and it does not include the last two fully connected layers (gray parts of Figure5), and we continue training the model based on pre-training parameters of ImageNet. The structure diagram of VGG19 is shown in Figure5, and the input of VGG19 is 224 244 pixel RGB × image. Thus, the output of CNN is 4096 10 and the attention mechanism is introduced to the LSTM × network to weight keyframes [25]. Thereafter, the network increases two fully connected layers and a SoftMax layer for classification. The calculation process of SoftMax is as follows:

  ezj f zj = , (13) Pn zi i=1 e Appl. Sci. 2019, 9, x FOR PEER REVIEW 7 of 12

LSTM can further improve performance. The sequence-based attention mechanism can be applied to tasks related to time-series computer vision and assist the model in focusing on some sequence information of the video. Considering the influence of light, angle and clarity of the input images, we use a better quality of the camera, and the proposed model is trained with RGB images. The part of CNN is improved by using the model based on VGG19 and it does not include the last two fully connected layers (gray parts of Figure 5), and we continue training the model based on pre-training parameters of ImageNet. The structure diagram of VGG19 is shown in Figure 5, and the input of VGG19 is 224 × 244 pixel RGB image. Thus, the output of CNN is 4096 × 10 and the attention mechanism is introduced to the LSTM network to weight keyframes [25]. Thereafter, the network increases two fully connected layers and a SoftMax layer for classification. The calculation process of SoftMax is as follows:

z j = e fz()j n , e zi (13) Appl. Sci. 2019, 9, 1599  i=1 7 of 12 where zj represents the output value of the current time j , zi represents the output value of the where zj represents the output value of the current time j, zi represents the output value of the time i, andtime f (iz,) andis the f value()z is of the the value function of the SoftMax function map. SoftMax map.

Figure 5. VGG19 network structure.

Obviously, SoftMax SoftMax converts converts the the output output to tothe the probability probability of each of each result. result. The probability The probability value valuefalls within falls within the interval the interval [0,1], [0,1], and andthe theprobabilit probabilityy sum sum is isone. one. This This form form of of results results facilitates subsequent calculations. Finally, thethe model output is a 10-dimensional vector, and the highest prediction score is obtained as the recognition result according toto thethe SoftMaxSoftMax calculationcalculation method.method.

3. Experimental Dataset and Results

3.1. Dataset The experimental experimental dataset dataset was was performed performed on the on audio-visual the audio-visual database database we had we created, had created, and 10 andindependent 10 independent digital English digital utterances English utterances (numbers (numbersfrom zero fromto nine) zero were to gathered nine) were from gathered six different from sixspeakers different (three speakers males and (three three males females). and three Each females). speaker pronounced Each speaker each pronounced word up to each 100 word times. up The to 100dataset times. was Thebased dataset on American was based English on American pronunciation English and pronunciation the pronunciation and theof each pronunciation number was of eachdivided number into wasseparate divided video into clips. separate Each video independent clips. Each speaker independent was not speaker trained was in not professional trained in professionalpronunciation, pronunciation, and their first and language their first was language not always was not Englis alwaysh. English.Thus, there Thus, might there mightbe some be somedifferences differences in the in lip the movements lip movements of individual of individual utterances. utterances. We We collected collected numerous numerous non-standard samples for the dataset in order to facilitate a more extensiveextensive study of thethe lip-readinglip-reading recognition system. We We collected videos from the frontal perspectiveperspective of individual speakers who were sitting naturally without any actions. The size of each frame of the original images was 1920 1080 resolution, naturally without any actions. The size of each frame of the original images was× 1920 × 1080 approximatelyresolution, approximately 25 frames per25 frames second. per In second. order toIn accuratelyorder to accurately locate the locate beginning the beginning and end and of each end utteranceof each utterance unit, we unit, used we audio used as audio an aid as to an separate aid to separate each uttered each word, uttered each word, word each lasted word about lasted 1 s. about Then, each1 s. Then, isolated each word isolated video word was further video extractedwas further into extracted a fixed length into ofa fixed 10 frames. length After of 10 processing frames. After each video frame, we obtained a fix dimension of ROI at 224 224 pixels as standard inputs of CNN model. processing each video frame, we obtained a fix dimension× of ROI at 224 × 224 pixels as standard inputs of CNN model. 3.2. Results and Discussions

In this section, we evaluated our proposed neural network model, and the results were analyzed and compared in our dataset. The out-of-order dataset was randomly divided into 80% training dataset and 20% test dataset. We used toolkit to carry out the CNN and the attention-based LSTM network. It used a random gradient descent method to train the network in small batches of 50 units with the learning rate of 0.001. The weight of CNN was based on the parameters of the pre-trained model of VGG19, and the weight of attention-based LSTM and the full connection were randomly initialized. The visualization of CNN model was shown in Figure6. In order to evaluate the improvement performance influenced by the attention mechanism, we tested and compared the general CNN-RNN architecture. As shown in Figure7, the proposed network only added one calculation step (gray part) of the attention layer, and the other parts of the two architectures were identical to the initial parameters of the architecture. Appl. Sci. 2019, 9, x FOR PEER REVIEW 8 of 12

3.2. Results and Discussions In this section, we evaluated our proposed neural network model, and the results were analyzed and compared in our dataset. The out-of-order dataset was randomly divided into 80% training dataset and 20% test dataset. We used pytorch toolkit to carry out the CNN and the attention-based LSTM network. It used a random gradient descent method to train the network in small batches of 50 units with the learning rate of 0.001. The weight of CNN was based on the parameters of the pre- Appl.trained Sci. 2019model, 9, 1599 of VGG19, and the weight of attention-based LSTM and the full connection 8were of 12 randomly initialized. The visualization of CNN model was shown in Figure 6.

Attention-based Block1_pool Block3_pool Block5_pool LSTM Block4_pool Inputs Block2_pool Fc1 Fc2&Fc3 Predict results Classifier RNN

Figure 6. The visualization processes of feature extraction.

In order to evaluate the improvement performance influenced by the attention mechanism, we tested and compared the general CNN-RNN architecture. As shown in Figure 7, the proposed network only added oneFigure calculation 6. The visualizationstep (gray part) processes of the of attention feature extraction. layer, and the other parts of the two architectures were identical to the initial parameters of the architecture. Softmax VGG19 FC7 Atten LSTM FC 10x(7,7,512) 10x4096

Figure 7. OurOur proposed proposed architecture architecture and th thee compared network architecture.

The training dataset and the test dataset were were respectively respectively input into into two two CNN-RNN CNN-RNN networks, networks, then wewe usedused thethe same same CNN CNN (VGG19) (VGG19) to to extract extract the the sequence sequence features features of 4096of 409610 × and10 and input input them them into × intotwo ditwofferent different RNN networks.RNN networks. The losses, The accuracies, losses, ac andcuracies, visualizations and visualizat of the attentionions of mechanismthe attention for mechanismeachAppl. period Sci. 2019 for were, 9, xeach FOR shown periodPEER in REVIEW Figureswere shown8 and 9in. Figures 8 and 9. 9 of 12

(a) (b)

FigureFigure 8. 8.Comparison Comparison of of our our proposed proposed attention-based attention-based model model (blue) (blue) with with CNN-LSTM CNN-LSTM model model (red). (red). (a()a Comparison) Comparison of of losses losses between between the the two two networks networks and and (b ()b comparison) comparison of of the the accuracies accuracies between between the the twotwo networks. networks.

In Figure 8, it could be seen that the recognition1 accuracy of the model with the attention mechanism was higher. When the epochs were around 15 times, the loss tended to be stable and it indicated that the optimal solution had been reached at this time. The accuracy of the proposed network model was 88.2% in the test dataset and 84.9% for the contrastive network (CNN-LSTM).

Figure 9. Visualization of the attention mechanism.

Figure 9 shows the visualization of the attention mechanism, each line shows the weight of attention at each moment from zero to nine, a deeper color represents a greater weight of attention. It could be inferred that the attention area of each pronunciation was concentrated near the third and seventh frames, because the third, fourth, seventh, and eighth frames contained the main information of the lip motion. These frames were related to the video theme and contained a certain chronological order, which were considered to be important video frames, the model assigned a large amount of attention weight. This attention distribution showed that the attention mechanism optimized the proportion of keyframe and achieved the requirement of allocating more information processing resources to important parts. Therefore, our research used an architecture which was a fusion of CNN and attention-based LSTM. Obviously, the performance of our proposed architecture was more powerful. We tested the English words in ten independent utterances in two models. The experimental results were shown in Figure 10. The results show that the proposed model had higher accuracy in all independent digital Appl. Sci. 2019, 9, x FOR PEER REVIEW 9 of 12

(a) (b)

Figure 8. Comparison of our proposed attention-based model (blue) with CNN-LSTM model (red). (a) Comparison of losses between the two networks and (b) comparison of the accuracies between the two networks.

In Figure 8, it could be seen that the recognition accuracy of the model with the attention mechanism was higher. When the epochs were around 15 times, the loss tended to be stable and it indicated that the optimal solution had been reached at this time. The accuracy of the proposed Appl.network Sci. 2019 model, 9, 1599 was 88.2% in the test dataset and 84.9% for the contrastive network (CNN-LSTM).9 of 12

FigureFigure 9. 9.Visualization Visualization of of the the attention attention mechanism. mechanism.

InFigure Figure 98 ,shows it could the bevisualization seen that theof the recognition attention accuracy mechanism, of the each model line withshows the the attention weight of mechanismattention at was each higher. moment When from the zero epochs to nine, were a deeper around color 15 times,represents the loss a greater tended weight to be of stable attention. and itIt indicated could be thatinferred the optimalthat the attention solution hadarea beenof each reached pronunciation at this time. was Theconcentrated accuracy near of the the proposed third and networkseventh model frames, was because 88.2% the in thethird, test fourth, dataset seventh, and 84.9% and foreighth the contrastiveframes contained network the (CNN-LSTM). main information of theFigure lip motion.9 shows These the visualizationframes were related of the attentionto the video mechanism, theme and eachcontained line shows a certain the chronological weight of attentionorder, which at each were moment considered from zeroto be to important nine, a deeper video colorframes, represents the model a greater assigned weight a large of attention.amount of Itattention could be inferredweight. thatThis theattention attention distribution area of each showed pronunciation that the wasattention concentrated mechanism near theoptimized third and the seventhproportion frames, of becausekeyframe the and third, achieved fourth, the seventh, requirement and eighth of allocating frames contained more information the main information processing ofresources the lip motion. to important These framesparts. Therefore, were related our to research the video used theme an architecture and contained which a certain was a chronologicalfusion of CNN order,and attention-based which were considered LSTM. to be important video frames, the model assigned a large amount of attentionObviously, weight. the This performance attention distribution of our proposed showed architecture that the attention was more mechanism powerful. optimized We tested the the proportionEnglish words of keyframe in ten independent and achieved utterances the requirement in two models. of allocating The experimental more information results were processing shown resourcesin Figure to 10. important The results parts. show Therefore, that the proposed our research mo useddel had an higher architecture accuracy which in all was independent a fusion of CNNdigital and attention-based LSTM. Obviously, the performance of our proposed architecture was more powerful. We tested the English words in ten independent utterances in two models. The experimental results were shown in Figure 10. The results show that the proposed model had higher accuracy in all independent digital word utterances and it was stronger than the general CNN-LSTM model. Furthermore, according to the analysis of the recognition results for each individual utterance of English words, the identification of “two” was the easiest, “zero” was the most difficult to identify. Complex pronunciation was difficult to recognize because of errors in individual pronunciation. The best-recognized utterance was “two” because the speakers’ movements were consistent without regional differences, and the worst identified utterance was “zero” because the lip movements of speakers were complicated and a part of pronunciation required tongue to control syllables. Since the experimental data was not using the first language of all speakers, the experimental result would be adversely affected. It could be inferred that the accuracy of lip-reading recognition result would be higher if standard announcer’s pronunciation videos were used. In addition, it was difficult to put it into practical applications. Therefore, the dataset in this paper was closer to the actual application and it had high academic research value. As a whole, the proposed model effectively improved recognition accuracy. It could be concluded that the proposed model was stronger than the general CNN-LSTM structure in the performance of these ten independent digital pronunciations. Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 12 word utterances and it was stronger than the general CNN-LSTM model. Furthermore, according to the analysis of the recognition results for each individual utterance of English words, the identification of “two” was the easiest, “zero” was the most difficult to identify. Complex pronunciation was difficult to recognize because of errors in individual pronunciation. The best- recognized utterance was “two” because the speakers’ movements were consistent without regional differences, and the worst identified utterance was “zero” because the lip movements of speakers were complicated and a part of pronunciation required tongue to control syllables. Since the experimental data was not using the first language of all speakers, the experimental result would be adversely affected. It could be inferred that the accuracy of lip-reading recognition result would be higher if standard announcer’s pronunciation videos were used. In addition, it was difficult to put it into practical applications. Therefore, the dataset in this paper was closer to the actual application and it had high academic research value. As a whole, the proposed model effectively improved recognitionAppl. Sci. 2019 accuracy., 9, 1599 It could be concluded that the proposed model was stronger than the general10 of 12 CNN-LSTM structure in the performance of these ten independent digital pronunciations.

Figure 10. Comparisons of recognition accuracy between two networks. Figure 10. Comparisons of recognition accuracy between two networks. 4. Conclusions 4. Conclusion In this paper, hybrid neural network architecture of CNN and attention-based LSTM is proposed for lip-readingIn this paper, recognition hybrid neural systems. network Firstly, architectu CNN (VGG19)re of CNN extracted and attention-based visual features LSTM from is proposed the mouth forROI. lip-reading Then, we recognition used the attention-based systems. Firstly, LSTM CNN to learn (VGG19) the sequence extracted weights visual features and sequence from informationthe mouth ROI.between Then, the we frame-level used the features. attention-based Finally, the LSTM classification to learn was the achieved sequence by usingweights two and fully sequence connected informationlayers and a between SoftMax the layer. frame-level The experimental features. dataset Finally, was the built classification by us independently was achieved and by it consistedusing two of fullythree connected males and layers three females.and a SoftMax American layer. English The ex pronunciationperimental dataset of numbers was built from by zero us toindependently nine, and each anddigital it consisted utterance of werethree males divided and into three independent females. American video clips, English each pronunciation independent of speaker numbers was from not zerotrained to nine, in professional and each digital pronunciation. utterance were The experimental divided into resultsindependent show thatvideo compared clips, each with independent the general speakerCNN-RNN was not model, trained the in proposed professional architecture pronunciation. can effectively The experimental predict words results from show the that sequence compared of lip withregion the images general on CNN-RNN our own dataset, model, andthe proposed the accuracy archit of theecture proposed can effectively model is predict 88.2% inwords the test from dataset the sequence of lip region images on our own dataset, and the accuracy of the proposed model is 88.2% which is 3.3% higher than the general CNN-RNN. In future research, we will train the lip-reading in the test dataset which is 3.3% higher than the general CNN-RNN. In future research, we will train recognition model on datasets of real-time broadcast videos, including video samples from news the lip-reading recognition model on datasets of real-time broadcast videos, including video samples broadcasts and real-world environments to explore our proposed approach for speaker-independent from news broadcasts and real-world environments to explore our proposed approach for speaker- video speech recognition system. independent video speech recognition system. Author Contributions: Data curation, Y.L. and H.L.; Formal analysis, Y.L. and H.L.; Methodology, Y.L. and H.L.; Project administration, Y.L.; Resources, Y.L.; Supervision, Y.L.; Validation, H.L.; Visualization, H.L.; Writing—original draft, H.L.; Writing—review & editing, Y.L. and H.L. Funding: This research was supported by the National Natural Science Foundation of China (61571013), by the Beijing Natural Science Foundation of China (4143061), by the Science and Technology Development Program of Beijing Municipal Education Commission (KM201710009003) and by the Great Wall Scholar Reserved Talent Program of North China University of Technology (NCUT2017XN018013). Conflicts of Interest: The authors declare no conflict of interest.

References

1. Jaimes, A.; Sebe, N. Multimodal human–computer interaction: A survey. Comput. Vis. Image Underst. 2007, 108, 116–134. [CrossRef] 2. Loomis, J.M.; Blascovich, J.J.; Beall, A.C. Immersive virtual environment technology as a basic research tool in psychology. Behav. Res. Methods Instrum. Comput. 1999, 31, 557–564. [CrossRef][PubMed] Appl. Sci. 2019, 9, 1599 11 of 12

3. Hassanat, A.B. Visual passwords using automatic lip reading. arXiv, 2014; arXiv:1409.0924. 4. Thanda, A.; Venkatesan, S.M. Multi-task learning of deep neural networks for audio visual automatic speech recognition. arXiv, 2017; arXiv:1701.02477. 5. Biswas, A.; Sahu, P.K.; Chandra, M. Multiple cameras audio visual speech recognition using active appearance model visual features in car environment. Int. J. Speech Technol. 2016, 19, 159–171. [CrossRef] 6. Scanlon, P.; Reilly, R. Feature analysis for automatic speechreading. In Proceedings of the 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No. 01TH8564), Cannes, France, 3–5 October 2001; pp. 625–630. 7. Matthews, I.; Potamianos, G.; Neti, C.; Luettin, J. A comparison of model and transform-based visual features for audio-visual LVCSR. In Proceedings of the IEEE International Conference on Multimedia and Expo, ICME 2001, Tokyo, Japan, 22–25 August 2001; pp. 825–828. 8. Aleksic, P.S.; Katsaggelos, A.K. Comparison of low-and high-level visual features for audio-visual continuous automatic speech recognition. In Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, QC, Canada, 17–21 May 2004; p. V-917. 9. Chitu, A.G.; Driel, K.; Rothkrantz, L.J. Automatic lip reading in the Dutch language using active appearance models on high speed recordings. In Proceedings of the International Conference on Text, Speech and Dialogue, Brno, Czech Republic, 6–10 September 2010; pp. 259–266. 10. Luettin, J.; Thacker, N.A.; Beet, S.W. Visual speech recognition using active shape models and hidden Markov models. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996; pp. 817–820. 11. Shaikh, A.A.; Kumar, D.K.; Yau, W.C.; Azemin, M.C.; Gubbi, J. Lip reading using optical flow and support vector machines. In Proceedings of the 2010 3rd International Congress on Image and Signal Processing, Yantai, China, 16–18 October 2010; pp. 327–330. 12. Puviarasan, N.; Palanivel, S. Lip reading of hearing impaired persons using HMM. Expert Syst. Appl. 2011, 38, 4477–4481. [CrossRef] 13. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. 14. Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to forget: Continual prediction with LSTM. In Proceedings of the 9th International Conference on Artificial Neural Networks: ICANN’99, Edinburgh, UK, 7–10 September 1999. 15. Wang, Y.; Huang, M.; Zhao, L. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 606–615. 16. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv, 2014; arXiv:1409.1556. 17. Matthews, I.; Cootes, T.F.; Bangham, J.A.; Cox, S.; Harvey, R. Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 198–213. [CrossRef] 18. Fan, X.; Zhang, F.; Wang, H.; Lu, X. The system of face detection based on OpenCV. In Proceedings of the 2012 24th Chinese Control and Decision Conference (CCDC), Taiyuan, China, 23–25 May 2012; pp. 648–651. 19. King, D.E. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. 20. Martins, A.; Astudillo, R. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1614–1623. 21. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] 22. Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. 23. Cho, K.; Courville, A.; Bengio, Y. Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans. Multimed. 2015, 17, 1875–1886. [CrossRef] Appl. Sci. 2019, 9, 1599 12 of 12

24. Zhang, Y.; Pezeshki, M.; Brakel, P.; Zhang, S.; Bengio, C.L.Y.; Courville, A. Towards end-to-end speech recognition with deep convolutional neural networks. arXiv, 2017; arXiv:1701.02720. 25. Graves, A. Generating sequences with recurrent neural networks. arXiv, 2013; arXiv:1308.0850.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).