Deep Learning for Lip Reading Using Audio-Visual Information for Urdu Language
Total Page:16
File Type:pdf, Size:1020Kb
Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language Muhammad Faisal Sanaullah Manzoor Information Technology University Information Technology University Lahore Lahore [email protected] [email protected] Abstract But human lipreading performance is not Human lip-reading is a challenging task. It requires not precise consequently there is enormous need of only knowledge of underlying language but also visual automatic lip-reading system. It has many clues to predict spoken words. Experts need certain level practical applications such as dictating of experience and understanding of visual expressions learning to decode spoken words. Now-a-days, with the instructions or messages to a phone in a noisy help of deep learning it is possible to translate lip environment, transcribing and re-dubbing sequences into meaningful words. The speech recognition archival silent films, security, biometric in the noisy environments can be increased with the visual identification, resolving multi-talker information [1]. To demonstrate this, in this project, we simultaneous speech, and improving the have tried to train two different deep-learning models for lip-reading: first one for video sequences using spatio- performance of automated speech recognition in temporal convolution neural network, Bi-gated recurrent general [1, 2]. neural network and Connectionist Temporal Classification Usually in noisy environments speech Loss, and second for audio that inputs the MFCC features recognition systems fails or performs poorly, to a layer of LSTM cells and output the sequence. We have because of the extra noise signals. To overcome also collected a small audio-visual dataset to train and test our model. Our target is to integrate our both models this problem and improve the performance of to improve the speech recognition in the noisy speech recognition system in noisy environment. environments we can add the visual information. The visual information (lips movements of the speaker) can help to speech recognition systems. 1. Introduction There are many challenges that make the The ability of recognizing what is being said lipreading task very difficult some of the major only from visual information is an impressive challenges are: skill but a difficult task for the novice. The task • Absence of context of lipreading is inherently ambiguous at the • Extraction of spatio-temporal features word level and short sentences, because of the • Some people are more expressive with absence of the context and secondly some their lips while others are not (visually- characters that produce the same lip sequences, speechless-person) called homophemes, are difficult to distinguish. • Generalization across Speakers But this difficulty can be overcome using the • Guttural sounds (like consonants K and context of neighborhood information. The G) neighboring words can help to minimize the • Mumbling sounds ambiguity of homophemes. In this project, we trained two separate deep Performance of automatic speech recognition learning based models, the first model is LipNet system (ASR) can be increased with the help of [2], which was trained on GRID dataset we visual information in the noisy environments. retrained it on our own dataset of Urdu 1 language. This model is end-to-end sentence perform phoneme classification using a large level lipreading model. This model operates at non-public audio-visual dataset. Recently, [1] the character level using the spatio-temporal has proposed a large and complex end-to-end convolutions (STCNNs) recurrent neural trainable network for audio-visual speech networks (RNNs) and finally the connectionist recognition; their model consists of two temporal classification loss (CTC) [3]. The encoders and one decoder. They named their second network processes the audios features model as Watch, Listen, Attend and Spell using the LSTMs and categorical cross entropy (WLAS) network. The first encoder encodes the loss. video frames by passing them through The next section summarizes the literature convolutional layers followed by stacked review. In section 3 we briefly described the LSTMs. A fixed size encoded vector is stored. network architecture and our dataset is explained The second encoder encodes the audio MFCC in section 4. The implementation details are features by passing it through the stacked explained in section 5. Finally the experiments LSTMs. Later on, these both encoded vectors and results are discussed in section 6. are concatenated and input to decoder networks that also consists of stacked LSTMs, the decoder 2. Literature Review network outputs the character sequence. In this section, we outline several approaches to automatic lipreading and audio-visual 3. Network Architecture lipreading. Lip-reading architecture has two networks, one network is used for lip-reading context 2.1 Lip Reading prediction and second network is deployed for A large body of work has been done on lip speech recognition. In the following section, reading using pre-deep learning methods. Many details of both networks are reported. Figure 1 approaches using the convolutional neural shows the lip-reading network and Figure 2 networks have been proposed to recognize illustrates the configuration of our speech phonemes [4] and visemes [5] from still images recognition network. of lips movement, instead of recognizing full words and sentences. A phoneme is the smallest distinguishable unit of sound that collectively make up a spoken word; a viseme is its visual equivalent (lips movement). Recently a deep learning based lip-reading model has been proposed called LipNet [2], that consists of three spatio-temporal convolutional layers, followed by bi-directional gated recurrent Figure1: LipNet architecture. A sequence of T frames is units (Bi-GRU), and finally a connectionist used as input, and is processed by 3 layers of STCNN, temporal classification (CTC) loss. They each followed by a spatial max-pooling layer. The features reported 96 % accuracy on the GRID dataset. extracted are processed by 2 Bi-GRUs; each time-step of the GRU output is processed by a linear layer and a 2.2 Audio-visual Speech Recognition softmax. This end-to-end model is trained with CTC [2]. The audio-visual speech recognition system is very similar to the lip-reading and their problems are closely linked. [6] used feed- forward Deep Neural Network (DNNs) to 2 Figure 2: Architecture of Speech Recognition Network phrases of Urdu language. Corpus has following words and phrases as shown in the TABLE 1. 3.1 Spatiotemporal Convolution based Detailed information about our dataset is Lip-Reading Context available at [11]. Each participant is requested Convolutional neural networks (CNNs) with to repeat each word and phrase ten times. stacked convolutional layers are instrumental in Corpus contains 1000 videos of words and 1000 performance in computer vision based tasks such phrases videos. object recognition [10]. A basic 2-Dimensional convolution layer expression is given by Table 1: Words and Phrases of Urdu Language based Audio-Video Corpus Words Phrases C kw k h ' = ' ' ' ' ' [ conv ( x,w ) ]c i j ∑∑ ∑ w c ci j xc ,i +i , j + j , تلش ختم کریں شروع c=1 i' =1 j' =1 C' xCx k x k for input x and weights w ∈ R w h . To معاف کیجئے گا منتخب process video data across time spatiotemporal convolutional neural networks (STCNN) are in practice. Therefore, expression for STCNN is میں معافی چاےتا ےوں رابطے :given below C kt kw kh آپکا شکریے سمت شناسی ' ' ' ' ' = ' [ stcnn ( x,w ) ]c ti j ∑∑∑ ∑ w c ci j xc ,i +i , j + j c=1 t' =1 i' =1 j' =1 خدا حافظ آگے بھڑ یں Recurrent neural networks (RNN) improve propagating and learning of information over time steps. We used a type of RNN known as مج ھ ے یے ک ھیل پسند ےے پیچ ھ ے جائیں bidirectional GRU (Bi-GRU) [8]. Output of STCNN is fed into Bi-GRU, denoted by z . ht is hidden layer of Bi-GRU. One GRU آپ سے مل کر خوشی ےوئی آغاز maps { z1, z2 ,…. zT } ↦ { h1,h2 ,….,hT } while second GRU maps { zT ,….z2 ,z1 } ↦ { h1,….hT } → ← آپکا خیر مقد م ےے اسل م علیکم then ht ≔[ht ,ht ] . Input to GRU is T sequences. اپنا خیال رک ھئیے گا ویب We use connectionist temporal classification (CTC) loss to overcome the problem of training data alignment with target outputs, as described in [3]. The recorded videos are cropped to standard size of 100x50, we applied the viola jones [] face 4. Dataset detector on each video to detect the face, then To evaluate our model for lip-reading of Urdu we applied the mouth detector to detect the lips speech words and phrases, we constructed a and cropped each video so that it only contains video-speech corpus. Corpus has video-audio the lips movements. This process is illustrated in recordings 10 participants including both male figure 3. and female. Corpus contains ten words and ten 3 Figure 3: Pre-Processing of videos. 5. Implementation Details For video based lip reading task we have re- implemented the network proposed by [2], the Figure 5: Structure overview (Left Unidirectional RNN, details of the network are given figure. Mainly, Right Bi-directional RNN) [9] the network consists of spatio-temporal convolutional layers (STCNN), bi-directional The speech recognition network only contains gated recurrent units (Bi-GRU), and the 256 LSTMs cells followed by a fully connectionist temporal classification (CTC) loss connected layer and a SoftMax classification [3]. The first STCNN takes input of 75 frames at layer. The LSTM takes MFCC features as input once, and process them together; each STCNN is and the classification layer classify each word as followed by a max-pooling layer of 2x2 mask. a class. The network architecture is illustrated in figure 2. We implemented both network on tensor flow platform [12]. 6. Experiments and Results We have performed several experiments, the first experiment we performed was testing the LipNet [2] on our dataset.