Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language Muhammad Faisal Sanaullah Manzoor Information Technology University Information Technology University Lahore Lahore
[email protected] [email protected] Abstract But human lipreading performance is not Human lip-reading is a challenging task. It requires not precise consequently there is enormous need of only knowledge of underlying language but also visual automatic lip-reading system. It has many clues to predict spoken words. Experts need certain level practical applications such as dictating of experience and understanding of visual expressions learning to decode spoken words. Now-a-days, with the instructions or messages to a phone in a noisy help of deep learning it is possible to translate lip environment, transcribing and re-dubbing sequences into meaningful words. The speech recognition archival silent films, security, biometric in the noisy environments can be increased with the visual identification, resolving multi-talker information [1]. To demonstrate this, in this project, we simultaneous speech, and improving the have tried to train two different deep-learning models for lip-reading: first one for video sequences using spatio- performance of automated speech recognition in temporal convolution neural network, Bi-gated recurrent general [1, 2]. neural network and Connectionist Temporal Classification Usually in noisy environments speech Loss, and second for audio that inputs the MFCC features recognition systems fails or performs poorly, to a layer of LSTM cells and output the sequence. We have because of the extra noise signals. To overcome also collected a small audio-visual dataset to train and test our model.