Lip Reading Using Neural Network and Deep Learning
Total Page:16
File Type:pdf, Size:1020Kb
Lip Reading using Neural Network and Deep learning Karan Shrestha Department of Computer Science Earlham College Richmond, Indiana 47374 [email protected] ABSTRACT 2.1 Haar Feature-Based Cascade Classifier Lip reading is a technique to understand words or speech by vi- Paul Voila and Michael Jones introduced an efficient and robust sual interpretation of face, mouth, and lip movement without the system for object detection using Haar Cascades classifiers [23]. The involvement of audio. This task is difficult as people use different Haar feature-based classifier is a machine learning based approach dictions and various ways to articulate a speech. This project ver- which is trained with cascade function using a set of positive and ifies the use of machine learning by applying deep learning and negative images. The Haar feature-based classifier is used to detect neural networks to devise an automated lip-reading system. A sub- objects in the images. In my project, I have used Haar feature-based set of the dataset was trained on two separate CNN architectures. classifier to detect and track region of interest (ROI) - mouth region The trained lip reading models were evaluated based on their accu- from the input speaker videos to preprocess the data before training racy to predict words. The best performing model was implemented the model with neural network architecture. in a web application for real-time word prediction. Initially, the algorithm required a set of positive images (frontal face with mouth visible) and negative images (non-frontal face) to KEYWORDS train the classifier. Then, the specific feature (ROI) was extracted from the image. A feature was obtained in a single value by subtract- lip reading, computer vision, deep learning, convolutional neural ing the sum of white pixels from the black pixels in the rectangle. network, web application, object detection Haar classifier detects the particular features (ROI) in an image based on three feature calculations - Edge features (Two rectan- 1 INTRODUCTION gles), Line features (Three rectangles) and Four-rectangle features. The types of Haar cascade rectangle features are shown in Figure 1 Lip reading is a recent topic which has been a problematic concern [25]. to even expert lip readers. There is a scope for lip reading to be resolved using various methods of machine learning. Lip reading is a skill with salient benefits. Enhancement in lip reading technology increases the possibility to allow better speech recognition in noisy or loud environments. A prominent benefit would be developments in hearing aid systems for people with hearing disabilities. Similarly, for security purposes, a lip reading system can be applied for speech analysis to determine and predict information from the speaker when the audio is corrupted or absent in the video. With the variety of languages spoken around the world, the difference in diction and relative articulation of words and phrases. Figure 1: Haar cascade rectangle features It becomes substantially challenging to create a computer program that automatically and accurately reads the spoken words solely Haar feature-based classifier has three main characteristics - based on the visual lip movement of the speaker. Even the expert Integrated image, AdaBoost inspired learning algorithm and Atten- lip readers are only able to estimate about every second word [7]. tional Cascade method. Thus, utilizing the capabilities of neural networks and deep learning Integral Image is an image obtained by the cumulative addition of algorithms two architectures were trained and evaluated. Based on pixels from the required region. The integral image at a location x,y the evaluation, the better performing model was further customized contains the sum of the pixels to left and above the x,y region [23]. to enhanced accuracy. The model architecture with an overall better When detecting an object (ROI) in images an Attentional Cascade accuracy was implemented in a web application to devise a real- method is applied in each part of the image to provide an order time lip-reading system. for evaluation and checking features. It prevents computing on other features of the images. AdaBoost learning algorithm upgrades 2 BACKGROUND the classification performance of a simple learning algorithm by This section provides the necessary background information to selecting a small set of features to train the classifier. Attentional understand this research project better. It identifies Haar Feature- Cascade method with AdaBoost ensures for redundancy computing Based Cascade Classifier and Neural Network from the prominent to at a minimum level. Thus, these characteristics of Haar feature- researches and explains the reason for usage in my project for based classifier allow it for object detection with a higher precision feature classification and training the lip reading model. speed. Wang et al. [24] proposed an approach for lip detection and track- sequences of lip movements. The paper also discusses different data ing in real-time using a Haar-like feature classifier. Their method input techniques and the use of temporal fusion architectures to combined the primitive Haar-Like feature and variance value to analyze and compare the efficiency of the CNN implementation. In construct a Variance based Haar-Like feature. This allowed the hu- this paper, 3D Convolution with Early Fusion (EF-3) architecture is man face and lip to be represented with a small number of features. introduced which I incorporate in my project. Likewise, this paper A Support Vector Machine (SVM) was used for training and classifi- provides an analysis of the employment of the temporal sequences cation, combined with the Kalman filter for real-time lip detection using models like Hidden Markov Models and Recurrent Neural and tracking of the lip movement. As mentioned in the paper, the Networks (RNN), which is less capable of adjusting with the motion results of using Haar-like feature classifier was significantly better of the image. This lessens the predicted accuracy of the trained than other machine learning approaches to detect lip movements. models. However, the debate for using CNN is its applicability of To accurately detect and track the lip movement for preprocess- use in the moving subject with higher precision. The result of the ing the video data, I used Haar feature-based cascade classifier. CNN model in their implementation showed a top-1 accuracy of Similarly, Haar classifier benefited my project from its high com- 65.4%, which means the model accurately predicted the correct putation speed and precision in detecting and tracking of mouth label of the word 65.4 times. region (ROI) of speakers from video inputs of Lip Reading in the Similarly, Li, Yiting, et al. [15], significant research has been Wild (LRW) dataset. conducted for the dynamic feature extraction. Their research also integrates the deep learning model. The primary extent of their 2.2 Neural Network research was to use CNN to delimit the negative effects caused by unstable subjects in the video and images, and face alignment Neural Network (NN) is a computing system with roughly similar blurred during feature extraction. Their research provided a mod- properties as the nervous system in human brains. The neural ification on the perspective for my project as their findings are network is a framework of algorithms working together to identify significant, in terms of accounting for shaking or unstable objects the underlying relations in a dataset to provide the results in the in the dataset that I chose to train the model. A major constituent best way possible. There are so-called hidden layers between input in their project was the use of a novel dynamic feature calculated and output layers which have their separate functionality combined using a collection of lip images and effectiveness of CNN architec- to perform specific tasks [8]. Specified by Hammerstrom, Neural ture for word recognition. Their results showed 71.76% accuracy Network was used in the various tasks of computer systems like in recognizing words using a dynamic feature, which was 12.82% image recognition, computer vision, character recognition, stock higher than the result obtained through conventional static features market prediction, medical applications, and image compression (composed through static images). [6]. For my lip reading system, I use a specific type of deep learning Noda et al. [17] applied CNN in their system of visual speech Neural Network called Convolutional Neural Network. recognition, using the visual feature extraction mechanism. The CNN was trained by feeding the images of speakerâĂŹs mouth Convolutional Neural Network area combined with phoneme labels. A phoneme is a distinct unit Convolutional Neural Network (CNN) is a class of Neural Network of sound that collectively distinguishes a word in a language. The system in a standard multi-layered network. The layers comprise model acquired multiple convolutional filters extracting essential of a single or more layer connected in a multiple connection series. visual features that recognized phonemes. Further, their model ac- CNN is capable of utilizing the local-connectivity in high dimen- counted for temporal dependencies for the phonemes generated by sional data such as datasets composed of images and videos. This label sequences which were handled by a hidden Markov model feature permits the applicability of CNN in the field of computer that recognized multiple isolated words. Their proposed system vision and speech recognition. [13]. implemented CNN and a hidden Markov model for visual feature ex- A basic form of CNN has four significant parts: convolutional traction, which significantly outperformed the models that acquired layer, activation function, pooling layer, and fully connected layer. "conventional dimensionality compression approaches." The re- Convolutional layer uses a set of learning filters to learn the param- sults demonstrated 58% recognition accuracy of phenomes achieved eters from the input data. The activation function is a non-linear through image sequences of raw mouth area [17].