Lip Reading by Alternating Between Spatiotemporal and Spatial Convolutions
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Imaging Article Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions Dimitrios Tsourounis * , Dimitris Kastaniotis and Spiros Fotopoulos Department of Physics, University of Patras, 26504 Rio Patra, Greece; [email protected] (D.K.); [email protected] (S.F.) * Correspondence: [email protected] Abstract: Lip reading (LR) is the task of predicting the speech utilizing only the visual information of the speaker. In this work, for the first time, the benefits of alternating between spatiotemporal and spatial convolutions for learning effective features from the LR sequences are studied. In this context, a new learnable module named ALSOS (Alternating Spatiotemporal and Spatial Convolutions) is introduced in the proposed LR system. The ALSOS module consists of spatiotemporal (3D) and spatial (2D) convolutions along with two conversion components (3D-to-2D and 2D-to-3D) providing a sequence-to-sequence-mapping. The designed LR system utilizes the ALSOS module in-between ResNet blocks, as well as Temporal Convolutional Networks (TCNs) in the backend for classification. The whole framework is composed by feedforward convolutional along with residual layers and can be trained end-to-end directly from the image sequences in the word-level LR problem. The ALSOS module can capture spatiotemporal dynamics and can be advantageous in the task of LR when combined with the ResNet topology. Experiments with different combinations of ALSOS with ResNet are performed on a dataset in Greek language simulating a medical support application scenario and on the popular large-scale LRW-500 dataset of English words. Results indicate that the proposed ALSOS module can improve the performance of a LR system. Overall, the insertion of ALSOS Citation: Tsourounis, D.; Kastaniotis, module into the ResNet architecture obtained higher classification accuracy since it incorporates the D.; Fotopoulos, S. Lip Reading by contribution of the temporal information captured at different spatial scales of the framework. Alternating between Spatiotemporal and Spatial Convolutions. J. Imaging Keywords: lip reading; temporal convolutional networks; spatiotemporal processing 2021, 7, 91. https://doi.org/ 10.3390/jimaging7050091 Academic Editor: Amine Nait-Ali 1. Introduction Received: 16 April 2021 Convolutional Neural Networks (CNNs) have been used successfully in single image Accepted: 18 May 2021 recognition tasks, like classification [1], detection [2], and segmentation [3], while lately Published: 20 May 2021 they have been started to become a promising selection for visual sequence modeling tasks too [4–6]. Therefore, CNNs have been recently evaluated in sequence modeling Publisher’s Note: MDPI stays neutral tasks with various data types including amongst others audio synthesis [7], language with regard to jurisdictional claims in modeling [8], machine translation [9] as well as image sequence modeling [10]. In this published maps and institutional affil- manner, the extraction of image-based features happens independently from each frame of iations. the sequence using a 2D CNN, and then, a sequence feature-level modelling stage is utilized, such as either the simple pooling of the features or the implementation of a recurrent architecture, to predict the video action. These image sequence modeling tasks are related to video understanding, like activity recognition [11] or Lip reading [12–14]. A natural Copyright: © 2021 by the authors. extension of the spatial (2-Dimensional) convolutions for video is the spatiotemporal Licensee MDPI, Basel, Switzerland. (3-Dimensional) convolutions in order to incorporate both spatial relations of each image This article is an open access article frame and temporal relations between continuous frames [10]. Despite the progress in the distributed under the terms and representation learning from image sequences, significant progress has been also made by conditions of the Creative Commons adopting feedforward architectures for the task of sequence encoding. Architectures based Attribution (CC BY) license (https:// on Temporal Convolutional Networks (TCNs) tend to replace the recurrent architectures creativecommons.org/licenses/by/ since TCNs exhibit several interesting properties as they can hold long-effective memory, 4.0/). J. Imaging 2021, 7, 91. https://doi.org/10.3390/jimaging7050091 https://www.mdpi.com/journal/jimaging J. Imaging 2021, 7, x FOR PEER REVIEW 2 of 17 J. Imaging 2021, 7, 91 2 of 17 based on Temporal Convolutional Networks (TCNs) tend to replace the recurrent archi- tectures since TCNs exhibit several interesting properties as they can hold long-effective memory, introduceintroduce more morelayers layers as well as as well explicit as explicit depth, depth, and learn and in learn a feedforward in a feedforward man- manner with ner with stablestable gradients gradients [15]. Additionally [15]. Additionally,, TCNs TCNs offer offersignificant significant improvements improvements imple- implementation- mentation-wise,wise, as they as they can parallelize can parallelize better better the processing the processing load [4]. load Thus, [4]. Thus,the TCNs the TCNsgain gain ground ground over conventionalover conventional approaches approaches for visual for visualsequence sequence modeling, modeling, such assuch the Long as the Long Short- Short-Term MemoryTerm Memory networks networks (LSTMs) (LSTMs) and Gated and Recurrent Gated Recurrent Units (GRUs), Units (GRUs), mainly mainlydue due to their to their applicabilityapplicability advantages advantages [4]. [4]. In the problemIn of the automatic problem Lip of automaticReading (LR), Lip also Reading known (LR), as alsoVisual known Speech as VisualRecog- Speech Recog- nition (VSR), thenition objective (VSR), is the to objectiverecognize isa tospoken recognize word a using spoken visual-only word using information visual-only information from the speaker.from This the information speaker. This is information mainly focused is mainly on the focused mouth on region the mouth [12] which region [is12 ] which is the the optical outputoptical of the output human of the voice human system voice. However, system. However,several approaches several approaches investigate investigate the the contributioncontribution of other parts of other of the parts face, of as the for face, example as for the example upper theface upper and the face cheeks and the cheeks [16]. [16]. The LR problemThe LR problemcan be formulated can be formulated as an image as an sequence image sequence modeling modeling task in taskwhich in which frames frames provideprovide information information about different about different phonemes phonemes that in a thatparticular in a particular sequence com- sequence compose pose a world. a world. A typical LR systemA typical based LR on system deep CNNs based is on composed deep CNNs by a is processing composed pipeline by a processing uti- pipeline lizing two basicutilizing components, two basic the frontend components, and ba theckend, frontend as illustrated and backend, in the as Figure illustrated 1. The in the Figure1 . frontend is dealingThe frontend with the isfeature dealing mapping with the in featurea vectorial mapping representation in a vectorial of the representation input of the image sequenceinput while image the backend sequence refers while to th thee sequence backend modeling refers to the[17]. sequence Thus, firstly, modeling the [17]. Thus, frames from a spokenfirstly, thespeech frames time from interval a spoken are processed speech timein order interval to detect are and processed isolate inthe order to detect region of interest,and i.e., isolate the mouth the region area. of In interest, some cases, i.e., the the mouth face sequence area. In someis aligned cases, before- the face sequence is hand ensuringaligned that the beforehand mouth area ensuring is seen thatnormalized the mouth by areathe network. is seen normalized Next, the bymouth the network. Next, image sequencethe is mouthfed to aimage spatiotemporal sequence module is fed to with a spatiotemporal 3-Dimensional module (3D) convolutions with 3-Dimensional (3D) to extract both convolutionsspatial and temporal to extract information. both spatial andThen, temporal a CNN information.is utilized as Then,a feature a CNN ex- is utilized as a tractor for eachfeature frame extractorindependently. for each The frame obta independently.ined feature vectors The obtained are the featureinputs of vectors a se- are the inputs quence-to-sequenceof a sequence-to-sequence mapping with a sequence mapping modeling with aencoding sequence method, modeling such encoding as Recur- method, such as rent architecturesRecurrent [16], [18–20] architectures or Temporal [16], [18 Convolutional–20] or Temporal Networks Convolutional (TCNs) Networks [21,22] in (TCNs) [21,22] in recent time. The output of the sequence modeling usually caters a fully connected recent time. The output of the sequence modeling usually caters a fully connected network network to provide the class probabilities and thus the predicted spoken speech. Thus, the to provide the class probabilities and thus the predicted spoken speech. Thus, the sequen- sequential data is encoded via the spatiotemporal stage to extract low level features that tial data is encoded via the spatiotemporal stage to extract low level features that are are learned in the first layers while followed by the frame-wise vectorial representation learned in the first layers while followed by the frame-wise vectorial representation