Towards Robust Visual Speech Recognition Automatic Systems for Lip Reading of Dutch

Towards Robust Visual Speech Recognition Automatic Systems for Lip Reading of Dutch Towards Robust Visual Speech Recognition Automatic Systems for Lip Reading of Dutch PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magni¯cus Prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op dinsdag 2 november 2010 om 12.30 uur door Alin Gavril CHIT»U Master of Science in Computer Science, Universitatea Bucure»sti, Master of Science (ir) in Applied Mathematics, Technische Universiteit Delft geboren te Bu»steni,RoemeniÄe. Dit proefschrift is goedgekeurd door de promotoren: Prof. dr. C.M. Jonker Prof. dr. drs. L.J.M. Rothkrantz Samenstelling promotiecommissie: Rector Magni¯cus voorzitter Prof. dr. C. M. Jonker, Delft University of Technology, promotor Prof. dr. drs. L. J. M. Rothkrantz, Netherlands Defence Academy, promotor Prof. dr. ir. F. W. Jansen Delft University of Technology Prof. dr. I. Heynderickx, Delft University of Technology Prof. dr. I. V¸aduva, University of Bucharest Prof. dr. Ing. Elmar NÄoth, University of Erlangen-NÄurnberg Dr. K. C. H. M. Nieuwenhuis, DECIS Lab & TRT-NL The work was carried out in the Man-Machine Interaction Group in the Mediamatics Department at Delft University of Technology and was part of the Interactive Collaborative Information Systems (ICIS) project ¯nanced by the Dutch Ministry of Economic A®airs, grant nr: BSIK03024. Copyright °c 2010 by Alin Gavril Chit»u. ISBN: 978-90-8570-697-7 All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without the prior written permission from the author. For more information please contact the author at chitu [email protected] Printed in The Netherlands by WÄohrmannPrint Service. To my dearest wife Dana and our sweetest daughters Ana and Mina, Contents 1 Introduction 1 1.1 Lip Reading by Humans . 1 1.2 Automatic Lip Reading . 2 1.3 Societal Relevance - Applications . 3 1.3.1 Crisis Management - The ICIS Project . 3 1.3.2 Lip Reading Applications . 3 1.4 Problem De¯nition and Objective . 4 1.5 Outline of the Dissertation . 7 2 Methodology, Definitions and the State of the Art 11 2.1 Building an Automatic Lip Reader: Overview . 11 2.2 Recognition Task . 13 2.3 Data Corpus . 15 2.3.1 On Building a Data Corpus for Lip Reading: A Comparison . 16 2.3.2 High Speed Recordings . 20 2.3.3 Data Annotation Versus Data Parametrization . 22 2.3.4 Conclusions . 23 2.4 Data Parametrization . 24 2.4.1 Feature Vectors De¯nition . 24 2.4.2 Lip Reading Relevant Feature Space . 26 2.4.3 Higher Order Features . 27 2.4.4 Image Segmentation Fundamentals . 28 2.5 Lip Reading Primitives . 33 2.5.1 Phonemes . 33 2.5.2 From Phonemes to Visemes . 34 2.5.3 Separability of Utterances as a Result of Viseme De¯nition . 39 2.6 Hidden Markov Models Methodology . 40 2.6.1 Modelling the Visemes Using HMM . 42 i ii CONTENTS 2.6.2 Silence and Pause Models . 43 2.6.3 Modelling the Low Level Context Using Tri-visemes . 44 2.6.4 Gaussian Mixtures . 45 2.7 Language Models . 45 2.7.1 Grammar Based Language Models . 46 2.7.2 n-Grams . 46 2.8 Measures for System Performance . 47 2.9 State of the Art in Lip Reading . 48 3 Computational Models 51 3.1 Hidden Markov Models . 51 3.1.1 The Forward-Backward Algorithm . 55 3.1.2 The Viterbi Algorithm . 57 3.1.3 The Baum-Welch Algorithm . 58 3.2 Principal Component Analysis . 59 3.3 Optical Flow Analysis . 64 3.3.1 Di®erential Approach Overview . 67 3.3.2 Lucas-Kanade Algorithm . 67 3.3.3 Horn-Schunck Algorithm . 68 3.3.4 Performance Measures . 69 3.4 Active Appearance Models . 70 3.4.1 Shapes and Landmarks . 71 3.4.2 Learning the Statistical Model of the Variance . 72 3.4.3 AAM Fitting . 74 4 Data Acquisition 77 4.1 Data Corpus Requirements . 77 4.2 Recording Setup . 78 4.2.1 Video Devices . 79 4.2.2 Audio Devices . 79 4.2.3 Side View Recordings . 80 4.2.4 Controlling the Recording Environment . 80 4.2.5 Prompter Software . 81 4.2.6 Consistency Check During Post Processing . 81 4.3 Data Corpus Statistics . 83 4.3.1 Utterance Types . 83 4.3.2 Respondents . 83 4.3.3 Language Data . 84 4.3.4 Speech Rate . 87 4.3.5 Viseme Coverage . 88 4.4 Discussion . 91 CONTENTS iii 5 Statistical Lip Geometry Estimation for Lip Reading 93 5.1 Description of the Algorithm . 93 5.1.1 De¯ning the Region Of Interest . 94 5.1.2 Lips Segmentation . 94 5.1.3 De¯ning the Feature Vectors . 95 5.1.4 Visual Validation of the Feature Vectors . 97 5.1.5 Re¯nement of the Filter Results: Outlier Removal . 99 5.1.6 Cavity, Tongue and Teeth Description . 100 5.2 Lip Reading Results . 101 5.3 Conclusions . 106 6 Active Appearance Models for Lip Reading 107 6.1 Description of the Algorithm . 107 6.1.1 Facial Model for Lip Reading . 107 6.1.2 AAM Results on the Training Data . 110 6.1.3 De¯ning the Feature Vectors . 111 6.1.4 Visual Validation of the Feature Vectors . 112 6.2 AAM as ROI Detection Algorithm . 115 6.3 Lip Reading Results . 115 6.4 Conclusions . 118 7 Optical Flow Analysis for Lip Reading 121 7.1 Description of the Algorithm . 121 7.1.1 De¯ning the Region Of Interest . 122 7.1.2 De¯ning the Feature Vectors . 123 7.1.3 Visual Validation of the Feature Vectors . 124 7.2 Lip Reading Results . 125 7.3 Conclusions . 128 8 Further Analysis of the Results and Other Experiments 131 8.1 Comparison Based on the Feature Extraction Method . 131 8.2 Comparison Based on the Type of Speech . 132 8.3 Influence of High Speed Recordings . 133 8.4 Influence of the Size of the Training Corpus . 135 8.5 Conclusions . 139 9 Conclusions: Summing Up, General Thoughts and Future Directions 141 A Data Corpora for Lip Reading 147 B Grammars for Recognition Tasks 149 B.1 Digit String Recognition Task(CD) . 149 B.2 Letter String Recognition Task(CL) . 149 B.3 Complete Grammar Set Recognition Task(GU) . 149 iv CONTENTS C Utterance Types 151 D Utterances Sample 153 Bibliography 157 Summary 169 Samenvatting 171 Acknowledgements 173 Curriculum Vitae 177 Colophon 179 Chapter 1 Introduction 1.1 Lip Reading by Humans Lip reading was thought for many years to be speci¯c to hearing impaired persons. Therefore, it was considered that lip reading is one possible solution to an abnormal situation. Even the name of the domain suggests that lip reading was considered to be a rather arti¯cial way of communication because it associates lip reading with the written language which is a relatively new cultural phenomenon and is not an evolutionary inherent ability. Extensive lip reading research was primarily done in order to improve the teaching methodology for hearing impaired persons to increase their chances for integration in the society. Later on, the research done in human perception and more exactly in speech perception proved that lip reading is actively employed in di®erent degrees by all humans irrespective to their hearing capacity. The most well know study in this respect was performed by Harry McGurk and John MacDonald in 1976. In their experiment the two researchers were trying to understand the perception of speech by children. Their ¯nding, now called the McGurk e®ect, published in Nature [Mcg76], was that if a person is presented a video sequence with a certain utterance (i.e. in their experiments utterance 'ga'), but in the same time the acoustics present a di®erent utterance (i.e. in their experiments the sound 'ba'), in a large majority of cases the person will perceive a third utterance (i.e. in this case 'da'). Subsequent experiments showed that this is true as well for longer utterances and that is not a particularity of the visual and aural senses but also true for other perception functions. Therefore, lip reading is part of our multi-sensory speech perception process and could be better named visual speech recognition. Being an evolutionary acquired capacity, same as speech perception, some scientists consider the lip reading's neural mechanism the one that enables humans to achieve high literacy skills with relative easiness [Att06]. Another source of confusion is the \lip" word, because it implies that the lips are the only part of the speaker face that transmit information about what is being said. The teeth, the tongue and the cavity were shown to be of great importance for lip reading by humans ([Wil98a]). Also other face elements were shown to be 1 2 INTRODUCTION 1.2 important during face to face communication; however, their exact influence is not completely elucidated. During experiments in which a gaze tracker was used to track the speaker's areas of attention during communication it was found that the human lip readers focus on four major areas: the mouth, the eyes and the centre of the face depending on the task and the noise level ([Buc07]). In normal situations the listener scans the mouth and the other areas relatively equal periods of times.

Load more