Robust Visual Speech Recognition Using Optical Flow Analysis and Rotation Invariant Features

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by RMIT Research Repository Robust Visual Speech Recognition Using Optical Flow Analysis and Rotation Invariant Features A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Ayaz Ahmed Shaikh B.Eng. (Electronic), M.Phil (Communication Systems & Networks) Mehran University of Engineering and Technology School of Electrical and Computer Engineering Science, Engineering and Technology Portfolio RMIT University August 2011 Declaration I certify that except where due acknowledgement has been made, the work is that of the author alone; the work has not been submitted previously, in whole or in part, to qualify for any other academic award; the content of the thesis is the result of work which has been carried out since the official commencement date of the approved research program; and, any editorial work, paid or unpaid, carried out by a third party is acknowledged. Ayaz Ahmed Shaikh Dated: ii Acknowledgement I particularly wish to thank the following people: Most especially Professor Mike Austin and Dr. Paul Beckett. Professor Austin for his mentoring and encouragement to submit this thesis, his generous support and his availability being invaluable for me, as was the support and encouragement from Dr. Beckett throughout my PhD candidacy. Dr. Jayavardhana Gubbi, of the University of Melbourne, who generously devoted considerable time for guidance, and was always available for ongoing discussions and moral support. Associate Prof. Dr. Dinesh K Kumar for providing the ideas, inspiration, motivation and the necessary support to commence the research in the field of visual speech recognition. A special thank you to Dr. Wai Chee Yau for providing the dataset. My colleagues Tayab, Arun, Fawaz, Khalifa, Premith, Noura and Kashfia for advice and fruitful exchange of ideas. Additionally, I thank Bryan Glynn for carefully reading the manuscript and for giving helpful hints and suggestions in writing this thesis. I acknowledge Mehran University of Engineering and Technology, Pakistan, for their financial support, making it possible for me to commence this research and to proceed to its eventual completion. Finally, I want to thank my family. I am deeply grateful to my wife Shazia, whose patience and support encouraged me daily. In addition, I like to thank my mother and my parents-in-law for their continued prayers and encouragement. Ayaz Shaikh iii Keywords Lip-reading, visual speech recognition, Speech reading, visual feature extraction, temporal speech segmentation, classification of visual features, Support Vector Machines iv Table of Contents Declaration ....................................................................................................................... ii Acknowledgement .......................................................................................................... iii Keywords ......................................................................................................................... iv Table of Contents ............................................................................................................. v Abstract ........................................................................................................................... ix List of Tables .................................................................................................................. xii List of Figures .................................................................................................................xiii Abbreviations and Acronyms ......................................................................................... xv Chapter 1 ............................................................................................................................. 1 Introduction ......................................................................................................................... 1 1.1 Pattern Recognition ......................................................................................................... 3 1.1.1 Feature Extraction .................................................................................................... 4 1.1.2 Classification ............................................................................................................ 4 1.2 Motivation........................................................................................................................ 5 1.3 Scope of Thesis ................................................................................................................. 5 1.4 Aims and Objectives of the Research .............................................................................. 6 1.5 Outline of Thesis .............................................................................................................. 8 1.6 Publications Resulting from this Research ....................................................................... 9 Chapter 2 ........................................................................................................................... 11 An Overview of AVSR Systems....................................................................................... 11 2.1 Introduction ................................................................................................................... 11 2.2 Non-Audio Speech Modalities ....................................................................................... 12 2.3 The Review of AVSR and VSR ......................................................................................... 17 2.4 Anatomy of the Human Speech Production System ...................................................... 20 2.5 Linguistics of Visual Speech ........................................................................................... 22 2.6 Visual Speech Perception by Humans ............................................................................ 23 2.7 Speech Reading Proficiency ........................................................................................... 24 2.8 Significance of Facial Parts in Lip-reading ...................................................................... 24 2.9 Basic Components of Visual Speech Recognition System.............................................. 25 v 2.10 Brief Description of Proposed VSR System .................................................................... 27 2.11 Summary ........................................................................................................................ 30 Chapter 3 ........................................................................................................................... 31 Image and Video Preprocessing ........................................................................................ 31 3.1 Visual Front End ............................................................................................................. 31 3.1.1 Face Detection ....................................................................................................... 33 3.1.1 Spatial Segmentation ............................................................................................. 34 3.2 Choice of Utterances ...................................................................................................... 36 3.2.1 Dataset Exploited for this Study ............................................................................ 36 3.3 Temporal Segmentation ................................................................................................ 39 3.4 Image Noise Reduction .................................................................................................. 46 3.5 Issues that Need to be Considered for the Development of Visual Speech Recognition46 3.6 Summary ........................................................................................................................ 48 Chapter 4 ........................................................................................................................... 49 Mouth Movement Representation Using Optical Flow Based Motion Template ............ 49 4.1 Introduction ................................................................................................................... 51 4.2 Development of Optical Flow Based Motion Templates ............................................... 52 4.2.1 Optical Flow Motion Estimation ............................................................................ 53 4.2.2 Gradient Based Approach ...................................................................................... 53 4.2.3 Global Methods ...................................................................................................... 54 4.2.4 Local Methods ........................................................................................................ 55 4.2.5 Optical Flow Computation Used in this Research .................................................. 57 4.2.6 Non Overlapping Block Based Approach ............................................................... 59 4.2.6.1 Block Optimization ............................................................................................. 62 4.2.7 Normalization of speed of speech ......................................................................... 63 4.3 Summary .......................................................................................................................

Load more