Towards Robust Audio-Visual Speech Recognition

Research Collection Doctoral Thesis Towards Robust Audio-Visual Speech Recognition Author(s): Naghibi, Tofigh Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010492674 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library DISS. ETH NO. 22867 Towards Robust Audio-Visual Speech Recognition A thesis submitted to attain the degree of DOCTOR OF SCIENCES of ETH ZURICH (Dr. sc. ETH Zurich) presented by TOFIGH NAGHIBI MSc, Sharif University of Technology, Iran born June 8, 1983 citizen of Iran accepted on the recommendation of Prof. Dr. Lothar Thiele, examiner Prof. Dr. Gerhard Rigoll, co-examiner Dr. Beat Pfister, co-examiner 2015 Acknowledgements I would like to thank Prof. Dr. Lothar Thiele and Dr. Beat Pfister for super- vising my research. Beat has always allowed me complete freedom to define and explore my own directions in research. While this might prove difficult and sometimes even frustrating to begin with, I have come to appreciate the wisdom of his way since it not only encouraged me to think for myself, but also gave me enough self-confidence to tackle open problems that otherwise I would not dare to touch. The former and current members of the ETH speech group helped me a lot in various ways. In particular, I would like to thank Sarah Hoffmann who was my computer science and cultural mentor. Her vast knowledge in pro- gramming and algorithm design and her willingness to share all these with me made my electrical engineer-to-computer scientist transformation much easier. Further, following her cultural advice particularly on food and bever- age created the best memories of my PhD life. I want to thank Hui Liang for his valuable feedback and reading parts of this thesis, usually on very short notice. I also owe him for always being available to listen to my ideas and even study new emerging machine learning techniques together which without him would not be read. I am also indebted to Lukas Knauer for his German translation of my dissertation abstract and Naoya Takahashi for using his power point expertise. On the personal side, I wish to thank my parents for all they have done for me. It takes many years to realize how much of what you are depends on your parents. I wish to thank my sister Nasim and my close friend Damoun for the fun we have had growing up together. Finally, special thanks must go to my amazing wife Haleh for her pa- tience. She waited for me many nights while I was playing with my machine 4 learning toys at office or even at home. Her invisible management of everything that may distract me and her favorite quote ”everything is under the control” were literally my anchor throughout my PhD life. Contents List of Abbreviations 9 Abstract 11 Kurzfassung 15 1 Introduction 19 1.1 ProblemStatement . ... .... .... .... .... .. 19 1.2 ScientificContributions. 21 1.3 StructureofthisThesis . 22 2 Visual Features and AV Datasets 25 2.1 VisualFeatureDescriptions. 26 2.1.1 Scale-invariantFeatureTransform(SIFT) . 27 2.1.2 InvariantScatteringConvolutionNetworks . 28 2.1.3 3D-SIFT ....................... 29 2.1.4 Bag-of-Words... .... .... .... .... .. 29 2.2 Datasets............................ 30 2.2.1 CUAVE........................ 31 2.2.2 GRID......................... 32 2.2.3 Oulu ......................... 34 2.2.4 AVletter........................ 35 2.2.5 ETHDigits ...................... 36 6 Contents 3 Feature Selection 39 3.1 IntroductiontoFeatureSelectionProblem . 40 3.1.1 VariousFeatureSelectionMethods . 40 3.1.2 FeatureSelectionSearchStrategies . 42 3.2 MutualInformationProsandCons . 43 3.2.1 First Expansion:Multi-wayMutualInformation . 46 3.2.2 SecondExpansion:ChainRuleofInformation . 49 3.2.3 TruncationoftheExpansions . 50 3.2.4 The superiority of the D2 Approximation . 54 3.3 SearchStrategies ....................... 55 3.3.1 ConvexBasedSearch. 56 3.3.2 ApproximationAnalysis . 61 3.4 ExperimentsforCOBRAEvaluation . 62 3.5 COBRA-selectedISCNFeaturesforLipreading . 73 3.6 ConclusionandDiscussion . 80 4 Binary and Multiclass Classification 81 4.1 IntroductiontoBoostingProblem . 82 4.2 OurResults .......................... 83 4.3 Fundamentals ......................... 85 4.4 BoostingFramework . 87 4.4.1 SparseBoosting . .... .... .... .... .. 91 4.4.2 SmoothBoosting . 93 4.4.3 MABoost forCombiningDatasets (CD-MABoost) . 93 4.4.4 LazyUpdateBoosting . 94 4.5 MulticlassGeneralization. 96 4.5.1 PreliminariesforMulticlassSetting . 97 4.5.2 MulticlassMABoost(Mu-MABoost) . 98 4.6 ClassificationExperimentswithBoosting . 100 4.6.1 BinaryMABoostExperiments . 101 4.6.2 ExperimentwithSparseBoost . 102 4.6.3 ExperimentwithCD-MABoost . 104 4.6.4 MulticlassClassificationExperiments . 106 4.7 ConclusionandDiscussion . 108 Contents 7 5 VisualVoiceActivityDetection 111 5.1 IntroductiontoUtteranceDetection . 112 5.2 SupervisedLearning:VADbyUsingSparseBoost . 113 5.3 Semi-supervisedLearning:Audio-visualVAD . 119 5.3.1 Ro-MABoost ... .... .... .... .... .. 121 5.3.2 Experiments on Supervised & Semi-supervised VAD 123 5.4 Conclusion .......................... 126 6 Lipreading with High-Dimensional Feature Vectors 127 6.1 IntroductiontoVisualSpeechRecognition . 128 6.2 ColorSpaces ......................... 132 6.3 VisualFeatures ........................ 135 6.4 MulticlassClassifier. 135 6.5 LipreadingExperimentswithISMAFeatures . 136 6.5.1 Oulu:Phraserecognition . 137 6.5.2 GRID:Digitrecognition . 142 6.5.3 AVletter:Letterrecognition . 147 6.6 ConclusionandDiscussion . 150 7 Audio Visual Information Fusion 153 7.1 TheProblemofAudioVisualFusion . 154 7.2 Preliminaries&ModelDescription. 157 7.3 AUC:TheAreaUnderanROCCurve . 160 7.3.1 AUCandMismatch . 163 7.3.2 AUCGeneralizationtoMulticlassProblem . 164 7.3.3 TransformingHMMOutputs. 165 7.3.4 OnlineAUCMaximization. 166 7.4 Multi-vectorModel . 170 7.5 ExperimentsonAUC-basedFusionStrategy . 172 7.6 Conclusion .......................... 183 8 Multichannel Audio-video Speech Recognition System 185 8.1 IntroductiontoBeamformingProblem . 185 8.2 SignalCancellation . 187 8.3 TransferFunctionEstimation . 189 8 Contents 8.4 SimulationResults . 192 8.5 ExperimentswithETHDigitsDataset . 194 8.6 Conclusion .......................... 198 9 Conclusion 201 9.1 Achievements......................... 201 9.1.1 Perspectives. 202 9.1.2 OpenProblems . 204 A Complementary Fundamentals 207 A.1 DefinitionsandPreliminaries . 207 A.2 ProofofTheorem4.7 . 208 A.3 ProofofLemma4.8. 209 A.4 Proof of the Boosting Algorithm for Combined Datasets . 209 A.5 ProofofTheorem4.9 . 210 A.6 ProofofEntropyProjectionontoHypercube . 213 A.7 ProofofTheorem4.11 . 214 A.8 MulticlassWeak-learningCondition . 216 List of Abbreviations AAM activeappearancemodel A-ASR audio-basedautomaticspeechrecognition ADABoost adaptive Boosting ANN artificialneuralnetwork ASM activeshapemodel ASR automaticspeechrecognition AUC areaunderthecurve A-VAD audio-basedvoiceactivitydetection AV-ASR audio-visualautomaticspeechrecognition BE backward elimination BoW bagofwords CART classification and regression tree CD-MABoost combiningdatasetwithMABoost COBRA convex based relaxation approximation DCT discretecosinetransform fps numberofframespersecond(framerate) FS forward selection GMM Gaussianmixturemodel HMM hiddenMarkovmodel ISCN invariant scattering convolution network ISMA combinedfeaturevectorconsistingofISCNfeatures and MABoost posterior probabilities 10 Contents JMI jointmutualinformation KL Kullback-Leibler LBP localbinarypattern LDA linear discriminant analysis MABoost mirrorascentboosting MADABoost modified adaptive boosting MAPP Mu-MABoost posterior probabilities MFCC mel-frequencycepstralcoefficient MIFS mutualinformationfeatureselection ML machinelearning MRI magneticresonanceimaging mRMR minimalredundancymaximalrelevance Mu-MABoost multiclass MABoost MVDR minimumvariancedistortionlessresponse PAC probablyapproximatelycorrect PCA principlecomponentanalysis RF randomforest ROC receiveroperatingcharacteristic SAMME Stagewise Additive Modeling using a Multi-class Exponential loss function SDP semidefiniteprogramming SIFT scale-invariantfeaturetransform SSP subset selection problem SVM supportvectormachine TF transferfunction VAD voice activity detection V-ASR visual-basedautomaticspeechrecognition V-VAD visual-based voice activity detection Abstract Human speech production is a multimodal process, by nature. While usually the acoustic speech signals constitute the primary cue in human speech per- ception, visual signals can also substantially contribute, particularly in noisy environments. Research in automatic audio-visual speech recognition is mo- tivated by the expectation that automatic speech recognition systems can also exploit the multimodal nature of speech. This thesis aims to explore the main challenges faced in realization and development of robust audio-visual automatic speech recognition (AV-ASR), i.e., developing (I) a high-accuracy speaker-independent lipreading system and (II) an optimal audio-visual fusion strategy. Extracting a set of informative visual features is the first step to build an accurate visual speech recognizer. In this thesis, various visual feature ex- traction methods are employed. Some of these methods, however, encode visual information into very high-dimensional feature vectors. In order to re- duce the computational complexity and the risk of overfitting, a new feature selection algorithm is introduced that selects a subset of

Towards Robust Audio-Visual Speech Recognition

Colour Spaces - a Review of Historic and Modern Colour Models* GD Hastings and a Rubin

Shadow Detection on Multispectral Images A

Color Feature Integration with Directional Ringlet Intensity

A Low-Cost Real Color Picker Based on Arduino

Acceleration of Image Processing Using New Color Model

On the Performance of Kernel Methods for Skin Color Segmentation

Object Localization Using Color Histograms

Fusion of Multi Color Space for Human Skin Region Segmentation

Book III Color

Concerning the Calculation of the Color Gamut in a Digital Camera

CSIFT Based Locality-Constrained Linear Coding for Image

Research Article on the Performance of Kernel Methods for Skin Color Segmentation