Research Collection

Doctoral Thesis

Towards Robust Audio-Visual Speech Recognition

Author(s): Naghibi, Tofigh

Publication Date: 2015

Permanent Link: https://doi.org/10.3929/ethz-a-010492674

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library DISS. ETH NO. 22867

Towards Robust Audio-Visual Speech Recognition

A thesis submitted to attain the degree of

DOCTOR OF SCIENCES of ETH ZURICH (Dr. sc. ETH Zurich)

presented by TOFIGH NAGHIBI MSc, Sharif University of Technology, Iran born June 8, 1983 citizen of Iran

accepted on the recommendation of Prof. Dr. Lothar Thiele, examiner Prof. Dr. Gerhard Rigoll, co-examiner Dr. Beat Pfister, co-examiner

2015

Acknowledgements

I would like to thank Prof. Dr. Lothar Thiele and Dr. Beat Pfister for super- vising my research. Beat has always allowed me complete freedom to define and explore my own directions in research. While this might prove difficult and sometimes even frustrating to begin with, I have come to appreciate the wisdom of his way since it not only encouraged me to think for myself, but also gave me enough self-confidence to tackle open problems that otherwise I would not dare to touch. The former and current members of the ETH speech group helped me a lot in various ways. In particular, I would like to thank Sarah Hoffmann who was my computer science and cultural mentor. Her vast knowledge in pro- gramming and algorithm design and her willingness to share all these with me made my electrical engineer-to-computer scientist transformation much easier. Further, following her cultural advice particularly on food and bever- age created the best memories of my PhD life. I want to thank Hui Liang for his valuable feedback and reading parts of this thesis, usually on very short notice. I also owe him for always being available to listen to my ideas and even study new emerging machine learning techniques together which without him would not be read. I am also indebted to Lukas Knauer for his German translation of my dissertation abstract and Naoya Takahashi for using his power point expertise. On the personal side, I wish to thank my parents for all they have done for me. It takes many years to realize how much of what you are depends on your parents. I wish to thank my sister Nasim and my close friend Damoun for the fun we have had growing up together. Finally, special thanks must go to my amazing wife Haleh for her pa- tience. She waited for me many nights while I was playing with my machine 4 learning toys at office or even at home. Her invisible management of every- thing that may distract me and her favorite quote ”everything is under the control” were literally my anchor throughout my PhD life. Contents

List of Abbreviations 9

Abstract 11

Kurzfassung 15

1 Introduction 19 1.1 ProblemStatement ...... 19 1.2 ScientificContributions...... 21 1.3 StructureofthisThesis ...... 22

2 Visual Features and AV Datasets 25 2.1 VisualFeatureDescriptions...... 26 2.1.1 Scale-invariantFeatureTransform(SIFT) ...... 27 2.1.2 InvariantScatteringConvolutionNetworks . . . . . 28 2.1.3 3D-SIFT ...... 29 2.1.4 Bag-of-Words...... 29 2.2 Datasets...... 30 2.2.1 CUAVE...... 31 2.2.2 GRID...... 32 2.2.3 Oulu ...... 34 2.2.4 AVletter...... 35 2.2.5 ETHDigits ...... 36 6 Contents

3 Feature Selection 39 3.1 IntroductiontoFeatureSelectionProblem ...... 40 3.1.1 VariousFeatureSelectionMethods ...... 40 3.1.2 FeatureSelectionSearchStrategies ...... 42 3.2 MutualInformationProsandCons ...... 43 3.2.1 First Expansion:Multi-wayMutualInformation . . . 46 3.2.2 SecondExpansion:ChainRuleofInformation . . . 49 3.2.3 TruncationoftheExpansions ...... 50

3.2.4 The superiority of the D2 Approximation ...... 54 3.3 SearchStrategies ...... 55 3.3.1 ConvexBasedSearch...... 56 3.3.2 ApproximationAnalysis ...... 61 3.4 ExperimentsforCOBRAEvaluation ...... 62 3.5 COBRA-selectedISCNFeaturesforLipreading ...... 73 3.6 ConclusionandDiscussion ...... 80

4 Binary and Multiclass Classification 81 4.1 IntroductiontoBoostingProblem ...... 82 4.2 OurResults ...... 83 4.3 Fundamentals ...... 85 4.4 BoostingFramework ...... 87 4.4.1 SparseBoosting ...... 91 4.4.2 SmoothBoosting ...... 93 4.4.3 MABoost forCombiningDatasets (CD-MABoost) . 93 4.4.4 LazyUpdateBoosting ...... 94 4.5 MulticlassGeneralization...... 96 4.5.1 PreliminariesforMulticlassSetting ...... 97 4.5.2 MulticlassMABoost(Mu-MABoost) ...... 98 4.6 ClassificationExperimentswithBoosting ...... 100 4.6.1 BinaryMABoostExperiments ...... 101 4.6.2 ExperimentwithSparseBoost ...... 102 4.6.3 ExperimentwithCD-MABoost ...... 104 4.6.4 MulticlassClassificationExperiments ...... 106 4.7 ConclusionandDiscussion ...... 108 Contents 7

5 VisualVoiceActivityDetection 111 5.1 IntroductiontoUtteranceDetection ...... 112 5.2 SupervisedLearning:VADbyUsingSparseBoost ...... 113 5.3 Semi-supervisedLearning:Audio-visualVAD ...... 119 5.3.1 Ro-MABoost ...... 121 5.3.2 Experiments on Supervised & Semi-supervised VAD 123 5.4 Conclusion ...... 126

6 Lipreading with High-Dimensional Feature Vectors 127 6.1 IntroductiontoVisualSpeechRecognition ...... 128 6.2 ColorSpaces ...... 132 6.3 VisualFeatures ...... 135 6.4 MulticlassClassifier...... 135 6.5 LipreadingExperimentswithISMAFeatures ...... 136 6.5.1 Oulu:Phraserecognition ...... 137 6.5.2 GRID:Digitrecognition ...... 142 6.5.3 AVletter:Letterrecognition ...... 147 6.6 ConclusionandDiscussion ...... 150

7 Audio Visual Information Fusion 153 7.1 TheProblemofAudioVisualFusion ...... 154 7.2 Preliminaries&ModelDescription...... 157 7.3 AUC:TheAreaUnderanROCCurve ...... 160 7.3.1 AUCandMismatch ...... 163 7.3.2 AUCGeneralizationtoMulticlassProblem . . . . . 164 7.3.3 TransformingHMMOutputs...... 165 7.3.4 OnlineAUCMaximization...... 166 7.4 Multi-vectorModel ...... 170 7.5 ExperimentsonAUC-basedFusionStrategy ...... 172 7.6 Conclusion ...... 183

8 Multichannel Audio-video Speech Recognition System 185 8.1 IntroductiontoBeamformingProblem ...... 185 8.2 SignalCancellation ...... 187 8.3 TransferFunctionEstimation ...... 189 8 Contents

8.4 SimulationResults ...... 192 8.5 ExperimentswithETHDigitsDataset ...... 194 8.6 Conclusion ...... 198

9 Conclusion 201 9.1 Achievements...... 201 9.1.1 Perspectives...... 202 9.1.2 OpenProblems ...... 204

A Complementary Fundamentals 207 A.1 DefinitionsandPreliminaries ...... 207 A.2 ProofofTheorem4.7 ...... 208 A.3 ProofofLemma4.8...... 209 A.4 Proof of the Boosting Algorithm for Combined Datasets . . 209 A.5 ProofofTheorem4.9 ...... 210 A.6 ProofofEntropyProjectionontoHypercube ...... 213 A.7 ProofofTheorem4.11 ...... 214 A.8 MulticlassWeak-learningCondition ...... 216 List of Abbreviations

AAM activeappearancemodel A-ASR audio-basedautomaticspeechrecognition ADABoost adaptive Boosting ANN artificialneuralnetwork ASM activeshapemodel ASR automaticspeechrecognition AUC areaunderthecurve A-VAD audio-basedvoiceactivitydetection AV-ASR audio-visualautomaticspeechrecognition BE backward elimination BoW bagofwords CART classification and regression tree CD-MABoost combiningdatasetwithMABoost COBRA convex based relaxation approximation DCT discretecosinetransform fps numberofframespersecond(framerate) FS forward selection GMM Gaussianmixturemodel HMM hiddenMarkovmodel ISCN invariant scattering convolution network ISMA combinedfeaturevectorconsistingofISCNfeatures and MABoost posterior probabilities 10 Contents

JMI jointmutualinformation KL Kullback-Leibler LBP localbinarypattern LDA linear discriminant analysis MABoost mirrorascentboosting MADABoost modified adaptive boosting MAPP Mu-MABoost posterior probabilities MFCC mel-frequencycepstralcoefficient MIFS mutualinformationfeatureselection ML machinelearning MRI magneticresonanceimaging mRMR minimalredundancymaximalrelevance Mu-MABoost multiclass MABoost MVDR minimumvariancedistortionlessresponse PAC probablyapproximatelycorrect PCA principlecomponentanalysis RF randomforest ROC receiveroperatingcharacteristic SAMME Stagewise Additive Modeling using a Multi-class Exponential loss function SDP semidefiniteprogramming SIFT scale-invariantfeaturetransform SSP subset selection problem SVM supportvectormachine TF transferfunction VAD voice activity detection V-ASR visual-basedautomaticspeechrecognition V-VAD visual-based voice activity detection Abstract

Human speech production is a multimodal process, by nature. While usually the acoustic speech signals constitute the primary cue in human speech per- ception, visual signals can also substantially contribute, particularly in noisy environments. Research in automatic audio-visual speech recognition is mo- tivated by the expectation that automatic speech recognition systems can also exploit the multimodal nature of speech. This thesis aims to explore the main challenges faced in realization and development of robust audio-visual au- tomatic speech recognition (AV-ASR), i.e., developing (I) a high-accuracy speaker-independent lipreading system and (II) an optimal audio-visual fu- sion strategy. Extracting a set of informative visual features is the first step to build an accurate visual speech recognizer. In this thesis, various visual feature ex- traction methods are employed. Some of these methods, however, encode vi- sual information into very high-dimensional feature vectors. In order to re- duce the computational complexity and the risk of overfitting, a new feature selection algorithm is introduced that selects a subset of informative visual features from the high-dimensional feature vector space. This feature selec- tion algorithm considers mutual information between features and class la- bels (phonemes) to be the criterion and employs a semi-definite programming based search strategy for subset selection. The performanceof the feature se- lection algorithm is analyzed and it is shown that the difference between the score of the selected feature subset and that of the optimal feature subset is bounded. That is, it guarantees that the score of the selected features is close to that of the optimal solution. To achieve a speaker-independent visual speech recognizer, this thesis proposes to employ a pool of scale-invariant feature transform (SIFT) co- efficients extracted from multiple color spaces. The ensemble of decision 12 Contents tree classifiers trained with these features yields a high level of robustness against inter-speaker and illumination variations. While for voice activity de- tection two-dimensional SIFT features are used to build a statistical model, for lipreading we use three-dimensional SIFT features that can capture the time dynamics of video data. It is shown that using three-dimensional SIFT features gives a substantial recognition accuracy improvement in comparison with conventional visual features. The proposed AV-ASR achieves 70.5% ut- terance classification accuracy which is the highest accuracy reported for the Oulu dataset, in the speaker-independent setting. While many boosting algorithms have been developed to train an ensem- ble of classifiers, most of them cannot be naturally generalized to the multi- class classification setting, which is a requirement in our application. In this thesis, we propose a framework to design boosting algorithms. This frame- work has the advantage that it can be naturally generalized to the multiclass setting. Several properties such as convergence rates and generalization error of this framework are analyzed. Moreover, multiple practically and theoret- ically interesting algorithms such as SparseBoost are derived. We show that the SparseBoost algorithm only uses a percentage of training samples at each training round (about half of the samples) while still converging to the optimal hypothesis in the sense of probably approximately correct (PAC) learning. A common practice to fuse audio and visual information is to assign a re- liability weight to each modality. It is shown in this thesis that a more suitable criterion to estimate the reliability weights is to maximize the area undera re- ceiver operating characteristic curve (AUC) rather than frequently used crite- ria such as the recognition accuracy. Moreover, here we estimate a reliability weight for each feature. This generalizes the (conventional) two-dimensional stream weight estimation problem to a fairly high-dimensional problem. In order to efficiently estimate the reliability weights, we use a smoothed AUC function and adopt a variant of the projected gradient descent algorithm to maximize the AUC criterion in an online manner. Audio-visual voice activity detection (AV-VAD) is an important prereq- uisite in many audio-visual applications. We propose a robust audio-visual voice activity detector which can be trained in a semi-supervised manner. This interesting property can be achieved by noting the fact that both audio and visual signals represent the same underlying event, namely speech pro- duction. In this approach,training data is labeled by iteratively training audio- and visual-based speech detectors and re-labeling the data in order to use it in the next round. The labeled data from the last iteration is then used to train Contents 13 the final audio-visual voice activity detector. The proposed AV-VAD algo- rithm results in almost 96% frame-based detection rate (visual-based VAD yields 78% detection rate) on the GRID dataset in the speaker-independent setting. 14 Contents Kurzfassung

Die Erzeugung menschlicher Sprache ist naturgem¨ass ein multimodaler Prozess, der akustische und visuelle Information erzeugt. W¨ahrend sich die menschliche Sprachwahrnehmung ¨ublicherweise prim¨ar auf die akustische Information st¨utzt, k¨onnen visuelle Signale ebenfalls substanziell beitra- gen, insbesondere in einer lauten Umgebung. Forschung in automatis- cher audio-visueller Spracherkennung wird motiviert durch die Erwartung, dass Spracherkennungssysteme die multimodale Natur der Sprache auch ausnutzen k¨onnen. Ziel dieser Arbeit ist, die Hauptherausforderungen zu erkunden, welchen bei der Realisation robuster audio-visueller Spracherken- nungssysteme (audio-visual automatic speech recognition, AV-ASR) begegn- et wird, d.h. die Entwicklung (I) eines hochpr¨azisen, sprecherunabh¨angigen Lippenlesesystems und (II) einer optimalen audio-visuellen Fusionsstrategie. Die Extraktion informativer visueller Merkmale ist der erste Schritt beim Design einer exakten visuellen Spracherkennung. In dieser Arbeit werden diverse Methoden zur Extraktion visueller Merkmale angewendet. F¨ur einige dieser Methoden ist jedoch die Menge der resultierenden visuellen Merk- male sehr gross. Zur Reduktion des Rechenaufwandes und zur Vermeidung von Uberanpassung¨ wird ein neuer Algorithmus zur Auswahl von Merk- malen eingef¨uhrt, welcher eine Untermenge aus der sehr grossen Menge visueller Merkmale ausw¨ahlt. Dieser Merkmalsauswahl-Algorithmus ber¨uck- sichtigt die Transinformation (mutual information) zwischen Merkmalen und Klassennamen (Phoneme) als Kriterium und wendet zur Auswahl der Unter- mengen eine Suchstrategie an, die auf semidefiniter Programmierung basiert. Die Leistung des Merkmalsauswahl-Algorithmus wird analysiert und es wird gezeigt, dass die Differenz zwischen der Bewertung der ausgew¨ahlten Unter- menge und der optimalen Merkmalsmenge begrenzt ist. Das heisst, es ist 16 Contents garantiert, dass sich die Bewertung der gew¨ahlten Merkmale nahe bei der- jenigen der optimalen L¨osung befindet. Zur Verwirklichung einer sprecherunabh¨angigen visuellen Spracherken- nung schl¨agt diese Arbeit die Verwendung einer Vielzahl skalierungs- invarianter Merkmalstransformations-Koeffizienten (scale-invariant feature transform, SIFT) vor, welche aus mehreren Farbr¨aumen extrahiert werden. Klassifikatoren aus Ensembles von Entscheidungsb¨aumen, die mit diesen Merkmalen trainiert werden, haben einen hohen Grad an Robustheit gegen Sprecher- und Beleuchtungsvariationen. W¨ahrend zur Detektion von Sprechaktivit¨at zweidimensionale SIFT-Merkmale genutzt werden, um statistische Modelle zu erstellen, setzen wir f¨ur das Lippenlesen drei- dimensionale SIFT-Merkmale ein, welche die zeitliche Dynamik von Video- daten festhalten k¨onnen. Es wird gezeigt, dass die Verwendung drei- dimensionaler SIFT-Merkmale im Vergleich zu herk¨ommlichen visuellen Merkmalen eine wesentliche Verbesserung der Erkennungsgenauigkeit be- wirkt. Das vorgeschlagene AV-ASR erreicht eine Erkennungsrate von 70.5%, welches der h¨ochste bisher publizierte Wert ist f¨ur die Oulu-Datenbank bei sprecherunabh¨angiger Erkennung. Es wurden bisher viele Boosting-Algorithmen entwickelt um Ensembles von Klassifikatoren zu trainieren. Die meisten lassen sich jedoch nicht auf Mehrklassen-Klassifikatoren verallgemeinern, was bei unserer Anwendung notwendig ist. In dieser Arbeit schlagen wir eine Systematik f¨ur das Design von Boosting-Algorithmen vor. Diese Systematik hat den Vorteil, dass sie sich nat¨urlich auf den Mehrklassenfall verallgemeinern l¨asst. Diverse Eigen- schaften wie Konvergenzraten und Generalisierungsfehler sind analysiert worden. Ausserdem sind mehrere praktisch und theoretisch interessante Algorithmen wie beispielsweise SparseBoost abgeleitet worden. Es wird gezeigt, dass der SparseBoost-Algorithmus in jeder Trainingsrunde nur einen Teil (ungef¨ahr die H¨alfte) der Trainingsmuster nutzt und dennoch auf die optimale Hypothese konvergiert, im Sinne des PAC-Lernens (probably ap- proximately correct). Ein ¨ubliches Vorgehen zur Vereinigung von Audio und visueller Inform- ation ist die Gewichtung jeder Modalit¨at entsprechend ihrer Zuverl¨assigkeit. Es wird in dieser Arbeit gezeigt, dass die Maximierung von AUC (area un- der the receiver operating characteristic curve) ein passenderes Kriterium zur Sch¨atzung der Zuverl¨assigkeit ist, als h¨aufig genutzte Kriterien wie die Erkennungsrate. Ausserdem wird hier die Zuverl¨assigkeit jedes Merkmals gesch¨atzt. Dadurch wird die (konventionelle) zweidimensionale Sch¨atzung Contents 17 der Gewichte der Modalit¨aten zu einem ziemlich hochdimensionalen Prob- lem generalisiert. Zwecks effizienter Sch¨atzung der Zuverl¨assigkeit wird eine gegl¨attete AUC-Funktion verwendet und eine Variante des projizierten Gradientenverfahrens adaptiert f¨ur die Maximierung des AUC-Kriteriums im Online-Modus. Die audio-visuelle Detektion der Stimmaktivit¨at (audio-visual voice ac- tivity detection, AV-VAD) ist eine wichtige Voraussetzung f¨ur viele audio- visuelle Anwendungen. Es wird eine robuste audio-visuelle Stimmaktivit¨ats- detektion vorgeschlagen, welche in halb¨uberwachter Weise trainiert werden kann. Diese interessante Eigenschaft ergibt sich aus dem Umstand, dass sowohl Audio als auch visuelle Signale dasselbe zugrundeliegende Ereig- nis repr¨asentieren, n¨amlich die Sprachproduktion. In diesem Ansatz erfolgt die Etikettierung der Daten durch iteratives Trainieren von audio-basiertem und visuellem Detektor und anschliessender Neuetikettierung, welche in der n¨achsten Trainingsrunde benutzt wird. Die Etiketten aus der letzten Iteration werden benutzt, um die endg¨ultige audio-visuelle Stimmaktivit¨atsdetek- tion zu trainieren. Der vorgeschlagene AV-VAD-Algorithmus erreicht eine Detektionsrate von nahezu 96% pro Analyseabschnitt (der visuelle Stimm- aktivit¨atsdetektor erreicht 78%) f¨ur den GRID- Datensatz bei sprecherun- abh¨angigem Einsatz. 18 Contents Chapter 1

Introduction

1.1 Problem Statement

Speech, as the main communication medium among humans, is multimodal in nature. We all know from everyday life experience that observing the face of the person who talks to us usually improves our speech perception. This improvement is even more distinct when audio signal is degraded. For exam- ple, in a cocktail party situation where many people are talking at once, if we track the voice of a person whose face is not observable, our auditory system can only benefit from binaural processing where the desired sound source is localized by using the time and level differences between the signals re- ceived by ears. However, when we are able to observe the talker’s face and gestures, a more complex phenomenon occurs. Recent studies have shown that focusing on the face of a talker in a crowded space enhances cortical selectivity in auditory cortex, which in turn results in diminishing the neu- ral responses stimulated by the competing auditory input streams [GCSP13]. From a neurological standpoint, the mechanism by which visual inputs lead to elicit larger neural responses in auditory cortex is still largely unknown, the effect, however, is clear: when visual cues are available, noisy speech gets more intelligible. Motivated by this observation and considering that there has been a sig- nificant improvement in audio-based automatic speech recognition (A-ASR) 20 1 Introduction

[HDY+12, MHL+07, SS06, JHL97] over the last two decades, many re- searchers attempted to reduce the gap between human speech recognition performance and the performance of A-ASR in real world applications by integrating visual speech information and auditory speech information. This, however, turned out to be non-trivial. The first attempts in this direction [PNLM04, and references there in] clearly showed that appearance based vi- sual features (pixel values and their affine transformations) provide very lim- ited learnable speech information. Moreover, such features are highly speaker dependent and strongly affected by changing illumination conditions. That is, the commonly used visual features are not reliable. This raised several theo- retical and practical questions.

The first and foremost question is which are the appropriate visual fea- tures? It is clear that due to the data processing theorem, features obtained by applying non-linear transformations over an image do not convey more information than the raw pixel values themselves. With some of these non- linearly transformed features however, a given learning algorithm can more efficiently search through the hypothesis space, which in turn may result in returning a more accurate model to represent visual speech. Thus, it is only meaningful to explore the goodness of a feature set with respect to a given learning algorithm. The follow up question is then, which is the appropriate learning algorithm for visual speech recognition?

The next challenging question in audio-visual automatic speech recogni- tion (AV-ASR) is how to optimally combine audio and visual speech informa- tion. From a purely theoretical standpoint, if audio and visual feature streams were class conditionally independent, multiplying their likelihood functions and the a priori probability would result in optimal fusion (Bayes fusion). In reality however, our estimation of likelihood functions (particularly likeli- hood function of the video stream) may severely deviate from the true likeli- hood functions mainly due to mismatch between the distributions of training and test data and due to model inaccuracy. Therefore, a more complex fu- sion strategy is needed to weight the likelihood functions with respect to the reliability of the models.

This thesis will concentrate on both theoretical and practical aspects of feature selection and learning algorithms and their applications in AV-ASR in order to give answers that differ form conventional solutions to the above questions. 1.2 Scientific Contributions 21

1.2 Scientific Contributions

The following contributions result from this thesis:

1. We propose a mutual information based feature selection algorithm and derive its approximation ratio which can be interpreted as a lower bound of the goodness of selected feature sets. To the best of our knowledge, this is the first (mutual information based) feature selec- tion method with performance guarantee1. 2. Employing the duality between linear online learning (hedging) and boosting, we derive a framework to design and develop strong classi- fiers. We show the algorithms derived from this framework are boosting algorithms in the probably approximately correct (PAC) sense, i.e., they drive the classification error to zero. Using this framework, we show that the MADABoost algorithm proposed in [DW00] is in fact a boost- ing algorithm. This is the first proof of its boosting property. Moreover, a sparse boosting algorithm is proposed which uses only a percentage of the training data in each training round resulting in a substantial memory and computational complexity reduction of learning process. 3. Most commonly used visual features for speech recognition are very sensitive to illumination variations and moreover are highly speaker dependent. To cope with speaker dependence issue, we propose to use 3 dimensional scale invariant feature transform (3D-SIFT). To improve illumination invariance property, we suggest to use multiple 3D-SIFT feature sets where features sets are extracted from different color spaces (obtained by nonlinear transformation of RGB ) with some invariance properties. The final classifier is the ensemble of classifiers each trained with one of these feature sets. The resulting classifier yields high accuracy in speaker-independentmode and is robust against changing lighting conditions. 4. Using an algorithm derived from our proposed boosting framework (MABoost) in co-training setting discussed in [BM98], we develop a speaker independent audio-visual voice activity detection system that

1This performance guarantee is a non-zero (thus, non-trivial) lower bound of the ratio be- tween the attained solution and the optimal solution of the underlying NP hard maximization problem. 22 1 Introduction

can be trained in a semi-supervised manner. It is shown to be very re- liable in different lighting conditions and for various speakers. Further, since it does (almost) not require labeled data to train, its intelligence (accuracy)improvesas it is used, meaning that it can adapt to new users in an unsupervised way. 5. Many audio-visual fusion techniques proposed in the literature, opti- mize the fusion parameters with respect to the recognition accuracy on training data which is based on an implicit assumption that training conditions are sufficiently similar to test conditions. This, however, is a paradoxical assumption since if training and test conditions were sim- ilar enough, we could simply use Bayes fusion. To address this prob- lem, we propose a multi-stream fusion scheme adopt the area under the receiver operating characteristic curve (AUC) as the design criterion. This algorithm can be trained in an online manner when the training set is large or parameter adaptation is required.

1.3 Structure of this Thesis

This thesis includes two parts: A more theoretical part where a feature selec- tion algorithm and a boosting framework are described and a more practical part where these algorithms are employed to construct a robust audio-visual speech recognition system.

Chapter 2 describes different visual features employed in this thesis includ- ing SIFT, 3D-SIFT and ISCN. Moreover, the audio-visual datasets used for evaluations throughout this thesis are also presented in this chapter. Chapter 3 introduces mutual information based feature selection algo- rithms. Particularly for COBRA, the semi-definite programming based feature selection, the approximation ratio is explored. Finally, the prac- tical useability of this feature selection method is demonstrated by means of a relatively comprehensive experiments with 10 different datasets. Chapter 4 introduces a boosting framework called MABoost. Several boost- ing algorithms derived from this framework, including SparseBoost, MADABoost and Mu-MABoost are presented and explored in this chapter. 1.3 Structure of this Thesis 23

Chapter 5 discusses in details the supervised and semi-supervised algorithm employed for audio-visual voice activity detection. Chapter 6 gives the description of a decision tree based lip reading system. It explains how 3D-SIFT features can be used to train a multiclass clas- sifier in order to achieve a robust lip reading system.

Chapter 7 is devoted to the information fusion problem in audio-visual speech recognition systems. It describes a fusion scheme for AV-ASR systems. Chapter 8 gives some practical evidence for the advantages of using multi- channel audio and visual data for improving recognition accuracy in noisy, reverberant environments. Chapter 9 provides some final remarks and insights for future works. 24 1 Introduction Chapter 2

Visual Features and AV Datasets

According to Castleman’s definition [Cas79], a feature is a function of one or more measurements, computed so that it quantifies some signification characteristics of an object. Despite the fact that over the last years vari- ous types of visual features have been proposed for V-ASR systems (see [ZZHP14, PNLM04]), none of them have gained wide acceptance (such as MFCCs in A-ASR systems) in real world deployment. Different visual fea- ture extraction methods may roughly be categorized into four classes.

1. Appearance or texture based features including PCA, LDA, H-LDA etc. The main underlying concept among texture based features is that all pixels encode visual speech information. After extracting a region of interest (ROI) which usually contains mouth, lips and jaw, these methods apply traditional dimensionality reduction techniques (such as DCT, PCA or LDA) directly to the ROI to calculate the final fea- ture sets. However, in these features the information relevant to visual speech is strikingly dominated by irrelevant information (noise for our goal) such as facial characteristics and skin color of the speaker. Fur- thermore, these features are very sensitive to illumination variations.

2. Shape based algorithms [Che01, MCB+02, CET01] such as AAM and ASM which consist of models for the visible speech articulatory parts. 26 2 Visual Features and AV Datasets

These models, however, need a tedious training (facial points have to be labeled) and may not capture all relevant information due to the lim- itations of models (shapes). More importantly, they are highly sensitive to image quality (low resolution).

3. Motion based methods which mainly use optic flow features. Although using optic flow features alone results in poor performance, it was shown that they may be useful as complimentary features in AV-ASR systems [Gur09].

4. Invariant features which show different invariance properties with re- spect to illumination conditions, scaling, skin color, etc. Invariant fea- tures are achieved by applying non-linear transformations to an ROI and can be extracted either by computing features for each frame in- dependently, i.e., spatial features such as SIFT and ISCN, or by con- sidering the video data a three dimensional time series and computing informative spatio-temporal features for a video sequence. Our proposed feature sets may also be considered to go along this line with an additional bags-of-words (BoW) step resulting in a sparse fea- ture vector (suitable for decision tree classifiers used as weak classifiers in ourboostingalgorithm).These features turn out to be highly speaker- and illumination-invariant.

While the last class of features are widely used in various machine vision and video classification applications [SAS07], they were rarely applied to the lipreading problem. In the first part of this chapter various invariant features that are later used to develop speaker-independentvoice activity detection and speech recognition systems, are introduced. As shown in Chapter 6, using these features significantly improves the robustness of the V-ASR systems. In the second part of this chapter, the datasets used in this thesis are ex- plained and some samples from them are illustrated.

2.1 Visual Feature Descriptions

Two kinds of features can be extracted from an image: (I) Global features which can be seen as a function of an entire image. PCA and LDA features are some examples of global features. In this type of representation, each feature is a function of both foreground (interpreted as information we are 2.1 Visual Feature Descriptions 27 interested in) and background (interpreted as noise). (II) Local features which describe the characteristics of small regions in neighborhood of key points in an image. The key points, however, may belong to foregroundor background. Thus, unlike global features, local features are functions of either foreground or backgroundbut not both at the same time. Since in lipreading, among many details and informationin a face image, we are only interestedin lip shape and location, it is advantageous to use local features. It is then the classifier’s task to intelligently extracts information from foreground features and ignore the background features. In this thesis, only local features are employed due to the fact that these features provide robustness required by lipreading systems.

2.1.1 Scale-invariant Feature Transform (SIFT)

Scale-invariant feature transform (SIFT) descriptors are local features intro- duced by Lowe [Low04]. SIFT descriptors are one of the most widely used local features in machine-vision applications due to their invariance to scale, rotation, illumination changes, and viewing directions. The success of SIFT features can be attributed to two factors: First, it does not blindly use all the pixels. It finds local structures such as corners or blobs that are present in different views and different scales of an image and uses them as key points. Second, once the key points were detected, it provides descriptions for these key points which are partially invariant to translation, rotation, scaling, illumi- nation and affine transformations. These descriptors will be the SIFT feature vectors representing an image. SIFT descriptors are computed as follows: First, the image gradients at the sample points in a 2D neighborhoodaround a key point location are com- puted. Lowe argued that it is better to use the distribution of the gradient orientations rather than their raw values since this distribution is highly in- variant to partial variations (such as scaling, rotation and translation). Thus, at the second stage, the histograms of orientations are computed. When com- puting the orientation histogram, the increments are weighted by the gradient magnitude and also weighted by a Gaussian window function centered at the key point. These orientation histograms measure how strong the gradient is in each direction. By forming multiple such histograms (according to Lowe method, 4 such histograms for 4 subregions around a key point) and concate- nating them the SIFT descriptor for a key point is obtained. This vector is usually normalized by the ℓ1 or ℓ2 norm of the vector in order to achieve invariance to illumination changes. 28 2 Visual Features and AV Datasets

In this thesis, SIFT features are used to train a visual voice activity detec- tor (VAD). As shown in Chapter 5, invariance properties of SIFT features can significantly contribute to the robustness of the system.

2.1.2 Invariant Scattering Convolution Networks Invariant scattering convolution network (ISCN) coefficients are wavelet based features computed by a network of wavelets applied to an image. Fol- lowing the explosive success of convolutional neural networks, Bruna and Mallat introduced ISCN features [BM12] in order to extract higher order statistics of an image via a deep network of wavelets. In this representation, an image is first segmented into many overlapped subregions. The first layer of the network then outputs local features by (I) applying wavelet transforms to subregions and (II) removing their phases to obtain translation invariant property. Averaging the features of each subregion yields a feature set which is both deformation-invariant(due to the averaging) and translation-invariant. As shown in [BM12], these features turn out to be SIFT-type descriptors. The next layers, however, provide higher-order statistics by applying finer wavelets in various directions1 to the outputs of the previous layer. These higher-order features show strong discriminative power, particu- larly when the underlying distribution of the data is highly non-Gaussian. This, however, comes at the expense of exponentially expanding the feature space due to the fact that the number of ISCN features exponentially in- creases with the number of layers and the number of directions along which the wavelet transform is computed. In our experiments, for instance, a 3-layer scattering convolution network with an 8-directional Morlet wavelet with the maximum spatial resolution 8, yields 55552 ISCN features for an ROI of size 128 128. This representa- tion can be compressed by applying DCT transform× on ISCN features and only selecting the 20% of the DCT features with the highest energy, which in our experiments were about 10000 features. Since, given the limited com- putational, memory and data resources, with this number of features it is still difficult to devise a reliable statistical model, a feature selection algorithm is developed to select a small subset of informative ISCN features used in train- ing HMM-GMM models. This feature selection method is the topic of the next chapter.

1In 2D frequency space, directional wavelets can be described by rotating and scaling the base wavelet function. 2.1 Visual Feature Descriptions 29

2.1.3 3D-SIFT

There is a growing body of research attempting to generalize the successful 2D descriptors to 3D descriptors in order to take the third dimension of video data (time in video and Z in MRI images) into account. Several 3D feature sets proposed in the literature include 3D-SIFT [SAS07], HoG3D [KMS08], spatio-temporal BoW [NWF08] and LBP-TOP [ZBP09]. HoG3D and 3D- SIFT, however, are based on histograms of gradient orientations and can be seen as the direct generalization of the popular SIFT descriptors. Due to their compliance with the 3D nature of video data, they achieved a great deal of success in some applications such as action recognition. Local descriptors are used to represent a local fragment of a video. Usu- ally informative points (key points) of a video are selected either by a key point detector (e.g., Harris-Laplace detector [MS02]) or by dense sampling. Since we only extract features from a small ROI (lip and mouth region), dense sampling is employed in this thesis. As in SIFT, after selecting the key points, the next step is to compute the gradient magnitudes and orientations at the sample points in a 3D neighborhood around the key point location. Given a key point s = (xs,ys,ts), a 3D neighborhood is considered to be a cuboid volume with its center of gravity lying at the key point location. This 3D cuboid is described by C = (s,w,l,h) where w, h and l are its width (along X axis), height (along Y axis) and height (along time axis), respectively. The gradient magnitudes are weighted by a Gaussian window which in our case is a sphere centered at the key point location. The 3D cuboid is first divided into 2x2x2 sub-volumes. The gradient orientations in each sub-volume are then accumulated into a sub-histogram with 80 bins, where 80 is the number of polyhedron faces used to tessellate the sphere (in order to quantize the gra- dient orientations). A sub-histogram represents the visual information of its corresponding sub-volume. All the sub-histograms are concatenated then to construct the final feature vector of length 640.

2.1.4 Bag-of-Words

The bag-of-words (BoW) model is one of the most popular feature represen- tation in machine-vision and visual classification tasks. The key idea in BoW is in fact borrowed from text mining. Usually, the standard steps in text re- trieval systems to extract features from a document is: (I) Parsing a document 30 2 Visual Features and AV Datasets into words (II) Assigning a unique identifier (code) to each word (III) repre- senting the document by a vector with components given by the frequency of occurrence of the words. Following this line of thought, Sivic and Zisserman introduced the BoW model for object retrieval and visual search in videos [SZ03]. The BoW method can be outlined as follows:

1. Extracting raw features from an image or a video. In this thesis, we use SIFT and 3D-SIFT as the raw features for representing the ROIs.

2. Constructing a codebook with K codes by employing a clustering ap- proach such as K-means and assigning an index i 1,...,K to each feature vector according to its cluster membership.∈ { }

3. Representing each ROI by a K-dimensional vector where the i-th ele- ment is the frequency of occurrence of the index i in the given ROI.

Two important problems can be handled by using BoW in classification. First, in lipreading, ROIs may have different sizes (because of the varia- tions of speakers’ lip shapes and sizes). Due to the changes in image sizes, we may extract different number of feature vectors from different ROIs. It is then not clear how to represent ROIs with equal-size feature vectors. One remedy is to artificially resize all images to have an equal size. This idea is used in Chapter 3 to train a visual speech recognizer with ISCN features. Another solution is to use the BoW model which sidesteps this problem by assigning equal-size feature vectors to ROIs. Second, features generated by the BoW method are highly robust against, small variations and largely unaffected by a change in camera viewpoint, the object’s scale and scene illumination. This robustness is the direct result of two factors: First, employing vector quantization and second, using the distri- bution (frequencyof occurrence)of the raw features as the final representation of an image. Both of these factors reduce the amount of noise in raw features and consequently, increase the robustness of the system.

2.2 Datasets

Compared with audio-only dataset, the number of audio-visual datasets suit- able for training an AV-ASR is much smaller and even fewer of them are freely available. In this thesis, we used four datasets which were commonly 2.2 Datasets 31 used in the relevant literature: CUAVE [PGTG02], GRID [CBCS06], Oulu [ZBP09] and AVletter [MCB+02]. Moreover, we also recorded a multi- channel audio-visual dataset which is used in the last chapter to conclude this thesis. Table 2.1 summarizes some of the important properties of the datasets used in this thesis. Note that the GRID dataset originally contains 32 speakers while we only used 16 of them in our evaluations.

Dataset Resolution #Aud. channels Light cond. Speakers CUAVE 720x480 1 controlled 36 Oulu 720x576 1 controlled 20 GRID 720x576 1 controlled 16 AVletter unknown 0 controlled 10 ETHDigits640x480 8 time-varying 15

Table 2.1: Summery of datasets utilized in this thesis.

2.2.1 CUAVE

The first dataset used in this thesis is CUAVE [PGTG02], which contains the digits from zero to nine repeated five times by 36 speakers. This dataset offers reasonably large speaker variability which is necessary to train a robust recognizer. The dataset has been recorded at a resolution of 720x480 with the NTSC standard of 29.97 fps [PGTG02]. We, however, do not directly work with the rawdata.Due to the effortofMihaiGurbanin his PhDwork [Gur09],

Figure 2.1: Manually centered, rotated and scaled ROIs extracted from CUAVE dataset. 32 2 Visual Features and AV Datasets the preprocessed data, where the video data is interlaced, mouth regions are manually cropped, centered and rotated to be horizontal, is available for this dataset. Four region of interests (ROIs) from this preprocessed dataset are depicted in Figure 2.1. Each ROI is a square image with 128x128 pixels. All ROIs are of equal size and the feature vectors representing these ROIs are also of equal size. Naturally, working with manually detected ROIs leads to overly optimistic results. However, since our focuses in Chapters 3 and 7, where the CUAVE dataset is employed, are on feature set evaluation and audio-visual informa- tion fusion, this issue can be ignored. Later in Chapter 6, when we opt the GRID dataset, which is much larger than CUAVE and for which no manu- ally detected ROIs are available, the effect of the automatic ROI detection is explored.

2.2.2 GRID

GRID corpus has been introduced in [CBCS06]. It consists of high-quality audio and video recordings of 1000 utterances spoken by each of 34 talkers. Sentences are simple, syntactically identical phrases such as “place at B 4 now”. The sentence structure for the Grid corpus is indicated in Table 2.2.

Each sentence consists of six words. Audio-visual recordings were made in a single-walled acoustically isolated booth. Each recording lasts exactly 3 second and consists of 75 frames, i.e., 25 frames per second. The image resolution is 720x576pixels. The sampling frequencyof the audio data is 44.1 kHz, which in our experiments was down sampled to 8 kHz. Some sample frames of this corpus are shown in Figure 2.2.

Command Color* Preposition Letter* Digit* Adverb bin blue at A–Z 0–9 again lay green by excludingW now place in please set white with soon

Table 2.2: Sentence structure for the GRID corpus. The keywords are shown by aster- isks. Each sentence consists of six words, three of which are keywords. 2.2 Datasets 33

Figure 2.2: Top four images are sample frames from the GRID corpus and the bot- tom four images are automatically extracted mouth regions. Different ROIs may have different sizes. 34 2 Visual Features and AV Datasets

Unlike the CUAVE dataset, mouth regions are not manually labeled. In order to extract ROIs, in this thesis we used a fully automatic facial point detection algorithm proposed by Dantone et al. [DGFVG12]. It is a real-time facial point detection approach based on the conditional regression forest and yields high accuracy in most cases. The detected ROIs may have different sizes due to the various mouth shapes, speakers’s facial characteristics, their distance to the camera and detection error. Four of these ROIs are illustrated in Figure 2.2. The GRID dataset is more than 40 GB. Hence, we only used a subset of the corpus in our experiments by randomly selecting 400 sentences from each of the first 16 speakers in the corpus. This reduces the memory size for processing and makes it possible to run our learning algorithms in the batch learning mode.

2.2.3 Oulu

The Oulu dataset collected by Zhao et al. [ZBP09] is a small corpus contain- ing 10 phrases listed in Table 2.3. The video data is recorded by a SONY DSR-200AP 3CCD-camera with a frame rate of 25 fps. The image resolu- tion is 720x576 pixels. This dataset includes 20 persons, each uttering the ten phrases listed in Table 2.3 for five times. Audio files are single chan- nel 16 bit per sample uncompressed wave files with sampling frequency 48 kHz. With this dataset, semi-automatically cropped ROIs are also provided. These mouth regions were determined by giving eyes positions manually in each frame to an automatic mouth detection system. Figure 2.3 depicts four frames and four extracted ROI of this dataset. While the video data is colored, the providedROIs with this dataset are gray scale images. Therefore, as in the GRID dataset, we automatically extracted our own ROIs, since as shown in Chapter 6, colored images are more suitable for lipreading due to valuable in- formation in colors, particularly the red color of lips. By Transforming RGB

C1 C2 C3 C4 C5 Excuseme Goodbye Hello Howareyou Nicetomeetyou C6 C7 C8 C9 C10 Seeyou I amsorry Thankyou Haveagoodtime Youarewelcome

Table 2.3: Ten phrases used in the Oulu dataset. 2.2 Datasets 35 images to multiple color spaces, we improve the robustness of the system and increase the convergence speed of the training phase.

Figure 2.3: Top four images are sample frames from the Oulu corpus and the bottom four images are automatically extracted mouth regions.

2.2.4 AVletter

The AVletters dataset [MCB+02] consists of 10 speakers saying the letters A to Z, three times each. The dataset only provided pre-extracted lip regions at 60x80 pixels in gray scale. As there is no raw audio data available for this dataset, we only used it for evaluation of lipreading systems. Figure 2.4 demonstrates four example ROIs of this dataset. 36 2 Visual Features and AV Datasets

Figure 2.4: Four ROI examples from AVletter dataset which consists repetitions of letters A to Z.

2.2.5 ETHDigits

Since all the explained datasets were collected under controlled conditions, their high quality audio-visual data are not representative of real-world exam- ples. Therefore, we set up a recording system to collect our own audio-visual dataset.

Figure 2.5: Eight channel microphone array used for audio acquisition in the ETHDig- its dataset.

ETHDigits is an audio-visual dataset recorded in a highly reverberant of- fice room with more than 1 second reverberation time T60. The audio data is captured by 8 microphones shown in Figure 2.5 and video signals are recorded by a Kinect for Xbox 360. In this work however, we only use the RGB camera of Kinect, which has 640x480 pixel resolution and operates at a frame rate of 30 Hz. Unlike the previous datasets, we did not record the visual data under controlled conditions. Depending on how many lights were on during a recording session, what time of day it was and whether the sun was shining, amount of light in the room may largely vary, which in turn, results in large variation of the visual data quality. The ETHDigits dataset consists of 15 speakers and each speaker repeats a sequence of numbers from one to ten for 5 times. Digits were presented on a computer screen located about 45 centimeters away from talkers and both Kinect and the microphone 2.2 Datasets 37 array were mounted on top of this screen. Unlike the CUAVE database, here the five digit sequences appear with different speeds on the screen (the later the faster). Namely, there are longer silence durations between numbers of the first three sequences than that of the last two sequences. Some samples of this dataset can be seen in Figure 2.6.

Figure 2.6: Top four images are sample frames from the ETHDigits dataset and the bottom four images are their corresponding automatically extracted mouth regions. 38 2 Visual Features and AV Datasets Chapter 3

Feature Selection

For various reasons, feature selection is a necessary preprocessing block in many machine learning applications. The main reason, however, is simply the lack of enough data. In this chapter, various aspects of the feature selection problem are investigated and a feature selection algorithm is developed which later is employed to select a small subset of informative features from ISCN coefficients in order to train a GMM-HMM based lipreading system.

The main two issues in feature subset selection are finding an appropri- ate measure function that can be fairly fast and robustly computed for high- dimensionaldata and a search strategy to optimize the measure over the subset space in a reasonable amount of time. In this chapter mutual information be- tween features and class labels is considered to be the measure function. Two series expansions for mutual information are proposed, and it is shown that most heuristic criteria suggested in the literature are truncated approximations of these expansions.

As of search strategy we suggest a parallel search strategy based on semi- definite programming (SDP) that can search through the subset space in poly- nomial time. By exploiting the similarities between the proposed algorithm and an instance of the maximum-cut problem in graph theory, the approxima- tion ratio of this algorithm is derived and is compared with the approximation ratio of the backward elimination method. 40 3 Feature Selection

3.1 Introduction to Feature Selection Problem

From a purely theoretical point of view, given the underlying conditional probability distribution of a dependent variable C and a set of features X, the Bayes decision rule can be applied to construct the optimum induction algo- rithm. However, in practice learning machines are not given access to the dis- tribution P r(C X). Therefore, given a feature vector or variables X RN , the aim of most| machine learning algorithms is to approximate this underly-∈ ing distribution or estimate some of its characteristics. Unfortunately, in most practically relevant data mining applications, the dimensionality of the fea- ture vector is quite high making it prohibitive to learn the underlying distribu- tion. For instance, gene expression data or images may easily have more than tens of thousands of features. While, at least in theory, having more features should result in a more discriminative classifier, it is not the case in practice because of the computational burden and the overfitting effect. High-dimensional data poses different challenges on induction and pre- diction algorithms. Essentially, the amount of data to sustain the spatial den- sity of the underlying distribution increases exponentially with the dimension- ality of the feature vector, or alternatively, the sparsity increases exponentially given a constant amount of data. Normally in real-world applications, a lim- ited amount of data is available and obtaining a sufficiently good estimate of the underlying high-dimensional probability distribution is almost impossible unless for some special data structures or under some assumptions (indepen- dent features, etc). Thus, dimensionality reduction techniques, particularly feature extraction and feature selection methods, have to be employed to reconcile idealistic learning algorithms with real-world applications.

3.1.1 Various Feature Selection Methods In the context of feature selection, two main issues can be distinguished. The first one is to define an appropriate measure function to assign ascoreto aset of features. The second issue is to develop a search strategy that can find the optimal (in a sense of optimizing the value of the measure function) subset of features among all feasible subsets in a reasonable amount of time. Different approaches to address these two problems can roughly be cat- egorized into three groups: Wrapper methods, embedded methods and filter methods. 3.1IntroductiontoFeatureSelectionProblem 41

Wrapper methods [Koh96] use the performance of an induction algorithm (for instance a classifier) as the measure function. Given an inducer , wrap- per approaches search through the space of all possible feature subsetsI and select the one that maximizes the induction accuracy. Most of the methods of this type require to check all the possible 2N subsets of features and thus, may rapidly become prohibitive due to the so-called combinatorial explosion. Since the measure function is a machine learning (ML) algorithm, the se- lected feature subset is only optimal with respect to that particular algorithm, and may show poor generalization performance over other inducers. The second group of featureselection methodsare called embedded meth- ods [NSS04] and are based on some internal parameters of the ML algorithm. Embedded approaches rank features during the training process and thus si- multaneously determine both the optimal features and the parameters of the ML algorithm. Since using (accessing) the internal parameters may not be applicable in all ML algorithms, this approach cannot be seen as a general solution to the feature selection problem. In contrast to wrapper methods, embedded strategies do not require to run the exhaustive search over all sub- sets since they mostly evaluate each feature individually based on the score calculated from the internal parameters. However, similar to wrapper meth- ods, embedded methods are dependent on the induction model and thus the selected subset is somehow tuned to a particular induction algorithm. Filter methods, as the third group of selection algorithms, focus on fil- tering out irrelevant and redundant features in which irrelevancy is defined according to a predetermined measure function. Unlike the first two groups, filter methods do not incorporate the learning part and thus show better gen- eralization power over a wider range of induction algorithms. They rely on finding an optimal feature subset through the optimization of a suitable mea- sure function. Since the measure function is selected independently of the induction algorithm, this approach decouples the feature selection problem from the following ML algorithm. The first contribution of this work is to analyze the popular mutual infor- mation measure in the context of the feature selection problem. The mutual information function is expanded in two different series and is shown that most of the previously suggested information-theoretic criteria are the first or second order truncation-approximations of these expansions. The first ex- pansion is based on generalization of mutual information and has already ap- peared in the literature while the second one is new, to the best of our knowl- edge. The well-known minimal Redundancy Maximal Relevance (mRMR) 42 3 Feature Selection score function can be immediately concluded from the second expansion.

3.1.2 Feature Selection Search Strategies

Alternatively, feature selection methods can be categorized based on the search strategies they employ. Popular search approaches can be divided into four categories: Exhaustive search, greedy search, projection and heuris- tic. A trivial approach is to exhaustively search in the subset space as it is done in wrapper methods. However, as the number of features increases, it can rapidly become infeasible. Hence, many popular search approaches use greedy hill climbing, as an approximationto this NP-hard combinatorial prob- lem. Greedy algorithms iteratively evaluate a candidate subset of features, then modify the subset and evaluate if the new subset is an improvement over the old one. This can be done in a forward selection setup which starts with an empty set and adds one feature at a time or with a backward elimination process which starts with the full set of features and removes one feature at each step. The third group of the search algorithms are based on targeted pro- jection pursuit which is a linear mapping algorithm to pursue an optimum projection of data onto a low dimensional manifold that scores highly with respect to a measure function [FT74]. In heuristic methods, for instance ge- netic algorithms, the search is started with an initial subset of features which gradually evolves toward better solutions. Recently, two convex quadratic programing based methods, QPFS in [RHEC10] and SOSS in [NHP13] have been suggested to address the search problem. QPFS is a deterministic algorithm and utilizes the Nystr¨om method to approximate large matrices for efficiency purposes. SOSS on the other hand, has a randomized rounding step which injects a degree of randomness into the algorithm in order to generate more diverse feature sets. Developing a new search strategy is another contribution of this chapter. Here, a new class of search algorithms is introduced which is based on semi- definite Programming (SDP) relaxation. The feature selection problem is re- formulated as an instance of (0-1)-quadratic integer programming. This inte- ger programming optimization is then relaxed to an SDP problem, which is convex and hence can be solved with efficient algorithms [BV04]. It is shown that it usually gives better solutions than greedy algorithms in the sense that its approximate solution is more probable to be closer to the optimal point of the criterion. 3.2MutualInformationProsandCons 43

3.2 Mutual Information Pros and Cons

Let us consider an N dimensional feature vector X = [X1,X2, ..., XN ] and a dependent variable C which can be either a class label in case of classifica- tion or a target variable in case of regression.The mutual information function is defined as a distance from independence between X and C measured by the Kullback-Leibler divergence [CT91]. Basically, mutual information mea- sures the amount of information shared between X and C by measuring their dependency level. Denote the joint pdf of X and C and its marginal distribu- tions by P r(X, C), P r(X) and P r(C), respectively. The mutual information between the feature vector and the class label can be defined as follows:

I(X1,X2, . . . ,XN ; C)= I(X; C)= P r(X, C) P r(X, C)log dX dC (3.1) Z P r(X)P r(C)

It reaches its maximum value when the dependent variable is perfectly de- scribed by the feature set. In this case mutual information is equal to H(C), where H(C) is the Shannon entropy of C. Mutual information can also be considered a measure of set intersection [Rez61]. Namely, let A and B be event sets correspondingto random variables A and B, respectively. It is not difficult to verify that a function µ defined as:

µ(A B)= I(A; B) (3.2) ∩ satisfies all three properties of a formal measure over sets [Yeu91] [Bog07], i.e., non-negativity, assigning zero to empty set and countable additivity. However, as it is seen later, the generalization of the mutual information mea- sure to more than two sets will no longer satisfy the non-negativity property and thus can be seen as a signed measure which is the generalization of the concept of measure by allowing it to have negative values. There are at least three reasons for the popularity of the use of mutual information in feature selection algorithms. 1. Most of the suggested non information-theoretic score functions are not formal set measures (for instance correlation function). Therefore, they cannot assign a score to a set of features but rather to individual features. However, mutual information as a formal set measure is able to evaluate all possible informative interactions and complex functional relations between 44 3 Feature Selection features and as a result, represent the complete information contained in a set of features. 2. The relevance of the mutual information measure to misclassification error is supported by the existence of bounds relating the probability of mis- classification of the Bayes classifier, Pe, to the mutual information. More specifically, Fano’s weak lower bound [Fan61] on Pe, 1+ P log (n 1) H(C) I(X; C) (3.3) e 2 y− ≥ − where ny is the number of classes and the Hellman-Raviv [HR70] upper bound, 1 P (H(C) I(X; C)) (3.4) e ≤ 2 − on Pe, provide somewhat a performance guarantee. As it can be seen in (3.3) and (3.4), maximizing the mutual information between X and C decreases both upper and lower bounds on misclassification error and guarantees the goodness of the selected feature set. However, there is somewhat of a misunderstanding of this fact in the literature. It is some- times wrongly claimed that maximizing the mutual information results in minimizing the Pe of the optimal Bayes classifier. This is an unfoundedclaim since Pe is not a monotonic function of the mutual information. Namely, it is possible that a feature vector A with less relevant information-content about the class label C than a feature vector B yields a lower classification error rate than B. The following example may clarify this point. Example 3.1. Consider a binary classification problem with equal number of positive and negative training samples and two binary features X1 and X2. The goal is to select the optimum feature for the classification task. Suppose the first feature X1 is positive if the outcome is positive. How- ever, when the outcome is negative, X1 can take both positive and nega- tive values with the equal probability. Namely, P r(X =1 C=1) = 1 and 1 | P r(X1= 1 C= 1) = 0.5. In the same manner, the likelihood of X2 is defined as−P r|(X =1− C=1) = 0.9 and P r(X = 1 C= 1)= 0.7. Then, 2 | 2 − | − the Bayes classifier with feature X1 yields the classification error: P =P r(C= 1)P r(X =1 C= 1) e1 − 1 | − + P r(C=1)P r(X= 1 C=1) = 0.25 (3.5) − | Similarly, the Bayes classifier with X2 yields Pe1 = 0.2 meaning that, X2 is a better feature than X1 in the sense of minimizing the probability of mis- classification. However, unlike their error probabilities, I(X1; C)=0.31, is 3.2MutualInformationProsandCons 45

greater than I(X2; C)=0.29. That is, X1 conveys more information about the class label in the sense of Shannon mutual information than X2.

A more detailed discussion can be found in [FDV12]. However, it is worthwhile to mention that although using mutual information may not nec- essarily result in the highest classification accuracy, it guarantees to reveal a salient feature subset by reducing the upper and lower boundsof Pe. 3. By adapting classification error as a criterion, most standard classifica- tion algorithms fail to correctly classify the instances from minority classes in imbalanced datasets. Common approaches to address this issue are to ei- ther assign higher misclassification costs to minority classes or replace the classification accuracy criterion with the area under the ROC curve which is a more relevant criterion when dealing with imbalanced datasets. Either way, the features should also be selected by an algorithm which is insensitive (ro- bust) with respect to class distributions (otherwise the selected features may not be informative about minority classes, in the first place). Interestingly, by internally applying unequal class dependent costs, mutual information pro- vides some robustness with respect to class distributions. Thus, even in an imbalanced case, a mutual information based feature selection algorithm is likely (though not guaranteed) to not overlook the features that represent the minority classes. In [Hu11], the concept of the mutual information classifier is investigated. Specifically, the internal cost matrix of the mutual informa- tion classifier is derived to show that it applies unequal misclassification costs when dealing with imbalanced data and showed that the mutual information classifier is an optimal classifier in the sense of maximizing a weighted clas- sification accuracy rate. The following example shows this robustness.

Example 3.2. Assume an imbalanced binary classification task where P r(C=1) = 0.9. As in Example 3.1, there are two binaryfeatures X1 and X2 and the goal is to select the optimum feature. Suppose P r(X =1 C=1) = 1 1 | and P r(X1= 1 C= 1)=0.5. Unlike the first feature, X2 can much better classify the minority− | class− P r(X = 1 C= 1)=1 and P r(X =1 C=1) = 2 − | − 2 | 0.8. It can be seen that the Bayes classifier with X1 results in 100% classi- fication rate for the majority class while only 50% correct classification for the minority. On the other hand, using X2 leads to 100% correct classifica- tion for the minority class and 80% for the majority. Based on the probability of error, X1 should be preferred since its probability of error is Pe1 = 0.05 while Pe2 = 0.18. However, by using X1 the classifier cannot learn the rare event (50% classification rate) and thus randomly classifies the minority class 46 3 Feature Selection which is the class of interest in many applications. Interestingly, unlike the Bayesian error probabilities, mutual information prefers X2 over X1, since I(X2; C)=0.20 is greater than I(X1; C)=0.18. That is, mutual informa- tion is to some extent robust against imbalanced data.

Unfortunately, despite the theoretical appeal of the mutual information measure, given a limited amount of data, an accurate estimate of the mu- tual information would be impossible. Because to calculate mutual informa- tion, estimating the high-dimensional joint probability P r(X, C) is inevitable which is, in turn, known to be an NP hard problem [KS01]. As mutual information is hard to evaluate, several alternatives have been suggested [Bat94], [PLD05], [KC02]. For instance, the Max-Relevance cri- terion approximates (3.1) with the sum of the mutual information values be- tween individual features Xi and C:

N Max-Relevance = I(Xi; C) (3.6) Xi=1 Since it implicitly assumes that features are independent, it is likely that selected features are highly redundant. To overcome this problem, several heuristic corrective terms have been introduced to remove the redundant in- formation and select mutually exclusive features. Here, it is shown that most of these heuristics are derived from the following expansions of mutual infor- mation with respect to Xi.

3.2.1 First Expansion: Multi-way Mutual Information

The first expansion of mutual information that is used here, relies on the natural extension of mutual information to more than two random variables proposed by McGill [McG54] and Abramson [Abr63]. According to their proposal, the three-way mutual information between random variables Yi is defined by:

I(Y ; Y ; Y )=I(Y ; Y )+ I(Y ; Y ) I(Y , Y ; Y ) 1 2 3 1 3 2 3 − 1 2 3 =I(Y ; Y ) I(Y ; Y Y ) (3.7) 1 2 − 1 2| 3 where “,” between variables denotes the joint variables. Note that, similar to two-way mutual information, it is symmetric with respect to Yi variables, i.e., 3.2MutualInformationProsandCons 47

I(Y1; Y2; Y3)= I(Y2; Y3; Y1). Generalizing over N variables:

I(Y1; Y2; ... ; YN )= I(Y1; ... ; YN 1) − I(Y1; ... ; YN 1 YN ) (3.8) − − |

Unlike 2-way mutual information, the generalized mutual information is not necessarily nonnegative and hence, can be interpreted as a signed measure of set intersection [Han80]. Consider (3.7) and assume Y3 is class label C, then positive I(Y1; Y2; C) implies that Y1 and Y2 are redundant with respect to C since I(Y1, Y2; C) I(Y1; C)+ I(Y2; C). However, the more interesting case is when I(Y ; Y≤; C) is negative, i.e., I(Y , Y ; C) 1 2 1 2 ≥ I(Y1; C)+ I(Y2; C). This means, the information contained in the interac- tions of the variables is greater than the sum of the information of the indi- vidual variables [Gur09].

N N 1 N − I(X; C)= I(X ; C) I(X ; X ; C) i1 − i1 i2 iX1=1 iX1=1 i2=Xi1+1 N 1 + + ( 1) − I(X ; ... ; X ; C) (3.9) ··· − 1 N

An artificial example for this situation is the binary classification problem depicted in Figure 3.1, where the classification task is to discriminate between the ellipse class (class samples depicted by circles) and the line class (star samples) by using two features: values of x axis and values of y axis. As can be seen, since I(x; C) 0 and I(y; C) 0, there is no way to distinguish between these two classes≈ by just using≈ one of the features. However, it is obvious that employing both features results in almost perfect classification, i.e., I(x, y; C) H(C). ≈ The mutual information in (3.1) can be expanded out in terms of general- ized mutual information between the features and the class label as: From the definition in (3.8) it is straightforward to infer this expansion. However, the more intuitive proof is to use the fact that mutual information is a measure of set intersection, i.e., I(Y ; Y ; Y ) = µ(Y Y Y ), where 1 2 3 1 ∩ 2 ∩ 3 48 3 Feature Selection

2 Ellipse class Line class 1.5

1

0.5

0 y values

−0.5

−1

−1.5

−2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x values

Figure 3.1: Synergy between x and y features. While information of each individual feature about the class label (ellipse or line) is almost zero, their joint information can almost completely remove the class label ambiguity.

Yi is the corresponding event set of the Yi variable. Now, expanding the N- variable measure function results in

N N I(X; C)= µ(( X ) C)= µ( (X C)) (3.10) i ∩ i ∩ i[=1 i[=1 N N 1 N − = µ(X C) µ(X X C) i ∩ − i1 ∩ i2 ∩ Xi=1 iX1=1 i2=Xi1+1 N 1 + + ( 1) − µ(X X X C) ··· − 1 ∩ 2 ···∩ N ∩ where the last equation follows directly from the addition law or sum rule in set theory. The proof is complete by recalling that all measure functions with 3.2MutualInformationProsandCons 49 the set intersection arguments in the last equation can be replaced by the mu- tual information functions according to the definition of mutual information in (3.2).

3.2.2 Second Expansion: Chain Rule of Information The second expansion for mutual information is based on the chain rule of information [CT91]:

N I(X; C)= I(Xi; C Xi 1,...,X1) (3.11) | − Xi=1 The chain rule of information leaves the choice of ordering quite flexible. For example, the right side can be written in the order (X1,X2,...,XN ) or (XN ,XN 1,...,X1). In general, it can be expanded over N! different per- − mutations of the feature set X1,...,XN . Taking the sum over all possible expansions yields, { }

N (N!)I(X; C) = (N 1)! I(X ; C) (3.12) − i Xi=1 N + (N 2)! I(X ; C X ) − i2 | i1 iX1=1 i2 1X,...,N i1 ∈{ }\ N + + (N 1)! I(X ; C X ,...,X X ) ··· − i |{ 1 N }\ i Xi=1 Dividing both sides by (N 1)!/2, and using the following equation − I(Xi1 ; C Xi2 ) = I(Xi1 ; C) I(Xi1 ; Xi2 ; C) to replace I(Xi1 ; C Xi2 ) terms, our| second expansion can− be expressed as |

N N I(X; C)= I(X ; C) (3.13) 2 i Xi=1 N 1 N 1 − I(X ; X ; C) − N 1 i1 i2 − iX1=1 i2=Xi1+1 N 1 + + I(X ; C X ,...,X X ) ··· 2 i |{ 1 N }\ i Xi=1 50 3 Feature Selection

Ignoring the unimportant multiplicative constant N/2 on the left side of equa- tion (3.13), the right side can be seen as a series expansion form of mutual information (up to a known constant factor).

3.2.3 Truncation of the Expansions

In the proposed expansions (3.9) and (3.13), mutual information terms with more than two features represent higher-order interaction properties. Neglect- ing the higher order terms yields the so-called truncated approximation of the mutual information function. If the constant coefficient in (3.13) is ignored, the truncated forms of suggested expansions can be written as:

N N 1 N − D = I(X ; C) I(X ; X ; C) (3.14) 1 i − i j Xi=1 Xi=1 j=Xi+1 N N 1 N 1 − D = I(X ; C) I(X ; X ; C) (3.15) 2 i − N 1 i j Xi=1 − Xi=1 j=Xi+1 where D1 is the truncated approximation of (3.9) and D2 is for (3.13). In- terestingly, despite the very similar structure of the expressions in (3.14) and (3.17), they have intrinsically different behaviors. This difference seems to be rooted in different functional forms they employ to approximate the under- lying high-order pdf with lower order distributions ( i.e., how they combine these lower order terms). For instance, the functional form that MIFS em- ploys to approximate P r(x) is shown in (3.19). While D1 is not necessarily a positive value, D2 is guaranteed to be a positive approximationsince all terms in (3.12) are positive. However, D2 may highly underestimate the mutual in- formation values since it may violate the fact that (3.1) is always greater than or equal to maxi I(Xi; C).

JMI, mRMR & MIFS Criteria

Several known criteria including joint mutual information (JMI) [MB06], minimal Redundancy Maximal Relevance (mRMR) [PLD05] and Mutual In- formation Feature Selection (MIFS) [Bat94] can immediatelybe derivedfrom D1 and D2. Using the identity: I(X ; X ; C)= I(X ; C)+I(X ; C) I(X ,X ; C) i j i j − i j 3.2MutualInformationProsandCons 51

in D2 reveals that D2 is equivalent to JMI.

N 1 N − JMI= D2 = I(Xi,Xj; C) (3.16) Xi=1 j=Xi+1

Using I(Xi; Xj ; C)= I(Xi; Xj) I(Xi; Xj C) and ignoring the terms containing more than two variables, i.e.,− I(X ; X |C), in the second approx- i j | imation D2, one may immediately recognize the popular score function

N N 1 N 1 − mRMR= I(X ; C) I(X ; X ) (3.17) i − N 1 i j Xi=1 − Xi=1 j=Xi+1 introduced by Peng et al. in [PLD05]. That is, mRMR is a truncated approxi- mation of mutual information and not a heuristic approximation as suggested in [BPZL12].

The same line of reasoning as for mRMR can be applied to D1 to achieve MIFS with β equal to 1.

N N 1 N − MIFS= I(X ; C) I(X ; X ) (3.18) i − i j Xi=1 Xi=1 j=Xi+1

Observation: A constant feature is a potential danger for the above mea- sures. While adding an informative but correlated feature may reduce the score value (since I(Xi; Xj C) I(Xi; Xj ) can be negative), adding a non- informative constant feature|Z to− a feature set does not reduce its score value since both I(Z; C) and I(Z; Xi; C) terms are zero, that is, constant features may be preferred over informative but correlated features. Therefore, it is es- sential to remove constant features by some pre-processing before using the above criteria for feature selection.

Implicitly Assumed Distribution

An obvious question arising in this context with respect to the proposed trun- cated approximations is: Under what probabilistic assumptions do the pro- posed approximations become valid mutual information functions? That is, which structure should a joint pdf admit, to yield mutual information in the forms of D1 or D2? 52 3 Feature Selection

For instance, if we assume features are mutually and class condi- N tionally independent, i.e., P r(X) = i=1 P r(Xi) and P r(X, C) = P r(C) N P r(X C), then it is easy toQ verify that mutual information has i=1 i| the formQ of Max-Relevance introduced in (3.6). These two assumptions, de- fine the adapted independence-map of P r(X, C) where the independence- map of a joint probability distribution is defined as follows. Definition 3.3. An independence-map (i-map) is a look up table or a set of rules that denote all the conditionaland unconditionalindependence between random variables. Moreover, an i-map is consistent if it leads to a valid fac- torized probability distribution. That is, given a consistent i-map, a high-order joint probability distribu- tion is approximated with product of low-order pdfs and the obtained approx- N imation is a valid pdf itself (e.g., i=1 P r(Xi) is an approximation of the high-order pdf P r(X) and it is alsoQ a valid probability distribution). The question regarding the implicit consistent i-map that MIFS adopts has been investigated in [BP10]. However, the assumption set (i-map) sug- gested in their work is inconsistent and leads to the incorrect conclusion that MIFS upper bounds the Bayesian classification error via the inequality (3.4). As shown in the following theorem, unlike the Max-Relevance case, there is no i-map that can produce mutual information in the forms of mRMR or MIFS (ignoring the trivial solution that reduces mRMR or MIFS to Max- Relevance). Theorem 3.4. There is no consistent i-map other than the trivial solution, i.e., the i-map indicating that random variables are mutually and class con- ditionally independent, that can produce mutual information functions in the forms of mRMR (3.17) or MIFS (3.18) for an arbitrary number of features. Proof: The proof is by contradiction. Suppose there is a consistent i-map, where its corresponding joint pdf Pˆ r(X, C) (which is the approximation of P r(X, C)) can generate mutual information in the forms of (3.17) or (3.18). That is, if this i-map is adopted, by replacing Pˆ r(X, C) in (3.1) we get mRMR or MIFS. This implies that mRMR and MIFS are always valid set measures for all datasets regardless of their true underlying joint probability distributions. Now, if it is shown (by any example) that they are not valid mu- tual information measures, i.e., they are not always positive and monotonic, then the assumption that Pˆ r(X, C) exists and is a valid pdf has been contra- dicted. It is not so difficult to construct an example in which mRMR or MIFS 3.2MutualInformationProsandCons 53 can get negative values. Consider the case where features are independent of class label, I(Xi; C)=0, while they have nonzero dependencies among themselves, I(Xi; Xj ) = 0. In this case, both mRMR and MIFS generate negative values which is6 not allowed by a valid set measure. This contra- dicts our assumption that they are generated by a valid distribution, so we are forced to conclude that there is no consistent i-map that results in mutual information in the mRMR or MIFS forms.

The same line of reasoning can be used to show that D1 and D2 are also not valid measures. However, despite the fact that no valid pdf can produce mutual informa- tion of those forms, it is still valid to ask for which low-order approximations of the underlying high-order pdfs, mutual information reduces to a truncated approximation form. That is, we do not restrict an approximation to be a valid distribution anymore. Any functional form of low-order pdfs may be seen as an approximationof the high-orderpdfs and may give rise to MIFS or mRMR. In the next following, these assumptions for the MIFS criterion are revealed.

MIFS Derivation from Kirkwood Approximation

It is shown in [KKG07] that truncation of the joint entropy H(X) at the rth- order is equivalent to approximating the full-dimensional pdf P r(X) using joint pdfs with dimensionality of r or smaller. This approximation is called rth order Kirkwood approximation. The truncation order that is chosen, par- tially determines our belief about the structure of the function that approxi- mates the P r(X). The 2nd order Kirkwood approximation of P r(X), can be denoted as follows [KKG07]:

N 1 N − P r(Xi,Xj ) Pˆ r(X)= i=1 j=i+1 (3.19) N N 2 Q Q − i=1 P r(Xi)  Q  Now, assume the following two assumptions hold: Assumption 3.5. Features are class conditionally independent, that is: P r(X C)= N P r(X C) | i=1 i| Q Assumption 3.6. P r(X) is well approximated by a 2nd order Kirkwood su- perposition approximation in (3.19). 54 3 Feature Selection

Then, writing the definition of mutual information and applying the above assumptions yields the MIFS criterion

I(X; C)= H(X) H(X C) (3.20) − | N N 1 N (a) − H(X ) I(X ; X ) H(X C) ≈ i − i j − | Xi=1 Xi=1 j=Xi+1 N N 1 N (b) − = I(X ; C) I(X ; X ) i − i j Xi=1 Xi=1 j=Xi+1

In the above equation, (a) follows the second assumption by substituting the 2nd order Kirkwood approximation (3.19) inside the logarithm of the entropy integral and (b) is an immediate consequence of the first assumption. The first assumption has already appeared in previous works [BPZL12] [BP10]. However, the second assumption is novel and, to the best of our knowledge, the connection between the Kirkwood approximation and the MIFS criterion has not been explored before. It is worth to mention that, in reality, both assumptions can be violated. Specifically, the Kirkwood approximation may not precisely reproduce de- pendencies which might be observed in real-world datasets. Moreover, it is important to remember that the Kirkwood approximation is not a valid prob- ability distribution.

3.2.4 The superiority of the D2 Approximation

Measure D2 always yields a positive score and generally tends to underesti- mate the mutual information while D1 shows a large overestimation for inde- pendent features and a large underestimation (even becoming negative) in the presence of dependent features. In general, D2 shows more robustness than D1. The same results can be observed for mRMR which is derived from D2 and MIFS derived from D1. Previous works also arrived to the same results and reported that mRMR performs better and more robustly than MIFS espe- cially when the feature set is large. Therefore, in the following sections D2 is used as the truncated approximation. For simplicity, its subscript is dropped and it is rewritten as follows: 3.3 Search Strategies 55

N D( X ,...,X )= I(X ; C) (3.21) { 1 N } i Xi=1 N 1 N 1 − I(X ; X ; C) − N 1 i j − Xi=1 j=Xi+1 Note that although D in (3.21) is not a formal set measure any more, it still can be seen as a score function for sets. However, it is noteworthy that un- like formal measures, the suggested approximations are no longer monotonic where the monotonicity merely means that a subset of features should not be better than any larger set that contains the very same subset. Therefore, as explained in [NF77] the branch and bound based search strategies cannot be applied to them.

A very similar approach has been applied [Bro09] (by using D1 approx- imation) to derive several known criteria like MIFS [Bat94] and mRMR [PLD05]. However, in [Bro09] and most of other previous works, the set score function in (3.21) is immediately reduced to an individual-feature score function by fixing N 1 features in the feature set. This will let them to run a greedy selection search− method over the feature set which essentially is a one-feature-at-a-time selection strategy. It is clearly a naive approximation of the optimal NP-hard search algorithm and may perform poorly under some conditions. In the following, a convex approximation of the binary objective function appearing in feature selection is investigated. This approximation is inspired by the Goemans-Williamson maximum cut approximation approach [GW95].

3.3 Search Strategies

Given a measure function1 D, the subset selection problem (SSP) can be de- fined as follows:

Definition 3.7. Given N features Xi and a dependent variable C, select a subset of P N features that maximizes the measure function. Here it is assumed that≪ the cardinality P of the optimal feature subset is known.

1By some abuse of terminology, any set function in this section is referred to as a measure, no matter whether they satisfy the formal measure properties. 56 3 Feature Selection

In practice, the exact value of P can be obtained by evaluating subsets for different values of cardinality P with the final induction algorithm. Note that it is intrinsically different than wrapper methods. While in wrapper methods 2N subsets have to be tested, here at most N runs of the learning algorithm are required to evaluate all possible values of P . A search strategy is an algorithm trying to find a feature subset in the feature subset space with 2N members2 that optimizes the measure function. The wide range of proposed search strategies in the literature can be divided into three categories:

Exponential complexity methods including exhaustive search [Koh96], • branch and bound based algorithms [NF77]. Sequential selection strategies with two very popular members, forward • selection and backward elimination methods. Stochastic methods like simulated annealing and genetic algorithms • [VD93], [Doa92].

Here, a fourth class of search strategies is introduced which is based on the convex relaxation of the 0-1 integer programming and explore its approx- imation ratio by establishing a link between SSP and an instance of the maximum-cut problem in graph theory. In the following, the two popular se- quential search methods are briefly discussed and the proposed solution is presented: a close to optimal polynomial-time complexity search algorithm and its evaluation on different datasets.

3.3.1 Convex Based Search

The forward selection (FS) algorithm selects a set S of size P iteratively as follows:

1. Initialize S = . 0 ∅ 2. In each iteration i, select the feature Xm maximizing D(Si 1 Xm), − ∪ and set Si = Si 1 Xm. − ∪ 3. Output SP .

2 N Given a P , the size of the feature subset space reduces to P . 3.3 Search Strategies 57

Similarly, backward elimination (BE) can be described as:

1. Start with the full set of features SN . 2. Iteratively remove a feature X maximizing D(S X ), and set m i\ m Si 1 = Si Xm, where removing X from S is denoted by S X. − \ \ 3. Output SP .

An experimentally comparative evaluation of several variants of these two algorithms has been conducted in [AB96]. From an information theoretical standpoint, the main disadvantage of the forward selection method is that it only can evaluate the utility of a single feature in the limited context of the previously selected features. The artificial binary classifier in Figure 3.1 may illustrate this issue. Since the information content of each feature (x and y) is almost zero, it is highly probable that the forward selection method fails to select them in the presence of some other more informative features. Contrary to forward selection, backward elimination can evaluate the con- tribution of a given feature in the context of all other features. Perhaps this is why it has been frequently reported to show superior performance than for- ward selection. However, its overemphasis on feature interactions is a double- edged sword and may lead to a sub-optimal solution.

Example 3.8. Imagine a four dimensional feature selection problem where X1 and X2 are class conditionally and mutually independent of X3 and X4, i.e., P r(X1,X2,X3,X4) = P r(X1,X2)P r(X3,X4) and P r(X ,X ,X ,X C) = P r(X ,X C)P r(X ,X C). Consider 1 2 3 4| 1 2| 3 4| I(X1; C) and I(X2; C) are equal to zero, while their interaction is infor- mative. That is, I(X1,X2; C)=0.4. Moreover, assume I(X3; C)=0.2, I(X4; C)=0.25 and I(X3,X4; C)=0.45. The goal is to select only two features out of four. Here, backward elimination will select X1,X2 rather than the optimal subset X ,X because, removing either of{ X or }X will { 3 4} 1 2 result in 0.4 reduction of the mutual information value I(X1,...,X4; C), while eliminating X3 or X4 deducts at most 0.25.

One may draw the conclusion that backward elimination tends to sacrifice the individually-informative features in favor of the merely cooperatively- informative features. As a remedy, several hybrid forward-backward sequen- tial search methods have been proposed. However, they all fail in one way or another and more importantly cannot guarantee the goodness of the solution. 58 3 Feature Selection

Alternatively,a sequential search method can be seen as an approximation of the combinatorial subset selection problem. To propose a new approxima- tion method, the underlying combinatorial problem has to be studied. To this end, the SSP defined in the beginning of this section is reformulated as:

max xT Qx x N xi = P (3.22) Xi=1 x 0, 1 for i =1,...,N i ∈{ } where Q is a symmetric mutual information matrix constructed from the mu- tual information terms in (3.21):

λ I(X1; C) 2 I(X1; XN ; C) λ ··· − λ  2 I(X1; X2; C) 2 I(X2; XN ; C) Q = − . ···. − . (3.23)  . .. .     λ I(X ; X ; C) I(X ; C)  − 2 1 N ··· N  1 where λ = P 1 and x = [x1,...,xN ] is a binary vector where the variables − xi are set-membership binary variables indicating the presence of the corre- sponding features Xi in the feature subset. It is straightforward to verify that for any binary vector x, the objective function in (3.22) is equal to the score function D(X ) where X = X x = 1; i = 1,...,N . Note that, for nz nz { i| i } mRMR I(Xi; Xj ; C) terms have to be replaced with I(Xi; Xj ). The (0,1)-quadratic programming problem (3.22) has attracted a great deal of theoretical study because of its importance in combinatorial problems (see [PRW95] and references therein). This problem can simply be trans- formed to a (-1,1)-quadratic programming problem, 1 1 max yT Qy + yT Qe + c y 4 2 N y =2P N (3.24) i − Xi=1 y 1, 1 for i =1,...,N i ∈ {− } via the transformation y =2x e, where e is an all ones vector. Additionally − 1 T c in the above formulation is a constant equal to 4 e Qe and it can be ignored 3.3 Search Strategies 59 because it is independent of y. In order to homogenize the objective function in (3.24), an (N+1) (N+1) matrix Qu is defined by adding a 0-th row and column to Q so that:×

0 eT Q Qu = (3.25) QT e Q 

1 Ignoring the constant factor 4 in (3.24), the equivalent homogeneous form of (3.22) can be written as:

T u SSSP = max y Q y y N SSP y y =2P N (3.26) h i i 0 − Xi=1 y 1, 1 for i =0,...,N i ∈ {− }

Note that y isnowan N+1 dimensional vector with the first element y0 = 1 as a reference variable. Given the solution of the problem above, i.e., y,± the optimal feature subset is obtained by X = X y = y . op { i| i 0} The optimization problem in (3.26) can be seen as an instance of the maximum-cut problem [GW95] with an additional cardinality constraint, also known as the k-heaviest subgraph or maximum partitioning graph problem. The two main approaches to solve this combinatorial problem are either to use the linear programming relaxation by linearizing the product of two bi- nary variables [FY83], or the semi-definite programming (SDP) relaxation suggested in [GW95]. The SDP relaxation has been proved to have excep- tionally high performance and achieves the approximation ratio of 0.878 for the original maximum-cut problem. The SDP relaxation of (3.26) is:

u SSDP = max tr Q Y Y { } N Y = (2P N)2 ij − i,jX=1 N SDP Y = (2P N) (3.27) h i i0 − Xi=1 diag(Y)= e Y 0  60 3 Feature Selection where Y is an unknown (N + 1) (N + 1) positive semi-definite matrix and tr Y denotes its trace. Obviously,× any feasible solution y for SSP is also feasible{ } for its SDP relaxation by Y = yyT . Furthermore, it ish not dif-i ficult to see that any rank one solution, rank(Y)=1, of SDP is a solution of SSP . The SDP problem can be solved within an additiveh i error γ of theh optimumi byh for examplei interior point methods [BV04] whose computa- 1 tional complexity are polynomial in the size of the input and log( γ ). How- ever, since its solution is not necessarily a rank one matrix, some more steps are required to obtain a feasible solution for SSP . Algorithm 3.1 summa- rizes the approximation algorithm for SSP whichh i in the following will be referred to as convex based relaxationh approximationi (COBRA) algorithm.

Algorithm 3.1: COBRA Feature Selection Input: number of repetitions T and SDP parameters in 3.27. For t =1,...,T do (a) Randomized rounding: Using the multivariate normal distribution with zero mean and a covariance matrix R = YSDP to sample u from distribution (0, R) and construct xˆ = sign(u). Select X = X xˆ N=x ˆ t { i| i 0} (b) Size adjustment: By using the greedy forward or backward algorithm, resize the cardinality of Xt to P . End Output: Ouput Xt with the maximum SDP score.

The randomized rounding step in COBRA is a standard procedureto pro- duce a binary solution from the real-valued solution of SDP and is widely used for designing and analyzing approximation algorithmsh i [Rag88]. The third step is to construct a feasible solution that satisfies the cardinality con- straint. Generally, it can be skipped since in feature selection problems the exact satisfaction of the cardinality constraint is not required.

The SDP-NAL solver [ZST10] was used with the Yalmip interface [Lof04] to implement this algorithm in Matlab. SDP-NAL uses the Newton augmented Lagrangian method to efficiently solve SDP problems.It can solve large scale problems (N up to a few thousand) in an hour on a PC with an Intel Core i7 CPU. Even more efficient algorithms for low-rank SDP have been suggested claiming that they can solve problems with the size up to 3.3 Search Strategies 61

N=30000 in a reasonable amount of time [GPP+12]. Here only the SDP- NAL solver was used in the experiments.

3.3.2 Approximation Analysis In order to gain more insight into the quality of a measure function, it is essential to be able to directly examine it. However, since estimating the ex- act mutual information value in real data is not feasible, it is not possible to directly evaluate the measure function. Its quality can only be indirectly examined through the final classification performance (or other measurable criteria). However, the quality of a measure function is not the only contributor to the classification rate. Since SSP is an NP-hard problem, the search strategy can only find a local optimal solution.That is, besides the quality of a measure function, the inaccuracy of the search strategy also contributes to the final classification error. Thus, in order to draw a conclusion concerning the quality of a measure function, it is essential to have an insight about the accuracy of the search strategy in use. In this section, the accuracy of the proposed method with the traditional backward elimination approach is compared. A standard approach to investigate the accuracy of an optimization algo- rithm is by analyzing how close it gets to the optimal solution. Unfortunately, feature selection is an NP-hard problem and thus achieving the optimal so- lution to use as reference is only feasible for small-sized problems. In such cases, one wants a provable solution’s quality and certain properties about the algorithm, such as its approximation ratio. Given a maximization problem, an algorithm is called ρ-approximation algorithm if the approximate solution is at least ρ times the optimal value. That is, in our case ρS S , SSP ≤ COBRA where SCOBRA = D(XCOBRA). The factor ρ is usually referred to as the approximation ratio in the literature. The approximation ratios of BE and COBRA can be found by linking the SSP to the k-heaviest subgraph problem (k-HSP) in graph theory.k-HSP is an instance of the max-cut problem with a cardinality constraint on the selected subset, that is, to determine a subset S of k vertices such that the weight of the subgraph induced by S is maximized [SW98]. From the definition of k-HSP, it is clear that SSP with the criterion (3.21) is equivalent to the P - heaviest subgraph problem since it selects the heaviest subset of features with the cardinality P , where heaviness of a set is the score assigned to it by D. An SDP based algorithm for k-HSP has been suggested in [SW98] and its approximation ratio has been analyzed. Their results are directly applicable to 62 3 Feature Selection

Values of P N/2 N/3 N/4 N/6 N/8 N/10 N/20 BE 0.4 0.25 0.16 0.10 0.071 0.055 0.026 COBRA 0.48 0.33 0.24 0.13 0.082 0.056 0.015

Table 3.1: Approximation ratios of backward elimination (BE) and COBRA for dif- ferent N/P values.

COBRA since both algorithms use the same randomization method (step 2 of COBRA) and the randomization is the main ingredient of their approximation analysis. The approximation ratio of BE for k-HSP has been investigated in [AITT00]. It is a deterministic analysis and their results are also valid for our case, i.e., using BE for maximizing D. The approximation ratios of both algorithms for different values of P , as a function of N (total number of features), have been listed in Table 3.1 (values are calculated from the formulas in [AITT00]). As can be seen, as P becomes smaller, the approximation ratio approaches zero yielding the trivial lower bound 0 on the approximate solution. However, for larger values of P , the approximation ratio is nontrivial since it is bounded away from zero. For all cases shown in the table except the last one, COBRA gives better guarantee boundthan BE. Thus, we may concludethat COBRA is more likely to achieve better approximate solution than BE. In the focus of the following section is on comparing the proposed search algorithm with sequential search methods in conjunction with different mea- sure functions and over different classifiers and datasets.

3.4 Experiments for COBRA Evaluation

The evaluation of a feature selection algorithm is an intrinsically difficult task since there is no direct way to evaluate the goodness of a selection process in general. Thus, usually a selection algorithm is scored based on the perfor- mance of its output, i.e., the selected feature subset, in some specific clas- sification (regression) system. This kind of evaluation can be referred to as the goal-dependent evaluation. However, this method obviously cannot eval- uate the generalization power of the selection process on different induction algorithms. To evaluate the generalization strength of a feature selection al- gorithm, a goal-independent evaluation is required. Thus, for evaluation of 3.4ExperimentsforCOBRAEvaluation 63

Dataset Name Mnemonic # Features # Samples # Classes Arrhythmia ARR 278 370 2 NCI NCI 9703 60 9 DBWorlde-mails DBW 4702 64 2 CNAE-9 CNA 856 1080 9 InternetAdv. IAD 1558 3279 2 Madelon MAD 500 2000 2 LungCancer LNG 56 32 3 Dexter DEX 20000 300 2

Table 3.2: Dataset descriptions the feature selection algorithms, it is proposed to compare the algorithms over different datasets with multiple classifiers. This method leads to a more classifier-independent evaluation process. Algorithm 3.3: Estimating P by searching over an admissible set that minimizes the classification error-rate. Input: P : P= P ,...,P . { 1 L} For P in P do (a) Run the COBRA algorithm and output the solution X. (b) Derive the classifier error-rate by applying K-fold. (c) Cross-validation and save the classification accuracy CL(P ). End Output: Popt = argmin CL(P ) P

Some properties of the eight datasets used in the experiments are listed in Table 3.2. All datasets are available on the UCI machine learning archive [FA10], except the NCI data which can be found in the website of Peng et al. [PLD05]. These datasets have been widely used in previous feature selection studies [PLD05], [CSO10]. The goodness of each feature set was evaluated with five classifiers including support vector machine (SVM), random forest (RF), classification and regression tree (CART), neural network (NN) and linear discriminant analysis (LDA). To derive the classification accuracies, 10-fold cross-validation is performed except for the NCI, DBW and LNG 64 3 Feature Selection

Datasets NCI DBW IAD CNA S-ratio 0.814 0.031 0.898 0.024 0.889 0.027 0.972 0.010 ± ± ± ± Datasets MAD LNG ARR DEX S-ratio 0.892 0.028 0.713 0.033 0.677 0.056 0.951 0.019 ± ± ± ± Table 3.3: Mean and standard deviation of similarity ratio of feature subsets from the COBRA algorithm. For each dataset (except LNG) 100 similarity ratios were evalu- ated for P = 10, ..., 109. For the LNG dataset, which only contains 56 features, the average was taken over P = 5, ..., 49. datasets where leave-one-out cross-validation is used. As explained before, filter-based methods consist of two components: A measure function and a search strategy. The measure functions utilized in the experiments are mRMR and JMI defined in (3.17) and (3.16), respectively. To unambiguously refer to an algorithm, it is denoted by measure function + search method used in that algorithm, eg., mRMR+FS. A simple algorithm listed in Algorithm 3.3 is employed to search for the optimal value of the subset cardinality P , where P ranges over a set P of admissible values. In the worst case, P = 1,...,N . { } Table 3.4 and Table 3.5 show the results obtained for the 8 datasets and 5 classifiers. Friedman test with the corresponding Wilcoxon-Nemenyi post- hoc analysis was used to compare the different algorithms. However, looking at the classification rates even before running the Friedman tests on them reveals a few interesting points which are marked in bold font. First, on the small size datasets (NCI, DBW and LNG), mRMR+COBRA consistently shows higher performance than other algorithms. The reason lies in the fact that the similarity ratio of the feature sets selected by COBRA is lower than BE or FS feature sets. The similarity ratio of two consecutive sets Xi and Xj , with j = i +1 is defined as

Xi Xj Si = | X∩ | (3.28) | i| Table 3.3 reports the average of the similarity ratios of 100 subsequent feature 1 109 sets ( 100 i=10 Si) for the datasets. From the definition of similarity ratio it is clear thatP for BE and FS this ratio is always equal to 1. However, because of the randomization step this ratio may widely vary for COBRA. That is, COBRA generates quite diverse feature sets. Some of these feature sets have 3.4ExperimentsforCOBRAEvaluation 65

Classifiers SVM LDA CART RF NN Average NCI Dataset mRMR+COBRA (54) 81.7 (95) 78.3 (20) 45.0 (71) 88.3 (60) 75.0 73.67 mRMR+FS (32) 78.3 (11) 68.3 (2) 45.0 (12) 83.3 (99) 70.0 69.00 mRMR+BE (26) 76.6 (11) 68.3 (2) 45.0 (13) 85.0 (31) 71.7 69.33 JMI+COBRA (72) 85.0 (70) 75.0 (28) 45.0 (45) 90.0 (93) 75.0 74.00 JMI+FS (27) 75.0 (17) 68.3 (82) 45.0 (17) 86.6 (78) 70.0 69.00 JMI+BE (23) 76.6 (20) 76.6 (7) 33.3 (19) 86.6 (89) 76.6 70.00 DBW Dataset mRMR+COBRA (38) 96.9 (152)92.2 (38) 86.0 (33) 92.2 (33) 98.4 93.12 mRMR+FS (31) 93.7 (4) 89.0 (4) 86.0 (7) 90.6 (9) 92.2 90.31 mRMR+BE (110)93.7 (6) 89.0 (4) 82.8 (29) 92.2 (9) 92.2 90.00 JMI+COBRA (35) 93.7 (14) 89.0 (8) 82.8 (24) 92.2 (108)93.7 90.31 JMI+FS (23) 93.7 (6) 89.0 (5) 82.8 (34) 92.2 (96) 92.2 90.00 JMI+BE (24) 93.7 (6) 89.0 (5) 82.8 (23) 92.2 (149)92.2 90.00 CNA Dataset mRMR+COBRA (200)94.0 (183)92.7 (63) 75.0 (183)90.8 (187)92.0 88.91 mRMR+FS (149)90.6 (142)90.4 (7) 70.2 (138)87.7 (78) 85.5 84.88 mRMR+BE (199)94.0 (165)92.5 (47) 75.0 (176)90.8 (84) 92.2 88.90 JMI+COBRA (140)92.6 (146)92.2 (47) 75.0 (148)90.4 (148)91.4 88.30 JMI+FS (150)92.7 (142)92.1 (48) 75.3 (148)90.7 (145)91.3 88.40 JMI+BE (150)92.7 (142)92.1 (48) 75.0 (144)90.4 (134)91.2 88.30 IAD Dataset mRMR+COBRA (165)96.5 (140)96.1 (28) 96.4 (160)97.2 (68) 97.1 96.64 mRMR+FS (109)96.2 (127)95.8 (127)96.7 (25) 97.0 (52) 97.2 96.58 mRMR+BE (22) 96.3 (163)95.9 (121)96.1 (109)97.2 (148)97.4 96.58 JMI+COBRA (112)96.3 (4) 96.3 (9) 96.3 (57) 97.3 (140) 100 97.24 JMI+FS (9) 96.2 (4) 96.2 (52) 96.4 (7) 96.8 (7) 97.8 96.68 JMI+BE (4) 96.6 (17) 95.8 (79) 96.3 (13) 96.5 (10) 97.2 96.48

Table 3.4: Comparison of COBRA with the greedy search methods over different datasets. For each classifier and combination of search method and measure function, the values in parentheses is the number of selected features and the second value is the classification accuracy. The last column reports the average of the classification accu- racies for each algorithm. Bold numbers represent statistically significant test results where COBRA outperforms sequential feature selection methods. 66 3 Feature Selection

Classifiers SVM LDA CART RF NN Average MAD Dataset mRMR+COBRA (12) 83.2 (13) 60.4 (26)80.5 (12) 88.0 (11) 62.2 74.81 mRMR+FS (32) 55.3 (5) 55.5 (12)58.2 (49) 57.3 (5) 52.7 55.82 mRMR+BE (14) 55.3 (11) 54.8 (31)57.3 (26) 56.4 (115)48.6 54.50 JMI+COBRA (13) 82.5 (12) 60.7 (40)80.7 (13) 87.6 (4) 61.1 74.54 JMI+FS (13) 82.5 (12) 60.7 (58)80.5 (13) 87.9 (19) 59.2 74.20 JMI+BE (13) 82.5 (12) 60.7 (58)80.5 (13) 87.3 (20) 60.1 74.25 LNG Dataset mRMR+COBRA (23) 75.0 (28) 96.9 (13)71.8 (28) 68.7 (27) 71.8 76.87 mRMR+FS (7) 81.2 (5) 68.7 (5) 71.8 (5) 75.0 (6) 71.8 73.75 mRMR+BE (7) 81.2 (4) 68.7 (4) 71.8 (4) 75.0 (4) 75.0 74.37 JMI+COBRA (7) 78.1 (6) 71.8 (5) 71.8 (5) 75.0 (5) 68.7 73.12 JMI+FS (7) 78.1 (4) 71.8 (4) 71.8 (8) 78.1 (5) 68.7 73.75 JMI+BE (7) 78.1 (6) 71.8 (5) 71.8 (6) 78.1 (6) 71.8 74.37 ARR Dataset mRMR+COBRA (45) 81.9 (48) 76.3 (30)75.4 (43) 82.2 (57) 72.9 77.75 mRMR+FS (34) 81.3 (43) 76.1 (7) 78.3 (34) 81.3 (5) 75.7 78.56 mRMR+BE (36) 81.6 (43) 76.3 (22)78.0 (25) 82.9 (8) 76.1 79.02 JMI+COBRA (26) 80.6 (51) 74.7 (15)78.3 (51) 81.5 (13) 71.9 77.41 JMI+FS (47) 74.3 (38) 73.5 (26)76.9 (37) 79.2 (54) 70.0 74.80 JMI+BE (47) 74.3 (38) 73.5 (26)76.9 (25) 80.0 (29) 68.6 74.66 DEX Dataset mRMR+COBRA (3) 92.0 (131)86.3 (24)80.7 (3) 93.0 (3) 81.3 86.66 mRMR+FS (3) 90.3 (56) 87.0 (94)80.3 (3) 92.0 (3) 80.0 86.00 mRMR+BE (3) 90.0 (131)87.3 (18)80.3 (3) 91.6 (99) 78.6 85.53 JMI+COBRA (88) 91.6 (13) 83.0 (12)80.3 (3) 94.0 (3) 81.0 86.00 JMI+FS (149)91.0 (129)87.6 (95)80.3 (119)92.3 (94) 80.6 86.40 JMI+BE (149)90.0 (128)87.3 (22)81.0 (146)92.0 (138)78.0 85.60

Table 3.5: Comparison of COBRA with the greedy search methods over different datasets. For each classifier and combination of search method and measure function, the values in parentheses is the number of selected features and the second value is the classification accuracy. The last column reports the average of the classification accu- racies for each algorithm. Bold numbers represent statistically significant test results where COBRA outperforms sequential feature selection methods. 3.4ExperimentsforCOBRAEvaluation 67

Friedman Test on Mean Accuracies for mRMR

CO - BE PostHoc P.value: 0.078 FS - BE PostHoc P.value: 0.965 FS - CO PostHoc P.value: 0.042 -4 -2 0 2 4

CO - BE FS - BE FS - CO

Figure 3.2: Comparing the search strategies for mRMR measure with the Friedman test and its corresponding post-hoc analysis. The Y-axis is the classification accuracy difference and X-axis indicates the names of the compared algorithms. relatively low scores as compared with BE or FS sets. However, since for small datasets the estimated mutual information terms are highly inaccurate, features that rank low with our noisy measure function may in fact be better for classification. As seen in Table 3.3, for NCI the averaged similarity ratio is significantly smaller than 1 while for CNA which is a relatively larger dataset, it is almost constant and equal to 1. The second interesting point is with respect to the MAD dataset.As canbe 68 3 Feature Selection seen, mRMR with greedy search algorithms perform poorly on this dataset. Several authors have already utilized this dataset to compare their proposed criterion with mRMR and arrived at the conclusionthat mRMR cannot handle highly correlated features, as in MAD dataset. However, surprisingly the per- formance of the mRMR+COBRA is as good as JMI on this dataset meaning that it is not the criterion but the search method that has difficulty to deal with highly correlated features. Thus, any conclusion with respect to the quality of a measure has to be drawn carefully since, as in this case, the effect of the non optimum search method can be decisive. To discover the statistically significant differences between the algo- rithms, we applied the Friedman test following with Wilcoxon-Nemenyi post- hoc analysis, as suggested in [HW99], on the average accuracies (the last column of Tables 3.4 and 3.5). Note that since 8 datasets were used in the experiments, there are 8 independent measurements available for each algo- rithm. The results of this test for mRMR based algorithms have been depicted in Figure 3.2. In all box plots, CO stands for COBRA algorithm. Each box plot compares a pair of the algorithms. The green box plots represent the existence of a significant difference between the corresponding algorithms. The adjusted p-values for each pair of algorithms have also been reported in Figure 3.2. The smaller the p-value, the stronger the evidence against the null hypothesis. As can be seen, COBRA shows meaningful superiority over both greedy algorithms. However, if the significance level is set at p =0.05, only FS rejects the null hypothesis and shows a meaningful difference with COBRA. The same test was run for each classifier and its results can be found in Figure 3.3. While three of the classifiers show some differences between FS and COBRA, neither of them reveal any meaningful difference between BE and COBRA. At this point, the least one can conclude is that independent of the classification algorithm at hand, it is a good chance that FS performs worse than COBRA. For JMI, however, the performances of all algorithms are comparable and with only 8 datasets it is difficult to draw any conclusion (see Figure 3.4). In the next experiment COBRA is compared with two other convex pro- gramming based feature selection algorithms, SOSS [NHP13] and QPFS [RHEC10]. Both SOSS and QPFS employ quadratic programing techniques to maximize a score function. SOSS, however, uses an instance of randomized rounding to generate the set-membership binary values while QPFS ranks the features based on their scores (achieved from solving the convex problem) 3.4ExperimentsforCOBRAEvaluation 69

SVM LDA -6 -2 24 CO - BE FS - BE FS - CO -10 -5 0CO 5 - 10 BE FS - BE FS - CO

CART RF -4 -2 0 2 -4 0246 CO - BE FS - BE FS - CO CO - BE FS - BE FS - CO

NN

-10CO 0 5 - 10 BE FS - BE FS - CO

Figure 3.3: Comparing the search strategies for mRMR. Results of the post-hoc tests for each classifier. 70 3 Feature Selection

SVM LDA -1.5 -0.5 0.5 -10 -5 0 5 CO - BE FS - BE FS - CO CO - BE FS - BE FS - CO

CART RF 0.0 1.0 -3 -1 1 3 CO - BE FS - BE FS - CO CO - BE FS - BE FS - CO

NN -6 -20 2 CO - BE FS - BE FS - CO

Figure 3.4: Comparing the search strategies for JMI. Results of the post-hoc tests for each classifier. 3.4ExperimentsforCOBRAEvaluation 71

Datasets mRMR+COBRA mRMR+QPFS mRMR+SOSS MAD 74.81 0.65 71.44 0.57 71.36 0.53 NCI 73.67 ± 2.41 71.00 ± 1.84 72.65 ± 2.13 IAD 96.64 ± 0.16 95.02 ± 0.21 96.64 ± 0.28 ARR 77.75 ± 1.03 78.73 ± 0.84 79.86 ± 1.18 CNA 88.91 ± 0.31 86.93 ± 0.45 85.43 ± 0.49 ± ± ± Table 3.6: Comparison of COBRA with QPFS and SOSS over 5 datasets. Average classification rates and their standard deviations are reported.

Datasets Time COBRA Time QPFS Time SOSS MAD 175 + 24 11 175 + 5 NCI 368 + 341 180 368 + 27 IAD 540 + 121 202 540 + 12 ARR 6 + 14 1 6 + 4 CNA 120 + 50 25 120 + 7

Table 3.7: Comparison of COBRA with QPFS and SOSS over 5 datasets. The compu- tational times in second are reported where the first value for COBRA and SOSS is for calculating the mutual information matrix and the second value is the time required to solve the optimization problems.

and therefore, sidesteps the difficulties of generating binary values. Note that both COBRA and SOSS first need to calculate the mutual information matrix Q. Once it is calculated, they can solve their corresponding convex optimiza- tion problems for different values of P . Table 3.6 reports the average (over 5 classifiers) classification accuracies of these three algorithms and the stan- dard deviation of these mean accuracies (calculated over the cross-validation folds). In Table 3.7, the computational times of each algorithm for a single run (in second) are shown, i.e., the amount of time required to select a feature set with (given) P features. The reported times for COBRA and SOSS con- sist of two values. The first value is the time required to calculate the mutual information matrix Q and the second value is the amount of time required to solve the corresponding convex optimization problem. All the values were measured on a PC with an Intel Core i7 CPU. As can be seen from Table 3.7, QPFS is significantly faster than COBRA and SOSS. This computational 72 3 Feature Selection

Datasets SVM CART RF NN LDA feat. 78.4 73.7 77.1 68.00 ARR Optimum 81.9 75.4 82.2 72.9 LDA feat. 92.6 75.0 90.5 91.1 CNA Optimum 94.0 75.0 90.8 92.0 LDA feat 95.8 96.0 97.2 96.3 IAD Optimum 96.5 96.4 97.2 97.1

Table 3.8: The performance of the classification algorithms when trained with CO- BRA features optimized for the LDA classifier. This table shows the generalization power of the COBRA features on the classifiers.

superiority, however, seems to come at the expense of lower classification ac- curacy. For large datasets such as IAD, CNA and MAD, the Nystr¨om approx- imation used in QPFS to cast the problem into a lower dimensional subspace does not yield a precise enough approximation and results in lower classifi- cation accuracies. An important remark to interpret these results is that, for NCI dataset (in all the experiments) first features with the low mutual infor- mation values with the class label were filtered out and only 2000 informative features were kept simply because computing a mutual information matrix of size 9703 9703 was a computationally demanding task. Thus, the dimension is 2000 and× not 9703 as mentioned in Table 3.2. The generalization power of the COBRA algorithm over different classi- fiers is another important issue to test. As can be observed in Table 3.4, the number of selected features varies quite markedly from one classifier to an- other. However, based on our experiments, the optimum feature set of any of the classifiers, usually (for large enough datasets) achieves a near-optimal ac- curacy in conjunction with other classifiers as well. This is shown in Table 3.8 for 4 classifiers and 3 datasets. The COBRA features of the LDA classifier in Table 3.4 is used here to train other classifiers. Table 3.8 lists the accuracies obtained by using the LDA features and the optimal features, repeated from Table 3.4. Unlike the CNA and IAD datasets, a significant accuracy reduction can be observed in the case of ARR data which has substantially less train- ing data than CNA and IAD. It suggests that for small size datasets, a feature selection scheme should take the induction algorithm into account since the learning algorithm is sensitive to small changes of the feature set. 3.5COBRA-selectedISCNFeaturesforLipreading 73

3.5 COBRA-selected ISCN Features for Lipreading

Having the COBRA algorithm at hand, we can train our first visual speech recognizer with ISCN features described in Chapter 2. This recognizer is based on the GMM-HMM technology. First, the COBRA feature selection algorithm was used to select 74 features from the large set of ISCN features extracted from each frame. The first-order derivatives of these features were then included in this feature set. Compared with conventional visual features such as PCA, LDA and DCT features, COBRA selected ISCN features re- sulted in a visual speech recognizer with 6% absolute recognition rate im- provement and 12% lower variance of accuracy over different speakers. In this experiment the CUAVE dataset [PGTG02] is used which contains the digits from zero to nine repeated five times by 36 speakers. A brief de- scription of this dataset is given in Chapter 2.2.1. The size of the ROIs in CUAVE dataset are 128x128 pixels. In these ex- periments, 4 sets of features were extracted from these ROIs and compared.

1. ISCN: First and second order ISCN features were extracted from ROIs. By using a 3-layer network and setting the hyper-parameter J corre- sponding to the spatial resolution to three, about 50000 ISCN features were computed for the ROI of each video frame. Due to the high- dimensionality of the ISCN features, they cannot directly be used in the statistical modeling. Hence, the COBRA algorithm was employed to reduce the dimensionality by selecting a small subset of ISCN fea- tures.

2. PCA: The commonly used principal component analysis was applied to ROIs. The derived principal component scores corresponding to the largest eigenvalues were then considered the PCA features.

3. LDA: Linear discriminate analysis is performed on ROIs. The projec- tion of the ROIs to a low dimensional linear subspace maximizing the projected ratio between the between-class scatter matrix and within- class scatter matrix yielded the LDA features.

4. DCT: The two dimensional DCT transform was applied to ROIs and features with the highest energy were selected in the zig-zag order so that the higher energy features appear first in the feature vector. 74 3 Feature Selection

The first order temporal derivatives of the features are also included in all the feature sets. By applying a two dimensional DCT to ISCN features and dropping 80% of the low-energy coefficients the feature space dimensionality was dramat- ically reduced. However, even keeping only 20% of the features resulted in 10000 features which is still prohibitively large. Hence, COBRA was used to reduce the feature set cardinality to a practically workable number. Two set of experiments, one with word models and one with viseme mod- els, were performed. Visemes are the visual equivalent of phonemes and are used to describe articulatory gestures in lipreading. A correspondence be- tween visemes and phonemes are shown in Table 3.10. This map was sug- gested by Jeffers and Barley [JB71] to group 43 phonemes into 11 visemes. It was shown in [CH23] that compared with other viseme-phoneme maps, this correspondence achieves higher recognition accuracy in visual speech recognition. When visemes were used as speech units, 10 three-state HMMs were trained to statistically describe the visemes (10 visemes were sufficient to describe the digits). Each markov state was modeled with a GMM contain- ing two mixture components with diagonal covariance matrices. When words were directly modeled, an HMM with 9 emitting states (two Gaussian mix- ture components with diagonal covariance matrix per state) was trained for each digit. In the first set of experimentsthe performanceof ISCN features were eval- uated in the case where speech units are words. The visual-only speech recog- nizer was trained with various number of ISCN features selected by COBRA. In order to have a speaker independent evaluation, the leave one speaker out cross validation strategy was utilized. Figure 3.5 shows the recognition rates of different feature sets for test with word HMMs. As can be seen the graph correspondingto the COBRA-selected ISCN features is not smoothdue to the randomization step in the COBRA algorithm, which results in selecting very diverse feature sets. However, the general trend is clear. By increasing the number of features the recognition rate improves and reaches 75% for a fea- ture set with 148 features including the first order derivatives. After passing that optimal point, it starts to decrease because of the curse of dimensionality effect and ends up at around 72.6% recognition rate for 220 features. The per- formances of the V-ASR for LDA, PCA and DCT with respect to the number of features have also been depicted in Figure 3.5. COBRA selected ISCN features considerably outperform the conven- tional visual features. Note that both ISCN and LDA features are selected 3.5COBRA-selectedISCNFeaturesforLipreading 75

75

70

65

60 Recognition rate

55 PCA DCT LDA COBRA 50 20 40 60 80 100 120 140 160 180 200 220 Number of features

Figure 3.5: Comparing ISCN features selected by COBRA with conventionally se- lected features in a visual digit recognition task with word models.

in a supervised way to maximize the viseme recognition rate. However, in the first experiment, the selected features were used to model the whole words rather than their constituent visemes. Table 3.9 reports the accuracy rates for various feature sets when the optimal number of features were used to train the V-ASR. The standard deviations of the reported mean accuracies were computed as the square root of the variance of the accuracy rates over the 36 speakers divided by √36 and is an indication of the sensitivity of the recog- nizer to speaker variation. Two other sensitivity indicators reported for each feature set are the mean accuracies of the top and bottom 10 percent of the recognition rates of the 36 speakers. The standard deviations reported in Ta- ble 3.9 shows that using ISCN features leads to about 12% relative variance reduction compared with LDA features. However, the gap between the per- formances of the best and worst groups (i.e., 10% Max and 10% Min) reveals 76 3 Feature Selection

Feature types # Features Accuracy rates 10% Max 10% Min ISCN 148 75% 2.99% 98.00 36.50 LDA 90 69%±3.38% 96.00 29.00 PCA 34 67%±2.91% 89.77 35.74 DCT 62 70%±2.58% 89.30 40.37 ± Table 3.9: Digit recognition rates and their corresponding standard deviations, for mul- tiple feature sets. that ISCN was not so successful in reducing this gap. This large gap between the maximum and minimum performancescan be interpreted as the reliability of the V-ASR. As it is seen later, to have a successful audio-visual informa- tion fusion the V-ASR reliability should be improved. As mentioned, LDA selects a set of transformed features that has the largest ratio of between-class scatter to within-class scatter. Basically, LDA tends to maximize the separa- bility by increasing the difference between the class-conditional mean values while keeping the within-class variances small. LDA is the optimum feature transformation method when class conditional distributions are Gaussian with similar covariance matrices and different mean values. As can be seen, the highest recognition rate that LDA feature can achieve is about 69% with 90 features. The fact that LDA features cannot perform as well as ISCN features is an indication of non-Gaussianity of the underlying distribution. Since only two Gaussian terms in GMMs were used to model the in-state features distributions, it is not surprising that non-Gaussian distri- butions may not be accurately represented. Note that considering the limited available data, increasing the number of Gaussian components leads to per- formance degradation due to the excessive overfitting effect. In the second set of experiments the performance of the COBRA selected ISCN features were evaluated when a sequence of viseme models were used to describe a digit. A correspondence between visemes and phonemes and the visemic pronunciations of digits are listed in Tables 3.10 and 3.11, re- spectively. As in the previous experiment, various number of features were used to train the visual speech recognizer. Figure 3.6 shows the recognition rates of different feature sets for the isolated digit recognition test. The ISCN features again outperform other feature sets however, the recognition rate is about 20% lower than that of word-based V-ASR reported in Table 3.9. This in fact can be explainedif the video data are reviewed. In many occasions, the 3.5COBRA-selectedISCNFeaturesforLipreading 77

Visemes Visibility Rank Occurrence% TIMIT phonemes /A 1 3.15 /f//v/ /B 2 15.49 /er//ow//r//q//w/ /uh/ /uw/ /axr/ /ux/ /C 3 5.88 /b//p//m//em/ /D 4 0.70 /aw/ /E 5 2.90 /dh/ /th/ /F 6 1.20 /ch/ /jh/ /sh/ /zh/ /G 7 1.81 /oy/ /ao/ /H 8 4.36 /s/ /z/ /I 9 31.46 /aa//ae//ah//ay//eh/ /ey/ /ih/ /iy/ /y/ /ao/ /ax-h/ /ax/ /ix/ /J 10 21.10 /d/ /l/ /n/ /t/ /el/ /nx/ /en/ /dx/ /K 11 4.84 /g//k//ng//eng/ /S - - /sil/

Table 3.10: The visemes and their corresponding phonemes suggested in [JB71] and evaluated in [CH23]. The TIMIT phonetic alphabet notation is used to describe the phonemes (see [Hie93] for mapping TIMIT to IPA). The visibility rank is an indication of the difficulty of recognition. The lower this number is, the more difficult it is to recognize a viseme.

Digit Pronunciation Digit Pronunciation zero [H+I+B+B] one [B+I+J] two [J + B] three [E+B+I] four [A+G+B] five [A+I+A] six [H+I+K+H] seven [H+I+A+I+J] eight [I+J] nine [J+I+J]

Table 3.11: The viseme transcriptions of digits. 78 3 Feature Selection

50

45

40 Recognition rate%

35 PCA DCT LDA COBRA 30 40 60 80 100 120 140 160 180 200 220 Number of features

Figure 3.6: Comparing ISCN features selected by COBRA with conventionally se- lected features in a visual digit recognition task with viseme models. visemes are not even pronounced (for instance, /t/ at the end of the digit eight is very frequently omitted) or when pronounced, they only appear in one to three frames which is too short for modeling the time-dynamics of them. Figure 3.7 reports the confusion matrix of the digit classification when visemes were used to describe the digits. The ground truth can be found in the rows of the confusion matrix while the columns are devoted to predicted labels. Most confusions in digit recog- nitions are caused by the phonemes of two words being mapped to a same viseme. However, it is also possible that different visemes have the same vi- sual appearance depending on the context. This type of confusion is caused by a phenomena commonly called co-articulation [CM93] where the mouth shape of a particular phoneme causes the next phonemes to have similar mouth shapes. For instance, in the words eight and nine, the viseme /I is the 3.5COBRA-selectedISCNFeaturesforLipreading 79 zero one two three four five six seven eight nine

zero 56.03% 0 13.79% 0 0.86% 0.86% 3.45% 25.00% 0 0

one 1.42% 58.87% 0.71% 8.51% 0.71% 14.18% 4.96% 2.84% 0 7.80%

two 23.08% 0 35.90% 0 2.56% 0 0 0 28.21% 10.26%

three 9.27% 12.64% 20.22% 39.33% 6.18% 3.37% 2.25% 1.69% 1.97% 3.09%

four 9.87% 10.53% 19.08% 5.59% 49.01% 2.63% 0.66% 0.99% 0.33% 1.32%

five 3.30% 3.85% 0.55% 0 0.55% 62.09% 2.75% 13.19% 0 13.74%

six 4.05% 2.89% 2.89% 0.58% 0 0 74.57% 2.31% 2.89% 9.83%

seven 4.71% 0 0 1.18% 0 2.35% 5.88% 85.88% 0 0

eight 4.88% 1.74% 3.14% 1.74% 1.39% 6.27% 2.79% 2.79% 50.87% 24.39%

nine 8.41% 1.87% 2.80% 2.80% 0 4.67% 10.28% 26.17% 8.41% 34.58%

Recognition rate = 53.0168%

Figure 3.7: Confusion matrix of digit recognition with visemes. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. dominant viseme governing the mouth shape. That is, in the presence of /I, the viseme /J in eight ([I + J]) and in nine ([J + I + J]) does not have a visible effect. Even though, there is a drastic performance reduction compared with 80 3 Feature Selection when word models are employed, the classification results with visemes are far from being random. Given better models for visemes, it is plausible to achieve higher performances.

3.6 Conclusion and Discussion

In this chapter, a mutual information based feature selection algorithm was developed that can select a close-to-optimal subset of informative features. It was shown that by using a semi-definite programming based search strategy, this problem can be seen as a variant of the maximum-cut problem in graph theory. This observation was the basis of our approximation ratio analysis. The approximation ratio of COBRA was derived and compared with the ap- proximation ratio of the backward elimination method. It was experimentally shown that COBRA outperforms sequential search methods especially in the case of sparse data. Two series expansions for mutual information were presented and it was shown that most mutual information based score functions in the literature, including mRMR and MIFS, are truncated approximations of these expan- sions. Furthermore, the underlying connection between MIFS and the Kir- wood approximation was explored, and it was shown that by adopting the class conditional independence assumption and the Kirkwood approximation for P r(X), mutual information reduces to the MIFS criterion. This algorithm was employed to select a subset of ISCN features. The re- sulting subset was used used to train a GMM-HMM based speech recognizer. As shown in the experiments, using these features leads to 5-6% absolute accuracy improvement over the commonly used features derived with PCA, LDA and DCT transformations. Using ISCN, however, does not solve the re- liability problem associated with the V-ASR. If the V-ASR performance had an acceptable variance over different speakers, it could be directly fused with audio recognizer by multiplying their posterior probabilities. However, due to its large inter-speaker variance a more complex fusion scheme is required to properly weight the modalities according to their reliabilities. This issue is discussed in Chapter 7. Chapter 4

Binary and Multiclass Classification

A learning machine or an inducer is an algorithm that induces a hypothesis to describe the relation between features and labels from a training set. The main focus of this chapter is on dealing with classification problems. Therefore, the terms learning machine and classifier are used interchangeably. In many applications simple classifiers such as linear SVM or GMM- based naive Bayes classifiers are sufficient to obtain a high level of accuracy. In lipreading, however, this is not the case due to the complex nature of the underlying random process. Most learning algorithms applied to this problem underfit the complex reality and result in lower-than-expected accuracy. To address this issue, a powerful class of learners based on boosting methods, which are known to be efficient1 learning algorithms, is introduced. In this approach, a large number of weak classifiers are generated and the final hy- pothesis is taken to be the convex combination of these weak-classifiers. First a framework for the binary classification setting is provided and is shown that there is a natural generalization of it to multiclass setting. The robustness of

1An algorithm is called efficient if (I) it can find the optimal hypothesis in time polynomial in the length of the input and the number of samples (this part related to the training phase) and (II) it can compute the value of the optimal hypothesis in time polynomial in the input dimensionality at the test phase. For instance, the K-nearest neighbors algorithm is not an efficient algorithm due to the fact that its computational complexity at the test phase goes to infinity as the number of samples approaches infinity. 82 4 Binary and Multiclass Classification the proposed classifiers against overfitting and mislabeling noise is of great help in constructing robust voice activity detection and isolated word lipread- ing systems.

4.1 Introduction to Boosting Problem

A boosting algorithm can be seen as a meta-algorithm that maintains a distri- bution over the sample space. At each iteration a weak hypothesis is learned and the distribution is updated, accordingly. The output (strong hypothesis) is a convex combination of the weak hypotheses. Two dominant views to de- scribe and design boosting algorithms are “weak to strong learner” (WTSL), which is the original viewpoint presented in [Sch90, FS97], and boosting by “coordinate-wise gradient descent in the functional space” (CWGD) appear- ing in later works [Bre99, MBBF99, FHT98]. A boosting algorithm adhering to the first view guarantees that it only requires a finite number of iterations (equivalently, finite number of weak hypotheses) to learn a (1 ǫ)-accurate hypothesis. In contrast, an algorithm resulting from the CWGD− viewpoint (usually called potential booster) may not necessarily be a boosting algorithm in the probably approximately correct (PAC) learning sense. However, while it is rather difficult to construct a boosting algorithm based on the first view, the algorithmic frameworks, e.g., AnyBoost [MBBF99], resulting from the second viewpoint have proven to be particularly prolific for developing new boosting algorithms. Under the CWGD view, the choice of the convex loss function to be minimized is the cornerstone of designing a boosting algo- rithm. This, however, is a severe disadvantage in some applications. In CWGD, the weights are not directly controllable (designable) and are only viewed as the values of the gradient of the loss function. In many appli- cations, some characteristics of the desired distribution are known or given as problem requirements while, finding a loss function that generates such a distribution is likely to be difficult. For instance, what loss functions can gen- erate sparse distributions?2 What family of loss functions results in a smooth distribution?3 We even can go further and imagine the scenarios in which a

2In the boosting terminology, sparsity usually refers to the greedy hypothesis-selection strat- egy of boosting methods in the functional space. However, sparsity in this chapter refers to the sparsity of the distribution (weights) over the sample space. 3A smooth distribution is a distribution that does not put too much weight on any single sample or in other words, a distribution emulated by the booster does not dramatically diverge from the target distribution [Ser03, Gav03]. 4.2 Our Results 83 loss function needs to put more weights on a given subset of examples than others, either because that subset has more reliable labels or it is a problem requirement to have a more accurate hypothesis for that part of the sample space. Then, what loss function can generate such a customized distribution? Moreover, does it result in a provable boosting algorithm? In general, how can we characterize the accuracy of the final hypothesis? Although, to be fair, the so-called loss function hunting approach has given rise to useful boosting algorithms such as LogitBoost, FilterBoost, GiniBoost and MADABoost [FHT98, BS08, Hat06, DW00] which (to some extent) answer some of the above questions, it is an inflexible and relatively unsuccessful approach to addressing the boosting problems with distribution constraints. Another approach to designing a boosting algorithm is to directly fol- low the WTSL viewpoint [Fre95, Ser03, BGL02]. The immediate advantages of such an approach are, first, the resultant algorithms are provable boost- ing algorithms, i.e., they output a hypothesis of arbitrary accuracy. Second, the booster has direct control over the weights, making it more suitable for boosting problems subject to some distribution constraints. However, since the WTSL view does not offer any algorithmic framework (as opposed to the CWGD view), it is rather difficult to come up with a distribution update mechanism resulting in a provable boosting algorithm. There are, however, a few useful, and albeit fairly limited, algorithmic frameworks such as Total- Boost [WLR06] that can be used to deriveother provableboosting algorithms. The TotalBoost algorithm can maximize the margin by iteratively solving a convex problem with the totally corrective constraint. A more general family of boosting algorithms was later proposed by Shalev-Shwartz et al. [SS08], where it was shown that weak learnability and linear separability are equiv- alent, a result following from von Neumann’s minmax theorem. Using this theorem, they constructed a family of algorithms that maintain smooth dis- tributions over the sample space, and consequently are noise tolerant. Their proposed algorithms find an (1 ǫ)-accurate solution after performing at most O(log(N)/ǫ2) iterations, where− N is the number of training examples.

4.2 Our Results

A family of boosting algorithms is presented that can be derived from well-known online learning algorithms, including projected gradient descent 84 4 Binary and Multiclass Classification

[Zin03] and its generalization, mirror descent (both active and lazy updates, see [Haz09]) and composite objective mirror descent (COMID) [DSST10]. The PAC learnability of the algorithms derived from this framework is proven and shown that this frameworkin fact generates maximum margin algorithms. That is, given a desired accuracy level ν, it outputs a hypothesis of margin γmin ν with γmin being the minimum edge that the weak classifier guaran- tees to− return. The duality between (linear) online learning and boosting is by no means new. In online learning at round t, the learner receives a new unlabeled sample point and is required to predict its label. After predicting a label, the correct label is revealed and the learner suffers some amount of loss (dependent on the loss function). In boosting, however, at each time instance t, all samples are available to the booster. The booster selects a weak learner from the hy- pothesis space and suffers some amount of loss due to the performance of the selected weak learner. Both of these methods can be seen as zero-sum games with very similar formulations. The duality between these two methods was first pointed out in [FS97] and was later elaborated and formalized by using the von Neumann’s minmax theorem [FS96b]. Following this line, several proof techniques required to show the PAC learnability of the derived boosting algorithms are provided. These techniques are fairly versatile and can be used to translate many other online learning methods into our boosting framework. To motivate our boosting framework, two practically and theoretically interesting algorithms are derived:

The SparseBoost algorithm which by maintaining a sparse distribu- • tion over the sample space tries to reduce the space and the compu- tation complexity. In fact this problem, i.e., applying batch boosting on the successive subsets of data when there is not sufficient memory to store an entire dataset, was first discussed by Breiman in [Bre97], though no algorithm with theoretical guarantee was suggested. Sparse- Boost is the first provable batch booster that can (partially) address this problem. By analyzing this algorithm, it is shown that the tuning pa- rameter of the regularization term ℓ1 at each round t should not exceed γt 2 ηt to still have a boosting algorithm, where ηt is the coefficient of the th t weak hypothesis and γt is its edge. Exploiting sparsity in example domain has also been investigated in [HT09] by Hatano and Takimoto.

A smooth boosting algorithm that requires only O(log 1/ǫ) number • of rounds to learn a (1 ǫ)-accurate hypothesis. This algorithm can − 4.3 Fundamentals 85

also be seen as an agnostic boosting algorithm4 due to the fact that smooth distributions provide a theoretical guarantee for noise tolerance in various noisy learning settings, such as agnostic boosting [KK09, BLM01].

Furthermore, an interesting theoretical result about MADABoost [DW00] is provided.A proof (to the best of our knowledgethe only available uncondi- tional proof) for the boosting property of (a variant of) MADABoost is given here and is shown that, unlike the common presumption, its convergence rate is of O(1/ǫ2) rather than O(1/ǫ). The MABoost framework to multiclass setting is generalized and shown that it adopts the minimal weak-learning condition introduced in [MS13]. That is, it imposes minimal conditions on the weak learner space to drive the training error to zero. The ADABoost.MM algorithm presented in [MS13] can also be derived from our framework by using the Kullback-Leibler (KL) divergence for the generic Bregman divergence in the MABoost framework.

4.3 Fundamentals

First of all, the notations used throughout the chapter is established. Vec- tors are lower case bold letters and their entries are non-bold letters with subscripts, such as xi of x, or non-bold letter with superscripts if the vec- i tor already has a subscript, such as xt of xt. Matrices are upper case bold letters and their entries are shown by upper case non-bold letters with sub- scripts. Moreover, an N-dimensional probability simplex is denoted by = w N w =1, w 0 . S { | i=1 i i ≥ } SinceP a central notion throughout this chapter is that of Bregman diver- gences, some of their properties are briefly revisited. A Bregman divergence is defined with respect to a convex function as R

B (x, y)= (x) (y) (y)(x y)⊤ (4.1) R R − R − ∇R − and can be interpreted as a distance measure between x and y. Due to the convexity of , a Bregman divergence is always non-negative, i.e., R

4Unlike the PAC model, the agnostic learning model allows an arbitrary target function (labeling function) that may not belong to the class studied, and hence, can be viewed as a noise tolerant learning model [KSS92]. 86 4 Binary and Multiclass Classification

B (x, y) 0. In this work is considered to be a β-strongly convex func- tionR 5 with≥ respect to a normR. . With this choice of , the Bregman diver- β ||2|| R 1 gence B (x, y) 2 x y . As an example, if (x) = 2 x⊤x (which is R ≥ || − || R 1 2 1-strongly convex with respect to . 2), then B (x, y)= 2 x y 2 is the square of the Euclidean distance.|| Another|| exampleR is the neg|| ative− || entropy function (x) = N x log x (resulting in the KL-divergence) which is R i=1 i i known to be 1-stronglyP convex over the probability simplex with respect to ℓ1 norm. The Bregman projection is another fundamental concept of our frame- work. Definition 4.1. Bregman Projection The Bregman projection of a vector y onto a convex set with respect to a Bregman divergence B is S R Π (y) = arg min B (x, y) (4.2) S x R ∈S Moreover,the following generalized Pythagorean theorem holds for Breg- man projections. Lemma 4.2. Generalized Pythagorean (see also [CBL06, Lemma 11.3]) Given a point y RN , a convex set and yˆ = Π (y) as the Bregman projection of y onto∈ , for all x weS have S S ∈ S Exact: B (x, y) B (x, yˆ)+ B (yˆ, y) (4.3) R ≥ R R Relaxed: B (x, y) B (x, yˆ) (4.4) R ≥ R The relaxed version follows from the fact that B (yˆ, y) 0 and thus can be ignored. R ≥

Lemma 4.3. For any vectors x, y, z, we have

(x y)⊤( (z) (y)) = B (x, y) B (x, z)+ B (y, z) (4.5) − ∇R − ∇R R − R R The above lemma follows directly from the Bregman divergence defini- tion in (4.1). Additionally, the following definitions from convex analysis are useful throughout the chapter.

5That is, its second derivative (Hessian in higher dimensions) is bounded away from zero by at least β. 4.4 Boosting Framework 87

Definition 4.4. Norm & dual norm Let . A be a norm. Then its dual norm is defined as || ||

y ∗ = sup y⊤x, x 1 (4.6) || ||A { || ||A ≤ }

For instance, the dual norm of . = ℓ is . ∗ = ℓ norm and the dual || ||2 2 || ||2 2 norm of ℓ1 is ℓ norm. Further, ∞

Lemma 4.5. For any vectors x, y and any norm . A, the following inequal- ity holds: || ||

1 2 1 2 x⊤y x y ∗ x + y ∗ (4.7) ≤ || ||A|| ||A ≤ 2|| ||A 2|| ||A

The proof directly follows from H¨older’s inequality. Throughout this chapter, the shorthands . A = . and . A∗ = . are used for the norm and its dual, respectively.|| || || || || || || ||∗

4.4 Boosting Framework

Let (xi,ai) , 1 i N, be N training samples, where xi and ai 1{, +1 . Assume} ≤ ≤h is a real-valued function mapping∈ Xinto [ 1, 1]∈. {− } ∈ H X − Let us denote a distribution over the training data by w = [w1,...,wN ]⊤ and define a loss vector d = [ a h(x ),..., a h(x )]⊤. We define − 1 1 − N N γ = w⊤d as the edge of the hypothesis h under the distribution w and it is assumed− to be positive when h is returned by a weak learner. In this work, the branching program based boosters introduced in [MM02] is not considered and we adhere to the typical boosting protocol (described in Section 4.1). Let (x) be a 1-strongly convex function with respect to a norm . and denote itsR associated Bregman divergence B . Moreover, let the dual|| || norm R of a loss vector dt be upper bounded, i.e., dt L. The following mirror ascent boosting (MABoost) algorithm is our|| boosting||∗ ≤ framework.It is easy to verify that for dt as defined in MABoost, L=1 when . = ℓ and L= N || ||∗ ∞ when . = ℓ2. || ||∗ 88 4 Binary and Multiclass Classification

Algorithm 4.1: Mirror Ascent Boosting (MABoost) Input: (x)1-strongly convex function, R 1 1 1 1 w1 = [ N ,..., N ]⊤ and z1 = [ N ,..., N ]⊤ For t =1,...,T do

(a) Train classifier with wt and get ht, let d = [ a h (x ),..., a h (x )] and γ = w⊤d . t − 1 t 1 − N t N t − t t γt (b) Set ηt = L (c) Update weights: (z )= (z )+ η d (lazy update) ∇R t+1 ∇R t t t (z )= (w )+ η d (active update) ∇R t+1 ∇R t t t (d) Project onto : wt+1 = argmin B (w, zt+1) S w R ∈S End Output: The final hypothesis f(x)= sign T η h (x) .  t=1 t t  P

This algorithm is a variant of the mirror descent algorithm [Haz09], mod- ified to work as a boosting algorithm. The basic principle in this algorithm is quite clear. As in ADABoost, the weight of a wrongly (correctly) classified sample increases (decreases). The weight vector is then projected onto the probability simplex in order to keep the weight sum equal to 1. The distinc- tion between the active and lazy update versions and the fact that the algo- rithm may behave quite differently under different update strategies should be emphasized. In the lazy update version, the norm of the auxiliary variable zt is unbounded.In the active update version, on the other hand, z is bounded and does not drive far away from w. However, unlike the lazy version in the active update the algorithm always needs to access (compute) the previous projected weight wt to update the weight at round t and this may not be pos- sible in some applications (such as boosting-by- filtering [DW00]). Due to the duality between online learning and boosting, it is not surpris- ing that MABoost (both the active and lazy versions) is a boosting algorithm. The proof of its boostability, however, reveals some interesting properties which enables us to generalize the MABoost framework. In the following, only the proof of the active update is given and the lazy update is left to Sec- tion 4.4.4.

Theorem 4.6. Suppose that MABoost generates weak hypotheses h1,...,hT 4.4 Boosting Framework 89

whose edges are γ1,...,γT . Then the error ǫ of the combined hypothesis f on the training set is bounded as: 1 1 (w)= w 2 : ǫ (4.8) R 2|| ||2 ≤ T 2 1+ t=1 γt N P PT 1 γ2 (w)= w log w : ǫ e− t=1 2 t (4.9) R i i ≤ Xi=1

In fact, the first bound (4.8) holds for any 1-strongly convex , though for some (e.g., negative entropy) the much tighter bound as in (4.9)R can be achieved.R

Proof : Assume w∗ = [w1∗,... ,wN∗ ]⊤ is a distribution vector where 1 wi∗ = Nǫ if f(xi) = ai, and 0 otherwise. w∗ can be seen as a uniform distribution over the wrongly6 classified samples by the ensemble hypothesis f. Using this vector and following the approach in [Haz09], the upper bound T 1 N of η (w∗⊤d w⊤d ) is derived where d = [d , . . . ,d ] is a loss t=1 t t− t t t t t vectorP as defined in Algorithm 4.1.

(w∗ w )⊤η d = (w∗ w )⊤ (z ) (w ) (4.10a) − t t t − t ∇R t+1 − ∇R t = B (w∗, wt) B (w∗, zt+1)+ B (wt, zt+1) R − R R (4.10b)

B (w∗, wt) B (w∗, wt+1)+ B (wt, zt+1) ≤ R − R R (4.10c) where the first equation follows Lemma 4.3 and inequality (4.10c) results from the relaxed version of Lemma 4.2. Note that Lemma 4.2 can be applied here because w∗ . ∈ S Further, the B (wt, zt+1) term is bounded. By applying Lemma 4.5 R

B (wt, zt+1)+ B (zt+1, wt) = (zt+1 wt)⊤ηtdt R R − 1 2 1 2 2 zt+1 wt + ηt dt (4.11) ≤ 2|| − || 2 || ||∗ 1 2 and since B (zt+1, wt) 2 zt+1 wt due to the 1-strongly convexity of , we haveR ≥ || − || R 1 2 2 B (wt, zt+1) ηt dt (4.12) R ≤ 2 || ||∗ 90 4 Binary and Multiclass Classification

Now, substituting (4.12) into (4.10c) and summing it up from t = 1 to T , yields

T T 1 2 2 w∗⊤ηtdt wt⊤ηtdt ηt dt + B (w∗, w1) B (w∗, wT +1) − ≤ 2 || ||∗ R − R Xt=1 Xt=1 (4.13)

Moreover, it is evident from the algorithm description that for wrongly clas- sified samples

T T a f(x )= a sign η h (x ) = sign η di 0 (4.14) − i i − i  t t i   t t ≥ Xt=1 Xt=1 x x f(x ) = a ∀ i ∈{ | i 6 i} T Following (4.14), the first term in (4.13) will be w∗⊤ η d 0 and t=1 t t ≥ thus, can be ignored. Moreover, by the definition of γP, the second term is T T t=1 wt⊤ηtdt = t=1 ηtγt. Putting all this together, ignoring the last term − 2 Pin (4.13) and replacingP dt with its upper bound L, yields || ||∗ T T 1 2 B (w∗, w1) L ηt ηtγt (4.15) − R ≤ 2 − Xt=1 Xt=1

1 2 ǫ 1 Replacing the left side with B = 2 w∗ w1 = 2−Nǫ for the case of quadratic , and with B =− log(R ǫ)−when|| −is a negative|| entropy function, R − R R taking the derivative with respect to ηt and equating it to zero (which yields γt ηt = L ), the error bounds in (4.8) and (4.9) are achieved. Note that in the case of being the negative entropy function, Algorithm 4.1 degenerates into R ADABoost with a different choice of ηt. Before continuing our discussion, it is important to mention that the cor- nerstone concept of the proof is the choice of w∗. For instance, a different choice of w∗ results in the following maximum margin theorem

Theorem 4.7. Maximum margin property of MABoost Setting η = γt , t L√t MABoost outputs a hypothesis of margin at least γmin ν, where ν is a desired accuracy level and tends to zero in O( log T ) rounds of− boosting. √T See Appendix A.2 for the proof. 4.4 Boosting Framework 91

Observations: Two observations follow immediately from the proof of Theorem 4.6. First, the requirement of using Lemma 4.2 is w∗ , so in the case of projecting onto a smaller convex set , as∈ long S as Sk ⊆ S w∗ k holds, the proof is intact. Second, only the relaxed version of Lemma∈ S 1 is required in the proof (to obtain inequality (4.10c)). Hence, if there is an approximate projection operator Πˆ that satisfies the inequality S B (w∗, zt+1) B w∗, Πˆ (zt+1) , it can be substituted for the exact R ≥ R S projection operator Π and the active update version of the algorithm still works. A practical approximateS projection operator of this type can be ob- tained through a double-projection strategy (see Appendix A.3 for the proof).

Lemma 4.8. Consider the convex sets and , where . Then for any K S S ⊆ K x and y RN , Πˆ (y) = Π Π (y) is an approximate projection that ∈ S ∈ S S  K  satisfies B (x, y) B x, Πˆ (y) . R ≥ R S  The above observations are employed to generalize Algorithm 4.1. How- ever, it is important to emphasize that the approximate Bregman projection is only valid for the active update version of MABoost due to the fact that the relaxed version of Lemma 4.2 used in the proof of Theorem 4.6 is only valid for the active update version.

4.4.1 Sparse Boosting

Let (w)= 1 w 2. Since in this case the projection onto the probability R 2 || ||2 simplex is in fact an ℓ1-constrained optimization problem, it is plausible that some of the weights are zero (sparse distribution), which is already a useful observation. To promote the sparsity of the weight vector, it is desirable to directly regularize the projection with the ℓ1 norm, i.e., adding w 1 to the objective function in the projection step. It is, however, not possible|| || in MA- Boost, since w 1 is trivially constant on the probability simplex. Therefore, the projection|| step|| is split into two consecutive steps. The first projection is onto N = y 0 y . R+ { | ≤ i} Surprisingly, projection onto N implicitly regularizes the weights of R+ the correctly classified samples with a weighted ℓ1 norm term. This point is clearly shown in the proof of Theorem 4.9 presented in Appendix A.5. To further enhance sparsity, we may introduce an explicit ℓ1 norm regularization term into the projection step with a regularization factor denoted by αtηt. The solution of the projection step is then normalized to get a feasible point on the 92 4 Binary and Multiclass Classification

probability simplex. This algorithm is listed in Algorithm 4.2. αtηt is the reg- ularization factor of the explicit ℓ1 norm at round t. Note that the dominant i regularization factor is ηtdt which only pushes the weights of the correctly i classified samples to zero, i.e., when dt < 0. This can become evident by substituting the update step in the projection step for zt+1. Algorithm 4.2: SparseBoost 1 2 N 1 1 Input: (w)= w , = y 0 y , y = [ ,..., ]⊤ R 2 || ||2 R+ { | ≤ i} 1 N N For t =1,...,T do

(a) Train classifier with wt and get ht.

γt yt 1 γt yt 1 1 (b) Set (η = || || , α =0) or (η = || || , α = γ y ). t N t t 2N t 2 t|| t||1 (c) zt+1 = yt + ηtdt. 1 2 (d) yt+1 = arg miny N 2 y zt+1 + αtηt y 1. ∈R+ || − || || || yt+1 (e) wt+1 = N i . Pi=1 yt End Output: The final hypothesis f(x)= sign T η h (x) .  t=1 t t  P

For simplicity, two cases are considered: when α = min(1, 1 γ y ) t 2 t|| t||1 and when αt =0. The training error bound is given by the following theorem (proof in Appendix A.5): Theorem 4.9. Suppose that SparseBoost generates weak hypotheses h1,...,hT whose edges are γ1,...,γT . Then the error ǫ of the combined hypothesis f on the training set is bounded1 as follows: ǫ (4.16) ≤ 1+ c T γ2 y 2 t=1 t || t||1 P Note that this bound holds for any choice of α 0, min(1,γt yt 1) . ∈ 1 || || Particularly, in our two cases constant c is 1 for αt =0 , and 4 when αt = min(1, 1 γ y ). 2 t|| t||1 For αt =0, the ℓ1 norm of the weights yt 1 can be bounded away from 1 || || N 2 zero by N (see Appendix A.5). Thus, the error ǫ tends to zero by O( γ2T ). That is, in this case Sparseboost is a provable boosting algorithm. However, for αt =0, the ℓ1 norm yt 1 may rapidly go to zero and consequently the upper bound6 of the training|| error|| in (4.16) does not vanish as T increases. 4.4 Boosting Framework 93

In this case, it is not possible to conclude that the algorithm is in fact a boosting algorithm6. It is noteworthythat SparseBoost can be seen as a variant of the COMID algorithm in [DSST10].

4.4.2 Smooth Boosting Let k> 0 be a smoothness parameter. A distribution w is smooth with respect to a given distribution D if wi kDi for all 1 i N. Here, the smoothness ≤ ≤ ≤ 1 with respect to the uniform distribution, i.e., Di = N , is considered. Then, given a desired smoothness parameter k, a boosting algorithm is required k that only constructs distributions w such that wi N , while guaranteeing to output a (1 1 )-accurate hypothesis. To this end,≤ it is only required to replace − k the probability simplex with = w N w = 1, 0 w k in S Sk { | i=1 i ≤ i ≤ N } MABoost to obtain a smooth distribution boostingP algorithm, called smooth- MABoost. That is, the update rule is: wt+1 = argmin B (w, zt+1). w k R ∈S Note that the proof of Theorem 4.6 holds for smooth-MABoost, as well. 1 1 As long as ǫ k , the error distribution w∗ (wi∗ = Nǫ if f(xi) = ai, and ≥ 1 k 6 0 otherwise) is in k because Nǫ N . Thus, based on the first observa- S ≤ 1 tion, the error bounds achieved in Theorem 4.6 hold for ǫ k . In particular, 1 ≥ ǫ = k is reached after a finite number of iterations. This projection problem has already appeared in the literature. An entropic projection algorithm ( is negative entropy), for instance, was proposed in [SS08]. Using the nega-R tive entropy and the suggested projection algorithm in [SS08] results in a fast smooth boosting algorithm with the following convergence rate. Theorem 4.10. Given (w) = N w log w and a desired ǫ, smooth- R i=1 i i MABoost finds a (1 ǫ)-accurateP hypothesis in O(log( 1 )/γ2) of iterations. − ǫ 4.4.3 MABoost for Combining Datasets (CD-MABoost) Let’s assume we have two sets of data. A primary dataset and a secondary dataset . The goal is to train a classifier that achievesA(1 ǫ) accuracy B 1 − on while limiting the error on dataset to ǫ k . This scenario has manyA potential applications including (I)B transferB learni≤ ng [DYXY07] and (II) weighted combination of datasets based on their noise level or emphasiz- ing on a particular region of a sample space as a problem requirement (e.g., a

6 1 Nevertheless, for some choices of αt 6= 0 such as αt ∝ t2 , the boosting property of the algorithm is still provable. 94 4 Binary and Multiclass Classification medical diagnostic test that should make as few wrong diagnoses as possible when the sample is a pregnant woman). To address this problem, it is only N required to replace in MABoost with c = w i=1 wi =1, 0 wi i kS S { | ≤ ∀ ∈ 0 wi N i where i shorthandsP the indices of sam- Aples ∧ in .≤ By generating≤ ∀ ∈ smooth B} distributions∈ A on , this algorithm limits the weightA of the secondary dataset, which intuitivelyB results in limiting its effect on the final hypothesis. The proof of its boosting property is quite similar to Theorem 4.6 (see Appendix A.4).

4.4.4 Lazy Update Boosting

In this section, the proof for the lazy update version of MABoost in Theorem 4.6 is presented. Moreover, it is shown that MADABoost [DW00] can be presented as a variant of the lazy update MABoost. This gives a simple proof for MADABoost without making the assumption that the edge sequence is monotonically decreasing (as in [DW00]).

Proof : Assume w∗ = [w1∗,... ,wN∗ ]⊤ is a distribution vector where 1 w∗ = if f(x ) = a , and 0 otherwise. Then, i Nǫ i 6 i

(w∗ w )⊤η d = (w w )⊤ (z ) (z ) − t t t t+1 − t ∇R t+1 − ∇R t  + (z w )⊤ (z ) (z ) t+1 − t+1 ∇R t+1 − ∇R t  + (w∗ z )⊤ (z ) (z ) − t+1 ∇R t+1 − ∇R t 1 2 1 2 2  wt+1 wt + ηt dt + B (wt+1, zt+1) ≤2|| − || 2 || ||∗ R B (wt+1, zt)+ B (zt+1, zt) − R R B (w∗, zt+1)+ B (w∗, zt) B (zt+1, zt) − R R − R 1 2 1 2 2 wt+1 wt + ηt dt B (wt+1, wt) ≤2|| − || 2 || ||∗ − R + B (wt+1, zt+1) B (wt, zt) R − R B (w∗, zt+1)+ B (w∗, zt) (4.17) − R R where the first inequality follows from applying Lemma 4.5 to the first term and Lemma 4.3 to the rest of the terms and the second inequality is the re- sult of applying the exact version of Lemma 4.2 to B (wt+1, zt). Moreover, 1 2 R since B (wt+1, wt) wt+1 wt 0, it can be ignored in (4.17). R − 2 || − || ≥ 4.4 Boosting Framework 95

Summing up the inequality (4.17) from t =1 to T , yields

T T 1 2 B (w∗, z1) L ηt ηtγt (4.18) − R ≤ 2 − Xt=1 Xt=1

T T T where the facts that w∗⊤ η d 0 and w⊤η d = η γ t=1 t t ≥ t=1 − t t t t=1 t t are used. Inequality 4.18P is exactly the sameP as (4.15), and replacingP B ǫ 1 − R with Nǫ− or log(ǫ) yields the same error bounds in Theorem 4.6. Note that, since the exact version of Lemma 4.2 is required to obtain (4.17), this proof does not reveal whether MABoost can be generalized to employ the double- projection strategy. In some particular cases, however, it maybe possible to show that a double-projectionvariant of MABoost is still a provable boosting algorithm. In the following, we briefly show that MADABoost can be seen as a double-projection lazy MABoost . Algorithm 4.3: Variant of MADABoost Let (w) be the negative entropy and a unit hypercube; Set R K z1 = [1,..., 1]⊤; t At t =1,...,T , train h with w , set f (x)= sign ′ η ′ h ′ (x) t t t  t =1 t t  P N 1 f (x ) a and calculate ǫ = i=1 2 | t i − i|, set η = ǫ γ and update t P N t t t i (a) (z )= (z )+ η d zi = zieηtdt ∇R t+1 ∇R t t t → t+1 t i i (b) yt+1 = arg min B (y, zt+1) yt+1 = min(1,zt+1) y R → ∈K i i yt+1 (c) wt+1 = arg min B (w, yt+1) wt+1 = w R → yt+1 1 ∈S || ||

Output the final hypothesis f(x)= sign T η h (x) .  t=1 t t  P Algorithm 4.3 is essentially MADABoost, only with a different choice of ηt. It is well-known that projection of a weight vector via entropy function onto the probability simplex results in a vector which is the ℓ1 normalized of the original vector. This explains the normalization in step (b). The entropy projection onto the unit hypercube (i.e., step (c)), however, maybe less known and thus, its proof is given in Appendix A.6. The following theorem gives the 96 4 Binary and Multiclass Classification convergence rate of the MADABoost algorithm (see Appendix A.7 for the proof). Theorem 4.11. Algorithm 4.3 yields a (1 ǫ)-accurate hypothesis after at most T = O( 1 ). − ǫ2γ2 This is an important result since it shows that MADABoost seems, at least in theory, to be slower than what we hoped, namely O( 1 ). ǫγ2

4.5 Multiclass Generalization

The convergence proof of the MABoost framework (and the algorithms de- rived from it) is based on the well-understood weak-learning assumption stating that the weak classifiers predict better than random on any distri- bution over the training set. Logically, in order to generalize the MABoost framework to multiclass setting, the weak-learning assumption need to be generalized. However, as it has already been shown (see [SS98],[ZZRH09] and [MS13]), it is not trivial. While there is a unique definition for weak- learnability in binary setting, infinite number of weak-learning assumptions can be taken in multiclass classification. For instance, the straightforward extension of the binary weak-learning condition requires that each classifier only predicts better than random guessing, that is, given K classes the per- formance of a weak classifier should only be better than 1/K. The SAMME algorithm in [ZZRH09], for instance, adopts this assumption. However, as the number of classes increases, the random guessing baseline approaches zero and consequently the weak-learning requirement vanishes. It is hence not sur- prising that this condition turned out to be too weak for being boostable, i.e., no boosting algorithm with this assumption can drive the training error to zero [MS13]. Another possible weak-learning assumption is to require the classi- fiers to achieve more than 50% accuracy on any distribution over the training data. This assumption increasingly becomes difficult to satisfy as the number of classes increases. Since most of simple classifiers (such as decision stumps) fail to meet this requirement, by adopting this assumption we need to employ more complex weak learners which may increase the overfitting tendency. In their groundbreaking paper [MS13], Mukherjee and Schapire argued that the optimal condition is the weakest condition that still guarantees the boosta- bility. They derived the necessary and sufficient weak-learning condition for boostability, 4.5 Multiclass Generalization 97

called minimal weak-learning condition, and more importantly they pre- sented an inequality constraint which can be used to check whether a given weak-learning assumption is equivalent to the minimal weak-learning condi- tion. Throughout the rest of Section 4.5, the MABoost framework is general- ized to the multiclass setting and by using the results derived in [MS13] it is shown that it adopts the minimal weak-learning condition.

4.5.1 Preliminaries for Multiclass Setting

Let (x ,a ) , 1 i N, be N training samples, where x are feature { i i } ≤ ≤ i ∈ X vectors and ai 0,...,K 1 are class labels. Assume h is a discrete- valued function∈{ mapping −into} 0,...,K 1 . For the sake∈H of simplicity, it is assumed that all samplesX belong{ to class− 0.} Nevertheless, we may refer to the label of sample i by ai or by 0 depending on the context. This assumption highly simplifies our presentation. Assume an N K weight matrix whose (i, j)th element represents the non-negative weight× of wrongly classifying sample i to be in class j +1 and the sum of each row is zero, i.e., (i, 1) element (since all labels are 0) is a negative weight equal to the minus of the sum of non-negative weights. Due to this zero-sum constraint, the first column does not contain any additional information and can be discarded. Our weight matrix W is then an N th × K 1 matrix whose (i, j) elements Wi,j are non-negative costs of wrongly predicting− the label of the ith sample to be j. Analogously to the binary setting, a definition for the loss matrix (vector in the binary setting) is required. In the binary setting, the ith element of the loss vector d was -1,if the ith sample was classified correctly, and 1 otherwise. In the multiclass case, a loss matrix D is defined as an N K 1 matrix where all the elements of the ith row are -1, if the hypothesis×h classifies− the ith sample correctly. However, if h makes a mistake and assigns the ith sample th th to the j class, then Di,j is 1 and the rest of the elements of the i row are zero. That is, given a weak classifier h, the elements of the loss matrix D are:

1, if h(x )=0 − i Di,j = 0, if h(xi)= l and j = l (4.19)  6 1, if h(xi)= j  98 4 Binary and Multiclass Classification

With these definitions at hand, the weak-learning assumption can be defined as:

Definition 4.12. A weak hypothesis space satisfies the weak-learning con- dition if for any element-wise nonnegativeHW, there is a hypothesis h with loss matrix D as defined in (4.19) that is ∈ H

D W γ γ 0. (4.20) − • ≥ ∃ ≥

where the inner product of two matrices A and B is defined as the trace of their products Tr AB⊤ and denoted by A B. Without loss of general- ity, in this definition{ it is assumed} that W is normalized• so that the sum of its elements is 1. It is shown in Appendix A.8 that this definition for weak- learning condition is equivalent to that of ADABoost.MR [SS98]. As shown in [MS13], ADABoost.MR satisfies the minimal weak-learning condition. Thus, any hypothesis space that satisfies the weak-learning condition in Definition 3 is in fact a boostableH hypothesis space.

4.5.2 Multiclass MABoost (Mu-MABoost)

Assume X is a matrix. A matrix function (X) is defined to be equal to (x), where x is a super-vector constructed byR concatenating all the columns ofR X. Let (X) be a 1-strongly convex function with respect to a norm . and denoteR its associated Bregman divergence with B . Moreover, let|| the|| R dual norm of a loss matrix Dt be defined as the dual norm of its super-vector d (constructed by concatenating its columns) and assume it is bounded from above, i.e., Dt = dt L. It is easy to verify that for Dt in (4.19), || ||∗ || ||∗ ≤ L=1 when . = ℓ and L= N(K 1) when . = ℓ2. Algorithm 4.4 is then the generalization|| ||∗ ∞ of the MABoost− framework|| || to∗ the multiclass setting. In the output of Algorithm 4.4, 1(.) is an indicator function which returns 1 if its argument holds, and zero otherwise. As in the binary MABoost framework, it can be shown that if the weak classifiers are selected from a boostable space, the training error of the Mu- MABoost algorithms approaches zero in finite number of rounds.

Theorem 4.13. Suppose that Mu-MABoost generates weak hypotheses h1,...,hT whose edges are γ1,...,γT . Then the error ǫ of the combined 4.5 Multiclass Generalization 99 hypothesis f on the training set is bounded as:

1 1 (W)= W 2 : ǫ R 2|| ||2 ≤ T 2 1+ t=1 γt N(K 1) P − PT 1 γ2 (W)= w log w : ǫ e− t=1 2 t R i i ≤ Xi=1

(W) is equal to (w) with w being the super-vector representing W. TheR proof of TheoremR 4.13 is in the spirit of the proof of Theorem 4.6. The only difference is that instead of an error vector, an N K 1 error matrix th 1× − W∗ is used where the elements of the i row are N(K 1) if f classifies xi wrongly. Replacing this error matrix in the proof of Theorem− 4.6 and recalling the fact that matrix function (X) is defined to be (x) (with x being a super-vector constructed by concatenatingR the columns)R yields the proof of Theorem 4.13. Algorithm 4.4: Multiclass MABoost (Mu-MABoost) Input: (X)1-strongly convex function, R 1 1 1 W1 and Z1 with elements Wi,j = Zi,j = N(K 1) − For t =1,...,T do

(a) Train classifier with Wt and get ht, set Dt elements as in (4.19) and γ = W D . t − t • t γt (b) Set ηt = L (c) Update weights: (Z )= (Z )+ η D (lazy update) ∇R t+1 ∇R t t t (Z )= (W )+ η D (active update) ∇R t+1 ∇R t t t (d) Project onto : Wt+1 = argmin B (W, Zt+1) S W R ∈S End T Output: Final hypothesis f(x) : H(x,l)= t=1 ηt1 ht(x) == l f(x) = argmaxP H(x,l)  l

It is noteworthy that using KL-divergence as the Bregman divergence in Algorithm 4.4 gives a version of ADABoost.MM introduced in [MS13] with slightly different values for η. 100 4 Binary and Multiclass Classification

4.6 Classification Experiments with Boosting

In this section, the experiments that were run to evaluate the boosting al- gorithms presented in this work are described. For evaluation of binary and multiclass algorithms, 13 datasets from the UCI repository were used. They contain all combinations of real and discrete features, are drawn from various real-world problems and have been regularly used in previous works as learn- ing benchmark problems. Tables 4.1 and 4.2 list the binary and multiclass datasets and their descriptions, respectively, and provide some properties of these datasets.

Dataset Name Features Training samples Test samples Breastcancer 9 549 150 German-credit 20 700 300 Sonar 60 158 50 Gisette* 100 40 100 2000 Pima-diabetes 8 500× 168 House-votes-84 16 335 100 Thyroid-disease 25 1500 500

Table 4.1: Binary datasets description

In all experiments with binary datasets, the experiments were run for 20 times over randomly selected training and test sets with the sample numbers specified in Table 4.1 and the results averaged. The only exception for this test procedure is the Gisette dataset whose original feature vector dimension is 5000. For this dataset, we first selected 100 features with the highest mutual

Dataset Name Features Training samples Test samples Classes Abalone 8 3177 1000 28 Connect-4 42 57577 10000 3 Car 6 1228 500 4 Forest 54 20000 10000 7 Letter 16 15000 5000 26 Poker 10 15010 10000 10

Table 4.2: Multiclass datasets description. 4.6 Classification Experiments with Boosting 101 information values with the class labels and then set aside 2000 samples as the test set. The remaining 4000 samples were then divided to 40 training sets each containing 100 samples. Each algorithm was then run on these training sets and its test errors were averaged. For multiclass problems, however, most sets have more than 10000 train- ing samples and 5000 test samples giving sufficient confidence on the test re- sults. Thus, the experiments were run only once. Moreover, the Forest dataset is too large and contains more than 500000 samples. Thus, only 20000 of its samples were used for training and 10000 for test.

4.6.1 Binary MABoost Experiments

In the first set of experiments, the performance of the algorithms in the pres- ence of mislabeling noise was evaluated. Moreover, two different weak learn- ers were utilized in MABoost: Conventional decision trees and a special im- plementation of decision trees. In this implementation, at each round of train- ing a small set of features (precisely speaking √M features where M is the total number of features) were selected. A random cost was assigned then to each of the chosen features. These costs were uniformly drawn from (1, 4) and are in fact scaling coefficients to be applied when considering splits,U so the improvement on splitting on a feature is divided by its cost in deciding which split to choose [TA+97]. Given these costs, a decision tree was then grown with the selected features. This particular implementation of weak

Data set MABoost MA-Forest Real-ADA RF Breastcancer 3.96 3.21 3.87 3.08 German-credit 23.24 24.82 23.36 23.89 Votes-84 3.99 3.90 3.96 3.71 Pima-diabetes 23.55 24.46 23.59 23.53 Thyroid-disease 2.53 3.12 2.60 2.83 Sonar 16.23 16.54 13.88 17.11 Gisette 11.95 11.04 11.73 11.78

Table 4.3: The test errors in percentage of binary classification with no mislabeling noise for four algorithms. Each algorithm grows 500 decision trees with a maximum size of 63. MA-Forest, Real-ADA and RF stand for MABoost-Forest, real-ADABoost and Random forest, respectively. 102 4 Binary and Multiclass Classification learner can be foundin MABoost R package available on CRAN [Nag14]. We empirically found that this random-forest-type weak learner usually reduces the generalization error. Each of the algorithms were growing and combining 500 trees and the size of the trees was limited to 63.

Data set MABoost MA-Forest Real-ADA RF Breastcancer 4.48 3.89 3.94 4.60 German-credit 25.80 25.81 25.40 25.83 Votes-84 5.41 4.83 5.62 5.76 Pima-diabetes 25.16 25.22 25.36 25.78 Thyroid-disease 5.07 4.8 5.11 5.01 Sonar 21.27 19.87 21.67 20.00 Gisette 15.85 14.20 16.98 14.15

Table 4.4: The test errors of binary classification with 15% mislabeling noise in the training data. Each algorithm grows 500 decision trees of size limited to 63. A bold value at each row represents the lowest test error for that dataset.

Tables 4.3 and 4.4 report the performance of MABoost, random-forest type MABoost which is called MABoost-Forest, real ADABoost and random forest. In the first scenario (reported in Table 4.3), the mislabeling noise was zero while in the second scenario 15% of the training samples had wrong la- bels. As can be seen, in the absence of noise, random forest works better than boosting methods for three out of seven datasets. Due to the inherent robust- ness of random-subspace-selection based ensemble methods such as random forest against mislabeling noise, it is expected that Random forest wins in the noisy scenario as well. Surprisingly, MABoost-Forest yields significantly bet- ter performancethan other methods in the presence of 15% mislabeling noise. MABoost-Forest gives higher accuracy for four out of seven datasets and for those three that it does not win, its accuracy is close to that of the winner. It is perhaps due to the fact that MABoost-Forest inherits the robustness of Random forest while taking advantage of bias correction power of boosting methods.

4.6.2 Experiment with SparseBoost

Reducing the memory complexity is the main advantage of the SparseBoost algorithm. SparseBoost reduces the memory complexity by utilizing only a 4.6 Classification Experiments with Boosting 103 percentage of training samples at each round of boosting. In our experiments, the efficiency of the SparseBoost algorithm in the sense of memory com- plexity reduction is evaluated. The second column of Table 4.5 reports the average sparsity ratio of the weight vectors. The sparsity ratio is defined as the number of zero weights to the total number of weights and the average was taken over the sparsity ratios of the 200 weight vectors generated dur- ing the boosting process. This number is reported in percentage. As can be

Data set Sparsity ratio% Training err% Test err% Breastcancer 67.03 0 4.13 German-credit 25.62 0 25.00 Votes-84 49.35 0.59 4 Pima-diabetes 25.22 25.36 25.78 Thyroid-disease 40.30 0.03 4.73 Sonar 42.57 0 24.30 Gisette 47.84 0 13.70

Table 4.5: The tables shows the result of the experiments with SparseBoost algorithm. All values are in percentage.

seen in Table 4.5, for most of the datasets, at each round of boosting over 40% of the samples were not used. For the Breast cancer dataset, the spar- sity ratio is particularly interesting: only 33% of training samples per round were necessary to construct a perfect classifier in the sense of having 100% accuracy on the training set. However, this complexity reduction comes at the expense of higher generalization error. Comparing the accuracies reported in Table 4.3 and Table 4.5 reveals that the SparseBoost algorithm consistently achieves lower classification accuracy than MABoost, mostly due to the fact that at each round of boosting, SparseBoost constructs an easier dataset by excluding some of the samples from the training data. Even though the easy samples might have small weights in the case of using other weighting mech- anisms (such as ADABoost), they still affect the training procedure by acting as a regularizer by preventing the algorithm to perfectly fit the hard samples. Hence, completely excluding them from the dataset increases the chance of overfitting which consequently reduces the generalization power. 104 4 Binary and Multiclass Classification

4.6.3 Experiment with CD-MABoost

Assume for a particular classification task, several datasets with different qualities are given. The quality of a dataset depends on several factors such as the level of the feature noise and frequency of labeling errors. When the qual- ity of the datasets are approximatelyknown,we may incorporatethis informa- tion into our learning procedure by restricting the weights of the low-quality samples. As discussed in Section 4.4.3, CD-MABoost takes this approach to combine datasets. In CD-MABoost, the distribution over the good-quality samples is unconstrained while the weights of the low-quality samples are limited to be less than a given smoothness threshold 1/k. An experiment was run to show the effectiveness of this method for com- bining different quality datasets. To this end, an artificially generated dataset suggested by Long and Servedio in [LS10] was utilized. This dataset (which is called L-S dataset in this chapter) contains 4000 training samples (no teset set is needed in this experiment). Each sample (x, y) in this dataset is gen- erated as follows. First the label y is chosen randomly from 1, 1 . There are 21 features x ,...,x that take values in 1, 1 . The features{− } of 1000 1 21 {− } of samples (called large margin samples) are set as: x1= ... =x21=y. An- other 1000 samples (called puller samples) are set to: x1= ... =x11=y and x12= ... =x21= y. The rest of the samples (i.e. 2000 samples) which are called penalizers are− chosen in there stages: (I) The values of a random subset of five of the first eleven features x1,...,x11 are set equal to y, (II) the values of a random subset of six of the last ten features x12,...,x21 are set equal to y, and (III) the remaining ten features are set to y. − This data, as it is, can be fully learned by an ensemble of decision stumps. However, as has been discussed in [LS10], by only adding 10% mislabeling noise to this dataset (by randomly flipping the labels) no boosting algorithm with a convex loss function can learn the underlying hypothesis. In fact, with only 10% mislabeling noise in the data, neither of the commonly used boost- ing methods such as MADABoost, ADABoost and LogitBoost algorithms can drive the training error below 27%. By means of this dataset it will be shown below that our CD-MABoost algorithm (see Section 4.4.3) does not have this limitation. Assume two L-S datasets are given. One of these datasets is corrupted by 20% mislabeling noise while the other is clean, i.e., if these two datasets are combined, we have 10% mislabeling noise in the final dataset. While, as extensively discussed in [LS10], it is impossible to learn this dataset with a 4.6 Classification Experiments with Boosting 105

40

35 k=5800

30

25 k=6300 Training Error rate % 20

k=6800

15

k=8000

10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of rounds

Figure 4.1: The training error of CD-MABoost over Long-Servedio dataset with 10% labeling error for different values of k (the weights of the mislabeled samples are restricted to be smaller than 1/k).

convex loss based boosting method, it is shown here that by restricting the weights of the corrupted dataset (to be less than 1/k), this dataset can be effectively learned by CD-MABoost. In Figure 4.1 the training error of this boosting method is shown as a function of boosting roundsfor different values of k. It is not surprisingthat asthe valueof k increases (i.e., the constraint gets more restrictive) the number of training rounds for convergence decreases. However, if the value of 1/k is set to a rather large value the algorithm (i.e., 106 4 Binary and Multiclass Classification weakening the weight constraint) may never converge. In this example CD- MABoost does not converge for k < 5800, which translates to the weight constraint: w 0.000172. i ≤

4.6.4 Multiclass Classification Experiments

In this section we reports several experiments that we ran to evaluate the per- formance of the proposed multiclass boosting methods. Table 4.2 lists the multiclass datasets and their descriptions. The weak learners used in the first set of experiments were decision trees of size 5 and decision trees whose depth was restricted to be less than or equal to 8, that is, trees with maximum size of 29 1. The performance of the algorithms Mu-MABoost, SAMME and ADABoost.M1− was compared and the obtained error rates are shown in Table 4.6. As can be seen, when the tree-size is small, Mu-MABoost achieves significantly better results (apart from dataset Car) than ADABoost.M1 and SAMME due to its optimal weak-learning condition. Compared with AD- ABoost.M1, the Mu-MABoost algorithm does not expect high accuracy from the weak classifiers and thus can further drive the test error down. SAMME,

Data set Mu-MABoost ADABoost-M1 SAMME depth of decision trees 5 Abalone 76.60 95.10≤ 87.70 Connect-4 29.07 43.51 48.00 Car 6.20 2.00 3.00 Forest 29.07 38.26 40.18 Letter 31.62 61.72 64.52 Poker 40.40 43.47 51.93 depth of decision trees 8 Abalone 73.90 75.10≤ 78.70 Connect-4 27.81 30.56 39.54 Car 3.40 2.40 1.40 Forest 25.96 25.36 25.78 Letter 3.90 7.32 3.15 Poker 37.88 38.65 45.38

Table 4.6: The reported values are the test error rates of the multiclass classifiers in percentage after 500 rounds of boosting. 4.6 Classification Experiments with Boosting 107

78 33 Abalone Connect−4 30 Car 77 32 76 25

75 31 20 74 Training Err Error rate % 30 15 73 Test Err

72 10 29 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 −6 −6 −5 x 10 x 10 x 10 15 2.5 4

2 3 10

1.5 2

1 5 1 0.5 Weak classifiers coefficients 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

34 65 Forest Letter Poker 60 46 33 55 44 32 50 42 31 45 Error rate % 30 40 40 35 29 38 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 −6 −7 −6 x 10 x 10 x 10 8 6 3 5 6 4 2 3 4 2 2 1 1

Weak classifiers coefficients 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Number of rounds Number of rounds Number of rounds

Figure 4.2: Plots of the rates at which Mu-MABoost drives down the training and test errors on various data sets over 100 rounds of boosting. Simple decision trees of size 5 are used as weak learners in Mu-MABoost. Since it expects an arguably low accuracy from the weak classifiers, the boosting can continue even with such a simple weak learner. The second and fourth rows illustrate the coefficients of the weak classifiers in the final ensembles. 108 4 Binary and Multiclass Classification on the other hand, uses even a weaker condition than Mu-MABoost. How- ever, since this condition is not boostable, some of the weak classifiers in the SAMME ensemble in fact deteriorate the final classification accuracy and pre- vent the boosting method from driving the training error down. As the com- plexity of trees increases the gap between the performance of Mu-MABoost and ADABoost.M1 decreases since it is now more likely that a weak learner can satisfy the overly difficult demands of the ADABoost.M1 booster. Figure 4.2 illustrates the training and test errors of the Mu-MABoost algo- rithm for different datasets. Moreover, the coefficients of the weak classifiers are also shown in the second and fourth row of figures in Figure 4.2. As boost- ing continues, the algorithm creates harder distributions and very soon the small-tree learning algorithms (CART in our case) may no longer be able to meet the excessive requirements of unnecessarily strong weak-learning con- ditions such as that of ADABoost.M1. However, the Mu-MABoost algorithm makes more reasonable demands that are easily met by CART. As can be ob- served, the coefficients of the weak classifiers are mostly greater than zero meaning that the weak-learning condition has been satisfied.

4.7 Conclusion and Discussion

In this chapter, a framework that can produce provable boosting algorithms was presented. Given a boostable classifier space , a provable boosting al- gorithm is a meta-algorithm that can provably driveH the training error to zero by repeatedly calling a base learning algorithm, each time feeding it with a different distribution or weights over the training samples. The base learning algorithm returns a weak classifier selected from and the boosting algo- rithm assigns a coefficient to it which is proportionalH to the weak classifier’s performance. The final strong classifier is then the linear combination of these weak classifiers. The proposed framework is suitable for designing boosting algorithms with distribution constraints, i.e., deriving boosting algorithms for applica- tions where it is requiredto put more weight on some specific subset of exam- ples than on others, either because these examples have more reliable labels or it is a problem requirementto have a more accurate hypothesis for that part of the sample space. Several new boosting algorithms were derived from our proposed frame- work. Particularly, a sparse boosting algorithm (SparseBoost), which selects 4.7 Conclusion and Discussion 109 only a fraction of examples at each boosting round and hence, efficiently re- duces the required memory during the training process, was introduced. In fact this problem, i.e., applying batch boosting on successive subsets of data when there is not sufficient memory to store the entire dataset, was first dis- cussed by Breiman in [Bre97], though no algorithm with theoretical guarantee was suggested. SparseBoost is the first provable batch booster that can (par- tially) address this problem. However, since our proposed algorithm cannot control the exact number of zeros in the weight vector, a natural extension to this algorithm is to develop a boosting algorithm that receives the sparsity level as an input. However, this immediately raises the question: what is the maximum number of examples that can be removed at each round from the dataset, while still achieving a (1 ǫ)-accurate hypothesis? − We introduced the CD-MABoost algorithm as the first boosting method that can take the quality of the datasets into account when several datasets with different quality are utilized to learn a particular task. The quality of a dataset depends on many factors including the level of feature noise and mis- labeling noise. When the quality of the datasets are (qualitatively) given, this information can be incorporated in the learning process by restricting the im- portance or weights of the low quality samples. As shown in the experiments, while no other boosting methods (with convex loss function) can learn the bi- nary classification task introduced in [LS10], CD-MABoost can easily learn the optimal hypothesis by restricting the weights of the noisy samples. The boosting framework derived in this work is essentially the dual of the online mirror descent algorithm. This framework can be generalized in differ- ent ways. Here, it was shown that replacing the Bregman projection step with the double-projection strategy, or as it was called approximate Bregman pro- jection, still results in a boosting algorithm in the active version of MABoost, though this may not hold for the lazy version. In some special cases (MADA- Boost for instance), however,it can be shown that this double-projection strat- egy works for the lazy version as well. Our conjecture is that under some conditions on the first convex set (i.e., the convex set in Lemma 4.8 which is assumed to be larger than the probability simplex K), the lazy version can also be generalized to work with the approximate projectionS operator. A new error bound is provided for the MADABoost algorithm that does not depend on any assumption. Unlike the common conjecture, the conver- gence rate of MADABoost (at least with our choice of η) is of O(1/ǫ2). It is however, still an open question whether it is a tight bound or it can be improved. 110 4 Binary and Multiclass Classification

Finally, the MABoost framework was generalized to the multiclass set- ting and it was shown that this framework adopts the minimal weak-learning condition to boost the weak-learning space. That is, it is proven that it can employ very simple weak classifiers while still perfectly learning the under- lying hypothesis or in other words, driving the training error down to zero. By using very weak classifiers in the ensemble, we restrict the complexity of the model and thus reduce the overfitting effect. This property, i.e., robustness against overfitting, is an essential requirement in our lipreading application where feature vectors are drawn from a high-dimensional space. Chapter 5

Visual Voice Activity Detection

Voice activity detection (VAD) is a necessary stage in most speech-related ap- plications such as speech recognizers (to identify which audio frames need to be processed) and human-machine interaction systems (to detect human ac- tivities). While using a reliable VAD may significantly improve their perfor- mance, it is not easy to guaranteethe VAD accuracyespecially in adversecon- ditions such as highly reverberant rooms, non-stationary speech-type back- ground noise, etc. As the technology advances, it is now becoming very cheap, both compu- tationally and economically,to capture video streams. Thus, a natural solution to improving A-VAD systems is to utilize visual information as complemen- tary information. However, most of the reported V-VAD systems in the lit- erature suffer from speaker-dependency issues and inaccurate detection rates when used without audio features. In this Chapter, by using SIFT features and the bag-of-words (BoW) model explained in Chapter 2 and the binary classifier developed in Chap- ter 4, we construct a V-VAD system that, first, is highly speaker indepen- dent and second, can achieve high accuracy (78% frame-based detection rate, on average). Moreover, we show that this system can be trained in a semi- supervised manner. As it is known that manually labeling speech boundaries in audio-visual data is not only a labor-intensive and time-consuming task, 112 5 Visual Voice Activity Detection but also subject to human errors and interpretations, having a system that can be trained in a semi-supervised manner is highly desirable. In order to obtain semi-supervised AV-VAD, we develop a learning algorithm that can detect noisy samples whose labels are randomly flipped. This algorithm obtains up to 95% detection rate on some datasets, which is very promising and perhaps can be used in a much wider range of applications.

5.1 Introduction to Utterance Detection

To efficiently use visual information, two challenging issues have to be ad- dressed. The first issue in V-VAD systems is the relatively high computational complexity of the feature extraction part. In general, all visual speech pro- cessing systems require a region-of-interest (ROI) from which visual features can be extracted. Visual feature extraction algorithms roughly fall into two categories: Those that use a crude estimate of ROI to extract visual features and those that utilize more advanced methods such as active appearance mod- els [CET01] or active shape models [LTB96] to match an exact ROI loca- tion or an extract ROI contour [QWP11]. Clearly, the second category may yield more precise or informative visual features at the expense of higher computational complexity. In addition, both groups can benefit from an ROI tracking system to improve robustness [AMM+07], which again increases the computational complexity. Here we use the real-time algorithm presented in [DGFVG12] for mouth detection. Due to the computational efficiency of this algorithm, the amount of time spent for ROI extraction in our V-VAD algorithm is favorably small. The second and even more complicated challenging issue of V-VADs is speaker variability. Speaker independence is a fundamental requirement in typical real-world V-VAD applications. In audio-based VADs, and more gen- erally in audio-only speech recognition, speaker variability has been well studied. It has been shown that audio features (mainly MFCC for speech recognition and relative energy level and zero crossing rate for VADs) are highly speaker-independent. However, it is still a big challenge to develop a speaker-independent V-VAD and, more generaly, a lipreading algorithm that can cope with speaker variability (see Chapter 6 and [CHLN08]). It is known that most commonly applied visual features such as DCT or PCA of the mouth region mainly represent speaker characteristics rather than 5.2SupervisedLearning:VADbyUsingSparseBoost 113 speech events. Thus, most of the currently available V-VADs cannot reliably work in the speaker-independent mode [MLS+13], [AM08] and [YNO09]. For instance, the performance of the V-VAD proposed in [AM08] drops from 97% to 72%, which is only 7% above randomness considering that 65% of the samples in their dataset are speech. Here, we propose a V-VAD which is highly robust against speaker vari- ability. Experiments show that the proposed V-VAD algorithm can obtain 78% frame-based detection rate on the GRID dataset, which is much higher than commonly used technology which employ Gaussian mixture models trained with DCT, PCA or LDA features extracted from video frames. More- over, it is shown that this algorithm can be trained in a semi-supervised man- ner. This useful property can be exploited to automatically adapt this algo- rithm to new test conditions.

5.2 Supervised Learning: VAD by Using Sparse- Boost

Throughout this section, we develop a V-VAD by means of the SparseBoost algorithm listed in Algorithm 4.2. A SparseBoost classifier is trained in a supervised manner to classify speech and non-speech video frames. From each ROI, a set of SIFT feature vectors were extracted. The SIFT feature vectors of all frames were then clustered into 300 groups. The set of cluster centers were considered to be a codebook with 300 codes. This code- book was used in a BoW model to construct 300-dimensionalfeature vectors. To reduce the computational complexity, the SIFT features were only com- puted for a subregion whose area was almost 1/5 of the detected mouth region and was a rectangle around the middle of the mouth. Since this limited region was sufficient to determine whether the mouth is open or closed it provided sufficient statistics to learn the desired hypothesis. This reduction in the ROI size improved the computational efficiency of the algorithm and helped the classifier to find a more accurate hypothesis by filtering out redundant infor- mation. Figure 5.1 depicts the ROI and the visual features used in our V-VAD algorithm. To evaluate the proposed V-VAD algorithm, the GRID dataset explained in [CBCS06] was used in our experiments. Due to the large size of this dataset, we only employed the the audio-visual data of the first 16 speakers. 114 5 Visual Voice Activity Detection

Figure 5.1: (a) The mouth region is automatically detected by means of the regression based random forest algorithm proposed in [DGFVG12]. However, only a small region from the middle of the mouth is effectively used in V-VAD. (b) SIFT features are extracted from the middle of the mouth. The set of extracted SIFT feature-vectors is mapped to a 300-dimensional BoW feature vector representing this frame. First and second-order derivatives of BoW features are also included in the feature set to construct the final visual feature vector. SparseBoost then takes this vector to make the decision on whether it is a speech or non-speech frame.

The recordings of the first 9 speakers, i.e. s1 to s9 are always included in the training data. We then applied the one-speaker-out cross-validation technique over the second set of speakers (s10-s16) to evaluate the proposed methods. At each round of cross-validation, the data from 6 of the second 7 speakers (s10 to s16) was added to the training data and one was left out for test. Each video sample of GRID data is 3 seconds long. To have an almost balanced dataset (i.e., almost the same number of speech and non-speech frames), only the last two seconds of each recording were used. This gave a dataset with 57% speech and 43% non-speech frames. In the first experiment, the performance of GMM+DCT based V-VAD 5.2SupervisedLearning:VADbyUsingSparseBoost 115

90 SparseBoost+SIFT based V−VAD 85 GMM+DCT based V−VAD

80

75

70

65

60

55 Frame−based speech detection rate %

50

45 10 11 12 13 14 15 16 Speaker index

Figure 5.2: Comparison of the proposed approach (SparseBoost+SIFT) with GMM+DCT method. and our proposed approach (called SparseBoost+SIFT) were compared. In the GMM+DCT based V-VAD approach, the forty highest energy DCT fea- tures and their first and second-order derivatives were extracted from ROIs. Two Gaussian mixture models, each with 4 Gaussian components, were then trained to model the speech and non-speech frames. This approach is com- monly taken in the literature for both audio and visual based VADs [SKS99] [AM08]. We used simple decision trees depth limited to 5 as the weak learn- ers in the SparseBoost algorithm. Figure 5.2 demonstrates the performance of these approaches for every speaker. As shown, the proposed method sig- nificantly outperforms the commonly used GMM+DCT based V-VAD ap- proach. As seen, the GMM-based approach performs even worse than ran- dom guessing on speaker 12. This is due to the fact that DCT features are highly speaker-dependent and a generative method such as GMM cannot ex- tract the relevant information from the dominant non-relevant information. 116 5 Visual Voice Activity Detection

100

90

80

70

60

Frame−based speech detection rate % AV−VAD trained with high−SNR data 50 A−VAD trained with high−SNR data V−VAD

40 Inf 10 5 0 −5 −10 −15 −20 SNR values in dB

Figure 5.3: The accuracy of audio-visual VAD, A-VAD and V-VAD in percentage. A-VAD is trained with high-SNR audio data, containing only clean and 10 dB audio samples.

The same observation regarding the poor performance of DCT features in speaker-independent settings has also been reported in [AM08] and [Gur09]. Particularly in [Gur09] it was shown that the GMM+DCT method used to construct the silence model yielded very poor results and leaded to many dele- tions in continuous speech recognition. The mean accuracy of our method is 78% 2.24, where 2.24 is the standard deviation calculated over the speak- ers. To± the best of our knowledge, this method yields one of the best V-VAD systems reported in the literature in terms of both accuracy and the speaker- independence property. Given this robust V-VAD, the next natural step is to optimally combine it with audio-based VAD to obtain an AV-VAD that can accurately perform in adverse environments. As in V-VAD, we utilized the SparseBoost algorithm to train an A-VAD system. The 13-dimensional MFCC features and their first and second-order derivatives were extracted 5.2SupervisedLearning:VADbyUsingSparseBoost 117

clean 10 dB 1 1

0 0

−1 −1

5 dB 0 dB 1 1

0 0

−1 −1

−5 dB −10 dB 1 1

0 0

−1 −1

−15 dB −20 dB 1 1

0 0

−1 −1

Figure 5.4: Audio signals with various signal-to-noise ratios. In -20 dB, signal is com- pletely dominated by noise. from frames with 65 ms duration and 25 ms overlap. That is, similar to video data, the audio frame-rate is 25 frames per second. In the first set of experiments, we trained the A-VAD with a training set where half of the audio signals were clean and the otherhalf were 10 dBaudio signals (constructed by adding additive white noise to them). The audio and visual VADs were independently trained and their decisions were then fed to an ensemble of decision trees, with 100 trees, to make the final decision. This fusing strategy is known as the late decision fusion method. As seen in Figure 5.3, while AV-VAD slightly improves the A-VAD system in low- SNR regions, in most SNR values it performs worse than or on a par with the A-VAD, which considering its higher memory and computational cost, is not acceptable. More importantly, even in the low-SNRs where the AV-VAD works slightly better than A-VAD, its performance is still far worse than the simple V-VAD system. This is an indication that the visual information is 118 5 Visual Voice Activity Detection largely disregarded in the AV-VAD. This happens due to the fact that in our training data where audio signals are either clean or have 10 dB SNRs, A- VAD system is much more accurate than V-VAD. Thus, the final classifier which takes the decisions of audio and visual VADs as input, learns to ignore the V-VAD decision. To overcome this problem, in the next experiment we trained the A-VAD with a training set containing audio signals at various SNRs (by adding a random amount of white noise to each utterance). Figure 5.4 shows audio signals at various SNRs. It is clear that including -20 dB audio samples in the training set will not result in a A-VAD that can accurately work in -20 dB, simply because in -20 dB it is even almost impossible for a human to detect speech frames. By using this mixed training data, however, the A-VAD system learned to assign very low scores1 to the samples with low-SNRs, or generally to the samples that it cannot classify. In other words, it learned to detect when it fails. The final classifier then learned to use the V-VAD decision for samples whose A-VAD scores were too small. Figure 5.5 depicts the accuracy of the AV-VAD when training data con- tains audio samples at various SNRs. As seen, this AV-VAD shows significant improvement over the AV-VAD trained with high-SNR audio samples. This amount of improvement, however, may still not justify its higher computa- tional complexity comparedwith a simple A-VAD system. To further improve the proposed AV-VAD, one may note that in the late fusion strategy, the whole information regarding the audio (video) modality is compressed into one sin- gle value, i.e., the output of the A-VAD (V-VAD) and is passed to the final classifier. This gives very limited degrees of freedom to the final classifier to fuse audio and visual information. To circumvent this problem, instead of training individual audio and vi- sual VADs, the SparseBoost classifier was trained with super-vectors con- structed by concatenating audio and visual features. By observing both audio and visual features simultaneously, the classifier could explore their comple- mentary information and find a hypothesis that could fully take advantage of the both modalities. The blue curve in Figure 5.5 demonstrates the performance of the AV- VAD with the early fusion strategy and trained with mixed audio data. As shown, it outperforms the other methods almost at all SNRs and its accuracy

1The output of the A-VAD is assumed to be a continuous value in [−1, 1]. A value close to zero indicates low confidence in the decision. 5.3Semi-supervisedLearning:Audio-visualVAD 119

95

90

85

80

75

70 AV−VAD trained with noisy data, late fusion

Frame−based speech detection rate % AV−VAD trained with noisy data, early fusion 65 A−VAD trained with noisy data V−VAD 60 Inf 10 5 0 −5 −10 −15 −20 SNR values in dB

Figure 5.5: The accuracy of audio-visual VAD for both late and early fusion, A-VAD and V-VAD in percentage. A-VAD is trained with mixed audio data, containing audio samples at various SNR values. AV-VAD with early fusion outperforms other methods. is higher than that of the visual-only or audio-only VAD algorithms.

5.3 Semi-supervised Learning: Audio-visual VAD

Labeling data is an expensive, time-consuming task while in many applica- tions, unlabeled data is cheap and abundant due to the internet. Moreover, in some applications such as applying VAD to youtube clips, many environmen- tal variables such as, microphone type, camera resolution and background noise are changing from one video to another. Usually in these situations only unlabeled data are available to adapt a VAD to this varying environment. 120 5 Visual Voice Activity Detection

Recently, there has been a growing interest in developing unsupervised VAD systems [GSM13, KR13, YYDS11]. In this work, we further explore this line of research by taking advantage of the fact that audio and visual streams give two very different (and conditionally independent) representa- tions of the same underlying phenomenon ,i.e. speech. With this observation, our problemcan be casted into now the classical two-view unsupervised train- ing framework presented in [BM98]. Due to this framework, two learning al- gorithms are applied on each view of the data (audio and visual) separately and then each algorithm’s predictions over unlabeled data are used to enlarge the training set of the other algorithm. It was shown in [BM98] that when a data description can be partitioned into two distinct views, any initial weak predictor (trained with a small set of labeled data) can be boosted to arbitrary high accuracy using unlabeled ex- amples only by co-training. The two important assumptions based on which this celebrated result holds are: First, the underlying hypothesis is learnable in the presence of labeling noise and second, the two views of data are condi- tionally independent, given the label. While the latter assumption is satisfied in our audio-visual application, the first assumption largely depends on the employed learning algorithm. Some learning algorithms are more sensitive to labeling errors than others. For instance, it is known that for each learning algorithm with a convex loss function, there is a labeling noise with a par- ticular distribution so that in the presence of this noise the algorithm fails to converge to a hypothesis better random guessing [LS10]. In this work, we develop a learning algorithm, called Ro-MABoost, which can detect and remove mislabeled samples from the training data with high accuracy. We show that this algorithm can detect up to 95% of the labeling errors on some datasets, as long as the errors are randomly induced or en- countered from the learning algorithm point of view. That is, when there is no systematic labeling error or in other words, when the labeling error does not follow any pattern. Since Ro-MABoost can be used to learn the underlying hypothesis from a noisy dataset, it is a suitable choice for our semi-supervised application in which the audio and visual training sets are recursively labeled by non- perfect learning algorithms. The proposed semi-supervised AV-VAD training algorithm is outlined in Algorithm 5.1. In the following, we explain in detail the Ro-MABoost learning algorithm 5.3Semi-supervisedLearning:Audio-visualVAD 121 used in Algorithm 5.1.

Algorithm 5.1: Semi-supervised audio-visual VAD training A V Input: Labeled training data l = ( l , l ), D D AD V unlabeled training data u = ( u , u ), convergence threshold ǫDand DE =D . △ ∞ A Supervised stage: Train A-VAD with and label u with it, Dl D set E1 = error rate of A-VAD. While E>ǫ do △ V V (a) V-VAD: Train Ro-MABoost with l u and output indices of mislabeledD samples∪ D . I (b) Omitting mislabeled data: Remove x i from training data, { i | ∈ I} = u x i D D \{ i | ∈ I} (c) A-VAD: Train Ro-MABoost with A A and output h . Dl ∪ D A (d) Re-labeling: Re-label u with h . D A (e) Stopping Criterion: E2 = error rate of hA,

set E = E1 E2 and E1 = E2. End △ −

Output: Ouput hV and hA.

5.3.1 Ro-MABoost

As mentioned in Chapter 4, the concept of boosting emerged as an an- swer to the question whether a weak learner can be converted to a strong learner with arbitrary accuracy. A boosting algorithm is a meta-algorithm that generates many weak learners with slightly better-than-random per- formance. The final strong classifier is the ensemble of these weak clas- sifiers. At each round of training, the algorithm concentrates on the sam- ples that are wrongly classified in the previous steps and aims to find a hypothesis that accurately describes them. However, when some train- ing samples have wrong labels, this learning scheme may badly fail. 122 5 Visual Voice Activity Detection

Algorithm 5.2: Robust Mirror Ascent Boosting (Ro-MABoost) 1 1 1 1 Input: w = [ ,..., ]⊤, z = [ ,..., ]⊤ and = 1 N N 1 N N I {} For t =1,...,T do

(a) Train classifier with wt and get ht, let d = [ a h (x ),..., a h (x )] and γ = w⊤d . t − 1 t 1 − N t N t − t t (b) Set η = γt and c = 1,...,N t N I { } \ I (c) If

c N (d) Set = w wi 0 i , wi =0 i , i=1 wi =1 S { | ≥ ∀ ∈ I ∀ ∈ I 2 } (e) Project onto : wt+1 = argmin w wt ηPtdt 2 S w || − − || ∈S End Output: The final hypothesis f(x)= sign T η h (x) .  t=1 t t  P

The effect of random label noise on boosting can be intuitively explained as follows: Assume the data is separable. As the number of training rounds increases, the algorithm assigns higher and higher weights to hard-to-classify examples, which in this case are wrongly labeled samples. In ADABoost for instance, the weights of noisy samples increase exponentially fast with re- spect to the numberof trainingrounds.That is, after a few rounds, the weights of correctly-labeled samples are very small compared with those of misla- beled ones, resulting in weak learners that fit the noisy samples. To overcome this deficiency, we propose a give-up strategy to omit noisy samples during the training process. The algorithm takes two hyper- parameters as inputs: First, the noise rate, which is the amount of noise in data and second, the margin threshold which is the minimum margin of the samples that are still considered to be correct. Given these two parameters, 5.3Semi-supervisedLearning:Audio-visualVAD 123 at each round of training, the algorithm generates a weak classifier and re- moves the samples whose margins are smaller than the margin threshold. The removal process continues until the total number of removed samples reaches a threshold (computed from the noise rate). From this stage on, the algorithm continues to generate a weak classifier at each round, however, without re- moving any more samples from the training data. It finally returns the ensem- ble of the weak classifiers as the final classifier, when the predefined number of classifiers T are reached. This gives a robust version of the MABoost al- gorithm, called Ro-MABoost, which can handle noisy datasets. Let (x ,a ) , 1 i N, be N training samples, where x { i i } ≤ ≤ i ∈ X are feature vectors and ai 1, +1 are labels. Assume h is a real-valued function mapping∈ {− into [} 1, 1]. Denote a distribution∈ H over X − the training data by w = [w1,...,wN ]⊤ and define a loss vector d = [ a1h(x1),..., aN h(xN )]⊤. Assume the rate of labeling noise is ε and the− margin threshold− below which a sample is considered to be noisy is Θ. Using this notation, the Ro-MABoost is outlined in Algorithm 5.2. Set in that algorithm keeps the indices of noisy samples and is a subspace ofI the probability simplex in which only some of the dimensionsS can be nonzero (those that correspond to correct samples). Choosing as defined in Algo- rithm 5.2 guarantees that the weights assigned to detectedS noisy samples are zero, resulting in excluding them of the training process. Ro-MABoost can be described as follows: Assume there are 200 weak classifiers in the final ensemble, i.e. T = 200. These weak classifiers can be seen as very simple rules of thumb that roughly classify samples with accu- racy just slightly better than random. At each round, Ro-MABoost removes the samples that are inconsistent with most of the already generated rules of thumb. For instance, by properly setting Θ, after generating 100 rules of thumb, Ro-MABoost starts to remove the samples that are wrongly classi- fied by 90% of these classifiers. However, an important requirement in or- der to obtain reasonable results is to ensure that the weak classifiers used in Ro-MABoost are in fact very weak, otherwise, they even may fit the noisy samples. In our implementation, we used decision stumps (decision tree with only one node) as weak classifiers.

5.3.2 Experiments on Supervised & Semi-supervised VAD

In the first set of experiments,the accuracyof Ro-MABoost in detecting noisy samples was evaluated. Six of the binary datasets described in Table 4.1 were 124 5 Visual Voice Activity Detection employed in these experiments. The algorithm was tested in the presence of 3 different noise rates (ε): 10%, 20% and 30%. At each experiment, we ran- domly flipped a fraction of labels according to the given noise rate ε and used this noisy dataset to train a classifier by using Ro-MABoost. Through- out the training process, Ro-MABoost detects ε percentage of the samples as noisy samples. Table 5.1 reports the percentage of the identified noisy sam- ples by Ro-MABoost which in fact were mislabeled samples, i.e. the accu- racy of Ro-MABoost in detecting mislabeled samples. The margin threshold

Dataset 10%noise 20%noise 30%noise Breastcancer 95.71 93.57 84.76 German-credit 79.00 69.50 57.00 Votes-84 95.45 94.25 86.92 Pima-diabetes 57.14 58.44 46.08 Thyroid-disease 96.50 94.25 87.50 Sonar 52.38 66.66 40.32

Table 5.1: The accuracy of Ro-MABoost in detecting the mislabeled samples in the presence of various percentage of labeling noise reported in percentage. For instance, in the breast cancer dataset, Ro-MABoost can detect 95.71% of the mislabeled sam- ples correctly. used in these experiments was adaptive and was set to 0.2 times the mean margin (average of the margins of all samples) at each round. As seen in Ta- ble 5.1, Ro-MABoost achieves impressively high detection accuracy on most datasets. On 3 out of the 6 datasets, Ro-MABoost can detect more than 95% of mislabeled samples in the presence of 10% labeling noise and obtain more than 84% detection accuracy when the noise rate is as large as 30%. The success of Ro-MABoost in detecting mislabeled samples suggests further ap- plications of this algorithm including search result improvement in search engines by detecting irrelevant results and improving the quality of training data. In the next set of experiments we evaluated the performance of the semi- supervised AV-VAD algorithm listed in Algorithm 5.1. In this test, utterances of the speaker 12 were considered test data and the rest of the first 16 speak- ers to be training data. 1.25% of samples were randomly selected to construct the labeled training data and the remaining samples (i.e., 98.75%) were taken as unlabeled training data . The audio training set was a mixed of audio Du 5.3Semi-supervisedLearning:Audio-visualVAD 125

89

88

87

86

85 Semi−supervised training: A−VAD

Frame−based speech detection rate % Supervised training: A−VAD 84 Semi−supervised training: V−VAD Supervised training: V−VAD

83 0 2 4 6 8 10 12 Number of iterations in semi−supervised training algorithm

Figure 5.6: The accuracy of semi-supervised A-VAD and V-VAD in percentage. After 7 rounds of iterations, A-VAD reaches 88.25% detection rate which is only 1% lower than A-VAD accuracy obtained by supervised training. signals at various SNRs (random amount of white noise was added to each ut- terance) and the audio test set was the audio recordings from speaker 12 with additive noise so that their SNR values were around 0 dB. Figure 5.6 demon- strates the test error of A-VAD and V-VAD over 13 iterations of training. The test error of round 0 corresponds to the accuracy of the initial A-VAD (which is trained on a given small set of labeled data). The Horizontal lines in Figure 5.6 show the accuracy of A-VAD and V-VAD over the test set when 100% of training samples were labeled and used in training.They work as reference points for our unsupervised training. The aim is to get as close as possible to these optimal upper bounds. As seen, the optimal A-VAD accuracy is 89.30%. By using unsupervised training we obtained 88.25% detection rate, which is only slightly lower than that of the fully labeled case. The V-VAD classifier also obtains up to 87.4%, which is only 0.1% lower than 87.5% accuracy of 126 5 Visual Voice Activity Detection the fully labeled case. Moreover, Algorithm 5.1 seems to converge after a rel- atively small number of iterations since the performance did not demonstrate significant improvement after the 8th iteration. The only requirement of our semi-supervised AV-VAD training is to have an initial A-VAD. This A-VAD can either be obtained by training a classi- fier with a small set of labeled data (which in this case is better categorized as an example of semi-supervised training) or by simply using a threshold-based A-VAD (using energy as the main feature) as an initial A-VAD. Both interpre- tations can be of interest to practitioners. Particularly, in applications where adapting the AV-VAD to time-varying environments is a requirement, the al- ready available A-VAD can be considered an initial A-VAD and Algorithm 5.1 can be used for adaptation.

5.4 Conclusion

Throughout this Chapter, we developed a V-VAD system that is highly speaker-independent and can achieve high accuracy. The proposed V-VAD method is obtained by training a MABoost classifier (described in Chapter 4). The final V-VAD is an ensemble of 4000 very simple decision trees (with depth less than 5). Even though the dimensionality of the feature vector space is fairly large (900), the proposed V-VAD achieves low generalization error due to the robustness of MABoost to overfitting. We developed a robust version of the MABoost algorithm which can de- tect mislabeled samples in a training set. We showed that in some datasets, it can detectup to 95% of mislabeled samples when the labels are independently flipped with some small probability, i.e., the label noise is not systematic (and thus not learnable). Furthermore, we derived a semi-supervised AV-VAD framework that can be used to either adapt a pretrained AV-VAD to a new environment or to train an AV-VAD system over an unlabeled dataset by using an initial threshold-based A-VAD. This framework is based on the co-training algo- rithm proposed in [BM98] and our mislabeled sample detection algorithm, Ro-MABoost. In the experiments we showed that by using this framework, both A-VAD and V-VAD obtain detection accuracies close to that of the fully labeled case where all training samples are labeled. Chapter 6

Lipreading with High-Dimensional Feature Vectors

This Chapter is devoted to developing a lipreading system based on the fea- ture sets and the learning methods introduced in previous Chapters. In this system, the 3D-SIFT features are extracted from each video. These features are then used in bag-of-words(BoW) models to generate a 4000-dimensional BoW descriptor per frame. This high dimensional data is directly utilized to train a multiclass classifier by Mu-MABoost. This system is shown to be reliable in the sense that it is highly speaker-independent and outperforms conventional algorithms. Similar to [HES00], where the probability posteriors generated by an ANN were used as audio features (known as Tandem features) in an GMM- HMM based speech recognizer, a new set of visual features are proposed wherein output posterior probabilities of the Mu-MABoost classifier are added to COBRA-selected ISCN features, introduced in Chapter 3. The new feature set is then used to train GMM-HMMs. Using this new feature set fur- ther reduces the variance of the V-ASR over different speakers and obtains 70.5% and 63.5% accuracy on Oulu and GRID datasets, respectively. To the best of our knowledge, These recognition rates are the highest rates reported 128 6LipreadingwithHigh-DimensionalFeatureVectors in the literature for these datasets when V-ASR is evaluated in the speaker- independent setting.

6.1 Introduction to Visual Speech Recognition

It is a known fact that speech perceptionis a multi-modalphenomenon.Audio speech, visual speech (lipreading), facial gesture, body pose, etc., all convey relevant speech information. In most scenarios, however, audio and visual signals are sufficient input for a human listener to perceive and understand speech. Following this observation many researchers aimed to fill the gap between human level performance and the performance of A-ASR in real world appli- cations by integrating additional visual speech informationwith audio speech. This, however, turned out to be too optimistic. The first attempts in this direc- tion [PNLM04, and references there in] clearly showed that visual cues are highly speaker dependent and show large variations in different illumination conditions. That is, commonly used visual features (appearance features such as DCT, PCA and LDA and model based such as active shape model (ASM) [DL00] and active appearance model (AAM) [MCB+02, CET01]) are not reliable to work in a speaker-independent mode. This poses two questions: (I) A more theoretical question regarding the formal definition of reliability1 which has been investigated in [PSSD06, RCB97] and (II) a practical ques- tion on how an unreliable recognizer (V-ASR) can be optimally combined with a reliable recognizer (A-ASR) in order to obtain an overall performance improvement. A commonly used approach to tackle the second problem is through as- signing a reliability weight (an exponent weight) to the probability distribu- tion of each information stream. This approach has been empirically shown to be effective [WSC99], [CR98], [CDM02], [DL00] in audio-visual speech and speaker recognition, especially in the presence of varying additive acoustic noise. From the theoretical standpoint, it was shown in [PSSD06] that, under certain simplifying assumptions, proper choice of exponentweights can mini- mize the variance of mismatch error between true and estimated distributions,

1Note that reliability is a property corresponding to the features and the amount of mismatch error between the distributions of the training and test data. For instance, an A-ASR system trained over a noisy dataset is a weak recognizer while it is still considered to be reliable due to its consistent performance over different speakers. 6.1IntroductiontoVisualSpeechRecognition 129 though as we discussed in Chapter 7, this may not necessarily result in lower error rate. However, due to the strong speaker- and illumination-dependence of visual features, these weighting schemes should be able to automatically adapt themselves in order to cope with the quality variations of visual modal- ity. This, however, is an inherently challenging problem due to the fact that the quality of visual features should be assessed in an unsupervised manner. To sidestep this problem, in Chapter 7 we derived a weight estimation al- gorithm by maximizing the AUC criterion. Instead of optimizing the perfor- mance of the recognizer for a particular test condition, the proposed approach yields a set of weights minimizing the expectation of the error over different test conditions. The proposed fusion algorithm is a general approach that can be applied to the model presented in this chapter, as well. However, due its pessimistic nature, this robustness comes at the expense of some performance degradation of the AV-ASR at the high SNR regimes. Another approach to improving the reliability of the V-ASR systems is by using speaker- and illumination-invariant features. A reliable V-ASR is a clas- sifier whose performance shows little variation over different speakers. A big step towards this direction was taken by Zhao et.al. in [ZBP09], where they considered the whole video sequence of an utterance a volume and calculated spatio-temporal local binary pattern (LBP-TOP) features in order to classify utterances. LBP features are binary features known to be gray-scale and rota- tion invariant [OPM02]. When calculated across XY,YT and XT plans, they result in LBP-TOP features which, in addition to invariance properties inher- ited from LBP features, encode the time-dynamics of visual speech. As they demonstrated, the temporal patterns have a highly discriminative power since they are less sensitive to speakers and more related to underlying speech. However, they applied two further steps to extract low-dimensional equal- size feature vectors from videos. First, even though different utterances have different length (i.e., various number of frames), they are divided into the same number of block volumes to get the same number of features. Second, ADABoost is employed to select a small subset of informative features. From our point of view, these two steps are unnecessary and perhaps reduce the invariance properties of LBP-TOP features. Conventional visual features are highly speaker-dependentand more suit- able to encode the identity of a speaker rather than visual speech. Normal- izing visual features, thus, is a reasonable idea in V-ASR systems in order to improve the speaker-invariance property of the recognizer. For instance, in [ZZP11] it is proposed to first learn a deterministic function that maps a 130 6LipreadingwithHigh-DimensionalFeatureVectors curve (which in their work is a low-dimensional manifold) into the image space and then use this function to normalize raw visual data. They showed using this method in conjunction with LBP-TOP features resulted in overall 10% performance improvement in a 10-phrase recognition task. Other nor- malization methods such as Hi-LDA [PNLM04] and z-score normalization [NC09] are also commonly used in the literature. An extensive comparison between different normalization methods for AAM and DCT feature sets has been reported in [LTH+10]. However, these methods do not completely solve the inter-speaker variation problem and the variation of the V-ASR perfor- mance over different speakers is still substantially large. Perhaps, the main reason of the sensitivity of conventional visual features to speaker’s identity is because they are low-level global features. According to the definition, global features are features that are dependent on the values of all pixels. For instance, DCT, LDA or PCA are global features since they are directly extracted from pixel values by applying a linear transformation to an ROI. The main drawback of global features such as LDA is that since ROIs usually convey a significant amount of irrelevant information (i.e., fa- cial characteristics of the speaker), these features are highly noisy and thus difficult to learn from. Similar to LDA, AAM and ASM methods are also global features because even though they further process ROI to extract lip active appearances (active shapes in case of ASM), a new feature vector is still extracted from an ROI by applying a linear transformation to an ROI. In Chapter 3, we showed that using ISCN features improves the V-ASR performance. However, even though ISCN features are the results of applying several non-linear local wavelet transforms to an ROI, they still cannot obtain a sufficient level of robustness, necessary for V-ASR applications. To address this problem,we use a discriminativeapproachand train a MABoost classifier with visual features. Posterior probabilities of this classifier are then added to ISCN features to construct the final set of visual features which are used to train GMM-HMMs. The key contributions of this work are threefold:

1. Multiple color spaces: In order to achieve illumination-invariance and increase discriminative power, video frames are first transformed to 5 different color spaces: opponent color space which is invariant to changes in light intensity, color-invariant color space which is scale- invariant with respect to light intensity, r and g chromatic components of the normalized RGB color which are known to be scale invariant and 6.1IntroductiontoVisualSpeechRecognition 131

finally, the normal RGB color space and the gray color space. The in- variance properties of each of these color spaces and the discriminative power of their corresponding scale-invariant feature transform (SIFT) features [Low04] have been explored in [vdSGS10]. It was shown there that using a combined set of features extracted from these color spaces results in a significant performance improvement and enhances the ro- bustness of object classification against illumination variations. 2. 3D-SIFT + Bag-of-Words: Following the idea of Zhao et. al. [ZBP09] to use spatio-temporal features, we also employ 3-dimensional (X,Y and time, T) features to represent visual speech. However, instead of LBP-TOP, we employ 3D-SIFT introduced in [SAS07] in order to en- joy the invariance properties of the popular SIFT features. From a problem-oriented standpoint, the most important invariance property of SIFT in a lipreading task, where speakers have different lip and mouth characteristics and sizes, is its scale-invariance property. 3D-SIFT fea- tures are extracted from each frame and for each of the 5 color spaces mentioned above, i.e., 5 sets of feature vectors per frame. Since each frame may have various numbersof key points, the number of 3D-SIFT feature vectors representing each frame is different from the others. To obtain fixed-length feature vectors per frame, the bag-of-words model introduced in [SZ03] is employed. Using BoW is perhaps the most im- portant step in this work in order to achieve speaker-invariant features. 3. Multiclass Boosting: Finally, for classification, a learning algorithm needs to be used that has the following properties: (I) Due to the high- dimensionality of visual feature vectors, the learning algorithm should neither have many free parameters nor be too complex in order to be resistant to over-fitting. (II) It should be powerful enough to learn the underlying complex hypothesis. (III) It should be able to directly ad- dress the multiclass classification problem. The common approach to dealing with multiclass settings is to reduce them to multiple binary classification problems. This, however, significantly increases the com- putational complexity in our problem where five high-dimensional fea- ture sets are used to train five classifiers for digit classification and five 26-class classifiers for letter recognition. We use the multiclass setting of MABoost algorithm (called Mu- MABoost) presented in Chapter 4 and demonstrate that it satisfies the above requirements. Mu-MABoost has several appealing properties. 132 6LipreadingwithHigh-DimensionalFeatureVectors

First, it was shown in Chapter 4 that Mu-MABoost requires minimal conditions on weak-classifiers to achieve high accuracy on training data. That is, weak classifiers constituting the final strong classifier are chosen from a very simple hypothesis space, in our case, from the class of decision trees with a depth less than 5. Using such simple classifiers will minimize the risk of over-fitting. Second, by directly formulating the multiclass classification problem, it largely reduces the computa- tional complexity, making it a method of choice for high-dimensional data.

4. ISCN + Posterior Probabilities: Due to the success of HMMs in A- ASR systems [RJ93], HMMs and more generally deep Bayesian net- works [CH97, LT97, SLS+05] were largely applied in V-ASR to cap- ture the time-dynamics (temporal information) of visual speech. We also adapt HMMs in our final recognizer framework in order to better take the time-series aspect of speech into account. The visual features used to train the final GMM-HMM based lipreading system are poste- rior probabilities which are the output of Mu-MABoost and COBRA- selected ISCN features introduced in Chapter 3. This hybrid feature set obtains very promising performance and outperforms other feature extraction methods on the GRID and Oulu datasets.

6.2 Color Spaces

It is shown in [vK70] that changes in illumination can be modeled by a linear transformation (diagonal mapping or von Kries model). Thus, different color spaces for representing an image may present different invariance. As it is extensively explained in [vdSGS10], based on von Kries model and the von Kries offset model,four types of changes in an image can be described. These four image transformations due to the light change are light intensity change, light color change, light shift change and finally, light color and shift change. Proper color spaces, however, show invariance properties against some of these changes. The color spaces used in this work are as follows: 6.2 Color Spaces 133

1. Opponent color space: It is defined as

R G − O1 √2 R+G 2B  −  O2 = √6 (6.1) R+G+B O3      √3  where R, G and B are the three channels of the RGB color space. In this space, O1 and O2 are shift-invariant with respect to light intensity and O3 does not have any invariance property.

2. Color-invariant color space: It is defined as

0.3R+0.04G 0.35B H 0.34R 0.6G−+0.17B   =  −  (6.2) 0.3R+0.04G 0.35B C −    0.06R+0.63G+0.27B  H and C are the two color-invariant channels used to extract C-SIFT [AF06] (to be precise, in [AF06] only H channel was eventually used to obtain C-SIFT features). As explained in [vdSGS10], H and C are scale-invariant with respect to light intensity but not shift-invariant.

3. rg color space: It is the normalized RGB

R r R+G+B = G (6.3) g  R+G+B 

Due to the intensity normalization, r and g are scale-invariant with re- spect to light intensity changes, shadows and shading.

4. Gray scale: It is the most commonly used color space in V-ASR sys- tems due to its simplicity (mono-channel) and can be written as

g =0.2989R +0.5870B +0.1140G (6.4)

Gray scale has no invariance properties.

5. RGB color space: The last color space used in this work is the RGB color space which obviously has no invariance properties. 134 6LipreadingwithHigh-DimensionalFeatureVectors

Figure 6.1: Visual speech-unit classifier. In the first step, video data are transformed to 5 color spaces: RGB, opponent, rg, gray and color-invariant. 3D-SIFT features are extracted from each representation and clustered into 4000 classes. BoW features, which are the frequency of occurrence of each of these classes, are then computed for each video sample and employed to train 5 classifiers, one for each color space. The final classifier is the ensemble of these 5 classifiers. 6.3 Visual Features 135

6.3 Visual Features

Despite the fact that many visual feature sets have been proposed for V-ASR systems (see [ZZHP14] and [PNLM04]), none of them have gained wide ac- ceptance (as MFCCs in A-ASR systems) in real-world applications. Recently, however, several spatio-temporal features have appeared which are shown to be promising in both video classification [SAS07] and lipreading [PGC08] [ZBP09]. Our proposed method may also be considered to go along this line with an additional bags-of-words step which results in a sparse feature vector (suitable for decision tree classifiers used as weak classifiers in Mu- MABoost). In this work, video samples are first representedin 5 different color spaces introduced in the previous section. For each representation, a set of 3D-SIFT features2 are extracted from the video data and be used in a BoW model with 4000 words to construct the final 4000-dimensional feature vectors. This ap- proach yields 5 feature sets, each containing 4000-dimensional feature vec- tors. Each of these feature sets are then used to train a multiclass classifier for speech recognition. This procedure is shown in Figure 6.1.

6.4 Multiclass Classifier

In order to fully benefit from the invariance properties and sparsity of the ex- tracted features, we need a learning algorithm that (I) can exploit the sparsity of the feature vectors to reduce the computational complexity of the learning phase and (II) aggressively learn the relevant information in the training data with minimum overfitting. The Mu-MABoost algorithm presented in Chapter 4 is in fact satisfies both of these requirements. First, by using shallow deci- sion trees as weak learners, it can benefit from the sparse structure presented in the feature vectors and second, since it is a provable boosting algorithm, it guarantees to drive training error down to zero in a finite number of iterations. In this work, the Mu-MABoost algorithm with the Euclidean distance as the Bregman divergence is used to train visual digit and letter classifiers. This algorithm is listed in Algorithm 6.1. In Algorithm 6.1, vec(W) stands for the vectorized version of matrix W (by column-wise concatenation), Tr(.)

23D-SIFT descriptors are the direct generalization of the popular SIFT features. A brief explanation of 3D-SIFT is given in Chapter 2. 136 6LipreadingwithHigh-DimensionalFeatureVectors denotes the trace operator, N is the number of training examples and K is the number of classes. Algorithm 6.1: K-class MABoost (Mu-MABoost)

Input: (X) = Tr(X⊤X), R 1 1 1 W1 and Z1 with elements Wi,j = Zi,j = N(K 1) − For t =1,...,T do

(a) Train a classifier with Wt and get ht, set Dt elements as in (4.18) and γ = Tr(W⊤D ). t − t t γt (b) Set ηt = (K 1)N − (c) Update: Zt+1 = Wt + ηtDt

(d) Project onto : Wt+1 = argmin Tr (W Zt+1)⊤(W Zt+1) S vec(W) − − ∈S  End T Output: Final hypothesis f(x) : H(x,l)= t=1 ηt1 ht(x)= l f(x) = argmaxP H(x,l)  l

6.5 Lipreading Experiments with ISMA Fea- tures

For a comprehensive evaluation of the proposed method, several experiments were designed. Since our objective is to construct a speaker-independent lipreading system, all the experiments were carried out in the speaker- independent mode. The accuracies were then estimated by the one-speaker- out cross validation strategy. In all experiments, first, Mu-MABoost was trained to classify speech-units. The output of Mu-MABoost is a K- dimensional vector (K being the number of speech-units) where the i-th element is the probability of belonging to class i, given an input. This K- dimensional vector was computed for each frame and used as visual features with COBRA-selected ISCN features. The delta and delta-delta features were also included to construct the final visual feature set. These features were then used to train GMM-HMMs. In AV-ASR, we also include the 39 MFCC features extracted from audio 6.5LipreadingExperimentswithISMAFeatures 137 signals in the feature set. Thirteen MFCC features and their first- and second- order derivatives were extracted from frames with 20 ms duration and 10 ms overlap. That is, the audio frame-rate was 100 frames per second. Visual features were linearly interpolated to increase their frame-rate from 30 to 100 to have the equal number of audio and visual features per utterance.

6.5.1 Oulu: Phrase recognition The Oulu dataset consists of 10 daily used phrases such as nice to meet you. Each phrase is repeated 5 times by each of the 20 speakers.

300

250

200

150

Number of Phrases 100

50

0 10 20 30 40 50 60 70 Phrase Length

Figure 6.2: Histogram of length (number of frames) of phrases for the Oulu dataset. Most recordings are longer than 20 frames.

A brief description of this dataset can be found in Chapter 2. In each run, 138 6LipreadingwithHigh-DimensionalFeatureVectors

90 ISMA 80 MAPP ISCN 70

60

50

40

Recognition rate% 30

20

10

0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Total Phrase Index

Figure 6.3: Phrase recognition accuracy for ISMA, MAPP and ISCN features. C1 to C10 are the ten phrases of the Oulu dataset listed in Table 2.3.

19 speakers were used to train HMMs. The main difference of this dataset from the GRID dataset was that phrases in the Oulu dataset are much longer than digits in the GRID dataset. The histogram in Figure 6.2 shows the length distribution of the Oulu phrases. Each phrase was modeled by an HMM with 9 emitting states and GMMs with two Gaussian components and diagonal covariance matrices. The first set of experiments compares the performance of the Mu-MABoost features, i.e. when posterior probabilities are used to train GMM-HMMs, with ISCN features and reports the recognition rate when both feature sets were com- bined. In the following, the output posterior probabilities of Mu-MABoost 6.5LipreadingExperimentswithISMAFeatures 139 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

C1 79.25% 6.60% 2.83% 0 6.60% 1.89% 0.94% 0.94% 0 0.94%

C2 5.81% 80.23% 0 2.33% 2.33% 0 0 0 0 9.30%

C3 1.27% 0 77.22% 1.27% 0 11.39% 0 8.86% 0 0

C4 0 3.00% 3.00% 74.00% 6.00% 1.00% 1.00% 2.00% 3.00% 7.00%

C5 3.57% 0.89% 4.46% 8.04% 67.86% 3.57% 0 6.25% 0.89% 4.46%

C6 1.85% 0 12.04% 0 1.85% 60.19% 0 24.07% 0 0

C7 0 1.28% 0 1.28% 0 0 87.18% 0 8.97% 1.28%

C8 1.14% 0 7.95% 1.14% 4.55% 19.32% 0 64.77% 1.14% 0

C9 0.79% 2.36% 5.51% 1.57% 0.79% 0.79% 18.90% 0 62.99% 6.30%

C10 1.75% 14.04% 0.88% 8.77% 0.88% 0 5.26% 0 7.02% 61.40%

Recognition rate = 70.5411%

Figure 6.4: Confusion matrix of the visual recognizer with ISMA features. C8 (thank you) is the most wrongly classified phrase due to its confusion with C6 (see you). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. were referred to as MAPP features and, the feature set containing both Mu- MABoost posterior probabilities and ISCN features was called ISMA in these experiments. As seen in Figure 6.3, ISMA obtains 70.5% accuracy which is the highest 140 6LipreadingwithHigh-DimensionalFeatureVectors

90 ISMA MAPP ISCN 80

70

60

50 Recognition rate%

40

30

20 2 4 6 8 10 12 14 16 18 20 Speaker Index

Figure 6.5: Recognition accuracy per person for speakers in the Oulu dataset. The ISMA accuracy varies from 45 to 94 percent over 20 speakers.

recognition rate reported for this dataset. MAPP and ISCN features obtain 63% and 55% accuracies, respectively. The worst phrase recognition rate is for Thank you (C8) due to its confusion with another similar phrase see you (C6). This confusion can be detected from the ISMA confusion matrix de- picted in Figure 6.4. From the visual speech viewpoint, the only difference between these two phrases is in the tongue position which is usually not cap- tured in videos. Grouping these two phrases into a one class, improves the final classification rate by 3.6%. Figure 6.5 reports the recognition accuracy for each speaker. Although, 6.5LipreadingExperimentswithISMAFeatures 141 the inter-speaker variation of accuracies obtained by ISMA and MAPP fea- tures is smaller than ISCN features (standard deviation of ISMA accuracy over 20 speakers is 2.98 compared to 3.89 of ISCN features), it is still not sufficiently small to be called a reliable recognizer.

Feature sets ISMA+MFCC MAPP+MFCC ISCN+MFCC MFCC ∞ dB 88.27 88.28 79.36 94.99 10 dB 80.05 73.13 65.93 42.77 5 dB 71.84 61.74 58.52 20.93 0 dB 65.14 56.62 49.52 14.93 5 dB 57.54 45.1 41.17 12.12 −10 dB 60.74 40.61 35.59 12.03 −15 dB 55.03 43.57 34.38 11.72 −20 dB 50.72 38.97 31.87 10.22 − Table 6.1: Recognition rates in percent at different noise levels for the audio-visual recognizer trained with 3 different feature sets and the audio-only recognizer trained with MFCC features. Second column: ISMA features with MFCC features extracted from audio data. Third column: Audio-visual recognizer trained with a feature set containing MAPP and MFCC features. Fourth column: ISCN with MFCC features and finally the fifth column reports the accuracy of the audio-only recognizer at various SNRs.

Table 6.1 reports the accuracy of audio-visual speech recognizer for ISMA with MFCC, MAPP with MFCC and ISCN with MFCC features when audio signals are corrupted by additive white noise at various SNR lev- els 20, 15, 10, 5, 0, 5, 10, 15, .Accuracy of the audio-only recog- nizer{− is listed− in− the last− column. Two∞} interesting points can be detected in this table. First, the performance of the audio-visual recognizer is worse than that of the audio-only recognizer in noise-free scenario. This is an important indi- cation of unreliability of visual features. If visual features were reliable (that is, there was no mismatch between the distributions of visual features in the training and test sets), adding visual features to the audio feature set would only improve the performance, due to the Bayes theorem. However, as we observe, in reality adding a visual feature set to audio features may in fact reduce the performance due the speaker-dependent nature of visual cues. Second, using visual cues in low-SNRs largely improves the recognition 142 6LipreadingwithHigh-DimensionalFeatureVectors accuracy. As seen, at 0 dB for instance, the audio-only recognizer perfor- mance is barely better than random, while the audio-visual recognizer can correctly classify 65% of the phrases.

6.5.2 GRID: Digit recognition

The second set of experiments were conducted on the GRID dataset [CBCS06]. GRID is a relatively large dataset with 32 speakers and 1000 recordings per speaker (for more details look at Chapter 2). In the following tests, we only used the recordings of the first sixteen speakers of this dataset. For each speaker, we extract digits 0-9 from the first 400 utterances. Since in

500

450

400

350

300

250

200 Number of samples 150

100

50

0 2 4 6 8 10 12 14 16 18 Digits Length

Figure 6.6: Histogram of digit lengths in the GRID dataset. Most digits were between 6 to 10 frames long. 6.5LipreadingExperimentswithISMAFeatures 143

90 ISMA 80 MAPP ISCN 70

60

50

40

Recognition rate% 30

20

10

0 zero one two three four five six seven eight nine Total

Figure 6.7: Digit recognition accuracy for ISMA, MAPP and ISCN features. the GRID dataset talkers had 3 seconds to produce each sentence , the length of the spoken digits extracted from these sentences were much shorter than the length of phrases in the Oulu dataset. Moreover, for reducing the sample size, video data were down-sampled to 15 frames per second which further reduced the number of frames per digit. The histogram in Figure 6.6 shows the length distribution of the GRID digits. As seen in Figure 6.6, most digits are shorter than 10 frames and many of them are as short as 6 frames. Hence, unlike the Oulu dataset, we use HMMs with 3 emitting states to model the GRID digits. As in the previous experiments, we first compare the performance of the MAPP features, i.e. when posterior probabilities are used to train GMM-HMMs, ISCN features and ISMA features which are the combination of MAPP and ISCN features. Figure 6.7 compares the classification power of ISMA, MAPP and ISCN 144 6LipreadingwithHigh-DimensionalFeatureVectors zero one two three four five six seven eight nine

zero 62.93% 0.86% 6.03% 0 0 0.86% 4.31% 22.41% 0 2.59%

one 0 70.50% 2.16% 10.07% 5.04% 4.32% 0 0 3.60% 4.32%

two 17.71% 2.86% 60.57% 1.14% 2.86% 1.71% 1.71% 4.57% 1.14% 5.71%

three 0 5.74% 1.64% 78.69% 5.74% 0 0.82% 4.10% 0 3.28%

four 0.96% 11.96% 9.09% 8.61% 63.16% 0.48% 0.48% 1.91% 0.48% 2.87%

five 1.78% 0 0.59% 5.92% 0.59% 79.29% 2.37% 3.55% 1.78% 4.14%

six 5.65% 0 0.56% 0 0 2.26% 54.24% 3.39% 13.56% 20.34%

seven 18.97% 0.86% 3.45% 1.72% 0 0.86% 5.17% 65.52% 1.72% 1.72%

eight 0.70% 1.40% 2.10% 0.70% 0 2.10% 9.09% 2.10% 64.34% 17.48%

nine 2.42% 3.23% 7.26% 4.84% 3.23% 10.48% 18.55% 1.61% 13.71% 34.68%

Recognition rate = 63.4899%

Figure 6.8: GRID dataset: Confusion matrix of the visual recognizer with ISMA fea- tures. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.

features. Similar to the Oulu dataset, ISMA features obtain the highest accu- racy, i.e., 63%, followed by the MAPP features with 59.5% digit classifica- tion rate. Moreover, 50% of occurrences of all digits except nine are correctly classified by ISMA features. 6.5LipreadingExperimentswithISMAFeatures 145

ISMA 80 MAPP ISCN

70

60

50 Recognition rate%

40

30

2 4 6 8 10 12 14 16 Speaker Index

Figure 6.9: Recognition accuracy per person for the first 16 speakers of the GRID dataset. The ISMA accuracy varies from 37 to 81 percent over 16 speakers. Removing the ninth speaker from the dataset will 2% increase the mean accuracy.

The confusion matrix depicted in Figure 6.8 reveals that nine is com- monly confused with eight or six. This confusion can be attributed to the fact that their vowels, i.e., /ay/, /ey/ and /i/ are all mapped to the same viseme /I (see Table 3.10). While ISMA obtains promising accuracy of 63% on such a difficult dataset (due to the short length of digits), its variation over the speakers is still not satisfactory. As seen in Figure 6.9, the ISMA accuracy is as low as 37% on one speaker. This is due to the fact that lipreading is by nature speaker-dependent. It is a common observation that some people hardly move their lips while speaking. This is translated to different degrees 146 6LipreadingwithHigh-DimensionalFeatureVectors of visemes visibility for different speakers, due to their lip shapes, facial char- acteristics and speaking styles.

Feature sets ISMA+MFCC MAPP+MFCC ISCN+MFCC MFCC ∞ dB 81.29 87.66 72.02 95.48 10 dB 73.13 74.6 58.97 45.88 5 dB 70.32 69.51 54.39 29.19 0 dB 67.04 63.28 49.64 17.69 5 dB 64.41 56.64 44.4 11.3 −10 dB 62.01 51.41 39.63 11.24 −15 dB 60.53 48.06 37.01 11.17 −20 dB 58.93 46.66 35.6 11.24 − Table 6.2: Recognition rates in percent at different noise levels for audio-visual rec- ognizer trained with 3 different feature sets and audio-only recognizer trained with MFCC features. Second column: ISMA features with MFCC features extracted from audio data. Third column: Audio-visual recognizer trained with a feature set contain- ing MAPP and MFCC features. Fourth column: ISCN with MFCC features and finally the last column reports the accuracy of the audio-only recognizer at various signal-to- noise ratios.

Table 6.2 reports the accuracy of the audio-visual digit recognizer when (I) ISMA as visual features and MFCC as audio features are used, (II) MAPP and MFCC features are employed and (III) ISCN and MFCC are used to train the recognizer. The last column reports the accuracy of the audio-only recog- nizer at various signal to noise ratios. A surprising result is that ISMA+MFCC features obtains significantly worse results than MAPP+MFCC features on clean speech. However, as the SNR decreases, the ISMA+MFCC tends to outperform the MAPP+MFCC features as expected. This is due to the fact that the number of ISMA features (100) is 5 times larger than the MAPP fea- tures and when used with 39 MFCC features, they dominate the value of the loglikelihood. With a proper weighting scheme this effect can be substantially alleviated. As in the Oulu dataset, we again see that the performance of the audio-only recognizer can be largely improved at low-SNRs. At high-SNR values however, the audio-only recognizer still works better than the audio- visual recognizer, meaning that a more complex fusion scheme is needed to properly weight audio and visual modalities according to their contributions. 6.5LipreadingExperimentswithISMAFeatures 147

6.5.3 AVletter: Letter recognition

In this section a letter recognizer was trained to classify 26 English letters A-Z. Each English letter, except W, can be transcribed by a sequence of 1 to 3 phonemes. Since visual speech has much lower time-resolution than audio speech, a speech unit as small as a phoneme may only be captured in one frame, which is hardly enough to learn its corresponding statistics. Modeling longer speech-units (but not as long as words), such as a short sequence of phonemes as in this dataset, allows us to more precisely model visual speech while it is still feasible to generalize it to a large-vocabulary automatic speech recognition task. Figure 6.10 compares the recognition accuracy of ISMA, MAPP and ISCN features. Unlike the previous datasets, ISCN and MAPP features achieve close accuracy rates (32% and 33%, respectively) on AVletter. ISMA however, reaches 40.3% accuracy rate. While on some letters such as Y, W,

80 ISMA 70 MAPP ISCN 60

50

40

30 Recognition rate%

20

10

0 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Tot

Figure 6.10: Recognition accuracy of letters for ISMA, MAPP and ISCN features. 148 6LipreadingwithHigh-DimensionalFeatureVectors

M and F the accuracy rate is over 70%, for some letters, the visual recog- nizer could hardly achieve a performance better than random guessing. In the AVletter dataset, each letter is repeated 30 times by 10 different speakers. The confusion matrix of the letter recognition task is quite informative. As shown in Figure 6.11, some letters such as C, D and T which have exactly the same visual transcription, i.e. they consist of exactly the same sequence of visemes, are commonly confused. For some other letters such as U and Q which have different viseme transcriptions, the source of the confusion is the fact that they have the same dominant viseme. For instance, the dominant viseme of U and Q is /B (which corresponnds to phoneme /uh/; see Table 3.10). By governing the lip shape and prevents other visemes from having visible effects on the visual speech, /B hides other visemes and results in con- fusion of Q and U. The same argumentholds for the confusion of S andX(in this case the dominant viseme is /I which corresponnds to phoneme /s/). In lipreading visemes are considered the smallest visibly distinguishable speech units. However, since there is no unique mapping from phonemes to visemes, different mappings result in different viseme sets. From the classi- fication viewpoint, the best viseme group is a set of visemes that are highly separable given the visual features. Having this definition in mind, it is in- teresting to investigate whether the commonly accepted viseme set listed in Table 3.10 is optimal. In order to find a data-driven optimal viseme set with k visemes that minimize the classification error, we grouped the 26 English letters into 13 classes by partitioning the confusion matrix into 13 almost dis- joint principal submatrices, i.e., the rows and columns should be re-arranged in order to obtain a new matrix where 13 non-zero principal submatrices are observable. While the off-diagonal elements of these principal submatrices

Visual Groups I, R S, X M, N {J, Z}{B, P}{ O } { F }{ W }{A,E,K,} L {H}{G, Y}{ } C, D,{ T,}{ V Q, U} { }{ } Table 6.3: Regrouping 26 English letters to a smaller set containing 13 visibly distin- guishable elements. 6.5LipreadingExperimentswithISMAFeatures 149 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z A 37.5 0 0 0 9.4 0 0 0 12.5 0 9.4 18.8 0 9.4 0 0 0 3.1 0 0 0 0 0 0 0 0 B 0 56.2 0 0 0 0 0 0 0 0 0 0 0 0 0 37.5 0 0 6.2 0 0 0 0 0 0 0 C 0 0 11.8 29.4 0 0 0 0 11.8 0 0 0 0 0 0 0 0 11.8 0 17.6 11.8 0 0 0 0 5.9 D 2.9 0 20.0 17.1 5.7 0 0 0 2.9 0 5.7 0 0 2.9 0 0 5.7 5.7 0 22.9 0 2.9 0 0 0 5.7 E 11.1 0 2.8 5.6 38.9 0 0 5.6 0 0 5.6 8.3 0 8.3 0 2.8 2.8 0 0 5.6 2.8 0 0 0 0 0 F 0 0 3.2 0 0 61.3 6.5 0 0 3.2 0 6.5 9.7 0 3.2 0 0 0 3.2 0 0 3.2 0 0 0 0 G 0 5.3 5.3 0 5.3 10.5 21.1 0 0 10.5 5.3 0 0 0 0 5.3 5.3 0 0 0 10.5 0 0 0 15.8 0 H 3.8 0 0 0 0 0 0 57.7 0 0 0 3.8 0 0 3.8 0 3.8 0 11.5 0 3.8 3.8 3.8 3.8 0 0 I 5.1 0 2.6 2.6 12.8 0 2.6 0 33.3 0 10.3 0 0 2.6 2.6 0 7.7 12.8 0 2.6 2.6 0 0 0 0 0 J 0 0 5.3 5.3 0 0 15.8 0 0 52.6 0 0 0 0 0 5.3 0 0 0 0 0 0 0 0 0 15.8 K 3.0 0 0 6.1 3.0 0 0 3.0 6.1 6.1 39.4 3.0 0 9.1 0 0 0 0 0 12.1 3.0 0 0 6.1 0 0 L 0 25.0 0 0 12.5 0 0 25.0 0 0 0 12.5 0 0 12.5 0 0 0 12.5 0 0 0 0 0 0 0 M 0 3.2 0 0 3.2 6.5 0 3.2 0 3.2 0 6.5 58.1 3.2 0 6.5 0 0 0 0 0 0 0 6.5 0 0 N 16.7 0 0 0 3.7 3.7 0 1.9 5.6 0 1.9 13.0 7.4 18.5 1.9 0 0 1.9 7.4 1.9 0 0 0 13.0 0 1.9 O 0 0 0 0 0 0 0 0 4.2 0 0 0 0 0 70.8 0 8.3 8.3 0 0 8.3 0 0 0 0 0 P 0 36.4 0 0 0 0 0 0 0 0 0 0 0 0 0 48.5 0 0 0 0 0 6.1 3.0 0 6.1 0 Q 0 0 0 0 0 0 2.4 2.4 2.4 0 0 4.9 2.4 0 7.3 0 36.6 7.3 0 0 31.7 2.4 0 0 0 0 R 0 0 0 5.3 0 0 0 0 10.5 5.3 0 0 0 5.3 5.3 0 0 63.2 0 0 5.3 0 0 0 0 0 S 0 0 0 0 0 9.1 0 11.4 0 0 0 4.5 0 11.4 0 0 0 0 31.8 0 0 0 0 27.3 0 4.5 T 0 0 25.6 15.4 0 0 10.3 0 0 7.7 2.6 0 0 0 0 0 0 0 0 23.1 0 5.1 0 0 0 10.3 U 0 0 0 8.3 0 0 0 0 8.3 0 0 0 0 0 25.0 0 33.3 0 0 0 25.0 0 0 0 0 0 V 0 2.3 11.4 9.1 0 0 9.1 0 0 6.8 2.3 0 0 0 0 2.3 2.3 2.3 0 4.5 2.3 43.2 0 0 2.3 0 W 0 0 0 0 0 0 3.7 0 0 0 0 0 3.7 0 0 0 0 0 0 0 0 0 88.9 0 0 3.7 X 0 0 0 0 0 0 0 8.7 0 0 4.3 13.0 13.0 8.7 4.3 0 0 4.3 26.1 0 0 0 0 17.4 0 0 Y 0 4.2 0 0 0 0 18.8 0 0 12.5 0 0 0 0 0 4.2 0 0 0 0 4.2 2.1 4.2 0 50.0 0 Z 0 6.7 3.3 3.3 0 3.3 3.3 0 0 3.3 3.3 0 0 0 0 0 0 0 0 0 0 6.7 6.7 6.7 0 53.3 Recognition rate = 40.8974%

Figure 6.11: AVletter dataset: Confusion matrix of Visual recognizer with ISMA fea- tures. 150 6LipreadingwithHigh-DimensionalFeatureVectors are non-zero, the off-diagonal elements of inter-principal submatrices should be small. This problem however, is NP-hard. A commonly used approxima- tion to find a feasible solution for this problem is to use the spectral clustering method [vL07]. Applying this method to the confusion matrix given in Figure 6.11, yields the mapping described in Table 6.3. Using these 13 classes, in- stead of 26 English letters, increases the classification rate to 61.3%. As can be seen in Table 6.3, most of the letters with similar visemic transcriptions such as C, D and T are grouped together. S and X are also put in the same class and Q and U are also considered indistinguishable.

6.6 Conclusion and Discussion

In this chapter,we developeda lipreadingsystem based on the multiclass MA- Boost algorithm presented in Chapter 4 and COBRA-selected ISCN features introduced in Chapter 3. We showed that, training a multiclass classifier with 3D-SIFT features extracted from multiple color spaces and using the output posterior probabilities as visual features with ISCN features yielded a promis- ing lipreading system. This method obtains 70.5% accuracy rate on the Oulu dataset which is 6% higher than spatiotemporal local binary patterns (LBP) based method reported in [ZBP09], and yields 63.5% digit recognition accu- racy on the GRID dataset which is 4% higher than the AAM based algorithm suggested in [LHT+09]. Moreover, both methods described in [ZBP09] and [LHT+09] have employed some semi-automatic lip-region detection method which clearly improves the final accuracy, while in our work we only used a fully automatic facial detection point method described in citeDantone:12. ISMA features, which are the combination of MAPP and ISCN features, consistently outperform other methods over 3 audio-visual datasets that were employed in this Chapter. However, the inter-speaker variation of the accu- racy of the ISMA features is still not sufficiently small. While the accuracy of the lipreading system was for most speakers in an acceptable range, each dataset contained some speakers for which the accuracy was significantly lower than the average accuracy due to their particular speaking styles, lip shapes and other facial characteristics. As shown in our experiments, the mis- match between training and test conditions leads to an audio-visual recog- nizer with worse performance than that of the audio-only recognizer when speech signals are clean. However, as the SNR values of the audio signals de- crease, the audio-visual recognizer outperforms audio-only recognizer with a 6.6 Conclusion and Discussion 151 large margin. In order to obtain an audio-visual recognizer that is favorably comparable with the audio-only recognizer in noise-free scenarios and out- performs it in noisy environments, we need a more sophisticated information fusion algorithm that weights the audio and visual modalities according to their information values. Such a method is developed in the next Chapter. 152 6LipreadingwithHigh-DimensionalFeatureVectors Chapter 7

Audio Visual Information Fusion

Information fusion in multi-sensory systems is a challenging task due to the fact that different sensors capture underlying phenomena differently. This raises many problems such as having data streams with different temporal rates, various data representations, different dynamic ranges, different sensi- tivity levels and different noise types. This chapter looks into the multi-modal information fusion problem in the context of audio-visual speech recognition. By investigating the sources of uncertainty in the system, i.e., the estimation and model error, it is shown that a more suitable criterion to estimate the reliability weights assigned to modalities is to maximize the area under a receiver operating characteristic curve (AUC) rather than existing criteria, e.g., recognition accuracy or mutual information. Moreover, here we estimate a reliability weight for each feature. This generalizes the (conventional)two dimensional stream weight estimation problem to a fairly high dimensional problem. In order to efficiently estimate the reliability weights, we use a smoothed AUC function and adopt a variant of the projected gradient descent algorithm to maximize the AUC criterion in an online manner. The proposed algorithm results in a robust information fusion scheme that can cope with audio-modality failure. Moreover, it is shown that by using the one weight per feature strategy the audio recognition rate is greatly improved 154 7AudioVisualInformationFusion in the noisy environments.

7.1 The Problem of Audio Visual Fusion

Having multiple observation streams in a pattern recognition task, the central question is how to optimally combine their information. Various information fusion algorithms suggested in the context of audio-visual automatic speech recognition (AV-ASR), can be categorized into two groups. First, feature fu- sion methods where audio and video features are concatenated to constitute a super-feature vector. Linear discriminant analysis (LDA) or other dimension- reduction methods are usually followed to reduce the final feature vector di- mensionality. Though straightforward from a mathematical viewpoint, these approaches give the impression of mixing the unmixable and have been often reported to be inferior to the decision fusion techniques [PNLM04]. The sec- ond group of approaches, i.e., decision fusion methods, analyze the different streams separately and combine the decisions of the single modality classi- fiers at the state level or even at the phoneme or word level in HMM-based recognizers. This strategy allows to optimize each classifier for the character- istic of its input features and thus achieves higher accuracy. A common decision fusion method of combining information sources, or as we call here feature streams, in a statistical pattern recognition framework is to use the product rule on their posterior probabilities, i.e., Bayes fusion [KHDM98]. This approach results in a naive Bayes classifier which coincides with the Bayes classifier when the feature streams are class conditionally in- dependent. Despite the theoretical appeal of Bayes fusion, in reality however, naive Bayes classification may severely deviate from Bayes classification due to the following reasons.

1. Model error: Any assumption of a parametric distribution may cause a mismatch between the true and the estimated distribution [RCB97].

2. Estimation error: Parameter estimation error which mainly originates from insufficient training data may shift the decision boundary and thus, impose an additional uncertainty. Moreover, the mismatch be- tween the distributions of training and test data (because of the time- varying environment)also can be seen as a result of insufficient training data. 7.1TheProblemofAudioVisualFusion 155

3. Independence Assumption: The dependency among features violates the central assumption in naive Bayes classification.

Different approaches have been suggested to alleviate these problems. For instance, in [PLHZ04] audio and visual streams are coupled through a low dimensional random variable (in fact two dimensional in their work). That is, the class conditional distribution consists of multiplication of three terms: the likelihood of audio stream, the likelihood of video stream and a factor that models their dependency. This interesting idea, however, is only capable of dealing with the violation of the independence assumption. A more commonly used approach to tackle the above problems is through assigning a reliability weight (an exponent weight) to each stream pdf. This approach has been empirically shown to be effective [WSC99], [CR98], [CDM02], [DL00] in audio-visual speech and speaker recognition, especially in the presence of varying additive acoustic noise. Its effectiveness was later theoretically explored in [PSSD06] to derive an unsupervised weight estima- tion scheme. They assumed that the difference between the weighted likeli- hood and the Bayes likelihood (mismatch error) is a random variable with normal distribution and showed that a proper choice of weights can minimize the variance of the mismatch error. In fact, based on their approach it can be seen that any (linear or non linear) function of stream pdfs may be used to minimize the variance of the mismatch error. For instance, in [LCSC05] it was proposed to use a threshold-based combination of the sum rule and the product rule to fuse the class conditional likelihoods. It is, however, notewor- thy that minimizing the variance of the mismatch error does not necessarily minimize the classification error. In fact, as shown by Friedman et al. [FF97], under some conditionsthe exact opposite may come true, i.e., larger variances result in lower classification error rates. No matter what functional form is assumed to combine the stream pdfs (exponent weights, linear combination or etc.), we need a criterion to esti- mate the fusion parameters. Minimum word error rate is the most common criterion employed to estimate the parameters. In [PG98] for instance, a dis- criminative training method was proposed to minimize the smoothed error function (instead of the 0-1 loss function). Other existing criteria are maxi- mum SNR value and maximum mutual information [ONS99] where the re- liability of each stream is directly measured by mutual information between the stream and HMM states. The main shortcoming of these criteria is that they are only optimum 156 7AudioVisualInformationFusion for one realization of training data, i.e., the training data at hand. Paradox- ically, however, these approaches are supposed to perform well in mismatch- ing training and test conditions where the test data is corrupted with different background noises and thus clearly has a different distribution than the train- ing data. Even more importantly, it is a well-known fact that visual features are highly speaker-dependent [CHLN08]. That is, the distribution of visual features in training data, may vary from those in test data. A justification for this rather paradoxical parameter estimation procedure is that in the absence of information about the test condition, the obvious strategy is to minimize the error rate over training data with the hope that it will generalize enough to achieve satisfactory results on the test data. Adaptive weight estimation is a commonly suggested approach to tackle this problem [GVN+01], [RS12]. This approach, however, may suffer from several issues including poor weight estimation accuracy due to the unsuper- vised nature of this approach, heavy computational complexity (since most of these approaches need to estimate the audio and video quality in real-time) and system complexity. The main question answered in this work is whether it is possible to estimate the fusion weights in a static, supervised manner and yet achieve a robust recognition system even in mismatch conditions. The contribution of this work is threefold:

1. We suggest to use the area under a receiver operating characteristic curve (AUC) to estimate the exponent weights. AUC is linearly related to the expectation of classification accuracy and thus, the weights max- imizing AUC achieve satisfactory results over a wider range of mis- match conditions.

2. We assign a reliability weight to each feature rather than each modality. This approach will particularly allow us to model the inter-dependency of visual features more effectively by increasing the hyper-parameters that describe the final combined pdfs.

3. In order to maximize the AUC value, a non-convex approximation of AUC is adopted and optimized by a gradient decent type algorithm.

AUC is a commonly used measure to evaluate the classification rules and ranking algorithms [HT01]. Interestingly, AUC can be seen as the average of the classification rates over the different classification thresholds and uni- form distribution of classification costs (equivalently in our work, uniform 7.2Preliminaries&ModelDescription 157 distribution over the priori probabilities of classes). Based on this interpre- tation of AUC, maximizing the AUC value is equivalent to minimizing the expectation of the classification rate. We relate the notion of classification threshold to mismatch error and show that the weights maximizing the AUC value (minimizing the average of error rate), result in more robust AV-ASR than the weights that are estimated to be optimum for only one realization of mismatch error. Unlike the conventionalmethods where only two stream weights1, onefor audio and one for video, are used to integrate the audio-visual information sources, in this work a reliability weight is assigned to each feature. This generalization is usually referred to as the weighted naive Bayes classifier in the literature. There is a rich body of theoretical and empirical work on the weighted naive Bayes classifier [ZCCW13, and references therein]. It is also a fairly common approach in audio only ASR, though, to the best of our knowledge it has not been utilized for AV-ASR. Moreover, we further generalize this model by assigning an individual weight vector to each class. For at least two reasons, these generalizations are beneficial: First, these weights are effective tools to model the relatively high correlation among visual features. Second, different visual (audio) features may have different Bayes error (informativeness) and different level of robustness against mis- match error. However, due to the increase in the number of parameters, a computa- tionally efficient method is required to estimate the weights. To this end, we use a variant of the online projected gradient descent algorithm [Zin03] that as shown in [ZJYH11] can efficiently maximize the AUC value over a large dataset.

7.2 Preliminaries & Model Description

Throughout this work, we consider an instance of the pattern classifica- tion problem. Given a training dataset = (O ,y ) , 1 i N, contain- D { i i } ≤ ≤ ing audio-visual recordings (samples) Oi of K distinct words (classes), i.e., yi 1,...,K , the goal is to train a classifier that optimally assign a class label∈ {y to any new} test sample. Following the common approach in AV-ASR,

1In fact, in most of the previous works the summation of the weights is constrained to be one and consequently, only one weight needs to be estimated. 158 7AudioVisualInformationFusion a multi-stream HMM is trained for each class to capture its statistical prop- erties and represent the class conditional distribution. The models and class conditional distributions are denoted by Mk, 1 k K and P (O Mk), respectively. The Bayes classification rule can be≤ used≤ then to assign a| class label to any given sample O:

yˆ= argmax P (O Mk)πk = argmax log P (O Mk)+ log πk (7.1) 1 k K | 1 k K | ≤ ≤ ≤ ≤ where πk = P (y = k) is the a priori probability of class k. A common practice in speech recognition, including this work, is to ap- proximate P (O M) with the probability of the most probable state sequence of M computed| by the Viterbi algorithm.2 1 T t t t Given a sample O = oAV ,..., oAV (where oAV = [oA, oV ] is a bimodal observation vector for{ this sample at} time t), the state emission prob- ability (observation probability) of an HMM at state S at time t can be written as:

P (ot q = S)= P (ot q = S)wA P (ot q = S)wV (7.2) AV | t A| t V | t where wA and wV are appropriate stream weights to account for the reli- ability difference of each modality and ot = [ot ...,ot ] and ot = A A1 A|A| V [ot ,...,ot ] are audio and visual observation vectors, respectively. More- V1 V|V | over, A and V are the dimensionalities of audio and visual feature vectors. This is| a| fairly| standard| multi-stream likelihood model. Assuming indepen- dence among audio features and among visual features, we further generalize this model by adopting reliability weight for each feature. That is,

i= A j= V | | | | t t w t wV P (o q = S)= P (o q = S) Ai P (o q = S) j (7.3) AV | t Ai | t Vj | t Yi=1 jY=1

Suppose = (q , ..., q ) is a sequence of random variables taking values in Q 1 T some finite set = S1, ..., Sn , the state space. Given a model M and an input sample OS= o{1 ,..., oT} , the log-likelihood can be approximated { AV AV }

2Since we use the weighted pdfs, this approximation does not give rise to a valid probability distribution function anymore, though, for simplicity we still refer to it as likelihood. 7.2Preliminaries&ModelDescription 159 as follows:

log P (O M) log Pˆ(O M) = max log P (O, M) (7.4) | ≈ | Q| Q T T t = log P (O, M)= log P (oAV qt)+ log aqt−1,qt Q∗| | Xt=1 Xt=1 where = (q1,...,qT ) is the most probable sequence of states in M com- puted byQ∗ the Viterbi algorithm, Pˆ(O M) is the approximation of P (O M) | | and aqt−1,qt are the transition probabilities with aq0,q1 equal to 1. By substi- t tuting P (oAV qt) from (7.3) into (7.4), the (approximation of) log-likelihood can be written| as a linear function of the reliability weights:

T s | | t log Pˆ(O M)= ws log P (os q ) | j j | t Xt=1 s XA,V Xj=1 ∈{ } T

+ log aqt−1,qt = w⊤p (7.5) Xt=1 where stands for the vector transpose and, ⊤

w = [w0, wA1 ,...,wA|A| , wV1 ,...,wV|V | ]⊤ (7.6) T t t p = [log aq ,q , log P (o qt),..., log P (o qt)]⊤ t−1 t A1 | V|V | | Xt=1 Each entry p[l], l 1, ..., A + V of p, indicates the amount of contri- bution of that feature∈ { to the| total| log-likelihood| |} value. The first entry p[0] is the contribution of transition probabilities to the loglikelihood and w0 is a weight to adjust this term. w0 can either be fixed to 1 or can be considered as a variable to be optimized. The next step to estimate the reliability weights is to select an appropri- ate criterion. Let denote the criterion by f(log P (O M)) which is a function of log-likelihood, the goal is to optimize f with respect| to w. It is impor- tant to remember that given an optimal state sequence , log-likelihood is a linear function of w. However, in order to find theS optimal∗ state se- quence itself, the Viterbi algorithm needs w. Thus, the optimization pro- S∗ cess maxw f(log P (O M)) can be approximately solved by the following iterative process which| jointly optimizes w and . S 160 7AudioVisualInformationFusion

Algorithm 7.1: Iterative joint optimization

Input: set w(1) = [1,..., 1]⊤, ǫ = and θ =stopping threshold while ǫ>θ do ∞

= max log P (O, M, w(j)) (7.7) Qj Q| Q Calculate pj from(7.6) (7.8)

w(j +1) = argmax f(w⊤pj ) (7.9) w ǫ = w(j + 1) w(j) || − ||2 j ++.

end Output: The final weight vector w .

The w(j) argument in (7.7) is to remind that Viterbi algorithm needs the values of the weights in order to compute the optimal path and θ is a stopping 6 threshold usually set to a small value. In our work it was set to 10− and at most three iterations were sufficient to converge. Throughout the rest of the chapter we mainly focus on the criterion func- tion f and how to optimize it, i.e., equation (7.9) of Algorithm 1. Classification accuracy is a commonly used criterion in order to to es- timate the free parameters of a model in a classification task. However, as we discussed before, this criterion results in a sub-optimal solution in mis- matched training-testing scenarios. Instead, we use a more robust criterion that takes the mismatch error into account, that is, maximizing the AUC value.

7.3 AUC: The Area Under an ROC Curve

Receiver operating characteristic (ROC) curves are commonly used two- dimensional graphs which are originally developed in classical detection the- ory and later found their applications in machine learning due to their in- variance to monotonic score transformation and class distributions. An ROC curve is defined as a plot of true positive (tp) rate on the Y axis against false positive (fp) rate on the X axis and thus it is intrinsically defined for binary 7.3AUC:TheAreaUnderanROCCurve 161

1

0.9

0.8

0.7

0.6

0.5

0.4 True positive rate 0.3

0.2 Random Guessing strategy 0.1 Strong classifier

0 0 0.2 0.4 0.6 0.8 1 False positive rate

Figure 7.1: An example of receiver operating characteristic curve. classification (detection) problems. Positives correctly classified tp = Total positives Negatives incorrectly classified fp = Total negatives

From its definition, it is clear that the points lying on the Y =X line belong to random guessing strategy and the point (0, 1) represents optimal classification (look at Figure 7.1). Given a confidence-rated classifier3 a threshold (or in general decision boundary) can be used to produce the final binary decision.

3Confidence-rated classifiers yield an instance probability or score, a numeric value that represents the degree to which an instance is a member of a class. 162 7AudioVisualInformationFusion

Each threshold value produces a different point in ROC space. Clearly, when threshold is both tp and fp are 0 and as threshold approaches both of these values−∞ are 1. For any threshold value in between, we get a point∞ on a concave curve above the Y =X line as in Figure 7.1. AUC is then simply the area under the ROC curve. Following the AUC interpretation suggested in [CM03], consider a bi- nary classification task with m positive samples and n negative samples. A confidence-rated classifier assigns a score to each sample. We assume the + + classifier outputs for positive samples x1 ,...,xm and negative examples x1−,...,xn− are ordered. The AUC is then calculated as:

j=n 1 i=m A = I + − (7.10) mn (xi >xj ) Xi=1 Xj=1 where I(.) is one if (.) holds, and zero otherwise. We remark that the AUC + + value is exactly the probability P (X >X−), where X is the random vari- able corresponding to the distribution of the outputs for the positive samples and X− the one corresponding to the negative samples. This interpretation was in fact the inspiration to use the AUC in this work. Under this interpreta- tion, AUC can be seen as a measure of the ranking quality of the classifier. As the AUC value increases, it becomes more probable that the score of a posi- tive sample is larger than the score of any negative sample in the dataset. That is, in an audio-visual dataset, the scores of positive samples are likely to be higher than negative samples, independent of their speakers and the mismatch error corresponding to samples. AUC, however, can be seen from another point of view, that is as the ex- pectation of the classification accuracy. Consider a binary classification prob- lem with two HMMs M1 and M2 trained with samples from the first and second class, respectively. We distinguish the “error free” class conditional probabilities (free from estimation error and not from modeling error) by B subscription, i.e., PB (O Mk), from estimated likelihoods. Since we need a confidence-rated classifier,| we have to define a discriminant function as fol- lows to assign a score to each sample.

π1 L(O) =log PB (O M1) log PB (O M2)+ log( ) (7.11) | − | π2 From (7.1) it is clear that based on the optimal classification rule, O belongs to class 1 if L(O) t, and to class 2 otherwise and t is the classification ≥ 7.3AUC:TheAreaUnderanROCCurve 163

threshold which is equal to zero in the Bayes rule. Let g1(L)= P (L(O) y = 1) be the probability distribution function of the scores of the samples belong-| ing to class 1 and g2(L) be the pdf of the scores of class 2. The classification error rate, then, can be written as:

t + ∞ e = cπ1 g1(L)dL+ (2 c)π2 g2(L)dL (7.12) Z − Zt −∞ where 0 c 2 is the classification cost. Consider both classification cost and threshold≤ ≤ as random variables. Based on the following lemma according to Flach et al. [FHF11], AUC can be seen as a linear function of the expecta- tion of classification error e where expectation is taken with respect to c and t. Lemma. Expected error for uniform cost distribution and uniform in- stance selection decreases linearly with AUC as follows:

π2 + π2 E e =2π π (1 AUC)+ 1 2 (7.13) c,t{ } 1 2 − 2 where Ex is the expectation with respect to to the random variable x. Thus, the AUC is in fact the average of classification rates over different classification thresholds and different classification costs. In the next subsection we connect the notion of classification threshold and classification cost to mismatch error and show that according to the above lemma, AUC can be interpreted as the expectation of the classification error over different mismatch conditions.

7.3.1 AUC and Mismatch Let us denote the difference between the true discriminant function and the estimated discriminant function by Z(O), i.e.,

L(O) Lˆ(O)= Z(O) (7.14) − where Lˆ is the estimated discriminant function. In the absence of a priori knowledge about mismatch error, the classification decision can be made by Lˆ, that is, a sample O belongsto class 1 if Lˆ(O) 0, and to class 2 otherwise. However, this classification rule is equivalent to ≥L(O) Z(O). If we further assume that mismatch error Z(O)= Z is constant over≥ all the samples in a given dataset, then Z can be seen as the classification threshold. In other words, mismatch error results in shifting the classification threshold from zero 164 7AudioVisualInformationFusion to Z. Note that, Z is a random variable varying from one dataset to another. Denote the mismatch error of the training data and the test data by Ztr and Zte, respectively. Reliability weights are usually selected in order to minimize the classification error over the training data which is given by:

0 + ∞ etr = π1 g1(Lˆ(w))dLˆ+ π2 g2(Lˆ(w))dLˆ Z Z0 −∞ Ztr + ∞ = π1 g1(L)dL+ π2 g2(L)dL (7.15) Z ZZtr −∞ where the argument w is to emphasize that the estimated score Lˆ(w) depends on the weight vector w. However, since Ztr is likely to be different from Zte, the estimated weights are not optimal for the test condition. Moreover, a priori probabilities of the test data may significantly vary from the train- ing data. This difference can be modeled by classification cost c in (7.13). Thus, a more reasonable strategy to select the reliability weights is to min- imize Ec,Z e where the expectation over c is for averaging over different class priors{ and} the expectation over Z accounts for different levels of mis- match error. Minimizing this expectation is equivalent to maximizing AUC. It is important to note that this justification is only valid under some strong assumptions, .i.e., (I) mismatch error Z(O)= Z is constant over all samples and (II) the multiclass AUC has similar properties to the binary AUC. Both of these assumptions may be violated in reality. Nevertheless, as we see in the experiments, the AUC maximization results in a robust weight estimation strategy. To adapt AUC to our problem, we have to address three issues:

1. Generalize it to multiclass classification. 2. Transform HMM outputs to comparable scores. 3. Reduce the computational complexity of the AUC computation.

These issues are discussed in the following sections.

7.3.2 AUC Generalization to Multiclass Problem For more than two classes, AUC is too complex to be calculated. With only 3 classes, the ROC curve should be plotted in a 6 dimensional space [Lan00] 7.3AUC:TheAreaUnderanROCCurve 165 and in general, the dimension of an ROC curve grows with square of the numbers of classes, O(n2 n). To handle the multiclass situation, different methods have been suggested− to approximate multiclass AUC including one- versus-all classes [PD00] method, pairwise AUC averaging [HT01] etc. Here, we used the one versus all approach due to its computational efficiency. For a k-class problem, it only needs to average over k AUC values:

k Atotal = πiAi (7.16) Xi=1

th where Ai is the AUC value of a binary classification task in which, the i class is considered the positive class and the rest to be negative. Equation (7.16) is an average of AUCs weighted by the class priors πi. The disadvantage of this approach is that the AUC is now sensitive to class distributions.

7.3.3 Transforming HMM Outputs

Throughout the rest of this work, we assume M1 to MK are K given HMMs and the task is to assign a class label k 1, ..., K to each sample (utterance) of the dataset. Furthermore, we assume∈ the dataset is balanced. That is, the class frequencies are equal. Rewritten from section 2, the log-likelihood of each sample can be seen as a linear function of reliability weights:

log Pˆ(O Mk) = log P (O, Mk)= w⊤pk (7.17) | S∗|

where the subscript k in pk is to emphasize that pk is the sum of the observation log-probabilities of model Mk. In the following, theˆof log Pˆ, which indicates that it is the Viterbi approximation of the log-likelihood, is droppedfor the ease of notation. Since log P (O Mk) is a linear functionof w, reliability weight vector w can be interpreted as| a linear classifier or one layer perceptron. This interesting fact, allows one to efficiently estimate the weight vector with respect to different criteria and more importantly to generalize the weight set to K weight sets, that is, one weight set for each pattern. This generalization is further explored in the following sections. The score in (7.17) cannot be utilized as it is for AUC calculation. From (7.5) it is clear that for AUC computation,scores should be comparable while, log P (O M ) is dependent on the observation probability P (O) and thus | k 166 7AudioVisualInformationFusion not comparable from one utterance to another. In the previous section we have used the logarithm of the likelihood ratio for the binary case (7.11). In multiclass case, however, there are various ways to define the likelihood ratio. Here we use the following definition for the multiclass likelihood ratio (MLR):

K P (O Mk) MLRk = | (7.18) K P (O M ) i=1 | i Q This quantity is a generalization of the likelihood ratio in a binary classi- fication task. Since it is independent of the sample probability P (O), it is comparable among different samples and thus suitable for our framework. To make the classification decision, we need to compute K likelihood ratios for each sample. The final decision rule is that the sample belongs to the class with the maximum likelihood ratio value4. Now, by taking logarithm of MLR and replacing log P (O M ) terms by | k their approximations w⊤pk, we get:

K L (O) = log MLR = w⊤(Kp p ) k k k − i Xi=1 k = w⊤xO (7.19) where xk = Kp K p is simply the distance of the kth log-likelihood O k − i=1 i of O from the averageP log-likelihoods of the K HMMs. The larger it is, the more likely it is that O belongs to class k. The obtained score Lk(O) is in- dependent of the sample probability meaning that given two samples O1 and O2 one may compare their scores Lk(O1) and Lk(O2) to conclude which one is represented better by model Mk. As we will see in the next section, this score will be used to compute the AUC.

7.3.4 Online AUC Maximization The AUC metric is written as a sum of pairwise losses between samples from different classes. That is, the computational complexity of AUC is

4This also can be seen as a one-versus-all approach to convert a multiclass problem to K binary classification tasks. Other methods could be taken including pairwise comparisons or P O M min-margin= ( | k ) max P (O|Mi) i 7.3AUC:TheAreaUnderanROCCurve 167

quadratic in the number of samples. Consider a dataset = (Oi,yi) RN 1,...,K i BK with B samples from eachD of the{ K classes∈ and N× {= A + }|V ∈+1. We} split into K subsets, = (O ,y ) i | | | | D Dk { i i | ∈ & yi = k . Combining (7.10) and (7.16), the AUC value for a multiclass Dclassifier can} be computed as:

K 1 I A = (Lk(Oj ) Lk(On)) Z ≥ Xk=1 jXk nXk ∈D 6∈D 1 K I ⊤ k ⊤ k = (w xO w xO ) Z j ≥ n Xk=1 jXk nXk ∈D 6∈D K 1 I ⊤ k ⊤ k (7.20) = 1 (w xO w xO ) Z − j ≤ n Xk=1 jXk nXk ∈D 6∈D where Z is a normalization factor. Maximizing AUC is equivalent to minimiz- ing the summation of I ⊤ k ⊤ k terms in (7.20). Unfortunately, this (w xO w xO ) j ≤ n is a combinatorial optimization problem due to the non-convexity of the indi- cator function. In order to obtain a convex optimization problem, a common approach is to approximate the indicator function with its convex surrogate. Following the suggestion in [ZJYH11], we may replace the indicator function with the hinge loss function,

k k k k ℓ(w, x x ) = max 0, 1 w⊤(x x ) (7.21) Oj − On − Oj − On  as shown in Figure 7.2. However, a drawback of this convex function, and in general any other convex loss, is that it assigns large loss values to outliers making the algorithm to only focus on outliers. This phenomenon has been detected and extensively studied in the context of boosting classifiers [LS10] which in some sense is a dual form of online learning[FS97] that we use here. As a remedy, we suggest to use a hinge loss with level threshold,

ℓ(w,xk xk ) Oj − On k k = min T , max 0, 1 w⊤(x x ) (7.22) 0 − Oj − On   where T0 is the cut-off threshold as depicted in Figure 7.2. The loss function is now non-convex and the employed online minimization algorithm cannot guarantee to achieve the global optimum. 168 7AudioVisualInformationFusion Loss

T0

x 1

Figure 7.2: Illustrating hinge function and its non-convex thresholded version.

Having (7.20) and (7.22) in hand, the AUC maximization problem can be formulated as:

K 1 2 k k min w + C ℓ(w, xO xO ) (7.23) w 2|| ||2 j − n Xk=1 jXk nXk ∈D 6∈D w =1 , 0 w i 0,..., A + V i ≤ i ∀ ∈{ | | | |} Xi

1 2 where 2 w 2 is a quadratic regularization term added to control the solution complexity.|| || Other regularization terms such as entropy function or L1 norm regularization are also commonly used in the literature. Moreover, C is a penalty parameter of the error term. The constraints in (7.23) guarantee that the solution lies on the probability simplex. The constraint set, which may be unnecessary for information fusion, is useful for modeling the inter-feature correlations. On a probability simplex, 7.3AUC:TheAreaUnderanROCCurve 169 if one weight increases the sum of the rest of the weights decreases. So pro- jecting onto a probability simplex tries to alleviate the violation of the inde- pendence assumption by correlating the reliability weights to each other. It is conceivable thus that projecting to different convex sets results in different level of correlation modeling. For instance, projecting onto a unit hypercube ( 0 wi 1) gives less credit to the features’ correlations and projecting onto R+≤( 0 ≤w ) assumes almost no correlation among features. ≤ i Irrespective of the constraints, the minimization in (7.23) has two impor- tant properties. It is a batch minimization and it is quadratic in the number of samples. These properties raise two issues. First, as the size of the train- ing data increases, the computational cost quadratically increases making it prohibitive for large datasets. Second, by having new training material, the minimization problem should be resolved and weights cannot be updated. In other words, this framework is not suitable for adaptive weight adjustment. We require an online optimization framework with low computational and memory complexity that can be updated as the new samples become avail- able. To this end, we utilize the online AUC maximization framework, sug- gested in [ZJYH11] with some small modifications to fit our multiclass prob- lem. Since the standard online optimization framework is the sum of the losses of individual samples, we rewrite (7.23) to fit this framework, i.e.,

1 T w 2 + C (w) (7.24) 2|| ||2 Lt Xt=1 where,

(w)= I ht (w)+ + I ht (w) (7.25) Lt (yt=1) 1 ··· (yt=K) K t′=t 1 − t I i i hi(w)= (y ′ =i)ℓ(w, xO xO ′ ) i 1, ..., K t 6 t − t ∈{ } tX′=1 As T tends towards infinity, equation (7.24) approaches the objective func- tion in (7.23) since it is an unbiased estimate of it. The projected gradient decent algorithm introduced in [Zin03] can at this step be utilized to solve this optimization problem. The weight update strategy based on the projected gradient descent algorithm is listed in Algorithm 7.2. In Algorithm 7.2, proj(.) is a projection function, is the probability △ 170 7AudioVisualInformationFusion

Algorithm 7.2: Single Vector: Projected Gradient Descent 1 1 1 Input: set w = [ N ,..., N ]⊤. for t =1,...,T do t+1 t ∂ receive (Ot,yt) w = proj w η ∂w t(w) △ − L  end Output: The final weight vector wT . simplex and η is a learning rate. Projection onto probability simplex is defined 2 as proj (z) = argmin z x 2. △ x || − || ∈△ As can be seen, the projected gradient decent algorithm needs to compute the (sub)gradient of (w) at each update round5. For thresholded hinge loss Lt function for instance, the subgradient with respect to w when yt = i is:

t′ =t 1 ∂ − i i t(w)= (x x ), (7.26) ∂w L Ot′ − Ot tX′=1

i i when 1 w⊤(x x ) T , and zero otherwise. Since to compute Ot′ Ot 0 this subgradient− ≤ all the− previous≤ samples are required, Zhao. et al. [ZJYH11] introduced a buffer based algorithm to only store the last few samples to be used for subgradient computation. Furthermore, they showed that the differ- ence between the optimal solution and the solution obtained by the buffer based method is bounded. Although in this work, we set the buffer size to be the same size as the number of samples in the training data, using this trick can be quite practical for large datasets or when it is required to update the weights in real time.

7.4 Multi-vector Model

As mentioned before, interpreting the weight vector as a linear classifier en- ables us to further generalize this notion. Here we use the multi-vector gener- alization of linear classifier to address the multiclass problem. In this setting, an individual weight vector is dedicated to each class, i.e., K different w in our problem. Let us denote the weight matrix W = [w1, w2,..., wK ],

5In our case it is in fact subgradient since hinge loss is not differentiable. 7.4 Multi-vector Model 171

where wi is the weight vector of HMM Mi. While the main body of the proposed algorithm for the single-vector case remains intact, there are still some modifications required. As discussed in section 7.3, when AUC is used as a criterion, the scores of utterances should be comparable, that is, they should be independent of the observation probability P (O). In the single- vector case, it is sufficient to use MLR in (7.18) to satisfy this requirement. However, in the multi-vector setting where each P (O Mk) = wkpk has its own weight vector, P (O) will not be removed from the| equation (7.18) (since it is weighted with different values) and thus, MLR is not P (O)-independent anymore. Thus, in the multi-vector case, instead of MLR we use a heuris- tic to achieve approximately P (O)-independent scores. Given an observation sequence O = o1, ..., oTr with length T , we normalize its corresponding r { } r likelihood vector p by the number of observations Tr to make the score of a given sequence independent of its length. Similar to the single-vector case in (7.19), the score is,

K 1 k L (O )= w⊤(Kp p )= w x (7.27) k r T k k − i k Or r Xi=1

k 1 K where x = (Kpk pi). Having this modification in hand, it is Or Tr − i=1 straightforward to generalizeP the Algorithm 2 to the multi-vector case. To this end, let us rewrite (7.27) in the matrix-trace form:

Lk(Or) = Tr(DkW⊤XOr ) (7.28) where Tr(.) stands for the trace of a matrix, Dk is a diagonal matrix with th only one non-zero element in its k diagonal position set to be 1 and XOr = x1 xK . Then, the AUC optimization (7.23) in multi-vector case is: [ Or , ..., Or ]

K 1 2 k k min W 2 + C ℓ(W, xO xO ) (7.29) W 2|| || j − n kX=1 jXk nXk ∈D 6∈D s.t W⊤1( A + V +1) 1 = 1K 1 , Wi,j 0 | | | | × × { }≥ where 1 is an all-one vector, W stands for all entries of W, W 2 is { i,j } || ||2 defined as Tr(W⊤W) and, ℓ(W, xk xk ) Oj − On = min T , max 0, 1 Tr(D W⊤(XO XO )) (7.30) 0 − k j − n   172 7AudioVisualInformationFusion

Similar to the single-vector case, this optimization can be solved by using the projected gradient decent algorithm. For the tth sample belonging to ith class, the subgradient is equal to:

t′=t 1 ∂ − ∂ t(W)= Tr DkW⊤(XO′ XO ) ∂W L ∂w t − t tX′=1  t′=t 1 − = [0, ..., 0, xi xi , 0, ..., 0], (7.31) Ot′ − Ot tX′=1 when 1 Tr DkW⊤(XO′ XO ) T0, and zero otherwise. In (7.31), − ≤ t − t ≤ columns 0 denote all-zero vectors. As seen, all columns of the subgradient matrix except one, the ith column, are zero. Thus at each update step t only one column of W is updated.

Algorithm 7.3: Multi Vector: Projected Gradient Descent 1 1 1 Input: set W = [ N 1,..., N 1]. for t =1,...,T do receive (Ot,yt), set i = yt t+1 t ∂ Wi = proj Wi η ∂W t(W) ithcolumn △ − L |  end Output: The final weight matrix WT .

t th In algorithm 7.3, Wi denotes the i column of W at round t and sub- gradients are coimputed as in (7.31). Comparing Algorithm 2 and 3 reveals that both single- and multi-vector cases have equal computational complex- ity. However, it is important to note that in the multi-vector case, Wi is on average T/K times updated after T iterations and thus, its convergence is K times slower than the single-vector case.

7.5 Experiments on AUC-based Fusion Strategy

In the following experiments, the proposed weighting estimation scheme is evaluated in different scenarios. The sound units are words and the task at hand is to develop a robust audio-visual digit recognition system. Here, the CUAVE dataset [PGTG02] was used, which contains the digits from zero to 7.5ExperimentsonAUC-basedFusionStrategy 173 nine repeated five times by 36 speakers (see Section 2.2.1). In order to have a speaker-independent evaluation, the one-speaker-out cross-validation strat- egy was utilized. The audio signals were corrupted by additive white noise at various SNR levels 20, 15, 10, 5, 0, 5, 10, 15, dB, while video signals were clean in all{− experiments.− − − ∞} For digit recognition, we used a similar GMM-HMM based V-ASR in- troduced in Chapter 6. As in that Chapter, the ISMA features which con- sists of COBRA-selected ISCN features and 10 posterior probabilities of Mu- MABoost were considered to be visual features. Compared with Chapter 3, here a slightly smaller subset of ISCN features,i.e., 30 features, were selected. By including the first- and second-order derivatives of the visual features, the final 120 visual features used to train GMM-HMMs were obtained. As audio features, the mel-frequency cepstral coefficients (MFCC) were used. Thirteen MFCC features and their first- and second-order derivatives were extracted from frames with 20 ms duration and 10 ms overlap. That is, the audio frame-rate was 100 frames per second. Visual features were linearly interpolated to increase their frame-rate from 60 to 100 to have equal number of audio and visual features per utterance. Finally, both audio and visual fea- tures were normalized per speaker by subtracting their mean values for each speaker. Audio-visual feature vectors were employed to train 10 HMMs; one HMM per digit. Each Markov state in HMMs was modeled with two Gaus- sian components with diagonal covariance matrices. The number of emitting states in all HMMs was empirically chosen to be nine. Moreover, the popu- lar HTK toolkit [YEG+06] was used to train the HMMs. It is important to remark that during training audio and visual features had equal weights. In the first experiment, the accuracy of digit classification was evaluated by using three weighting strategies: (I) Optimal strategy, (II) Minimizing the classification error and (III) Proposed strategy. The optimal strategy needs a priori knowledge about test conditions. In this setting, the weights are selected in order to minimize the classification error rate on each particular test data, i.e., an individual weight set for each SNR value. The second strategy is to estimate the weights by minimizing the error rate over the clean training data. This strategy is called min-error in Figure 7.3 and its performance over different SNR values is depicted with a solid red line. 174 7AudioVisualInformationFusion

1

0.9

0.8

0.7

0.6 Audio−visual classification accuracy 0.5 Optimal Min−error Max−AUC 0.4 −20 −15 −10 −5 0 5 10 Inf SNR in dB

Figure 7.3: Comparison between optimal (no mismatch), maximum AUC and mini- mum classification error based strategies for estimating the feature weights.

The proposed strategy (in single-vector case), which is called max-AUC in Figure 7.3, chooses the reliability weights such that the approximation of AUC over the clean training data is maximized. The first two methods serve as reference to compare with our approach. As seen in Figure 7.3, the proposed strategy is very robust against the mismatch conditions and works uniformly well over different SNR regions while min-error approach shows a significant performance degradation as the SNR value decreases. This figure clearly shows that employing an appropri- ate criterion to estimate the weights is essential to achieve robustness against varying conditions. Figure 7.3 summarizes the main contribution of this work. That is, employing a set of fixed reliability weights that have been chosen in accordance with a robust criterion, can achieve close-to-optimal accuracy in a wide range of SNR values. In the second set of experiments reported in 7.5ExperimentsonAUC-basedFusionStrategy 175

Modalities Audio-only Video-only Audio-visual ∞ dB 159W 98.00 81.69 96.35 2W 97.72 81.35 95.38 159W-match 98.00 81.69 96.35 2W-match 97.72 81.35 95.38 10 dB 159W 93.08 81.69 93.04 2W 92.81 81.35 91.64 159W-match 93.53 81.69 94.11 2W-match 92.81 81.35 91.92 5 dB 159W 83.94 81.69 91.14 2W 84.28 81.35 89.61 159W-match 84.26 81.69 91.54 2W-match 84.28 81.35 90.00 0 dB 159W 65.17 81.69 87.74 2W 64.26 81.35 86.51 159W-match 67.15 81.69 89.14 2W-match 64.26 81.35 86.51

Table 7.1: Classification rates in different noise levels for single-vector setting. 159W rows report the accuracies for one weight per feature (159) case and 2W rows are for one weight per modality. 159W-match and 2W-match represent the scenario where the reliability weights are estimated from match-to-test datasets.

Tables 7.1 and 7.2, the ten HMMs were trained with clean audio-visual data and tested in different noisy conditions. The weights were estimated by max- imizing AUC with the non-convex loss for two cases: one weight for each feature (159 in total) which we refer to as 159W in our tables and one weight for each stream (only 2 weights) called as 2W. Moreover, we also report the results of a scenario where the weights were estimated from match-to- test data, called “159W-match” and “2W-match” in Tables 7.1 and 7.2 . The match-to-test data was constructed by randomly selecting 10% of the training data and adding appropriate additive noise to it to match the test scenario. 176 7AudioVisualInformationFusion

Modalities Audio-only Video-only Audio-visual -5 dB 159W 39.68 81.69 85.26 2W 36.85 81.35 84.18 159W-match 44.08 81.69 85.53 2W-match 36.85 81.35 84.18 -10 dB 159W 20.22 81.69 82.24 2W 17.46 81.35 82.00 159W-match 24.21 81.69 83.57 2W-match 17.46 81.35 82.06 -15 dB 159W 12.47 81.69 80.58 2W 11.08 81.35 80.67 159W-match 14.32 81.69 82.14 2W-match 11.08 81.35 81 -20 dB 159W 8.889 81.69 79.05 2W 8.333 81.35 79.78 159W-match 8.667 81.69 81.56 2W-match 8.333 81.35 80.06

Table 7.2: Classification rates in different noise levels for single-vector setting. 159W rows report the accuracies for one weight per feature (159) case and 2W rows are for one weight per modality. 159W-match and 2W-match represent the scenario where the reliability weights are estimated from match-to-test datasets.

The motivation for running this experiment was to estimate, given a hypo- thetical adaptive weight estimation algorithm (that can adaptively update the reliability-weights to match test conditions), how much additional classifica- tion accuracy may be obtained? In reality, however, an adaptive algorithm can never precisely estimate the test conditions. Thereby, the improvements reported in Tables 7.1 and 7.2 for the match case should be interpreted as being overly optimistic. The classification accuracies for audio-only, video- only and audio-visualclassification systems are reported in thesecond to forth columns of Tables 7.1 and 7.2, respectively. 7.5ExperimentsonAUC-basedFusionStrategy 177

Modalities Audio-only Video-only Audio-visual ∞ dB 159W 96.96 82.92 96.69 2W 97.06 78.44 94.57 159W-match 96.96 82.92 96.69 2W-match 97.06 78.44 94.57 10 dB 159W 91.49 82.92 96.53 2W 92.06 78.44 93.00 159W-match 92.28 82.92 96.58 2W-match 92.06 78.44 93 5 dB 159W 84.97 82.92 95.29 2W 85.58 78.44 91.21 159W-match 88.03 82.92 95.61 2W-match 85.58 78.44 91.26 0 dB 159W 73.04 82.92 93.4 2W 74.01 78.44 88.78 159W-match 79.37 82.92 93.85 2W-match 73.96 78.44 88.78

Table 7.3: Classification rates (in percentage) at different noise levels for multi-vector case. 159W rows report the accuracies for one weight per feature (159) case and 2W rows are for one weight per stream case. 159W-match and 2W-match represent the scenario where the reliability weights are estimated from match-to-test datasets.

Two interesting trends can be detected in Tables 7.1 and 7.2. First, the one weight per feature case leads to 1% to 2% classification rate improvement in high-SNRs regimes, i.e., above -5 dB. However, as the SNR value declines, the advantage of using a large number of weights (here 159) decreases and at -15 dB it is observed that one weight per modality results in a slightly better performance than the 159W case. That is, the mismatch between training and test conditions may more severely affect a model with 159 hyper-parameters than a model with only 2 hyper-parameters. The second trend detected in Tables 7.1 and 7.2 is that in the match case 178 7AudioVisualInformationFusion

Modalities Audio-only Video-only Audio-visual -5 dB 159W 52.18 82.92 89.14 2W 54.46 78.44 84.69 159W-match 62.68 82.92 90.08 2W-match 54.35 78.44 84.75 -10 dB 159W 31.10 82.92 82.99 2W 34.29 78.44 80.13 159W-match 44.85 82.92 86.42 2W-match 34.53 78.44 80.18 -15 dB 159W 18.65 82.92 77.10 2W 21.67 78.44 75.47 159W-match 29.40 82.92 81.71 2W-match 21.72 78.44 75.36 -20 dB 159W 11.94 82.92 70.53 2W 13.58 78.44 72.07 159W-match 20.31 82.92 78.10 2W-match 13.58 78.44 72.18

Table 7.4: Classification rates (in percentage) at different noise levels. 159W rows report the accuracies for one weight per feature (159) case and 2W rows are for one weight per stream case. 159W-match and 2W-match represent the scenario where the reliability weights are estimated from match-to-test datasets. scenario, the one weight per stream (2W-match) does not show a meaningful improvement over the 2W case which may seem surprising. In fact, in most cases, accuracies obtained by 2W-match are very close to 2W cases. The in- effectiveness of the 2W setting to adapt to a particular condition is due to the fact that it does not introduce sufficient degrees of freedom into the model to properly present the complex reality. However, in the 159W-match cases, where the weights were estimated from a match-to-test training set, signif- icant improvements in all SNRs over the mismatch cases can be observed. That is, given an adaptive algorithm that can update the weights in order to 7.5ExperimentsonAUC-basedFusionStrategy 179 match test conditions, it is more rewarding to employ a larger set of weights than only two stream weights. In the third experiment reported in Tables 7.3 and 7.4, we evaluated the performance of the multi-vector setting where 10 different weight vectors are estimated; one for each digit. As in the second experiment, 159W represents the one weight per feature case and 2W stands for the one weight per modal- ity. An immediate observation is that the multi-vector setting greatly outper- forms the single-vector setting in a wide range of SNRs from 10 dB to -10 dB. It seems this improvement is due to the fact that the multi-vector setting significantly improves the quality of the audio-only recognizer. For instance, at -5 dB the audio-only performance for 159W in the single-vector setting is 39% while for multi-vector setting it reaches to 52%. As in the single-vector setting, it is observed that 159W is more sensitive to mismatch error and achieves 2% less accuracy rate at -20 dB than the 2W case. However, it consistently shows better performance in higher SNRs.

1

0.95

0.9

0.85

0.8

2W single−vector Audio−visual classification accuracy 0.75 2W multi−vector 159W single−vector 159W multi−vector 0.7 −20 −15 −10 −5 0 5 10 Inf SNR in dB

Figure 7.4: Comparison between the 4 proposed weight sets. 180 7AudioVisualInformationFusion

Figure 7.4 compares all the four cases, i.e., 159W single-vector, 2W single-vector, 159W multi-vector and 2W multi-vector. From this figure, it is clearer which setting is more suitable for which SNR region(s). The under- lying principle is that as the number of parameters increases, both the accu- racy of a model and the sensitivity to mismatch error increase. The different methods in Figure 7.4 offer different compromises between these two factors. zero one two three four five six seven eight nine

zero 58.37% 6.87% 1.29% 1.29% 24.46% 1.29% 0 0.43% 0 6.01%

one 0 87.59% 2.07% 0 2.76% 0 0 2.07% 0.69% 4.83%

two 8.05% 1.15% 78.16% 5.75% 5.75% 0 0 0 0 1.15%

three 2.21% 0.74% 3.68% 74.26% 0.74% 0 9.56% 3.68% 2.94% 2.21%

four 2.80% 1.87% 0 1.87% 85.98% 1.87% 4.67% 0.93% 0 0

five 3.81% 1.90% 0 0 2.86% 78.57% 0.48% 0.48% 0 11.90%

six 0.79% 0.79% 1.59% 0.79% 0 0.79% 84.92% 8.73% 0.79% 0.79%

seven 2.25% 6.11% 9.00% 14.79% 2.57% 0.32% 15.11% 48.55% 0 1.29%

eight 1.01% 1.51% 0.50% 7.04% 0.50% 0 2.51% 0 85.93% 1.01%

nine 3.36% 2.68% 0.67% 1.34% 0 4.70% 0.67% 4.03% 1.34% 81.21%

Recognition rate = 73.0168%

Figure 7.5: Confusion matrices of audio-based classifier for multi-vector 159W setting in 0 dB. 7.5ExperimentsonAUC-basedFusionStrategy 181

An important question raised here is whether the audio and visual-based classifiers often make mistakes on similar test samples or their errors are roughly uncorrelated. In other words, are the hard-to-classify samples for both classifiers similar? Figures 7.5, 7.6 and 7.7 depict the confusion matrices of audio, visual and audio-visual based classifiers for multi-vector 159W setting in 0 dB, respec- tively. zero one two three four five six seven eight nine

zero 86.92% 0 7.48% 1.87% 0.93% 0.93% 0 1.87% 0 0

one 0.70% 95.77% 0 3.52% 0 0 0 0 0 0

two 22.27% 1.75% 65.94% 2.18% 1.31% 0 4.80% 0.87% 0.44% 0.44%

three 0.61% 7.32% 0 89.02% 0.61% 0 1.22% 0 0.61% 0.61%

four 5.29% 8.81% 4.85% 4.41% 76.65% 0 0 0 0 0

five 0 1.60% 0 1.60% 0 90.43% 0 2.66% 0 3.72%

six 3.72% 1.60% 2.66% 1.06% 0 0 79.26% 0.53% 4.79% 6.38%

seven 1.67% 0.56% 0 1.67% 0 0.56% 2.78% 90.56% 0.56% 1.67%

eight 3.23% 0 0.92% 1.38% 0 1.38% 3.69% 2.76% 71.89% 14.75%

nine 2.70% 0 1.35% 0 0 2.70% 2.70% 0 7.43% 83.11%

Recognition rate = 81.6201%

Figure 7.6: Confusion matrices of visual-based classifier for multi-vector 159W set- ting in 0 dB. 182 7AudioVisualInformationFusion zero one two three four five six seven eight nine

zero 93.67% 1.27% 1.90% 0.63% 1.90% 0 0 0 0 0.63%

one 0 95.86% 0.59% 1.78% 0.59% 0 0 0.59% 0 0.59%

two 9.69% 1.02% 85.71% 2.04% 0.51% 0 0.51% 0.51% 0 0

three 0.60% 0 1.20% 97.60% 0 0 0 0.60% 0 0

four 1.08% 3.76% 0 1.61% 93.55% 0 0 0 0 0

five 1.56% 1.56% 0 0 0 92.19% 0 1.04% 0 3.65%

six 2.20% 0.55% 1.65% 0 0 0 93.96% 0 0.55% 1.10%

seven 0 0.56% 1.12% 1.12% 0 0 0.56% 96.65% 0 0

eight 0.53% 0.53% 0 1.58% 0 0 2.11% 0.53% 91.58% 3.16%

nine 0.58% 0 0 0 0 1.17% 1.17% 0 2.34% 94.74%

Recognition rate = 93.4078%

Figure 7.7: Confusion matrices of audio-visual classifier for multi-vector 159W setting in 0 dB.

As seen, digit four was the most difficult class to classify for audio- based recognizer and it only classified 51.4% of the samples belonging to this class correctly. On the other hand, samples from four were the easiest to classify for visual-based classifier. It yielded 97% classification accuracy on that class. Another interesting result is for digit nine. Both classifiers give 7.6 Conclusion 183 almost the same accuracy for this class, i.e., 67-68%. However, while audio classier considers most of the wrongly classified samples of this class to be five, visual-based classifier assigns most of the wrongly classified samples to class seven. The accuracy improvement of the combined classifier over the single-modality classifiers depends on the level of the correlation between the classifiers (see random forest analysis in [Bre01]). The weaker the correlation between the outputs of the classifiers, the higher the accuracy of the combined classifier is. This explains the classification improvement of the audio-visual classifier over the audio-only or visual-only classifiers.

7.6 Conclusion

In order to achieve robustness against mismatch between training and test sce- narios, we proposed to employ AUC as a design criterion to estimate the re- liability weights. We showed that the reliability weight vector could be inter- preted as a linear classifier taking the likelihood values computed by HMMs as an input feature vector. We evaluated different weight sets such as one weight per stream, one weight per feature and one weight vector per class. As shown in the experiments, except for the severe mismatch condition (-20 dB), the one weight per feature appraoch outperforms the one weight per modality approach both in multi-vector and single-vector setting. Multi-vector setting, which dedicates an individual weight vector to each class, showed a consider- able improvement over the single-vector case in a wide range of SNR values. It however, seems to be more sensitive to mismatch error as its accuracy rate in -20 dB is about 10% less than the single-vector case. With an adaptive model selection algorithm, a multi-mode fusion algorithm may employ all these four models to achieve an optimal accuracy over all SNR regions. 184 7AudioVisualInformationFusion Chapter 8

Multichannel Audio-video Speech Recognition System

This section is devoted to multichannel audio-visual speech recognition sys- tems. The AV-ASR system proposed in this chapter is similar to the system in- troduced in Chapter 6 with one additional audio-processing block, beamform- ing. Since audio signals are captured by a microphone array with eight chan- nels, we can utilize beamforming techniques to focus on the desired sound source and alleviate background noise. The output of the beamforming block is then supplied to the AV-ASR discussed in Chapters 6 and 7. The multichannel audio-visual dataset used to evaluate the proposed al- gorithms in this Chapter is collected in a highly reverberant office room, un- der highly varying light conditions and with a relatively low-resolution RGB camera. Using this dataset to evaluate the proposed algorithms, yields a fair estimate of the performance of the algorithms in adverse conditions which commonly occur in real-world applications.

8.1 Introduction to Beamforming Problem

While visual information is of a high importancefor increasing the robustness of AV-ASR in several respects, it hardly can help (at least directly) to im- prove the speech signals in reverberant environments. Reverberation corrupts harmonic structure in voiced speech and consequently decreases the speech 186 8MultichannelAudio-videoSpeechRecognitionSystem recognition rate. One simple approach to overcome this difficulty is to exploit beamforming techniques, to focus on the desired direction and attenuate the echoes of the signal. However, although this solution may work in fixed (non-adaptive) beam- formers, it leads to signal cancellation in adaptive beamformers,due to a phe- nomenon known as signal leakage. Adaptive beamformers tend to adapt to the real characteristics of the background noise rather than a predetermined model like white or diffuse noise. To this end, they estimate the statistics of background noise from the non-speech segments of audio signals. However, in a reverberant room, the echoes of the speech signals are still sensed by the microphones long after the end of the speech signals. These echoes bear the same statistical signature as the desired signal and hence result in signal cancellation when they are considered as noise by beamformer. The commonly utilized minimum variance distortion-less response (MVDR) beamformer minimizes the output power while maintaining a spec- ified response to the desired signal. However, in the presence of coherent interferers, which occur due to reverberation, microphone mismatch or steer- ing vector error, MVDR fails due to the signal leakage problem. Signal pro- cessing methods that have been proposed to address this problem are usually known as robust adaptive beamforming methods. However, this robustness is achieved at the expense of less interference reduction or an increased number of microphones. For example in [SK85], the signal is averaged over the space to decorrelate the signal and interference. This technique can only be applied to uniform microphone arrays 1 and needs many microphones to achieve sat- isfactory performance. Using norm-constrained adaptive filters was proposed in [QV95] to constrain the power of the signal leakage and which conse- quently leads to improving the adaptive beamformers. It, however, requires knowledge of the interference covariance matrix which may not be available in speech recognition applications. In [HSH99], both quadratic and non-linear (truncation) constraints in three-block structure have been used to improve the interference reduction. Another approach is to estimate the transfer func- tions (TFs) with blind source separation techniques. TFs may also be esti- mated based on non-stationarity of signal and stationarity of noise assumption [GBW01].

1A uniform microphone array is a linear microphone array with equal distances between the microphones. 8.2 Signal Cancellation 187

In this work, we extract the TFs from the covariance matrix of the ar- ray data given the source locations. we show that this problem can be for- mulated as an instance of weighted Procrustes problem [Vik06]. Typically, a Procrustes problem is used to rotate and scale a set of data to fit another set. Here this method is used to estimate those TFs that are close enough to the ideal TFs (steering vectors) and still can reconstruct the data covariance matrix. Using the proposed beamforming method yields 2% performance im- provement in audio-only speech recognizer when microphone array consists of 8 microphones and obtains 1.11% when 4-channel microphone array is utilized.

8.2 Signal Cancellation

Let sm(n) be the sound waves emitted by M wide-band sources which are k,i received by an array of N microphones. The room impulse response hroom(t) characterizes both direct and echo paths from the k-th source to the i-th mi- crophone. Since the microphones may not be calibrated and may introduce i different transfer functions hmic(t), one can merge the room impulse re- sponse and microphone transfer function into a total transfer function hk,i(t) containing both room acoustic and microphone characteristics. Therefore the i-th microphone output in the frequency domain can be written as:

M k,i xi(f)= h (f)sm(f)+ ni(f), (8.1) mX=1 where ni(f) is the additive noise at the i-th microphone. The microphone outputs can be aggregated into a column vector x: x = Hs + n, (8.2) where n is the additive white noise vector, H = [h1, ..hM ] is the channel ma- k,i M,i T trix, hi = [h , ..., h ] the i-th column of H and s = [s1(f), ..., sM (f)] is the source vector. The number of sources M is assumed to be less than the number of microphones N. Without loss of generality, s1 can be considered as the desired signal. The goal of the beamformer is to obtain an estimate of the desired signal by filtering and summing the microphone outputs: y(f)= w(f)H x(f), (8.3) 188 8MultichannelAudio-videoSpeechRecognitionSystem where (.)H is the Hermitian transpose operator, y the beamformer output and w the beamformer weight vector. The conventional MVDR beamformer chooses its weight vector w to minimize the output power while maintaining the signal from a specified direction of arrival:

argmin wH Rw subject to wH d =1 (8.4) w

R = E xxH is the covariance matrix of received signals x. The target steer- { } jωτ1 jωτN ing vector is defined by d = [e− , ..., e− ], where τ1...τN are delays matched to the desired speaker direction. The conventional MVDR presented in (8.4), however, can only perform well in anechoic environments where noise and signal are independent. In reality, severe signal cancellation occurs because of microphone mismatch, location estimate errors, signal-correlated noise and reverberant environments. To clarify this problem, we have to look more closely into the covariance matrix R:

R = E xxH = HR HH + σ2I, (8.5) { } s where R = E ssH is the source covariance matrix and σ2 the power of s { } additive white noise. For uncorrelated sound sources, Rs is diagonaland (8.5) can be decomposed into three additive terms:

H H 2 R = h1h1 S1(f)+ H2:M Rs2:M H2:M + σ I, (8.6) where S1(f) is the power spectrum of the desired signal and H2:M =

[h2, ..hM ]. Rs2:M can be obtained by removing the first row and the first H column of Rs. Substituting w h1 =1 in the MVDR objective function (8.4) yields,

H H H 2 H w Rw = S1(f)+ w H21:M Rs2:M H2:M w + σ w w (8.7) By ignoring the first term, the MVDR optimization problem can be simplified to

H H 2 H H argmin w H2:M Rs2:M H2:M w + σ w w s.t w h1 =1 (8.8) w

Note that (8.8) is independent of S1. However, usually in reverberant rooms no knowledge about h1 is available in advance and it has to be estimated from the array data or approximated with steering vector d calculated from the main speaker location estimate. Since h1 includes both direct and indirect 8.3TransferFunctionEstimation 189 paths, it can be decomposed into a sum of steering vector and echoes transfer function: h1 = d + hechoes (8.9) The attenuation factor has been intentionally dropped to avoid notational dis- traction. R can be rewritten as

H H 2 R = (hechoes +d)(hechoes +d) S1(f)+H2:M Rs2:M H2:M +σ I (8.10)

H By using w d = 1 as an approximation of the target transfer function h1, the objective function becomes

wH h +1 2S (f)+ wH H R HH w + σ2wH w (8.11) | echoes | 1 2:M s2:M 2:M Looking closely at this function reveals that the first term can be vanished if H w hechoes = 1 holds. Therefore the MVDR optimization problem tends −H to satisfy the w hechoes = 1 constraint. However, satisfying this con- straint results in removing the signal− from the beamformer output. To clarify it, note that the signal component in the beamformer output can be written H as ys = w h1s1(f). Using (8.9) and the fact that the weight vector satis- H H fies w d = 1 and w hechoes = 1 constraints, the beamformer response − H to the signal transfer function h1 is w h1 = 0 and thus ys = 0. That is, by approximating the real transfer function (TF) with the steering vector d, the MVDR beamformer tends to remove the desired signal. Actually, if we ignore the white noise term for sake of simplicity, by choosing w from H null space, the objective function reaches its minimum value, namely zero. It H H means w H =0 and therefore w h1 =0. A natural approach to address the signal cancellation problem is to es- timate the acoustic transfer function and the microphone mismatches. This estimate is then used to achieve the signal-independent optimization problem in (8.8).

8.3 Transfer Function Estimation

In order to avoid signal cancellation, we aim to estimate h1 from the noisy signals x = Hs + n received by microphones. This information can be ex- tracted from the covariance matrix R. Using spectral decomposition, R can be written as R = U(Λ + σ2I)UH , where the first M columns of U are the orthonormal basis vectors of the signal and interference subspaces and σ2 is 190 8MultichannelAudio-videoSpeechRecognitionSystem the white noise power which can be estimated as the average of the N M smallest eigenvalues of R. Given the source position, the ideal steering− vec- tor can be simply calculated. However, because of acoustic characteristics of a room, the steering vector d may not lie in the subspace spanned by the U1:M columns. One approach to estimate h1 is to find the closest vector to d which lie in the in subspace spanned by the first M columns of U.

F argmin U1:M x d 2 (8.12) x || − ||

Consequently, h1 = U1:M x and it belongs to the subspace U1:M . This is a H projection problem and the solution can be written as h1 = U1:M U1:M d, H where P = U1:M U1:M is the projection operator onto the U1:M subspace. However, although it can be an accurate guess when the steering vector and the real TF mismatch is fairly small, it may not work in more severe rever- berant environments. To go one step further, we assume the availability of some additional information about the interferer locations. That is, direction of arrival of k interferers (k M) can be derived from array data (which is reasonably simple at least for≤ an imprecise estimate)2. Unlike some other methods, they do not necessarily need to be the k strongest sources. Defining 1/2 H H′ = U1:M ΛM M V and thus R = H′H′ one can infer that H′ can be estimated up to an× unknown multiplicative unitary matrix V from R decom- position. Our aim is to estimate this unitary matrix V and reconstruct h1 by ˆ 1/2 h1 = U1:M ΛM M v1. Taking locations information into account, (8.12) can be extended as follows:×

1/2 F H argmin U1:M Λ V1:kΓ D 2 s.t V1:kV1:k = I (8.13) V,Γ || − || where Γ is a k k diagonal weighting matrix which is necessary to model the × unknown attenuation factor and D = [d1, ..., dk] is the steering matrix. It is worth to note that in case of k =1, (8.13) reduces to the simple least squares problem in (8.12). However, unlike (8.12), there is no straightforward solution for it. The optimization problem in (8.13) is known as weighted orthogonal Procrustes problem (WOPP) and can be seen as a linear least squares prob- lem defined on a Stiefel manifold. A Stiefel manifold, commonly denoted as m,n, is the set of all matrices VM N having orthonormal columns. V × 2Since many systems nowadays use both audio and video channels to ease the human- machine interaction, like Microsoft’s Kinect-Xbox, the location information could also come from the vision channel. 8.3TransferFunctionEstimation 191

Usually, solutions suggested for the optimization problem in 8.13 have two steps: Given Γ , they try to find the optimum V and use it in the second step to find the optimal Γ . This forms an iterative solution that converges to a local minimum. Here, we employ the algorithm suggested in [Dos10] with small modifications to work with complex matrices. The complete iterative channel matrix estimation algorithm is listed in Algorithm 8.1.

Algorithm 8.1: Iterative channel matrix estimation

Input: M eigenvectors U1:M , steering matrix D and eigenvalues Λ. 1/2 H 0 Set A = [a1, ..., ak]= Λ U1:M D and V1:k = I. For t =1,..., do H ai vi t (a) Compute γi = H where vi is i-th column of V vi Λvi and set Γ = diag([γ1, ..., γi]) (b) Choose ρ such that ρI Λ is positive-definite. − Compute c = γ a + γ 2(ρI Λ)v for i =1, ..., k i i i | i| − i and set C = [c1, ..., ck]. H By orthogonal decomposition of C get C = UcΛcVc t+1 H and set V = UcVc k k (c) Φ(t+1) = 2 real(γ vH a ) γ 2vH Λv i i i − | i| i i Xi=1 Xi=1 Terminate if Φ(t+1) Φ(t) 0 − ≈ End Output: Γ and Vt+1.

Simulations have shown that it converges in less than 20 iterations. This algorithm may be computationally expensive. However, as long as the trans- fer function of a room does not change rapidly, it only needs to be run infre- quently. The output of this algorithm is an estimate of the channel matrix H which is used in (8.8) to achieve a signal-independent MVDR beamformer. 192 8MultichannelAudio-videoSpeechRecognitionSystem

8.4 Simulation Results

The adaptive beamformer is implemented in the frequency domain with overlap-add 2048 point FFT filterbank and sampling frequency of fs = 44100. The non-uniformlinear array consists of 8 microphonesas depicted in Figure 8.3. For evaluation of the proposed algorithm, two different simulation cases have been chosen. Case I: In the first scenario, two speech signals in a reverberant room 3 with T60 = 100ms (T60 is the reverberation time ) have been assumed. The origin of the coordinate system is at the center of the microphone array. The desired speaker and the interferer are placed at coordinates [0, 2.5m, 0] and [3m, 2.5m, 0], respectively. A mismatch of less than 2 dB is assumed

Updated MVDR beamformer 10

0

−10

−20

−30

−40 500 1000 1500 2000 2500 3000 3500 4000 Conventional MVDR beamformer 10

0

−10

−20

−30 signal response −40 500 1000 1500 2000 2500 3000 interference3500 response 4000 Beamformer response in dB Super−directive beamformer relative mean attenuation 10

0

−10

−20

−30

−40 500 1000 1500 2000 2500 3000 3500 4000 Frequency

Figure 8.1: Proposed MVDR (top), conventional MVDR (middle) and super-directive beamformer responses to the signal and the interference.

3 T60 is perceived as the time for the sound intensity to drop below -60 dB of the original sound after the sound source ceases. 8.4 Simulation Results 193 between the microphone transfer functions. The beampattern can be defined as BP (h(f)) = wH h(f) 2 and beampattern values at h = h (desired | | 1 signal transfer function) and h = h2 (interferer) can be interpreted as the beamformer responses to the desired signal and the interferer signal, respec- tively. These responses are shown in Figure 8.1 for the proposed, conventional MVDR and super-directive beamformers. Super-directive beamformers as- sume diffuse noise which is a very common model for reverberant environ- ments. Note that beampattern at h = h can be interpreted as SNRout where 1 SNRin SNR is defined as signal to noise ratio and at h = h , BP (h (f)) = INRout 2 2 INRin with INR being interferer to noise ratio. According to these two equalities, the beampattern can be seen as a performance evaluation measure. The higher its value is at h1, the larger the SNR improvement and the lower its value is at h2(f), the more the inference attenuation is. The average of BP (h2(f)) over all frequencies is called relative mean at- BP (h1(f)) tenuation and is shown with a horizontal line for all beamformers in Figure 8.1. As it can be seen, the fixed super-directive beamformer can attenuate the interference up to 15 dB on average while the MVDR with updated constraint with our algorithm, can achieve 7 dB more interference reduction. Also, the beamformer response to h1 shows less fluctuations and thus less signal distor- tion for the updated MVDR than the super-directive beamformer. However, as expected, the conventional MVDR response to the target TF shows that it severely distorts the desired signal.

Case II: In the second scenario, the traditional and the proposed MVDR beamformer’s performance in the presence of large target steering vector er- ror have been studied. Figure 8.2 shows the beampattern of the MVDR beam- former with updated constraint (the proposed method) versus the traditional MVDR beamformer at 1kHz. 15 ◦ of error in both interference and signal steering vectors has been assumed.− Therefore both columns of the steering matrix in (13) are imprecise. Nevertheless, as it can be seen in Figure 8.2, the updated MVDR achieves both robustness against 15 ◦ steering vector error and high interference reduction which is around 30 dB at interference direc- tion. The solid line also reveals a slight shift in the null position (from 40 ◦ to 33 ◦ which leads to about 10 dB degradation in interference reduction (from 40 dB to 30 dB). More experiments have shown that the proposed algorithm is robust against even larger target direction errors at the expense of this shift in null position which can be seen as a trade-off between the noise reduction and robustness. 194 8MultichannelAudio-videoSpeechRecognitionSystem

0

−10

−20 updated MVDR −30 conventional MVDR Beampattern in dB target position −40 interferer position

0 20 40 60 80 100 120 140 160 180 Azimuth in degree

Figure 8.2: Beampattern of the updated constraint MVDR and the traditional MVDR at f=1000Hz. Vertical lines mark the source and the interferer positions. The signal vector as well the interferer vector are assumed to have an error of -15 degree.

The proposed algorithm prevents signal attenuation by shifting the main lobe to the correct azimuth while traditional MVDR performance dramati- cally worse. The main advantage of this method over other similar methods of robust constraint set design like [ZXG06] is that it does not widen the main lobe (since it results in spatial resolution reduction), but rather shifts it to the correct angle by extracting covariance matrix information.

8.5 Experiments with ETHDigits Dataset

Throughout this section, we used the ETHDigits dataset for evaluation of the proposed algorithms. ETHDigits is an audio-visual dataset recorded in a highly reverberant office room with T60 (reverberation time) longer than 1 second. The audio data was captured by 8 microphones that were arranged as in Figure 8.3 and RGB video signals with the resolution of 640x480 per frame were recorded by a Kinect for Xbox 360. Unlike the datasets used in the previous chapters, we did not record the visual data under controlled con- ditions. Depending on how many lights were on during a recording session, what time of day it was and whether the sun was shining, the amount of light in the room may largely vary, which in turn, results in large variation of the visual data quality. The ETHDigits dataset was recorded from 15 speakers and each speaker repeated a sequence of numbers from one to ten for 5 times (with no randomization). The numbers were presented on a computer screen 8.5ExperimentswithETHDigitsDataset 195

1 2 3 4 5 6 7 8

9d 3d1d 1d 1d 3d 9d

27d

Figure 8.3: Linear microphone array with non-uniform spacing between microphones. d is 3 cm and the total length of the microphone array is 81 cm. The microphones 1, 2, 7 and 8 constitute a uniformly spaced four-channel microphone array used in some experiments.

located about 45 centimeters away from talkers and both Kinect and the mi- crophone array were mounted on top of this screen. In the first set of experiments, we evaluated the performance of the pro- posed beamformingalgorithm (updated MVDR). Figure 8.4 demonstrates the performance of the audio only recognizer for each speaker when: (I) updated MVDR algorithm was applied to the output of our 8 channel microphone ar- ray to enhance the SNR of the audio signal fed to A-ASR, (II) only output of one microphone was passed to A-ASR (III) updated MVDR was applied to the output of a 4 channel microphone array and its output was then fed to A-ASR. The four channel microphone array was simply constructed by only considering the outputs of 4 microphones out of 8 microphones, as shown in Figure 8.3. As expected, the update MVDR beamforming method with the 8-channel microphone array outperforms both the single-channel and the 4-channel sys- tem over all speakers, except speaker 15. Using the 8-channel microphone array yields 97.21% 1.41 average recognition accuracy over 15 speakers, compared with 95.38%± 1.66 and 96.55% 1.60 of the single-channel and the 4-channel system, respectively.± Figure 8.5± demonstrates the performance of visual speech recognizer when ISCN, MAPP and ISCN+MAPP (as dis- cussed in Chapter 6) features are used to represent visual data. The average recognition accuracy of ISCN, MAPP and ISMA (ISCN+MAPP) based V- ASR over 15 speakers are 35.25% 4.24, 44.40% 2.98 and 46.88% 3.95, respectively. Similar to the Oulu, AVletter± and GRID± datasets in Chapter± 6, the combination of ISCN and MAPP features, outperforms each of them in- dividually, suggesting that ISCN and MAPP features are complimentary fea- ture sets. In fact, it can be observed in Figure 8.5 that when ISCN and MAPP 196 8MultichannelAudio-videoSpeechRecognitionSystem

100

95

90

85 Digit recognition accuracy% 80 Single channel audio Eight channel audio Four channel audio 75

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Speaker ID

Figure 8.4: Recognition accuracy per speaker on ETHDigits for 8-channel, 4-channel and single-channel systems. In 8-channel and 4-channel systems updated MVDR are applied to enhance the the audio signal quality.

features obtain almost the same accuracy, their combination achieves a sig- nificantly higher recognition rate than each of them alone. For instance, V- ASR with ISMA features achieves about 65% accuracy on speaker 14 which is 15% higher than the performance of ISCN and MAPP, individually. The same trend can be seen for speakers 7, 9 and 11. The recognition rate of the audio-visual recognizer for each speaker is shown in Figure 8.6. AV-ASR performance has been reported for two fusion strategies: (I) Audio and visual features (ISMA features were taken as vi- sual features) had equal weights and (II) an individual weight was assigned to each feature of the audio-visual feature set. In this case the weights were estimated by means of the AUC based approach developed in Chapter 7. It 8.5ExperimentswithETHDigitsDataset 197

70 V−ASR with ISCN V−ASR with MAPP 60 V−ASR with ISMA

50

40

30 Digit recognition accuracy%

20

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Speaker ID

Figure 8.5: Recognition accuracy per speaker on ETHDigits for visual-only speech recognizer with ISCN, MAPP and ISCN+MAPP (ISMA) features.

is clear that weighted audio-visual fusion largely outperforms the simple uni- form weight strategy. The weighted AV-ASR reaches 95.54% 1.5 accuracy compared with only 70.78% 4.51 recognition rate of uniform± strategy. As it was already discussed in Chapter± 7, using the one-weight per feature strat- egy increases the robustness of the system in mismatching training and test conditions by assigning higher weights to more robust features i.e., features with less variations over different speakers. By reducing the effect of unreli- able features, AV-ASR can work reasonably accurate even when one modality completely fails. This, for instance, can be observed in the case of speaker 4 where visual-only recognizer can hardly perform better than random guessing while AV-ASR reaches 89% accuracy for this speaker. 198 8MultichannelAudio-videoSpeechRecognitionSystem

100

90

80

70

60

Digit recognition accuracy% 50

40 AV−ASR with one−weight per feature fusion AV−ASR with uniform weight fusion 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Speaker ID

Figure 8.6: Recognition accuracy per speaker on ETHDigits for AV-ASR when all features have equal weights (uniform weight strategy) and when one individual weight is assigned to each feature. These weights are estimated with the AUC based weight strategy presented in Algorithm 7.2.

8.6 Conclusion

In this chapter, we developed an algorithm to approximate the channel matrix of the room from the covariance matrix of the multichannel audio data. It was shown that this problem can be formulated as an instance of the weighted Procrustes problem. This transfer function is then used to develop a more robust beamforming method against the signal cancellation problem. Using the proposed beamforming method yields 2% performance improvement of the audio-only speech recognizer when microphone array consists of 8 mi- crophones and obtains 1.11% when 4-channel microphone array is utilized. The ETHDigits dataset which includes eight-channel audio data and video 8.6 Conclusion 199 data from a RGB camera is then used to evaluate the AV-ASR system with a beamforming block. The room was a highly reverberant office room and the visual data had a fairly low quality compared with Oulu and GRID datasets used in Chapter 6. The experimentsshowed that even thoughthe visual speech recognizer in this adverse condition is unreliable, with a proper audio-visual fusion strategy AV-ASR can obtain 95.54% accuracy. 200 8MultichannelAudio-videoSpeechRecognitionSystem Chapter 9

Conclusion

9.1 Achievements

In this thesis we have developed several feature selection and boosting meth- ods which can be used to construct robust audio-visual speech recognition and voice activity detection systems. The main advantage of our proposed feature selection method, COBRA, is that it guarantees a non-zero lower bound on the normalized score1 of se- lected feature set, which can be interpreted as a goodness measure of the selected feature set. Since appearance based visual features are highly speaker dependent, we used ISCN feature extraction that generates translation- and scale-invariant features. Its drawback, however, is that it extracts tens of thousands of fea- tures from a relatively small image. Thus, we employed COBRA in order to reduce the ISCN feature vector dimensionality. Statistical models built upon the selected features showed superior robustness against speaker variations and lighting conditions. Boosting methods are based on the idea of creating a highly accurate clas- sifier by combining many weak and inaccurate classifiers. It can be seen as a meta-algorithm that maintains a distribution over the sample space. At each iteration a weak hypothesis minimizing the weighted loss is learned and the

1normalized with respect to the score of the optimal feature set 202 9 Conclusion weights (over the samples) are updated, accordingly. The output (strong hy- pothesis) is a convex combination of the weak hypotheses. Unlike most of the previously suggested boosting methods, in our boosting framework, MA- Boost, the booster has direct control over the weights, making it more suitable for boosting problems subject to some distribution constraints. We derived several theoretically and practically appealing algorithms (including Sparse- Boost, smooth boosting, Mu-MABoost etc.) and more importantly provided some proof techniques that can be used to translate many other online learn- ing algorithms into boosting methods. By means of MABoost, we developed two practically important appli- cations in speech processing: a reliable voice activity detector and a robust lipreading system. Our proposed voice activity detector can be trained in a semi-supervised manner, which is an important requirement in some applica- tions. Our lipreading system is based on the generalization of the MABoost framework to multiclass setting. It is a decision tree based lipreading system which is highly suitable for the sparse features used in this system. Finally, we devised an information fusion strategy based on AUC maxi- mization. We showed that in this method, the weights assigned to likelihood values should be optimized with respect to a more robust criterion than a simple accuracy rate. We utilized AUC to estimate the likelihood weights and showed that under some simplifying assumptions this criterion can be seen as the expectation of accuracy rate with respect to uniformly distributed mismatch error. Therefore, maximizing it may result in a more robust sys- tem against mismatch conditions. Our experiments showed that this fusion strategy leads to a robust digit classification system with an accuracy rate favorably comparable with adaptive systems.

9.1.1 Perspectives This thesis should be seen as a rather broad investigationfor various robust vi- sual features and machine learning techniquesaiming to improve audio-visual speech recognition systems. Even though our current implementation of AV- ASR needs more refinementsin order to be used in a commercial product, my inner Jules Verne believes in a close future there will be continuous AV-ASR based applications available on smart phones and intelligent vehicles which provide reasonable accuracy. A realization of a reasonable lipreading system should at least satisfy the following requirements: 9.1 Achievements 203

1. It should reasonably work in the speaker-independent mode. 2. It should be robust against illumination variations. 3. It can adapt itself to particular users. and a successful AV-ASR should at least: 1. have a robust audio-visual utterance detector. 2. enjoy a robust audio-visual information fusion scheme, in order to cope with possible audio or visual modality failures. The first two requirements were directly investigated and addressed in this thesis. We have shown that employing a pool of scale invariant features ex- tracted from multiple space colors yields a high level of robustness against inter-speaker and illumination variations. Two important aspects of this ap- proach are (I) Sparse representation of visual information (II) Constructing a strong classifier based on decision trees which can take advantage of feature vector sparsity. While the first property provides robustness against undesired variations, the second property guarantees to efficiently learn the underlying hypothesis. Due to these two properties, the proposed method can have fur- ther applications including visual emotion recognition, video classification and video search and scene detection. While the third requirement has not been directly discussed in this the- sis, most of the proposed algorithms can be easily modified to take speaker adaption into account. However, some of the theoretical results and perfor- mance guarantees could only be proven for batch learning setting and not for online learning which is a requirement in speaker adaption. Conducting fur- ther research on online boosting algorithms with PAC learning property may generalize our theoretical results to online learning settings. Audio-visual voice activity detection is a prerequisite in many audio- visual applications. We proposed a robust audio-visual voice activity de- tector which can be trained in a semi-supervised manner. This interesting property can be achieved by noting the fact that both audio and visual sig- nals represent the same underlying event: speech. Following the same line of reasoning, it is plausible to apply this semi-supervised training procedure to other audio-visual applications, including audio-visual speech recognition and audio-visual content search which are more complex than a simple AV- VAD. Each of these applications, however, may pose extra technical chal- lenges which need to be addressed first. For instance, to apply this technique 204 9 Conclusion to audio-visual phoneme recognition, we have to deal with the fact that an audio-based phoneme recognizer achieves much higher classification accu- racy than its video-based counterpart. Thus, labeling the data by iterating over audio- and video-based classifiers may not converge to a meaningful result. One remedy to address this problem is to reduce the classes by clustering the phonemes into visems (or even broader speech units than visems) so that audio and video-based classifiers yield almost similar classification accuracy over them. Further works on this direction may result in a highly desirable (or highly scary!) autonomous artificial intelligence systems (perhaps mono-task systems) which can learn and improve on itself. Finally, by means of the boosting framework and proof techniques sug- gested in Section 4 we solved two open problems. First, we showed that it is possible to select only a percentage of samples (in many datasets half of the samples or even less) at each round of training and still achieve 100% classification accuracy and second, we derive the first proof for MADABoost algorithm presented in [DW00].

9.1.2 Open Problems Several new problems have also emerged from our work. Over the last two decades it was shown that non-convex loss functions may have some desir- able properties such as robustness against labeling noise [LS10] and better scalability [CSWB06]. This raises the question whether using a non-convex divergence function in MABoost can still result in a provable boosting algo- rithm? The second question is related to the weights of the samples in a boost- ing algorithm. In common boosting methods, the weights constitute a prob- ability distribution over the sample space. However, as shown in this work, this seems to not be a necessary condition. For instance in SparseBoost, the weights are projected onto a hypercube rather than a probability simplex. The main questions are: Under which conditions can the boosting weights be pro- jected onto other convex sets than probability simplex, while the boosting algorithm still converges? And what are the characteristics of these convex sets? Finding answers to these two questions may have big practical and theo- retical impacts. It is known that using a non-convex loss function may lead to a boosting method which is robust against labeling noise. Moreover, project- ing the weights onto a larger convex set than probability simplex, may enable 9.1 Achievements 205 us to develop an algorithm that can learn the underlying hypothesis from an infinite amount of data in a very efficient manner (in fact exponentially fast) without using all the samples in the training data which would take an infi- nite amount of time. Given an algorithm satisfying these two properties, it is plausible to develop an unsupervised audio-visual based algorithm that can be trained over unlabeled data gathered from the Internet. 206 9 Conclusion Appendix A

Complementary Fundamentals

This appendix contains proofs of the lemmas and theorems presented in Chapter 4. Before proceeding with the proofs, some definitions and facts need to be reminded.

A.1 Definitions and Preliminaries

T Definition A.1. Margin Given a final hypothesis f(x) = t=1 ηtht(x), T the margin of a sample (xj ,aj) is defined as m(xj )= aj f(Pxj )/ t=1 ηt. Moreover, the margin of a set of examples denoted by m is the minimumP of D margins over the examples, i.e., m =minx m(xj ). D Lemma A.2. Duality between max-margin and min-edge The minimum edge γmin that can be achieved over all possible distributions of the training set is equal to the maximum margin (m∗ = maxη m ) of any linear combi- nation of hypotheses from the hypotheses space. D

This lemma is discussed in details in [FS96a] and [RW05]. It is the direct result of von Neumann’s minmax theorem and simply means that the maxi- mum achievable margin is γmin. 208 AComplementaryFundamentals

A.2 Proof of Theorem 4.7

The proof of the maximum margin property of MABoost is almost the same as the proof of Theorem 4.6. th Let’s assume the i sample has the worst margin, i.e., m = m(xi). Let th D all entries of the error vector w∗ to be zero except its i entry which is set to be 1. Following the same approach as in Theorem 4.6, (see equation (4.13)), we get

T T 1 2 2 w∗⊤ηtdt wt⊤ηtdt ηt dt + B (w∗, w1) B (w∗, wT +1) − ≤ 2 || ||∗ R − R Xt=1 Xt=1 (A.1)

With our choice of w∗ it is easy to verify that the first term on the left T T side of the inequality is m t=1 ηt = t=1 w∗⊤ηtdt. By setting C = D − 2 B (w∗, w1), ignoring the lastP term B (wP∗, wT +1), replacing dt with R R T T|| ||∗ its upper bound L and using the identity w⊤η d = η γ the t=1 t t t − t=1 t t above inequality is simplified to P P

T T T 1 2 m ηt L ηt ηtγt + C (A.2) − D ≤ 2 − Xt=1 Xt=1 Xt=1

γt Replacing ηt with the value suggested in Theorem 4.7, i.e., ηt = and L√t T dividing both sides by t=1 ηt, gives P T 1 1 2 t=1( t )γt LC √t − m (A.3) P T 1 γ − T 1 γ ≤ D t=1 √t t t=1 √t t P P The first term is minimized when γt = γmin . Similarly to the first term, the second term is maximized when γt is replaced by its minimum value. This gives the following lower bound for m : D

T 1 1 t=1 √t t LC γmin − m (A.4) P T 1 − γ T 1 ≤ D t=1 √t min t=1 √t P P A.3 Proof of Lemma 4.8 209

Considering the facts that T +1 dx T 1 and 1+ T dx T 1 , 1 √x ≤ t=1 √t 1 x ≥ t=1 t we get R P R P 1 + log T LC γmin γmin m (A.5) − 2√T +1 2 − γmin(√T +1 1) ≤ D − − 1+log T LC Now by taking ν = γmin + , we have γmin ν γmin. 2√T +1 2 γmin(√T +1 1) − ≤ It is clear from (A.5) that −ν approaches zero− as T tends to infinity with a convergence rate proportional to log T . It is noteworthy that this convergence √T rate is slightly worse than that of TotalBoost which is O( 1 ). √T

A.3 Proof of Lemma4.8

Remember that Πˆ (y)= Π Π (y) . Our goal is to show that B (x, y) S S  K  R ≥ B x, Πˆ (y) . R S To this end, we only need to repeatedly apply Lemma 4.2, as follows

B (x, y) B x, Π (y) (A.6) R ≥ R K B x, Π (y) B x, Πˆ (y) (A.7) R K ≥ R S   which completes the proof.

A.4 Proof of the Boosting Algorithm for Com- bined Datasets

We have to show that when the convex set is defined as

N k = w w =1, 0 w i 0 w i (A.8) Sc { | i ≤ i ∀ ∈ A ∧ ≤ i ≤ N ∀ ∈ B} Xi=1 the error of the final hypothesis on , i.e., ǫ , converges to zero while the error on is guaranteed to be ǫ A1 . A B B ≤ k First, we show the convergence of ǫ to zero. This is easily obtained by A setting w∗ to be an error vector with zero weights over the training samples 1 from and weights over the training set . One can verify that w∗ B ǫANA A ∈ 210 AComplementaryFundamentals

c, thus the proof of Theorem 4.6 holds and subsequently, the error bounds inS (4.8) stating that ǫ 0 as the number of iterations increases. A → 1 To show the second part of the theorem that is ǫ k , vector w∗ is selected to be an errorvector with zero weights overtheB trai≤ning samples from and 1 weights over the training set .Note that, as long as ǫ is greater ǫBNB B A 1 1 B than k , this w∗ c. Thus,for all k ǫ the proofof Theorem 4.6 holds and as the bounds in∈ (4.8) S show, the error≤ decreasesB as the number of iterations increases. In particular in a finite number of rounds, the classification error on reduces to 1 which completes the proof. B k

A.5 Proof of Theorem 4.9

We use proof techniques similar to those given in [DSST10], with a slight change to take the normalization step into account.

By replacing zt+1 in the projection step from the update step, the projec- tion step can be rewritten as

1 yt+1 = arg min y yt ηty⊤dt + αtηt y 1 (A.9) y + 2|| − || − || || ∈R This optimization problem can be highly simplified by noting that the vari- ables are not coupled. Thus, each coordinate can be independently optimized. In other words, it can be decoupled into N independent 1-dimensional opti- mization problems.

i 1 i i yt+1 = arg min yi yt ηtyidt + αtηtyi (A.10) 0 yi 2|| − || − ≤ The solution of (A.10) can be written as

yi = max(0,yi + η di α η ) (A.11) t+1 t t t − t t This simple solution gives a very efficient and simple implementation for i th SparseBoost. From (A.10) it is clear that for dt < 0 (i.e., when i sample i is classified correctly), ηtydt acts as the ℓ1 norm regularization and pushes i − yt+1 towards zero while αtηt enhance sparsity by pushing all weights to zero. Let w∗ to be the same error vector as defined in Theorem 4.6. We start this proofby again deriving the progressboundson each step of the algorithm. A.5 Proof of Theorem 4.9 211

The optimality of yt+1 for (A.9) implies that

(w∗ y )⊤( η d + α η r′(y)+ y y ) 0 (A.12) − t+1 − t t t t t+1 − t ≥ N where r′(y) is a sub-gradientvector of the ℓ1 norm function r(y)= i=1 yi. Moreover, due to the convexity of r(y), we have P

α η r(y )⊤(w∗ y ) α η r(w∗) r(y ) (A.13) t t t+1 − t+1 ≤ t t − t+1 We thus have 

(w∗ y )⊤η d + α η r(y ) r(w∗) − t t t t t t+1 −  (w∗ y )⊤η d + α η (y w∗)⊤r′(y ) ≤ − t t t t t t+1 − t+1 = (w∗ y )⊤η d + α η (y w∗)⊤r′(y ) + (y y )⊤η d − t+1 t t t t t+1 − t+1 t+1 − t t t = (w∗ y )⊤(η d α η r′(y ) y + y ) − t+1 t t − t t t+1 − t+1 t + (w∗ y )⊤(y y ) + (y y )⊤η d (A.14) − t+1 t+1 − t t+1 − t t t where the first inequality follows (A.13). Now, from the optimality condition in (A.12), the first term in the last equation is non-positive and thus, can be ignored.

(w∗ y )⊤η d + α η r(y ) r(w∗) − t t t t t t+1 −  (w∗ y )⊤(y y ) + (y y )⊤η d ≤ − t+1 t+1 − t t+1 − t t t 1 2 1 2 1 2 = w∗ y y y w∗ y + (y y )⊤η d 2|| − t||2 − 2|| t+1 − t||2 − 2|| − t+1||2 t+1 − t t t 1 2 1 2 w∗ y y y ≤ 2|| − t||2 − 2|| t+1 − t||2 1 2 1 2 1 2 2 w∗ yt+1 2 + yt+1 yt 2 + ηt dt (A.15) − 2|| − || 2|| − || 2 || ||∗ where the first equation follows from Lemma 4.3 (or direct algebraic expan- sion in this case) and the second inequality from Lemma 4.5. By summing the left and right sides of the inequality from 1 to T , replac- 2 ing dt with its upperbound N and substituting 1 for r(w∗), we get || ||∗ T T T N 2 1 2 w∗⊤η d y⊤η d + η + w∗ y t t ≤ t t t 2 t 2|| − 1||2 Xt=1 Xt=1 Xt=1 T + α η 1 r(y ) (A.16) t t − t+1 Xt=1  212 AComplementaryFundamentals

Now, replacing r(yt+1) with its lower bound, i.e, 0 and using the fact T T that w∗⊤η d 0 (as shown in (4.14)) and y⊤η d = t=1 t t ≥ t=1 t t t TP η γ y , yields P − t=1 t t|| t||1 P T T T N 2 1 2 0 η γ y + η + w∗ y + α η (A.17) ≤− t t|| t||1 2 t 2|| − 1||2 t t Xt=1 Xt=1 Xt=1

Taking derivative w.r.t ηt and setting it to zero, gives the optimal ηt as follows γ y α η = t|| t||1 − t (A.18) t N This equation implies that α should be smaller than γ y or otherwise t t|| t||1 ηt becomes smaller than zero. Setting αt = (1 k)γt yt 1 where k is a − k || || constant smaller than or equal to 1, results in ηt = N γt yt 1. Replacing 1 2 || ||1 ǫ this value for ηt in (A.17) and noting that 2 w∗ y1 2 = 2Nǫ− gives the following bound on the training error || − || 1 ǫ (A.19) ≤ 1+ c T γ2 y 2 t=1 t || t||1 P 1 where c = k2 is a constant factor depending on the choice of αt. To prove that ǫ approaches zero as T increases, we still have to provide an evidence that T γ2 y 2 is a divergent series. There are different possibilities to t=1 t || t||1 approachP this problem. Here, we show that in the case of αt =0, the ℓ1 norm of weights yt 1 can be bounded away from zero (i.e., yt 1 C > 0) and T || 2 || 2 2 2 || || ≥ thus, t=1 γt yt 1 TγminC . || || ≥ i ToP this end, we rewrite yt from (A.11) as

i i i yt = max(0,yt 1 + ηt 1dt 1 αt 1ηt 1) − − − − − − i i yt 1 + ηt 1dt 1 αt 1ηt 1 ≥ − − − − − − t 1 t 1 1 − − + η ′ d ′ α ′ η ′ (A.20) ≥ N t t − t t tX′=1 tX′=1 where the last inequality is achieved by recursively applying the first inequal- i ity to yt 1. At any arbitrary round t, either the algorithm has already con- verged and− ǫ = 0 or there is at least one sample that is classified wrongly t by the ensemble classifier Ht(x) = l=1 ηlhl(x). Now, without loss of P A.6ProofofEntropyProjectionontoHypercube 213 generality, assume the ith sample is wrongly classified at round t. That is, t 1 t′−=1 ηt′ dt′ > 0 (look at (4.14)). Now, for αt =0, the weight of the wrongly classifiedP sample i is t 1 − i 1 1 y + η ′ d ′ (A.21) t ≥ N t t ≥ N tX′=1 1 That is, yt 1 N . This gives a lousy (but sufficient for our purpose) lower bound on|| y|| ≥. Replacing y with its lower bound 1 in (A.19), yields || t||1 || t||1 N N 2 ǫ (A.22) ≤ 1+ Tγ2 where γ is the minimum edge over all γt.

A.6 Proof of Entropy Projection onto Hypercube

N Lemma A.3. Let (w)= i=1 wi log wi wi. Then the Bregman projection R N − N of a positive vector z RP onto the unit hypercube = [0, 1] is yi = ∈ + K min(1,zi),i =1,...,N. To show the correctness of the above lemma, i.e., that the solution of the Bregman projection y = arg min B (y, z) (A.23) y R ∈K is yi = min(1,zi), we only need to show that y satisfies the optimality con- dition

(v y)⊤ B (y, z) 0 v (A.24) − ∇ R ≥ ∀ ∈ K N Given (w)= wi log wi wi, the gradient of B is R i=1 − R P T y B (y, z)= log i (A.25) ∇ R z Xi=1 i Hence,

yi yi (v y)⊤ B (y, z)= (vi yi) log + (vi yi) log − ∇ R − zi − zi i Xi:zi 1 i Xi:zi<1 ∈{ ≥ } ∈{ } (A.26) 214 AComplementaryFundamentals

yi 1 For zi 1, yi is equal to 1. That is, log = log < 0. On the other hand, ≥ zi zi since vi 1, (vi yi) = (vi 1) 0. Thus, the first sum in (A.26) is always non-negative.≤ The− second sum− is always≤ zero since log yi = log 1 = 0. That zi is, the optimality condition (A.26) is non-negative for all v which completes the proof.

A.7 Proof of Theorem 4.11

Its proof is essentially the same as the proof of the lazy version of MABoost with a few differences. Before proceeding further, some definitions and facts should be re-emphasized. First of all, since (w)= N w log w w is 1 -strongly convex (see R i=1 i i − i N [Sha12, p. 136]) with respectP to ℓ1 norm (and not 1-strongly as in Theorem 4.6), the following inequality holds for the Bregman divergence:

1 2 B (x, y) x y 1 (A.27) R ≥ 2N || − ||

Moreover, the following lemma which bounds y is essential for our proof. || t||

Lemma A.4. For all t, yt 1 Nǫt where ǫt is the error of the ensemble t|| || ≥ hypothesis Ht(x)= l=1 ηlhl(x) at round t. P This lemma holds due to the fact that

t i i i Pl=1 ηldl aiHt(xi) yt = min(1,zt) = min(1,e ) = min(1,e− ) (A.28)

t where Ht(x) = l=1 ηlhl(x) is the output of the algorithm at round t. If Ht(xi) makesP a mistake on classifying xi, aiHt(xi) will be greater i − than zero and thus, yt = 1. For the samples that are classified cor- rectly, a H (x ) 0 and thus, 0 yi 1. That is, Nǫ = − i t i ≤ ≤ t ≤ t number of wrongly classified samples at round t N yi = y . ≤ i=1 t || t||1 We are now ready to proceed with the proof of TheoremP 4.11. Let w∗ = [w1∗, , wN∗ ]⊤ to be a vector where wi∗ = 1 if f(xi) = ai, and 0 otherwise. ··· 6 T Similar to the proof of the lazy update, we are going to bound the (w∗ t=1 − yt)⊤ηtdt. P A.7 Proof of Theorem 4.11 215

(w∗ y )⊤η d = (y y )⊤ (z ) (z ) − t t t t+1 − t ∇R t+1 − ∇R t  + (z y )⊤ (z ) (z ) t+1 − t+1 ∇R t+1 − ∇R t  + (w∗ z )⊤ (z ) (z ) − t+1 ∇R t+1 − ∇R t 1 2 N 2 2  yt+1 yt + ηt dt + B (yt+1, zt+1) ≤ 2N || − || 2 || ||∗ R B (yt+1, zt)+ B (zt+1, zt) − R R B (w∗, zt+1)+ B (w∗, zt) B (zt+1, zt) − R R − R 1 2 N 2 2 yt+1 yt + ηt dt B (yt+1, yt) ≤ 2N || − || 2 || ||∗ − R + B (yt+1, zt+1) B (yt, zt) B (w∗, zt+1)+ B (w∗, zt) R − R − R (A.29)R where the first inequality follows from applying Lemma 4.5 to the first term and Lemma 4.3 to the rest of the terms and the second inequality is the result of applying the exact version of Lemma 4.2 to B (yt+1, zt). Moreover, ac- 1 R 2 cording to inequality (A.27) B (yt+1, yt) 2N yt+1 yt 0 and hence these terms can be ignored in (A.29).R Summing− up|| the inequali− || ty≥ (A.29) from t =1 to T , yields: T T N 2 B (w∗, z1) ηt ηtγt yt 1 (A.30) − R ≤ 2 − || || Xt=1 Xt=1

It is important to remark that yt 1 appearing in the last term is due to the fact that w = yt and thus,||y ||η d = w η d y = η γ y . t yt 1 t⊤ t t t⊤ t t t 1 t t t 1 || || || || || || Now, by replacing ηt = ǫtγt in the above equation and noting that B (w∗, z1)= N Nǫ, we get: R − T T N N(1 ǫ) ǫ2γ2 ǫ γ2 y (A.31) − − ≤ 2 t t − t t || t||1 Xt=1 Xt=1 From Lemma A.4, it is evident that y Nǫ . Moreover, since ǫ ǫ , it || t||1 ≥ t ≤ t can be replaced by ǫ, as well (though very pessimistic). As usual, γt is also replaced with the min edge, denoted by γ. Applying these lower bounds to the equation (A.31), yields 2(1 ǫ) 1 ǫ2 − (A.32) ≤ Tγ2 ≤ Tγ2 216 AComplementaryFundamentals which indicates that the proposed version of MadaBoost needs at most O( 1 ) iterations to converge. ǫ2γ2

A.8 Multiclass Weak-learning Condition

In this appendix, we show that the weak-learning condition adopted for Mu- MABoost is the optimal weak-learning condition in the sense that it is the weakest condition that can be assumed while guaranteeing the boosting prop- erty. To this end, we first show that the weak-learning condition in Definition 3 is equivalent to that of ADABoost.MR. As before, assume all the samples belong to class 0. Define to be the collection of N K weight matrices W satisfying the followingW conditions: × 1. W 0 for j =0 i,j ≥ 6 2. W = W i,0 − j=0 i,j P 6 Further, consider the matrix 1 to be an N K matrix whose (i, j)th entry h × is 1 if h(xi = j). The following equation then describes the ADABoost.MR weak-learning condition presented in [MS13].

W , h : W 1 0 (A.33) ∀ ∈W ∃ ∈ H • h ≤ It is straightforward to show that the weak-learning condition in (A.33) is equivalent to that of in Definition 3. Let Corr be the set of indices of the correctly classified samples by hypothesis h. By expanding the left side of inequality (A.33), we get

N W 1 = W = W + W • h i,h(xi) i,0 i,j nX=1 i XCorr i/XCorr ∈ ∈ = W W (A.34) i,j − i,j i/XCorr iXCorr Xj=0 ∈ ∈ 6 The right side of the last equation in (A.34) is equal to W D in Definition 3. Thus, both ADABoost.MR and Mu-MABoost adopt the same• weak-learning condition. Moreover, according to Theorem 6 in [MS13], this condition is the minimal weak-learning condition that can be adopted such that the boosting algorithms tends the training error to zero. Bibliography

[AB96] D. W. Aha and R. L. Bankert. A comparative evaluation of sequential feature selection algorithms. In Learning from Data: Artificial Intelligence and Statistics V. Springer-Verlag, 1996. [Abr63] N.Abramson. Information Theory and Coding. McGraw-Hill, New York, 1963. [AF06] A. E. Abdel-Hakim and A. A. Farag. Csift: A sift descriptor with color invariant characteristics. In Proceedings of CVPR, 2006. [AITT00] Y. Asahiro, K. Iwama, H. Tamaki, and T. Tokuyama. Greedily finding a dense subgraph. Journal of Algorithms, 2000. [AM08] I.AlmajaiandB.Milner. Usingaudio-visualfeatures for robust voice activity detection in clean and noisy speech. In Proceed- ings of EUSPC, 2008. [AMM+07] M. Aoki, K. Masuda, H. Matsuda, T. Takiguchi, and Y. Ariki. Voice activity detection by lip shape tracking using EBGM. In Proceedings of the 15th International Conference on Multime- dia. ACM, 2007. [Bat94] R. Battiti. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. on Neural Net- works, 5:537–550, 1994. [BGL02] N. H. Bshouty, D. Gavinsky, and M. Long. On boosting with polynomially bounded distributions. Journal of Machine Learning Research, 2002. 218 Bibliography

[BLM01] S. Ben-David, P. Long, and Y. Mansour. Agnostic boosting. In COLT. 2001.

[BM98] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of COLT, 1998.

[BM12] J. Bruna and S. Mallat. Invariant scattering convolution net- works. IEEE Trans. on Pattern Analysis and Machine Intelli- gence, 2012.

[Bog07] V.I.Bogachev. Measure Theory. Springer, 2007.

[BP10] K. S. Balagani and V. V. Phoha. On the feature selection cri- terion based on an approximation of multidimensional mutual information. IEEE Trans. on Pattern Analysis and Machine In- telligence, 2010.

[BPZL12] G. Brown, A. Pocock, M. Zhao, and M. Luj´an. Conditional likelihood maximisation: A unifying framework for informa- tion theoretic feature selection. Journal of Machine Learning, 2012.

[Bre97] L. Breiman. Pasting bites together for prediction in large data sets and on-line. Technical report, Dept. Statistics, Univ. Cali- fornia, Berkeley, 1997.

[Bre99] L. Breiman. Prediction games and arcing algorithms. Neural Computation, 1999.

[Bre01] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

[Bro09] G. Brown. A new perspective for informationtheoretic feature selection. In Proceedings of Artificial Intelligence and Statis- tics, 2009.

[BS08] J. K. Bradley and R. E. Schapire. Filterboost: Regression and classification on large datasets. In NIPS. 2008.

[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cam- bridge University Press, New York, NY, USA, 2004. Bibliography 219

[Cas79] K. R. Castleman. Digital Image Processing. Prentice Hall Pro- fessional Technical Reference, 1st edition, 1979.

[CBCS06] M. Cooke, J. Barker, S. Cunningham, and X. Shao. An audio- visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 2006.

[CBL06] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

[CDM02] C.C. Chibelushi, F. Deravi, and J. S. Mason. A review of speech-based bimodal recognition. IEEE Trans. on Multime- dia, 2002.

[CET01] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681–685, 2001.

[CH97] G. I. Chiou and J. Hwang. Lipreadingfromcolorvideo. IEEE Trans. on Image Processing, 1997.

[CH23] L. Cappelletta and H. Harte. Phoneme-to-viseme mapping for visual speech recognition. In In Proceedings of Patter Recog- nition Applications and Methods, 2023.

[Che01] T. Chen. Audiovisualspeechprocessing. IEEE Signal Process- ing Magazine, 2001.

[CHLN08] S. Cox, R. Harvey, Y. Lan, and J. Newman. The challenge of multispeaker lip-reading. In International Conference on Auditory-Visual Speech Processing, 2008.

[CM93] M. Cohen and D. Massaro. Modeling coarticulation in syn- thetic visual speech. In Models and Techniques in Computer Animation. 1993.

[CM03] C. Cortes and M. Mohri. AUC optimization vs. error rate min- imization. In NIPS. MIT Press, 2003.

[CR98] T.ChenandR.R. Rao. Audio-visualintegrationinmultimodal communication. Proceedings of the IEEE, 1998. 220 Bibliography

[CSO10] P. M. Ciarelli, E. O. T. Salles, and E. Oliveira. An evolving system based on probabilistic neural network. In Proceedings of BSNN, 2010. [CSWB06] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading con- vexity for scalability. In Proceedings of ICML, 2006. [CT91] T.M.CoverandJ.A.Thomas. Elements of Information Theory. Wiley-Interscience, New York, NY, USA, 1991. [DGFVG12] M. Dantone, J. Gall, G. Fanelli, and L. Van Gool. Real-time facial feature detection using conditional regression forests. In CVPR, 2012. [DL00] S. Dupont and J. Luettin. Audio-visual speech modeling for continuous speech recognition. IEEE Trans. on Multimedia, 2000. [Doa92] J. Doak. An evaluation of feature selection methods and their application to computer security. Technical Report CSE-92-18, University of California at Davis, 1992. [Dos10] M.B.Dosse. AnisotropicorthogonalProcrustesanalysis. Jour- nal of Classification, 27, 2010. [DSST10] J. C. Duchi, S. Shalev-shwartz, Y. Singer, and A. Tewari. Com- posite objective mirror descent. In COLT, 2010. [DW00] C. Domingo and O. Watanabe. Madaboost: A modification of AdaBoost. In COLT, 2000. [DYXY07] W. Dai, Q. Yang, G. Xue, and Y. Yong. Boosting for transfer learning. In ICML, 2007. [FA10] A. Frank and A. Asuncion. UCI machine learning repository, 2010. [Fan61] R.Fano. Transmission of Information: A Statistical Theory of Communications. The MIT Press, Cambridge, MA, 1961. [FDV12] B. Fr´enay, G. Doquire, and M. Verleysen. On the potential in- adequacy of mutual information for feature selection. In Pro- ceedings of ESANN, 2012. Bibliography 221

[FF97] J. H. Friedman and U. Fayyad. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1997. [FHF11] P. Flach, J. Hern´andez-orallo, and C. and Ferri. A coherent interpretation of AUC as a measure of aggregated classification performance. In Proceedings of ICML, 2011. [FHT98] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic re- gression: a statistical view of boosting. Annals of Statistics, 1998. [Fre95] Y. Freund. Boosting a weak learning algorithm by majority. Journal of Information and Computation, 1995. [FS96a] Y. Freund and R. E. Schapire. Experimentswith a New Boost- ing Algorithm. In ICML, 1996. [FS96b] Y. Freundand R. E. Schapire. Game theory,on-line prediction and boosting. In COLT, 1996. [FS97] Y. Freundand R. E. Schapire. A decision-theoreticgeneraliza- tion of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997. [FT74] J.H.FriedmanandJ.W. Tukey. Aprojectionpursuitalgorithm for exploratory data analysis. IEEE Trans. on Computers, 1974. [FY83] A.M.FriezeandJ.Yadegar. Onthequadraticassignment prob- lem. Discrete Applied Mathematics, 1983. [Gav03] D. Gavinsky. Optimally-smooth adaptive boosting and appli- cation to agnostic learning. Journal of Machine Learning Re- search, 2003. [GBW01] S. Gannot, D. Burshtein, and E. Weinstein. Signal enhance- ment using beamforming and nonstationarity with applications to speech. IEEE Trans. on Signal Processing, 49:1614 –1626, 2001. [GCSP13] Z. Golumbic, G. B. Cogan, C. E. Schroeder, and D. Poeppel. Visual input enhances selective speech envelope tracking in au- ditory cortex at a ”cocktail party”. 2013. 222 Bibliography

[GPP+12] L. Grippo,L. Palagi, M. Piacentini, V.Piccialli, and G. Rinaldi. SpeeDP: an algorithm to compute SDP bounds for very large max-cut instances. Mathematical Programming, 2012. [GSM13] F.G.German,D.L.Sun,andG.J.Mysore. Towardsapractical lipreading system. In Proceedings of CVPRInterspeech, 2013. [Gur09] M. Gurban. Multimodal feature extraction and fusion for audio-visual speech recognition. PhD thesis, 4292, STI, EPF Lausanne, 2009. [GVN+01] H. Glotin, D. Vergyr, C. Neti, G. Potamianos, and J. Luettin. Weighting schemes for audio-visual fusion in speech recogni- tion. In Proceedings of ICASSP, 2001. [GW95] M.X.GoemansandD.P.Williamson. Improvedapproximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM, 1995. [Han80] T. S. Han. Multiple mutual informations and multiple interac- tions in frequency data. Information and Control, 1980. [Hat06] K. Hatano. Smooth boosting using an information-based crite- rion. In Algorithmic Learning Theory. 2006. [Haz09] E. Hazan. A survey: The convex optimization approach to re- gret minimization. Working draft, 2009. [HDY+12] G. Hinton,L. Deng,D. Yu, G. E. Dahl, A. Mohamed,N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012. [HES00] H. Hermansky, D. P. W. Ellis, and S. Sharma. Tandem con- nectionist feature extraction for conventional hmm systems. In Proceedings of ICASSP, 2000. [Hie93] J. L. Hieronymus. Ascii phonetic symbols for the world’s lan- guages worldbet. International Phonetic Association, 93. [HR70] M.HellmanandJ.Raviv. Probabilityoferror,equivocation and the Chernoff bound. IEEE Trans. on Information Theory, 1970. Bibliography 223

[HSH99] O. Hoshuyama, A. Sugiyama, and A. Hirano. A robust adap- tive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters. IEEE Trans. on Signal Pro- cessing, 47:2677–2684, 1999.

[HT01] D. J. Hand and R. J. Till. A simple generalisation of the area under the ROC curve for multiple class classification problems. Journal of Machine Learning Research, 2001.

[HT09] K. Hatano and E. Takimoto. Linear programmingboosting by column and row generation. In Proceedings of Discovery Sci- ence, 2009.

[Hu11] B. G. Hu. What are the differences between Bayesian classi- fiers and mutual-information classifiers? IEEE Trans. on Neu- ral Networks and Learning Systems, 2011.

[HW99] M. Hollander and D. A. Wolfe. Nonparametric Statistical Methods, 2nd Edition. Wiley-Interscience, 1999.

[JB71] J. Jeffers, , and M. Barley. Speechreading (Lipreading). Charles C Thomas Pub Ltd., 1971.

[JHL97] B. Juang, W. Hou, and C. Lee. Minimum classification error rate methods for speech recognition. IEEE Trans. on Speech and Audio Processing, 1997.

[KC02] N. Kwak and C. Choi. Inputfeatureselection for classification problems. IEEE Trans. on Neural Networks, 2002.

[KHDM98] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas. On combin- ing classifiers. IEEE Trans. on Pattern Analysis and Machine Intelligence, 1998.

[KK09] A. Kalai and V. Kanade. Potential-based agnostic boosting. In NIPS. 2009.

[KKG07] B. J. Killian, J. Y. Kravitz, and M. K. Gilson. Extraction of configurational entropy from molecular simulations via an ex- pansion approximation. Journal of Chemical Physics, 2007. 224 Bibliography

[KMS08] A. Klaser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In proceedings of BMVC, 2008.

[Koh96] R. Kohavi. Wrappers for Performance Enhancement and Oblivious Decision Graphs. PhD thesis, Stanford, CA, USA, 1996.

[KR13] T. Kinnunen and P. Rajan. A practical, self-adaptive voice ac- tivity detector for speaker verification with noisy telephone and microphone data. In ICASSP, 2013.

[KS01] D. Karger and N. Srebro. Learning Markov networks: Maxi- mum bounded tree-width graphs. Proceedings of the 12th An- nual Symposium on Discrete Algorithms, pages 392–401, 2001.

[KSS92] M.J. Kearns,R. E.Schapire,andL.M.Sellie. Toward efficient agnostic learning. In COLT, 1992.

[Lan00] T. Lane. Extensions of ROC analysis to multi-class domains. In ICML-2000 Workshop on Cost-Sensitive Learning, 2000.

[LCSC05] S. Lucey, T. Chen, S. Sridharan, and V. Chandran. Integration strategies for audio-visual speech processing: applied to text- dependent speaker recognition. IEEE Trans. on Multimedia, 2005.

[LHT+09] Y. Lan, R. Harvey, B. J. Theobald, E. Ong, and R. Bowden. Comparing visual features for lipreading. In Auditory-Visual Speech Processing, 2009.

[Lof04] J. Lofberg. YALMIP: a toolbox for modeling and optimiza- tion in MATLAB. In Proceedings of Computer Aided Control Systems Design, 2004.

[Low04] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004.

[LS10] P.M.LongandR.A.Servedio.Randomclassificationnoise de- feats all convex potential boosters. Journal of Machine Learn- ing Research, 2010. Bibliography 225

[LT97] J. Luettin and N. Thacker. Speechreading using probabilistic models. Computer Vision and Image Understanding, 1997. [LTB96] J. Luettin, N. A. Thacker, and S. Beet. Speechreading using shape and intensity information. In International Conf. Spoken Language Proc., 1996. [LTH+10] Y. Lan, B. J. Theobald, R. Harvey, E. J. Ong, and R. Bow- den. Improving visual features for lipreading. In Proceedings of AVSP, 2010. [MB06] P.MeyerandG.Bontempi. On the Use of Variable Complemen- tarity for Feature Selection in Cancer Classification. Springer, 2006. [MBBF99] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algo- rithms as gradient descent. In NIPS, 1999. [MCB+02] I. Matthews,T.F. Cootes, J.A. Bangham,S. Cox, andR. Harvey. Extraction of visual features for lipreading. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2002. [McG54] W. McGill. Multivariate information transmission. Trans. of the IRE Professional Group on Information Theory, 1954. [MHL+07] E. McDermott, T. J. Hazen, J. Le-Roux, A. Nakamura, and S. Katagiri. Discriminative training for large-vocabularyspeech recognition using minimum classification error. IEEE Trans. on Audio, Speech, and Language Processing, 2007. [MLS+13] V. P. Minotto, C. B. O. Lopes, J. Scharcanski, C. R. Jung, and B. Lee. Audiovisual voice activity detection based on micro- phone arrays and color information. IEEE Journal of Selected Topics in Signal Processing, 2013. [MM02] Y. Mansour and D. McAllester. Boosting using branching pro- grams. Journal of Computer and System Sciences, 2002. [MS02] K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In Proceedings of Computer Vision, 2002. [MS13] I. Mukherjeeand R. E. Schapire. A theoryof multiclass boost- ing. Journal of Machine Learning Research, 2013. 226 Bibliography

[Nag14] T. Naghibi. MABoost: An R package for classification, 2014. URL http://cran.r-project.org/web/packages/maboost. [NC09] J. L. Newman and S. J. Cox. Automatic visual-only language identification: A preliminary study. In Proceedings of ICASSP, 2009. [NF77] P. M. Narendra and K. Fukunaga. A branch and bound algo- rithm for feature subset selection. IEEE Trans. on Computers, 1977. [NHP13] T.Naghibi,S. Hoffmann,and B.Pfister. Convexapproximation of the NP-hard search problem in feature subset selection. In Proceedings of ICASSP, 2013. [NSS04] J. Neumann, C. Schn¨orr, and G. Steidl. SVM-based feature selection by direct objective minimisation. In Proceedings of DAGM, 2004. [NWF08] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learn- ing of human action categories using spatial-temporal words. International Journal of Computer Vision, 2008. [ONS99] S. Okawa, T. Nakajima, and K. Shirai. A recombination strat- egy for multi-band speech recognition based on mutual infor- mation criterion. In Proceedings of Eurospeech, 1999. [OPM02] T.Ojala,M.Pietikainen,andT.Maenpaa. Multiresolution gray- scale and rotation invariant texture classification with local bi- nary patterns. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2002. [PD00] F. Provost and P. Domingos. Well-trained PETs: Improving probability estimation trees, 2000. CDER Working Paper, Stern School of Business, New York University, NY. [PG98] G.PotamianosandH.P.Graf. Discriminativetrainingof HMM stream exponents for audio-visual speech recognition. In Pro- ceedings of ICASSP, 1998. [PGC08] S. Pachoud, S. Gong, and A. Cavallaro. Macro-cuboid based probabilistic matching for lip-reading digits. Proceedings of CVPR, 2008. Bibliography 227

[PGTG02] E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy. CUAVE: A new audio-visual database for multimodal human- computer interface research. In In Proc. of ICASSP, 2002.

[PLD05] H. Peng, F. Long,and C. Ding. Feature selection based on mu- tual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. on Pattern Analysis and Ma- chine Intelligence, 2005.

[PLHZ04] H. Pan, S. E. Levinson, T. S. Huang, and L. Zhi-pei. A fused hidden Markov model with application to bimodal speech pro- cessing. IEEE Trans. on Signal Processing, 2004.

[PNLM04] G. Potamianos, C. Neti, J. Luettin, and I. Matthews. Au- dio–visual automatic speech recognition: an overview. Visual and Audio–Visual Speech Processing, 2004.

[PRW95] S. Poljak, F. Rendl, and H. Wolkowicz. A recipe for semidef- inite relaxation for (0,1)-quadratic programming. Journal of Global Optimization, 1995.

[PSSD06] A. Potamianos, E. Sanchez-Soto, and K. Daoudi. Stream weight computation for multi-stream classifiers. In Proceed- ings of ICASSP, 2006.

[QV95] F. Qian and B. Van Veen. Quadratically constrained adap- tive beamforming for coherent signals and interference. IEEE Trans. on Signal Processing, 43:1890–1900, 1995.

[QWP11] L. Qingju,W. Wenwu,and J. Philip. A visualvoiceactivity de- tection method with adaboosting. In Sensor Signal Processing for Defence, SSPD, 2011.

[Rag88] P. Raghavan. Probabilistic construction of deterministic algo- rithms: approximating packing integer programs. Journal of Computer and System. Sciences, 1988.

[RCB97] M. G. Rahim, L. Chin-Hui, and J. Biing-Hwang. Discrimi- native utterance verification for connected digits recognition. IEEE Trans. on Speech and Audio Processing, 1997. 228 Bibliography

[Rez61] F. M. Reza. An Introduction to Information Theory. Dover Publications, Inc., New York, 1961.

[RHEC10] I. Rodriguez-Lujan, R. Huerta, Ch. Elkan, and C. S. Cruz. Quadratic programming feature selection. Journal of Machine Learning Research, 2010.

[RJ93] L.RabinerandB.Juang. Fundamentals of Speech Recognition. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993.

[RS12] R. Rajavel and P. S. Sathidevi. Adaptive reliability measure and optimum integration weight for decision fusion audio-visual speech recognition. Journal of Signal Processing Systems, 2012.

[RW05] G. R¨atsch and M. Warmuth. Efficient margin maximization with boosting. Journal of Machine Learning Research, 2005.

[SAS07] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional SIFT de- scriptor and its application to action recognition. In Proceed- ings of Multimedia, 2007.

[Sch90] R. E. Schapire. The strength of weak learnability. Journal of Machine Learning Research, 1990.

[Ser03] R. A. Servedio. Smooth boosting and learning with malicious noise. Journal of Machine Learning Research, 2003.

[Sha12] S. Shalev-Shwartz. Online learning and online convex opti- mization. Found. Trends Machine Learning, 2012.

[SK85] T. J. Shan and T. Kailath. Adaptive beamformingfor coherent signals and interference. IEEE Trans. on ASSP, vol. 33:527– 536, 1985.

[SKS99] J. Sohn, N. S Kim, and W. Sung. A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6(1):1–3, 1999.

[SLS+05] K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, and T. Darrell. Visual speech recognition with loosely synchronized feature streams. In Proceedings of ICCV, 2005. Bibliography 229

[SS98] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. In COLT, 1998. [SS06] F. Sha and L. K. Saul. LargemarginGaussian mixturemodel- ing for phonetic classification and recognition. In Proceedings of ICASSP 2006., 2006. [SS08] S. Shalev-Shwartz and Y. Singer. On the equivalence of weak learnability and linear separability: new relaxations and effi- cient boosting algorithms. In COLT, 2008. [SW98] A. Srivastav and K. Wolf. Finding dense subgraphs with semidefinite programming. In Approximation Algorithms for Combinatiorial Optimization. Springer, 1998. [SZ03] J. Sivic and A. Zisserman. Video google: a text retrieval ap- proach to object matching in videos. In Proceedings of Com- puter Vision, 2003. [TA+97] T. M. Therneau,E. J. Atkinson,et al. An introductionto recur- sive partitioning using the RPART routines. Technical report, URL http://www.mayo.edu/hsr/techrpt/61.pdf, 1997. [VD93] H.VafaieandK. De-Jong. Robust featureselectionalgorithms. In Proceedings of TAI, 1993. [vdSGS10] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evalu- ating color descriptors for object and scene recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2010. [Vik06] T. Viklands. Algorithms for the Weighted Orthogonal Pro- crustes Problem and other Least Squares Problems. PhD thesis, Ume˚aUniversity, Ume˚a, Sweden, 2006. [vK70] J.vonKries. Influenceofadaptationontheeffectsproduced by luminous stimuli. In MacAdam, D.L. (Ed.), Sources of Color Vision, 1970. [vL07] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 2007. [WLR06] M. K. Warmuth, J. Liao, and G. R¨atsch. Totally corrective boosting algorithms that maximize the margin. In ICML, 2006. 230 Bibliography

[WSC99] T. Wark, S. Sridharan, and V. Chandran. Robust speaker veri- fication via fusion of speech and lip modalities. In Proceeding of ICASSP, 1999. [YEG+06] S. J. Young,G. Evermann,M. J. F. Gales, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. C. Woodland. The HTK Book, version 3.4. Cambridge University Engineering Department, Cambridge, UK, 2006. [Yeu91] R. W. Yeung. A new outlook on Shannon’s information mea- sures. IEEE Trans. on Information Theory, 1991. [YNO09] T. Yoshida, K. Nakadai, and H. G. Okuno. Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In 9th IEEE-RAS International conference on Humanoid Robots, 2009. [YYDS11] D. Ying, Y. Yan, J. Dang, and F. K. Soong. Voice activity de- tection based on an unsupervised learning framework. IEEE Transactions on ASLP, 2011. [ZBP09] G.Zhao,M.Barnard,andM.Pietikainen. Lipreading with local spatiotemporal descriptors. IEEE Trans. on Multimedia, 2009. [ZCCW13] N. A. Zaidi, J. Cerquides, M. J. Carman, and G. I. Webb. Al- leviating naive Bayes attribute independence assumption by attribute weighting. Journal of Machine Learning Research, 2013. [Zin03] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003. [ZJYH11] P. Zhao, R. Jin, T. Yang, and S. Hoi, C. Online AUC maxi- mization. In Proceedings of ICML, New York, NY,USA, 2011. ACM. [ZST10] X. Zhao, D. Sun, and K. Toh. A Newton-CG augmented La- grangian method for semidefinite programming. SIAM J. on Optimization, 2010. [ZXG06] Y. Zheng, P. Xie, and S. Grant. Robustness and distance dis- crimination of adaptive near field beamformers. Proceedings of ICASSP, 12:478–488, 2006. Bibliography 231

[ZZHP14] Z. Zhou, G. Zhao, X. Hong, and M. Pietik¨ainen. A review of recent advances in visual speech decoding. Image and Vision Computing, 2014. [ZZP11] Z. Zhou, G. Zhao, and M. Pietikainen. Towards a practical lipreading system. In Proceedings of CVPR, 2011.

[ZZRH09] J. Zhu, H. Zou, S. Rosset, and T. Hastie. Multi-class AdaBoost. Statistics and its interface, 2009. 232 Bibliography Curriculum Vitae

1983 Born in Tehran, Iran

1990-2001 Primary and high school in Tehran

2001-2006 Studies in electrical engineering at University of Tehran

2006 Bs.c in electrical engineering

2006-2009 Master studies in telecommunication at Sharif university of Technology, Tehran

2009-2015 Research assistant and PhD student at the Speech Pro- cessing Group, Computer Engineering and Networks Laboratory, ETH Z¨urich