Discriminatively Structured Graphical Models for Speech Recognition The

Discriminatively Structured Graphical Models for Speech Recognition The Graphical Models Team JHU 2001 Summer Workshop Jeff A. Bilmes — University of Washington, Seattle Geoff Zweig — IBM Thomas Richardson — University of Washington, Seattle Karim Filali — University of Washington, Seattle Karen Livescu — MIT Peng Xu — Johns Hopkins University Kirk Jackson — DOD Yigal Brandman — Phonetact Inc. Eric Sandness — Speechworks Eva Holtz — Harvard University Jerry Torres — Stanford University Bill Byrne — Johns Hopkins University UWEE Technical Report Number UWEETR-2001-0006 November 2001 Department of Electrical Engineering University of Washington Box 352500 Seattle, Washington 98195-2500 PHN: (206) 543-2150 FAX: (206) 543-3842 URL: http://www.ee.washington.edu Discriminatively Structured Graphical Models for Speech Recognition The Graphical Models Team JHU 2001 Summer Workshop Jeff A. Bilmes — University of Washington, Seattle Geoff Zweig — IBM Thomas Richardson — University of Washington, Seattle Karim Filali — University of Washington, Seattle Karen Livescu — MIT Peng Xu — Johns Hopkins University Kirk Jackson — DOD Yigal Brandman — Phonetact Inc. Eric Sandness — Speechworks Eva Holtz — Harvard University Jerry Torres — Stanford University Bill Byrne — Johns Hopkins University University of Washington, Dept. of EE, UWEETR-2001-0006 November 2001 Abstract In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switch- board. Typified by Maximum Mutual Information (MMI) or Minimum Classification Error (MCE) training, these methods assume a fixed statistical modeling structure, and then optimize only the associated numerical parameters (such as means, variances, and transition matrices). Such is also the state of typical structure learning and model selection procedures in statistics, where the goal is to determine the structure (edges and nodes) of a graphical model (and thereby the set of conditional independence statements) that best describes the data. This report describes the process and results from the 2001 Johns Hopkins summer workshop on graphical models. Specifically, in this report we explore the novel and significantly different methodology of discriminative structure learning. Here, the fundamental dependency relationships between random variables in a probabilistic model are learned in a discriminative fashion, and are learned separately and in isolation from the numerical parameters. The resulting independence properties of the model might in fact be wrong with respect to the true model, but are made only for the sake of optimizing classification performance. In order to apply the principles of structural discriminability, we adopt the framework of graphical models, which allows an arbitrary set of random variables and their conditional independence relationships to be modeled at each time frame. We also, in this document, describe and present results using a new graphical modeling toolkit (GMTK). Using GMTK and discriminative structure learning heuristics, the results presented herein indicate that significant gains result from discriminative structural analysis of both conventional MFCC and novel AM-FM features on the Aurora continuous digits task. Lastly, we also present results using GMTK on several other tasks, such as on an IBM audio-video corpus, preliminary results on the SPINE-1 data set using hidden noise variables, on hidden articulatory modeling using GMTK, and on the use of interpolated language models represented by graphs within GMTK. 1 Contents 1 Introduction 4 2 Overview of Graphical Models (GMs) 4 2.0.1 Semantics . 5 2.0.2 Structure . 7 2.0.3 Implementation . 8 2.0.4 Parameterization . 9 2.1 Efficient Probabilistic Inference . 9 3 Graphical Models for Automatic Speech Recognition 11 4 Structural Discriminability: Introduction and Motivation 11 5 Explicit vs. Implicit GM-structures for Speech Recognition 19 5.1 HMMs and Graphical Models . 19 5.2 A more explicit structure for Decoding . 23 5.3 A more explicit structure for training . 24 5.4 Rescoring . 26 5.5 Graphical Models and Stochastic Finite State Automata . 26 6 GMTK: The graphical models toolkit 29 6.1 Toolkit Features . 29 6.1.1 Explicit vs. Implicit Modeling . 29 6.1.2 The GMTKL Specification Language . 30 6.1.3 Inference . 30 6.1.4 Logarithmic Space Computation . 31 6.1.5 Generalized EM . 31 6.1.6 Sampling . 31 6.1.7 Switching Parents . 31 6.1.8 Discrete Conditional Probability Distributions . 32 6.1.9 Graphical Continuous Conditional Distributions . 32 7 The EAR Measure and Discriminative Structure learning Heuristics 32 7.1 Basic Set-Up . 33 7.2 Selecting the optimal number of parents for an acoustic feature X . 34 7.3 The EAR criterion . 35 7.4 Class-specific EAR criterion . 36 7.5 Optimizing the EAR criterion: heuristic search . 36 7.6 Approximations to the EAR measure . 36 7.6.1 Scalar approximation 1 . 36 7.6.2 Scalar approximation 2 . 37 7.6.3 Scalar approximation 3 . 37 7.7 Conclusion . 37 8 Visualization of Mutual Information and the EAR measure 37 8.1 MI/EAR: Aurora 2.0 MFCCs . 38 8.2 MI/EAR: IBM A/V Corpus, LDA+MLLT Features . 42 8.3 MI/EAR: Aurora 2.0 AM/FM Features . 43 8.4 MI/EAR: SPINE 1.0 Neural Network Features . 44 9 Visualization of Dependency Selection 45 UWEETR-2001-0006 2 10 Corpora description and word error rate (WER) results 47 10.1 Experimental Results on Aurora 2.0 . 47 10.1.1 Baseline GMTK vs. HTK result . 47 10.1.2 A simple GMTK Aurora 2.0 noise clustering experiment . 48 10.1.3 Aurora 2.0 different features . 49 10.1.4 Mutual Information Measures . 49 10.1.5 Induced Structures . 50 10.1.6 Improved Word Error Rate Results . 50 10.2 Experimental Results on IBM Audio-Visual (AV) Database . 50 10.2.1 Experiment Framework . 51 10.2.2 Baseline . 52 10.2.3 IBM AV Experiments in WS’01 . 52 10.2.4 Experiment Framework . 53 10.2.5 Matching Baseline . 53 10.2.6 GMTK Simulating an HMM . 54 10.2.7 EAR Measure of Audio Features . 54 10.3 Experimental Results on SPINE-1: Hidden Noise Variables . 55 11 Articulatory Modeling with GMTK 55 11.1 Articulatory Models of Speech . 57 11.2 The Articulatory Feature Set . 58 11.3 Representing Speech with Articulatory Features . 59 11.4 Articulatory Graphical Models for Automatic Speech Recognition: Workshop Progress . 60 12 Other miscellaneous workshop accomplishments 63 12.1 GMTK Parallel Training Facilities . 63 12.1.1 Parallel Training: emtrain parallel ............................ 63 12.1.2 Parallel Viterbi Alignment: Viterbi align parallel .................. 64 12.1.3 Example emtrain parallel header file . 65 12.1.4 Example Viterbi align parallel header file . 68 12.2 The Mutual Information Toolkit . 70 12.2.1 Mutual Information and Entropy . 70 12.2.2 Toolkit Description . 70 12.2.3 EM for MI estimation . 71 12.2.4 Conditional entropy . 71 12.3 Graphical Model Representations of Language Model Mixtures . 72 12.3.1 Graphical Models for Language Model Mixtures . 72 12.3.2 Perplexity Experiments . 76 12.3.3 Perplexity Results . 76 12.3.4 Conclusions . 77 13 Future Work and Conclusions 77 13.1 Articulatory Models . 77 13.1.1 Additional Structures . 77 13.1.2 Computational Issues . 77 13.2 Structural Discriminability . ..

Load more