Discriminatively Structured Graphical Models for

The Graphical Models Team JHU 2001 Summer Workshop

Jeff A. Bilmes — University of Washington, Seattle Geoff Zweig — IBM Thomas Richardson — University of Washington, Seattle Karim Filali — University of Washington, Seattle Karen Livescu — MIT Peng Xu — Johns Hopkins University Kirk Jackson — DOD Yigal Brandman — Phonetact Inc. Eric Sandness — Speechworks Eva Holtz — Harvard University Jerry Torres — Stanford University Bill Byrne — Johns Hopkins University

UWEE Technical Report Number UWEETR-2001-0006 November 2001

Department of Electrical Engineering University of Washington Box 352500 Seattle, Washington 98195-2500 PHN: (206) 543-2150 FAX: (206) 543-3842 URL: http://www.ee.washington.edu Discriminatively Structured Graphical Models for Speech Recognition

The Graphical Models Team JHU 2001 Summer Workshop

Jeff A. Bilmes — University of Washington, Seattle Geoff Zweig — IBM Thomas Richardson — University of Washington, Seattle Karim Filali — University of Washington, Seattle Karen Livescu — MIT Peng Xu — Johns Hopkins University Kirk Jackson — DOD Yigal Brandman — Phonetact Inc. Eric Sandness — Speechworks Eva Holtz — Harvard University Jerry Torres — Stanford University Bill Byrne — Johns Hopkins University

University of Washington, Dept. of EE, UWEETR-2001-0006 November 2001

Abstract In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switch- board. Typified by Maximum Mutual Information (MMI) or Minimum Classification Error (MCE) training, these methods assume a fixed statistical modeling structure, and then optimize only the associated numerical parameters (such as means, variances, and transition matrices). Such is also the state of typical structure learning and model selection procedures in statistics, where the goal is to determine the structure (edges and nodes) of a graphical model (and thereby the set of conditional independence statements) that best describes the data. This report describes the process and results from the 2001 Johns Hopkins summer workshop on graphical models. Specifically, in this report we explore the novel and significantly different methodology of discriminative structure learning. Here, the fundamental dependency relationships between random variables in a probabilistic model are learned in a discriminative fashion, and are learned separately and in isolation from the numerical parameters. The resulting independence properties of the model might in fact be wrong with respect to the true model, but are made only for the sake of optimizing classification performance. In order to apply the principles of structural discrim- inability, we adopt the framework of graphical models, which allows an arbitrary set of random variables and their conditional independence relationships to be modeled at each time frame. We also, in this document, describe and present results using a new graphical modeling toolkit (GMTK). Using GMTK and discriminative structure learning heuristics, the results presented herein indicate that significant gains result from discriminative structural analysis of both conventional MFCC and novel AM-FM features on the Aurora continuous digits task. Lastly, we also present results using GMTK on several other tasks, such as on an IBM audio-video corpus, preliminary results on the SPINE-1 data set using hidden noise variables, on hidden articulatory modeling using GMTK, and on the use of interpolated language models represented by graphs within GMTK.

1 Contents

1 Introduction 4

2 Overview of Graphical Models (GMs) 4 2.0.1 Semantics ...... 5 2.0.2 Structure ...... 7 2.0.3 Implementation ...... 8 2.0.4 Parameterization ...... 9 2.1 Efficient Probabilistic Inference ...... 9

3 Graphical Models for Automatic Speech Recognition 11

4 Structural Discriminability: Introduction and Motivation 11

5 Explicit vs. Implicit GM-structures for Speech Recognition 19 5.1 HMMs and Graphical Models ...... 19 5.2 A more explicit structure for Decoding ...... 23 5.3 A more explicit structure for training ...... 24 5.4 Rescoring ...... 26 5.5 Graphical Models and Stochastic Finite State Automata ...... 26

6 GMTK: The graphical models toolkit 29 6.1 Toolkit Features ...... 29 6.1.1 Explicit vs. Implicit Modeling ...... 29 6.1.2 The GMTKL Specification Language ...... 30 6.1.3 Inference ...... 30 6.1.4 Logarithmic Space Computation ...... 31 6.1.5 Generalized EM ...... 31 6.1.6 Sampling ...... 31 6.1.7 Switching Parents ...... 31 6.1.8 Discrete Conditional Probability Distributions ...... 32 6.1.9 Graphical Continuous Conditional Distributions ...... 32

7 The EAR Measure and Discriminative Structure learning Heuristics 32 7.1 Basic Set-Up ...... 33 7.2 Selecting the optimal number of parents for an acoustic feature X ...... 34 7.3 The EAR criterion ...... 35 7.4 Class-specific EAR criterion ...... 36 7.5 Optimizing the EAR criterion: heuristic search ...... 36 7.6 Approximations to the EAR measure ...... 36 7.6.1 Scalar approximation 1 ...... 36 7.6.2 Scalar approximation 2 ...... 37 7.6.3 Scalar approximation 3 ...... 37 7.7 Conclusion ...... 37

8 Visualization of Mutual Information and the EAR measure 37 8.1 MI/EAR: Aurora 2.0 MFCCs ...... 38 8.2 MI/EAR: IBM A/V Corpus, LDA+MLLT Features ...... 42 8.3 MI/EAR: Aurora 2.0 AM/FM Features ...... 43 8.4 MI/EAR: SPINE 1.0 Neural Network Features ...... 44

9 Visualization of Dependency Selection 45

UWEETR-2001-0006 2 10 Corpora description and word error rate (WER) results 47 10.1 Experimental Results on Aurora 2.0 ...... 47 10.1.1 Baseline GMTK vs. HTK result ...... 47 10.1.2 A simple GMTK Aurora 2.0 noise clustering experiment ...... 48 10.1.3 Aurora 2.0 different features ...... 49 10.1.4 Mutual Information Measures ...... 49 10.1.5 Induced Structures ...... 50 10.1.6 Improved Word Error Rate Results ...... 50 10.2 Experimental Results on IBM Audio-Visual (AV) Database ...... 50 10.2.1 Experiment Framework ...... 51 10.2.2 Baseline ...... 52 10.2.3 IBM AV Experiments in WS’01 ...... 52 10.2.4 Experiment Framework ...... 53 10.2.5 Matching Baseline ...... 53 10.2.6 GMTK Simulating an HMM ...... 54 10.2.7 EAR Measure of Audio Features ...... 54 10.3 Experimental Results on SPINE-1: Hidden Noise Variables ...... 55

11 Articulatory Modeling with GMTK 55 11.1 Articulatory Models of Speech ...... 57 11.2 The Articulatory Feature Set ...... 58 11.3 Representing Speech with Articulatory Features ...... 59 11.4 Articulatory Graphical Models for Automatic Speech Recognition: Workshop Progress ...... 60

12 Other miscellaneous workshop accomplishments 63 12.1 GMTK Parallel Training Facilities ...... 63 12.1.1 Parallel Training: emtrain parallel ...... 63 12.1.2 Parallel Viterbi Alignment: Viterbi align parallel ...... 64 12.1.3 Example emtrain parallel header file ...... 65 12.1.4 Example Viterbi align parallel header file ...... 68 12.2 The Mutual Information Toolkit ...... 70 12.2.1 Mutual Information and Entropy ...... 70 12.2.2 Toolkit Description ...... 70 12.2.3 EM for MI estimation ...... 71 12.2.4 Conditional entropy ...... 71 12.3 Graphical Model Representations of Language Model Mixtures ...... 72 12.3.1 Graphical Models for Language Model Mixtures ...... 72 12.3.2 Perplexity Experiments ...... 76 12.3.3 Perplexity Results ...... 76 12.3.4 Conclusions ...... 77

13 Future Work and Conclusions 77 13.1 Articulatory Models ...... 77 13.1.1 Additional Structures ...... 77 13.1.2 Computational Issues ...... 77 13.2 Structural Discriminability ...... 78 13.3 GMTK ...... 78

14 The WS01 GM-ASR Team 80

UWEETR-2001-0006 3 1 Introduction

In this report, we describe the results from the Johns Hopkins workshop that took place during a 6-week period over the summer, 2001. During this time, novel research was performed in several different areas. These areas include: 1) graphical models and their application to speech recognition; 2) the design, development and testing of a new graphical-model based toolkit for this purpose to allow for the rapid exploration of a wide variety of different models for ASR; 3) the exploration of a new method to discriminatively construct the graphical model structure (the nodes and edges of the graph); 5) the application of graphical models and structure learning to novel speech features, in particular standard MFCCs and novel amplitude and frequency modulation features; 4) the initial evaluation of graphical models on three data sets, the Aurora 2.0 noisy speech corpus, an IBM audio-visual corpus, and SPINE-1, the DARPA speech in noisy environments corpus; 5) the beginnings of the use of graphical models to represent anatomically correct articulatory-based speech recognition; 5) the beginnings of the development of a software toolkit for computing mutual information and related quantities on large data sets, the toolkit is used both to visualize dependencies in these data sets and also to help determine graphical model structures; and 6) the application of graphical models and GMTK to the problem of simple smoothed language models. As a main goal of the workshop, the graphical model methodology developed attempted to optimize the structure of the network so as to improve classification performance (e.g., speech recognition error) rather than simply to better describe or improve the likelihood of the data. This document will outline the theory, and describe the methodology and results, both positive and negative, that took place during the 6 weeks of the workshop. Broadly, this report is organized as follows: In Section 2, we first provide a general introduction to graphical models, and briefly introduce the notation we will use throughout this document. Section 3 provides a broad overview and introduction on how graphical models are a promising approach to the speech recognition task. Section 4 describes an introduction and motivation to one of the main goals of the workshop, that of structural discriminability. This section provides a number of intuitive examples on why such an approach should yield improved performance, even when discriminative parameter training methods are not used. Section 5 describes in more detail the various ways in which graphical models can be used for speech recognition systems, namely the explicit vs. the implicit approach, and the various trade offs between the two. Section 6 provides an overview of GMTK, the graphical model toolkit, software that was developed for use at the workshop that allows the rapid use of graphical models for language, speech, and other time-series processes. Section 7 develops a specific method to form structurally discriminative networks, and describes and provides a new derivation for the EAR measure, a quantity that is useful for this purpose. Section 8 provides a number of examples of the visualization both of conditional mutual information and the EAR measure on a number of different corpora and speech feature sets. Section 9 contains the visualization of the result of using the EAR measure to induce discriminative structure on the three corpora that were used in this study. Section 10 describes in more detail the three corpora that were used, baseline results and other results, and improved results using structure determination. Section 11 describes articulatory based speech recognition (another of the workshop goals) and how GMTK can be used to represent hidden articulatory models for speech recognition. Section 12 describes a number of other workshop accomplishments, including: 1) The GMTK parallel training/testing scripts developed at the workshop (Section 12.1); the beginnings of the development of the mutual-information toolkit (Section 12.2), which can be used to compute general mutual information quantities between discrete/continuous random variables, and 3) the beginnings of the application of graphical models to representing mixtures of different order language models (Section 12.3). Section 13 concludes and describes future work, and lastly in Section 14 describes the WS01 GM team.

2 Overview of Graphical Models (GMs)

Broadly speaking, graphical models (GMs) offer two primary features to those interested in working with statistical systems. On the one hand, a GM may be viewed as an abstract, formal, and visual language that can depict important properties (conditional independence) of natural systems and signals when described by multi-variate random pro- cesses. There are mathematically precise rules that describe what a given graph means, rules which associate with a graph a family of probability distributions. Natural signals (those which are not purely random) have significant sta- tistical structure, and this can occur at multiple levels of granularity. Graphs can show anything from causal relations between high-level concepts [84] down to the fine-grained dependencies existing within the neural code [3]. On the other hand, along with GMs come a set of algorithms for efficiently performing probabilistic inference and decision

UWEETR-2001-0006 4 making. Typically intractable, the GM inference procedures and their approximations exploit the inherent structure in a graph in a way that can significantly reduce computational and memory demands, thereby making probabilistic inference as fast as possible. Simply put, graphical models in one way or another describe conditional independence properties amongst col- lections of random variables. A given GM is identical to a list of conditional independence statements, and a graph represents all distributions for which all these independence statements are true. A random variable X is conditionally independent of a different random variable Y given a third random variable Z under a given probability distribution p(·), if the following relation holds:

p(X = x, Y = y|Z = z) = p(X = x|Z = z)p(Y = y|Z = z)

for all x, y, and z. This is written X⊥⊥Y |Z and it is said that “X is independent of Y given Z under p(·)”. This has the following intuitive interpretation: if one has knowledge of Z, then knowledge of Y does not change one’s knowledge of X and vice versa. Conditional independence is different from unconditional (or marginal) indepen- dence. Therefore, neither X⊥⊥Y implies X⊥⊥Y |Z nor vice versa. Conditional independence is a powerful concept — using conditional independence, a statistical model can undergo enormous simplifications. Moreover, even though conditional independence might not hold for certain signals, making such assumptions might yield vast improvements because of computational, data-sparsity, or task-specific reasons (e.g., consider the hidden Markov model with as- sumptions which obviously do not hold for speech [6], but which nonetheless empirically appear to be somewhat benign, and at times even helpful as described in Section 4 and [7]). Formal properties of conditional independence, and many other equivalent mathematical formulations, are described in [69, 84]. A GM [69, 25, 105, 84, 60] is a graph G = (V,E) where V is a set of vertices (also called nodes or random variables) and the set of edges E is a subset of the set V × V . The graph describes an entire family of probability distributions over the variables V . A variable can either be scalar- or vector-valued, where in the latter case the vector variable implicitly corresponds to a sub-graphical model over the elements of the vector. The edges E, depending on the graph semantics (see below), specifies a set of conditional independence properties over the random variables. The properties specified by the GM are true for all members of its associated family. Four items must be specified when using a graph to describe a particular probability distribution [11]: the GM semantics, structure, implementation, and parameterization. The semantics and the structure of a GM are inherent to the graph itself, while the implementation and parameterization are implicit within the underlying model. Each of these are now described in turn.

2.0.1 Semantics There are many types of GMs, each one with differing semantics. The set of conditional independence assumptions specified by a particular GM, and therefore the family of probability distributions it represents, will be different depending on the type of GM currently being considered. The semantics specifies a set of rules about what is or is not a valid graph and what set of distributions correspond to a given graph. Various types of GMs include directed models (or Bayesian networks) [84, 60],1 undirected networks (or Markov random fields) [19], factor graphs [40, 68], chain graphs [69, 90] which are combinations of directed and undirected GMs, causal models [85], dependency networks [52], and many others. When the semantics of a graph change, the family of distributions it represents also changes, but overlap can exist between certain families (i.e., there might be a probability distribution that has a representation by two different types of graphical model). This also means that the same exact graph (i.e., the actual graphical picture) might represent very different families of probabilities depending on the current semantics. Therefore, when using a GM, it is critical to first agree upon the semantics that is currently being used. A Bayesian network (BN) [84, 60, 51] is one type of GM where the graph edges are directed and acyclic. In a BN, edges point from parent to child nodes, and the graph implicitly spells out a factorization that is a simplification of the chain rule of probability, namely: Y Y p(X1:N ) = p(Xi|X1:i−1) = p(Xi|Xπi ). i i

1Note that the name “Bayesian network” does not imply Bayesian statistical inference. In fact, both Bayesian and non-Bayesian Bayesian networks may exist.

UWEETR-2001-0006 5 The first equality is the probabilistic chain rule, and the second equality holds under a particular BN, where πi desig- nates node i’s parents according to the BN. A probability distribution that is represented by a given BN will factorize with respect to that BN, and this is called the directed factorization property [69]. A Dynamic Bayesian Network (DBN) [29, 46, 43, 109] has exactly the same semantics as a BN, but is structured to have a sequence of clusters of connected vertices, where edges between clusters point only in the direction of increasing time. DBNs are particularly useful to describe time signals such as speech. GMTK, in fact, is a general tool that allows users to experiment with DBNs. Several equivalent schemata exist that formally define a BN’s conditional independence relationships [69, 84, 60]. The idea of d-separation (or directed separation) is perhaps the most widely known: a set of variables A is conditionally independent of a set B given a set if A is d-separated from B by C. D-separation holds if and only if all paths that connect any node in A and any other node in B are blocked. A path blocked if it has a node v along the path with either: 1) the arrows along the path do not converge at v (i.e., serial or diverging at v) and v ∈ C, or 2) the arrows along the path do converge at v, and neither v nor any descendant of v is in C. From d-separation, one may “read off” a list of conditional independence statements from a graph. This set of probability distributions for which this list of statements is true is precisely the set of distributions represented by the graph. Graph properties equivalent to d-separation include the directed local Markov property [69] (a variable is conditionally independent of its non-descendants given its parents), factorization according to the graph, and the Bayes-ball procedure [96] (shown in Figure 1).

Figure 1: The Bayes-ball procedure makes it easy to answer questions about a given BN such as “is XA⊥⊥XB|XC ?”, where XA, XB, and XC are disjoint sets of nodes in a graph. The answer is true if and only if an imaginary ball bouncing from node to node along the edges in a graph and starting at any node in XA can reach any in XB. The ball must bounce according to the rules as depicted in the figure. Only the nodes in XC are shaded. A ball may bounce through a node to another node depending both on its shading and the direction of its edges. The dashed arrows depict whether a ball, when attempting to bounce through a given node, may bounce through that node or if it is blocked and must bounce back to the beginning.

Conditional independence properties in undirected graphical models (UGMs) are much simpler than for BNs, and are specified using graph separation. For example, assuming that XA, XB, and XC are disjoint sets of nodes in an UGM, XA⊥⊥XB|XC is true when all paths from any node in XA to any node in XB intersect some node in XC . In a UGM, a distribution may be described as a factorization of potential functions over the cliques in the graph. BNs and DGMs are not the same, meaning that they correspond to different families of probability distributions. Despite the fact that BNs have complicated semantics, they are useful for a variety of reasons. One is that BNs can have a causal interpretation, where if node A is a parent of B, A can be thought of as a cause of B. A second reason is that the family of distributions associated with BNs is not the same as the family associated with UGMs — there are some useful probability models, for example, that are concisely representable with BNs but which are not representable at all with UGMs (and vice versa). UGMs and BNs do have an overlap, however, and the family of distributions corresponding to this intersection is known as the decomposable models [69]. These models have important properties relating to efficient probabilistic inference and graph type (namely, triangulated graphs and the existence of a junction tree). In general, a lack of an edge between two nodes does not imply that the nodes are independent. The nodes might be able to influence each other indirectly via an indirect path. Moreover, the existence of an edge between two nodes does not imply that the two nodes are necessarily dependent — the two nodes could still be independent for certain parameter values or under certain conditions (e.g., zeros in the parameters, see later sections). A GM guarantees

UWEETR-2001-0006 6 only that the lack of an edge implies some conditional independence property, determined according to the graph’s semantics. It is therefore best, when discussing a given GM, to refer only to its (conditional) independence rather than its dependence properties. If one must refer to a directed dependence between A and B, it is perhaps better to say simply that there is an edge (directed or otherwise) between A and B. Originally BNs were designed to represent causation, but more recently, models with semantics [85] that more precisely representing causality have been defined. Other directed graphical models have been designed as well [52], and can be thought of as the general family of directed graphical models (DGMs).

B B A B C A C A C

Figure 2: This figure shows four BNs with different arrow directions over the same random variables, A, B, and C. On the left side, the variables form a three-variable first-order Markov chain A → B → C. In the middle graph, the same conditional independence statement is realized even though one of the arrow directions has been reversed. Both these networks state that A⊥⊥C|B. These two networks do not, however, insist that A and B are not independent. The right network corresponds to the property A⊥⊥C but it does not imply that A⊥⊥C|B. Perhaps show the a DGM not representable by a UGM and vice versa.

2.0.2 Structure A graph’s structure, the set of nodes and edges, determines the set of conditional independence properties for the graph under a given semantics. Note that more than one GM might correspond to exactly the same conditional independence properties even though their structure is entirely different. In this case, multiple very different looking graphs correspond to the same family of probability distributions. In such cases, the various GMs are said to be Markov equivalent [102, 103, 53]. In general, it is not immediately obvious with large complicated graphs how to quickly visually determine if Markov equivalence holds, but algorithms are available which can determine the members of an equivalence class [102, 103, 78, 21]. Nodes in a graphical model can be either observed, or hidden. If a variable is observed, it means that its value is known, or that data (or “evidence”) is available for that variable. If a variable is hidden, it currently does not have a known value, and all that is available is the conditional distribution of the hidden variables given the observed variables (if any). Hidden nodes are also called confounding, latent, or unobserved variables. Hidden Markov models are so named because they possess a Markov chain that, in many applications, contains only hidden variables. A node may switch roles, and may sometimes be hidden and at other times be observed. With an HMM, for example, the “hidden” chain might be observed during training (because a phonetic or state-level alignment has been provided) and hidden during recognition (because the hidden variable values are not known for test speech data). When making the query “is A⊥⊥B|C?”, it is implicitly assumed that C is observed. A and B are the nodes being queried, and any other nodes in the network not listed in the query are considered hidden. Also, when a collection of sampled data exists (say as a training set), some of the data samples might have missing values each of which would correspond to a hidden variable. The EM algorithm [30], for example, can be used to train the parameters of hidden variables. Hidden variables and their edges reflect a belief about the underlying generative process lying behind the phe- nomenon that is being statistically represented. This is because the data for these hidden variables is either unavailable, is too costly or impossible to obtain, or might even not exist since the hidden variables might only be hypothetical (e.g., specified by hand based on human-acquired knowledge or hypotheses about the underlying domain). Hidden variables can be used to indicate the underlying causes behind an information source. In speech, for example, hidden variables can be used to represent the phonetic or articulatory gestures, or more ambitiously, the originating semantic thought behind a speech waveform.

UWEETR-2001-0006 7 Certain GMs allow for what are called switching dependencies [45, 79, 11]. In this case, edges in a GM can change as a function of other variables in the network. An important advantage of switching dependencies is the reduction in the required number of parameters needed by the model. Switching dependencies are also used in a new graphical model-based toolkit for ASR [9] (see Section 6). A related construct allows GMs to have optimized local probability implementations [42]. It is sometimes the case that certain observed variables are only used as conditional variables. For example, consider the graph B → A which implies a factorization of the joint distribution P (A, B) = P (A|B)P (B). In many cases, it is not necessary to represent the marginal distribution over B. In such cases B is a “conditional-only” variable, meaning is always and only to the right of the conditioning bar. In this case, the graph represents P (A|B). This can be useful in a number of cases including classification (or discriminative modeling), where we might only be interested in posterior distributions over the class random variable, or in situations where additional observations, say Z, exist which might be marginally independent of a class variable, say C, but which, conditioned on other observations, say X, are dependent. This can be depicted by the graph C → X ← Z, where it is assumed that the distribution over Z is not represented. Often, the true (or the best) structure for a given task is unknown. This can mean that either some of the edges or nodes (which can be hidden) or both can be unknown. This has motivated research on learning the structure of the model from the data, with the general goal to produce a structure that accurately reflects the important statistical properties that exist in the data set. These can take a Bayesian [51, 53] or frequentist point of view [17, 67, 51]. Structure learning is akin to both statistical model selection [71, 18] and data mining [28]. Several good reviews of structure learning are presented in [17, 67, 51]. Structure learning from a discriminative perspective, thereby producing what could be called discriminative generative models, was proposed in [6]. In this report, in fact, a method that is entitled structural discriminability is given initial evaluation. In contrast to typical structure learning in graphical models, structural discriminability is an attempt to, within the space of graph structures, find one that performs best at the classification task. This implies that certain dependency statements might be made by the model which are in general not true in the data, and are made only for the sake of classification accuracy. More on this is described in Sections 4 and 7. Figure 3 depicts a topological hierarchy of both the semantics and structure of GMs, and shows where different models fit into place.

Graphical Models

Other Semantics Causal Models Chain Graphs

DGMs Dependency Networks UGMs

Bayesian Networks FST Simple MRFs DBNs Mixture Models HMM Models Decision LDA Gibbs/Boltzman Kalman Trees Distributions Factorial HMM/Mixed PCA Memory Markov Models Segment Models BMMs

Figure 3: A topology of graphical model semantics and structure

2.0.3 Implementation When two nodes are connected by a dependency edge, the local conditional probability representation of that de- pendency may be called its implementation. A dependence of a variable X on Y can occur in a number of ways depending on if the variables are discrete or continuous. For example, one might use discrete conditional probability

UWEETR-2001-0006 8 tables (CPTs), compressed tables [42], decision trees, or even a deterministic function (in which case GMs may rep- resent data-flow [1] graphs, or may represent channel coding algorithms [40]). GMTK, described in Section 6 makes heavy use of deterministic dependencies. A node in a GM can also depict a constant input parameter since random variables can themselves be constants. Alternatively, the dependence might be linear regression models, mixtures thereof, or non-linear regression (such as a multi-layered perceptron [14], or a STAR [100] or MARS [41] model). In general, different edges in a graph will have different implementations. In UGMs, conditional distributions are not represented explicitly. Rather a joint distribution over all the nodes in the graph is specified with a product of what are called “potential” functions over cliques in the graph. In general the clique potentials could be anything, although particular types are commonly used (such as Gibbs or Boltzmann distributions [54]). Many such models fall under what are known as exponential models [34]. The implementation of a dependency in an UGM is implicitly specified via these functions in that they specify the way in which one variable can influence the resulting probabilities for other random variable values.

2.0.4 Parameterization The parameterization of a model corresponds to the parameter values of a particular implementation in a particular structure. For example, with linear regression, parameters are simply the regression coefficients; for a discrete proba- bility table the parameters are the table entries. Since parameters of distributions which are random can themselves be seen as nodes, Bayesian approaches may easily be represented [51] with GMs. Many algorithms exist for training the parameters of a graphical model. These include maximum likelihood [34] such as the EM algorithm [30], discriminative or risk minimization approaches [101], gradient descent [14], sampling approaches [73], or general non-linear optimization [38]. The choice of algorithm depends both on the structure and implementation of the GM. For example, if there are no hidden variables, an EM approach is not required. Certain structural properties of the GM might render certain training procedures less crucial to the performance of the model [11]

2.1 Efficient Probabilistic Inference A key application of any statistical model is to compute the probability of one subset of random variables given values for some other subset, a procedure known as probabilistic inference. Inference is essential both to make predictions based on the model and to learn the model parameters using, for example, the EM algorithm [30, 77]. One of the critical advantages of GMs is that they offer procedures for making exact inference as efficient as possible, much more so than if conditional independence is ignored or is used unwisely. And if the resulting savings is not enough, there are GM-inspired approximate but still more efficient inference algorithms.

F E D C B A

Figure 4: The graph’s independence properties are used to move sums inside of factors.

Exact inference can in general be quite computationally costly. For example, suppose there is a joint distribution over 6 variables p(a, b, c, d, e, f) and the goal is to compute p(a|f). This requires both p(a, f) and p(f), so the variables b, c, d, e must be “marginalized”, or integrated away to form p(a, f). The naive way of performing this computation would entail the following sum: X p(a, f) = p(a, b, c, d, e, f) b,c,d,e

Supposing that each variable has K possible values, this computation requires O(K6) operations, a quantity which is exponential in the number of variables in the joint distribution. If, on the other hand, it was possible to factor the joint distribution into factors containing fewer variables, it would be possible to reduce computation significantly. For example, under the graph in Figure 4, the above distribution may be factored as follows:

p(a, b, c, d, e, f) = p(a|b)p(b|c)p(c|d, e)p(d|e, f)p(e|f)p(f)

UWEETR-2001-0006 9 so that the sum X X X X p(a, f) = p(f) p(a|b) p(b|c) p(c|d, e) p(d|e, f)p(e|f) b c d e requires only O(K3) computation. Inference in GMs involves formally defined manipulations of graph data structures and then operations on those data structures. These operations provably correspond to valid operations on probability equations, and they reduce computation essentially by moving sums, as in the above, as far to the right as possible in these equations. The graph operations and data structures needed for inference are typically described in their own light, without needing to refer back to the original probability equations. One well-known form of inference procedure, for example, is the junction tree (JT) algorithm [84, 60]. In fact, the commonly used forward-backward algorithm [87] for hidden Markov models is just a special case of the junction tree algorithm [98], and which is, in turn, a special case of the generalized distributive law [2]. The JT algorithm requires that the original graph be converted into a junction tree, a tree of cliques with each clique containing nodes from the original graph. A junction tree possesses the running intersection property, where the intersection between any two cliques in the tree is contained in all cliques in the (necessarily) unique path between those two cliques. The junction tree algorithm itself can be viewed as a series of messages passing between the connected cliques of the junction tree. These messages ensure that the neighboring cliques are locally consistent (i.e., that the neighboring cliques have identical marginal distributions on those variables that they have in common). If the messages are passed in an order that obeys a particular protocol, called the message passing protocol, then because of the properties of the junction tree, local consistency guarantees global consistency, meaning that the marginal distributions on all common variables in all cliques in the graph are identical, and also guarantees that inference is correct. Because only local operations are required in the procedure, inference can thus be much faster than if the equations were manipulated naively . For the junction tree algorithm to be valid, however, a decomposable model must first be formed from the original graph. Junction trees exist only for decomposable models, and a message passing algorithm can provably be shown to yield correct probabilistic inference only in that case. It is often the case, however, that a given DGM or UGM is not decomposable. In such case it is necessary to form a decomposable model from a general GM (directed or otherwise), and in doing so make fewer conditional independence assumptions. Inference is then solved for this larger family of models. Solving inference for a larger family still of course means that inference has been solved for the smaller family corresponding to the original (possibly) non-decomposable model. Two operations are needed to transform a general DGM (Bayesian network) into a decomposable model: moral- ization and triangulation. Moralization joins the unconnected parents of all nodes and then drops all edge directions. This procedure is valid because more edges means fewer conditional independence assumptions or a larger family of probability distributions. Moralization is required to ensure that the resulting UGM does not violate any of the con- ditional independence assumptions made by the original DGM. In other words, after moralizing, it is assured that the UGM will make no independence assumption that is not made by the original DGM. If such an invalid independence assumption was made, then the inference algorithm could easily be incorrect. After moralization, or if starting from a UGM to begin with, triangulation is necessary to produce a decomposable model. The set of all triangulated graphs corresponds exactly to the set of decomposable models. The triangula- tion operation [84, 69] adds edges until all cycles in the graph with non-consecutive nodes (along the cycle) have a connected pair. Triangulation is valid because more edges enlarge the set of distributions represented by the graph. Triangulation is necessary because only for triangulated (or decomposable) graphs do junction trees exists. A good survey of triangulation techniques is given in [66]. Finally, a junction tree is formed from the triangulated graph by, first, forming all maximum cliques in the graph, next connecting all of the cliques together into a “super” or “hyper” graph, and finally finding a maximum spanning tree [24] amongst that graph of maximum cliques. In this case, the weight of an edge between two cliques is set to the number of variables in the intersection of the two cliques. Note, that there are several ways of forming a junction tree from a graph, the method described above is only one of them. For a discrete-node-only network, probabilistic inference using the junction tree algorithm has complexity P Q O( c∈C v∈c |v|) where C is the set of cliques in the junction tree, c is the set of variables contained within a clique, and |v| is the number of possible values of variable v. The algorithm is exponential in the clique sizes, a quantity important to minimize during triangulation. There are many ways to triangulate [66], and unfortunately the operation of finding the optimal triangulation (the one with the smallest cliques) is itself NP-hard. For an HMM, the clique sizes are fixed at size 2 (two), so the complexity is N 2 where N is the number of HMM states, and there are

UWEETR-2001-0006 10 T cliques leading to the well known O(TN 2) complexity for HMMs. Further information on the junction tree and related algorithms can be found in [60, 84, 25, 61]. Exact inference, such as the above, is useful only for moderately complex networks since inference is NP-hard in general [23]. Approximate inference procedures can, however, be used when exact inference is not feasible. There are several approximation methods including variational techniques [94, 58, 62], Monte Carlo sampling methods [73], and loopy belief propagation [104]. Even approximate inference can be NP-hard however [27]. Therefore, it is always important to use a minimal model, one with least possible complexity that still accurately represents the important aspects of a task. The complete study of graphical models takes much time and effort, and this brief survey is nowhere close to a complete survey. For further and more complete information, see the references mentioned above.

3 Graphical Models for Automatic Speech Recognition

The underlying statistical model most commonly used for speech recognition is the hidden Markov model (HMM). The HMM, however, is only one example in the vast space of statistical models encompassed by graphical models. In fact, a wide variety of algorithms often used in state-of-the-art ASR systems can easily be described using GMs, and these include algorithms in each of the three categories: acoustic, pronunciation, and language modeling. While many of these ASR approaches were developed without GMs in mind, each turns out to have a surprisingly simple and elucidating network structure. Given an understanding of GMs, it is in many cases easier to understand the technique by looking first at the network than at the original algorithmic description. While it is beyond the scope of this document to describe all of the models that are commonly used for speech recognition and language processing, many of them are described in detail in [7]. Additional graphical models that explicitly account for many of the aspects of a speech recognition system are described in Sections 5 and 4.

4 Structural Discriminability: Introduction and Motivation

Discriminative parameter learning techniques are becoming an important part of automatic speech recognition technol- ogy, as indicated by recent advances in large vocabulary tasks such as Switchboard [107], which now complement well known improvements in small vocabulary tasks like digit recognition [82]. These techniques are exemplified by the maximum mutual information learning technique [4] or the minimum classification error (MCE) method [63], which specify procedures for discriminatively optimizing HMM transition and observation probabilities. These methodolo- gies adopt a fixed pre-specified model structure and optimize only the numeric parameters. From a graphical-model point of view, the model structure is fixed while the parameters of the model may vary. In statistics, the methods of discriminant analysis generalize on discriminative methods in ASR, as the goal in this case is to design the parameters of a model that is able to distinguish as best as possible between a set of objects (visual, auditory, etc.) as represented by a set of numerical feature values [76]. Moreover, methods of statistical model selection are also commonly used [71, 18] in an attempt to discover the structure and/or the parameters of a model that allow it to best describe a given set of training data. Typically, such approaches include an inverse cost function (such as the model’s likelihood) which is maximized. When the cost function is minimized (likelihood is maximized), it is assumed that the model is at the point where it best represents the given training data (and indirectly the true distribution). This cost function is typically further offset by a complexity penalty term so as to ensure that the model that is ultimately selected is not one that merely describes the training data excessively well without having an ability to generalize. Such approaches are seen in the guise of regularization theory [86], minimum description length [92], Bayesian information criterion [95], and/or structural risk minimization [101]. The technique of structural discriminability [10, 6, 11] stands in significant contrast to the methods above. In this case, the goal is to learn discriminatively the actual edge structure between random variables in graphical models that represent class-conditional probabilistic models. The structure is selected to optimize not likelihood, but rather a cost function that indicates how well the class conditional models do in a classification task. Also, a goal is, within the set of models that do equally well, to choose the one that is as simple as possible. Therefore, the edges in the structurally discriminative graph will almost certainly encode conditional independence statements that are wrong with respect to the data. In other words, the resulting conditional independence statements made by the model might not be true, and there is nothing that is attempting to ensure that the conditional independence statements are indeed true. Rather, the conditional independence statements will be made only for the sake of optimizing a cost function representing

UWEETR-2001-0006 11 discriminability (such as the number of classification errors that are made, or the KL-divergence between the true and the model posterior probability). Structural discriminability is orthogonal to and complementary with the methods used for fixed-structure parameter optimization. This means that once a discriminative structure is determined, it can be possible to further optimize the model using discriminative parameter training methods to yield further gains. On the other hand, it might be possible that structural discriminability obviates discriminative parameter training – we will explore this idea further below. At the basis of all pattern classification problems is a set of K classes C1,...,CK , and a representation of each of these classes in terms of a set of T random variables X1,...,XT (denoted as X1:T for now). For each class, one is interested in obtaining discriminant functions, functions the maximization of which should yield the correct class. For example if gk(X1:T ) is the discriminant function for class k, then one would perform the operation:

∗ k = argmax gk(X1:T ) k

∗ If X1:T indeed represented an object of class k , then an error would not occur. If X1:T represented instead a different 0 object, say k , then an error condition occurs. The ultimate goal therefore is to find functions gk such that the errors are minimum over a training data set [33, 76, 101]. It is possible (but not necessary) to place the above goal and procedure into a purely probabilistic setting, where the discriminant functions are either probabilistic or functions of probabilistic quantities. Given the true posterior probability of the class given the features P (Ck|X1:T ) one can clearly see that the error of choosing class k given a feature set X1:T will be 1 − P (Ck|X1:T ). Therefore, to minimize this error, the class should be chosen so as to maximize the posterior probability (which will minimize error), leading to the following and well known Bayes decision rule [33]: ∗ k = argmax P (Ck|X1:T ) k

This is a form of discrimination using the posterior probability, and if a model of P (Ck|X1:T ) is formed, say Pˆ(Ck|X1:T ), then the following decision function

∗ k = argmax Pˆ(Ck|X1:T ) k is said to use a discriminative model Pˆ(Ck|X1:T ). This model can take many forms, such as logistic regression, neural network [14, 76], support vector machine [101], and so on. In each case, the function form of the probabilistic discriminative model is chosen, and is then optimized (trained) in some way. Typically, the structure and form of such a function is fixed, and the only way that the structure can change is (potentially) by certain parameter coefficients having a “zero value”, thereby rendering the structure controlled by these coefficients essentially non-existent. During training, however, there is typically no guarantee that such coefficients can or will be zero. Nor is there a guarantee that the instance of zero coefficients in the model could correspond to conditional independence statements that are possible to be stated by a graphical model of a particular semantics. It is often useful to have as many non-harmful conditional independence statments made as possible because the resulting model is much simpler. For completeness, we note that the above probabilistic decision function can be generalized further to take into account a loss function which measures the potential difference in severity between making different kinds of mistakes. For example, if the true class is k1 and class k4 is chosen, the severity of such a mistake might be much less than if k2 were chosen. This is often encoded by producing a loss function L(k0|k), which is the loss (penalty) of choosing class k0 when the true class is k. This is used to produce a risk function R(k):

0 X 0 R(k |X1:T ) = L(k |k)P (Ck|X1:T ) k which is the expected loss of choosing class k0. The goal is to choose the class that minimizes the overall risk as in:

∗ 0 k = argmin R(k |X1:T ) k0 This decision rule is provably optimal (minimizing the expected loss) for any given loss function [33, 34]. Moreover, for the 0/1-loss functions (a loss function that is 1 for all k and k0 except is zero for the correct class k = k0), it is easy to see that this decision rule degenerates into the posterior maximization decision procedure above. It is this 0/1-loss function case that we examine further below.

UWEETR-2001-0006 12 It is often useful in many applications (such as speech recognition) to use Bayes rule to decompose the posterior probability P (Ck|X1:T ) within the decision rule thus:

∗ k = argmax P (Ck|X1:T ) = argmax P (X1:T |Ck)P (Ck)/P (X1:T ) k k

On the right most side, it can be seen that the maximization over k is not affected by P (X1:T ), so an equivalent decision rule is therefore: ∗ k = argmax P (X1:T |Ck)P (Ck) (1) k

This decision rule involves two factors: 1) the prior probability of the class P (Ck), and 2) the likelihood of the data given the class P (X1:T |Ck). This latter factor, when estimated from data and denoted as Pˆ(X1:T |Ck) is often called a generative model. It is a generative model because it is said to be able to generate likely instances of a given class, as represented by the features. For example, if the generative model Pˆ(X1:T |Ck) was an accurate approximation of the true likelihood function P (X1:T |Ck), and if a sample of the distribution was formed as x1:T ∼ Pˆ(X1:T |Ck), then that sample would (with high probably) be a valid instance of the class Ck, at least as best as can be represented by the feature vector X1:T . Moreover, if Pˆ(X1:T |Ck) is an accurate representation of the true likelihood function, and Pˆ(Ck) is an accurate representation of the class prior, then the decision rule:

∗ k = argmax Pˆ(X1:T |Ck)Pˆ(Ck) k would lead to an accurate decision rule. Therefore, a goal that is often pursued is to find accurate as possible likelihood and prior approximations. This leads naturally to standard maximum likelihood training procedures: given a data set 1 1 N N consisting of N independent samples, D = {(x1:T , k ) ..., (x1:T , k )}, the goal is to find the approximation of the likelihood function that best explains the data, or:

N ˆ∗ Y ˆ i i P = argmax P (x1:T |k ) Pˆ i=1 where the optimization is done over some set (possibly infinite) of likelihood function approximations that are being considered. Note that because the samples are assumed to be independent of each other, the optimization can be broken into K separate optimization procedures, one for each class k = 1,...,K:

ˆ∗ Y ˆ i P (|k) = argmax P (x1:T |k) ˆ P (|k) i:ki=k

where Pˆ(|k) indicates that the optimization is done only over those class conditional likelihood approximation func- tions that could be considered as a generative model for class k. It can be proven that as the size of the (training) data set D grows to infinity, and if within the set of models Pˆ(|k) that are optimized over lies the true class conditional likelihood function P (X1:T |k), then the maximum like- lihood procedure will converge to the true answer. This is the notion of asymptotic consistency and is given formal treatment in many texts such as [26]. It is moreover the case that the maximum likelihood procedure minimizes the KL-divergence between the true likelihood function and the model, as in: ˆ argmin D(P (X1:T |k)||P (X1:T |k)) P (ˆ|k)

The maximum likelihood training procedure is therefore a well-founded technique to obtain an approximation of the likelihood function. Returning to Equation 1, we can see that the maximum likelihood procedure, even in the case of asymptotic consistency and the like, is only a sufficient condition for finding an optimal discriminant function, but not a necessary condition. In fact, we would be happy with any of a number of functions f living in the family F defined as follows:

F = {f : argmax f(X1:T ,Ck)P (Ck) = argmax P (X1:T |Ck)P (Ck), ∀X1:T , k} k k

UWEETR-2001-0006 13 Region 1 Region 2 Region 1 Region 2

Region 3 Region 3 Y Y X X

Region 4 Region 4

Figure 5: A 2-dimensional spatial example of discriminability. The left figure depicts the true class-conditional dis- tributions, in this case an example of a four class problem. The region where each class has the highest probability is indicated by a different color (or shade). Also, contour plots are given of the different class conditional densities. For example, region and class 4 shows what could be a mixture of four non-convex component density functions. The true distributions lead to the decision boundaries that separate each of the regions. On the right, class-conditional densities are shown that might lead to the exact same decision boundaries as shown on the left. The densities in this case, however are much simpler than on the left – the right densities are formed without regard to the complexities of the left densities at points other than the decision boundaries. A goal of forming a discriminant function should be not to model any complexity of the likelihood functions that does not facilitate discrimination.

This means that any f ∈ F when multiplied by the prior probability and then used as a discriminant function will be just as effective for classification as the true likelihood function. Clearly, P (X1:T |Ck) lives within F, but the crucial point is that there might be many others, some of which are much simpler – simple in this case could mean easy computationally to evaluate, have few parameters, have a particularly easy to understand functional form, etc. A goal, then should be to find the f ∈ F that is as simple as possible. Note that some of the f ∈ F will be valid distributions themselves (i.e., are non-negative and integrate to unity), and others will be general functions. In this work, we are interested primarily in those f ∈ F functions which indeed are valid densities. A simple 2-D argument will further exemplify the above. On the left of Figure 5 shows contour plots for four different class conditional likelihood functions in a 2-D space. Each of the regions at which one of the likelihood functions is maximum is indicated by a color, as well as by the existence of the decision regions in the plot. As can be seen, each class-conditional density consists of a complex multi-modal distribution that (from looking at the figure) possibly results from a mixture of non-convex component functions. These distributions result in the decision boundaries and regions as shown. Any x-y point falling directly on one of the boundaries could be in either one of the two abutting classes. When the goal is classification, it is not necessary to represent the complexity of the class conditional distributions at regions of the space other than at the decision boundaries. If different generative class-conditional density functions were discovered having the same exact boundaries, but a much simpler form away from the boundaries, the classifi- cation error would be the same but the resulting class-conditional likelihood functions could be much simpler. This is indicated on the right of Figure 5, where uni-modal distributions have replaced the multi-modal ones on the left, but where the decision boundaries have not changed. Note that such class-conditional functions, while being “generative” in that they would generate something (i.e., are valid densities), would not necessarily generate accurate samples of the true class. For example, a large X is indicated on the left of Figure 5, indicating values of a feature vector that are likely to have been generated from the 4th class-conditional likelihood function. It is likely that such an X is an accurate representation of an object of type 4. The same relative position is marked on the right of the figure, also with an X. In this case, it is not likely to have been generated by the generative model for class 4 since it is not located at a point of high probability. Moreover, the large Ys indicate a point that is likely to be generated by the models on the right, but not by the true generative models on the left. These generative models on the right, therefore, do not necessarily generate objects typical of the classes

UWEETR-2001-0006 14 Objects of class A Objects of class B

Discriminative Generative Discriminative Generative model for A model for B

Figure 6: Pictorial example of distinctive features. This figure shows two types of key-like objects: on the left are objects consisting of an annulus and a protruding horizontal bar, on the right objects consist of a diagonal and then a horizontal bar. When designing an algorithm that makes a distinction between these two types of objects, the horizontal bar feature would not be beneficial in this process since it is common to both object types. It is sufficient only to represent the objects by their unique features relative to each other, as shown on the bottom. Models of the unique attributes of objects could be simpler since they contain only the minimal set of features necessary to discriminate.

the models are supposed to represent. Instead, the generative models are minimal and represent only what is needed for discrimination. Therefore, we call these densities discriminative generative models [11]. Another visual geometric example can further begin to motivate discriminative generative models, and ultimately structural discriminability. Consider Figure 6 which shows also in 2-D instances of two different types of key-like objects. The top of the figure shows instances of objects of class A, which are annuli with a protruding horizontal bar to the right. Objects of class B are diagonal bars with protruding horizontal bars to the right. As can be seen, objects of class A and class B have in common the horizontal bars, and their distinctive features are either the annuli or the diagonal bars. Therefore, in discriminating between objects of different types, one would expect that the horizontal bars of the two objects would not be very useful. For discrimination, any model of the two object types might not need to expend resources representing the horizontal bars, and should instead concentrate on the annuli and diagonal bars respectively. This latter case is shown in the bottom of the figure, where only those discriminative aspects of the objects are represented. In this case, less about the objects needs to be “remembered” thereby leading to a simpler criterion for deciding amongst the two classes. In general, when the task is pattern classification, it should be necessary for a model only to represent those features of the objects which are crucial for discrimination. Feature which are common to both objects could potentially be

UWEETR-2001-0006 15 V V4 3 V4 V3

V 1 V2 V1 V2

P(V1,V2 ,V3 ,V4 | C =1) P(V1,V2 ,V3 ,V4 | C = 2)

V V4 3 V4 V3

V 1 V2 V1 V2 ˆ ˆ P(V1,V2 ,V3 ,V4 | C =1) P(V1,V2 ,V3 ,V4 | C =1)

Figure 7: Structural Discriminability.

ignored entirely without any penalty in classification performance. We finally come to the idea of structural discriminability. The essential idea is to find generative class-conditional likelihood functions that are optimal for classification performance and simplicity. We optimize these likelihood mod- els over the space of conditional independence properties as encoded by a graphical model. This means that the goal is to find minimal edge sets such that discrimination is preserved. Minimal edge sets are desirable since the fewer edges in a graphical model, the more conditional independence statements are made, which can lead to fewer parameters, cheaper probabilistic inference, greater generality for limited amounts of training data, and to concentrating modeling power only on what is important. In other words, the aim of structural discriminability is to identify a minimal set of dependencies (i.e., edges in a graph) in class conditional distributions P (X1:T |Ck) such that there is little or no degradation in classification accuracy relative to the decision rule given in Equation 1. A simple motivating example is given in Figure 7. The top of the figure shows the undirected graphical model for two generative 4-dimensional class-conditional likelihood functions: on the left P (V1,V2,V3,V4|C = 1) for class 1, and on the right P (V1,V2,V3,V4|C = 2) for class 2. The edges show the generative models, meaning that these models depict the truth. Note that many of the edges between the two models are common. For example, the edge between V1 and V4 appears both in the model for C = 1 and C = 2. It might be the case that these common edges could be removed from both models since they are a common trait of both classes. If all common edges are removed from both models, the result is as shown on the bottom of the figure. Here only the unique edges for the two models are kept, the edge between V1 and V3 for the left, and the edge between V3 and V2 on the right. These edges, represent unique properties of the objects, at least as far as the conditional independence statements the graphs encode are concerned. It is crucial to realize that the example in Figure 7 is only an illustration. It does not imply that all common edges in class-conditional graphical models should be removed: there might be common edges which turn out to be quite helpful for discrimination. Moreover, there might be information irrespective of the edges which are useful for discrimination. Take, for example, the means of two class-conditional Gaussian densities with equal spherical covariance matrices. The only thing producing distinction between the two classes are the means. Therefore, no edge structure adjustment will help discrimination. On the other hand, there are cases where structural discriminability obviates discriminative parameter training.

UWEETR-2001-0006 16 V3 V3 Object Generation: V1 V2 V1 V2

P(V1,V2 ,V3 ,V4 | C =1) P(V1,V2 ,V3 ,V4 | C = 2)

V3 V3 Common Dependencies: V1 V2 V1 V2

Pc (V1,V2 ,V3 ,V4 | C =1) Pc (V1,V2 ,V3 ,V4 | C = 2)

V3 V3 Discriminative V V Dependencies: 1 2 V1 V2

Pd (V1,V2 ,V3 ,V4 | C =1) Pd (V1,V2 ,V3 ,V4 | C = 2)

Figure 8: It is possible for structural discriminability to render maximum-likelihood training ineffectual. Moreover, structural “confusability” (a network with anti-discriminative edges) can render discriminative parameter training in- effectual.

Specifically, there are cases where an inherently discriminative structure can render discriminative parameter training no more beneficial than regular maximum likelihood based training. Moreover, the wrong “anti-discriminative” model structure can render even discriminative training ineffectual at producing appropriate discriminative models. For a graphical example, consider Figure 8, and let us assume that there is no discrimination available in the individual variables (e.g., for Gaussians, the means of the random variables are all the same, so it is only the covariance structure which can help to produce more discriminative models). The top box shows truth, meaning the graphs that correspond to the true generative models (tri-variate distributions in this case) for class 1 and class 2. The middle box shows the edges that are common to the two true models. The bottom box shows the edges that are distinct between the two true models. If one insists on using the structures given in the middle box, important discriminative information about the two classes might be impossible to represent. Therefore even discriminative parameter training will be incapable of producing good results. On the other hand, the two bottom graphs show the distinct edges of the two models. Using these class-conditional models, even simple maximum-likelihood training would be able to produce models that are capable of discriminating between objects of two classes. Discriminative training, in this case, therefore might not have any benefit over maximum likelihood training. Further expanding on this example, consider Figure 9 which shows six 3-dimensional zero-mean Gaussian densi- ties corresponding to the graphs in Figure 8. Each graph shows 1500 samples from the corresponding Gaussian along with the marginal planar distributions for each of the variable sets V1V2, V2V3, and V1V3 (the margins, rather than being shown at the actual zero locations for the corresponding axes, are shown projected onto the axes planes of the 3-dimensional plots). The covariance matrices for the six Gaussians are respectively (moving across columns and then

UWEETR-2001-0006 17 down columns) as follows:  9 4 2   11 3 1   4 2 1   3 1 0  2 1 1 1 0 1  5 2 0   10 3 0   2 1 0   3 1 0  0 0 1 0 0 1  1 0 0   2 0 1   0 2 1   0 1 0  0 1 1 1 0 1

On the upper left of Figure 9, it can be seen clearly that V1 depends both on V2 and V3 (i.e., V1 depends on V3 indirectly through V2, as V1⊥⊥V3|V2), Note that the conditional independence in the top left covariance matrix can be easily seen to hold because the determinant of the off-diagonal submatrix is zero. i.e.

 4 2  = 0 2 1 reflecting a zero in the (1,3) position in the inverse covariance. On the upper right, it can be seen that V2⊥⊥V3, as indicated by the upper right in Figure 8. From these “truth” models, it can be seen that the ability to discriminate between the two classes lies within the V2V3 plane. The middle row models in Figure 8, and an example of their Gaussian correlates given in Figure 9 will not be able to discriminate well between the two classes regardless of the training method, since they only contain an edge (and therefore a possible dependency) between V1 and V2. The bottom row models in Figure 8 (correspondingly Figure 9) will once again be able to make a distinction between the two classes, and mere maximum-likelihood training would yield solutions such as the ones indicated. Note that these bottom row models are not the only possible structurally discriminative conditional independence properties – in the example, the bottom right model could just as well assume everything is independent without harm. Note also that in this case a linear discriminant analysis [44] projection would enable simple lower-dimensional Gaussians to discriminate well. In the general case, however, (e.g., where the dependencies are non-linear and non- Gaussian) it can be seen that placing certain restrictions on a generative model can make them more amenable to their use in a discriminative context. Structural discriminability tries to do this in the space of conditional independence statements as encoded by a graphical model. We must emphasize here that it is not the case that structural discriminability will necessarily make discriminative parameter training ineffective. In natural settings, it is most likely that that a combination of structural discriminability and discriminative parameter training will yield the best of both worlds: a simple model structure that is capable of representing the distinct properties of objects relative to competing objects, and a parameter training method to ensure that the models make use of their ability. Incidentally, it is often asked why delta [36] (first time derivative) and double-delta [70, 106] (second temporal derivative) of speech feature vectors produce a significant gain in HMM-based speech recognition systems. It turns out that the concept of structural discriminability can be used to shed some light on this situation [7]. Delta feature generation process can indeed be precisely modeled by a graphical model, so why might not such precise modeling of delta features be desirable? The reason is that such edges added to a model (the correct generative model) will render the delta features independent of the hidden variables. This will have the effect of making the delta features non-informative. Therefore, by making (wrong) independence statements about the generative process of the delta features, discrimination is improved (see [7] for details). In summary, the structure of the model family can have significant effects on the parameter training method used. If the wrong model family is used, even discriminative parameter training might not be helpful. Our goal in this work is to identify a criterion function that enables us to best tell if a given edge is discriminative or not, and if it should be removed or added in a class-conditional generative graphical model. Ideally, there would be a measure that could be computed independently for each edge, and those edges for which the measure is good enough (e.g., above threshold) would be retained, all others being dropped. Such a measure that attempts to achieve this goal, the EAR measure, is described in detail in Section 7. In this report, we focus on class-conditional probabilistic models that can be expressed as Bayesian networks. We focus further on and use a new graphical model toolkit (GMTK) for representing both standard and discriminative

UWEETR-2001-0006 18 Figure 9: Structural Discriminability of Gaussian Structures structures for speech recognition. The benefits of this framework include the ability to rapidly and easily express a wide variety of models, and use them in as efficient a way as possible for a given model structure.

5 Explicit vs. Implicit GM-structures for Speech Recognition 5.1 HMMs and Graphical Models Undoubtedly the most commonly used model for speech recognition is the hidden Markov model or HMM [5, 59, 87], and so we begin by relating the HMM to graphical models. It has long been realized that the HMM is a special case of the more general class of dynamic graphical models [98], and Figure 10 illustrates the graphical representation of an HMM. Recall that in the classical definition [59, 87], an HMM consists of: • The specification of a number of states • An initial state distribution π

• A state transition matrix A, where Aij is the probability of transitioning from state i to j between successive observations • An observation function b(i, t) that specifies the probability of seeing the observed acoustics at time t given that the system is in state i.

In this formulation, the joint probability of a state sequence s1, s2, . . . sT and observation sequence o1, o2, . . . oT is given by T −1 T Y Y πs1 Asisi+1 b(si, i) (2) i=1 i=1

UWEETR-2001-0006 19 State Variables

Observation Variables

Figure 10: Graphical model (specifically, a dynamic Bayesian network) representation of a hidden Markov model (HMM). The graph represents the set of random variables (two per time frame) of an HMM, and edges encode the set of conditional independence statements made by that HMM.

In the case that the state sequence or alignment is not known, the marginal probability of the observations can still be computed, either by enumerating all possible state sequences and summing the corresponding joint probabilities, or via dynamic programming recursions. Similarly, the single likeliest state sequence can be computed. Figure 10 shows the graphical model representation of an HMM. It is a model in which each time frame has two variables: one whose value represents the value of the state at that time, and one that represents the value of the observation. The conditioning arrows indicate that the probability of seeing a particular state at time t is conditioned on the value of the state at time t−1, and the actual numerical value of this probability reflects the transition probability between the associated states in the HMM. The observation variable at each frame is conditioned on the state variable in the same frame, and the value of P (ot|st) reflects the output probabilities of the HMM. Therefore, the directed factorization property of directed graphical models (the joint probability can be factored into a form where each factor is the probability of a random variable given its parents) immediately yields Equation 2. One important thing to note about the graphical model representation is that it is explicit about absolute time: each time frame gets its own separate set of random variables in the model. In Figure 10, there are exactly four time- frames represented, and to represent a longer time series would require a graph with more random variables. This is in significant contrast to the classic representation of an HMM, which has no inherent mechanism for representing absolute time. Instead, in the classic HMM representation, only relative time is represented. Absolute time is of course represented, and is often done so in the auxiliary data structures used for specific computations. Figure 11 makes this more explicit. At the top of this figure is an HMM that represents the word “digit.” There are five states (unshaded circles) representing the different sounds in the word, and a dummy initial and final state. The arcs in the HMM represent possible transitions, and not conditional independence relationships as in the graphical model. Note that this graph shows only the transition matrix of the Markov chain (only one part of the HMM), and in particular edges are given only when there are non-zeros in the transition matrix. The The self-loops in the Markov chain graph depict that it is possible to be in a state at one time (with a given probability), and then stay in the state at the next time frame. In particular, the picture shows the Markov chain for a “left-to-right” HMM in which it is only possible to stay in the same state, or move forward in the state sequence. This kind of representation is not explicit about absolute time. It can represent 100 frame occurrences of the word “digit” as well as 10 or 1000 frame occurrences. Of course, actual computations must be specific about absolute time, and the absolute temporal aspect is introduced into an HMM via the notion of a computational grid or its equivalent. This is shown at the upper right for a seven frame occurrence of the word “digit.” The horizontal axis represents time; the vertical axis represents the HMM-state set, and a path from the lower-left corner of the grid to the upper-right corner represents an explicit path through the states of the HMM over time. In this case, the first frame is aligned to the /d/ state, the second and third frames are aligned to the /ih/ state, and so forth. Although the structure (zeros/non-zeros) of the underlying Markov chain of the HMM is defined by the graph at the left, computations on the HMM are defined with reference to the temporally-explicit grid. The graphical model representation of the same utterance is shown at the bottom of Figure 11. This is a temporally- explicit structure (it shows absolute time) with seven repeated chunks, one for each of the time-slices. The assignment to the state variables of this model correspond to the HMM path represented in the computational grid. Note that different paths through the grid will correspond to different assignments of values to the state variables in the graphical model. Whereas computation in the HMM typically involves summing or maximizing over all paths, computation in the graphical model typically involves summing or maximizing over all possible assignments to the hidden variables. The analogy between a path in an HMM and an assignment of values to the hidden variables in a graphical model is quite important and should be kept in mind at all times. The graphical model in Figure 11 represents the information in an HMM. In particular, it encodes the conditional

UWEETR-2001-0006 20 HMM Grid: Classic view of an HMM 7-Frame Utterance

T IH D IH JH IH T JH IH D Emission Probabilities Transition Probabilities 12 34567 Time

Graphical Model view of an HMM

Q = D=Q IH Q= IH Q= JH Q= IH Q= T Q= T 1 2 3 4 5 6 7 State

Acoustic Observation O O 1 O2 7 Emission Probabilities Transition Probabilities represented by the Markov chain. represented by the observation distributions.

Figure 11: Comparison of two views of an HMM: the more classic views, where on the top left, an HMM is seen only as the non-zero values of the underlying Markov chain, and on the top right as a grid where time is on the horizontal axis and the possible states are shown on the vertical axis. The bottom plot shows the graphical model view of an HMM, there the HMM’s underlying set of random variables (two per time frame) are shown, and the edges specifically encode the set of conditional independence statements as made by the model.

independence assumptions made by the HMM when it is seen as a collection of random variables, two per time-slice. Note that the graphical-model view of the HMM given above, however, does not explicitly account for certain aspects of HMMs as they are typically used in a speech recognition system. One example of this is that in practice, the utterances that are associated with an HMM represent a complete example of a word or words. A specific recording of the word “digit” will start with the /d/ sound and end with the /t/ sound; i.e. the whole word will be spoken. This extra piece of information is not explicitly captured in the previous graphical model. To see this, suppose the value /d/ is assigned to every occurrence of the state variable. (This will actually happen in the course of inference, either explicitly or implicitly depending on the algorithm used.) This concrete assignment will have some probability (gotten by multiplying together many instances of the self-loop probability for /d/ along with the associated observation probabilities), and unless specific provisions within the set of hidden random variables of the graphical model are made, this probability might not be zero. In particular, if the hidden variables set does not have state values that specifically and jointly encode both position and phoneme, it would not be possible to ensure that this assignment of /d/ to all variables would have zero probability. If such were the case, this fact would violate our prior knowledge that a complete occurrence of “digit” must be spoken. Such zero probability assignments in the graphical model would correspond in the classic HMM framework to a path that does not end in the upper-right corner of the computational grid. It is often the case that this issue is resolved in the program code, in that it does inference by treating the upper-right corner in a special way. This essentially corresponds to a particular query in the graphical model representation, namely, P (O1:7,Q1:6,Q7 = q), meaning that the HMM is such that only assignments to the variables are considered that have a specific assignment to the last hidden variable Q7. Another issue concerns the use of time-invariant conditional probabilities. Consider the “digit” example. Because the first occurrence of the phone /ih/ is followed by /jh/, it must be the case that P (Qt = /jh/|Qt−1 = /ih/) > 0 (so that the transition is possible) and P (Qt = /t/|Qt−1 = /ih/) = 0 (so that the transition is not skipped). However, because the second occurrence of /ih/ is followed by /t/, it must be the case that P (Qt = /t/|Qt−1 = /ih/) > 0 and P (Qt = /jh/|Qt−1 = /ih/) = 0 which is in direct contradiction to the previous requirements. Therefore, if the

UWEETR-2001-0006 21 End-of-Word Observation Deterministic Transitions

1 2 2 34 5Position

10 1 111Transition

DIHIH JH IH T Phone

Observation Acoustic Parameter Tying via Explicit Structure

Figure 12: A more explicit graphical model representation of parameter tying in a speech-recognition-based HMM. cardinality of the hidden random variables in an HMM is only equal to the number of phonemes, then the HMM would not be able to represent the different meanings of the states at different times, even though observation distributions for these different states should be shared for the same phoneme. One way therefore of solving this problem is to distinguish between the first and second occurrences of /ih/: to define /ih1/ and /ih2/. In general, one may increase the number of states in an HMM to encode both phoneme identity and position in word, and also sentence, sentence category, and/or any other hidden event required. One must be careful, however, because we typically desire that multiple state values “share” the same output probabilities (i.e., the observation distributions of the same phoneme in different words is typically shared, or tied in a speech recognition system, this means that P (x|Q = q1) = P (x|Q = q2) for all x). This issue is compounded when we consider instances of multiple words. For example, if the word “fit” - /f/ /ih/ /t/ - occurs in the database, should its /ih/ be the same as /ih1/ or /ih2/, or something else entirely? The graphical model view of an HMM therefore (Figure 11 bottom), while being possible to encode these sorts of constraints, does not explicitly indicate how the constraints come to be or how they should be implemented. The graphical model formalism, however, is able also to represent the parameter tying issue via an explicit struc- tural representation using the graph itself, rather than hiding this detail either to a particular implementation, or via an expanded hidden state space. This has been called the explicit graphical representation approach [9], whereas the simple HMM graphical model is called implicit approach (expanded upon when discussing GMTK in Section 6.1.1). For example, a graphical model that incorporates the word-end and parameter tying constraints that are appropriate to a typical practical application is shown in Figure 12 (see [109] for further information). The main aspect of this representation is that there is now an explicit variable representing the position in the underlying HMM. The position within the HMM is now distinct from the phone labeling the position, which is ex- plicitly represented by a second set of variables. The tying issue is implemented by mapping from position to phone via a conditional probability distribution. In the example of Figure 12, positions 2 and 4 both map to /ih/. Different transition probabilities are obtained at each time point via an explicit transition variable, which is conditioned on the phone. Either there is a transition or not, and the probability depends on the phone (which means that different phones can have their own length distributions, and a given length distribution is active depending not on the position within the word, but only on the phone that is currently being considered). The position is a function of the previous position and the previous transition value: if the transition value was 0, then the position remains unchanged; otherwise it increments by one. These relationships can be straightforwardly and explicitly encoded as conditional probabilities, as shown in the figure. In this representation, there is an “end-of-word” observation, which is assigned an arbitrary value of 1. This variable ensures that all variable assignments that have non-zero probability must end on a transition out of the final position. This is done by conditioning this observation on the state variable, and setting its conditional probability distribution so that the probability of its observed value is 0 unless the position variable has the value of the last position. Similarly, its probability is 0 unless the transition value is 1. This ensures assignments in conformance with the classic HMM convention that all paths end in a transition out of the final emitting state. In this model, the conditional probabilities need only be specified for the first occurrence of each kind of variable, and then can be shared by all subsequent occurrences of analogous variables. Thus, we have achieved the goal of making the conditional probabilities essentially “time-invariant”. However, it is important to note that the probabilities

UWEETR-2001-0006 22 do depend on the specific utterance being processed. In this model, the mapping from position to phone will change from utterance to utterance, as does the final position. Thus an implementation must support some mechanism for representing and reading in conditional probabilities on an utterance-by-utterance basis. This is analogous to reading in word-graphs, lattices, or scripts on an utterance-by-utterance basis in a classic HMM system. A final nuance of this model is that some of the conditional probability relationships are in fact deterministic, and not subject to parameter estimation. In fact, Figure 12 shows that some of the edges indicate true random imple- mentations (the smoothed zig-zag or wriggled edges), and the other edges indicate deterministic implementations (the straight edges). Specifically, the distribution controlling the position variable encodes a simple logical relationship: if the transition-parent is 0, then P ositiont = P ositiont−1; otherwise, P ositiont = P ositiont−1 + 1. Efficiently representing deterministic relationships, and exploiting them in inference, is important for an efficient implementation of a graphical modeling system [109]. To summarize, the model of figure 12 uses the following conditional probabilities: 1. position at frame 1: a deterministic function that is 1 with probability 1 2. position in all other frames: a deterministic function such that

P (positiont = positiont−1|positiont−1, transitiont−1 = 0) = 1

P (positiont = positiont−1 + 1|positiont−1, transitiont−1 = 1) = 1

3. phone: an utterance-specific deterministic mapping 4. transition: a dense table specifying P (transition|phone) 5. observation: a function such as a Gaussian mixture that specifies P (observation|phone) 6. end-of-utterance observation: an utterance-specific deterministic function such that

P (end − of − utterance = 1|position 6= final, transition 6= 1) = 0)

The graphical model in Figure 12 is useful in a variety of tasks: • Parameter estimation when the exact sequence of words and therefore phones, including silences, is known. • Finding the single best alignment of a sequence of frames to a known sequence of words and phones. • Computing the probability of acoustic observations, given a known word sequence - which is useful for rescoring the n-best hypotheses of another recognition system. The main issue of this model is that, unless it is implemented using an expanded state space implicitly within the graph, a fully specified state sequence must be used. This is present only if one knows where silence resides between words. For example, the state sequence corresponding to “hi there” will in general be different from “hi sil there,” though in practice one does not know where the silences occur. Of course, one has two choices: 1) resort to the implicit approach and use an expanded state space, or 2) use a more intricate graph which explicitly represents the possibility of silence between words and multiple lexical variants. In the next section, we pursue the latter of the two choices.

5.2 A more explicit structure for Decoding A more explicit (and complicated) model structure can be used for decoding, and is described in this section. The structure given here was developed anew specifically for the JHU 2001 workshop, and became the basis for many of the recognition experiments that were undertaken during this time. In general, the goal of the decoding process is to determine the likeliest sequence of words given an observed acoustic stream, and this can be done with the model structure shown in figure 13. This model is similar in spirit to that of Figure 12, but there is an extra “layer” of both deterministic and random variables added on top. Again, the wriggled edges correspond to true random implementations, and all straight edges depict determinism. Since the goal is to discover a word sequence, there must be some representation of words, and this is obtained by associating an explicit “word” variable with each frame. The other addition is a “word transition” variable, which

UWEETR-2001-0006 23 End-of- Utterance = 1 Word Word- Transition Word- Position Phone- Transition Phone

Acoustics

Figure 13: A graphical model for decoding.

indicates when one word ends and another begins. The “word position” variable is analogous to the previous “position” variable, and indicates which phone of a word is under consideration – the first, second, etc. The logic of this network is fairly straightforward. The combination of word and word-position determines a phone. As before, both the observation and transition variables are conditioned on this. There is a word transition when there is a phone transition in the last phone of a word. Finally, the word variable retains its value across time if there is no word transition, and otherwise takes a new value according to a probability distribution that is conditioned on the previous word value. In other words, when the word transition variable at time t − 1 is zero, the word variable at time t is a deterministic function of the word variable at time t − 1. When the word transition variable at time t − 1 is one, the word variable at time t is random, according to a bigram language model. For this reason, the edge between word variables are shown as both a straight edge and a wriggled edge: it is straight only when the previous word transition is zero. Such a construct can be encoded using switching dependencies within GMTK (see Section 6.1.7). The only major constraint on decoding is that the interpretation should end at the end of a word, and this is enforced by adding an “end-of-utterance” observed variable and conditioning it on the word transition variable. The probability distribution is set so that the observed end-of-utterance value is only possible (obtains non-zero probability) if the final word-transition value is 1. As in the simpler model of the previous section, all the logical relationships that were used to describe the network can be encoded as conditional probabilities. This network is used for decoding by finding the likeliest values of the hidden variables. The decoded word sequence can be read off from the word variable in frames for which the word transition has the value 1. (Because a word may be repeated as in “fun fun”, one cannot simply look at the sequence of assignments to the word variable alone since that would not inform if a duplicate word occurred, or if it was just an longer instance of a single word.) This decoding model uses a bigram language model of word sequences. That is, the probability contributed by Q the word variables w is factored as: P (w) = P (wt|wt−1). A trigram language model that conditions on the two Q t previous words - P (w) = t P (wt|wt−1, wt−2) - is more common, and is encoded in the somewhat more complex model of Figure 14. In this model, there is another layer of variables, “last-word,” that represent the word wt−2. When there is a word transition, the next word is chosen from a distribution that is conditioned on both the current and last word, and appropriate copy operations are done so that the last-word variable will always have the correct value. When there is no word transition, the word is like before just a duplicate of the word at the previous frame. The tri-gram probability p(wt|wt−1, wt−2) is shown in the figure by having two wriggled edges converging at the word variable, where the edge from the previous word is wriggled only in the case when the word transition is one (again, the edge is shown both as a straight edge and as a wriggled edge).

5.3 A more explicit structure for training Figure 12 illustrates a model that is suitable for training. The adjustable parameters are simply the output and transition probabilities, and when training is completed these are available for use in the same or other network structures. Due

UWEETR-2001-0006 24 Last Word End-of- Utterance = 1 Word Word- Transition Word- Position Phone- Transition Phone

Acoustics

Figure 14: A model for decoding with a trigram language model. to its simplicity and efficiency, this structure is relatively efficient doing training (it has only size three cliques in the graph), whenever the exact word and phone sequence is known. Unfortunately, this is not always the case. For example, occurrences of silence may not be indicated in a transcrip- tion of training data, but it is known that silence does exist and therefore should be probabilistically modeled during training as well as during recognition. As another example, there might be multiple possible pronunciations for the words that are known, e.g. /T AH M EY T OW/ or /T AM M AA T OW/, and it is desirable not to choose one of those pronunciations at training time because the phone-transcriptions are not available. Again, such a feature can be added either implicitly or explicitly, and we pursue the latter. Figure 15 illustrates a model structure that is able to consider all possible insertions of silence between words. This structure is essentially the same as Figure 13 used for bigram decoding, with a small modification. There are two new variables, one denoting the position within the utterance, and the other denoting whether silence should be inserted or skipped at a word transition. The “position within utterance” variable denotes the position (first, second, third, etc.) in the training script, where inter-word silences are explicitly numbered. For example, in the phrase “the cat in the hat,” the numbering is as: sil(1) the(2) sil(3) cat(4) sil(5) in(6) sil(7) the(8) sil(9) hat(10) sil(11) The probability distribution governing position-within-utterance can now be described. The basic idea is that if there is a transition at the end of a position denoting silence (e.g. position 3), then the position variable advances by one to the next word. On the other hand, if there is a transition at the end of a normal word position (e.g. position 4), then depending on the value of the skip-silence variable, the position either advances by one (and the silence is inserted) or by two (and the silence is skipped). The end-of-utterance variable is set up so that its assigned value is only possible if the position is either the last word or silence, and there is a word transition. Note that in this case the word variable is a deterministic function only of the position variable. During training, it is known what the first, second, etc. word is, so this is certainly possible to do (and is how GMTK implements it). Training with multiple pronunciation or lexical variants for each word is more complicated. Figure 16 illustrates one possible graphical model that can accommodate multiple lexical variants, as well as optional silence. The main addition here is a variable that explicitly represents the lexical variant of the word being uttered. For example, “tomato” has two; a lexical-variant value of 0 represents /T AH M EY T OW/ and a value of 1 represents /T AM M AA T OW/. The phone value is now determined by a combination of word, word-position, and lexical variant. The lexical variant is selected according to an appropriate distribution when the word-transition value is 1. In the case that there is no transition, it is simply copied from the previous frame, which is why there is an edge between consecutive occurrences of the variable. When there is a transition, the lexical variant is chosen based on the new word (since there will typically be a different number of variants with differing probabilities for each word). This again depicts a switching dependency: when word transition at time t − 1 is zero, lexical variant at time t is a copy (deterministic) of itself from the previous time frame. When word transition at time t − 1 is one, lexical variant at time t is determined randomly based on the value of word at time t (i.e., the new word). Although we do not present the decoding graph, a similar

UWEETR-2001-0006 25 Skip-Sil End-of- Position Utterance = 1 in Utterance

Word Word- Transition Word- Position Phone- Transition Phone

Acoustics

Figure 15: Training with optional silence. The structure below the dotted line is substantially the same as the bigram decoding structure given in Figure 13 modification of the previous decoding networks allows for decoding with multiple lexical variants.

5.4 Rescoring It is frequently the case that one has a reasonably good system that produces a set of word hypotheses that one then wants to choose between on the basis of a more sophisticated model. This process is referred to as rescoring, and GMTK is ideally suited to rescoring existing hypotheses with a more sophisticated model. The simplest kind of rescoring, and the one we will consider in this section, is n-best rescoring, where there are n possible word sequences to choose between, and the sequences are simply enumerated out; for example, The cat in the hat. The cat is the hat. The cat in a hat. The cat is a hat. When the hypotheses are enumerated like this, rescoring can be done simply by computing the data probability ac- cording to each hypothesis with the basic model of Figure 12. This is a nice situation, because then exactly the same model structure can be used for both training and testing. There are two disadvantages, however, the first being that an outside system is required to produce the hypotheses, and the second that a much more compact representation is often available in the form of a word lattice, as illustrated for “the cat in the hat” in Figure 17. While the first disadvantage is intrinsic to the rescoring paradigm, the second can be alleviated with a somewhat more sophisticated model structure, as discussed in the following section.

5.5 Graphical Models and Stochastic Finite State Automata Since lattices are a compact and useful representation in many applications, it is fortunate that a straightforward procedure allows them to be represented and manipulated in the graphical model framework [109]. To understand the exact analogy, it is necessary to define exactly what we mean by a lattice. This can be done with the following and relatively standard definition. A lattice consists of: 1. a set of states; 2. a set of directed arcs connecting the states; 3. a subset of states identified as “initial states”

UWEETR-2001-0006 26 Skip-Sil End-of- Position Utterance = 1 in Utterance Lexical Variant

Word Word- Transition Word- Position Phone- Transition Phone

Acoustics

Figure 16: Training with lexical variants.

is the The cat hat

in a

Figure 17: A word lattice. Each path from the leftmost point to the rightmost point represents a possible word sequence. The number of complete distinct paths can grow exponentially in the number of edges in the lattice, making it a far more compact representation than an enumeration of all possible word sequences.

UWEETR-2001-0006 27 P AH T

EY

T AH M T OW AA

Figure 18: A stochastic finite state automaton that represents several words and their pronunciation variants. The initial state is on the left and shaded; the final state is shaded on the right. Each path through the graph represents a valid pronunciation of a word.

End-of-Path Observation

State Variables

Transition Variables

Figure 19: A graphical model structure for representing generic SFSAs with path-length four. An assignment to the state and transition variables corresponds to a path through the automaton.

4. a subset of states identified as “final states” 5. a probability distribution over the arcs leaving each state The semantics of a lattice are that it represents a set of paths, each of which starts in an initial state, ends in a final state, and has a probability equal to the product of the transition probabilities encountered on the way. Figure 18 illustrates a stochastic finite state automaton. As detailed in [109], there is a straightforward construction process by which a set of paths can be represented in a graphical model. The one restriction is that the paths must be of a given length; in real applications with concrete observation streams, this is always the case. As in previous examples, the key is to explicitly represent the state that is occupied at each time frame. Also in common with previous examples, there is a transition variable at each frame that in this case specifies which arc to follow out of the lattice state. Figure 19 illustrates a graphical model that can represent a generic SFSA. In the model of Figure 19, the cardinality of the state variables is equal to the number of states in the underlying SFSA. The cardinality of the transition variable is equal to the maximum out degree of any state. The conditional probability of the transition variable taking a particular value k, given that the state variable has value j, is equal to the probability of taking the kth arc out of state j. The end-of-path variable has an artificially assigned value of 1, which is only possible if the state variable has a value equal to a predecessor of a final state, and the transition variable has a value that denotes an arc leading to a final state. It can be easily shown [109] that conditional probabilities defined in this way lead to a model in which each assignment of values to the variables corresponds to a valid path through the automaton - and gets the same probability, or else it corresponds to an illegal path and gets 0 probability. Note that each state (except initial and final ones) in the SFSA of Figure 18 is labeled with an output symbol. In some cases, it is useful to have “null” or unlabeled states in the interior of the graph. In this case, multiple transitions may be taken within a single time-frame, and the representation of Figure 19 is no longer adequate. Instead, multiple transition and state variables are required in each frame. This type of model is discussed in more detail in [109]. Note also that in the previous section, we saw a lattice in which the arcs - not the nodes - were labeled. In fact, a trivial conversion in which each labeled arc is broken into two unlabeled arcs with a labeled state sandwiched in between shows that the two representations are equivalent.

UWEETR-2001-0006 28 6 GMTK: The graphical models toolkit

As mentioned in earlier sections, with GMs, one uses a graph to describe a statistical process, and thereby defines one of its most important attributes, namely conditional independence. Because GMs describe these properties visually, it is possible to rapidly specify a variety of models without much effort. Again, GMs subsume much of the statistical underpinnings of existing ASR techniques — no other known statistical abstraction appears to have this property. More importantly, the space of statistical algorithms representable with a GM is enormous; much larger than what has so far been explored for ASR. The time therefore seems ripe to start seriously examining such models. Of course, this task is not possible without a (preferably freely-available and open-source) toolkit with which one may maneuver through the model space easily and efficiently, and this section describes the first version of GMTK. GMTK can represent all of the models that have been described in previous sections, and was used throughout the 2001 JHU workshop. GMTK is meant to complement rather than replace other publicly available packages — it has unique features, ones that are different from both standard ASR-HMM [108, 99, 57] and standard Bayesian network [16, 81] packages. This section contains a detailed description of GMTK’s features, including a language for specifying structures and probability distributions, logarithmic space exact training and decoding procedures, the concept of switching parents, and a generalized EM training method which allows arbitrary sub-Gaussian parameter tying. Taken together, these features endow GMTK with a degree of expressiveness and functionality that significantly complements other publicly available packages. Full documentation can be found on the features and use of GMTK at the following WEB location, given in the citation [8].

6.1 Toolkit Features GMTK has a number of features that support a wide array of statistical models suitable for speech recognition and other time-series data. GMTK may be used to produce a complete ASR system for both small- and large-vocabulary domains. The graphs themselves may represent everything from N-gram language models down to Gaussian compo- nents, and the probabilistic inference mechanism supports first-pass decoding in these cases.

6.1.1 Explicit vs. Implicit Modeling In general and as discussed in Section 5, there are two representational extremes one may employ when using graphical models and in particular GMTK for an ASR system. On the one hand, a graph may explicitly represent all the underlying variables and control mechanisms (such as sequencing) that are required in an ASR system [109]. We call this approach an “explicit representation” where variables can exist for such purposes as word identification, numerical word position, phone or phoneme identity, the occurrence of a phoneme transition, and so on. In this case, the structure of the graph explicitly represents the interesting hidden structure underlying an ASR system. On the other hand, one can instead place most or all of this control information into a single hidden Markov chain, and use a single integer state to encode all contextual information and control the allowable sequencing (Figure 11, top left). We call this approach an “implicit” representation. As an additional example of these two extremes, consider the word “yamaha” with pronunciation /y aa m aa hh aa/. The phoneme /aa/ occurs three times, each in different contexts, first preceding an /m/, then preceding an /hh/, and finally preceding a word boundary. In an ASR system, it must somewhere be specified that the same phoneme /aa/ may be followed only by one of /m/, /h/, or a word boundary depending on the context — /aa/, for example, may not be followed by a word boundary if it is the first /aa/ of the word. In the explicit GM approach, the graph and associated conditional probabilities unambiguously represent these constraints. In an implicit approach, all of the contextual information is encoded into an expanded single-variable hidden state space, where multiple HMM states correspond to the same phoneme /aa/ but in different contexts. The explicit approach is useful when modeling the detailed and intricate structures of ASR. It is our belief, more- over, that such an approach will yield improved results when combined with a discriminative structure (See above and [6, 11]), because it directly exposes events such as word-endings and phone-transitions for use as switching parents (see Section 6.1.7). The implicit approach is further useful in tempering computational and/or memory requirements. In any case, GMTK supports both extremes and everything in between — a user of GMTK is therefore free to ex- periment with quite a diverse and intricate set of graphs. It is the task of the toolkit to derive an efficient inference procedure for each such system.

UWEETR-2001-0006 29 frame: 0 { variable : state { type : discrete hidden cardinality 4000; switchingparents : nil; conditionalparents : nil using MDCPT("pi"); } variable : observation { type : continuous observed 0:38; switchingparents : nil; conditionalparents : state(0) using mixGaussian mapping("state2obs"); } }

frame: 1 { variable : state { type : discrete hidden cardinality 4000; switchingparents : nil; conditionalparents : state(-1) using MDCPT("transitions"); } variable : observation { type : continuous observed 0:38; switchingparents : nil; conditionalparents : state(0) using mixGaussian mapping("state2obs"); } }

chunk 1:1;

Figure 20: GMTKL specification of a basic HMM structure. The feature vector in this case is 39 dimensional, and there are 4000 hidden states. Frame 1 can be duplicated or ”unrolled” to create an arbitrarily long network.

6.1.2 The GMTKL Specification Language A standard DBN [29] is typically specified by listing a collection of variables along with a set of intra- and inter- dependencies which are used to unroll the network over time. GMTK generalizes this ability via dynamic GM tem- plates. The template defines a collection of (speech) frames and a chunk specifier. Each frame declares an arbitrary set of random variables and includes attributes such as parents, type (discrete, continuous), parameters to use (e.g. discrete probability tables or Gaussian mixtures) and parameter sharing. At the end of a template is a chunk specifier (two integers, N : M) which divides the template into a prologue (the first N − 1 frames), a repeating chunk, and an epilogue (the last T − M frames, where T is the frame-length of the template). The middle chunk of frames is “unrolled” until the dynamic network is long enough for a specific utterance. GMTK uses a simple textual language (GMTKL) to define GM templates. Figure 1 shows the template of a basic HMM in GMTKL. It consists of two frames each with a hidden and an observed variable, and dependences between successive hidden and between observed and hidden variables. For a given template, unrolling is valid only if all parent variables in the unrolled network are compatible with those in the template. A compatible variable has the same name, type, and cardinality. It is therefore possible to specify a template that can not be unrolled and which would lead to GMTK reporting an error. A template chunk may consist of several frames, where each frame contains a different set of variables. Using this feature, one can easily specify multi-rate GM networks where variables occur over time at rates which are fractionally but otherwise arbitrarily related to each other.

6.1.3 Inference The current version of GMTK supports a number of operations for computing with arbitrary graph structures, the four main ones being: P 1. Integrating over hidden variables to compute the observation probability: P (o) = h P (o, h)

2. Finding the likeliest hidden variable values: argmaxhP (o, h) 3. Sampling from the joint distribution P (o, h)

UWEETR-2001-0006 30 A S=1 C S 2 B S=

Figure 21: When S = 1, A is B’s parent, when S = 2, B is C’s parent. S is called a switching parent, and A and B conditional parents.

Q 4. Parameter estimation given training data {ok} via EM/GEM: argmaxθ k P (ok|θ) A critical advantage of the graphical modeling framework derives from the fact that these algorithms work with any graph structure, and a wide variety of conditional probability representations. GMTK uses the Frontier Algorithm, detailed in [109, 111], which converts arbitrary graphs into equivalent chain-structured ones, and then executes a forwards-backwards recursion. The frontier algorithm is standard junction-tree inference [60, 84, 25, 61] where the junction tree is formed via a constrained triangulation algorithm. The triangulation algorithm is equivalent to the variable elimination algorithm, but where the variable order is constrained such that the variables occur in topological order relative to the original directed model. The chain structure is useful because it makes it easier to do beam-pruning, to work with deterministic relationships between variables, and to implement logarithmic space inference.

6.1.4 Logarithmic Space Computation In many speech applications, observation sequences can be thousands of frames long. When there are a dozen or so variables per frame (as in an articulatory network, see below), the resulting unrolled network might have tens of thousands of nodes, and cliques may have millions of possible values. A naive implementation of exact inference, which stores all clique values for all time, would result in (an obviously prohibitive) gigabytes of required storage To avoid this problem, GMTK implements a recently developed procedure [13, 110] that reduces memory require- ments exponentially from O(T ) to O(log T ) (this has also been called the Island algorithm). This reduction has a truly dramatic effect on memory usage, and can additionally be combined with GMTK’s beam-pruning procedure for further memory savings. The key to this method is recursive divide-and-conquer. With k-way splits, the total memory usage is O(k logk T ), and the runtime is O(T logk T ). The constant of proportionality is related to the number of entries in each clique, and becomes smaller with pruning. For algorithmic details, the reader is referred to [110].

6.1.5 Generalized EM GMTK supports both EM and generalized EM (GEM) training, and automatically determines which to use based on the parameter sharing currently in use. When there is no parameter tying, normal EM is employed. GMTK, however, has a flexible notion of parameter tying, down to the sub-Gaussian level – in such a case, the typical EM training algorithm does not lead to analytic parameter update equations. GMTK’s GEM training is distinctive because it provides a provably convergent method for parameter estimation, even when there is an arbitrary degree of tying, even down to the level of Gaussian means, covariances, or factored covariance matrices (see Section 6.1.9).

6.1.6 Sampling Drawing variable assignments according to the joint probability distribution is useful in a variety of areas ranging from approximate inference to speech synthesis, and GMTK supports sampling from arbitrary structures. The sampling procedure is computationally inexpensive, and can thus be run many times to get a good distribution over hidden (discrete or continuous) variable values.

6.1.7 Switching Parents GMTK supports another novel feature rarely found in GM toolkits, namely switching parent functionality (also called Bayesian multi-nets [11]). This already was used in Section 6.1.1. Normally, a variable has only one set of parents. GMTK, however, allows a variable’s parents to change (or switch) conditioned on the current values of other parents. The parents that may change are called conditional parents, and the parents which control the switching are called

UWEETR-2001-0006 31 switching parents. Figure 21 shows the case where variable S switches the parents of C between A and B, corre- sponding to the probability distribution: P (C|A, B) = P (C|A, S = 1)P (S = 1) + P (C|B,S = 2)P (S = 2). This can significantly reduce the number of parameters required to represent a probability distribution, for example, P (C|A, S = 1) needs only a 2-dimensional table whereas P (C|A, B) requires a three dimensional table. Switching functionality has found particular utility in representing certain language models, as experiments during the JHU2001 workshop demonstrated.

6.1.8 Discrete Conditional Probability Distributions GMTK allows the dependency between discrete variables to be specified in one of three ways. First, they may be deterministically related using flexible n-ary decision trees. This provides a sparse and memory-efficient representation of such dependencies. Alternatively, fully random relationships may be specified using dense conditional probability tables (CPTs). In this case, if a variable of cardinality N has M parents of the same cardinality, the table has size N M+1. Since this can get large, GMTK supports a third sparse method to specify random dependencies. This method combines sparse decision trees with sparse CPTs so that zeros in a CPT simply do not exist. The method also allows flexible tying of discrete distributions from different portions of a CPT.

6.1.9 Graphical Continuous Conditional Distributions GMTK supports a variety of continuous observation densities for use as acoustic models. Continuous observation variables for each frame are declared as vectors in GMTKL, and each observation vector variable can have an arbitrary number of conditional and switching parents. The current values of the parents jointly determine the distribution used for the observation vector. The mapping from parent values to child distribution is specified using a decision tree, allowing a sparse representation of this mapping. A vector observation variable spans over a region of the feature vector at the current time. GMTK thereby supports multi-stream speech recognition, where each stream may have its own set of observation distributions and sets of discrete parents. The observation distributions themselves are mixture models. GMTK uses a splitting and vanishing algorithm dur- ing training to learn the number of mixture components. Two thresholds are defined, a mixture-coefficient vanishing ratio (mcvr), and a mixture-coefficient splitting ratio (mcsr). Under a K-component mixture, with component proba- th bilities pk, if pk < 1/(K × mcvr), then the k component will vanish. If pk > mcsr/K, that component will split. GMTK also supports forced splitting (or vanishing) of the N most (or least) probable components at each training iteration. Sharing portions of a Gaussian such as means and covariances can be specified either by-hand via parameter files, or via a split (e.g., the split components may share an original covariance). Each component of a mixture is a general conditional Gaussian. In particular, the c-component probability is p(x|zc, c) = N(x|Bczc + fc(zc) + µc,Dc) where x is the current observation vector, zc is a c-conditioned vector of continuous observation variables from any observation stream and from the past, present, or future, Bc is an arbitrary sparse matrix, fc(zc) is a multi-logistic non-linear regressor, µc is a constant mean residual, and Dc is a diagonal covariance matrix. Any of the above components may be tied across multiple distributions, and trained using the GEM algorithm. GMTK treats Gaussians as directed graphical models [7], and can thereby represent all possible Gaussian factor- ization orderings, and all subsets of parents in any of these factorizations. Under this framework, GMTK supports diagonal, full, banded, and semi-tied factored sparse inverse covariance matrices [12]. GMTK can also represent ar- bitrary switching dependencies between individual elements of successive observation vectors. GMTK thus supports both linear and non-linear buried Markov models [10]. All in all, GMTK supports an extremely rich set of observation distributions.

7 The EAR Measure and Discriminative Structure learning Heuristics

This section summarizes different approaches to modifying the structure of a HMM in order to improve classification performance. The underlying goal in this endeavor is to augment the structure in an HMM in a structurally discrim- inative fashion. The space of possible models that is spanned by the optimization procedure is also described. In particular, the graph structure for an HMM is treated such that the vector observation variables are expanded into their individual variables, as in a BMM [10]. These observation vectors, and element variables therein, are augmented with discriminative cross-observation edges leading to fewer independence statements made by the model in an attempt to

UWEETR-2001-0006 32 improve structural discriminability. This is shown in Figure 22. This section also introduces the EAR measure [6], but provides a new and more precise derivation of precisely what assumptions are needed for it to be obtained. In doing so, a new optimal criterion function for structural discriminability (fast-forward to Equation 4) is derived. This derivation, could lead to additional novel heuristics to achieve structural discriminability.

7.1 Basic Set-Up The basic problem that we consider may be summarized thus: Let Q denote a hidden class variable, taking values q; let X denote a (vector-valued) variable comprising a set of observed acoustic features observed at a specific frame; finally let W denote a set of prior observations that may also be useful for determining which class q the given frame. Typically the dimension of W is too large for all of these features to be incorporated into the predictive model for the current time frame. Thus we face a model selection problem: find a model MZ , which incorporates a subset of features Z,(Z ⊆ W ), but which gives good predictive performance. We may formalize this as follows: our goal is to choose a model, MZ from a class of models M, such that the resulting fitted distribution pˆZ (Q | X,W ) maximizes:

Ep(Q,X,W ) logp ˆZ (Q | X,W ). (3)

Since  p(Q | X,W )  KL(p(Q | X,W ) || pˆZ (Q | X,W )) = Ep(Q,X,W ) log pˆZ (Q | X,W )

maximizing (3) is equivalent to minimizing the KL-divergence between the true distribution p(Q | X,W ) and the fitted distribution pˆZ (Q | X,W ).

Q

W

Z2

X

Z1

Figure 22: Basic set-up: we wish to find a set Z ⊂ W of fixed dimension of parents for X which will lead to an optimal classification model for Q given X and W . Q is typically a hidden variable at one time step of an HMM, and X are the feature vectors at the current time point. W are the set of all possible additional parents of X that could be chosen, and W might consist either of collections of X vectors from an earlier time, or could consist of entirely different features that are not ordinarily used for an X at any time.

UWEETR-2001-0006 33 7.2 Selecting the optimal number of parents for an acoustic feature X Here we consider the simplest version of the model selection problem. For a given Z ⊂ W , we define the following model: ∗ ∗ ∗ ∗ MZ = {p | p (x, q, w) = p (q, w)p (x | q, z)} .

This corresponds to a graphical model in which Q, W form a clique, and the parents of X are Q and Z. MZ is simply the set of model distributions in which X⊥⊥W \ Z | Q, Z holds. Note that while X⊥⊥W \ Z | Q, Z would hold for a particular model that been selected, it is not necessarily the case that X⊥⊥W \ Z | Q, Z is correct according to the true generative model p(x, q, w). As mentioned in Section 4, this is not a concern as the goal here is only to obtain generative models that discriminate well. We now define the model class: Mc = {MZ where |Z| = c} This is simply the set of graphical models in which X has exactly c parents in addition to Q, and Q, W forms a clique. For a given model MZ we will let pˆZ denote the fitted distribution, given data, under the model MZ . Since 2 Q, W forms a clique, the fitted distribution and the true distribution are the same pˆZ (Q, W ) = p(Q, W ). Similarly, pˆZ (X | Q, Z) = p(X | Q, Z) since we allow for the variables X to form a clique. (If we are fitting a parametric rather than a non-parametric sub-model of MZ then these last two equations will not hold; we return to this point below.) Note that for models within MZ we have that X X pˆZ (X,W ) = pˆZ (X, q, W ) = pˆ(q, W )ˆp(X|q, Z). q q

Now,

Ep logp ˆZ (Q | X,W ) = Ep logp ˆZ (Q, X, W ) − Ep logp ˆZ (X,W )

= Ep logp ˆZ (X | Q, W ) + Ep logp ˆZ (Q, W )

−Ep logp ˆZ (X,W )

(∗) = Ep logp ˆZ (X | Q, Z) + Ep logp ˆZ (Q, W )

−Ep logp ˆZ (X,W )

= Ep log p(X | Q, Z) + Ep log p(Q, W )

−Ep logp ˆZ (X,W )

= I(X; Z | Q) + Ep log p(X | Q) + Ep log p(Q, W )

+KL(p(X,W ) || pˆZ (X,W )) − Ep log p(X,W )

where the step marked (∗) follows from the conditional independence assumption assumed in the model MZ . Disre- garding terms which do not depend on Z we then see that the optimal Z is that which maximizes:

I(X; Z | Q) + KL(p(X,W ) || pˆZ (X,W )). (4)

Thus, in words, we wish to find the set Z which maximizes the conditional mutual information between X and Z given Q, but at the same time which maximizes the KL divergence between p(X,W ) and pˆZ (X,W ). Note that the only assumption that we have made so far is that

pˆZ (X | Q, Z) = p(X | Q, Z) and pˆZ (Q, W ) = p(Q, W )

These assumptions will hold true if we are fitting a non-parametric model, e.g. as would be the case if we were fitting discrete Bayes Nets. However, note that if pˆZ (Q, W ) 6= p(Q, W ), but all models MZ agree on the distribution of p(Q, W ) (i.e., they all use the same distribution over variables Q and W ), then the expression obtained in equation (4)

2When integrating with respect to the true distribution p, and the model is non-parametric, it is only conditional independence statements which distinguish the model and the true distribution. For a clique, there are no independence statements.

UWEETR-2001-0006 34 would still select the optimal model under criterion (3), since in this case the term Ep logp ˆZ (Q, W ) does not depend on Z. More formally, criterion (4) is correct if we are considering selecting among models:

∗ ∗ ∗ ∗ MZ = {p | p (x, q, w) = p0(q, w)p (x | q, z)} . where p0(q, w) is a fixed distribution. This would be the case with learning structure in an HMM in which the model for p(Q, W ) was already fixed.

7.3 The EAR criterion Expanding the KL-divergence term in (4) which depends on Z we obtain the following: X −Ep logp ˆZ (X,W ) = Ep log pˆZ (q, X, W ) q X = −Ep log pˆZ (X | q, W )ˆpZ (q | W ) − Ep logp ˆZ (W ) q X = −Ep log pˆZ (X | q, Z)ˆpZ (q | W ) − Ep logp ˆZ (W ) q X = −Ep log p(X|q, Z)p(q | W ) − Ep log p(W ) q

where in the last line we use the fact that pˆZ (X | Q, Z) = p(X | Q, Z), pˆZ (Q | W ) = p(Q | W ) and pˆZ (W ) = p(W ). If Q⊥⊥(W \ Z) | Z in the true distribution p, so that p(q | W ) = p(q | Z), then the sum in the last expression becomes: X −Ep log p(X|q, Z)p(q | Z) = −Ep log p(X | Z) q

= −I(X; Z) − Ep log p(X)

Thus if Q⊥⊥(W \ Z) | Z in the true distribution then maximizing (4) is equivalent to maximizing:

EAR(Z) = I(X; Z | Q) − I(X; Z) (5)

This is the Explaining Away Residual (EAR) criterion proposed by Bilmes (1998). Since

I(X; Z | Q) − I(X; Z) = I(X; Q | Z) − I(X; Q)

and the last term on the RHS does not depend on Z, a criterion which is equivalent to the EAR criterion results from selecting the set Z which maximizes: EAR∗(Z) = I(X; Q | Z) If Q6⊥⊥(W \ Z) | Z in the true distribution p, then we are no longer guaranteed that the set Z which optimizes the EAR criterion (5) will optimize our objective (3). Though it is unlikely to hold exactly in the true distribution, it may hold approximately in contexts where the features W relate to noise, which is independent of the state Q. One obvious approach to assessing whether or not Q⊥⊥(W \ Z) | Z, would be to calculate I(Q; W \ Z | Z) for different choices of Z. In particular, it would seem to be of concern if I(Q; W \ ZEAR | ZEAR) >> 0 where ZEAR is the set which optimizes the EAR criterion. Note that the EAR criterion (5) will also be equivalent to (3) in the case where X⊥⊥Q | Z in the true distribution p, since in that case X Ep log p(X|q, Z)p(q | W ) = Ep log p(X|Z) q However, this independence seems rather unlikely to hold in practice.

UWEETR-2001-0006 35 7.4 Class-specific EAR criterion The EAR criterion may be adapted to select class-specific sets of parents, ‘switching parents’, as follows:

EARq(Z) = I(X; Z | Q = q) − I(X; Z)

Class-specific sets of parents allow the corresponding statistical model to encode ‘context-specific independence’ (CSI) constraints, of the form: X⊥⊥Y | Z = z

7.5 Optimizing the EAR criterion: heuristic search In principle, optimizing the EAR criterion ‘simply’ requires us to calculate I(X; Z|Q) and I(X; Z). In practice, when Z represents a set of covariates the calculation of these quantities for a large speech corpora is computationally intensive. Since each feature vector typically contains between 20 and 40 components, and there are thought to be long-range dependencies at time lags of up to 150 ms (and 10ms per time-frame), the set W of potential parents for a given X variable may contain thousands of candidate covariates. Under these circumstances it is infeasible to optimize the EAR criterion directly. This motivates the use of heuristic search procedures. A greedy algorithm provides a simple heuristic search procedure for finding a set of size k:

(a) Set Z = ∅. (b) Find the variable U ∈ W \ Z for which EAR(Z ∪ {U}) is maximized. Add U to Z. (c) Repeat step (b) until Z has dimension k.

7.6 Approximations to the EAR measure The approach just described still suffers from the disadvantage that calculation of the EAR measure requires calcu- lation of joint mutual information between vectors of variables, which in turn requires multivariate joint densities to be computed. This was infeasible for the speech recognition tasks that were considered during the Johns Hopkins workshop, given the time and resources that were available. Consequently we investigated simple approximations to the EAR measure that only required the calculation of information between scalars.

7.6.1 Scalar approximation 1

The first variable Z1 was selected by evaluating the EAR criterion, which does not require evaluating information between vectors of variables. The second variable Z2 was found by finding the variable with the highest EAR criterion, among the remaining variables (W \{Z1}) which at the same time did not appear ‘redundant’ in that it satisfied at least one of the following inequalities: I(X; Z2 | Q) > I(X; Z1 | Q) or I(X; Z2 | Q) > I(Z1; Z2 | Q) (6) This is depicted in Figure 23, where edge-thickness roughly corresponds to mutual-information value.

Q Q

Z1 Z1 X X Z Z 2 2

Figure 23: Heuristic: conditions under which a second variable Z2 are not considered redundant with the first variable Z1 added. Edge thickness corresponds roughly with mutual-information magnitude.

The motivation behind verifying these inequalities was as follows: if

Z2⊥⊥X | Z1,Q

UWEETR-2001-0006 36 then adding Z2 to X and Z1 as parents of X, will not change the resulting model. This conditional independence implies the reverse inequalities via the information processing inequality:

I(X; Z2 | Q) ≤ I(X; Z1 | Q) and I(X; Z2 | Q) ≤ I(Z1; Z2 | Q)

Consequently, if at least one of the inequalities (6) fails to hold then this at least implies that the conditional indepen- dence does not hold. Similarly if a third variable is required then we selected the variable with the highest EAR measure which satisfies at least one of the inequalities

I(X; Z3 | Q) > I(X; Zi | Q) or I(X; Z3 | Q) > I(Zi; Z3 | Q) for each of i = 1 and i = 2 . However, in practice we found that this rule was not helpful in selecting additional parents. We also considered schemes in which these redundancy tests were traded-off. However, these schemes did not lead to notable improvement in recognition. Note, however, that these redundancy criterion are not discriminative in that they ask only for redundancy with respect to conditional mutual information, and not the EAR measure (and certainly not Equation 4). Therefore, it is not entirely surprising that these conditions showed little effect.

7.6.2 Scalar approximation 2 The second approximation method that we used was to rank variables based on the EAR measure, but simply to reject those for which the marginal information was too large. i.e. we eliminated from consideration those covariates Z∗ for which I(X; Z∗) was greater than a threshold. Typically, this threshold was selected by calculating this quantity for all covariates Z∗ that are under consideration and then choosing an appropriate percentile.

7.6.3 Scalar approximation 3 The third approximation was the simplest of them all: choose the top n variables ranked by the EAR measure. This measure proved to perform about as well as the heuristics above, and was used for all word error experiments described in Sections 9 and 10.

7.7 Conclusion The main findings from the exploration of methods for discriminative learning of structure were as follows: • The EAR method performed well in selecting a single additional parent for each feature; • Adding class-specific parents via the class-specific EAR measure did not lead to significant improvements in performance; This stands in contract, however, to previous work [6, 11] where a benefit was obtained with class-specific parents, using the EAR measure along with a randomized edge selection algorithm; • To select additional parents it is necessary to evaluate mutual information between vectors of variables – it is not possible to judge the relevance of a variable simply by making comparisons between scalars.

8 Visualization of Mutual Information and the EAR measure

As described in Section 7, even the simplest scalar version of the EAR measure requires the computation of mutual information and conditional mutual information on a wide collection of pairs of scalar elements of speech feature vectors. Before we present new word error results using this measure, it is elucidative in its own right to visualize such mutual information and the EAR measure on a set of quite different types of speech feature vectors. As will be seen, the degree to which these visualizations show large magnitude EAR-measure values should correspond roughly to the degree to which discriminative HMM structure augmentation should improve word error performance in a speech recognition system. First, it must be noted that even the simple pairwise mutual information I(X; Z|Q) where X and Z are scalars is th obtained from a three-dimensional grid. X really means Xti where t is the current time-position, i is the i element of the feature vector at time t, and Z really means Zt+τ,j where τ is the time-lag between the current time position

UWEETR-2001-0006 37 and the candidate Z-variable, and j is the position in the vector at time t + τ. This is shown in Figure 24. In general, we consider the variables at time t the child variables (i.e., the child at time t and position i), and the variables at time t + τ the possible parent variables at position j. We call τ the lag. f e a t u r e

p X

o t,i si t i o n Xt+τ ,j

t-τ t -> time frame

Figure 24: EAR measure computation. The set of possible pairs of variables on which pair-wise mutual information (both conditional and unconditional) needs to be computed. This can often be thousands of pairs of variables.

It can therefore be seen that to visualize mutual information, conditional mutual information, or the EAR measure, it must be possible to visualize volumetric data, plotting the quantities as functions of i, j, and τ. One way to do this would be to represent slices through this volume at various fixed j, j, or τ. Another way might be to average across some dimension and project down onto a diagonal plane. This was used in [6]. For the purposes of the workshop, we found it most simple to average across a dimension and then project down onto one of the three planar axes, as depicted in Figure 25. This leads to three 2-dimensional color plains as described in the next three paragraphs. First, the j : τ plot is the average MI as a function of parent and time-lag. This plot therefore shows the relative value of a parent overall at a given time-lag τ and parent position j. If a value for a particular j, τ is large, then that parent at that time will be overall beneficial. Next, the i : τ plot is the average MI as a function of child and time-lag. This plot shows the degree to which a child variable at position i at time t is “fed” useful information by all variable on average at time t + τ. This plot therefore shows the relative benefit of each time lag for each child. Finally, the i : j plot corresponds to the average MI as a function of parent-child. This plot therefore provides information about the the most useful parent for each child overall, irrespective of time. If a MI value of a particular parent is low for a given child, then this parent will never be useful for that child. On the other hand, if the the MI value is large, there will be many instances in the set of candidate variables where useful information about the child may be found. This view, of course, does not indicate the degree to which multiple parents might be redundant with one another, as only pair-wise scalar mutual information is calculated. Below, these three plots will be shown in a row, where the first j : τ plot is on the left, the second i : τ plot shown in the middle, and the third i : j plot shown on the right. Note that the above descriptions were in terms of mutual information (MI) (i.e., I(X; Z)). The three 2-dimensional plots could equally well describe conditional mutual information I(X; Z|Q), or the EAR measure I(X; Z|Q) − I(X; Z). In general, we have terms these plots “jet plots” for obvious reasons. We produced jet plots for a differing feature sets (MFCCs, IBM LDA-MLLT features, and novel AM/FM-based features), differing corpora (Aurora 2.0, an IBM Audio/Video corpus, and the DARPA Speech in Noisy Environments One (SPINE-1) corpus), and differing hidden conditions (Q corresponding to phones, sub-phones, whole-words, and general HMM states). These comparisons in a single document allow the examination of the relative differences between different corpora, conditions, etc. The various corpora and features are described below.

8.1 MI/EAR: Aurora 2.0 MFCCs The first row of plots is given in Figure 26. 3. This plot shows the conditional mutual information I(X; Z|Q) for the Aurora 2.0 corpus [56] where Q corresponds to different phones, as defined in the standard Aurora distribution. Aurora 2.0 is further described in Section 10.1. The plot are for MFCCS, and their first and second derivatives. Specifically,

3If these plots are fuzzy and you are reading them on paper form, we suggest that you obtain an electronic PDF version of the document with which it is possible to zoom in quite closely, as the plots are included within the document in reasonably high resolution.

UWEETR-2001-0006 38 parent-lag plot

child-lag plot d l hi c i

lag

jparent - parent child plot

Figure 25: EAR measure projections.

the first 12 features correspond to cepstral coefficients c1 through 12, the next feature is c0 and which is followed by log-energy. The deltas for these 14 features (in the same order) come next, followed by the double-deltas.

Average I(X;Z|Q), Aurora Phone MFCCs Average I(X;Z|Q), Aurora Phone MFCCs Average I(X;Z|Q), Aurora Phone MFCCs

40 0.5 40 0.5 40 0.9

35 35 35 0.45 0.45 0.8

30 30 30 0.4 0.4 0.7

25 25 25 0.6 0.35 0.35

20 20 20 0.5 0.3 0.3 Child Feature Position Parent Feature Position Parent Feature Position 15 15 15 0.4

0.25 0.25 10 10 10 0.3

0.2 0.2 5 5 5 0.2

0.15 0.15 0 0 0 0.1 −150 −100 −50 0 −150 −100 −50 0 0 5 10 15 20 25 30 35 40 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 26: Aurora Phone

The plots show a number of things. First, the left-most plot shows that the parents with the most mutual information in general come from c0 and log-energy, and their deltas (and to some extent double deltas). Similarly, from the middle plot, the children which receive the most information from any parent at a given time are also c0 and log-energy (and deltas). This is confirmed by the third plot, which therefore states that, in absolute terms, most of the information is between c0 and log-energy and its time-lagged variants. There is also information between a c0 (or log-energy) and its delta and double delta versions, more at least than the other features. Furthermore, most of the information is closer rather than farther away from the base time position t. It can also be seen from the diagonal lines in the right-most plot that features in general tend to share MI with lagged versions of themselves and with their derivatives. There is in fact information between different features, but it is at a lower magnitude than c0 and log-energy and other features, and is therefore difficult to see from this plot alone using this colormap. The EAR-measure (discriminative) version of these plots is shown in Figure 27. The first obvious thing to note is that discriminative MI is quite different from non-discriminative MI, suggesting that the choice of discriminative MI might have a significant impact on structure learning. In the left two EAR plots, for example, the same features remain that were valuable in the non-discriminative plots remain value, but only at much greater time lags. In fact, the closer-time lags of these features seem to indicate that these would be some of the least discriminative edges to use. This trait extends to features other than c0 and log-energy: specifically, edges between neighboring features do not

UWEETR-2001-0006 39 Average I(X;Z|Q)−I(X;Z), Aurora Phone MFCCs Average I(X;Z|Q)−I(X;Z), Aurora Phone MFCCs Average I(X;Z|Q)−I(X;Z), Aurora Phone MFCCs 40 40 0.02 40 0.02 0.08

35 35 35 0.015 0.015 0.06

30 30 30 0.01 0.01 0.04

25 25 25 0.005 0.02 0.005 20 20 20

0 0 Child Feature Position Parent Feature Position 15 Parent Feature Position 15 15 0

−0.005 −0.02 10 10 10 −0.005

−0.01 −0.04 5 5 5

−0.01 −0.06 0 −0.015 0 0 −150 −100 −50 0 −150 −100 −50 0 0 5 10 15 20 25 30 35 40 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 27: Discriminative Aurora Phone tend to have an advantage over their distant counterparts. Moreover, looking at the right-most EAR plot, the lack of a clear diagonal indicates that edges between different feature positions appear to have more discriminative benefit than those between corresponding features.

Average I(X;Z|Q), Aurora Phonestate MFCCs Average I(X;Z|Q), Aurora Phonestate MFCCs Average I(X;Z|Q), Aurora Phonestate MFCCs

40 40 40 0.9

0.45 0.45 35 35 35 0.8

30 0.4 30 0.4 30 0.7

25 25 25 0.35 0.6 0.35

20 20 20 0.5 0.3 0.3 Child Feature Position Parent Feature Position Parent Feature Position 15 15 15 0.4 0.25 0.25 10 10 10 0.3

0.2 5 0.2 5 5 0.2

0.15 0 0 0 0.1 −150 −100 −50 0 −150 −100 −50 0 0 5 10 15 20 25 30 35 40 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 28: Aurora Phone State

Average I(X;Z|Q)−I(X;Z), Aurora Phonestate MFCCs Average I(X;Z|Q)−I(X;Z), Aurora Phonestate MFCCs Average I(X;Z|Q)−I(X;Z), Aurora Phonestate MFCCs 40 0.05 40 0.015 40 0.015

35 0.01 35 35 0.01 0.005 0 30 30 30 0.005 0 25 25 25 0

−0.005 −0.05 20 20 20 −0.005 −0.01 Child Feature Position Parent Feature Position 15 Parent Feature Position 15 15 −0.01 −0.015 −0.1 10 10 −0.015 10 −0.02

−0.02 5 5 5 −0.025

−0.025 −0.15 0 0 0 −150 −100 −50 0 −150 −100 −50 0 0 5 10 15 20 25 30 35 40 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 29: Discriminative Aurora Phone State

The next set of plots are shown in Figure 28 and Figure 29. The plots appear to be similar to the ones above. This indicates that the change in hidden conditioning (i.e., Q moving from a phone random variable to a sub-phone HMM- state random variable) would not have a large impact on the structures that would be most beneficial in augmenting an HMM. Note also that the overall range of the EAR measure appears to be lower in the phone-state plots than in the phone plots. This might indicate that, conditioned on the phone-state, there would be less utility in an augmented structure. The next set of plots are shown in Figure 30 and Figure 31. These plots show the case when Q corresponds to an actual word (one of the words in the Aurora 2.0 vocabulary of size 11) rather than a phone or phone-state. Again, the

UWEETR-2001-0006 40 Average I(X;Z|Q), Aurora Word MFCCs Average I(X;Z|Q), Aurora Word MFCCs Average I(X;Z|Q), Aurora Word MFCCs

40 40 40 0.5 0.5 0.9

35 35 35 0.45 0.45 0.8

30 30 30 0.7 0.4 0.4

25 25 25 0.6 0.35 0.35 20 20 20 0.5

0.3 0.3 Child Feature Position Parent Feature Position

Parent Feature Position 15 15 15 0.4

0.25 0.25 10 10 10 0.3

0.2 5 0.2 5 5 0.2

0.15 0 0 0 0.1 −150 −100 −50 0 −150 −100 −50 0 0 5 10 15 20 25 30 35 40 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 30: Aurora Word

−3 Average I(X;Z|Q)−I(X;Z), Aurora Word MFCCs Average I(X;Z|Q)−I(X;Z), Aurora Word MFCCs x 10 Average I(X;Z|Q)−I(X;Z), Aurora Word MFCCs

40 40 40 0.025 18 0.12 35 35 16 35 0.1 0.02 30 30 14 30 0.08 12 25 0.015 25 25 0.06 10

20 20 20 0.04 0.01 8

Child Feature Position 15 0.02 Parent Feature Position 15 6 Parent Feature Position 15

0.005 4 0 10 10 10

2 −0.02 5 0 5 5 0 −0.04 0 0 0 −150 −100 −50 0 −150 −100 −50 0 0 5 10 15 20 25 30 35 40 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 31: Discriminative Aurora Word

plots appear to be similar to the ones above. This indicates that the change in hidden conditioning (i.e., Q moving from a phone random variable to a sub-phone HMM-state random variable) might not have a large impact on the structures that would be most beneficial in augmenting an HMM. Note, however, that the overall range of EAR values is in this case larger than in the two previous case. This would appear to be encouraging.

Average I(X;Z|Q), Aurora Wordstate MFCCs Average I(X;Z|Q), Aurora Wordstate MFCCs Average I(X;Z|Q), Aurora Wordstate MFCCs 0.9 40 40 40 0.45 0.45

0.8 35 35 35

0.4 0.4 0.7 30 30 30

0.35 25 0.35 25 25 0.6

20 20 20 0.5 0.3 0.3 Child Feature Position Parent Feature Position Parent Feature Position 15 15 15 0.4

0.25 0.25 10 10 10 0.3

0.2 5 5 0.2 5 0.2

0 0 0 −150 −100 −50 0 −150 −100 −50 0 0 5 10 15 20 25 30 35 40 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 32: Aurora Word State

Lastly, for the Aurora 2.0 MFCC plots, Figure 32 and Figure 33 show the case where Q corresponds to whole- word states. I.e., the HMM models in this case use entire words, but the conditioning set of the random variable Q corresponds to all of the possible states within each of the words. Once again, the patterns are the same, and in this case the EAR range seems to be the lowest of the set so far.

UWEETR-2001-0006 41 Average I(X;Z|Q)−I(X;Z), Aurora Wordstate MFCCs Average I(X;Z|Q)−I(X;Z), Aurora Wordstate MFCCs Average I(X;Z|Q)−I(X;Z), Aurora Wordstate MFCCs

40 0.01 40 40 0.01

35 35 35 0 0

30 30 0 30

−0.01 −0.05 25 25 25 −0.01

20 −0.02 20 20 −0.1

−0.02 Child Feature Position

Parent Feature Position 15 15 Parent Feature Position 15 −0.03 −0.15 10 10 10 −0.03

−0.04 5 5 5 −0.2 −0.04 0 0 0 −150 −100 −50 0 −150 −100 −50 0 0 5 10 15 20 25 30 35 40 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 33: Discriminative Aurora Word State

Average I(X;Z|Q), Audio−Visual (Audio Only) Average I(X;Z|Q), Audio−Visual (Audio Only) Average I(X;Z|Q), Audio−Visual (Audio Only)

1

0.38 0.38 50 50 50

0.36 0.36 0.8

40 0.34 40 0.34 40

0.32 0.32 0.6 30 30 30 0.3 0.3 Child Feature Position Parent Feature Position Parent Feature Position 0.28 20 20 0.28 20 0.4

0.26 0.26

10 10 10 0.24 0.24 0.2

0.22 0 0 0.22 0 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 0 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 0 0 10 20 30 40 50 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 34: IBM Audio-Visual Data, Processed Features

8.2 MI/EAR: IBM A/V Corpus, LDA+MLLT Features The next set of plots correspond to a heavily pre-processed set of feature vectors that were created at IBM-research. These features were computed on the IBM parallel Audio/Video corpus [49], although the plots shown here only include the audio portion of the corpus. While we had originally wanted to compute cross-stream dependencies between the audio and visual portions of the corpus, the 6-week time limitations prevented us from doing that. The audio stream features were obtained by training a linear discriminant analysis (LDA) transforms of 9 frames of the cepstral coefficients, followed by a maximum likelihood linear transform (MLLT) [48]. The video stream features in this feature set were obtained by an LDA-MLLT transform of the pixels in a region of interest around the mouth as described in [72]. The LDA-MLLT jet plots also demonstrated a striking difference between the EAR and non-discriminative MI plots. The first thing to note is that the magnitude of the MI plots is on average less than any seen so far with the

Average I(X;Z|Q)−I(X;Z), Audio−Visual (Audio Only) Average I(X;Z|Q)−I(X;Z), Audio−Visual (Audio Only) Average I(X;Z|Q)−I(X;Z), Audio−Visual (Audio Only)

0.005

0 0 0 50 50 50 −0.005 −0.005 −0.01 40 40 40 −0.05 −0.01 −0.015 −0.015 30 30 −0.02 30 −0.1 −0.02 −0.025 Child Feature Position Parent Feature Position Parent Feature Position −0.025 20 20 20 −0.03 −0.15 −0.03 −0.035 10 10 −0.035 10 −0.04 −0.2 −0.04 −0.045 0 0 0 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 0 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 0 0 10 20 30 40 50 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 35: Discriminative, IBM Audio-Visual Data, Processed Features

UWEETR-2001-0006 42 Aurora MFCC plots. While this could be an issue with the data, it could also indicate that in general these features have less overall information available. The EAR plots are also informative in their range. It appears that most of the EAR values in these plots are negative, indicating that these features have been pre-processed to the point that they are likely to produce little if any gain when adding cross-observation edges to the model. Unfortunately, we did not generate the cross audio-video MI or EAR plots which could show potentially useful and non-redundant cross-stream information.

8.3 MI/EAR: Aurora 2.0 AM/FM Features

Average I(X;Z|Q), Aurora Whole−Word State Phonetact Average I(X;Z|Q), Aurora Whole−Word State Phonetact Average I(X;Z|Q), Aurora Whole−Word State Phonetact 40 40 40 1

0.9 0.9 35 35 35

0.8 0.8 30 30 30 0.8

25 0.7 25 0.7 25

0.6 20 0.6 20 0.6 20

15 Child Feature Position 15 15 Parent Feature Position 0.5 0.5 Parent Feature Position

0.4 10 10 10 0.4 0.4

5 5 5 0.3 0.3 0.2 0 0 0 −150 −100 −50 0 −150 −100 −50 0 0 5 10 15 20 25 30 35 40 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 36: Aurora Word State, Phonetact Features

Average I(X;Z|Q)−I(X;Z), Aurora Whole−Word State Phonetact Average I(X;Z|Q)−I(X;Z), Aurora Whole−Word State Phonetact Average I(X;Z|Q)−I(X;Z), Aurora Whole−Word State Phonetact 40 40 40

35 35 35 0.15 0.05 0.05

30 30 30 0.1 0 25 0 25 25

0.05 20 20 20 −0.05 −0.05 0

15 Child Feature Position 15 15 Parent Feature Position Parent Feature Position

−0.1 10 10 10 −0.1 −0.05

5 5 5 −0.15 −0.1 −0.15 0 0 0 −150 −100 −50 0 −150 −100 −50 0 0 5 10 15 20 25 30 35 40 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 37: Discriminative Aurora Word State, Phonetact Features

We also applied novel AM/FM features provided by Phonetact inc. to this analysis. We applied these features to the Aurora whole-word state case. We computed AM (Amplitude Modulation) and FM (Frequency Modulation) features 4. These are computed by dividing the spectrum into 20 equally spaced bands using multiple complex quadrature band pass filters. For each neighboring pair of filters, the higher-band filter output is multiplied by the conjugate of the lower-band output. The result is low-pass filtered and sampled every 10ms. The FM features are the sine of the angle of the sampled output, and the AM feature is the log of the real component. Although we expect that these features could be improved by further processing (e.g. cosine transform, mean subtraction, derivative-concatenation) we used the raw features to provide the maximum contrast with MFCCs. The plots are shown in Figure 36 and Figure 36. In the plots below, the FM features fill the lower half of the plot, while AM fill the upper half. The difference between discriminative and non-discriminative jet plots is perhaps most visible in this case. In the left two non-discriminative MI plots, the most MI is found in the AM Phonetact features–the top half of the features–at a time lag of up to -50. The FM have relatively less MI. In the left two EAR plots, however, the AM features’ offer significant MI only at time lags earlier than around 50ms, and the FM go from offering little help in the MI case to potentially providing useful discriminative information in this case.

4We thank Y. Brandman of Phonetact, Inc. for providing this technology.

UWEETR-2001-0006 43 The right-most MI plot shows a great deal of energy between parent and children AM features, especially when parents and children share the same feature number; the rest of the plot shows less. There appears to be little cross- information between the AM and FM features. In the third discriminative plot, however, the entire plots shows little energy, and the regions have become fairly homogeneous. As will be seen in the experimental section, we use these features as conditional only features (i.e., they are used as additional features W as described in Section 7.

8.4 MI/EAR: SPINE 1.0 Neural Network Features We also computed MI and EAR quantities on the SPINE 1.0 corpus using neural network-based features [97] (these are features where a three-layer multi-layered perceptron is trained as a phone classifier on a 9-frame window of speech features, and then the output of the network is used as features, before the non-linearity). The two plots are shown in Figure 38 and Figure 39. In the MI case (Figure 38), we can see that most of the information about the features lies at 10 ms into the past (i.e., the previous frame). This is probably a result of the fact that the features are meant to predict phonetic category, and there is significant correlation between successive categorical phonetic classes (i.e., if a phone occurs at a frame, it is more likely than not to occur at the previous and next frame). Other than that, we can see that some of the features seemed to have more temporal correlation than others, possible a result of the phonetic category of the feature (i.e., vowels we would expect would extend over a wider time region). Unfortunately, the original phonetic labels of the features were not available to us at the workshop, so we could not verify this hypothesis. Lastly, on the right-most plot, we see that the features seem to be most correlated with themselves. In the EAR plot case (Figure 39), the biggest difference seems to be that it is no longer the case that the previous frame (10ms into the past) provides the most useful information, as is not surprising. The magnitudes of EAR for these plots also seemed encouraging. Unfortunately, time-limitations prevented us from learning discriminative structure on these features.

Average I(X;Z|Q), SPINE Average I(X;Z|Q), SPINE Average I(X;Z|Q), SPINE 55 55 55 0.8 0.36 0.36 50 50 50

0.34 0.34 45 45 45

0.32 0.32 40 40 40 0.6 0.3 35 0.3 35 35

30 0.28 30 0.28 30

25 0.26 25 0.26 25 Child Feature Position Parent Feature Position Parent Feature Position 20 20 0.24 20 0.4 0.24

15 15 0.22 15 0.22

10 10 0.2 10 0.2

5 5 5 0.18 0.18 0.2 0 0 0 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 0 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 0 0 5 10 15 20 25 30 35 40 45 50 55 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 38: SPINE-1

Average I(X;Z|Q)−I(X;Z), SPINE Average I(X;Z|Q)−I(X;Z), SPINE Average I(X;Z|Q)−I(X;Z), SPINE 55 55 55 0.018 0.018 0.04 50 50 50 0.02 0.016 0.016 45 45 45

0 40 40 40 0.014 0.014 −0.02 35 35 35

−0.04 30 0.012 30 0.012 30

25 25 25 −0.06 0.01 0.01 Child Feature Position Parent Feature Position 20 20 Parent Feature Position 20 −0.08

0.008 15 0.008 15 15 −0.1

10 10 10 −0.12 0.006 0.006 5 5 5 −0.14

0 0 0 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 0 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 0 0 5 10 15 20 25 30 35 40 45 50 55 Time Lag (ms) Time Lag (ms) Child Feature Position

Figure 39: Discriminative SPINE-1

UWEETR-2001-0006 44 9 Visualization of Dependency Selection

Aurora Whole−Word State MFCCs, One Parent per Child 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 Feature 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 Time Lag

Figure 40: Dlink plots for the Aurora 2.0 corpus, MFCCs, where Q corresponds to whole-word HMM states. Only one parent per child is shown. 10ms time frames.

For many of the experiments that were performed during the workshop, we used the EAR measure (see Section 7, and visualized in Section 8) to form structurally discriminative models (Section 4). Some of the structures that were formed were ultimately used in WER experiments (to be reported in Section 10), but because of time-constraints some of the structures were not used. In this section, we present some of the structures that were formed and in doing so describe the format used to visualize these structures. Some of these figures will be described further in later sections, when the word errors of the resulting models are discussed. In general, the edge augmentation heuristics in Section 7.6 meant that the goal was to select the most discriminative relationships (edges) between past ”parent” variables and the present ”child” variables. A goal was to visualize which relationships were strongest, so we created so-called ”d-link” (dependency link) plots. All of the d-link plots can be interpreted as follows: The horizontal axis indicates time, there the right-most position is time t, and moving to the left increases in τ to position t + τ. The axis is labeled with time-frames, each frame uses a 10ms step. Therefore, the axis is in units of 10ms time chunks. The vertical axes indicates feature position, the meaning of each depends on the features. Since most of the plots we show here are for MFCCs, the meanings (i.e., the relative MFCC feature order) are the same as that described in Section 8. For each child feature in the current time slice (time t, the right-most column of features) either the one or two most informative parents are shown. This means that either the top or the top two children are chosen according to the EAR measure, using the heuristic described in Section 7.6.3. The parents are shown by a colored rectangle at location t+τ and position j in the plot. In general, a darker parent rectangle indicates that a parent has a strong relationship with more than one child (yellow=1 child, green=2, blue=3, and black=4). Also, the thickness of the line between parent and child indicates the magnitude of the EAR measure for that parent-child pair. We now describe the main features of these plots. Figure 40 shows the Aurora 2.0 whole-word state MFCCs dlink plots, where only the top parent according to the ear measure for each child is displayed. There are number of interesting features of this plot. First, the small number of purely horizontal edges suggests that child variables infrequently asked for lagged, past versions of themselves as parents. This is in contrast to the case if pure MI was used, where it would often be the case that a child would use as a parent the corresponding feature at, say, one time-frame past. The cepstral feature c0 and

UWEETR-2001-0006 45 Aurora Whole−Word State MFCCs, Two Parents per Child 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 Feature 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 −15 −14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 Time Lag

Figure 41: Dlink plots for the Aurora 2.0 corpus, MFCCs, where Q corresponds to whole-word HMM states. Two parents per child shown. 10ms time frames.

log-energy feature, were two of the exceptions. Second, features c1 through c12 (features 0-11) very often took the second derivative of c1 through c12 (features 14-25) as parents and vice versa. There was not a similar relationship to the first derivative of c1 through c12, nor was there a similar relationship between log-energy and either the first or second derivative of log-energy. This seems to indicate that 2nd derivative features from the past are discriminatively informative about features at time t. Note also that the delta features rarely if ever used the non-delta features. As argued in Section 4, such edges would be among the least discriminative, and could potentially hurt performance. Third, the time lags between parents and children were somewhat surprising. There were more parents at a lag of τ = 3 than at shorter time lags, and there were more parents around a lag of τ = 9 to 12 than at lag times between 3 and 9. Finally, at long time lags (say over 100 ms), only the parent features c1 through c12 and log-energy were informa- tive; derivative features were not. It appears therefore that overall long-term (+100ms) contours usefully contributes to the discriminability of the class. Perhaps interestingly, these are typical lengths of syllables rather than phones. In general, we also created both one- and two-parent d-link plots, but we found that the number of parents did not have a strong effect on the underlying word-error results. Figure 41 shows the dlink plot in the case of Aurora 2.0 whole-word state MFCCs, where the top two parents according to the EAR measure for each child are displayed. Perhaps the main feature of this plot, relative to Figure 40, is that c0 and log-energy children continue to desire parents at long-time scales. In this case, the parent goes back 150ms! Figure 42 shows the dlink-plots in the case where Q corresponds to a phoneme on Aurora 2.0, and again two parents per child are shown. For Aurora phone state MFCCs, the d-link plots were largely similar to the whole-word state plots. One of the few noticeable differences is that the smattering of features c1-c12 (features 0-11) at time lags from -9 to -13 found with whole-word state MFCCs was absent with phone state MFCCs. Figure 43 show the same, but where Q is the phone-state (portions of a phoneme). Figure 44 shows the dlink plot when Q correspond to entire Aurora 2.0 vocabulary words at one parent per child, and Figure 45 is the same with two parents per child. All of these plots show similar phenomena, namely that which are described above, and a continued desire for cepstral c0 and log-energy to have long-range parents. Figure 46 shows the d-link plots for the combined feature set of the Aurora 2.0 MFCCs (features 0-41), and the Phonetact AM and then FM features (range 42-90) in that order. The MFCC features have colored rectangles at time t indicating that they have child variables. The Phonetact features were used as purely conditional random variables

UWEETR-2001-0006 46 Aurora Phone MFCCs, Two Parents per Child 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 Feature 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 Time Lag

Figure 42: Dlink plots for the Aurora 2.0 corpus, MFCCs, where Q corresponds to individual phones. Two parents per child shown. 10ms time frames.

(i.e., they were only added as members of W , as described in Section 7). It is interesting to note that the Phonetact features appear to be more discriminative when used in this context the MFCC features themselves were. Features 60 and 61 were particularly informative, with parents at a range of time lags. More details are provided in the next section.

10 Corpora description and word error rate (WER) results 10.1 Experimental Results on Aurora 2.0 The experimental results described in this section focus on the Aurora 2.0 continuous digit recognition task [56]. The Aurora database consists of TIDigits data, which has been additionally passed though telephone channel filters, and subjected to a variety of additive noises. There are eight different noise types ranging from restaurant to train-station noise, and SNRs from -5dB to 20dB. For training, we used the “multi-condition” set of 8440 utterances that reflect the variety of noise conditions. We present aggregate results for test sets A,B, and C, which total about 70,000 test sentences [56]. We processed the Aurora data in two significantly different ways. In the first, we used the standard front-end provided with the database to produce MFCCs, including log-energy and C0. We then appended delta and double-delta features and performed cepstral mean subtraction, to form a 42 dimensional feature vector. In the second approach, we computed AM (Amplitude Modulation) and FM (Frequency Modulation) features as described in Section 8.3.

10.1.1 Baseline GMTK vs. HTK result To validate our structure-learning methods, we built baseline systems (with GMTK emulating an HMM), and then enhanced them with the discriminative structures shown above. The first set of experiments therefore consisted only of baseline numbers. In particular, since GMTK was entirely new software, we wanted to ensure that the baselines we obtains with GMTK matched that of the standard HTK Aurora 2.0 baseline provided with the Aurora distribution. This is shown in Figure 48. The figure shows word accuracies for several signal-to-noise rations, and several different baseline systems, HTK with whole-word models as specified in [56] (each of the 11 vocabulary words uses 16 HMM states with no parameter tying between states, as in the Aurora 2.0 release). Additionally, silence and short-pause

UWEETR-2001-0006 47 Aurora Phonestate MFCCs, Two Parents per Child 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 Feature 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 −15 −14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 Time Lag

Figure 43: Dlink plots for the Aurora 2.0 corpus, MFCCs, where Q corresponds to individual phone HMM states. Two parents per child shown. 10ms time frames.

clean 20 15 10 5 0 -5 GMTK 99.2 98.5 97.8 96.0 89.2 66.4 21.5 PH 99.1 98.3 97.2 94.9 86.4 54.9 2.80 HP 98.5 97.3 96.2 93.6 85.0 57.6 24.0

Table 1: Word recognition rates our baseline GMTK system as a function of SNR. PH are the GMTK phone models. HP is reproduced from [56]. models were used, with three silence states and the middle state tied to short-pause. All models were strictly left-to- right, and used 4 Gaussians per state for a total of 715 Gaussians. The GMTK baseline numbers simulated an HMM and either 1) used whole-word models thereby mimicking the Aurora 2.0 HTK baseline, or 2) used a tied mono-phone state models. We examine and compare these results in Figure 49 (numbers also shown in Table 1) which shows the improvements relative to the HTK baseline. These results show the specific absolute recognition rates for our GMTK baseline systems as a function of SNR, averaged across all test conditions. Also presented is the published baseline result [56] with a system that had somewhat fewer (546) Gaussians (the GMTK system used 4 Gaussians per state rather than 3 because we in this case set up the Gaussian splitting procedure to double the number of Gaussians after each split. In newer experiments, it was found that GMTK was slightly better than HTK with the exact same model structure [20], possibly because of GMTK’s new Gaussian handling code). As can be seen, we see that overall the results are comparable with each other. We also see that, in general, the word-state model seems to outperform the phone-state model (which is not unexpected since in small vocabulary tasks, each word having its own entire model is often useful).

10.1.2 A simple GMTK Aurora 2.0 noise clustering experiment Another set of experiments that were run regard simple noise clustering in the Aurora 2.0 database. Essentially, the structure in Figure 11 is augmented to include a single hidden noise variable per frame, as shown in Figure 50. The noise variable is observed during training (and indicates the noise type of the training data), and is hidden during testing hopefully allowing the underlying noise-specific model to be best used in the right context. Figure 51 shows the results as “improvements” relative to the GMTK phone-based baseline and for different SNR

UWEETR-2001-0006 48 Aurora Whole−Word MFCCs, One Parent per Child 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 Feature 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 Time Lag

Figure 44: Dlink plots for the Aurora 2.0 corpus, MFCCs, where Q corresponds to entire words. One parent per child shown. 10ms time frames. levels. Unfortunately, we do not see any systematic improvement in the results, and if anything the results get worse as the SNR increases. There are several possible reasons for this: first, the number of Gaussians in the noise-clustered model had increased relative to the baseline, and therefore had less training data. Second, it was still the case that all noise conditions were integrated against during decoding. Lastly, this structure was obtained by hand and not discriminatively. Perhaps the data was indeed modeled better, but the distinguishing features of the words was not, as argued in Section 4.

10.1.3 Aurora 2.0 different features Yet another set of GMTK baseline experiments we performed regarded the relative performance of different feature sets on Aurora 2.0, and the results are shown in Figure 52. The figure compares a system that uses MFCCs with a system that uses MFCCs augmented with other features (various combinations of raw AM and raw FM features). The AM/FM features were raw in that there was no normalization, smoothing (subtraction), discrete cosine transform, or any other post-processing that often shows improved results with Gaussian-mixture HMM-based ASR systems. Our goal was to keep the AM/FM features as un-processed as possible in the hope that the discriminative structure learning would find an appropriate structure over those features, rather than having the features (via post-processing) conform to that which a standard HMM finds most useful.

10.1.4 Mutual Information Measures The next step of our analysis was a computation of the discriminative mutual information (i.e., the EAR measure) between all possible pairs of conditioning variables, as described in Section 8. Although we could compute this for hidden variables as well as observations (see also Section 12.2 on the beginnings of a new mutual-information toolkit which would solve this problem), for expediency and simplicity we focused on conditioning between observation components alone. Thus, the structures we present later are essentially expanded views of conditioning relationships among the individual entries of the acoustic feature vectors.

UWEETR-2001-0006 49 Aurora Whole−Word MFCCs, Two Parents per Child 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 Feature 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 −14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 Time Lag

Figure 45: Dlink plots for the Aurora 2.0 corpus, MFCCs, where Q corresponds to entire words. Two parents per child shown. 10ms time frames.

10.1.5 Induced Structures Using the method of Section 7, we induced conditioning relationships using both MFCCs and AM-FM features. In Figure 44, we show the induced structure for an MFCC system based on whole-word models, and using Q-values corresponding to words in the EAR measure. As expected, there is conditioning between C0 and its value more than 100 ms previously. In a second set of experiments, we used the AM-FM features as possible conditioning parents for the MFCCs; the induced conditioning relationships are shown in Figure 46. The first 42 features are the MFCCs; these are followed by AM features, and finally the FM features. This graph indicates that FM features provide significant discriminative information about the MFCCs.

10.1.6 Improved Word Error Rate Results Table 2 presents the relative improvement in word-error-rate for several structure-induced systems. There are several things to note. The first is that significant improvements were obtained in all cases. The second is that structure induction successfully identified the synergistic information present in the AM-FM features, and resulted in a signif- icant improvement over raw MFCCs. The final point is that when we increased the size of a conventional system to the same number of parameters, performance was much worse in high noise conditions than in the improved mod- els. Thus, structure induction apparently therefore improves performance in a robust way. These results are further summarized in Figure 53.

10.2 Experimental Results on IBM Audio-Visual (AV) Database The IBM Audio-Visual database was collected at the IBM Thomas J. Watson Research Center before the CLSP summer workshop in 2000. The database consists of full-face frontal video and audio of 290 subjects, uttering ViaVoiceT M training scripts, i.e., continuous read speech with mostly verbalized punctuation, and a vocabulary size of approximately 10,500 words. Transcriptions of all 24,325 database utterances, as well as a pronunciation dictionary are provided. Details about this database can be found in [37].

UWEETR-2001-0006 50 Aurora Whole−Word State Phonetact/MFCCs, Two Parents per Child 80 7879 7677 7475 7273 7071 6869 6667 6465 6263 6061 5859 5657 5455 5253 5051 4849 4647 4445 4243 4041 3839 3637 Feature 3435 3233 3031 2829 2627 2425 2223 2021 1819 1617 1415 1213 1011 89 67 45 23 01 −15 −14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 Time Lag

Figure 46: Dlink plots for the Aurora 2.0 corpus, Phonetact features, where Q corresponds to whole-word states. Two parents per child shown. 10ms time frames.

clean 20 15 10 5 0 -5 WWS 16.3 19.3 14.2 10.5 9.85 19.0 12.6 AMFM 10.4 9.73 6.91 4.29 7.05 17.4 15.5 WW 7.16 7.02 5.51 5.93 5.05 16.0 15.0 EP 18.9 6.56 14.7 10.7 7.16 5.09 1.20

Table 2: Percent word-error-rate improvement for structure-induced systems. WWS is a system where Q ranges over states; AMFM conditions MFCCs on AM-FM features; In WW, Q ranges over words; and EP is a straight Gaussian system with twice as many Gaussians as the baseline. For the WW and WWS systems, one parent per feature was used; in the AMFM case, two parents. EP has the same number of parameters as WW and WWS.

10.2.1 Experiment Framework The audio-visual database has been partitioned into a number of disjoint sets in order to train and evaluate models for audio-visual ASR. The training set contains 35 hours of data from 239 subjects, and it is used to train all HMMs. A held-out data set of close to 5 hours of data from 25 subjects is used to train HMM parameters relevant to audio-visual decision fusion. A test set of 2.5 hours data from 26 subjects is used for testing of the trained models. There are also disjoints sets for speaker adaptation and multi-speaker HMM refinement experiments, but since we did not use those sets in our experiments due to time constraints, we will refer readers to [37] for details. Sixty-dimensional acoustic feature vectors are extracted for the audio data at a rate of 100 Hz. These features are obtained by a linear discriminant analysis (LDA) data projection, applied on a concatenation of nine consecutive feature frames consisting of a 24-dimensional discrete cosine transform (DCT) of mel-scale filter bank energies. LDA is followed by a maximum likelihood linear transform (MLLT) based data rotation. Cepstral mean subtraction (CMS) and energy normalization are applied to the DCT features at the utterance level, prior to the LDA/MLLT feature projection. In addition to the audio features, visual features are also extracted from the visual data. The visual features consist of a discrete cosine image transform of the subject’s mouth region, followed by an LDA projection and MLLT feature rotation. They have been provided by the IBM participants for the entire database, are of dimension 41, and are synchronous with the audio features at a rate of 100 Hz.

UWEETR-2001-0006 51 Spine, Linear MI, Hynek, One Parent per Child 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 Feature 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 Time Lag

Figure 47: Dlink plots for the SPINE-1 corpus, neural-network features, where Q corresponds to HMM states. One parent per child shown. 10ms time frames.

10.2.2 Baseline In the baseline audio ASR system, only the audio features are used for training. Context dependent phoneme models are used as speech units, and they are modeled with HMMs with Gaussian mixture class-conditional observation probabilities. These are trained based on maximum likelihood estimation using embedded training by means of the EM algorithm. The baseline system was developed using the HTK toolkit version 2.2. The training procedure is similar to the one described in the HTK reference manual. All HMMs had 3 states except the short pause model that had only one state. A set of 41 phonemes was used. The first training step initializes the monophone models with single Gaussian densities. All means and variances are set to the global means and variances of the training data. Monophones are trained by embedded re-estimation using the first pronunciation variant in the pronunciation dictionary. A short pause model is subsequently added and tied to the center state of the silence model, followed by another 2 iterations of embedded re-estimation. Forced alignment is then performed to find the optimal pronunciation in case of multiple pronunciation variants in the dictionary. The resulting transcriptions are used for further 2 iterations of embedded re-estimation, which ends the training of monophone models. Context dependent phone models are obtained by first cloning the monophone models into context dependent phone models, followed by 2 training iterations using tri-phone based transcriptions. Decision tree based clustering is then performed to cluster phonemes with similar context and to obtain a smaller set of context dependent phonemes. This is followed by 2 training iterations. Finally, Gaussian mixture models are obtained by iteratively splitting the number of mixtures to 2, 4, 8, and 12, and by performing two training iterations after each splitting. The resulting baseline audio-only system performance, obtained by rescoring lattices (i.e., trigram lattices provided by IBM with log-likelihood value of the trigram language model on each lattice arc), is 14.44%.

10.2.3 IBM AV Experiments in WS’01 In our work during the summer workshop 2001, we used the Graphical Models Toolkit (GMTK) ( [9] and see Sec- tion 6). As described above, GMTK has the ability to rapidly and easily express a wide variety of models and to use then in as efficient a way as possible for a given model structure. Because of the rich features in the IBM AV database (both audio and video), we think GMTK is a perfect toolkit to study new model structures (rather than just an HMM) for this task.

UWEETR-2001-0006 52 100.00%

80.00%

GMTK WORD-STATE (715 Gaussians)

y GMTK PHONE-STATE (710 Gaussians)

c 60.00%

a HTK (546 Gaussians) r u c c A 40.00%

20.00%

0.00% 15 10 5 0 -5 SNR

Figure 48: Baseline accuracy results on the Aurora 2.0 corpus with MFCC features. This plots compares the word accuracies of HTK (green) with that of GMTK (blue and red) at various signal-to-noise ratios (SNRs). GMTK results are provided both in the case of whole-word models, shown in blue (the same models that were used for HTK), and for tied-phone models (shown in red).

10.2.4 Experiment Framework During the summer workshop 2000 [37], the use of orthogonal source — visual features — was investigated under different conditions. However, the study was still under the commonly used HMM framework, regardless of the fact that visual features are very different from audio features. In order to study alternative ways of using visual features, we decided to use the same training/heldout/test data split as the previous workshop (described in the previous section). The same features (60-dimensional audio features, 41-dimensional visual features) were also used in our experiments.5 Therefore, the HMM audio-only system described in the previous section also serves as our baseline. The decoder in GMTK requires a decoding structure in order to perform Viterbi decoding. While the lattices from workshop 2000 can be converted to the required decoding structure, because of the 6-week time constraints, we followed an n-best rescoring based decoding strategy. Namely, using the lattices generated from a first pass decoding, we first generate n-best lists off line. Subsequently, we rescore the n-best lists using various decoding structures with GMTK. We generated the top 100 best hypotheses for each heldout and test utterance from the corresponding lattice.

10.2.5 Matching Baseline In order to test GMTK, we first tried to match the HMM baseline. We built a graphical model based system simulating an HMM as our training structure. All parameters including number of states, number of Gaussian densities per state, the mixture weights, the means and variances, transition probabilities, were all from the HTK model (baseline). This model had 10,738 states, 12 mixtures per state. Then, this model was used to perform n-best rescoring on the test data. Because the short pause model in the HTK model has only one state which is tied with the middle state of the silence model (the transition probabilities are not tied), we used state sequences as input for the n-best rescoring. The

5Although, we ran out of time at the end of the workshop before we could integrate in the visual features.

UWEETR-2001-0006 53 60.00% GMTK WORD-STATE (715 Gaussians) 50.00% GMTK PHONE-STATE (710 Gaussians) HTK (546 Gaussians)

t 40.00% n e

m 30.00% e v o

r 20.00% p m I 10.00% R E

W 0.00%

e clean 20 15 10 5 0 -5 v i t -10.00% a l e

R -20.00%

-30.00%

-40.00% SNR

Figure 49: Baseline accuracy results on the Aurora 2.0 corpus with MFCC features. This is essentially the same plot as Figure 48, but shows relative improvements rather than absolute word accuracy transition probabilities of the short pause state were replaced by the transition probabilities of the middle state of the silence model. The WER (14.5%) showed that we can simulate an HMM with GMTK and achieve similar results.

10.2.6 GMTK Simulating an HMM Along with the audio and visual features, we also have a pronunciation lexicon, a word to HMM state sequence mapping, both from IBM. The HMM states are tied states from a decision tree clustering procedure. This decision tree is built for context dependent phoneme models and 4 neighboring phones are used as context. There are in total 2,808 different states for the 13,302 words in the vocabulary. We trained HMM models based on the word to state sequence mapping and the training data. At first, 5 iterations of EM training were carried out assuming single Gaussian density for each state. Then each Gaussian was splitted into two, followed by another 5 iterations of EM training. The splitting kept going until we had 16 Gaussian densities in each state. Finally, Gaussian densities with mixture weight close to zero were merged with the closest densities during another 5 iterations of EM training. The resulting HMM models had 2,808 states and each state had about 16 Gaussian densities. The total number of Gaussian densities was 45,016, which is much smaller than the HTK baseline (128,856 Gaussian densities). On the test data, this system gave a WER of 17.9%. We believe the difference of this result from the baseline was because of fewer number of Gaussian densities. However, due to the time constraint during the workshop, we did not carry out experiments to match the baseline.

10.2.7 EAR Measure of Audio Features The HMM models with 2,808 states and 16 Gaussian mixtures per state were used to calculate the discriminative mutual information, or the EAR (explaining-away residual) measure[6], between all possible pairs of conditioning variables. For expediency and simplicity, only dependencies between observation components (feature components)

UWEETR-2001-0006 54

Position Transition Phone

Observations

Noise Condition

Figure 50: A simple noise model with a single noise clustering hidden variable at each time step. was computed in our experiments. As described in Section 8, there are three dimensions in visualizing the scalar version of the EAR measure. From Figure 35, we can see that there is hardly any information between any two audio feature components we studied. As we recall how the audio features were extracted from the data, a possible reason arises: LDA, as a linear discriminative projection, might be removing the discriminative information between feature components. As a result, if we look at the audio features alone, it’s almost impossible to find any useful correlation between two feature components, even though the MI calculations should included a non-linear component of the discriminative information. It appears, however, in this case that the linear discriminant transformation was sufficient to remove this. This hypothesis, of course, should be more throughly verified, i.e., if an LDA transformation on multiple-windows (9 in this case) of feature vectors will remove the potential for discriminative cross-observation linear dependencies. The visual features, on the other hand, were extracted and processed independent of the audio features. There should be more information between a visual feature component and an audio feature components. Therefore, a plot of EAR measure including visual features is very desirable. Unfortunately, we could not compute such an EAR measure before the end of the workshop, due to the high computation needs from our AURORA experiments. However, because of the flexibility of the GMTK, we believe once the computation is done, we can quickly induce interesting structures and experiment with the new discriminative models.

10.3 Experimental Results on SPINE-1: Hidden Noise Variables We ran several additional experiments on the SPINE-1 speech in noisy environments data-base using a noise-clustering method as described in Figure 50. Again, the noise variable was observed during training and hidden during recogni- tion, and the various types of SPINE noises were clustered together at different degrees of granularity. From the trained GMTK-based models, we generated an n-best list from a separate HTK-based ASR system, and rescored the n-best list using GMTK. The results are shown in Figure 54. As can be seen, the WER gets worse as the number of noise clusters increases. Given this result, and the noise-variable results above for Aurora 2.0, it appears that simply adding a noise variable in a graphical model without concentrating on discriminability is not guaranteed to help (and is likely to hurt) performance. Due to time constraints, in the 6 weeks of the workshop, we did not attempt any experiments that tried to optimize discriminative structure on SPINE-1 as would be suggested by Figure 47.

11 Articulatory Modeling with GMTK

Another one goal of the graphical modeling team was to apply GMTK to articulatory modeling, or the modeling of the motion of the speech articulators, either in addition to or instead of phones. The goal ultimately was to use the articulatory structure as a starting point with which produce a more discriminative structure. Due to time limitations, however, during the workshop we focused only on the base articulatory structure. We use the term articulatory modeling to mean either the explicit modeling of particular articulators (tongue tip, lips, jaw, etc.) or the indirect modeling of articulators through variables such as manner or place of articulation. The motivations for using such a model come from both linguistic considerations and experiments in speech technology.

UWEETR-2001-0006 55 10.00%

8.00% Phone Noise-Clustered Baseline Phone

t 6.00% n e

m 4.00% e v o

r 2.00% p m I 0.00% R

E clean 20 15 10 5 0 -5

W -2.00%

e v i t -4.00% a l

Re -6.00%

-8.00%

-10.00% SNR

Figure 51: Baseline phone-based GMTK numbers compared to a simple model that uses noise clustering (i.e., includes a hidden variable for different types of noise, as seen in Figure 50)

On the linguistics side, current theories of phonology, referred to as autosegmental phonology [47], hold that speech is produced not from a single stream of phones but from multiple streams, or tiers, of linguistic features. These features (not to be confused with the same term in pattern recognition or acoustic features in an ASR system) can evolve asynchronously and do not necessarily line up to form phonetic segments. Autosegmental phonology does not provide a specific set of tiers, but various authors have posited that the set includes features of tone, duration, and articulation. On the speech technology side, there is mounting evidence that a phone-based model of speech is inadequate for recognition, especially for spontaneous, conversational speech [83]. Researchers have noted, for example, the difficulty of phonetically transcribing conversational speech, and especially of locating the boundaries between phones [39]. Furthermore, while it has been observed that pronunciation variability accounts for a large part of the performance degradation on conversational speech [50, 74, 75], efforts to model this variability with phone-based rules or additional pronunciations have had very limited success [74, 91]. One possible reason for this is that phonemes affected by pronunciation rules often assume a surface form intermediate to the underlying phonemes and the surface phones predicted by the rules [93]. Such intermediate forms may be better represented as changes in one or more of the articulatory features. There have been several previous efforts to use articulatory models for speech recognition. One such effort has been by Deng and his colleagues (e.g., [31, 32]). In their experiments, Deng et al. use HMMs in which each state cor- responds to a different vector of articulatory feature values. Their experiments explore different ways of constructing this state space using linguistic and physical constraints on articulator evolution, as well as different ways of modeling the HMM observation distributions. A similar model was used by Richardson et al. [88, 89]. Kirchhoff [65] used neural networks to extract articulatory features and then mapped these values to words, and in at least one case [64] allowed the articulators to remain unsynchronized except at syllable boundaries. There have been other attempts to use features at the lower levels of recognition, typically by first extracting articulatory feature values using neural networks or statistical models, and then using these values instead of or in addition to the acoustic observations for phonetic classification [35, 80].

UWEETR-2001-0006 56 MFCC 100.00% MFCC_F MFCC_A-F RAW_FM 80.00% RAW_AM RAW_AM_FM y

c 60.00% RAW_PT a r u c c

A 40.00%

20.00%

0.00% clean 20 15 10 5 0 -5 SNR

Figure 52: Absolute results on Aurora 2.0 at different SNRs using GMTK with different features sets consisting of either just MFCCs (the blue plot) or feature vectors consisting of MFCCs augmented with other features (so larger feature vectors). MFCC = just MFCCs, MFCC F = MFCCs + FM features, MFCC A-F = MFCCs + AM/FM features, RAW FM = just raw un-preprocessed FM features (i.e., no deltas, double deltas, cosine transform, smoothing, mean normalization, etc.), RAW AM = just raw un-preprocessed AM features, RAW AM FM = raw AM and FM features together.

One difficulty in using articulatory models for speech recognition is that the most commonly used computational structures (hidden Markov models, finite-state transducers) allow for only one hidden state variable at a time, whereas articulatory models involve several variables, one for each articulatory feature. While it is possible to implement articulatory models with single-state architectures by encoding every combination of values as a separate state (as in [31, 88, 89]), the mapping from the inherent multi-variable structure to the single-variable encoding can be cumber- some and the resulting model difficult to interpret and manipulate. Graphical models (GMs) are therefore a natural tool for investigating articulatory modeling. Since they allow for an arbitrary number of variables and arbitrary dependencies between them, the specification of articulatory models in terms of GMs is fairly direct. The resulting models are easy to interpret and modify, allowing for much more rapid prototyping of different model variants. The next section gives some background on articulatory modeling and our articulatory feature set. We then describe the progress made during the workshop, including the construction of a simple articulatory graphical model.

11.1 Articulatory Models of Speech Figure 55 shows a diagram of the vocal tract with the major articulators labeled. The ones we are most concerned with are the glottis, velum, tongue, and lips. The glottis is the opening between the vocal folds (or vocal cords), which may vibrate to create a voiced (or pitched) sound or remain spread apart to create a voiceless sound. The position of the velum determines how much of the airflow goes to the oral cavity and how much to the nasal cavity; if it is lowered so as to block airflow to the oral cavity, a nasal sound (such as [m] or [n]) is produced. The positions of the tongue and lips affect the shape of the oral cavity and therefore the spectral envelope of the output sound. Articulatory features

UWEETR-2001-0006 57 30.00%

20.00% t n

e 10.00% m e v

o 0.00% r

p clean 20 15 10 5 0 -5 m I -10.00% WWS_DQ_1MAX R

E WW_Phonetact_DQ_2MAX

W WW_DQ_1MAX -20.00% e WWS_DQ_VAN_1MAX v i t GMTK Word Baseline (2x) a l

e -30.00% GMTK Word Baseline R

-40.00%

-50.00% SNR

Figure 53: Summary of the relative improvements on the Aurora 2.0 corpus. WWS DQ 1MAX = Aurora 2.0 with whole word state (i.e., Q = whole-word states) using the edges selected by taking the single best according to the EAR measure. WW Phonetact DQ 2MAX = results where the AM/FM features are conditional random variables, and where the top two edges according to the EAR measure are chosen. WW DQ 1MAX = the whole models (Q in the EAR measure are the words) choosing the one best edge. WWS DQ VAN 1MAX = results similar to the WWS DQ 1MAX case except the GMTK vanishing ratio was set so that the total number of final parameters was the same as the HMM baseline. GMTK Word Baseline (2x) is the GMTK HMM baseline with twice the number of Gaussians (this model had 4/3 times the number of parameters as the augmented structure models). Finally, GMTK Word Baseline is the standard GMTK-based HMM baseline keep track of the positions of the articulators over time.

11.2 The Articulatory Feature Set We can imagine a large variety of feature sets that could be used to represent the evolution of the vocal tract during speech. For example, in [22], Chomsky and Halle advocate a system of binary features. Some of these features are more physically-based, such as voiced, and some are more abstract, such as tense. Other speech scientists, such as Goldstein and Browman [15], advocate more physically-based, continuous-valued features such as lip constriction location/degree and velum constriction degree. Unfortunately we are not aware of any well-established, complete feature sets in the speech science literature that are well-suited to our task. As has been done in previous work on articulatory-based recognition [65, 31], we have drawn on existing feature sets to construct one that is appropriate for our purposes. The set we have used to date is shown in Table 3. It was based on considerations such as state space size and coverage of the phone set used in our Aurora experiments. It does have some drawbacks, however, such as the lack of a representation of the relationship between certain vowel features and certain consonant features; for example, in order to model the fact that an alveolar consonant can cause the fronting of adjacent vowels, we would need to include a dependency between place and tongueBodyLowHigh.

UWEETR-2001-0006 58 38.50%

38.00%

37.50%

37.00%

36.50%

WER 36.00%

35.50%

35.00%

34.50%

34.00%

33.50% 0 3 6 Number of Clusters

Figure 54: Results on SPINE 1.0 with various degrees of noise-type clusters. As the number of clusters increases, the error gets worse.

Feature name Allowed values Comments voicing off, on “off” refers to voiceless sounds, “on” to voiced sounds velum closed, open “closed” refers to nasal sounds, “open” to non-nasal sounds manner closure, sonorant, fricative, burst “closure” refers to a complete closure of the vocal tract, e.g. the beginning part of a stop; “burst” refers to the turbulent region at the end of a stop. place labial, labio-dental, dental, alveolar, The location of the oral constriction for consonant sounds; post-alveolar, velar, nil “nil” is used for vowels retroflex off, on “on” refers to retroflexed sounds tongueBodyLowHigh low, mid-low, mid-high, high, nil Height of the tongue body for vowels; “nil” is used for consonants. tongueBodyBackFront back, mid, front, nil Horizontal location of the tongue body for vowels; “nil” is used for consonants. rounding off, on “on” refers to rounded sounds

Table 3: The feature set used in our experiments.

11.3 Representing Speech with Articulatory Features The distinctions between describing speech in terms of phones and describing it in terms of articulatory feature streams can be seen through some examples. First consider the case of the word warmth. The two ways of representing the canonical pronunciation of this word—as a string of phones and as parallel strings of features—are shown in Figure 56. In the articulatory represen- tation, if a speaker goes through the parallel streams synchronously, he will produce the same phone string as in the phone-based representation. If, however, the features are not perfectly synchronized, this may produce some other phones, or some sounds that are not in the standard phone inventory of English at all. The same can occur if the features remain synchronized, but do not reach their target values at some points. To see the possible advantage of the feature-based approach, consider the part of the word where the speaker is transitioning from the [m] to the [th]. The articulators must perform the following tasks: the velum must rise from its position for the nasal [m] to the non-nasal [th]; the vocal folds must stop vibrating; and the lips must part and the tongue tip move into an interdental position for the [th]. If all of the articulators move synchronously, then the phones [m] and [th] are produced as expected. However, the articulators may reach their target positions for the [th] at different times. One common occurrence in the production of this word is that the velum may be raised and voicing

UWEETR-2001-0006 59 Figure 55: A midsagittal section showing the major articulators of the vocal tract. Reproduced from http://www.ling.upenn.edu/courses/Spring 2001/ling001/phonetics.html.

turned off before the lips part. In that case, there is an intermediate period during which a [p]-like sound is produced, and the uttered word then sounds like warmpth. One way to describe this within a phone-based representation is to say that the phone [p] has been inserted. However, this does not express our knowledge of the underlying process that resulted in this sound sequence. Furthermore, a [p] produced in this way is likely to be different from an intentional [p], for example by having a shorter duration. Another example is the phenomenon of vowel nasalization, which can occur when a nasal sound follows a vowel (as in hand). If the velum is prematurely lowered, then the end of the vowel takes on a nasal quality. A phone- based description of this requires that we posit the existence of a new phone, namely the nasalized vowel. If we wish to express the fact that only the latter part of the vowel is nasalized, we need to represent the vowel segment as a sequence of two phones, a non-nasalized vowel and a nasalized one. A similar example is the early devoicing of a phrase-final voiced consonant, which would again require description in terms of two phones, one for the voiced part and one for the voiceless part.

11.4 Articulatory Graphical Models for Automatic Speech Recognition: Workshop Progress Figure 57 shows the structure of an articulator-based graphical model developed during the workshop. Each of the articulatory variables can depend on its own value in the previous frame, as well as on the current phone state. The dependency on the previous frame is intended to model continuity and inertia constraints on feature values. For example, a feature is likely to retain the same value over multiple frames rather than jump around, and most features cannot go from one value to a very different one without going through the intermediate value(s) (e.g. tongue height cannot go from “low” to “high” without going through “mid”). We also constructed an alternate version of the model,

UWEETR-2001-0006 60 phones w ao r m th voicing on off velum cl op cl manner son fric place lab nil post-alv lab dent tongueBodyLowHigh high low mid-low nil tongueBodyBackFront back mid nil rounding on off

Figure 56: Two ways of representing the pronunciation of the word warmth. The top line shows the phone-based representation; the remaining lines show the different streams in the feature-based representation, using the features and (abbreviated) feature values defined in Table 3.

in which the articulatory variables also depend on each other’s values in the current frame; these are not shown in the figure for clarity of presentation. The switching dependency from the phone state to the observation is used only for special handling of silence: if the current frame is a silence frame (i.e. the phone state variable is in one of the silence states), the observation depends only on the phone; otherwise, the observation depends only on the articulatory variables. This was done in order to avoid assigning specific articulatory values to silence.

frame i-1 frame i frame i+1

word word word

word word word trans trans trans

pos pos pos

phone phone phone state state state phone phone phone trans trans trans

1 2 N 1 2 1 2 a a . . . a a a . . . a N a a . . . a N

O O O

Figure 57: An articulatory graphical model. The structure from the phone state and above is identical to the phone- based recognizer. The articulatory variables are denoted a1, . . . , aN , and the observation variable is denoted o. The special dependencies for the last frame are not shown but are identical to those in the phone-based model.

The observation variable’s conditional probability is implemented via a Gaussian mixture for each allowed com- bination of values, plus additional Gaussian mixture models for the silence states. All of the other variables are discrete with discrete parents, so their conditional probabilities are given by (multidimensional) probability tables. The probability tables for the phone variable are constructed identically to the 3-state phone-based model we used in our phone-based Aurora experiments. The probability tables of the articulatory variables determine the extent to which they can stray from their canonical values for each phone state and the extent to which they depend on their past values. For example, we can make the articulatory variables depend deterministically on the phone state, by constructing their probability tables such that a zero probability is given to all values except the canonical value for the current phone state; the model then becomes equivalent to our phone-based recognizer. We in fact ran this experiment as a sanity check. For the general case of nondeterministic articulatory values, we needed to construct reasonable initial settings for

UWEETR-2001-0006 61 the probability tables of the articulatory variables, while avoiding entering each initial probability value by hand. The following procedure was therefore used. We first defined a table of allowed values for each articulator in each phone state, with a probability for each allowed value, as shown in Table 4. We then defined a table of transition probabilities for each articulator given its previous value, as shown in Table 5. The final probability table for each articulatory variable, representing the probability of each of its possible values given its last value and the current phone, was constructed by multiplying the appropriate entries in the two tables and normalizing to ensure that the probabilities sum to one.

Phone state voicing velum ... ah0 on (0.9), off (0.1) closed (0.2), open (0.8) ... n0 on (1.0) closed (1.0) ......

Table 4: Part of the table mapping phone states to articulatory values. According to this table, the first state of [ah] has a probability of 0.9 of being voiced and a probability of 0.1 of being voiceless, and it has a probability of 0.2 of being nasalized; an [n] must be voiced and nasal; and so on.

Value in previous frame Pr(value = 0) Pr(value = 1) 0 0.8 0.2 1 0.2 0.8

Table 5: The transition probabilities for the voicing variable. This particular setting says that the voicing variable has a probability of 0.8 of remaining in the same value as in the previous frame and a probability of 0.2 of changing values (regardless of the value in the previous frame).

Because of the large memory requirements of the model, we were unable to run experiments with this model for cases where the articulators are not constrained to their canonical values. Toward the end of the workshop, log-space inference became available in GMTK, making it possible to trade off memory for running time. We therefore leave experiments with this model to future investigation. Our main achievements during the workshop were in creating the infrastructure to construct various versions of this articulatory model. This includes: • Scripts to construct structures of the type in Figure 57 for arbitrary definitions of the articulatory feature set, phone set, and mappings from phones to articulatory values, for structures with and without inter-articulator dependencies in the current frame. • Scripts to generate initial conditional probability tables from phone-to-articulator mapping tables and articula- tory value transition tables, as described above • Scripts to generate an initial Gaussian mixture for each combination of articulatory values. We used a simple heuristic: if the combination corresponds to the canonical production of some phone state, use an existing Gaussian mixture for that phone state (from a prior training run of the phone-based recognizer); otherwise, use a silence model. • The tables necessary to construct several variants of the model, including one in which the articulators must always take on their canonical values; one in which the articulators must reach the canonical values in the middle state of each phone but can stray from those values in the first and third state; and one in which the articulators need never take on the canonical values but are constrained to fall within a given range of those values.

UWEETR-2001-0006 62 12 Other miscellaneous workshop accomplishments 12.1 GMTK Parallel Training Facilities This section documents the scripts that we developed to run GMTK tasks on multiple machines in parallel. The parallel scripts are written in the bash shell language and use the pmake (parallel make) utility to run distributed jobs. The scripts work by reading in a user-created “header” file defining various parameters, creating the appropriate makefiles, and then running pmake (possibly multiple times if training multiple iterations). We developed parallel scripts to (1) run EM training for a Gaussian mixture-based model with a given training “schedule” (as described in Section 12.1.1 below), and (2) create Viterbi alignments for a given set of utterances. We also used pmake to decode multiple test sets in parallel; for this purpose we did not write a separate script, but rather created makefiles by hand.

12.1.1 Parallel Training: emtrain parallel In order to train GMTK models in parallel, we divide the training set into a number of chunks. During each EM iteration, the statistics of each chunk are first computed using the current model parameters. At the end of the iteration, the statistics from all of the chunks are collected in order to update the model parameters. Below we describe the training procedure in greater detail. The parallel training script emtrain parallel is invoked via emtrain_parallel [header_filename] where the header file is itself a bash script containing parameter definitions of the form

$PARAMETER_NAME=value

The main parameters defined in the header file are: • The files in which the training data, initial model parameters, and training structure are found, the file to which the final learned parameters are to be saved, and a directory for temporary files. • A template masterfile from which multiple masterfiles will be created, one for each chunk of utterances. • A training schedule defined by two arrays of equal length N, one for the -mcvr parameter and one for the -mcsr parameter of gmtkEMtrain. These define the vanishing and splitting ratios, respectively, for the first N iterations of training. After the N th iteration, additional iterations are done until a convergence threshold (defined below) is reached, using the last -mcsr and -mcvr in the arrays. • The location of a script, which must be provided by the user, to create all the necessary utterance-dependenct decision tree files for a given chunk of utterances. • The iteration number I from which to start, which can be anywhere from 1 to the last iteration in the mcvr/mcsr arrays. If I = 1, the initial trainable parameters file will be used. If I > 1, then training will start from the Ith iteration using the learned parameters from the (I − 1)th iteration. The latter case is meant to be used to restart a training run that has been halted before completion for some reason. • The maximum number M of iterations to run, and the log-likelihood ratio threshold for convergence. Training will proceed until convergence or for M iterations, whichever comes first. • The number of chunks to break the training data into, and the maximum number of chunks to be run in parallel at one time. These and all other parameters are described in greater detail in an example header file in Appendix 12.1.3. The basic procedure that emtrain parallel follows is: 0. Define N = number of chunks to break training data into, I = initial EM iteration, M = maximum number of EM iterations, rt = log-likelihood ratio threshold for convergence, INITIAL GMP = initial trainable parameters file.

UWEETR-2001-0006 63 1. Divide the training set into the N chunks, and create a separate masterfile and decision tree (DT) files for each chunk of utterances. This step is done in parallel using pmake, as there may be a large number of masterfiles and DT files to create. 2. For iter = I to length of mcsr/mcvr arrays, do: (a) If iter = 1, set current parameters file to INITIAL GMP. Otherwise, set it to the learned parameters file from iteration iter − 1. (b) Create a pmake makefile with N targets, each of which stores the statistics of one chunk of the training data (using gmtkEMtrain with -storeAccFile). Run pmake using this makefile, storing all output to a log file. (c) Collect the statistics from all of the accumulator files for this iteration (using gmtkEMtrain with -loadAccFile) and update the model parameters, using the current -mcsr and -mcvr to split or vanish Gaussians as appropriate. 3. Repeat until convergence or until the M th iteration, whichever comes first: (a) Increment iter. (b) Follow (a) – (c) from (2) above.

(c) Test for convergence: letting Li be the log likelihood of the training data in iteration i, compute the current log-likelihood ratio r as L − L r = 100 · iter iter−1 (7) |Liter−1|

Convergence has been reached if r < rt. We also include here several notes that we have found helpful to keep in mind when running this script: • The last -mcsr and -mcvr values in the training schedule should be such that no splitting or vanishing is allowed. This is so that, during the “convergence phase” of the training run, successive iterations of the model have the same number of Gaussians and the log likelihoods can be compared (since, in EM training, the log likelihood is guaranteed to increase with each iteration only if the number of Gaussians is kept constant). • During an iteration of gmtkEMtrain, some utterances may be skipped if their probability is too small accord- ing to the current model (and with the current beam width). If different utterances are skipped in successive iterations, then the log likelihoods of those two iterations are again not strictly comparable since they are com- puted on different data. This can become a serious problem if a significant number of utterances is skipped. The script does not warn the user about skipped utterances, but gmtkEMtrain does output warnings, which can be found in the pmake log files.

12.1.2 Parallel Viterbi Alignment: Viterbi align parallel To create Viterbi alignments with GMTK, gmtkViterbi is run using the training structure instead of the decoding structure, and the values of the desired variables in each frame are written out into files, one per utterance (using the -dumpNames option). The parallel Viterbi alignment script Viterbi align parallel is invoked via Viterbi_align_parallel [header_filename] where the header file is again a listing of parameter definitions. The main parameters defined in the header file are: • Filenames for the observations of utterances to be aligned, model parameters, and training structure, directories in which to put temporary files and output alignments, and a filestem for the alignment filenames. • A template masterfile and a script to create chunk DT files, as in emtrain parallel.

UWEETR-2001-0006 64 • The number of chunks to break the observations into, and the maximum number of chunks to be run at once in parallel. • A file containing the names of the variables to be stored in the alignments. The procedure for parallel Viterbi alignment is much simpler than for parallel training. The script simply divides the data into the specified number of chunks, creates lists of output alignment filenames for each chunk, creates a makefile that runs gmtkViterbi with the appropriate parameters for each chunk, and runs pmake. A similar procedure could be used to do parallel Viterbi decoding of a test set by dividing the set into chunks and then collecting the chunk outputs, although we did not do this during the workshop.

12.1.3 Example emtrain parallel header file

############################################################ Example ##header file for use with emtrain_parallel ############################################################

############################################################ ## files & directories ############################################################

## File in which training data is stored (this can be either ## a pfile or a file containing a list of feature files) TRAIN_FILE=/export/ws01grmo/aurora/training_pfiles/mfcc/MultiTrain.pfile

## Initial trainable parameters file INITIAL_GMP=/your/parameters/dir/initial_params.gmp

## Output file in which to put _final_ learned parameters EMOUT_FILE=/your/parameters/dir/learned_params.gmp

## Structure file for training STRFILE=/your/parameters/dir/aurora_training.str

## Directory in which to put temporary files (makefiles, ## pmake output, chunk decision tree (DT) files, chunk ## masterfiles, and learned parameters from intermediate ## iterations) MISC_DIR=/your/temporary/dir/

## Template masterfile: like a regular masterfile, except ## that wherever an utterance-dependent DT file is specified ## in the template masterfile, the file name must end in the ## string "*RANGE*.dts". Also note that, in the template ## masterfile, the directory of the chunk DT files must be ## $MISC_DIR. MASTER_FILE=/your/parameters/dir/masterFile.template.params

############################################################ ## other params for gmtkEMtrain ############################################################

## The training schedule:

UWEETR-2001-0006 65 ## ## Arrays of values for the -mcvr and -mcsr parameters, one ## per iteration up to the last iteration before ## log-likelihood-based training takes over. After the ## last iteration specified in the arrays, training will ## continue until convergence (or until iteration ## $MAX_EM_ITER) using the last -mcvr and -mcsr values in ## the arrays. ## ## The schedule being used here is: ## -Run 1 iteration with no splitting or vanishing ## -Run 2 iterations in which all Gaussians are split but ## none are vanished ## -Run 1 iteration with no splitting or vanishing ## -Continue training until convergence

MCVR_ARRAY="1e20 1e20 1e20 1e20"

MCSR_ARRAY="1e10 1e-15 1e-15 1e10"

## Value for -meanCloneSTDfrac -- assumed to be the same ## for all iterations MEANCLONEFRAC=0.25

## Value for -covarCloneSTDfrac -- assumed to be same for ## all iterations VARCLONEFRAC=0.0

## Variables specifying all parameters relevant to each ## of the 3 feature streams (-of, -nf, -ni, -fmt, -iswp). ## --If using fewer than 3 streams, use null string for ## the remaining stream(s). ## --Must have non-null value for at least one of the ## streams. STREAM1_PARAMS="-of1 $TRAIN_FILE -nf1 42 -ni1 0 -fmt1 pfile -iswp1 true" STREAM2_PARAMS="" STREAM3_PARAMS=""

############################################################ ## other parameters for emtrain_parallel ############################################################

## User provides a script that generates the DT files for a ## given chunk of utterances. ## --The script must write the chunk DTs to files ending ## in .dts, where range is of the form min-utt:max-utt. ## --The script can take any number of arguments, but the ## last two must be an utterance range (in the form ## min-utt:max-utt) and a directory in which to put DT files. ## Any other arguments must be included here with the script ## name. In the example below, "generate_chunk_dts" takes ## as arguments a label file and then the utterance range ## and directory.

UWEETR-2001-0006 66 LABEL_FILE=/export/ws01grmo/aurora/LABELFILES/AllMultiTr.mlf GENERATE_CHUNK_DTS="/home/ws01/klivescu/GM/aurora/phone/generate_chunk_dts ${LABEL_FILE}"

## Number of training sentences; must be <= the number of ## utterances in $TRAIN_FILE (If < , then the first ## $NUM_TRN_SENTS utterances will be used for training.) NUM_TRN_SENTS=8440

## Number of EM iterations; ## -- if <= the number of elements in $MCVR_ARRAY and ## $MCSR_ARRAY, then training will follow the schedule ## in the arrays up to $MAX_EM_ITER ## -- if > number of elements, then training will follow the ## schedule through the last element of the arrays, and ## then will continue until iter $MAX_EM_ITER or until log ## likelihood threshold reached, whichever comes first MAX_EM_ITER=100 # this means the schedule above will be completed, # then training will continue for at most another # 100-4 iterations or until convergence

## EM iteration number to start from (between 1 and the last ## iteration in the MCVR and MCSR arrays). If > 1, then ## emtrain_parallel will look for the learned params file from ## the previous iteration, $MISC_DIR/learned_params[k].gmp, ## where k = $INIT_EM_ITER-1. An error will be generated if ## this file doesn’t exist. INIT_EM_ITER=1

## Log-likelihood (LL) difference ratio, in percent, at which ## convergence is assumed to have occurred LOG_LIKE_THRESH=0.2 # i.e. train until the LL difference # between the current iteration and the # previous one is 0.2% or less.

## Binary parameter indicating whether or not to keep the ## accumulator files from each iteration. The default value ## is ‘‘true’’. If ‘‘false’’, then each iteration will ## overwrite the accumulator files from the last iteration. KEEP_ACC=‘‘true’’

## Number of chunks to divide training data into--should be a ## multiple of EMTRAIN_PARALLELISM for maximum time-efficiency EMTRAIN_CHUNKS=20

## Maximum number of processes to run in parallel at any time EMTRAIN_PARALLELISM=20

## Number of processes to run on the local machine NUM_LOCAL=0

## Set of nodes on which to run, in pmake syntax NODES="delta grmo OR alta grmo"

## Specify the binary for gmtkEMtrain

UWEETR-2001-0006 67 GMTKEMTRAIN=/export/ws01grmo/gmtk/linux/bin/gmtkEMtrain.WedAug01_19_2001

12.1.4 Example Viterbi align parallel header file

############################################################ ## Example header file for use with Viterbi_align_parallel ############################################################

############################################################ ## directories ############################################################

## Directory in which to put temporary files (makefiles, ## pmake output, chunk decision tree (DT) files, and chunk ## masterfiles) MISC_DIR=/your/temporary/dir/

## Directory in which to put alignments ALIGN_DIR=/your/alignments/dir

## Filestem for alignment files. Output alignment files ## will be of the form $ALIGN_FILESTEM.utt_[num].out ALIGN_FILESTEM=align

############################################################ ## training & parameter files ############################################################

## File in which training data is stored (this can be either ## a pfile or a file containing a list of feature files) TRAIN_FILE=/export/ws01grmo/aurora/training_pfiles/mfcc/MultiTrain.pfile

## Template masterfile: like a regular masterfile, except ## that wherever an utterance-dependent DT file is specified ## in the template masterfile, the file name must end in the ## string "*RANGE*.dts". Also note that, in the template ## masterfile, the directory of the chunk DT files must be ## $MISC_DIR. MASTER_FILE=/your/parameters/dir/masterFile.template.params

## Structure file for training STRFILE=$PARAMS_DIR/aurora_training.str

## Trainable parameters file TRAINABLE_PARAMS_FILE=/your/paramters/dir/params.gmp

############################################################ ## params for gmtkViterbi ############################################################

## Specify the binary to use for gmtkViterbi

UWEETR-2001-0006 68 GMTKVITERBI=/export/ws01grmo/gmtk/linux/bin/gmtkViterbi.ThuJul26_23_2001

## Variables specifying all parameters relevant to each ## of the 3 feature streams (-of, -nf, -ni, -fmt, -iswp). ## --If using fewer than 3 streams, use null string for ## the remaining stream(s). ## --Must have non-null value for at least one of the ## streams. STREAM1_PARAMS="-of1 $TRAIN_FILE -nf1 42 -ni1 0 -fmt1 pfile -iswp1 true" STREAM2_PARAMS="" STREAM3_PARAMS=""

## Any other params you want to pass to gmtkViterbi MISC_PARAMS=""

############################################################ ## other params ############################################################

## User provides a script that generates the DT files for a ## given chunk of utterances. ## --The script must write the chunk DTs to files ending ## in .dts, where range is of the form min-utt:max-utt. ## --The script can take any number of arguments, but the ## last two must be an utterance range (in the form ## min-utt:max-utt) and a directory in which to put DT files. ## Any other arguments must be included here with the script ## name. In the example below, "generate_chunk_dts" takes ## as arguments a label file and then the utterance range ## and directory. LABEL_FILE=/export/ws01grmo/aurora/LABELFILES/AllMultiTr.mlf GENERATE_CHUNK_DTS="/home/ws01/klivescu/GM/aurora/phone/generate_chunk_dts ${LABEL_FILE}"

## File listing names of variables to dump out into alignments DUMP_NAMES_FILE=/your/work/dir/dump_names

## Number of training sentences; must be <= the number of ## utterances in $TRAIN_FILE (If < , then the first ## $NUM_TRN_SENTS utterances will be used for training.) NUM_TRN_SENTS=8440

## Max number of processes to run at once PARALLELISM=20

## Number of chunks to divide training data into--should be a ## multiple of PARALLELISM CHUNKS=20

## Number of processes to run locally NUM_LOCAL=0

## Set of nodes on which to run NODES="delta grmo OR alta grmo"

UWEETR-2001-0006 69 12.2 The Mutual Information Toolkit Our approach to structure learning is to start from a base model (the HMM) and decide how we can improve the structure (by addition or removal of edges) to make it more discriminative and better at classification. For that we need a way to evaluate the effect on the quality of the structure of adding (or removing) edges from the graph and this is where the MI Toolkit enters into play. The toolkit provides a set of tools designed to calculate mutual information between nodes in the graphical model. This measure is used to decide where changes to the structure will have most effect. We present the rest of this toolkit overview in the context of speech recognition even though the tools are general enough to be applied to any time-varying series. Specifically we assume data is presented in the form of sentences, which are collections of frames or vectors. One problem that this toolkit solves is that of processing very large amounts of data, which cannot fit into memory. At any given time, only one sentence has to be loaded and processed. Moreover, the tools are designed to be run in parallel. Besides the data, an input to the MI tools are the specifications of the relative positions of the features in the speech frames, between which we want to compute the MI. At any given time/frame, such a specification defines two vectors X and Y . By going over the data, we collect instances of vectors X and Y . Section 12.2.2 discusses how the joint probability distributions are estimated from those instances. Section 12.2.1 starts by introducing background about mutual information and entropy that is useful for the rest of the discussion.

12.2.1 Mutual Information and Entropy Mutual information is the amount of information a given random variable X has about another random variable Y . Formally, p(X,Y ) I(X; Y ) = E[log ] p(X)p(Y ) Mutual information is 0 when X and Y are independent (i.e. p(X,Y ) = p(X)p(Y )) and is maximal when X and Y are completely dependent i.e. there is a deterministic relationship between them. The value of I(X; Y ) in that case is min{H(X),H(Y )}, where H(X) is the entropy of X and is defined as 1 H(X) = E[log( )] log(X)

It measures the amount of uncertainty associated with X.

p(X,Y ) I(X; Y ) = E[log ] p(X)p(Y ) p(X|Y ) = E[log ] P (X) = H(X) − H(X|Y )

By symmetry I(X; Y ) is also equal to H(Y ) − H(Y |X), hence we get the upper bound since entropy is positive. The definition of the mutual information applies for any random vectors X and Y , but we make the distinction between the bivariate mutual information, when X and Y are scalars and multivariate mutual information, when X and Y are vectors.

12.2.2 Toolkit Description The MI Toolkit consists of four programs:

1. Discrete-mi 2. Bivariate-mi

UWEETR-2001-0006 70 3. Multivariate-mi 4. Conditional-entropy

Discrete-mi calculates the MI when the vectors are discrete i.e. each component can take a finite number of values. There are no restrictions on the size of the vectors other than memory size limitations (a hash table version of this tool has also been written to alleviate the memory problem when vectors have sparse values). Bivariate-mi calculates the MI between two continuous scalar elements. The restriction to scalar allows the use of several optimizations that considerably speed up MI calculation. Multivariate-mi generalizes the Bivariate-mi tool to arbitrary sized vectors. Conditional-entropy calculates the conditional (on frame labels) entropy of arbitrary sized vectors. There are two main parts to calculating the mutual information (or entropy). First, the joint probability distribution must be estimated. Then, after obtaining the marginals, the MI is estimated. Calculating the mutual information when the joint and marginal probability distributions are available is done as shown in section 12.2.1 (definition of MI). For Discrete-mi obtaining the probability distribution is straight-forward: the probability of each n-tuple is just the frequency at which the n-dimensional configuration appears. For the remaining programs, computing the probability distribution is more involved and relies on an Expectation Maximization procedure. Following is a description of how the mutual information between two random vectors X and Y is calculated using EM.

12.2.3 EM for MI estimation Training data is partitioned into “sentences,” each of which contains, depending on the length of the utterance, several hundred frames, or vectors of observations. Besides the data, an input to the MI procedure are the specifications of the relative positions of the features in the speech frames, between which we want to compute the MI.

1. While EM convergence is not reached (i.e. the increase in log likelihood is above a given threshold), 2. Read a new sentence in. 3. For each position/lag specification populate a arrays of vectors X and Y by collecting the specified features over the sentence. 4. Accumulate sufficient statistics. 5. Go to 2, until all sentences are read

6. Update the parameters of the joint probability distribution pXY .

7. Partition the parameters of the pXY distribution to get pX and pY .

8. Get samples from the above three distributions and estimate 1 PN log pXY (xi,yi) N i=1 pX (xi)pY (yi) The quantity that is estimated in the last step approximates the mutual information, by the law of large numbers. The larger the N the better the approximation. The sampling from the distributions can either be done directly from the data or by generating new samples according to the learned distributions. Both methods yield similar results.

12.2.4 Conditional entropy The previous MI procedure assumes both random variables are continuous, but often we are interested in the mutual information between a continuous and a discrete variable i.e. we want to calculate I(X,A) where X is continuous and A discrete. Such a calculation is needed, for example when we want to augment our graphical model with conditioning features Y , observations that do not depend on the state Q as it is the case for normal observations (which we also call scoring observation), but that can potentially help discrimination. A simple criterion to select conditioning observations is, thus, to compute the unconditional mutual information between the conditioning features and the state. If that value is small, we deduce that Y ⊥⊥Q. We also compute I(Y,Q|X) to verify that X depends on Y .

UWEETR-2001-0006 71 We can write the mutual information as a function of entropy and we get

I(X,A) = H(X) − H(X|A) 1 1 = E [ ] − E [ ] p log(X) p(x,a) log p(X|A) X = −Ep[log(X)] + p(ai)Ep(x|a)[log p(X|A = ai)]

ai

Therefore, we can use a similar procedure to that described above to estimate the probability distributions pX and pX|A, and by sampling from the distributions and using the law of large numbers, estimate the two terms −Ep[log(X)] and Ep(x|a)[log p(X|A = ai)]. The probability distribution pA can be computed by counting the frequencies of the values of the A.

12.3 Graphical Model Representations of Language Model Mixtures Another project that was undertaken during the workshop was the use of GMTK for some basic language modeling experiments, namely using graphs to represent mixtures of component language models of various orders, and to use the sparse conditional probability and switching hidden variable features of GMTK to implement them. In general, decoding in speech recognition can be decomposed into two separate probability calculations accord- ing to the noisy channel model, argmax P (A|w)P (w). Language models approximate the joint probability over a w Q sequence of words P (w), which becomes t P (wt|h = w1...t−1) using the chain rule. For n-gram language models, the word history at each time point h is limited only to the previous n − 1 words. This repeating structure of the word history lends itself nicely to the language of dynamic graphical models. Our experiments demonstrate how modular and easily trainable graphical models can be used for simple to advanced language modeling.

12.3.1 Graphical Models for Language Model Mixtures

Wt-4 Wt-3 Wt-2 Wt-1 Wt

Figure 58: Simple graph for a trigram language model.

For our experiments, we chose to model the standard trigram language model

N(wt, wt−2, wt−3) Ptrigram(wt|h) = P (wt|wt−2wt−1) = (8) N(wt−2, wt−3)

where N(wt, wt−2, wt−3) is the count of the number of times the gram wt, wt−2, wt−3 occurs in training data. which can be described using the directed graph shown in Figure 58. The graph shows the set of word variables Wt for each t, and shows how Wt depends on the two previous words in the history.

α ∈{3} α ∈{3} α ∈{3} α t-3 α t-2 α t-1 α α ∈{3} α t-4 t-3 t-2 t-1 t t

Wt-4 Wt-3 Wt-2 Wt-1 Wt

α ∈{2,3} α ∈{2,3} α ∈{2,3} α ∈{2,3} t-3 t-2 t-1 t

Figure 59: Mixture of trigram, bigram, and unigram using a hidden variable α and switching parents.

In general, not all possible three word sequences are seen in the a given training data, and this was of course the case in the IBM AV data that we worked with during the workshop. We therefore implemented a smoothed the probability

UWEETR-2001-0006 72 distribution by linearly interpolating the trigram distribution with bigram and unigram distributions, a method well known as Jelinek-Mercer smoothing [59]:

P (wi|h) = α1Ptrigram + α2Pbigram + (1 − α1 − α2)Punigram (9)

Where Pbigram and Punigram are defined similarly as the in the trigram case, as a ratio of counts. Another way of seeing this equation is that there exists a hidden discrete tri-variate random variable, say named α which is used to mix between the various different language models. The above equation can therefore be written as:

P (wi|h) = P (α = 3)Ptrigram + P (α = 2)Pbigram + P (α = 1)Punigram (10) Viewed in this way, we see that the α variable is really a switching parent which, depending on its value, is used to determine the set of parents that are active. This can be seen as the graphical model shown in Figure 59. In that figure, the αt variable at each time step is a switching parent (indicated by the dashed edge, see also Section 6.1.7 which describes the idea of switching parents, and how it is implemented in GMTK), and is used only to determine if some of the other edges in the graph are active or not. The values for which different edges are active are shown by the call-out boxes, indicating the αt values which are required to activate each edge.

α ∈{3} α ∈{3} α ∈{3} α ∈{3} t-1 t t-3 t-2 αt-4 αt-3 αt-2 αt-1 αt

Wt-4 Wt-3 Wt-2 Wt-1 Wt

α ∈{2,3} α ∈{2,3} α ∈{2,3} α ∈{2,3} t-3 t-2 t-1 t

Figure 60: Mixture of trigram, bigram, and unigram using a hidden variable α and switching parents. Here, α is dependent on the history.

In the most simple case, the random variable α has fixed probability values over time, and these values are typically learned from some held out set not used to produce the base count distributions (i.e., deleted interpolation [59]). It is possible, however, for these weights to be a function of and vary according to the word history h leading to the equation: P (wi|h) = P (α = 3|h)Ptrigram + P (α = 2|h)Pbigram + P (α = 1|h)Punigram (11) for a hidden discrete tri-valued random variable α. This structure is shown in Figure 60. Because h can have quite a large state space, P (α|h) itself could be a difficult quantity to estimate. Therefore, in order to reduce this data-sparsity problem and estimating the quantity more robustly, we can form equivalence classes of word histories h based on the frequency of their occurrence, which are called buckets. In other words, those h values which occurred within a given range of a certain number of times within the training data were grouped together into one bucked B(h), and the resulting probability became P (α|B(h)), which is necessarily a discrete distribution with lower overall cardinality. As will be seen in the experiments below, we vary the number of buckets, thereby evaluating the trade-off between the model’s robustness and its predictive accuracy. GMTK scaled quite easily to these changes, automatically learning the varying number of weights given the appropriate graphical model structures. Perplexity results with various bucket sizes are given in the following sections. Another aspect of language that is sometimes desirable is the ability to represent the notion of an optional lexical silence token that might occur between words (this might be a pause, or some other non-lexical entity). We will call this entity lexical silence, denoted by sil. This is particularly important when language modeling is used along with acoustic models, as the lexical silence “word” might have quite different acoustic properties than any of the real words in the lexicon. Therefore, a goal is to allow for the lexical silence to occur between any word. A problem, however, that arises when this is done is that the probability model for the current word now depends on the previous word wt−1 which might be lexical silence. It could be more beneficial to condition only on the previous “true” words (not lexical silences) when the context does contain lexical silence. In language modeling, this is a form of what is called a skip language model, where some word in the context is skipped in certain situations. It is possible to represent such a construct with a graph and with GMTK. We first develop the model in the bi-gram case for simplicity, and then provide the tri-gram case (which also includes perplexity results below).

UWEETR-2001-0006 73 The essential problem is that the random variable Wt in a graph is conditioned on the previous word Wt−1. When the previous word is sil, however, the information about the previous true word is lost (since it is not in the conditioning set in the model P (Wt|Wt−1,Wt−2)). Therefore, there must be some mechanism (graphical in this case) to keep track of what the previous true word is and use it in the conditioning set. This can be done by having an explicit extra variable which we call Rt for the previous Real word in the history. Rt then becomes part of the conditioning set and is used to produce the bi-gram score rather than Wt which might be sil. Also, Rt itself needs to be maintained over time when the current word is sil. When Wt is not sil, Rt should be updated to be whatever the previous word truly is. We first describe this in equations, giving the distribution of Wt given both Wt−1 and Rt−1, and then provide a graph. X P (wt|wt−1, rt−1) = P (wt, rt|wt−1, rt−1) (12)

rt X = P (rt|wt, wt−1, rt−1)P (wt|wt−1, rt−1) (13)

rt X = P (rt|wt, rt−1)P (wt|rt−1) (14)

rt

The quantity P (rt|wt, rt−1) is set as follows:  δrt=rt−1 if wt = sil P (rt|wt, rt−1) = δrt=wt if wt 6= sil where δi=j is the Dirac delta indicator function, which is one only when i = j and is otherwise zero. This implemen- tation of this distribution does the following: If the current word wt is sil, then Rt is a copy of whatever Rt−1 is, so the real word is retained from time-slice to time-slice. If, on the other hand, wt is a real word, then Rt is a copy of that word. Therefore, wt is both a normal and a switching parent of rt. The implementation of P (Wt|Rt−1) is as follows: X P (wt|rt−1) = P (wt|st, rt−1)P (st)

st where St is a hidden binary variable at time t which indicates if the current word is lexical silence. The implementation of P (wt|st, rt−1) is as follows:  δ if s = 1 wt=sil t P (wt|st, rt−1) = Pbigram(wt|rt−1) if st = 0

This means that whenever St is “on”, it forces Wt to be sil, and that is the only token that gets any (and all of the) probability. P (st) is simply set to the probability of lexical silence (the relative frequency can be obtained from training data). The graph for this model is shown in Figure 61. In the graph, Wt has a dependency only on Rt and St. Rt uses Wt both as a switching and a normal parent. The value of Wt switches the parent of Rt to be either Wt to obtain a new real word value, or Rt−1 to retain the previous real word value.

Rt-4 Rt-3 Rt-2 Rt-1 Rt

Wt-4 Wt-3 Wt-2 Wt-1 Wt

St-4 St-3 St-2 St-1 St

Figure 61: An implementation of a skip-like bigram, where the previous word Rt is used to lookup the bigram probability whenever the previous word is lexical silence

Given the above model, the following string of lexical items “Fool me once sil, shame on sil, shame on you” will get probability Pbg(me|Fool) Pbg(once|me) Pbg(same|once) Pbg(on|shame) Pbg(shame|on) Pbg(on|shame) 2 Pbg(you|on) P (sil) , where the probability of lexical silence is applied twice.

UWEETR-2001-0006 74 αt-4 αt-3 αt-2 αt-1 αt

α =1 α =1 α =1 α =1 t-1 t Wt-4 t-3 Wt-3 t-2 Wt-2 Wt-1 Wt

St-4 St-3 St-2 St-1 St Rt-4 Rt-3 Rt-2 Rt-1 Rt

Figure 62: An implementation of a skip-like bigram and a mixture of bigram and unigram models, essentially a combination of a bigram version of Figure 59 and of Figure 61

The model that both skips lexical silence and also mixes between bi-gram and unigram probability may be com- bined together as one model, as shown in Figure 62. Note that it is also possible to use two binary auxiliary hidden variables (rather than just one α variable) to produce the mixture given in Figure 60 and described in Equation 11. The trigram decomposition can be performed as follows:

P (wt|wt−1, wt−2) = P (αt = 1|wt−1, wt−2)Ptri(wt|wt−1, wt−2) (15) + P (αt = 0|wt−1, wt−2)P (wt|wt−1) (16) where

P (wt|wt−1) = P (βt = 1|wt−1)Pbi(wt|wt−1) (17) + P (βt = 0|wt−1)P (wt) (18)

and where we now have two hidden switching parent variables at each time slice, αt (deciding between a bi-gram and tri-gram) and βt (deciding between a bi-gram and a unigram). This model is shown graphically in Figure 63

βt-4 βt-3 βt-2 βt-1 βt α =1 α =1 α =1 α =1 t-3 t-2 t-1 t αt-4 αt-3 αt-2 αt-1 αt

Wt-4 Wt-3 Wt-2 Wt-1 Wt β =1 β =1 β =1 β =1 t-3 t-3 t-3 t-3

Figure 63: A mixture of a trigram, bigram, and unigram language model using two binary hidden variables α (to control trigram vs. bigram) and β (to control bigram vs. unigram).

Wt-4 Wt-3 Wt-2 Wt-1 Wt

St-4 St-3 St-2 St-1 St Rt-4 Rt-3 Rt-2 Rt-1 Rt

Vt-4 Vt-3 Vt-2 Vt-1 Vt

Figure 64: An implementation of a skip-like trigram, where the previous word Rt and the previous previous word Vt are both used to lookup the trigram probability whenever the previous or previous previous word is lexical silence. The structure also includes the provisions necessary to update the true words within the history.

Moreover, it is possible to implement a trigram model that skips over contexts that contain sil, similar to the bigram case. This trigram model in given in Figure 64. Combining this model together with a mixture model, we at last arrive at the model that was used for the perplexity experiments that were carried out during the 2001 JHU

UWEETR-2001-0006 75 workshop. This model is given in Figure 65. This model can be used for language model training and language scoring. We can also add the remaining structure that given in the lower portion of Figure 14 to obtain a general speech recognition decoder that uses this mixture and skip language model.

βt-4 βt-3 βt-2 βt-1 βt

αt-4 αt-3 αt-2 αt-1 αt β =1 β =1 β =1 β =1 t-3 t-2 t-1 t

Wt-4 Wt-3 Wt-2 Wt-1 Wt

St-4 St-3 St-2 St-1 St Rt-4 Rt-3 Rt-2 Rt-1 Rt

Vt-4 Vt-3 Vt-2 Vt-1 Vt α =1 α =1 α =1 α =1 t-3 t-2 t-1 t

Figure 65: A model that combines the two variable mixture between trigram, bigram, and unigram, and that also implements the skipping of lexical silence at the trigram level.

12.3.2 Perplexity Experiments We tested the language model given in Figure 65 in set of perplexity experiments. We used the IBM Audio-Visual corpus comprised of ≈ 13, 000 training utterances and ≈ 13, 000 word vocabulary. Test data was a subset of the training utterances, and an additional subset of heldout data was used to train the weights (i.e., the distributions of the variables α and β). The trigram, bigram, and unigram probabilities were calculated offline by taking frequency counts over the training data. Once the amount of buckets was specified in the structure files, the weights were learned with GMTK. To test the language model, we calculated the probability assigned to the test data by the language model. This was then converted into perplexity, a common measure of how well the language model predicts language. As is well known, perplexity can be thought of as the average branching factor of the model, i.e. how many words it assigns equal probability to, and it is correlated to WER for speech recognition. While perplexity reduction is often not a good predictor of overall word error reduction, in many cases it can be quite useful, especially when the perplexity reductions are large.

12.3.3 Perplexity Results Experiments were run with linearly interpolated bigram and trigram language models, varying the number of buckets from 1 to 10. The results are summarized below in terms of language model perplexity over the test data: bigram trigram 1 bucket 89.54 38.71 2 buckets 81.63 28.59 10 buckets 80.79 28.09 Adding one more word into the history through the trigram model significantly reduced perplexity, as expected given enough training data. The single bucket models performed reasonably well. The largest gains were from using two buckets instead of one, which affectively lowered the weight of n-grams whose history was never seen. More buckets did not appear to help, as the learned weights converged to similar numbers for the nonzero history count buckets.

UWEETR-2001-0006 76 12.3.4 Conclusions The language model experiments performed with the GMTK yield comparable results to those of language modeling toolkits. However, the similarity between the training, testing, and decoding graphical model structures allow for an easier transition between the different phases. Given the rich language of graphical models, extensions could be made to the trigram language model, such as higher-order n-grams or caching trigger words. These would require the addition of new variables and dependency arcs. Once the structures are specified, the modular, trainable framework of GMTK would allow for seamless training, testing, and decoding.

13 Future Work and Conclusions

There were many goals of the JHU 2001 workshop (see Section 1) only some of which were realized. In this section, we briefly outline some of what could be the next steps of the research that began at this workshop.

13.1 Articulatory Models The work we have begun at the workshop has produced some simple articulatory models and much of the infras- tructure needed to build similar models. However, there is a large space of articulatory models that can be explored with graphical models. There are many additional dependencies that are likely to be present in speech but are not represented in the structures that we have constructed, as well as entirely different structures that may better model the asynchronous nature of the articulators. One of the main ideas that we had hoped to investigate, but were not able to during the workshop, is structure learning over the hidden articulatory variables. For a model with a large number of variables, such as an articulatory model, it is infeasible to include all of the possible dependencies and labor-intensive to predetermine them using linguistic considerations. This would therefore be a natural application for structure learning, and in particular for the ideas of discriminative structure learning using the EAR measure.

13.1.1 Additional Structures There is also a wealth of other structures to be explored. The type of structure we have built (see Figure 57) is limited in several respects. While such a model can allow articulators to stray from their target values, it cannot truly represent asynchronous articulatory streams since all of the articulatory variables depend on the current phone state. In order to allow the articulators to evolve asynchronously, new structures are needed. One possibility we have considered, but not implemented during the workshop, is to treat each articulatory stream in the same way that the phone is treated in the phone-based model, with its own position and transition variables. In such a model, each articulatory stream could go through its prescribed sequence of values at its own pace. One example of such a structure, in which there is no phone variable at all, is shown in Figure 66. It is important, however, to constrain the asynchrony between the articulatory variables, for both computational and linguistic reasons. This can be done by forcing them to synchronize (i.e. reach the same positions) at certain points, such as word or syllable boundaries, or by requiring that some minimal subset be synchronized at any point in time. As an example, the structure of Figure 66 includes the dependencies that could be used for synchronization at word boundaries. The degree of synchronization is an interesting issue that, to our knowledge, has not been previously investigated in articulatory modeling research and would be fairly straightforward to explore using graphical models.

13.1.2 Computational Issues Computational considerations are also likely to be an important issue for future work. As mentioned in Section 11, we were unable to train or decode with these models during the time span of the workshop. With the addition of log-space inference in GMTK, training and decoding can now be done within reasonable memory constraints, but at the expense of increased running time. Therefore, work is still needed to enable articulatory models to run more efficiently. One way to control the computational requirements of the model is by careful constraints on the state space. Constraints on the overall articulatory state space can be applied through the choice of articulatory variables and inter-

UWEETR-2001-0006 77 frame i-1 frame i frame i+1

word word word word word word trans trans trans

a1 1 a1 pos apos pos

a1 a1 a1

a1 1 a1 tr a tr tr ......

aN N aN pos apos pos

aN aN aN

aN N aN tr atr tr

O O O

Figure 66: A phone-free articulatory model allowing for asynchrony. In this model, each feature has a variable ai i i representing its current value, a variable apos representing its position within the current word, and a variable atr indicating whether the feature is transitioning to its next position.

articulator dependencies. The “instantaneous” state space can also be controlled by imposing various levels of sparsity on the articulatory probability tables. In addition, measures can be taken to limit the size of the acoustic models (the conditional probability densities of the observation variable). For example, instead of having a separate model for each allowed combination of articulatory values, some of the models could be tied, or product-of-experts models [55] could be used to combine observation distributions corresponding to different articulators. We have begun to explore the product-of-experts approach in work we have pursued since the workshop. Finally, the distribution dimensionality could be reduced by choosing only a certain subset of the acoustic observations to depend on each articulator (which could be different for each articulator).

13.2 Structural Discriminability One of the main goals of the workshop was to investigate the use of discriminative structure learning. In this work, we have described a methodology that can learn discriminative structure between collections of observation variables. One of the key goals of the workshop that time-constraints prevented us from pursuing was the induction of discriminative structure at the hidden level. For example, given a baseline articulatory network, it would be desirable to augment that network (i.e., either add or remove edges) so as to improve its overall discriminative power. Work is planned in the future to pursue this goal.

13.3 GMTK There are many plans over the next several years for additions and improvements to GMTK. Some of these include: 1) a new faster inference algorithm that utilizes an off-line triangulation algorithm, 2) approximate inference schemes

UWEETR-2001-0006 78 such as a variational approach, and a loopy propagation procedure, 3) non-linear dependencies between observations, 4) better integration with language modeling systems, such as with the SRI language-modeling toolkit, 5) the use of hidden continuous variables, 6) adaptation techniques, and 7) general performance enhancements. Many other enhancements are planned as well. GMTK was conceived for use at the JHU 2001 workshop, but it is believed that it will become a useful tool for a variety of speech recognition, language modeling, and time-series processing tasks over the next several years.

UWEETR-2001-0006 79 14 The WS01 GM-ASR Team

Lastly, we would like to once again mention and acknowledge the WS01 JHU team, which consisted of the following people: Jeff A. Bilmes — University of Washington, Seattle Geoff Zweig — IBM Thomas Richardson — University of Washington, Seattle Karim Filali — University of Washington, Seattle Karen Livescu — MIT Peng Xu — Johns Hopkins University Kirk Jackson — DOD Yigal Brandman — Phonetact Inc. Eric Sandness — Speechworks Eva Holtz — Harvard University Jerry Torres — Stanford University Bill Byrne — Johns Hopkins University The team is also given in Figure 67 (along with several friends who happened by at the time of the photo-shoot). Speaking as a team leader (J.B.), I would like to acknowledge and give many thanks all the team members for doing a absolutely wonderful job. I would also like to thank Sanjeev Khudanpur, Bill Byrne, and Fred Jelinek and all the other members of CLSP for creating a fantastically fertile environment in which to pursue and be creative in perform- ing novel research in speech and language processing. Lastly, we would like to thank the sponsoring organizations (DARPA, NSF, DOD) without which none of this research would have occurred.

Thomas Karim Richardson Eric Sandness Filali Kirk Jackson

Geoff Zweig Peng Xu Yigal Jeff A. Brandman Eva Bilmes Holtz Karen Livescu

Jacob Sanjeev Laderman Khudanpur (kept the (Fred's 2001 Jerry Torrez computers running) Substitute)

Figure 67: The JHU WS01 GM Team. To see the contents of the T-shirts we are wearing, see Figure 68.

References

[1] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques and Tools. Addison-Wesley, Inc., Reading, Mass., 1986.

UWEETR-2001-0006 80 [2] S. M. Aji and R. J. McEliece. The generalized distributive law. IEEE Transactions in Information Thoery, 46:325–343, March 2000. [3] J.J. Atick. Could information theory provide an ecological theory of sensory processing? Network, 3, 1992. [4] L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer. Maximum mutual information estimation of HMM parameters for speech recognition. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, pages 49–52, Tokyo, Japan, December 1986. [5] J. Baker. The Dragon system—an overview. IEEE Transactions on Acoustics, Speech, and Signal Processing, 23:24–29, 1975. [6] J. Bilmes. Natural Statistical Models for Automatic Speech Recognition. PhD thesis, U.C. Berkeley, Dept. of EECS, CS Division, 1999. [7] J. Bilmes. Graphical models and automatic speech recognition. Technical Report UWEETR-2001-005, Uni- versity of Washington, Dept. of EE, 2001. [8] J. Bilmes. The gmtk documentation, 2002. http://ssli.ee.washington.edu˜bilmes/gmtk. [9] J. Bilmes and G. Zweig. The Graphical Models Toolkit: An open source software system for speech and time-series processing. Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 2002. [10] J.A. Bilmes. Buried Markov models for speech recognition. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, Phoenix, AZ, March 1999. [11] J.A. Bilmes. Dynamic Bayesian Multinets. In Proceedings of the 16th conf. on Uncertainty in Artificial Intelli- gence. Morgan Kaufmann, 2000. [12] J.A. Bilmes. Factored sparse inverse covariance matrices. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 2000. [13] J. Binder, K. Murphy, and S. Russell. Space-efficient inference in dynamic probabilistic networks. Int’l, Joint Conf. on Artificial Intelligence, 1997. [14] C. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995. [15] C. P. Browman and L. Goldstein. Articulatory phonology: An overview. Phonetica, 49:155–180, 1992. [16] The BUGS project. http://www.mrc-bsu.cam.ac.uk/bugs/Welcome.html. [17] W. Buntine. A guide to the literature on learning probabilistic networks from data. IEEE Trans. on Knowledge and Data Engineering, 8:195–210, 1994. [18] K.P. Burnham and D.R. Anderson. Model Selection and Inference : A Practical Information-Theoretic Ap- proach. Springer-Verlag, 1998. [19] R. Chellappa and A. Jain, editors. Markov Random Fields: Theory and Application. Academic Press, 1993. [20] C.-P. Chen, K. Kirchhoff, and J. Bilmes. Towards simple methods of noise-robustness. Technical Report UWEETR-2002-002, University of Washington, Dept. of EE, 2001. [21] D.M. Chickering. Learning from Data: Artificial Intelligence and Statistics, chapter Learning Bayesian net- works is NP-complete, pages 121–130. Springer-Verlag, 1996. [22] N. Chomsky and M. Halle. The Sound Pattern of English. New York: Harper and Row, 1968. [23] G. Cooper and E. Herskovits. Computational complexity of probabilistic inference using Bayesian belief net- works. Artificial Intelligence, 42:393–405, 1990. [24] T.H. Cormen, C.E. Leiserson, and R.L. Rivest. Introduction to Algorithms. McGraw Hill, 1990.

UWEETR-2001-0006 81 [25] R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic Networks and Expert Systems. Springer, 1999. [26] D.R. Cox and D.V. Hinkley. Theoretical Statistics. Chapman and Hall/CRC, 1974. [27] P. Dagum and M. Luby. Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artifi- cial Intelligence, 60(141-153), 1993. [28] Journal: Data mining and knowledge discovery. Kluwer Academic Publishers. Maritime Institute of Technol- ogy, Maryland. [29] T. Dean and K. Kanazawa. Probabilistic temporal reasoning. AAAI, pages 524–528, 1988. [30] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B., 39, 1977. [31] L. Deng and K. Erler. Structural design of hidden markov model speech recognizer using multivalued phonetic features: Comparison with segmental speech units. Journal of the Acoustical Society of America, 92(6):3058– 3067, Dec 1992. [32] L. Deng, G. Ramsay, and D. Sun. Production models as a structural basis for automatic speech recognition. Speech Communication, 33(2-3):93–111, Aug 1997. [33] R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, Inc., 1973. [34] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. John Wiley and Sons, Inc., 2000. [35] E. Eide. Distinctive features for use in an automatic speech recognition system. In Eurospeech-99, 2001. [36] K. Elenius and M. Blomberg. Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, pages 535–538. IEEE, 1982. [37] C. Neti et. al. Audo-visual speech recognition: Ws 2000 final report, 2000. http://www.clsp.jhu.edu/ws2000/final reports/avsr/ws00avsr.pdf. [38] R. Fletcher. Practical Methods of Optimization. John Wiley & Sons, New York, NY, 1980. [39] E. Fosler-Lussier, S. Greenberg, and N. Morgan. Incorporating contextual phonetics into automatic speech recognition. In Proceedings 14th International Congress of Phonetic Sciences, San Francisco, CA, 1999. [40] B. Frey. Graphical Models for Machine Learning and Digital Communication. MIT Press, 1998. [41] J. H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1–141, 1991. [42] N. Friedman and M. Goldszmidt. Learning in Graphical Models, chapter Learning Bayesian Networks with Local Structure. Kluwer Academic Publishers, 1998. [43] N. Friedman, K. Murphy, and S. Russell. Learning the structure of dynamic probabilistic networks. 14th Conf. on Uncertainty in Artificial Intelligence, 1998. [44] K. Fukunaga. Introduction to Statistical Pattern Recognition, 2nd Ed. Academic Press, 1990. [45] D. Geiger and D. Heckerman. Knowledge representation and inference in similarity networks and Bayesian multinets. Artificial Intelligence, 82:45–74, 1996. [46] Z. Ghahramani. Lecture Notes in Artificial Intelligence, chapter Learning Dynamic Bayesian Networks. Springer-Verlag, 1998. [47] J. A. Goldsmith. Autosegmental and Metrical Phonology. B. Blackwell, Cambridge, MA, 1990.

UWEETR-2001-0006 82 [48] R. A. Gopinath. Maximum likelihood modeling with gaussian distributions for classification. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1998. [49] G. Gravier, S. Axelrod, G. Potamianos, and C. Neti. Maximum entropy and MCE based HMM stream weight estimation for audio-visual asr. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 2002. [50] S. Greenberg, S. Chang, and J. Hollenback. An introduction to the diagnostic evaluation of the switchboard- corpus automatic speech recognition systems. In Proc. NIST Speech Transcription Workshop, College Park, MD, 2000. [51] D. Heckerman. A tutorial on learning with Bayesian networks. Technical Report MSR-TR-95-06, Microsoft, 1995. [52] D. Heckerman, Max Chickering, Chris Meek, Robert Rounthwaite, and Carl Kadie. Dependency networks for density estimation, collaborative filtering, and data visualization. In Proceedings of the 16th conf. on Uncer- tainty in Artificial Intelligence. Morgan Kaufmann, 2000. [53] D. Heckerman, D. Geiger, and D.M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Technical Report MSR-TR-94-09, Microsoft, 1994. [54] J. Hertz, A. Krogh, and R.G. Palmer. Introduction to the Theory of Neural Computation. Allan M. Wylde, 1991. [55] G. Hinton. Products of experts. In Proc. Ninth Int. Conf. on Artificial Neural Networks, 1999. [56] H. G. Hirsch and D. Pearce. The aurora experimental framework for the performance evaluations of speech recognition systems under noisy conditions. ICSA ITRW ASR2000, September 2000. [57] The ISIP public domain speech to text system. http://www.isip.msstate.edu/projects/speech/software/index.html. [58] T.S. Jaakkola and M.I. Jordan. Learning in Graphical Models, chapter Improving the Mean Field Approxima- tions via the use of Mixture Distributions. Kluwer Academic Publishers, 1998. [59] F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1997. [60] F.V. Jensen. An Introduction to Bayesian Networks. Springer, 1996. [61] M.I. Jordan and C. M. Bishop, editors. An Introduction to Graphical Models. to be published, 2001. [62] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul. Learning in Graphical Models, chapter An Intro- duction to Variational Methods for Graphical Models. Kluwer Academic Publishers, 1998. [63] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error rate methods for speech recognition. IEEE Trans. on Speech and Audio Signal Processing, 5(3):257–265, May 1997. [64] K. Kirchhoff. Syllable-level desynchronisation of phonetic features for speech recognition. In Proceedings ICSLP 1996, 1996. [65] K. Kirchhoff. Robust Speech Recognition Using Articulatory Information. PhD thesis, University of Bielefeld, Germany, 1999. [66] K. Kjaerulff. Triangulation of graphs - algorithms giving small total space. Technical Report R90-09, Depart- ment of Mathematics and Computer Science. Aalborg University., 1990. [67] P. Krause. Learning probabilistic networks. Philips Research Labs Tech. Report., 1998. [68] F. R. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory, 47(2):498–519, 2001. [69] S.L. Lauritzen. Graphical Models. Oxford Science Publications, 1996. [70] C.-H. Lee, E. Giachin, L.R. Rabiner, R. Pieraccini, and A.E. Rosenberg. Improved acoustic modeling for speaker independent large vocabulary continuous speech recognition. Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1991.

UWEETR-2001-0006 83 [71] H. Linhart and W. Zucchini. Model Selection. Wiley, 1986. [72] J. Luettin, G. Potamianos, and C. Neti. Asynchronous stream modeling for large vocabulary audio-visual speech recognition. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 2001. [73] D.J.C. MacKay. Learning in Graphical Models, chapter Introduction to Monte Carlo Methods. Kluwer Aca- demic Publishers, 1998. [74] D. McAllaster, L. Gillick, F. Scattone, and M. Newman. Fabricating conversational speech data with acoustic models: A program to examine model-data mismatch. In ICSLP, 1998. [75] D. McAllaster, L. Gillick, F. Scattone, and M. Newman. Studies with fabricated switchboard data: Exploring sources of model-data mismatch. In Proc. DARPA Workshop Conversational Speech Recognition, Lansdowne, VA, 1998. [76] G.J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley Series in Probability and Statistics, 1992. [77] G.J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley Series in Probability and Statistics, 1997. [78] C. Meek. Causal inference and causal explanation with background knowledge. In Besnard, Philippe and Steve Hanks, editors, Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI’95), pages 403–410, San Francisco, CA, USA, August 1995. Morgan Kaufmann Publishers. [79] M. Meila.˘ Learning with Mixtures of Trees. PhD thesis, MIT, 1999. [80] H. Meng. The use of distinctive features for automatic speech recognition. Master’s thesis, Massachusetts Institute of Technology, 1991. [81] K. Murphy. The Matlab bayesian network toolbox. http://www.cs.berkeley.edu/˜murphyk/Bayes/bnsoft.html. [82] Y. Normandin. An improved mmie training algorithm for speaker indepedendent, small vocabulary, continuous speech recognition. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1991. [83] M. Ostendorf. Moving beyond the ‘beads-on-a-string’ model of speech. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Keystone, CO, 1999. [84] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2nd printing edition, 1988. [85] J. Pearl. Causality. Cambridge, 2000. [86] T. Poggio and F. Girosi. Networks for approximation and learning. Proc. IEEE, 78:1481–1497, September 1990. [87] L.R. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice Hall Signal Processing Series, 1993. [88] M. Richardson, J. Bilmes, and C. Diorio. Hidden-articulator markov models for speech recognition. In Proc. of the ISCA ITRW ASR2000 Workshop, Paris, France, 2000. LIMSI-CNRS. [89] M. Richardson, J. Bilmes, and C. Diorio. Hidden-articulator markov models: Performance improvements and robustness to noise. In Proc. Int. Conf. on Spoken Language Processing, Beijing, China, 2000. [90] T. S. Richardson. Learning in Graphical Models, chapter Chain Graphs and Symmetric Associations. Kluwer Academic Publishers, 1998. [91] M. Riley and A. Ljolje. Automatic Speech and Speaker Recognition, chapter Automatic generation of detailed pronunciation lexicons. Kluwer Academic Publishers, Boston, 1996.

UWEETR-2001-0006 84 [92] J. Rissanen. Stochastic complexity (with discussions). Journal of the Royal Statistical Society, 49:223–239,252– 265, 1987. [93] M. Saraclar and S. Khudanpur. Properties of pronunciation change in conversational speech recognition. In Proc. NIST Speech Transcription Workshop, College Park, MD, 2000. [94] L.K. Saul, T. Jaakkola, and M.I. Jordan. Mean field theory for sigmoid belief networks. JAIR, 4:61–76, 1996. [95] G. Schwartz. Estimating the dimension of a model. Annals of Statistics, 1978. [96] R.D. Shachter. Bayes-ball: The rational pastime for determining irrelevance and requisite information in belief networks and influence diagrams. In Uncertainty in Artificial Intelligence, 1998. [97] S. Sivadas and H. Hermansky. Hierarchical tandem feature extraction. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 2002. [98] P. Smyth, D. Heckerman, and M.I. Jordan. Probabilistic independence networks for hidden Markov probability models. Technical Report A.I. Memo No. 1565, C.B.C.L. Memo No. 132, MIT AI Lab and CBCL, 1996. [99] CMU : Open source speech recognition. http://www.speech.cs.cmu.edu/sphinx/Sphinx.html. [100] H. Tong. Non-linear Time Series: A Dynamical System Approach. Oxford Statistical Science Series 6. Oxford University Press, 1990. [101] V. Vapnik. Statistical Learning Theory. Wiley, 1998. [102] T. Verma and J. Pearl. Equivalence and synthesis of causal models. In Uncertainty in Artificial Intelligence. Morgan Kaufmann, 1990. [103] T. Verma and J. Pearl. An algorithm for deciding if a set of observed independencies has a causal explanation. In Uncertainty in Artificial Intelligence. Morgan Kaufmann, 1992. [104] Y. Weiss. Correctness of local probability propagation in graphical models with loops. Neural Computation, Submitted. [105] J. Whittaker. Graphical Models in Applied Multivariate Statistics. John Wiley and Son Ltd., 1990. [106] J.G. Wilpon, C.-H. Lee, and L.R. Rabiner. Improvements in connected digit recognition using higher order spectral and energy features. Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1991. [107] P.C. Woodland and D. Povey. Large scale discriminative training for speech recognition. In ICSA ITRW ASR2000, 2000. [108] S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland. The HTK Book. Entropic Labs and Cambridge University, 2.1 edition, 1990’s. [109] G. Zweig. Speech Recognition with Dynamic Bayesian Networks. PhD thesis, U.C. Berkeley, 1998. [110] G. Zweig and M. Padmanabhan. Exact alpha-beta computation in logarithmic space with application to map word graph construction. Int. Conf. on Spoken Lanugage Processing, 2000. [111] G. Zweig and S. Russell. Probabilistic modeling with bayesian networks for automatic speech recognition. Australian Journal of Intelligent Information Processing, 5(4):253–260, 1999.

UWEETR-2001-0006 85 WS-2001 Discriminatively Structured Graphical Models for Automatic Speech Recognition

Jeff Bilmes Geoff Zweig

Thomas Richardson C Kirk Jackson

X Y Z Karen Livescu Karim Filali

Peng Xu Eric Sandness

Eva Holtz Jerry Torres

Yigal Brandman Bill Byrne

University of Washington, IBM Research, DOD, Johns Hopkins University, MIT, SpeechWorks, Harvard University, Stanford Univeristy, Phonetact Inc.

Figure 68: The JHU WS01 GM Team Graph.

UWEETR-2001-0006 86