Discriminative Syntactic Language Modeling for Speech Recognition

Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins Brian Roark Murat Saraclar MIT CSAIL OGI/OHSU Bogazici University [email protected] [email protected] [email protected] Abstract an n-gram language model, a Markov assumption is made, namely that each word depends only on We describe a method for discriminative the previous (n − 1) words. The parameters of the training of a language model that makes language model are usually estimated from a large use of syntactic features. We follow quantity of text data. See (Chen and Goodman, a reranking approach, where a baseline 1998) for an overview of estimation techniques for recogniser is used to produce 1000-best n-gram models. output for each acoustic input, and a sec- This paper describes a method for incorporating ond “reranking” model is then used to syntactic features into the language model, using choose an utterance from these 1000-best discriminative parameter estimation techniques. We lists. The reranking model makes use of build on the work in Roark et al. (2004a; 2004b), syntactic features together with a parame- which was summarized and extended in Roark et al. ter estimation method that is based on the (2005). These papers used discriminative methods perceptron algorithm. We describe exper- for n-gram language models. Our approach reranks iments on the Switchboard speech recog- the 1000-best output from the Switchboard recognition task. The syntactic features provide nizer of Ljolje et al. (2003).1 Each candidate string an additional 0.3% reduction in test–set w is parsed using the statistical parser of Collins error rate beyond the model of (Roark et (1999) to give a parse tree T (w). Information from al., 2004a; Roark et al., 2004b) (signifi- the parse tree is incorporated in the model using cant at p < 0:001), which makes use of a feature-vector approach: we define Φ(a; w) to a discriminatively trained n-gram model, be a d-dimensional feature vector which in princi- giving a total reduction of 1.2% over the ple could track arbitrary features of the string w baseline Switchboard system. together with the acoustic input a. In this paper we restrict Φ(a; w) to only consider the string w 1 Introduction and/or the parse tree T (w) for w. For example, Φ(a; w) might track counts of context-free rule pro- The predominant approach within language model- ductions in T (w), or bigram lexical dependencies ing for speech recognition has been to use an n- within T (w). The optimal string under our new gram language model, within the “source-channel” model is defined as or “noisy-channel” paradigm. The language model assigns a probability Pl(w) to each string w in the ∗ w = arg max (β log Pl(w) + hα¯; Φ(a; w)i+ language; the acoustic model assigns a conditional w log Pa(ajw)) (2) probability Pa(ajw) to each pair (a; w) where a is a sequence of acoustic vectors, and w is a string. For where the arg max is taken over all strings in the a given acoustic input a, the highest scoring string 1000-best list, and where α¯ 2 Rd is a parameter under the model is vector specifying the “weight” for each feature in w∗ β P w P a w (1) Φ (note that we define hx; yi to be the inner, or dot = arg maxw ( log l( ) + log a( j )) 1Note that (Roark et al., 2004a; Roark et al., 2004b) give where β > 0 is some value that reflects the rela- results for an n-gram approach on this data which makes use of both lattices and 1000-best lists. The results on 1000-best lists tive importance of the language model; β is typi- were very close to results on lattices for this domain, suggesting cally chosen by optimization on held-out data. In that the 1000-best approximation is a reasonable one. product, between vectors x and y). For this paper, is in the use of discriminative parameter estimation we train the parameter vector α¯ using the perceptron techniques. The criterion we use to optimize the pa- algorithm (Collins, 2004; Collins, 2002). The per- rameter vector α¯ is closely related to the end goal ceptron algorithm is a very fast training method, in in speech recognition, i.e., word error rate. Previ- practice requiring only a few passes over the train- ous work (Roark et al., 2004a; Roark et al., 2004b) ing set, allowing for a detailed comparison of a wide has shown that discriminative methods within an n- variety of feature sets. gram approach can lead to significant reductions in A number of researchers have described work WER, in spite of the features being of the same type that incorporates syntactic language models into a as the original language model. In this paper we ex- speech recognizer. These methods have almost ex- tend this approach, by including syntactic features clusively worked within the noisy channel paradigm, that were not in the baseline speech recognizer. where the syntactic language model has the task This paper describe experiments using a variety of modeling a distribution over strings in the lan- of syntactic features within this approach. We tested guage, in a very similar way to traditional n-gram the model on the Switchboard (SWB) domain, using language models. The Structured Language Model the recognizer of Ljolje et al. (2003). The discrim- (Chelba and Jelinek, 1998; Chelba and Jelinek, inative approach for n-gram modeling gave a 0.9% 2000; Chelba, 2000; Xu et al., 2002; Xu et al., 2003) reduction in WER on this domain; the syntactic fea- makes use of an incremental shift-reduce parser to tures we describe give a further 0.3% reduction. enable the probability of words to be conditioned on In the remainder of this paper, section 2 describes k previous c-commanding lexical heads, rather than previous work, including the parameter estimation simply on the previous k words. Incremental top- methods we use, and section 3 describes the feature- down and left-corner parsing (Roark, 2001a; Roark, vector representations of parse trees that we used in 2001b) and head-driven parsing (Charniak, 2001) our experiments. Section 4 describes experiments approaches have directly used generative PCFG using the approach. models as language models. In the work of Wen Wang and Mary Harper (Wang and Harper, 2002; 2 Background Wang, 2003; Wang et al., 2004), a constraint depen- dency grammar and a finite-state tagging model de- 2.1 Previous Work rived from that grammar were used to exploit syn- Techniques for exploiting stochastic context-free tactic dependencies. grammars for language modeling have been ex- Our approach differs from previous work in a cou- plored for more than a decade. Early approaches ple of important respects. First, through the feature- included algorithms for efficiently calculating string vector representations Φ(a; w) we can essentially prefix probabilities (Jelinek and Lafferty, 1991; Stol- incorporate arbitrary sources of information from cke, 1995) and approaches to exploit such algo- the string or parse tree into the model. We would ar- rithms to produce n-gram models (Stolcke and Se- gue that our method allows considerably more flexi- gal, 1994; Jurafsky et al., 1995). The work of Chelba bility in terms of the choice of features in the model; and Jelinek (Chelba and Jelinek, 1998; Chelba and in previous work features were incorporated in the Jelinek, 2000; Chelba, 2000) involved the use of a model through modification of the underlying gen- shift-reduce parser trained on Penn treebank style erative parsing or tagging model, and modifying a annotations, that maintains a weighted set of parses generative model is a rather indirect way of chang- as it traverses the string from left-to-right. Each ing the features used by a model. In this respect, our word is predicted by each candidate parse in this set approach is similar to that advocated in Rosenfeld et at the point when the word is shifted, and the con- al. (2001), which used Maximum Entropy modeling ditional probability of the word given the previous to allow for the use of shallow syntactic features for words is taken as the weighted sum of the condi- language modeling. tional probabilities provided by each parse. In this A second contrast between our work and previ- approach, the probability of a word is conditioned ous work, including that of Rosenfeld et al. (2001), by the top two lexical heads on the stack of the par- ticular parse. Enhancements in the feature set and al. (2004a; 2004b). The model we propose consists improved parameter estimation techniques have ex- of the following components: tended this approach in recent years (Xu et al., 2002; • GEN(a) is a set of candidate strings for an Xu et al., 2003). acoustic input a. In our case, GEN(a) is a set of Roark (2001a; 2001b) pursued a different deriva- 1000-best strings from a first-pass recognizer. tion strategy from Chelba and Jelinek, and used the • T (w) is the parse tree for string w. parse probabilities directly to calculate the string • Φ(a; w) 2 Rd is a feature-vector representation probabilities. This work made use of a left-to-right, of an acoustic input a together with a string w. top-down, beam-search parser, which exploits rich • α¯ 2 Rd is a parameter vector. lexico-syntactic features from the left context of • The output of the recognizer for an input a is each derivation to condition derivation move proba- defined as bilities, leading to a very peaked distribution.

Discriminative Syntactic Language Modeling for Speech Recognition

A Challenge Set for Advancing Language Modeling

ACL Lifetime Achievement Award

Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies

NOTE Our Lucky Moments with Frederick Jelinek

Fred Jelinek

Corpus Methods in a Digitized World

26Automatic Speech Recognition and Text-To-Speech

Lecture 5: N-Gram Language Models

History and Survey of AI August 31, 2012

A Maximum Likelihood Approach to Continuous Speech Recognition

Some of My Best Friends Are Linguists

Ieee-Level Awards