Discriminative Syntactic Language Modeling for Michael Collins Brian Roark Murat Saraclar MIT CSAIL OGI/OHSU Bogazici University [email protected] [email protected] [email protected]

Abstract an n-gram language model, a Markov assumption is made, namely that each word depends only on We describe a method for discriminative the previous (n − 1) words. The parameters of the training of a language model that makes language model are usually estimated from a large use of syntactic features. We follow quantity of text data. See (Chen and Goodman, a reranking approach, where a baseline 1998) for an overview of estimation techniques for recogniser is used to produce 1000-best n-gram models. output for each acoustic input, and a sec- This paper describes a method for incorporating ond “reranking” model is then used to syntactic features into the language model, using choose an utterance from these 1000-best discriminative parameter estimation techniques. We lists. The reranking model makes use of build on the work in Roark et al. (2004a; 2004b), syntactic features together with a parame- which was summarized and extended in Roark et al. ter estimation method that is based on the (2005). These papers used discriminative methods perceptron algorithm. We describe exper- for n-gram language models. Our approach reranks iments on the Switchboard speech recog- the 1000-best output from the Switchboard recog- nition task. The syntactic features provide nizer of Ljolje et al. (2003).1 Each candidate string an additional 0.3% reduction in test–set w is parsed using the statistical parser of Collins error rate beyond the model of (Roark et (1999) to give a parse tree T (w). Information from al., 2004a; Roark et al., 2004b) (signifi- the parse tree is incorporated in the model using cant at p < 0.001), which makes use of a feature-vector approach: we define Φ(a, w) to a discriminatively trained n-gram model, be a d-dimensional feature vector which in princi- giving a total reduction of 1.2% over the ple could track arbitrary features of the string w baseline Switchboard system. together with the acoustic input a. In this paper we restrict Φ(a, w) to only consider the string w 1 Introduction and/or the parse tree T (w) for w. For example, Φ(a, w) might track counts of context-free rule pro- The predominant approach within language model- ductions in T (w), or bigram lexical dependencies ing for speech recognition has been to use an n- within T (w). The optimal string under our new gram language model, within the “source-channel” model is defined as or “noisy-channel” paradigm. The language model assigns a probability Pl(w) to each string w in the ∗ w = arg max (β log Pl(w) + hα¯, Φ(a, w)i+ language; the acoustic model assigns a conditional w log Pa(a|w)) (2) probability Pa(a|w) to each pair (a, w) where a is a sequence of acoustic vectors, and w is a string. For where the arg max is taken over all strings in the a given acoustic input a, the highest scoring string 1000-best list, and where α¯ ∈ Rd is a parameter under the model is vector specifying the “weight” for each feature in w∗ β P w P a w (1) Φ (note that we define hx, yi to be the inner, or dot = arg maxw ( log l( ) + log a( | )) 1Note that (Roark et al., 2004a; Roark et al., 2004b) give where β > 0 is some value that reflects the rela- results for an n-gram approach on this data which makes use of both lattices and 1000-best lists. The results on 1000-best lists tive importance of the language model; β is typi- were very close to results on lattices for this domain, suggesting cally chosen by optimization on held-out data. In that the 1000-best approximation is a reasonable one. product, between vectors x and y). For this paper, is in the use of discriminative parameter estimation we train the parameter vector α¯ using the perceptron techniques. The criterion we use to optimize the pa- algorithm (Collins, 2004; Collins, 2002). The per- rameter vector α¯ is closely related to the end goal ceptron algorithm is a very fast training method, in in speech recognition, i.e., word error rate. Previ- practice requiring only a few passes over the train- ous work (Roark et al., 2004a; Roark et al., 2004b) ing set, allowing for a detailed comparison of a wide has shown that discriminative methods within an n- variety of feature sets. gram approach can lead to significant reductions in A number of researchers have described work WER, in spite of the features being of the same type that incorporates syntactic language models into a as the original language model. In this paper we ex- speech recognizer. These methods have almost ex- tend this approach, by including syntactic features clusively worked within the noisy channel paradigm, that were not in the baseline speech recognizer. where the syntactic language model has the task This paper describe experiments using a variety of modeling a distribution over strings in the lan- of syntactic features within this approach. We tested guage, in a very similar way to traditional n-gram the model on the Switchboard (SWB) domain, using language models. The Structured Language Model the recognizer of Ljolje et al. (2003). The discrim- (Chelba and Jelinek, 1998; Chelba and Jelinek, inative approach for n-gram modeling gave a 0.9% 2000; Chelba, 2000; Xu et al., 2002; Xu et al., 2003) reduction in WER on this domain; the syntactic fea- makes use of an incremental shift-reduce parser to tures we describe give a further 0.3% reduction. enable the probability of words to be conditioned on In the remainder of this paper, section 2 describes k previous c-commanding lexical heads, rather than previous work, including the parameter estimation simply on the previous k words. Incremental top- methods we use, and section 3 describes the feature- down and left-corner parsing (Roark, 2001a; Roark, vector representations of parse trees that we used in 2001b) and head-driven parsing (Charniak, 2001) our experiments. Section 4 describes experiments approaches have directly used generative PCFG using the approach. models as language models. In the work of Wen Wang and Mary Harper (Wang and Harper, 2002; 2 Background Wang, 2003; Wang et al., 2004), a constraint depen- dency grammar and a finite-state tagging model de- 2.1 Previous Work rived from that grammar were used to exploit syn- Techniques for exploiting stochastic context-free tactic dependencies. grammars for language modeling have been ex- Our approach differs from previous work in a cou- plored for more than a decade. Early approaches ple of important respects. First, through the feature- included algorithms for efficiently calculating string vector representations Φ(a, w) we can essentially prefix probabilities (Jelinek and Lafferty, 1991; Stol- incorporate arbitrary sources of information from cke, 1995) and approaches to exploit such algo- the string or parse tree into the model. We would ar- rithms to produce n-gram models (Stolcke and Se- gue that our method allows considerably more flexi- gal, 1994; Jurafsky et al., 1995). The work of Chelba bility in terms of the choice of features in the model; and Jelinek (Chelba and Jelinek, 1998; Chelba and in previous work features were incorporated in the Jelinek, 2000; Chelba, 2000) involved the use of a model through modification of the underlying gen- shift-reduce parser trained on Penn treebank style erative parsing or tagging model, and modifying a annotations, that maintains a weighted set of parses generative model is a rather indirect way of chang- as it traverses the string from left-to-right. Each ing the features used by a model. In this respect, our word is predicted by each candidate parse in this set approach is similar to that advocated in Rosenfeld et at the point when the word is shifted, and the con- al. (2001), which used Maximum Entropy modeling ditional probability of the word given the previous to allow for the use of shallow syntactic features for words is taken as the weighted sum of the condi- language modeling. tional probabilities provided by each parse. In this A second contrast between our work and previ- approach, the probability of a word is conditioned ous work, including that of Rosenfeld et al. (2001), by the top two lexical heads on the stack of the par- ticular parse. Enhancements in the feature set and al. (2004a; 2004b). The model we propose consists improved parameter estimation techniques have ex- of the following components: tended this approach in recent years (Xu et al., 2002; • GEN(a) is a set of candidate strings for an Xu et al., 2003). acoustic input a. In our case, GEN(a) is a set of Roark (2001a; 2001b) pursued a different deriva- 1000-best strings from a first-pass recognizer. tion strategy from Chelba and Jelinek, and used the • T (w) is the parse tree for string w. parse probabilities directly to calculate the string • Φ(a, w) ∈ Rd is a feature-vector representation probabilities. This work made use of a left-to-right, of an acoustic input a together with a string w. top-down, beam-search parser, which exploits rich • α¯ ∈ Rd is a parameter vector. lexico-syntactic features from the left context of • The output of the recognizer for an input a is each derivation to condition derivation move proba- defined as bilities, leading to a very peaked distribution. Rather than normalizing a prediction of the next word over F (a) = argmax hΦ(a, w), α¯i (3) w∈GEN(a) the beam of candidates, as in Chelba and Jelinek, in this approach the string probability is derived by In principle, the feature vector Φ(a, w) could take simply summing the probabilities of all derivations into account any features of the acoustic input a to- for that string in the beam. gether with the utterance w. In this paper we make Other work on syntactic language modeling in- a couple of restrictions. First, we define the first fea- cludes that of Charniak (2001), which made use of ture to be a non-incremental, head-driven statistical parser to produce string probabilities. In the work of Wen Φ1(a, w) = β log Pl(w) + log Pa(a|w) Wang and Mary Harper (Wang and Harper, 2002; where Pl(w) and Pa(a|w) are language and acous- Wang, 2003; Wang et al., 2004), a constraint depen- tic model scores from the baseline speech recog- dency grammar and a finite-state tagging model de- nizer. In our experiments we kept β fixed at the rived from that grammar, were used to exploit syn- value used in the baseline recogniser. It can then tactic dependencies. The processing advantages of be seen that our model is equivalent to the model the finite-state encoding of the model has allowed in Eq. 2. Second, we restrict the remaining features for the use of probabilities calculated off-line from Φ2(a, w) . . . Φd(a, w) to be sensitive to the string this model to be used in the first pass of decoding, w alone.2 In this sense, the scope of this paper is which has provided additional benefits. Finally, Och limited to the language modeling problem. As one et al. (2004) use a reranking approach with syntactic example, the language modeling features might take information within a system. into account n-grams, for example through defini- Rosenfeld et al. (2001) investigated the use of tions such as syntactic features in a Maximum Entropy approach. In their paper, they used a shallow parser to anno- Φ2(a, w) = Count of the the in w tate base constituents, and derived features from se- quences of base constituents. The features were in- Previous work (Roark et al., 2004a; Roark et al., dicator features that were either (1) exact matches 2004b) considered features of this type. In this pa- between a set or sequence of base constituents with per, we introduce syntactic features, which may be those annotated on the hypothesis transcription; or sensitive to the parse tree for w, for example (2) tri-tag features from the constituent sequence. Φ (a, w) = Count of S → NP VP in T (w) The generative model that resulted from their fea- 3 ture set resulted in only a very small improvement where S → NP VP is a context-free rule produc- in either perplexity or word-error-rate. tion. Section 3 describes the full set of features used in the empirical results presented in this paper. 2.2 Global Linear Models 2Future work may consider features of the acoustic sequence We follow the framework of Collins (2002; 2004), a together with the string w, allowing the approach to be ap- recently applied to language modeling in Roark et plied to acoustic modeling. 2.2.1 Parameter Estimation Input: A parameter specifying the number of iterations over the training set, T . A value for the ®rst parameter, α. A We now describe how the parameter vector α¯ is feature-vector representation a w Rd. Training exam- estimated from a set of training utterances. The Φ( , ) ∈ ples (ai, wi) for i = 1 . . . m. An n-best list GEN(ai) for each training set consists of examples (ai, wi) for i = training utterance. We take si to be the member of GEN(ai) 1 . . . m, where ai is the i’th acoustic input, and wi which has the lowest WER when compared to w . is the transcription of this input. We briefly review i the two training algorithms described in Roark et al. Initialization: Set α1 = α, and αj = 0 for j = (2004b), the perceptron algorithm and global condi- 2 . . . d. tional log-linear models (GCLMs). Algorithm: For t = 1 . . . T, i = 1 . . . m Figure 1 shows the perceptron algorithm. It is an online algorithm, which makes several passes over •Calculate yi = arg maxw∈GEN(ai) hΦ(ai, w), α¯i the training set, updating the parameter vector after • For j = 2 . . . m, set α¯j = α¯j + Φj(ai, si) − each training example. For a full description of the Φj(ai, yi) algorithm, see Collins (2004; 2002). A second parameter estimation method, which Output: Either the ®nal parameters α¯, or the averaged pa- t,i t,i rameters α¯avg de®ned as α¯avg = α¯ /mT where α¯ is was used in (Roark et al., 2004b), is to optimize Pt,i the log-likelihood under a log-linear model. Sim- the parameter vector after training on the i'th training example ilar approaches have been described in Johnson et on the t'th pass through the training data. al. (1999) and Lafferty et al. (2001). The objective Figure 1: The perceptron training algorithm. Following function used in optimizing the parameters is Roark et al. (2004a), the parameter α1 is set to be some con- stant α that is typically chosen through optimization over the 2 development set. Recall that α1 dictates the weight given to the L(α¯) = X log P (si|ai, α¯) − C X αj (4) baseline recognizer score. i j

a s ehΦ( i, i),α¯i at each training example it only increments features where P (si|ai, α¯) = hΦ(a ,w),α¯i . w GEN a e i P ∈ ( i) seen on si or yi, effectively ignoring all other fea- Here, each si is the member of GEN(ai) which tures seen on members of GEN(ai). For example, has lowest WER with respect to the target transcrip- in the experiments in Roark et al. (2004a), the per- tion wi. The first term in L(α¯) is the log-likelihood ceptron converged in around 3 passes over the train- of the training data under a conditional log-linear ing set, while picking non-zero values for around 1.4 model. The second term is a regularization term million n-gram features out of a possible 41 million which penalizes large parameter values. C is a con- n-gram features seen in the training set. stant that dictates the relative weighting given to the For the present paper, to get a sense of the relative two terms. The optimal parameters are defined as effectiveness of various kinds of syntactic features ∗ that can be derived from the output of a parser, we α¯ = arg max L(α¯) α¯ are reporting results using just the perceptron algo- We refer to these models as global conditional log- rithm. This has allowed us to explore more of the po- linear models (GCLMs). tential feature space than we would have been able Each of these algorithms has advantages. A num- to do using the more costly GCLM estimation tech- ber of results—e.g., in Sha and Pereira (2003) and niques. In future we plan to apply GLCM parameter Roark et al. (2004b)—suggest that the GCLM ap- estimation methods to the task. proach leads to slightly higher accuracy than the per- 3 Parse Tree Features ceptron training method. However the perceptron converges very quickly, often in just a few passes We tagged each candidate transcription with (1) over the training set—in comparison GCLM’s can part-of-speech tags, using the tagger documented in take tens or hundreds of gradient calculations before Collins (2002); and (2) a full parse tree, using the convergence. In addition, the perceptron can be used parser documented in Collins (1999). The models as an effective feature selection technique, in that for both of these were trained on the Switchboard S (a) we/PRP helped/VBD her/PRP paint/VB the/DT NP VP house/NN (b) PRP we/NPb helped/VPb her/NPb paint/VPb the/NPb we VBD NP VP house/NPc (c) helped PRP VB NP we/PRP-NPb helped/VBD-VPb her/PRP-NPb her paint DT NN paint/VB-VPb the/DT-NPb house/NN-NPc the house (d) we/NP helped/VP her/NP paint/VP house/NP Figure 2: An example parse tree treebank, and applied to candidate transcriptions in Figure 3: Sequences derived from a parse tree: (a) POS-tag sequence; (b) Shallow parse tag sequenceÐthe superscripts b both the training and test sets. Each transcription and c refer to the beginning and continuation of a phrase re- received one POS-tag annotation and one parse tree spectively; (c) Shallow parse tag plus POS tag sequence; and annotation, from which features were extracted. (d) Shallow category with lexical head sequence Figure 2 shows a Penn Treebank style parse tree that is of the sort produced by the parser. Given such tag in position i); and composite tag/word features, a structure, there is a tremendous amount of flexibil- e.g. tiwi (where wi represents the word in posi- ity in selecting features. The first approach that we tion i) or, more complicated configurations, such as follow is to map each parse tree to sequences encod- ti−2ti−1wi−1tiwi. These features can be extracted ing part-of-speech (POS) decisions, and “shallow” from whatever sort of tag/word sequence we pro- parsing decisions. Similar representations have been vide for feature extraction, e.g. POS-tag sequences used by (Rosenfeld et al., 2001; Wang and Harper, or shallow parse tag sequences. 2002). Figure 3 shows the sequential representations One variant that we performed in feature extrac- that we used. The first simply makes use of the POS tion had to do with how speech repairs (identified as tags for each word. The latter representations make EDITED constituents in the Switchboard style parse use of sequences of non-terminals associated with trees) and filled pauses or interjections (labeled with lexical items. In 3(b), each word in the string is asso- the INTJ label) were dealt with. In the simplest ver- ciated with the beginning or continuation of a shal- sion, these are simply treated like other constituents low phrase or “chunk” in the tree. We include any in the parse tree. However, these can disrupt what non-terminals above the level of POS tags as poten- may be termed the intended sequence of syntactic tial chunks: a new “chunk” (VP, NP, PP etc.) begins categories in the utterance, so we also tried skipping whenever we see the initial word of the phrase dom- these constituents when mapping from the parse tree inated by the non-terminal. In 3(c), we show how to shallow parse sequences. POS tags can be added to these sequences. The final The second set of features we employed made type of sequence mapping, shown in 3(d), makes a use of the full parse tree when extracting features. similar use of chunks, but preserves only the head- For this paper, we examined several features tem- word seen with each chunk.3 plates of this type. First, we considered context-free From these sequences of categories, various fea- rule instances, extracted from each local node in the tures can be extracted, to go along with the n-gram tree. Second, we considered features based on lex- features used in the baseline. These include n-tag ical heads within the tree. Let us first distinguish features, e.g. ti−2ti−1ti (where ti represents the between POS-tags and non-POS non-terminal cate- 3It should be noted that for a very small percentage of hy- gories by calling these latter constituents NTs. For potheses, the parser failed to return a full parse tree. At the each constituent NT in the tree, there is an associ- end of every shallow tag or category sequence, a special end of ated lexical head (HNT) and the POS-tag of that lex- sequence tag/word pair ª º was emit- ted. In contrast, when a parse failed, the sequence consisted of ical head (HPNT). Two simple features are NT/HNT solely ª º. and NT/HPNT for every NT constituent in the tree. Feature Examples from ®gure 2 a weighted word-lattice was produced, represent- ing alternative transcriptions, from the ASR system. (P,HCP,Ci,{+,-}{1,2},HP,HCi ) (VP,VB,NP,1,paint,house) (S,VP,NP,-1,helped,we) The baseline ASR system that we are comparing

(P,HCP,Ci,{+,-}{1,2},HP,HPCi ) (VP,VB,NP,1,paint,NN) against then performed a rescoring pass on these first (S,VP,NP,-1,helped,PRP) pass lattices, allowing for better silence modeling, (P,HCP,Ci,{+,-}{1,2},HPP,HCi ) (VP,VB,NP,1,VB,house) and replaces the trigram language model score with (S,VP,NP,-1,VBD,we) a 6-gram model. 1000-best lists were then extracted (P,HCP,Ci,{+,-}{1,2},HPP,HPCi ) (VP,VB,NP,1,VB,NN) (S,VP,NP,-1,VBD,PRP) from these lattices. For each candidate in the 1000- best lists, we identified the number of edits (inser- Table 1: Examples of head-to-head features. The examples tions, deletions or substitutions) for that candidate, are derived from the tree in ®gure 2. relative to the “target” transcribed utterance. The or- acle score for the 1000-best lists was 16.7%. Using the heads as identified in the parser, example To produce the word-lattices, each training utter- features from the tree in figure 2 would be S/VBD, ance was processed by the baseline ASR system. In S/helped, NP/NN, and NP/house. a naive approach, we would simply train the base- Beyond these constituent/head features, we can line system (i.e., an acoustic model and language look at the head-to-head dependencies of the sort model) on the entire training set, and then decode used by the parser. Consider each local tree, con- the training utterances with this system to produce sisting of a parent node (P), a head child (HCP), and lattices. We would then use these lattices with the k non-head children (C1 . . . Ck). For each non-head perceptron algorithm. Unfortunately, this approach child Ci, it is either to the left or right of HCP, and is is likely to produce a set of training lattices that are either adjacent or non-adjacent to HCP. We denote very different from test lattices, in that they will have these positional features as an integer, positive if to very low word-error rates, given that the lattice for the right, negative if to the left, 1 if adjacent, and 2 if each utterance was produced by a model that was non-adjacent. Table 1 shows four head-to-head fea- trained on that utterance. To somewhat control for tures that can be extracted for each non-head child this, the training set was partitioned into 28 sets, and Ci. These features include dependencies between baseline Katz backoff trigram models were built for pairs of lexical items, between a single lexical item each set by including only transcripts from the other and the part-of-speech of another item, and between 27 sets. Lattices for each utterance were produced pairs of part-of-speech tags in the parse. with an acoustic model that had been trained on the entire training set, but with a language model that 4 Experiments was trained on the 27 data portions that did not in- The experimental set-up we use is very similar to clude the current utterance. Since language mod- that of Roark et al. (2004a; 2004b), and the exten- els are generally far more prone to overtraining than sions to that work in Roark et al. (2005). We make standard acoustic models, this goes a long way to- use of the Rich Transcription 2002 evaluation test ward making the training conditions similar to test- set (rt02) as our development set, and use the Rich ing conditions. Similar procedures were used to Transcription 2003 Spring evaluation CTS test set train the parsing and tagging models for the training (rt03) as test set. The rt02 set consists of 6081 sen- set, since the Switchboard treebank overlaps exten- tences (63804 words) and has three subsets: Switch- sively with the ASR training utterances. board 1, Switchboard 2, Switchboard Cellular. The Table 2 presents the word-error rates on rt02 and rt03 set consists of 9050 sentences (76083 words) rt03 of the baseline ASR system, 1000-best percep- and has two subsets: Switchboard and Fisher. tron and GCLM results from Roark et al. (2005) The training set consists of 297580 transcribed under this condition, and our 1000-best perceptron utterances (3297579 words)4. For each utterance, results. Note that our n-best result, using just n- gram features, improves upon the perceptron result 4Note that Roark et al. (2004a; 2004b; 2005) used 20854 of these utterances (249774 words) as held out data. In this work of (Roark et al., 2005) by 0.2 percent, putting us we simply use the rt02 test set as held out and development data. within 0.1 percent of their GCLM result for that WER rt02 Trial rt02 rt03 Trial WER ASR system output 37.1 36.4 ASR system output 37.1 Roark et al. (2005) perceptron 36.6 35.7 n-gram perceptron 36.4 Roark et al. (2005) GCLM 36.3 35.4 n-gram + POS perceptron 36.1 n-gram perceptron 36.4 35.5 n-gram + POS + S1 perceptron 36.1 n-gram + POS + S2 perceptron 36.0 n-gram + POS + S3 perceptron 36.0 Table 2: Baseline word-error rates versus Roark et al. (2005) n-gram + POS + S3-E perceptron 36.0 n-gram + POS + CF perceptron 36.1 rt02 n-gram + POS + H2H perceptron 36.0 Trial WER ASR system output 37.1 n-gram perceptron 36.4 Table 4: Use of shallow parse sequence and full parse derived n-gram + POS (1) perceptron 36.1 features n-gram + POS (1,2) perceptron 36.1 n-gram + POS (1,3) perceptron 36.1 INTJ nodes are ignored, we refer to this condition as S3-E. For full-parse feature extraction, we tried Table 3: Use of POS-tag sequence derived features context-free rule features (CF) and head-to-head fea- tures (H2H), of the kind shown in table 1. Table 4 condition. (Note that the perceptron–trained n-gram shows the results of these trials on rt02. features were trigrams (i.e., n = 3).) This is due to a larger training set being used in our experiments; Although the single digit precision in the table we have added data that was used as held-out data in does not show it, the H2H trial, using features ex- (Roark et al., 2005) to the training set that we use. tracted from the full parses along with n-grams and The first additional features that we experimented POS-tag sequence features, was the best performing with were POS-tag sequence derived features. Let model on the held out data, so we selected it for ap- plication to the rt03 test data. This yielded 35.2% ti and wi be the POS tag and word at position i, respectively. We experimented with the following WER, a reduction of 0.3% absolute over what was three feature definitions: achieved with just n-grams, which is significant at p < 0.001,5 reaching a total reduction of 1.2% over 1. (ti−2ti−1ti), (ti−1ti), (ti), (tiwi) the baseline recognizer.

2. (ti−2ti−1wi) 5 Conclusion

3. (ti−2wi−2ti−1wi−1tiwi), (ti−2ti−1wi−1tiwi), The results presented in this paper are a first step in (ti−1wi−1tiwi), (ti−1tiwi) examining the potential utility of syntactic features for discriminative language modeling for speech Table 3 summarizes the results of these trials on recognition. We tried two possible sets of features the held out set. Using the simple features (num- derived from the full annotation, as well as a va- ber 1 above) yielded an improvement beyond just riety of possible feature sets derived from shallow n-grams, but additional, more complicated features parse and POS tag sequences, the best of which failed to yield additional improvements. gave a small but significant improvement beyond Next, we considered features derived from shal- what was provided by the n-gram features. Future low parsing sequences. Given the results from the work will include a further investigation of parser– POS-tag sequence derived features, for any given se- derived features. In addition, we plan to explore the quence, we simply use n-tag and tag/word features alternative parameter estimation methods described (number 1 above). The first sequence type from in (Roark et al., 2004a; Roark et al., 2004b), which which we extracted features was the shallow parse were shown in this previous work to give further im- tag sequence (S1), as shown in figure 3(b). Next, provements over the perceptron. we tried the composite shallow/POS tag sequence (S2), as in figure 3(c). Finally, we tried extract- ing features from the shallow constituent sequence 5We use the Matched Pair Sentence Segment test for WER, (S3), as shown in figure 3(d). When EDITED and a standard measure of signi®cance, to calculate this p-value. References Brian Roark, Murat Saraclar, and Michael Collins. 2004a. Cor- rective language modeling for large vocabulary ASR with the Eugene Charniak. 2001. Immediate-head parsing for language perceptron algorithm. In Proc. ICASSP, pages 749±752. models. In Proc. ACL. Brian Roark, Murat Saraclar, Michael Collins, and Mark John- Ciprian Chelba and Frederick Jelinek. 1998. Exploiting syntac- son. 2004b. Discriminative language modeling with condi- tic structure for language modeling. In Proceedings of the tional random ®elds and the perceptron algorithm. In Proc. 36th Annual Meeting of the Association for Computational ACL. and 17th International Conference on Computa- tional Linguistics, pages 225±231. Brian Roark, Murat Saraclar, and Michael Collins. 2005. Dis- Ciprian Chelba and Frederick Jelinek. 2000. Structured criminative n-gram language modeling. Computer Speech language modeling. Computer Speech and Language, and Language. submitted. 14(4):283±332. Brian Roark. 2001a. Probabilistic top-down parsing and lan- Ciprian Chelba. 2000. Exploiting Syntactic Structure for Nat- guage modeling. Computational Linguistics, 27(2):249± ural Language Modeling. Ph.D. thesis, The Johns Hopkins 276. University. Brian Roark. 2001b. Robust Probabilistic Predictive Stanley Chen and Joshua Goodman. 1998. An empirical study Syntactic Processing. Ph.D. thesis, Brown University. of smoothing techniques for language modeling. Technical http://arXiv.org/abs/cs/0105019. Report, TR-10-98, Harvard University. Ronald Rosenfeld, Stanley Chen, and Xiaojin Zhu. 2001. Michael J. Collins. 1999. Head-Driven Statistical Models Whole-sentence exponential language models: a vehicle for for Natural Language Parsing. Ph.D. thesis, University of linguistic-statistical integration. In Computer Speech and Pennsylvania. Language.

Michael Collins. 2002. Discriminative training methods for Fei Sha and Fernando Pereira. 2003. Shallow parsing with hidden markov models: Theory and experiments with per- conditional random ®elds. In Proceedings of the Human ceptron algorithms. In Proc. EMNLP, pages 1±8. Language Technology Conference and Meeting of the North American Chapter of the Association for Computational Lin- Michael Collins. 2004. Parameter estimation for statistical guistics (HLT-NAACL), Edmonton, Canada. parsing models: Theory and practice of distribution-free methods. In Harry Bunt, John Carroll, and Giorgio Satta, Andreas Stolcke and Jonathan Segal. 1994. Precise n-gram editors, New Developments in Parsing Technology. Kluwer probabilities from stochastic context-free grammars. In Pro- Academic Publishers, Dordrecht. ceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pages 74±79. Frederick Jelinek and John Lafferty. 1991. Computation of the probability of initial substring generation by stochas- Andreas Stolcke. 1995. An ef®cient probabilistic context-free tic context-free grammars. Computational Linguistics, parsing algorithm that computes pre®x probabilities. Com- 17(3):315±323. putational Linguistics, 21(2):165±202.

Mark Johnson, Stuart Geman, Steven Canon, Zhiyi Chi, and Wen Wang and Mary P. Harper. 2002. The superARV language Stefan Riezler. 1999. Estimators for stochastic ªuni®cation- model: Investigating the effectiveness of tightly integrating basedº grammars. In Proc. ACL, pages 535±541. multiple knowledge sources. In Proc. EMNLP, pages 238± Daniel Jurafsky, Chuck Wooters, Jonathan Segal, Andreas 247. Stolcke, Eric Fosler, Gary Tajchman, and Nelson Morgan. Wen Wang, Andreas Stolcke, and Mary P. Harper. 2004. The 1995. Using a stochastic context-free grammar as a lan- use of a linguistically motivated language model in conver- guage model for speech recognition. In Proceedings of the sational speech recognition. In Proc. ICASSP. IEEE Conference on Acoustics, Speech, and Signal Process- ing, pages 189±192. Wen Wang. 2003. Statistical parsing and language model- John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. ing based on constraint dependency grammar. Ph.D. thesis, Conditional random ®elds: Probabilistic models for seg- Purdue University. menting and labeling sequence data. In Proc. ICML, pages Peng Xu, Ciprian Chelba, and Frederick Jelinek. 2002. A 282±289, Williams College, Williamstown, MA, USA. study on richer syntactic dependencies for structured lan- Andrej Ljolje, Enrico Bocchieri, Michael Riley, Brian Roark, guage modeling. In Proceedings of the 40th Annual Meet- Murat Saraclar, and Izhak Shafran. 2003. The AT&T 1xRT ing of the Association for Computational Linguistics, pages CTS system. In Rich Transcription Workshop. 191±198. Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Anoop Peng Xu, Ahmad Emami, and Frederick Jelinek. 2003. Train- Sarkar, Kenji Yamada, Alex Fraser, Shankar Kumar, Libin ing connectionist models for the structured language model. Shen, David Smith, Katherine Eng, Viren Jain, Zhen Jin, and In Proc. EMNLP, pages 160±167. Dragomir Radev. 2004. A smorgasbord of features for sta- tistical machine translation. In Proceedings of HLT-NAACL 2004.