Measuring Tagging Performance of a Joint Language Model

Denis Filimonov1, Mary Harper1,2

1Laboratory for Computational and Information Processing Institute for Advanced Computer Studies University of Maryland, College Park 2Human Language Technology Center of Excellence {den,mharper}@umiacs.umd.edu

Abstract While there are other joint language models that use syntac- tic information in various forms that could potentially be used Predicting syntactic information in a joint language model (LM) to predict POS tags [5, 7, 8, 9], none of them has been evaluated has been shown not only to improve the model at its main task on the POS tagging task, which is an important measure if the of predicting words, but it also allows this information to be tags are to be used in a task1. In Section 2 passed to other applications, such as spoken language process- of this paper, we outline the design of our language model and ing. This raises the question of just how accurate the syntactic evaluate it on the POS tagging task and on perplexity in Section information predicted by the LM is. In this paper, we present a 3. joint LM designed not only to scale to large quantities of train- ing data, but also to be able to utilize fine-grain syntactic in- 2. Language Model formation, as well as other features, such as morphology and prosody. We evaluate the accuracy of our model at predicting A typical application of a language model in Automatic Speech syntactic information on the POS tagging task against state-of- Recognition (ASR) is to provide the prior probability in the the-art POS taggers and on perplexity against the ngram model. noisy channel framework: Index Terms: language modeling, joint language model, part- of-speech tagging n n n n w1 = argmax p(w1 |A)=argmax p(A|w1 ) · p(w1 ) (1) n n w1 w1 1. Introduction n n where p(A|w1 ) is the acoustic model and p(w1 ) is the lan- Part-of-speech (POS) tags are helpful in a number of NLP ap- guage model. The latter is typically factored as2: plications: named entity recognition, sentence boundary detec- tion, word sense disambiguation, to name a few. The HMM ap- Yn n i−1 proach, historically the first to be applied to the task of POS tag- p(w1 )= p(wi|w1 ) ging, has been largely superseded by sophisticated algorithms i=1 [1, 2] that make better use of scarce human-annotated data. Re- A joint language model predicts joint events cently, some systems have added large bodies of unannotated or i−1 i−1 p(witi|w1 t1 ),whereti is a tag assigned to the word automatically annotated text to assist in the tagging task, report- wi, such as part-of-speech tag. If the joint model is used, the ing significant improvements in acuracy [3, 4]. distribution is marginalized: It has also been shown that augmenting output of a language model with predicted POS tags, i.e., predicting joint word-tag Yn X Yn events, can improve performance of ASR applications, such as n i−1 i−1 i−1 p(w1 )= p(wi|w )= p(witi|w t ) detecting speech repairs and intonational phrase boundaries [5], 1 1 1 i=1 t ...t i=1 thus giving an additional incentive to develop joint language 1 n models. In this paper, we present a joint language model that However it is also possible (and beneficial [5]) to augment can predict syntactic tags as by-product, and compare our model Equation 1 to allow the acoustic model to use tags predicted against state-of-the-art POS taggers representing four differ- by the LM: ent approaches: HMM [6], MaxEnt [1], CRF [3], and per- ceptron [4]. Note, that all these high-performing taggers are n n n n n n w1 t1 = argmax p(A|w1 t1 ) · p(w1 t1 ) (2) n n bi-directional to some extent, ranging from simple word look- w1 t1 ahead to varying the order of tag prediction, which renders them inappropriate to certain tasks requiring a strict left-to-right ap- The size and sparsity of the parameter space of the joint proach. model necessitate the use of dimensionality reduction measures Designing a scalable joint model is notoriously hard. For 1While [5] presents raw POS error rate, the results were obtained example, the maximum amount of data that the model in [5] on tiny 58k-word Trains corpus and were not compared to any other (arguably the closest model to ours) was trained on was only system. 1M words. In this paper, we present our model trained on 170M 2Although models that require the entire sentence up front do exist, words of data, with fine-grain syntactic tags. e.g., [10], their applicability to ASR tasks is limited.

Copyright © 2009 ISCA 2667 6-10 September, Brighton UK with additional parameters to build a more effective model. In an FLM, however, all factors have to be deterministically com- puted; whereas, we need to distinguish between the factors that are given or computed and the factors that the model must pre- dict stochastically. We call these types of factors overt and hid- den, respectively. Examples of overt factors include surface words, morphological features such as suffixes, case informa- tion when available, etc.; the hidden factors are POS tags or other syntactic variables. In this work, we used suffixes obtained using the Snowball stemmer3 and a simple regular expression-based feature identi- Figure 1: A parse tree example fying whether the word is a number, a punctuation, an abbre- viation, a foreign word, etc. Henceforth, we will use word to represent the set of overt factors and tag to represent the set of in order to make the model computationally tractable and to hidden factors. allow for accurate estimation of the model’s parameters. We also want the model to be able to easily accommodate addi- 2.4. Hidden Factors Tree tional sources of information, such as morphological features and prosody. In the rest of this section we discuss avenues we Similarly to [5], we construct a binary tree where each tag is have taken to address these problems. a leaf; we refer to this tree as the Hidden Factors Tree (HFT). We use Minimum Discriminative Information (MDI) algorithm 2.1. Fine-grain Syntactic Tags [14] to build the tree. The HFT represents a hierarchical clus- The model presented in this section was designed to accommo- tering of the tag space. One of the reasons for doing this is to allow questions about subsets of tags rather than individual tags date large tagsets. In this paper we evaluate the impact of ex- 4 tending POS tags with additional syntactic information derived alone . from parse trees, namely: Unlike [5], where the tree of tags was only used to create questions, this representation of the tag space is, in addition, • A POS tag with its immediate parent in the parse tree, a key feature of our decoding optimizations to support model along with the tag’s relative position among its siblings. scalability. We refer the reader to [15] for the details on the We refer to this tagset as parent. For example, the tree in decoding algorithm. Figure 1 will be tagged: the/DT-NP-start black/JJ-NP- mid cat/NN-NP-end sat/VB-VP-single. 2.5. Questions • A POS tag of a word with the POS tag of the word’s lexical head, later referred to as head. The example in We use different types of questions for hidden and overt factors. Figure 1 is tagged: the/DT-NN black/JJ-NN cat/NN-VBD sat/VBD-root. • Questions about surface words are constructed using the The English POS tag set has 45 tags, while the fine-grain tagsets Exchange algorithm [16]. This algorithm takes the set of have approximately 1,500 tags. words that appear at a certain position in the training data associated with the current node in the history tree and 2.2. Decision Tree Clustering divides the set into two complementary subsets greedily optimizing on some target function (we use the average Binary decision tree clustering has been shown to be effective entropy of the marginalized word distribution, the same for reducing the parameter space in language modeling [5, 11] as for question selection). Note that since the algorithm and other language processing applications, e.g., [12]. Like any only operates on the words that appear in the training clustering algorithm, it can be represented by a function H that data, we need to do something more to account for the maps the space of histories to a set of equivalence classes. unseen words. Thus, to represent this type of question, we create the history tree structure depicted in Fig. 2. For other overt factors with smaller vocabularies, such ( | i−1 i−1 ) ≈ ( | ( i−1 i−1 )) p witi wi−n+1ti−n+1 p witi H wi−n+1ti−n+1 (3) as suffixes, we use equality questions5.

• As we mentioned in 2.4, we use the Hidden Factors Tree Since optimal tree construction is intractable, greedy algo- to create questions about hidden factors. Note that every rithms are universally used. While the procedure is standard – node in a binary tree can be represented by a binary path to recursively select binary questions about the history optimiz- from the root with all nodes under an inner node sharing ing some function – the greedy nature of the approach makes the same prefix. Thus, a question about whether a tag the details, such as which questions to ask and which function belongs to a subset of tags dominated by a node can be to optimize, crucial to achieve good results. In the remainder of expressed as whether the tag’s path matches the binary this section, we discuss the decisions we made regarding these prefix. issues.

2.3. Factors 3http://snowball.tartarus.org/ 4Trying all possible subsets of tags is not feasible since there are The Factored Language Model (FLM) [13] offers a convenient 2|T | of them, the tree allows us to reduce the number to O(T ) of the view of the input data: it represents every word in a sentence as most meaningful (as per the clustering algorithm) subsets. a tuple of factors. This allows us to extend the language model 5E.g., suffix(wi−1) = ‘ed’.

2668 System Gold data Additional data Accuracy Parser 1M 5M 97.24 MaxEnt 1M 0 97.24 Perceptron 1M 200M 97.44 CRF 1M 1000M 97.40 HMM 0 170M 97.15 LM 0 170M 97.21

Table 1: Tagging accuracy Figure 2: Backoff Nodes

while t are the fine-grained tags used by the model. Note that, 2.6. Optimization Criterion and Stopping Rule by construction, each t belongs to exactly one POS tag, thus T can be considered a set of fine-grain tags. To select questions we use the average entropy of the marginal- ized word distribution. We found that this criterion significantly Yn n n i−1 i−1 outperforms the entropy of the distribution of joint events. This p(T1 w1 )= p(Tiwi|T1 w1 ) is probably due to the increased sparsity of the joint distribution. i=1 Yn X ≈ p(t w |ti−1wi−1) 2.7. Smoothing i i i−2 i−2 i=1 ti ∈ Ti In order to estimate probability distributions at the leaves of the ti−1 ∈ Ti−1 history tree, we use the following recursive formula: ti−2 ∈ Ti−2

p˜n(witi)=λnpn(witi)+(1− λn)˜pn (witi) (4)

where n is the n-th node’s parent, pn(witi) is the ML distri- We refer the reader to [17] for the derivation of this approxima- bution at node n. The root of the tree is interpolated with the tion. ( )= 1 ( | )6 distribution punif witi |V | pML ti wi . To estimate inter- polation parameters λn, we use the EM algorithm described in 3. Experiments [12], however, rather than setting aside a separate development To evaluate the tagging accuracy of our model we chose the set of optimizing λn, we use 4-fold cross validation and take 7 Wall Street Journal Penn Treebank since it is the most widely the geometric mean of the resulting coefficients . We chose this used corpus for this task. While this is a newswire corpus we approach because a small development set often does not over- believe it is reasonable to assume that if our model performs lap with the training set for low-count nodes, leading the EM well relative to the state-of-the-art taggers on this task, it will =0 algorithm to set λn for those nodes. also port to other genres, including broadcast news and conver- For backoff nodes (see Figure 2), we use a lower order 8 sational speech (with an appropriately trained parser to generate model interpolated with the distribution at the backoff node’s tags). We used WSJ sections 22-24 for testing, both for tagging grandparent (see node A in Figure 2): and perplexity experiments. To train our model and the HMM tagger [6], we parsed 40M i−1 i−1 i−1 i−1 p˜B(witi|wi−n+1ti−n+1)=αAp˜bo(witi|wi−n+2ti−n+2) words from BLLIP WSJ and 130M words from NYT section

+(1− αA)˜pA(witi) of the English Gigaword corpus. We used an extension of the Berkeley parser [18] trained on WSJ sections 2-21 with 210,000 sentences from BLLIP WSJ for self-training. The F-score of the How to most effectively compute αA is an open question. For parser on section 23 is 91.65. this study, we use a simple heuristic based on observation that The tagging accuracies of the joint LM and other systems the further node A is from the root the more reliable the distri- are presented in Table 1. In the first column, Parser refers to bution p˜A(witi) is, and hence αA is lower. the Berkeley parser used to generate training data for LM and HMM. MaxEnt is described in [1], Perceptron in [4]9, CRF in 2.8. Tagging [3], and HMM in [6]. LM refers to this work; note that head When the model uses tags finer-grained than POS, the best POS and parent both have the same score. The second and the third sequence cannot be tractably inferred from the decoding lat- columns show the amounts of human labeled and automatically tice. We use the following variational approximation to estimate labeled (or unlabeled in the case of CRF) data the models used. probability of POS tag sequences. Here, we denote POS tags T Note that while LM does not outperform the state-of-the-art bi- directional algorithms, it does do significantly better than the 6We use this distribution rather than uniform joint distribution left-to-right HMM (p<0.0001 in sign test), and does not differ 1 |V ||T | because we do not want to allow word-tag pairs that have never significantly from the parser used to label the training data. been observed. The idea is similar to [6]. In Figure 3, we present a detailed breakdown of the perfor- 7To avoid a large number of zeros due to the product, we set a mini- mance of the joint LM using various tags and trained on dif- mum for λ to be 10−7. ferent amounts of data. When trained on 40M words, sparsity 8The lower order model is constructed by the same algorithm, al- prevents the systems from exhibiting significantly different per- though with smaller context. Note that the lower order model can back formances, including the HMM tagger. With more data, mod- off on words or tags, or both. In the experiments in this paper, we i−1 i−1 backoff both on words and tags, i.e., p(witi|wi−2ti−2) backs off to 9Note that Perceptron used higher quality text for training, labeled p(witi|wi−1ti−1), which in turn backs off to the unigram p(witi). by a combination of taggers with resulting accuracy of 97.44.

2669 ga oes ti uhlre hnohrjitmdl yial use. typically models joint other than larger much is it models, ngram ag ast n ag mut ftann data training of for amounts designed large model and language joint tagsets a large presented we paper, this In ecrn ak erfrterae o[15]. to reader the refer we task, rescoring hntePSmdlad1-3 oe hntesadr tri- standard the than lower 12-13% model and perplexity gram model lower POS 6-7% have fine- the The models of than fine-grain importance data. the both training shows tags: also their grain 4 generate Figure to in study used perplexity perfor- parser approach the they of as saturate mance signifi- to model begin POS they joint although the cantly, outperform tags fine-grain with els etrs uha ufie,ta rvdipratfrpredicting for important proved that suffixes, other of and as flexibility tagsets such fine-grain the employ features, is to it us allowed that that underscore previously model our to the like different of would on We none performance tagsets. since their compared contribution models novel joint a perplex- reported and is accuracy This tagging in ity. both model, POS the left-to- perform strict where applications ASR required. in is used decoding does right be it can future), taggers, LM bi-directional the the unlike in and, tagger, words HMM accessing the outperform falls of they LM advantage since surprisingly, the our (not have algorithms Although tagging first best an the paper. the this of this short consider is We of work prediction. contribution this tag important of tags, accuracy syntactic the with evaluate to LMs joint other are 11 10

hl 7Mwrsi mle hntevlmsue o training for used volumes the than smaller is words 170M While KN-smoothing. modified interpolated used We Perplexity Accuracy, % eas salse htmdl ihfiegantg out- tags fine-grain with models that established also We 105 110 115 120 100 97.05 97.15 97.25 95 97.2 97.1 06 010101010180 160 140 120 100 80 60 40 iue3 agn cuay O s fine-grain vs. POS accuracy: Tagging 3: Figure .CnlsosadFtr Work Future and Conclusions 4. 40 iue4 epeiy O s fine-grain vs. POS Perplexity: 4: Figure 10 o h vlaino hsmdli h ASR the in model this of evaluation the For . 60 80 Data size, million words Data size, million words 100 120 140 11 hl there While . 160 180 2670 ain xrse nti ae r hs fteatosaddo agencies. and funding authors the of the views of the those reflect are necessarily paper not DARPA this and in expressed IIS-0703859 recommen- dations and/or NSF findings opinions, by Any supported HR0011-06-C-0023. was work This data. similar Gigaword-scale accommodate perform scalability to the to model improve the intend to of also continue We and Chinese We in work. parsers. experiments future different of for output this is- mixing the leave by believe alleviated we be model, our can of sue performance the limits data ing accurately. tags 1]DnsFlmnvadMr apr on agaemdlwith model language joint A Harper. Mary and Filimonov Denis [15] 1]Zogin un n ayHre.Sl-riigPF gram- PCFG Self-Training Harper. Mary and Huang Zhongqiang structured [18] Learning Klein. Dan and Pauls, Adam Petrov, Slav [17] for Algorithms Ney. Hermann and Liermann, Jorg Martin, Sven [16] 1]Egn hrik meit-edprigfrlnug models. language for parsing Immediate-head Charniak. Eugene [10] 1]Ie ion.Bcofheacia ls -rmlnug mod- language n-gram class hierarchical Backoff Zitouni. Imed models language [14] Factored Kirchhoff. Katrin and Bilmes Jeff [13] A Magerman. Mercer. M. David R.L. [12] and Souza, de P.V. Brown, P.F. Bahl, L.R. [11] 2 ii hn iri at,adAaidK oh.Gie learning Guided Joshi. K. Aravind and Satta, Giorgio Shen, Libin and [2] Manning, D. Christopher Klein, Dan Toutanova, Kristina [1] 4 Drahom [4] label- sequential Semi-supervised Isozaki. Hideki and Suzuki Jun [3] 6 ct .TeeadMr .Hre.Ascn-re hidden second-order A Harper. P. Mary and Thede modeling. M. language Scott in classes [6] versus tagging POS Heeman. P. [5] 7 rvn .JsiadB rnvs iabgaino ue parts super of Disambiguation Srinivas. B. and Joshi K. Aravind [7] 9 e ag ayP apr n nra tlk.Terobustness The Stolcke. Andreas and Harper, P. Mary Wang, Wen mod- [9] language Structured Jelinek. Frederick and Chelba Ciprian [8] h agn cuayo h asrue ognrt train- generate to used parser the of accuracy tagging The o iietoa eunecasfiain In ACL classification. sequence bidirectional for cyclic a In with network. tagging dependency part-of-speech Feature-rich Singer. Yoram n n emnainuiggg-odsaeulblddt.In data. unlabeled ACL-HLT scale the giga-word of Proceedings using segmentation and ing eisprie riigfrteaeae ecprnPStagger. POS In perceptron averaged the for training Semi-supervised akvmdlfrpr-fsec agn.In tagging. ACL part-of-speech for model Markov 1998. report, Technical fsec o uetg) lotprig In parsing. Almost COLING Supertags): (or speech of aswt aetanttosars agae.In EMNLP languages. the across annotations latent with mars In recognition. CoNLL phone for models In 1998. 1253–1256, clustering. pages word trigram and bigram In tags. syntactic fine-grain fa lotpriglnug oe ie rofltann data. training errorful given In model language almost-parsing an of recognition. speech for eling l:Efcieest oe neneet nsec recognition. speech Language in & events Speech unseen Computer model to Effectiveness els: 2003. In backoff. parallel generalized and speech language natural recognition. for model language statistical tree-based In enrecognition tern rceig fteEACL the of Proceedings rceig fteIE ICASSP IEEE the of Proceedings rceig fteACL the of Proceedings 2007. , 1999. , 2007. , ´ r Spoustov ıra 1994. , 2009. , .Acknowledgments 5. EETascin nASSP on Transactions IEEE h hss tnod A S,1994. USA, CA, Stanford, thesis, PhD . .References 6. ,JnHaji Jan a, ´ aua agaepriga ttsia pat- statistical as parsing language Natural rceig fHLT-NAACL of Proceedings 2001. , ac 2009. March , rceig fteEMNLP the of Proceedings ue2008. June , CoRR 11:814 2007. 21(1):88–104, , ,JnRa,adMrsa Spousta. Miroslav and Raab, Jan c, ˇ 2000. , 2003. , rceig fteEMNLP- the of Proceedings rceig fHLT/NACCL of Proceedings pehCommunication Speech 1989. , rceig fthe of Proceedings rceig fthe of Proceedings rceig fthe of Proceedings rceig of Proceedings 2003. , 2009. , , ,