Exploiting Headword Dependency and Predictive Clustering for Language Modeling

Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, July 2002, pp. 248-256. Association for Computational Linguistics. Exploiting Headword Dependency and Predictive Clustering for Language Modeling Jianfeng Gao Hisami Suzuki Yang Wen* Microsoft Research, Asia Microsoft Research Department of Computer & Beijing, 100080, China Redmond WA 98052, USA Information Sciences of [email protected] [email protected] Tsinghua University, China distance word dependency leads to higher-order Abstract n-gram models, where the number of parameters is This paper presents several practical ways usually too large to estimate; (2) capturing deeper of incorporating linguistic structure into linguistic relations in a LM requires a large amount language models. A headword detector is of annotated training corpus and a decoder that first applied to detect the headword of each assigns linguistic structure, which are not always phrase in a sentence. A permuted headword available. trigram model (PHTM) is then generated This paper presents several practical ways of from the annotated corpus. Finally, PHTM incorporating long distance word dependency and is extended to a cluster PHTM (C-PHTM) linguistic structure into LMs. A headword detector by defining clusters for similar words in the is first applied to detect the headwords in each corpus. We evaluated the proposed models phrase in a sentence. A permuted headword trigram on the realistic application of Japanese model (PHTM) is then generated from the Kana-Kanji conversion. Experiments show annotated corpus. Finally, PHTM is extended to a that C-PHTM achieves 15% error rate cluster model (C-PHTM), which clusters similar reduction over the word trigram model. This words in the corpus. demonstrates that the use of simple methods Our models are motivated by three assumptions such as the headword trigram and predictive about language: (1) Headwords depend on previous clustering can effectively capture long headwords, as well as immediately preceding distance word dependency, and words; (2) The order of headwords in a sentence can substantially outperform a word trigram freely change in some cases; and (3) Word clusters model. help us make a more accurate estimate of the probability of word strings. We evaluated the 1 Introduction proposed models on the realistic application of Japanese Kana-Kanji conversion, which converts In spite of its deficiencies, trigram-based language phonetic Kana strings into proper Japanese modeling still dominates the statistical language orthography. Results show that C-PHTM achieves a modeling community, and is widely applied to tasks 15% error rate reduction over the word trigram such as speech recognition and Asian language text model. This demonstrates that the use of simple input (Jelinek, 1990; Gao et al., 2002). methods can effectively capture long distance word Word trigram models are deficient because they dependency, and substantially outperform the word can only capture local dependency relations, taking trigram model. Although the techniques in this no advantage of richer linguistic structure. Many paper are described in the context of Japanese proposals have been made that try to incorporate Kana-Kanji conversion, we believe that they can be linguistic structure into language models (LMs), but extended to other languages and applications. little improvement has been achieved so far in This paper is organized as follows. Sections 2 realistic applications because (1) capturing longer and 3 describe the techniques of using headword * This work was done while the author was visiting Microsoft Research Asia. dependency and clustering for language modeling. example in Figure 1, we find that if 治療~専念~全快 Section 4 reviews related work. Section 5 (chiryou 'treatment' ~ sennen 'concentrate' ~ zenkai introduces the evaluation methodology, and Section 'full recovery') is a meaningful trigram, then its 6 presents the results of our main experiments. permutations (such as 全快~治療~専念 (zenkai 'full Section 7 concludes our discussion. recovery' ~ chiryou 'treatment' ~ sennen 'concentrate')) should also be meaningful, because 2 Using Headwords headword trigrams tend to capture an order-neutral semantic dependency. This reflects a characteristic 2.1 Motivation of Japanese, in which arguments and modifiers of a Japanese linguists have traditionally distinguished predicate can freely change their word order, a two types of words1, content words (jiritsugo) and phenomenon known as "scrambling" in linguistic function words (fuzokugo), along with the notion of literature. We can then introduce our second the bunsetsu (phrase). Each bunsetsu typically assumption: headwords in a trigram are permutable. consists of one content word, called a headword in Note that the permutation of headwords should be this paper, and several function words. Figure 1 useful more generally beyond Japanese: for shows a Japanese example sentence and its English example, in English, the book Mary bought and translation2. Mary bought a book can be captured by the same headword trigram (Mary ~ bought ~ book) if we [治療+に][専念+して][全快+まで][十分+な][療養+に][努め+る] allow such permutations. [chiryou+ni] [sennen+shite] [zenkai+made] In this subsection, we have stated two [treatment+to][concentration+do][full-recovery+until] assumptions about the structure of Japanese that can [juubun+na] [ryouyou+ni] [tsutome+ru] be exploited for language modeling. We now turn to [enough+ADN] [rest+for] [try+PRES] discuss how to incorporate these assumptions in '(One) concentrates on the treatment and tries to rest language modeling. enough until full recovery' Figure 1. A Japanese example sentence with 2.2 Permuted headword trigram model bunsetsu and headword tags (PHTM) In Figure 1, we find that some headwords in the A trigram model predicts the next word wi by sentence are expected to have a stronger estimating the conditional probability P(wi|wi-2wi-1), dependency relation with their preceding assuming that the next word depends only on two headwords than with their immediately preceding preceding words, wi-2 and wi-1. The PHTM is a function words. For example, the three headwords simple extension of the trigram model that 治療~ 専念~ 全快 (chiryou 'treatment' ~ sennen incorporates the dependencies between headwords. 'concentrate' ~ zenkai 'full recovery') form a trigram If we assume that each word token can uniquely be with very strong semantic dependency. Therefore, classified as a headword or a function word, the we can hypothesize (in the trigram context) that PHTM can be considered as a cluster-based headwords may be conditioned not only by the two language model with two clusters, headword H and immediately preceding words, but also by two function word F. We can then define the conditional previous headwords. This is our first assumption. probability of wi based on its history as the product We also note that the order of headwords in a of the two factors: the probability of the category (H sentence is flexible in some sense. From the or F), and the probability of wi given its category. Let hi or fi be the actual headword or function word in a sentence, and let Hi or Fi be the category of the 1 Or more correctly, morphemes. Strictly speaking, the word wi. The PHTM can then be formulated as: LMs discussed in this paper are morpheme-based models P(w | Φ(w ...w )) = rather than word-based, but we will not make this i 1 i−1 (1) distinction in this paper. P(H i | Φ(w1...wi−1 )) × P(wi | Φ(w1...wi−1 )H i ) 2 Square brackets demarcate the bunsetsu boundary, and +P(Fi | Φ(w1...wi−1 )) × P(wi | Φ(w1...wi−1 )Fi ) + the morpheme boundary; the underlined words are the headwords. ADN indicates an adnominal marker, and where Φ is a function that maps the word history PRES indicates a present tense marker. (w1…wi-1) onto equivalence classes. P(Hi|Φ(w1…wi-1)) and P(Fi|Φ(w1…wi-1)) are ordered word sequence, co-occurrence models go to category probabilities, and P(wi|Φ(w1…wi-1)Fi) is the other extreme of predicting the next word based the word probability given that the category of wi is on a bag of previous words without taking word function word. For these three probabilities, we order into account at all. We prefer models that lie used the standard trigram estimate (i.e., Φ(w1…wi-1) somewhere between the two extremes, and consider = (wi-2wi-1)). The estimation of headword word order in a more flexible way. In PHTM of probability is slightly more elaborate, reflecting the Equation (2), λ2 represents the impact of word order two assumptions described in Section 2.1: on headword prediction. When λ2 = 1 (i.e., the resulting model is a non-permuted headword P(w | Φ(w ...w )H ) = λ (λ P(w | h h H ) (2) i 1 i−1 i 1 2 i i−2 i−1 i trigram model, referred to as HTM), it indicates that +(1− λ )P(w | h h H )) 2 i i−1 i−2 i the second assumption does not hold in real data. +(1 − λ1 )P(wi | wi−2 wi−1H i ) . When λ2 is around 0.5, it indicates that a headword bag model is sufficient. This estimate is an interpolated probability of three probabilities: P(w |h h H ) and P(w |h h H ), i i-2 i-1 i i i-1 i-2 i 2.3 Model parameter estimation which are the headword trigram probability with or without permutation, and P(wi|wi-2wi-1Hi), which is Assume that all conditional probabilities in the probability of wi given that it is a headword, Equation (1) are estimated using maximum where hi-1 and hi-2 denote the two preceding likelihood estimation (MLE). Then headwords, and λ , λ ∈[0,1] are the interpolation 1 2 P(w | w w ) = weights optimized on held-out data.

Exploiting Headword Dependency and Predictive Clustering for Language Modeling

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support