An Empirical Study of Smoothing Techniques for Language Modeling

An Empirical Study of Smoothing Techniques for Language Modeling Stanley F. Chen Joshua Goodman Harvard University Harvard University Aiken Computation Laboratory Aiken Computation Laboratory 33 Oxford St. 33 Oxford St. Cambridge, MA 02138 Cambridge, MA 02138 sfc©eecs, harvard, edu goodma.n~eecs, harvard, edu Abstract In this work, we carry out an extensive empirical comparison of the most widely used We present an extensive empirical com- smoothing techniques, including those described parison of several smoothing techniques in by 3elinek and Mercer (1980), Katz (1987), and the domain of language modeling, includ- Church and Gale (1991). We carry out experiments ing those described by Jelinek and Mer- over many training data sizes on varied corpora us- cer (1980), Katz (1987), and Church and ing both bigram and trigram models. We demon- Gale (1991). We investigate for the first strate that the relative performance of techniques time how factors such as training data depends greatly on training data size and n-gram size, corpus (e.g., Brown versus Wall Street order. For example, for bigram models produced Journal), and n-gram order (bigram versus from large training sets Church-Gale smoothing has trigram) affect the relative performance of superior performance, while Katz smoothing per- these methods, which we measure through forms best on bigram models produced from smaller the cross-entropy of test data. In addition, data. For the methods with parameters that can we introduce two novel smoothing tech- be tuned to improve performance, we perform an niques, one a variation of Jelinek-Mercer automated search for optimal values and show that smoothing and one a very simple linear in- sub-optimal parameter selection can significantly de- terpolation technique, both of which out- crease performance. To our knowledge, this is the perform existing methods. first smoothing work that systematically investigates any of these issues. 1 Introduction In addition, we introduce two novel smoothing techniques: the first belonging to the class of Smoothing is a technique essential in the construc- smoothing models described by 3elinek and Mer- tion of n-gram language models, a staple in speech cer, the second a very simple linear interpolation recognition (Bahl, Jelinek, and Mercer, 1983) as well method. While being relatively simple to imple- as many other domains (Church, 1988; Brown et al., ment, we show that these methods yield good perfor- 1990; Kernighan, Church, and Gale, 1990). A lan- mance in bigram models and superior performance guage model is a probability distribution over strings in trigram models. P(s) that attempts to reflect the frequency with We take the performance of a method m to be its which each string s occurs as a sentence in natu- cross-entropy on test data ral text. Language models are used in speech recognition to resolve acoustically ambiguous utterances. For example, if we have that P(it takes two) >> 1 IT P(it takes too), then we know ceteris paribus to pre- IvT - log Pro(t,) fer the former transcription over the latter. i=1 While smoothing is a central issue in language modeling, the literature lacks a definitive compar- where Pm(ti) denotes the language model produced ison between the many existing techniques. Previ- with method m and where the test data T is com- ous studies (Nadas, 1984; Katz, 1987; Church and posed of sentences (tl,...,tzr) and contains a total Gale, 1991; MacKay and Peto, 1995) only compare of NT words. The entropy is inversely related to a small number of methods (typically two) on a sin- the average probability a model assigns to sentences gle corpus and using a single training data size. As in the test data, and it is generally assumed that a result, it is currently difficult for a researcher to lower entropy correlates with better performance in intelligently choose between smoothing schemes. applications. 310 1.1 Smoothing n-gram Models preventing zero bigram probabilities. However, this In n-gram language modeling, the probability of a scheme has the flaw of assigning the same probabil- string P(s) is expressed as the product of the prob- ity to say, burnish the and burnish thou (assuming abilities of the words that compose the string, with neither occurred in the training data), even though each word probability conditional on the identity of intuitively the former seems more likely because the the last n - 1 words, i.e., ifs = wl-..wt we have word the is much more common than thou. To address this, another smoothing technique is to l 1 interpolate the bigram model with a unigram model P(s) = H P(wi[w{-1) ~ 1-~ P i-1 (1) PML(Wi) = c(wi)/Ns, a model that reflects how of- i=1 i=1 ten each word occurs in the training data. For example, we can take where w ij denotes the words wi • •. wj. Typically, n is taken to be two or three, corresponding to a bigram Pinto p( i J i-1) = APM (w pW _l) + (1 - or trigram model, respectively. 1 getting the behavior that bigrams involving common Consider the case n = 2. To estimate the proba- words are assigned higher probabilities (Jelinek and bilities P(wilwi-,) in equation (1), one can acquire Mercer, 1980). a large corpus of text, which we refer to as training data, and take 2 Previous Work P(Wi-lWi) PML(Wil i-1) -- The simplest type of smoothing used in practice is P(wi-1) additive smoothing (Lidstone, 1920; Johnson, 1932; c(wi-lWi)/Ns aeffreys, 1948), where we take e(wi-1)/Ns i w i-1 = e(wi_,,+l) + c(wi_ w ) + elVl (2) and where Lidstone and Jeffreys advocate /i = 1. where c(c0 denotes the number of times the string Gale and Church (1990; 1994) have argued that this c~ occurs in the text and Ns denotes the total num- method generally performs poorly. ber of words. This is called the maximum likelihood The Good-Turing estimate (Good, 1953) is cen- (ML) estimate for P(wilwi_l). tral to many smoothing techniques. It is not used While intuitive, the maximum likelihood estimate directly for n-gram smoothing because, like additive is a poor one when the amount of training data is smoothing, it does not perform the interpolation of small compared to the size of the model being built, lower- and higher-order models essential for good as is generally the case in language modeling. For ex- performance. Good-Turing states that an n-gram ample, consider the situation where a pair of words, that occurs r times should be treated as if it had or bigram, say burnish the, doesn't occur in the occurred r* times, where training data. Then, we have PML(the Iburnish) = O, which is clearly inaccurate as this probability should r* = (r + 1)n~+l be larger than zero. A zero bigram probability can lead to errors in speech recognition, as it disallows and where n~ is the number of n-grams that. occur the bigram regardless of how informative the acous- exactly r times in the training data. tic signal is. The term smoothing describes tech- Katz smoothing (1987) extends the intuitions of niques for adjusting the maximum likelihood esti- Good-Turing by adding the interpolation of higher- mate to hopefully produce more accurate probabili- order models with lower-order models. It is perhaps ties. the most widely used smoothing technique in speech As an example, one simple smoothing technique is recognition. to pretend each bigram occurs once more than it ac- Church and Gale (1991) describe a smoothing tually did (Lidstone, 1920; Johnson, 1932; Jeffreys, method that combines the Good-Turing estimate 1948), yielding with bucketing, the technique of partitioning a set, of n-grams into disjoint groups, where each group C(Wi-lWi) "[- 1 is characterized independently through a set of pa- = + IVl rameters. Like Katz, models are defined recursively in terms of lower-order models. Each n-gram is as- where V is the vocabulary, the set of all words be- signed to one of several buckets based on its fre- ing considered. This has the desirable quality of quency predicted from lower-order models. Each 1To make the term P(wdw[Z~,,+~) meaningful for bucket is treated as a separate distribution and i < n, one can pad the beginning of the string with Good-Turing estimation is performed within each, a distinguished token. In this work, we assume there are giving corrected counts that are normalized to yield n - 1 such distinguished tokens preceding each sentence. probabilities. 311 Nd bucketing new bucketing ., . ., 2 ° * ~°% ° o °$ o • . ° .~ °e o *°*° * ° • ** o~,~L.s °o . • o oO o ~o ° *b oeW~ o ;.°*~a-:.. • . ° • % at ...,~;e.T¢: ° . .. : ° ° % o% **° ~- ° ~°~ ° o o ° °• °~ ° o* . 6'V, ° o *°Na, o o o , , ,i , , ,i , , ,i , " .... 0 * * I * , , I , , * I , , * I , * lo 100 1000 10000 100000 0.o01 0.01 0.1 1 10 r~rn~¢ of counts in distN~t]on average r~n-zem count in dis~bution r~nus One Figure 1: )~ values for old and new bucketing schemes for Jelinek-Mercer smoothing; each point represents a single bucket The other smoothing technique besides Katz 3.1 Method average-count smoothing widely used in speech recognition is due This scheme is an instance of Jelinek-Mercer to Jelinek and Mercer (1980). They present a class smoothing. Referring to equation (3), recall that of smoothing models that involve linear interpola- Bahl et al. suggest bucketing the A~!-I according tion, e.g., Brown et al. (1992) take to c(Wi_n+l).i--1 We have found that partitioning the i--1 ~!-~ according to the average number of counts *--~+1 ~Wi__ 1 PML(Wi IWi-n+l)i--1 "Iv i-- n-]-I per non-zero element ~(~--~"+1) yields better Iwi:~(~:_.+~)>01 (1-- )~to~-~ ) P~inte~pt /W i wi_n+2)i-1 , (3) results.

Load more