<<

An Empirical Study of Smoothing Techniques for Language Modeling

Stanley F. Chen Joshua Goodman Harvard University Harvard University Aiken Computation Laboratory Aiken Computation Laboratory 33 Oxford St. 33 Oxford St. Cambridge, MA 02138 Cambridge, MA 02138 sfc©eecs, harvard, edu goodma.n~eecs, harvard, edu

Abstract In this work, we carry out an extensive empirical comparison of the most widely used We present an extensive empirical com- smoothing techniques, including those described parison of several smoothing techniques in by 3elinek and Mercer (1980), Katz (1987), and the domain of language modeling, includ- Church and Gale (1991). We carry out experiments ing those described by Jelinek and Mer- over many training data sizes on varied corpora us- cer (1980), Katz (1987), and Church and ing both bigram and trigram models. We demon- Gale (1991). We investigate for the first strate that the relative performance of techniques time how factors such as training data depends greatly on training data size and n-gram size, corpus (e.g., Brown versus Wall Street order. For example, for bigram models produced Journal), and n-gram order (bigram versus from large training sets Church-Gale smoothing has trigram) affect the relative performance of superior performance, while Katz smoothing per- these methods, which we measure through forms best on bigram models produced from smaller the cross-entropy of test data. In addition, data. For the methods with parameters that can we introduce two novel smoothing tech- be tuned to improve performance, we perform an niques, one a variation of Jelinek-Mercer automated search for optimal values and show that smoothing and one a very simple linear in- sub-optimal parameter selection can significantly de- terpolation technique, both of which out- crease performance. To our knowledge, this is the perform existing methods. first smoothing work that systematically investigates any of these issues. 1 Introduction In addition, we introduce two novel smooth- ing techniques: the first belonging to the class of Smoothing is a technique essential in the construc- smoothing models described by 3elinek and Mer- tion of n-gram language models, a staple in speech cer, the second a very simple linear interpolation recognition (Bahl, Jelinek, and Mercer, 1983) as well method. While being relatively simple to imple- as many other domains (Church, 1988; Brown et al., ment, we show that these methods yield good perfor- 1990; Kernighan, Church, and Gale, 1990). A lan- mance in bigram models and superior performance guage model is a distribution over strings in trigram models. P(s) that attempts to reflect the with We take the performance of a method m to be its which each string s occurs as a sentence in natu- cross-entropy on test data ral text. Language models are used in speech recog- nition to resolve acoustically ambiguous utterances. For example, if we have that P(it takes two) >> 1 IT P(it takes too), then we know ceteris paribus to pre- IvT - log Pro(t,) fer the former transcription over the latter. i=1 While smoothing is a central issue in language modeling, the literature lacks a definitive compar- where Pm(ti) denotes the language model produced ison between the many existing techniques. Previ- with method m and where the test data T is com- ous studies (Nadas, 1984; Katz, 1987; Church and posed of sentences (tl,...,tzr) and contains a total Gale, 1991; MacKay and Peto, 1995) only compare of NT words. The entropy is inversely related to a small number of methods (typically two) on a sin- the average probability a model assigns to sentences gle corpus and using a single training data size. As in the test data, and it is generally assumed that a result, it is currently difficult for a researcher to lower entropy correlates with better performance in intelligently choose between smoothing schemes. applications.

310 1.1 Smoothing n-gram Models preventing zero bigram . However, this In n-gram language modeling, the probability of a scheme has the flaw of assigning the same probabil- string P(s) is expressed as the product of the prob- ity to say, burnish the and burnish thou (assuming abilities of the words that compose the string, with neither occurred in the training data), even though each word probability conditional on the identity of intuitively the former seems more likely because the the last n - 1 words, i.e., ifs = wl-..wt we have word the is much more common than thou. To address this, another smoothing technique is to l 1 interpolate the bigram model with a unigram model P(s) = H P(wi[w{-1) ~ 1-~ P i-1 (1) PML(Wi) = c(wi)/Ns, a model that reflects how of- i=1 i=1 ten each word occurs in the training data. For ex- ample, we can take where w ij denotes the words wi • •. wj. Typically, n is taken to be two or three, corresponding to a bigram Pinto p( i J i-1) = APM (w pW _l) + (1 - or trigram model, respectively. 1 getting the behavior that bigrams involving common Consider the case n = 2. To estimate the proba- words are assigned higher probabilities (Jelinek and bilities P(wilwi-,) in equation (1), one can acquire Mercer, 1980). a large corpus of text, which we refer to as training data, and take 2 Previous Work P(Wi-lWi) PML(Wil i-1) -- The simplest type of smoothing used in practice is P(wi-1) additive smoothing (Lidstone, 1920; Johnson, 1932; c(wi-lWi)/Ns aeffreys, 1948), where we take e(wi-1)/Ns i w i-1 = e(wi_,,+l) + c(wi_ w ) + elVl (2) and where Lidstone and Jeffreys advocate /i = 1. where c(c0 denotes the number of times the string Gale and Church (1990; 1994) have argued that this c~ occurs in the text and Ns denotes the total num- method generally performs poorly. ber of words. This is called the maximum likelihood The Good-Turing estimate (Good, 1953) is cen- (ML) estimate for P(wilwi_l). tral to many smoothing techniques. It is not used While intuitive, the maximum likelihood estimate directly for n-gram smoothing because, like additive is a poor one when the amount of training data is smoothing, it does not perform the interpolation of small compared to the size of the model being built, lower- and higher-order models essential for good as is generally the case in language modeling. For ex- performance. Good-Turing states that an n-gram ample, consider the situation where a pair of words, that occurs r times should be treated as if it had or bigram, say burnish the, doesn't occur in the occurred r* times, where training data. Then, we have PML(the Iburnish) = O, which is clearly inaccurate as this probability should r* = (r + 1)n~+l be larger than zero. A zero bigram probability can lead to errors in speech recognition, as it disallows and where n~ is the number of n-grams that. occur the bigram regardless of how informative the acous- exactly r times in the training data. tic is. The term smoothing describes tech- Katz smoothing (1987) extends the intuitions of niques for adjusting the maximum likelihood esti- Good-Turing by adding the interpolation of higher- mate to hopefully produce more accurate probabili- order models with lower-order models. It is perhaps ties. the most widely used smoothing technique in speech As an example, one simple smoothing technique is recognition. to pretend each bigram occurs once more than it ac- Church and Gale (1991) describe a smoothing tually did (Lidstone, 1920; Johnson, 1932; Jeffreys, method that combines the Good-Turing estimate 1948), yielding with bucketing, the technique of partitioning a set, of n-grams into disjoint groups, where each group C(Wi-lWi) "[- 1 is characterized independently through a set of pa- = + IVl rameters. Like Katz, models are defined recursively in terms of lower-order models. Each n-gram is as- where V is the vocabulary, the set of all words be- signed to one of several buckets based on its fre- ing considered. This has the desirable quality of quency predicted from lower-order models. Each 1To make the term P(wdw[Z~,,+~) meaningful for bucket is treated as a separate distribution and i < n, one can pad the beginning of the string with Good-Turing estimation is performed within each, a distinguished token. In this work, we assume there are giving corrected counts that are normalized to yield n - 1 such distinguished tokens preceding each sentence. probabilities.

311 Nd bucketing new bucketing ., . . .,

2 ° * ~°% ° o °$ o • . ° .~ °e o *°*° * ° • ** o~,~L.s °o . • o oO o ~o ° *b oeW~ o ;.°*~a-:.. • . ° • % at ...,~;e.T¢: ° . .. : ° ° % o% **° ~- ° ~°~ ° o o ° °• °~ ° o* . 6'V, ° o *°Na, o o o

, , ,i , , ,i , , ,i , " .... 0 * * I * , , I , , * I , , * I , * lo 100 1000 10000 100000 0.o01 0.01 0.1 1 10 r~rn~¢ of counts in distN~t]on average r~n-zem count in dis~bution r~nus One

Figure 1: )~ values for old and new bucketing schemes for Jelinek-Mercer smoothing; each point represents a single bucket

The other smoothing technique besides Katz 3.1 Method average-count smoothing widely used in speech recognition is due This scheme is an instance of Jelinek-Mercer to Jelinek and Mercer (1980). They present a class smoothing. Referring to equation (3), recall that of smoothing models that involve linear interpola- Bahl et al. suggest bucketing the A~!-I according tion, e.g., Brown et al. (1992) take to c(Wi_n+l).i--1 We have found that partitioning the i--1 ~!-~ according to the average number of counts *--~+1 ~Wi__ 1 PML(Wi IWi-n+l)i--1 "Iv i-- n-]-I per non-zero element ~(~--~"+1) yields better Iwi:~(~:_.+~)>01 (1-- )~to~-~ ) P~inte~pt /W i wi_n+2)i-1 , (3) results. i-- u-I-1 Intuitively, the less sparse the data for estimat- That is, the maximum likelihood estimate is inter- ing PML(WilWi_n+l),i-1 the larger A~,~-~ should be. polated with the smoothed lower-order distribution, *-- ~-t-1 which is defined analogously. Training a distinct While larger c(wi_n+l)i-1 generally correspond to less I ~-1 for each wi_,~+li-1 is not generally felicitous; sparse distributions, this quantity ignores the allo- Wi--n-{-1 cation of counts between words. For example, we Bahl, Jelinek, and Mercer (1983) suggest partition- would consider a distribution with ten counts dis- ing the 1~,~-~ into buckets according to c(wi_~+l),i-1 i-- n-l-1 tributed evenly among ten words to be much more where all )~w~-~ in the same bucket are constrained sparse than a distribution with ten counts all on a i-- n-l-1 to have the same value. single word. The average number of counts per word To yield meaningful results, the data used to esti- seems to more directly express the concept of sparse- mate the A~!-, need to be disjoint from the data ness, ~-- n"l-1 In Figure 1, we graph the value of ~ assigned to used to calculate PML .2 In held-out interpolation, each bucket under the original and new bucketing one reserves a section of the training data for this schemes on identical data. Notice that the new buck- purpose. Alternatively, aelinek and Mercer describe eting scheme results in a much tighter plot, indicat- a technique called deleted interpolation where differ- ing that it is better at grouping together distribu- ent parts of the training data rotate in training either tions with similar behavior. PML or the A,o!-' ; the results are then averaged. z-- n-[-I Several smoothing techniques are motivated 3.2 Method one-count within a Bayesian framework, including work by This technique combines two intuitions. First, Nadas (1984) and MacKay and Peto (1995). MacKay and Peto (1995) argue that a reasonable 3 Novel Smoothing Techniques form for a smoothed distribution is • i-1 Of the great many novel methods that we have tried, Pone(W i i-1 c(wL, +l) + Po,,e(wilw _ +9 two techniques have performed especially well. IWi--nq-1) = c(wi_n+l)i--1 + 2When the same data is used to estimate both, setting The parameter a can be thought of as the num- all )~ ~-~ to one yields the optimal result. Wl-- n-l-1 ber of counts being added to the given distribution,

312 where the new counts are distributed as in the lower- by taking the n = 0 distribution to be the uniform order distribution. Secondly, the Good-Turing esti- distribution Punif(wi) = l/IV[. For each method, we mate can be interpreted as stating that the number highlight the parameters (e.g., Am and 5 below) that of these extra counts should be proportional to the can be tuned to optimize performance. Parameter number of words with exactly one count in the given values are determined through training on held-out distribution. We have found that taking data. i-1 O~ = "y [nl(Wi_n+l) -~- ~] (4) 4.2.1 Baseline Smoothing (interp-baseline) works well, where For our baseline smoothing method, we use an instance of Jelinek-Mercer smoothing where we con- i-i i strain all A,~!-I to be equal to a single value A,~ for ,- n-hi is the number of words with one count, and where/3 each n, i.e., i--1 i-1 and 7 are constants. Pb so(wilw _ +i) = A,, + 4 Experimental Methodology (I -- Am) Pbase(WilWi_n+2) 4.1 Data 4.2.2 Additive Smoothing (plus-one and We used the Penn treebauk and TIPSTER cor- plus-delta) pora distributed by the Linguistic Data Consor- We consider two versions of additive smoothing. tium. From the treebank, we extracted text from Referring to equation (2), we fix 5 = 1 in plus-one the tagged Brown corpus, yielding about one mil- smoothing. In plus-delta, we consider any 6. lion words. From TIPSTER, we used the Associ- 4.2.3 Katz Smoothing (katz) ated Press (AP), Wall Street Journal (WSJ), and While the original paper (Katz, 1987) uses a single San Jose Mercury News (SJM) data, yielding 123, parameter k, we instead use a different k for each 84, and 43 million words respectively. We created n > 1, k,~. We smooth the unigram distribution two distinct vocabularies, one for the Brown corpus using additive smoothing with parameter 5. and one for the TIPSTER data. The former vocab- ulary contains all 53,850 words occurring in Brown; 4.2.4 Church-Gale Smoothing the latter vocabulary consists of the 65,173 words (church-gale) occurring at least 70 times in TIPSTER. To smooth the counts n~ needed for the Good- For each experiment, we selected three segments Turing estimate, we use the technique described by of held-out data along with the segment of train- Gale and Sampson (1995). We smooth the unigram ing data. One held-out segment was used as the distribution using Good-tiering without any bucket- test data for performance evaluation, and the other ing. two were used as development test data for opti- Instead of the bucketing scheme described in the mizing the parameters of each smoothing method. original paper, we use a scheme analogous to the Each piece of held-out data was chosen to be roughly one described by Bahl, Jelinek, and Mercer (1983). 50,000 words. This decision does not reflect practice We make the assumption that whether a bucket is very well, as when the training data size is less than large enough for accurate Good-Turing estimation 50,000 words it is not realistic to have so much devel- depends on how many n-grams with non-zero counts opment test data available. However, we made this occur in it. Thus, instead of partitioning the space decision to prevent us having to optimize the train- of P(wi-JP(wi) values in some uniform way as was ing versus held-out data tradeoff for each data size. done by Church and Gale, we partition the space In addition, the development test data is used to op- so that at least Cmin non-zero n-grams fall in each timize typically very few parameters, so in practice bucket. small held-out sets are generally adequate, and per- Finally, the original paper describes only bigram haps can be avoided altogether with techniques such smoothing in detail; extending this method to tri- as deleted estimation. gram smoothing is ambiguous. In particular, it is unclear whether to bucket trigrams according to 4.2 Smoothing Implementations P(wi_JP(wi-1 d or P(wi_JP(wilwi-1).i--1 We chose the In this section, we discuss the details of our imple- former; while the latter may yield better perfor- mentations of various smoothing techniques. Due mance, our belief is that it is much more difficult to space limitations, these descriptions are not com- to implement and that it requires a great deal more prehensive; a more complete discussion is presented computation. in Chen (1996). The titles of the following sections include the mnemonic we use to refer to the imple- 4.2.5 Jelinek-Mercer Smoothing mentations in later sections. Unless otherwise speci- (interp-held-out and interp-del-int) fied, for those smoothing models defined recursively We implemented two versions of Jelinek-Mercer in terms of lower-order models, we end the recursion smoothing differing only in what data is used to

313 train the A's. We bucket the A ~-1 according to Method Lines Wi--n-bl 400 C(Wi_~+I)i-1 as suggested by Bahl et al. Similar to our interp-baseline ~ 40 Church-Gale implementation, we choose buckets to plus-one ensure that at least Cmin words in the data used to plus-delta 40 train the A's fall in each bucket. katz 300 In interp-held-out, the A's are trained using church-gale i000 held-out interpolation on one of the development ±nterp-held-out 400 test sets. In interp-del-int, the A's are trained interp-del-int 400 using the relaxed deleted interpolation technique de- new-avg-count 400 scribed by Jelinek and Mercer, where one word is new-one-count 50 deleted at a time. In interp-del-int, we bucket an n-gram according to its count before deletion, as Table 1: Implementation difficulty of various meth- this turned out to significantly improve performance. ods in terms of lines of C++ code 4.2.6 Novel Smoothing Methods (new-avg-count and new-one-count) ues from optimal values found on smaller data sizes. The implementation new-avg-count, correspond- We ran interp-del-int only on sizes up to 50,000 ing to smoothing method average-count, is identical sentences due to time constraints. to interp-held-out except that we use the novel From these graphs, we see that additive smooth- bucketing scheme described in section 3.1. In the ing performs poorly and that methods katz and implementation new-one-count, we have different interp-held-out consistently perform well. Our parameters j3~ and 7~ in equation (4) for each n. implementation church-gale performs poorly ex- cept on large bigram training sets, where it performs 5 Results the best. The novel methods new-avg-count and new-one-count perform well uniformly across train- In Figure 2, we display the performance of the ing data sizes, and are superior for trigram models. interp-baseline method for bigram and trigram Notice that while performance is relatively consis- models on TIPSTER, Brown, and the WSJ subset tent across corpora, it varies widely with respect to of TIPSTER. In Figures 3-6, we display the relative training set size and n-gram order. performance of various smoothing techniques with The method interp-del-int performs signifi- respect to the baseline method on these corpora, as cantly worse than interp-held-out, though they measured by difference in entropy. In the graphs differ only in the data used to train the A's. However, on the left of Figures 2-4, each point represents an we delete one word at a time in interp-del-int; we average over ten runs; the error bars represent the hypothesize that deleting larger chunks would lead empirical standard deviation over these runs. Due to more similar performance. to resource limitations, we only performed multiple In Figure 7, we show how the values of the pa- runs for data sets of 50,000 sentences or less. Each rameters 6 and Cmin affect the performance of meth- point on the graphs on the right represents a sin- ods katz and new-avg-count, respectively, over sev- gle run, but we consider sizes up to the amount of eral training data sizes. Notice that poor parameter data available. The graphs on the bottom of Fig- setting can lead to very significant losses in perfor- ures 3-4 are close-ups of the graphs above, focusing mance, and that optimal parameter settings depend on those algorithms that perform better than the on training set size. baseline. To give an idea of how these cross-entropy To give an informal estimate of the difficulty of differences translate to perplexity, each 0.014 bits implementation of each method, in Table 1 we dis- correspond roughly to a 1% change in perplexity. play the number of lines of C++ code in each imple- In each run except as noted below, optimal val- mentation excluding the core code common across ues for the parameters of the given technique were techniques. searched for using Powell's search algorithm as real- ized in Numerical Recipes in C (Press et al., 1988, pp. 309-317). Parameters were chosen to optimize 6 Discussion the cross-entropy of one of the development test sets To our knowledge, this is the first empirical compari- associated with the given training set. To constrain son of smoothing techniques in language modeling of the search, we searched only those parameters that such scope: no other study has used multiple train- were found to affect performance significantly, as ing data sizes, corpora, or has performed parameter verified through preliminary experiments over sev- optimization. We show that in order to completely eral data sizes. For katz and church-gale, we did not perform the parameter search for training sets 3To implement the baseline method, we just used the over 50,000 sentences due to resource constraints, interp-held-out code as it is a special case. Written and instead manually extrapolated parameter val- anew, it probably would have been about 50 lines.

314 average over ten runs at each size, up to 50,0OO sentences single run at each size 11.5 tl.5 ",.. 11

10.5 10.5 tO

10 9.5 ~io~n t rigrarn

9.5 0

8.5 -.~ ~... 0 "-~:: :. TIPSTER bigram "-. "'~:-WS.J bigrarn 8 a.5 "',., "'"~'.:"-~. TIPSTER bigram 7.5 ...... ,.. :=::::; ...... V~SJ b~gram 7 . TIPSTER tdgra~

6.5 1000 10000 tOO tO00 1O000 100000 le+06 )e+07 sentences of training data (-25 words~sentence) sentences of training data (-25 words/sentence)

Figure 2: Baseline cross-entropy on test data; graph on left displays averages over ten runs for training sets up to 50,000 sentences, graph on right displays single runs for training sets up to 10,000,000 sentences

average over ten runs at each size, up to 50,000 sentences single run at each size, up to 10,000,000 sentences 7 ...... , ...... ) • . . . ., . ., . . ., . . ., . . + ...... +....--~ plus~ne l~Us-one ...... ~ ...... ~ ...... = ...... 6 ...... ~ ...... ~...y~ "'"'~" -..+.,..,,.,

.,-" ....o.-..-~'""~...... ~ '-.. . c~ ..... plus=dsita ...... I .... --..~, ! 4 ......

......

._c 2=,j ...... 1 J .church-gata ""*

ks/z, interp-held-out, ~nterpdel-int, new-avg-count, new-one-count (see below)

-t ...... ~ ...... ~ -1 , • ...... ' ' ' "' " 1000 10000 100 1000 10000 100000 le+06 le+07 sentences of training data (-25 wordS/sentence) sentences of training data (-25 words~sentence)

average over ten runs at each size, up to 50,000 sentences single run at each size, Up to 10,000,000 sentences 0.04 . . , • - - , • - , - - , •

0 ...... o.02o .,';'~o~o,-~nt t

-0,02 "~"-.. interp-delqnt -0.02

-0.04 ~.-..._Z .~ ...... ~.-. katz d " / :*" -'-"(" '" "'m. ...~.,][F ,.a ...... a""~ ', -0.00 JO.O8 " inteq)qle)d-out ..o'""

.....~t 1 ...... ~ ...... ~ ...... ~3.1 .~" .~----~. new-one-countc/" x...-~ ~.1 n ...... -0.12 ..... "t~""" " ...... ~-.....-" ./ new-svg-count

...... ~ ...... :::::::::::::::::::::::::::::::::::::::::::::: -0.14 k ...... x _~_.._~...i" "~"~

-0.16 ...... J ...... too 1000 10000 "0.1610 o lO0O 1OOOO 10oooo le+06 le+07 sentences of training data (-25 words~sentence} sentences of training data (-25 words/sentonce) I Figure 3: Trigram model on TIPSTER data; relative performance of various methods with respect to baseline; graphs on left display averages over ten runs for training sets up to 50,000 sentences, graphs on right display single runs for training sets up to 10,000,000 sentences; top graphs show all algorithms, bottom graphs zoom in on those methods that perform better than the baseline method

315 average over ten runs at each size, up to 50,000 senlences single run at each size, up to 10,000,000 sentences 5 ...... , ...... , • • - 5 , • ., . • .. ., . • ., . •

4.6 .... ~- ...... "~'" plus-o'n~ ...... ~ ....

4 ...... 4 " ..

3.S " 3.5 "~"'~"... I~US*one

3 ...... t~.,, " "-~ ...... ~ plus4el~ =oE 2.5 ...... * 2.5 t''°'""'~'""'o...... " ""o. "*.

1.I church*gale ' 1 f--,~church-gale " "~" " "'~...p us-de ta 0.5 O.5 "~-.. --._o. 0 ..... ~3.6 T , ,k~tz, taterp-he~-out., interp~, el~tat, ,~ew,~zvg~ou' ...... ~ne.~ount ! ...... ,ow), l -05 100 1000 10000 100 1000 10000 100000 le+06 le+07 sentences of training data (-26 words/sentence) sentences of training data (-26 words/sentence)

average over t~ runs at each size, up to 50,000 sentences stagle r~n at each size, up to 10,OCO,O00 sentences 0.02 ...... , ...... , . . • 0.02 • • •, . , • ., ., . . .

-0.02 "" "~... humh*gale -0.02 church*gale ...~:,'~ -0.04 ~ " " ,¢ " .. o". I n erp-he d-out . " ~,~ -~ Interprdel-mt ..- .* ~'" ...... ~ ..~ .~./.- ~0~ ~- ...... ~ ...... {...... -0.00

-0.1 new-one-count ..~D -...B'" .~...... ~ ...... -'~-'~" :=L::~" T n:w:n2ount /~ ...... 1

•. -..~ j.~.~..:.::;$., ~Om1 2

.0.14 ew-avg-count .0.14 ' , , I , , , i , , , m , . , I , , 100 1000 10000 10o 10oo 1ooo0 1oo000 le+06 le+07 sentences of training data (-26 words/sentence) sentences of training data (-25 wo~ds/sentence)

Figure 4: Bigram model on TIPSTER data; relative performance of various methods with respect to baseline; graphs on left display averages over ten runs for training sets up to 50,000 sentences, graphs on right display single runs for training sets up to 10,000,000 sentences; top graphs show all algorithms, bottom graphs zoom in on those methods that perform better than the baseline method

bigram model tzigram model 0.02 ...... , ...... ,

0 0 ......

i -0.02 church-gale .0.02 interprdel-int .0,04 ..-~, ...... ~.

.0.00 -0.06 katz .--~" "'~'"'""'"

.0.o8 ...z" ...... ~*.. •..-:::...... ,,<.:..-" -0.06 .-" ...... i r~ t e t p..<1el-~ip_t ......

-0.12 "= . ~inte~p~held-out lew-a-~n~t...~--~272~::'z--.. "n~:orpe-~t a ...... D ...... 0.12 :::.-.. "'~=...... ~ ...... Q.. interp*held-out ...... 7~.=: =-P~::.... " ...... e ...... o ...... e ...... ~ ...... 0.14 - ~ - =':::=*~-~_.___ ~ new-one-count -0.16

-0.18 ...... i ...... i -0.1600 100 1000 IOQO0 1000 10000 sentences of training data (-21 words~sentence) sentences of traJelng data (-21 words/sentence)

Figure 5: Bigram and trigram models on Brown corpus; relative performance of various methods with respect to baseline

316 bigram model tdgram model

o 0 i "(, ~ 0.02...... -0.02 hurch-gale

-0.04 ...... ~~ -0.02 '--nt"~ erp~J el-int inte rp-d el-int ".. ~ inte rpheld~out ... ~"" ~ "'-~

E -0.06 • " ., ~,~. ] ~ - .- ~,. "'A.

-- -0.06 .-'-'katz ""-~ "-.= .. -0°3 ' :i: ,...... ::.>~,.- .~...... ~ "'- -::::.., y -oo i -......

-0.14 • """ "~ " • " -k-atz

-0.13 ~ -018 ...... = ' ' , 1oo 1000 10oo0 100000 le+06 10o 1000 10000 100000 le+0o sentences of training data (-25 words/sentence) sentences of t relelr~g data (-25 words/sentence)

Figure 6: Bigram and trigram models on Wall Street Journal corpus; relative performance of various methods with respect to baseline

performance el katz with respect to delta performance of new-avg-c~nt with respect to c-min 1.6 .... , • .., .., . .., • .., . . . . ., . . ., . .

10O senl z 1.4 -0.0O x\

\ / 1.2 -0.07 ~C ~'\ lO.000,000 sent /" / ==- /" 1 -O.08 x\, , .o ",\ // ,,," 10,0O0 sent 0.8 1,0O0 sent ..a 2 -0.O3 .... /

== 0.6 /' .,.~/" -0.1 '"6. ..'"' 2 l 0.4 / .-" .ED,O00 sent)< """'u, 1 OO3,0OO sent " / -0.11 .=_ 0.2 -0.12 "'"'d:. / .." ~'" j/10,0O0 sent -0.13

e .... I , , ,i , , ,r , , ,I , , ,i , , , 0.0Ol o.01 0.1 1 lO 10o 1000 10 100 tO00 10(00 100000 delta minimum number of counts per bucket

Figure 7: Performance of katz and new-avg-count with respect to parameters ~ and Cmin, respectively

characterize the relative performance of two tech- ing, which previously had not been compared with niques, it is necessary to consider multiple training common smoothing techniques, outperforms all ex- set sizes and to try both bigram and trigram mod- isting methods on bigram models produced from els. Multiple runs should be performed whenever large training sets. Finally, we find that our novel possible to discover whether any calculated differ- methods average-count and one-count are superior ences are statistically significant. Furthermore, we to existing methods for trigram models and perform show that sub-optimM parameter selection can also well on bigram models; method one-count yields significantly affect relative performance. marginally worse performance but is extremely easy to implement. We find that the two most widely used techniques, Katz smoothing and Jelinek-Mercer smoothing, per- In this study, we measure performance solely form consistently well across training set sizes for through the cross-entropy of test data; it would both bigram and trigram models, with Katz smooth- be interesting to see how these cross-entropy differ- ing performing better on trigram models produced ences correlate with performance in end applications from large training sets and on bigram models in such as speech recognition. In addition, it would be general. These results question the generality of the interesting to see whether these results extend to previous reference result concerning Katz smooth- fields other than language modeling where smooth- ing: Katz (1987) reported that his method slightly ing is used, such as prepositional phrase attachment outperforms an unspecified version of Jelinek-Mercer (Collins and Brooks, 1995), part-of-speech tagging smoothing on a single training set of 750,000 words. (Church, 1988), and stochastic parsing (Magerman, Furthermore, we show that Church-Gale smooth- 1994).

317 Acknowledgements Gale, William A. and Kenneth W. Church. 1994. What's wrong with adding one? In N. Oostdijk The authors would like to thank Stuart Shieber and and P. de Haan, editors, Corpus-Based Research the anonymous reviewers for their comments on pre- into Language. Rodolpi, Amsterdam. vious versions of this paper. We would also like to thank William Gale and Geoffrey Sampson for sup- Gale, William A. and Geoffrey Sampson. 1995. plying us with code for "Good-Turing frequency esti- Good-Turing frequency estimation without tears. mation without tears." This research was supported Journal of Quantitative Linguistics, 2(3). To ap- by the National Science Foundation under Grant No. pear. IRI-93-50192 and Grant No. CDA-94-01024. The Good, I.J. 1953. The population of second author was also supported by a National Sci- species and the estimation of population parame- ence Foundation Graduate Student Fellowship. ters. Biometrika, 40(3 and 4):237-264. Jeffreys, H. 1948. Theory of Probability. Clarendon Press, Oxford, second edition. References Jelinek, Frederick and Robert L. Mercer. 1980. In- Bahl, Lalit R., Frederick Jelinek, and Robert L. terpolated estimation of Markov source parame- Mercer. 1983. A maximum likelihood approach ters from sparse data. In Proceedings of the Work- to continuous speech recognition. IEEE Trans- shop on Pattern Recognition in Practice, Amster- actions on Pattern Analysis and Machine Intelli- dam, The Netherlands: North-Holland, May. gence, PAMI-5(2):179-190, March. Johnson, W.E. 1932. Probability: deductive and Brown, Peter F., John Cocke, Stephen A. DellaPi- inductive problems. Mind, 41:421-423. etra, Vincent J. DellaPietra, Frederick Jelinek, Katz, Slava M. 1987. Estimation of probabilities John D. Lafferty, Robert L. Mercer, and Paul S. from sparse data for the language model com- Roossin. 1990. A statistical approach to machine ponent of a speech recognizer. IEEE Transac- translation. Computational Linguistics, 16(2):79- tions on Acoustics, Speech and , 85, June. ASSP-35(3):400-401, March. Brown, Peter F., Stephen A. DellaPietra, Vincent J. Kernighan, M.D., K.W. Church, and W.A. Gale. DellaPietra, Jennifer C. Lai, and Robert L. Mer- 1990. A spelling correction program based on cer. 1992. An estimate of an upper bound for a noisy channel model. In Proceedings of the the entropy of English. Computational Linguis- Thirteenth International Conference on Compu- tics, 18(1):31-40, March. tational Linguistics, pages 205-210. Chen, Stanley F. 1996. Building Probabilistic Mod- Lidstone, G.J. 1920. Note on the general case of the els for Natural Language. Ph.D. thesis, Harvard Bayes-Laplace formula for inductive or a posteri- University. In preparation. ori probabilities. Transactions of the Faculty of Actuaries, 8:182-192. Church, Kenneth. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In MacKay, David J. C. and Linda C. Peto. 1995. A hi- Proceedings of the Second Conference on Applied erarchical Dirichlet language model. Natural Lan- Natural Language Processing, pages 136-143. guage Engineering, 1(3):1-19. Magerman, David M. 1994. Natural Language Pars- Church, Kenneth W. and William A. Gale. 1991. ing as Statistical Pattern Recognition. Ph.D. the- A comparison of the enhanced Good-Turing and sis, Stanford University, February. deleted estimation methods for estimating proba- bilities of English bigrams. Computer Speech and Nadas, Arthur. 1984. Estimation of probabilities in Language, 5:19-54. the language model of the IBM speech recognition system. IEEE Transactions on Acoustics, Speech Collins, Michael and James Brooks. 1995. Prepo- and Signal Processing, ASSP-32(4):859-861, Au- sitional phrase attachment through a backed-off gust. model. In David Yarowsky and Kenneth Church, editors, Proceedings of the Third Workshop on Press, W.H., B.P. Flannery, S.A. Teukolsky, and Very Large Corpora, pages 27-38, Cambridge, W.T. Vetterling. 1988. Numerical Recipes in C. MA, June. Cambridge University Press, Cambridge. Gale, William A. and Kenneth W. Church. 1990. Estimation procedures for language context: poor estimates are worse than none. In COMP- STAT, Proceedings in Computational , 9th Symposium, pages 69-74, Dubrovnik, Yu- goslavia, September.

318