<<

Recognition Lecture 5: N-gram Language Models

Mehryar Mohri Courant Institute and Research [email protected] Language Models

Definition: probability distribution Pr[ w ] over sequences of w = w1 ...wk. • Critical component of a system. Problems: • Learning: use large (e.g., several million words) to estimate Pr[ w ] . Models in this course: n-gram models, maximum entropy models. • Efficiency: computational representation and use.

Mehryar Mohri - Speech Recognition page 2 Courant Institute, NYU This Lecture

n-gram models definition and problems Good-Turing estimate Smoothing techniques Evaluation Representation of n-gram models Shrinking LMs based on probabilistic automata

Mehryar Mohri - Speech Recognition page 3 Courant Institute, NYU N-Gram Models

Definition: an n-gram model is a probability distribution based on the nth order Markov assumption

∀i, Pr[wi | w1 ...wi−1]=Pr[wi | hi], |hi|≤n − 1. • Most widely used language models. Consequence: by the chain rule, k k Pr[w]=! Pr[wi | w1 ...wi−1]=! Pr[wi | hi]. i=1 i=1

Mehryar Mohri - Speech Recognition page 4 Courant Institute, NYU Maximum Likelihood

Likelihood: probability of observing sample under distribution p ∈ P , which, given the independence assumption is m Pr[x1,...,xm]=! p(xi). i=1 Principle: select distribution maximizing sample probability m p! = argmax ! p(xi), ∈P p i=1 m or p! = argmax ! log p(xi). ∈P p i=1 Mehryar Mohri - Speech Recognition page 5 Courant Institute, NYU Example: Bernoulli Trials

Problem: find most likely Bernoulli distribution, given sequence of coin flips H, T, T, H, T, H, T, H, H, H, T, T, . . . , H.

Bernoulli distribution: p(H)=θ, p(T )=1− θ.

Likelihood: l(p) = log θN(H)(1 − θ)N(T ) = N(H) log θ + N(T ) log(1 − θ). Solution: l is differentiable and concave; dl(p) N(H) N(T ) N(H) = =0 θ = . dθ θ − 1 θ ⇔ N(H)+N(T ) Mehryar Mohri - Speech Recognition − page 6 Courant Institute, NYU Maximum Likelihood Estimation

Definitions: • n-gram: sequence of n consecutive words. • S : sample or corpus of size m . c w ...w • ( 1 k ) : count of sequence w 1 ...w k . ML estimates: for c ( w 1 ...w n − 1 ) =0! , c(w1 ...wn) Pr[wn|w1 ...wn−1]= . c(w1 ...wn−1) • But, c(w1 ...wn)=0 =⇒ Pr[wn|w1 ...wn−1]=0!

Mehryar Mohri - Speech Recognition page 7 Courant Institute, NYU N-Gram Model Problems

Sparsity: assigning probability zero to sequences not found in the sample = ⇒ speech recognition errors. • Smoothing: adjusting ML estimates to reserve probability mass for unseen events. Central techniques in language modeling. • Class-based models: create models based on classes (e.g., DAY) or phrases. Representation: for | Σ | =100 , 000 , the number of is 10 10 , the number of 10 15 ! • Weighted automata: exploiting sparsity. Mehryar Mohri - Speech Recognition page 8 Courant Institute, NYU Smoothing Techniques

Typical form: interpolation of n-gram models, e.g., , , unigram frequencies.

Pr[w3|w1w2]=α1c(w3|w1w2)+α2c(w3|w2)+α3c(w3). Some widely used techniques: • Katz Back-off models (Katz, 1987). • Interpolated models (Jelinek and Mercer, 1980). • Kneser-Ney models (Kneser and Ney, 1995).

Mehryar Mohri - Speech Recognition page 9 Courant Institute, NYU Good-Turing Estimate (Good, 1953) Definitions: • Sample S of m words drawn from Σ . • c ( x ) : count of x in S . • S k : set of words appearing k times. • M k : probability of drawing a point in S k . Good-Turing estimate of M k : k k S G +1 S ( +1)| k+1|. k = m | k+1| = m

Mehryar Mohri - Speech Recognition page 10 Courant Institute, NYU Properties

Theorem: the Good-Turing estimate is an estimate of M k with small bias, for small values of k/m : k +1 E[Gk]=E[Mk]+O( ). S S m Proof: E[Mk]= Pr[x] Pr[x ∈ Sk] S x!∈Σ m = Pr[x] Pr[x]k(1 − Pr[x])m−k " k # x!∈Σ m m = Pr[x]k+1(1 − Pr[x])m−(k+1) k (1 − Pr[x]) "k +1# $m% x!∈Σ k+1 k +1 m $ % k +1 = E |Sk | − E Mk = E[Gk] − E[Mk ]. m − k & +1 +1 ) m − k m − k +1 ' ( ' ( Mehryar Mohri - Speech Recognition page 11 Courant Institute, NYU Properties

Proof (cont.): thus, ! ! ! k k +1 ! !E[Mk] − E[Gk]! = ! E[Gk] − E[Mk+1]! !S S ! !m − k S m − k ! k +1 ≤ . (0 ≤ E[Gk], E[Mk+1] ≤ 1) m − k S In particular, for k =0 , ! ! 1 !E[Mk] − E[Gk]! ≤ . !S S ! m It can be proved using McDiarmid’s inequality that with probability at least 1 − δ , (McAllester and Schapire,

2000), 1 log( δ ) M0 G0 + O . ≤ ￿￿ m ￿ Mehryar Mohri - Speech Recognition page 12 Courant Institute, NYU Good-Turing Count Estimate

Definition: let r = c ( w 1 ...w k ) and n r = | S r |, then

Gr m nr+1 c∗(w ...w )= × =(r +1) . 1 k S n | r| r

Mehryar Mohri - Speech Recognition page 13 Courant Institute, NYU Simple Method (Jeffreys, 1948) Additive smoothing: add δ ∈ [0 , 1] to the count of each n-gram. c(w1 ...wn)+δ Pr[wn|w1 ...wn−1]= . c(w1 ...wn−1)+δ|Σ| • Poor performance (Gale and Church, 1994). • Not a principled attempt to estimate or make use of an estimate of the missing mass.

Mehryar Mohri - Speech Recognition page 14 Courant Institute, NYU Katz Back-off Model (Katz, 1987) Idea: back-off to lower order model for zero counts.

c wn−1 w wn−1 w wn−1 • if ( 1 )=0 , then Pr[ n | 1 ]=Pr[ n | 2 ] . otherwise, • n c(w1 ) n dc wn c w > n−1  ( 1 ) n−1 if ( 1 ) 0 wn w c w Pr[ | 1 ]= ( 1 ) n−1 β Pr[wn|w2 ] otherwise.  where d k is a discount coefficient such that 1 if k>K;  dk = (k +1)nk+1 ≈ otherwise. knk Katz suggests K =5 .

Mehryar Mohri - Speech Recognition page 15 Courant Institute, NYU Discount Coefficient

With the Good-Turing estimate, the total probability mass for unseen n-grams is G 0 = n 1 /m .

n dc(wn)c(w )/m =1 n1/m 1 1 − c(wn)>0 ￿1 d kn = m n ⇔ k k − 1 k>￿0 d kn = kn n ⇔ k k k − 1 k>￿0 k>￿0 (1 d ) kn = n . ⇔ − k k 1 k>￿0 K (1 d ) kn = n . ⇔ − k k 1 k￿=1

Mehryar Mohri - Speech Recognition page 16 Courant Institute, NYU Discount Coefficient k∗ Solution: a search with 1 − d k = µ (1 − ) leads to k K − ∗ µ !(1 k /k) knk = n1. k=1 K ⇔ − (k +1)nk+1 µ !(1 ) knk = n1. knk k=1 K ⇔ − µ ![knk (k +1)nk+1]=n1. k=1

⇔ µ[n1 − (K +1)nK+1]=n1. ⇔ 1 µ = K n 1 − ( +1) K+1 n1 ∗ k (K+1)nK+1 k − n ⇔ 1 dk = K n . 1 − ( +1) K+1 n1

Mehryar Mohri - Speech Recognition page 17 Courant Institute, NYU Interpolated Models (Jelinek and Mercer, 1980) Idea: interpolation of different order models.

Pr[w3|w1w2]=αc(w3|w1w2)+βc(w3|w2)+(1− α − β)c(w3), with 0 ≤ α,β ≤ 1. • α and β are estimated by using held-out samples. • sample split into two parts for training higher- order and lower order models. • optimization using expectation-maximization (EM) algorithm. • deleted interpolation: k-fold cross-validation. Mehryar Mohri - Speech Recognition page 18 Courant Institute, NYU Kneser-Ney Model

Idea: combination of back-off and interpolation, but backing-off to lower order model based on counts of contexts. Extension of absolute discounting.

max {0,c(w1w2w3) − D} c(·w3) Pr[w3|w1w2]= + α , c(w1w2) ! c(·w3)

where D is a constant.

Modified version (Chen and Goodman, 1998): D function of c ( w 1 w 2 w 3 ) .

Mehryar Mohri - Speech Recognition page 19 Courant Institute, NYU Evaluation

Average log-likelihood of test sample of size N: N 1 L(p)= log p[w h ], h n 1. N 2 k | k | k| ≤ − k￿=1 Perplexity￿ : the perplexity PP ( q ) of the model is defined as (‘average branching factor’) Lˆ(q) PP(q)=2− . • For English texts, typically PP ( q ) ∈ [50 , 1000] and 6 L(q) 10 bits. ≤ ≤ ￿

Mehryar Mohri - Speech Recognition page 20 Courant Institute, NYU In Practice

Evaluation: an empirical observation based on the word error rate of a speech recognizer is often a better evaluation than perplexity. n-gram order: typically n = 3, 4, or 5. Higher order n-grams typically do not yield any benefit. Smoothing: small differences between Back-off, interpolated, and other models (Chen and Goodman, 1998). Special symbols: beginning and end of sentences, start and stop symbols.

Mehryar Mohri - Speech Recognition page 21 Courant Institute, NYU The set of states of the WFA representing a backoff or /1.101 interpolated model is defined by associating a state qh to a/0.405 a/0.287 each sequence of length less than n found in the corpus: a !/4.856 /1.540 a/1.108 Q = {qh : |h| < n and c(h) > 0} a/0.441 /0.356 !/0.231 ! b Its transition set E is defined as the union of the following b/0.693 b/1.945 set of failure transitions: Figure 3: Example of representation of a bigram model {(qwh! , φ, − log(αh), qh! ) : qwh! ∈ Q} with failure transitions.

and the following set of regular transitions: cost of the invalid path to the state – the sum of the two transition costs (0.672) – is lower than the cost of the true

{(qh, w, − log(Pr(w|h)), nhw) : qwh ∈ Q} path. Hence the WFA with #-transitions gives a lower cost (higher probability) to all strings beginning with the where nhw is defined by: symbol a. Note that the invalid path from the state labeled #s$ to the state labeled b has a higher cost than the correct path, which is not a problem in the tropical semiring. qhw if 0 < |hw| < n nhw = ! ! (4) Example:! Bigram Model  qh w if |hw| = n where h = w h 4.4 Exact Offline Representation The set of states of the WFA representing a backoff or /1.101 This section presents a method for constructing an ex- interpolated model is defined by associating a state qh Ftoigure 2 illustratea/0.405s this construa/0.287ction for a trigram model. act offline representation of an n-gram each sequence of length less than n found in the corpuTs:reating #-transitions as regular symbols, this is a a !/4.856 /1.540 whose size remains practical for large-vocabulary tasks. deterministica/1.108automaton. Figure 3 shows a complete Q = {qh : |h| < n and c(h) > 0} The main idea behind our new construction is to mod- a/0.441 /0.356 Katz backoff bigram!/0.231model built from!counts taken from b ify the topology of the WFA to remove any path contain- Its transition set E is defined as the union of the followitnhge following toy corpusb/0.693and using faib/1.945lure transitions: set of failure transitions: ing #-transitions whose cost is lower than the correct cost associated by the model to the string labeling that path. Figure 3: Exam#psle$ ofbreparesaentatioan of#/a sbi$gram model {(qwh! , φ, − log(αh), qh! ) : qwh! ∈ Q} with failure transitions. Since, as a result, the low cost path for each string will #s$ b a a a a #/s$ have the correct cost, this will guarantee the correctness and the following set of regular transitions: cost of the inva#lisd$patah to#/thse$state – the sum of the two transition costs (0.672) – is lower than the cost of the true of the representation in the tropical semiring. {(q , w, − log(Pr(w|h)), n ) : q ∈ Q} path. Hence the WFA with #-transitions gives a lower Our construction admits two parts: the detection of the h hw wh where #s$ denotes the start symbol and #/s$ the end sym- cost (higher probability) to all strings beginning with the invalid paths of the WFA, and the modification of the bol fosyrmeabcohl as.eNnotteentchaet.thNeointevatlhidaptatthhefrsotmartthseysmtatbeolalb#esle$ddoes where nhw is defined by: topology by splitting states to remove the invalid paths. not la#sb$etloathneysttartaenlasbiteiloedn,b hitaseanhciogdhersctohstethhainsttoheryco#rsre$c.t All Mehryar Mohrip -a Speechth, w Recognitionhich is not a problem in the tropical semirpageing22. Courant Institute, NYU To detect invalid paths, we determine first their initial qhw if 0 < |hw| < n transitions labeled with the end symbol #/s$ lead to the nhw ! ! = ! (4) non-# transitions. Let E! denote the set of #-transitions  qh w if |hw| = n where h = w h single4.fi4naElxsatcatteOofflfintheeRaeuptroesmenattaotnio.n of the original automaton. Let Pq be the set of all paths This section presents a method for constructing an ex- π = e . . . e ∈ (E −E )k, k > 0, leading to state q such Figure 2 illustrates this construction for a trigram mod4el..3 Approximate Offline Representation 1 k ! act offline representation of an n-gram language model that for all i, i = 1 . . . k, p[e ] is the destination state of Treating #-transitions as regular symbols, this isTahe common method used for an offline representation of i whose size remains practical for large-vocabulary tasks. some #-transition. deterministic automaton. Figure 3 shows a compleatne n-grTahme lmaanignuidaegaebmehoinddeolucrannewbeceonasstirluyctdioenriivsetodmfroodm- the Katz backoff bigram model built from counts taken from ify the topology of the WFA to remove any path contain- the following toy corpus and using failure transitions: representation using failure transitions by simply replac- Lemma 2 For an n-gram language model, the number ing eaincgh#φ-t-ratrnasnitisointisownhboyseacnos#t-itsralonwseitritohnan. Tthheucso,rraectrtacnosstition associated by the model to the string labeling that path. of paths in Pq is less than the n-gram order: |Pq| < n. #s$ b a a a a #/s$ that cSoiunclde,oans layrbeseutlta,ktehen lionwthcoesat bpsaethnfcoer oeafcahnsytrointghewrilallter- ! #s$ b a a a a #/s$ Proof. For all πi ∈ Pq, let πi = πiei. By definition, nativehaivne tthheecoerxraeccttcroespt,rtehsisenwtialltigounarcanatneenthoewcobrerecttankeessn re- ! ! #s$ a #/s$ gardloefssthoefrewprheseetnhteartiothneinretheextirsotpsicaanl saemlteirringa.tive transition. there is some ei ∈ E! such that n[ei] = p[ei] = qhi . By Our construction admits two parts: the detection of the definition of #-transitions in the model, |hi| < n − 1 for where #s$ denotes the start symbol and #/s$ the end symT-hus the approximate representation may contain paths invalid paths of the WFA, and the modification of the all i. It follows from the definition of regular transitions bol for each sentence. Note that the start symbol #s$ doweshose weight does not correspond to the exact probabil- topology by splitting states to remove the invalid paths. not label any transition, it encodes the history #s$. Aitlyl of the string labeling that path according to the model. that n[ei] = qhiw = q. Hence, hi = hj = h, i.e. ei = To detect invalid paths, we determine first their initial transitions labeled with the end symbol #/s$ lead to the ej = e, for all πi, πj ∈ Pq. Then, Pq = {πe : π ∈ Pqh }∪ Connosni-d#etrrafnosriteioxnasm. Lpelet Ethedsetnaorttestthaeteseitnofif g#-utrraens3i,tiloanbseled single final state of the automaton. ! {e}. The history-less state has no incoming non-# paths, with o#fs$th. eIonrigainfalilauurteomtraatonns.itLioetnPmq boedethle, stheteoref aellxpisattshsonly k therefore, by recursion, |Pq| = |Pqh | + 1 = |hw| < n. 4.3 Approximate Offline Representation one pπat=h efr1o.m. . etkhe∈ s(tEar−t Est!a)te, kto>th0e, lseatadtienglatobsetlaetedqas,ucwhith a that for all i, i = 1 . . . k, p[e ] is the destination state of ! The common method used for an offline representationcoof st of 1.108, since the φ trainsition cannot be traversed We now define transition sets Dqq (originally empty) some #-transition. an n-gram language model can be easily derived from twheith an input of a. If the φ transition is replaced by an following this procedure: for all states r ∈ Q and all π = ! representation using failure transitions by simply repla#c-tranLseitmiomna, 2thFeorer aisn an-sgeracmonldanpguaathgetomothdeel,stthaetenulambbeelred a e1 . . . ek ∈ Pr, if there exists another path π = eπ such ing each φ-transition by an #-transition. Thus, a transition ! ! – takionfgpatthhes i#n-Ptrqanisslietsios nthaton thheenh-gisrtaomryo-rldeesrs: |sPtaq|te<, tnh.en the that e ∈ E! and i[π ] = i[π], and either (i) n[π ] = n[π] that could only be taken in the absence of any other alter- ! ! ! a tranPsriotoiof.n oFourt aolfl πthi e∈hPisqt,olreyt-πlei s=s sπtaietei.. BTyhidsefipnaitthionis, not and w[eπ] < w[π ] or (ii) there exists e ∈ E! such that native in the exact representation can now be taken re- ! ! ! ! ! ! ! gardless of whether there exists an alternative transitiopna.rt othfetrheeispsroombeabeiil∈istEic! msucohdtehla–t nw[eei]s=hapl[leri]ef=erqhtoi . iBt yas an p[e ] = n[π ] and n[e ] = n[π] and w[eπ] < w[π e ], then definition of #-transitions in the model, |h | < n − 1 for ! ! Thus the approximate representation may contain patihnsvalid path. In this case, there is a probliem, because the we add e1 to the set: Dp[π]p[π ] ← Dp[π]p[π ] ∪ {e1}. See whose weight does not correspond to the exact probabil- all i. It follows from the definition of regular transitions ity of the string labeling that path according to the model. that n[ei] = qhiw = q. Hence, hi = hj = h, i.e. ei = Consider for example the start state in figure 3, labeled ej = e, for all πi, πj ∈ Pq. Then, Pq = {πe : π ∈ Pqh }∪ with #s$. In a failure transition model, there exists only {e}. The history-less state has no incoming non-# paths, one path from the start state to the state labeled a, with a therefore, by recursion, |Pq| = |Pqh | + 1 = |hw| < n. cost of 1.108, since the φ transition cannot be traversed We now define transition sets Dqq! (originally empty) with an input of a. If the φ transition is replaced by an following this procedure: for all states r ∈ Q and all π = ! # transition, there is a second path to the state labeled a e1 . . . ek ∈ Pr, if there exists another path π = eπ such ! ! – taking the #-transition to the history-less state, then the that e ∈ E! and i[π ] = i[π], and either (i) n[π ] = n[π] ! ! a transition out of the history-less state. This path is not and w[eπ] < w[π ] or (ii) there exists e ∈ E! such that part of the probabilistic model – we shall refer to it as an p[e!] = n[π!] and n[e!] = n[π] and w[eπ] < w[π!e!], then invalid path. In this case, there is a problem, because the we add e1 to the set: Dp[π]p[π!] ← Dp[π]p[π!] ∪ {e1}. See Failure Transitions (Aho and Corasick 1975; MM, 1997) Definition: a failure transition Φ at state q of an automaton is a transition taken at that state without consuming any symbol, when no regular transition at q has the desired input label. • Thus, a failure transition corresponds to the of otherwise. Advantages: originally used in string-matching. • More compact representation. • Dealing with unknown alphabet symbols.

Mehryar Mohri - Speech Recognition page 23 Courant Institute, NYU 18 Statistical Natural Language Processing 4.0

Weighted Automata Representation bye/8.318

wi wi−2wi−1 wi−1wi 1/8.318

wi−1 Φ Φ ε/-1.386 bye/0.693

ε/-0.287 wi−1 wi 0 2/1.386 bye/7.625

Φ wi ε/-0.693 hello/1.386

ε hello/7.625 3

Representation of a trigram model using failure transitions (de Bruijn graphs(a) ). (b)

Mehryar Mohri - Speech Recognition Figure 4.7. Katz backpage-o24ff Courantn-g Institute,ram NYUmodel. (a) Representation of a trigram model with failure transitions labeled with Φ. (b) Bigram model derived from the input text hello bye bye. The automaton is defined over the log semiring (the transition weights are negative log-probabilities). State 0 is the initial state. State 1 corresponds to the word bye and state 3 to the word hello. State 2 is the back-off state.

Let c(w) denote the number of occurrences of a sequence w in the corpus. c(hi) and c(hiwi) can be used to estimate the conditional probability P(wi | hi). When c(hi) != 0, the maximum likelihood estimate of P(wi | hi) is:

c(hiwi) Pˆ (wi | hi) = (4.3.8) c(hi) But, a classical data sparsity problem arises in the design of all n-gram models: the corpus, no matter how large, may contain no occurrence of hi (c(hi) = 0). A solution to this problem is based on smoothing techniques. This consists of adjusting Pˆ to reserve some probability mass for unseen n-gram sequences. Let P˜ (wi | hi) denote the adjusted conditional probability. A smoothing technique widely used in language modeling is the Katz back-off technique. The idea is to “back-off” to lower order n-gram sequences when c(hiwi) = 0. Define the backoff sequence of hi as the lower order n-gram sequence suffix of " " hi and denote it by hi. hi = uhi for some word u. Then, in a Katz back-off model, P(wi | hi) is defined as follows:

P˜ (wi | hi) if c(hiwi) > 0 P(wi | hi) = " (4.3.9) ! αhi P(wi | hi) otherwise

where αhi is a factor ensuring normalization. The Katz back-off model admits a natural representation by a weighted automaton in which each state encodes a Approximate Representation

The cost of a representation without failure transitions and ! -transitions is prohibitive. n−1 n • For a trigram model, | Σ | states and | Σ | transitions are needed. • Exact on-the-fly representation but drawback: no offline optimization. Approximation: empirically, limited accuracy loss. • Φ -transitions replaced by ! -transitions. • Log semiring replaced by tropical semiring. Alternative: exact representation with ! -transitions (Allauzen, Roark, and MM, 2003).

Mehryar Mohri - Speech Recognition page 25 Courant Institute, NYU Shrinking

Idea: remove some n-grams from the model while minimally affecting its quality. Main motivation: real-time speech recognition (speed and memory). Method of (Seymore and Rosenfeld,1996): rank n- grams ( w, h ) according to difference of log probabilities before and after shrinking:

∗ " c (wh) log p[w|h] − log p [w|h] . ! "

Mehryar Mohri - Speech Recognition page 26 Courant Institute, NYU Shrinking

Method of (Stolcke, 1998): greedy removal of n- ! grams based on relative entropy D ( p ! p ) of the models before and after removal, independently for each n-gram. • slightly lower perplexity. • but, ranking close to that of (Seymore and Rosenfeld,1996) both in the definition and empirical results.

Mehryar Mohri - Speech Recognition page 27 Courant Institute, NYU LMs Based on Probabilistic Automata (Allauzen, MM, and Roark, 1997) Definition: expected count of sequence xx in probabilistic automaton:

c(x)= ! |u|xA(u), u∈Σ∗

where | |uu || xx is the number of occurrences of xx in uu . Computation: • use counting weighted transducers. • can be generalized to other moments of the counts.

Mehryar Mohri - Speech Recognition page 28 Courant Institute, NYU Example: Bigram Transducer

b:ε/1 b:ε/1 a:ε/1 a:ε/1

0 a:a/1 1 a:a/1 2/1 b:b/1 b:b/1 Weighted transducer TT .

XX ◦◦ TT computes the (expected) count of each bigram { { aa,aa, ab, ab, ba, ba, bb bb }} in XX .

Mehryar Mohri - Speech Recognition page 29 Courant Institute, NYU Counting Transducers

b:ε/1 b:ε/1 xx == abab a:ε/1 a:ε/1 bbabaabba

X:X/1 0 1/1 εεabεεεεε εεεεεabεε

X is an automaton representing a string or any other regular expression. Alphabet Σ Σ == {{ a, a, b b }} .

Mehryar Mohri - Speech Recognition page 30 Courant Institute, NYU References

• Cyril Allauzen, Mehryar Mohri, and Brian Roark. Generalized Algorithms for Constructing Statistical Language Models. In 41st Meeting of the Association for Computational (ACL 2003), Proceedings of the Conference, Sapporo, Japan. July 2003. • Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18 (4):467-479. • Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical Report, TR-10-98, Harvard University. 1998. • William Gale and Kenneth W. Church. What’s wrong with adding one? In N. Oostdijk and P. de Hann, editors, Corpus-Based Research into Language. Rodolpi, Amsterdam. • Good, I. The population frequencies of species and the estimation of population parameters, Biometrika, 40, 237-264, 1953. • and Robert L. Mercer. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on in Practice, pages 381-397.

Mehryar Mohri - Speech Recognition page 31 Courant Institute, NYU References

• Slava Katz . Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Transactions on , Speech and Signal Processing, 35, 400-401, 1987. • Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 181-184, 1995. • David A. McAllester, Robert E. Schapire: On the Convergence Rate of Good-Turing Estimators. Proceedings of Conference on Learning Theory (COLT) 2000: 1-6. • Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic dependences in stochastic language modeling. Speech and Language, 8:1-38. • Kristie Seymore and Ronald Rosenfeld. Scalable backoff language models. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), 1996. • Andreas Stolcke. 1998. Entropy-based pruning of back-off language models. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pages 270-274.

Mehryar Mohri - Speech Recognition page 32 Courant Institute, NYU References

• Ian H. Witten and Timothy C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression, IEEE Transactions on , 37(4):1085-1094, 1991.

Mehryar Mohri - Speech Recognition page 33 Courant Institute, NYU