Lecture 5: N-Gram Language Models

Lecture 5: N-Gram Language Models

Speech Recognition Lecture 5: N-gram Language Models Mehryar Mohri Courant Institute and Google Research [email protected] Language Models Definition: probability distribution Pr[ w ] over sequences of words w = w1 ...wk. • Critical component of a speech recognition system. Problems: • Learning: use large text corpus (e.g., several million words) to estimate Pr[ w ] . Models in this course: n-gram models, maximum entropy models. • Efficiency: computational representation and use. Mehryar Mohri - Speech Recognition page 2 Courant Institute, NYU This Lecture n-gram models definition and problems Good-Turing estimate Smoothing techniques Evaluation Representation of n-gram models Shrinking LMs based on probabilistic automata Mehryar Mohri - Speech Recognition page 3 Courant Institute, NYU N-Gram Models Definition: an n-gram model is a probability distribution based on the nth order Markov assumption ∀i, Pr[wi | w1 ...wi−1]=Pr[wi | hi], |hi|≤n − 1. • Most widely used language models. Consequence: by the chain rule, k k Pr[w]=! Pr[wi | w1 ...wi−1]=! Pr[wi | hi]. i=1 i=1 Mehryar Mohri - Speech Recognition page 4 Courant Institute, NYU Maximum Likelihood Likelihood: probability of observing sample under distribution p ∈ P , which, given the independence assumption is m Pr[x1,...,xm]=! p(xi). i=1 Principle: select distribution maximizing sample probability m p! = argmax ! p(xi), ∈P p i=1 m or p! = argmax ! log p(xi). ∈P p i=1 Mehryar Mohri - Speech Recognition page 5 Courant Institute, NYU Example: Bernoulli Trials Problem: find most likely Bernoulli distribution, given sequence of coin flips H, T, T, H, T, H, T, H, H, H, T, T, . , H. Bernoulli distribution: p(H)=θ, p(T )=1− θ. Likelihood: l(p) = log θN(H)(1 − θ)N(T ) = N(H) log θ + N(T ) log(1 − θ). Solution: l is differentiable and concave; dl(p) N(H) N(T ) N(H) = =0 θ = . dθ θ − 1 θ ⇔ N(H)+N(T ) Mehryar Mohri - Speech Recognition − page 6 Courant Institute, NYU Maximum Likelihood Estimation Definitions: • n-gram: sequence of n consecutive words. • S : sample or corpus of size m . c w ...w • ( 1 k ) : count of sequence w 1 ...w k . ML estimates: for c ( w 1 ...w n − 1 ) =0! , c(w1 ...wn) Pr[wn|w1 ...wn−1]= . c(w1 ...wn−1) • But, c(w1 ...wn)=0 =⇒ Pr[wn|w1 ...wn−1]=0! Mehryar Mohri - Speech Recognition page 7 Courant Institute, NYU N-Gram Model Problems Sparsity: assigning probability zero to sequences not found in the sample = ⇒ speech recognition errors. • Smoothing: adjusting ML estimates to reserve probability mass for unseen events. Central techniques in language modeling. • Class-based models: create models based on classes (e.g., DAY) or phrases. Representation: for | Σ | =100 , 000 , the number of bigrams is 10 10 , the number of trigrams 10 15 ! • Weighted automata: exploiting sparsity. Mehryar Mohri - Speech Recognition page 8 Courant Institute, NYU Smoothing Techniques Typical form: interpolation of n-gram models, e.g., trigram, bigram, unigram frequencies. Pr[w3|w1w2]=α1c(w3|w1w2)+α2c(w3|w2)+α3c(w3). Some widely used techniques: • Katz Back-off models (Katz, 1987). • Interpolated models (Jelinek and Mercer, 1980). • Kneser-Ney models (Kneser and Ney, 1995). Mehryar Mohri - Speech Recognition page 9 Courant Institute, NYU Good-Turing Estimate (Good, 1953) Definitions: • Sample S of m words drawn from vocabulary Σ . • c ( x ) : count of word x in S . • S k : set of words appearing k times. • M k : probability of drawing a point in S k . Good-Turing estimate of M k : k k S G +1 S ( +1)| k+1|. k = m | k+1| = m Mehryar Mohri - Speech Recognition page 10 Courant Institute, NYU Properties Theorem: the Good-Turing estimate is an estimate of M k with small bias, for small values of k/m : k +1 E[Gk]=E[Mk]+O( ). S S m Proof: E[Mk]= Pr[x] Pr[x ∈ Sk] S x!∈Σ m = Pr[x] Pr[x]k(1 − Pr[x])m−k " k # x!∈Σ m m = Pr[x]k+1(1 − Pr[x])m−(k+1) k (1 − Pr[x]) "k +1# $m% x!∈Σ k+1 k +1 m $ % k +1 = E |Sk | − E Mk = E[Gk] − E[Mk ]. m − k & +1 +1 ) m − k m − k +1 ' ( ' ( Mehryar Mohri - Speech Recognition page 11 Courant Institute, NYU Properties Proof (cont.): thus, ! ! ! k k +1 ! !E[Mk] − E[Gk]! = ! E[Gk] − E[Mk+1]! !S S ! !m − k S m − k ! k +1 ≤ . (0 ≤ E[Gk], E[Mk+1] ≤ 1) m − k S In particular, for k =0 , ! ! 1 !E[Mk] − E[Gk]! ≤ . !S S ! m It can be proved using McDiarmid’s inequality that with probability at least 1 − δ , (McAllester and Schapire, 2000), 1 log( δ ) M0 G0 + O . ≤ m Mehryar Mohri - Speech Recognition page 12 Courant Institute, NYU Good-Turing Count Estimate Definition: let r = c ( w 1 ...w k ) and n r = | S r |, then Gr m nr+1 c∗(w ...w )= × =(r +1) . 1 k S n | r| r Mehryar Mohri - Speech Recognition page 13 Courant Institute, NYU Simple Method (Jeffreys, 1948) Additive smoothing: add δ ∈ [0 , 1] to the count of each n-gram. c(w1 ...wn)+δ Pr[wn|w1 ...wn−1]= . c(w1 ...wn−1)+δ|Σ| • Poor performance (Gale and Church, 1994). • Not a principled attempt to estimate or make use of an estimate of the missing mass. Mehryar Mohri - Speech Recognition page 14 Courant Institute, NYU Katz Back-off Model (Katz, 1987) Idea: back-off to lower order model for zero counts. c wn−1 w wn−1 w wn−1 • if ( 1 )=0 , then Pr[ n | 1 ]=Pr[ n | 2 ] . otherwise, • n c(w1 ) n dc wn c w > n−1 ( 1 ) n−1 if ( 1 ) 0 wn w c w Pr[ | 1 ]= ( 1 ) n−1 β Pr[wn|w2 ] otherwise. where d k is a discount coefficient such that 1 if k>K; dk = (k +1)nk+1 ≈ otherwise. knk Katz suggests K =5 . Mehryar Mohri - Speech Recognition page 15 Courant Institute, NYU Discount Coefficient With the Good-Turing estimate, the total probability mass for unseen n-grams is G 0 = n 1 /m . n dc(wn)c(w )/m =1 n1/m 1 1 − c(wn)>0 1 d kn = m n ⇔ k k − 1 k>0 d kn = kn n ⇔ k k k − 1 k>0 k>0 (1 d ) kn = n . ⇔ − k k 1 k>0 K (1 d ) kn = n . ⇔ − k k 1 k=1 Mehryar Mohri - Speech Recognition page 16 Courant Institute, NYU Discount Coefficient k∗ Solution: a search with 1 − d k = µ (1 − ) leads to k K − ∗ µ !(1 k /k) knk = n1. k=1 K ⇔ − (k +1)nk+1 µ !(1 ) knk = n1. knk k=1 K ⇔ − µ ![knk (k +1)nk+1]=n1. k=1 ⇔ µ[n1 − (K +1)nK+1]=n1. ⇔ 1 µ = K n 1 − ( +1) K+1 n1 ∗ k (K+1)nK+1 k − n ⇔ 1 dk = K n . 1 − ( +1) K+1 n1 Mehryar Mohri - Speech Recognition page 17 Courant Institute, NYU Interpolated Models (Jelinek and Mercer, 1980) Idea: interpolation of different order models. Pr[w3|w1w2]=αc(w3|w1w2)+βc(w3|w2)+(1− α − β)c(w3), with 0 ≤ α,β ≤ 1. • α and β are estimated by using held-out samples. • sample split into two parts for training higher- order and lower order models. • optimization using expectation-maximization (EM) algorithm. • deleted interpolation: k-fold cross-validation. Mehryar Mohri - Speech Recognition page 18 Courant Institute, NYU Kneser-Ney Model Idea: combination of back-off and interpolation, but backing-off to lower order model based on counts of contexts. Extension of absolute discounting. max {0,c(w1w2w3) − D} c(·w3) Pr[w3|w1w2]= + α , c(w1w2) ! c(·w3) where D is a constant. Modified version (Chen and Goodman, 1998): D function of c ( w 1 w 2 w 3 ) . Mehryar Mohri - Speech Recognition page 19 Courant Institute, NYU Evaluation Average log-likelihood of test sample of size N: N 1 L(p)= log p[w h ], h n 1. N 2 k | k | k| ≤ − k=1 Perplexity : the perplexity PP ( q ) of the model is defined as (‘average branching factor’) Lˆ(q) PP(q)=2− . • For English texts, typically PP ( q ) ∈ [50 , 1000] and 6 L(q) 10 bits. ≤ ≤ Mehryar Mohri - Speech Recognition page 20 Courant Institute, NYU In Practice Evaluation: an empirical observation based on the word error rate of a speech recognizer is often a better evaluation than perplexity. n-gram order: typically n = 3, 4, or 5. Higher order n-grams typically do not yield any benefit. Smoothing: small differences between Back-off, interpolated, and other models (Chen and Goodman, 1998). Special symbols: beginning and end of sentences, start and stop symbols. Mehryar Mohri - Speech Recognition page 21 Courant Institute, NYU The set of states of the WFA representing a backoff or </s>/1.101 interpolated model is defined by associating a state qh to a/0.405 a/0.287 </s> each sequence of length less than n found in the corpus: a !/4.856 </s>/1.540 a/1.108 Q = {qh : |h| < n and c(h) > 0} a/0.441 /0.356 !/0.231 ! b Its transition set E is defined as the union of the following <s> b/0.693 b/1.945 set of failure transitions: Figure 3: Example of representation of a bigram model {(qwh! , φ, − log(αh), qh! ) : qwh! ∈ Q} with failure transitions. and the following set of regular transitions: cost of the invalid path to the state – the sum of the two transition costs (0.672) – is lower than the cost of the true {(qh, w, − log(Pr(w|h)), nhw) : qwh ∈ Q} path.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    33 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us