Arxiv:1908.09738V1 [Eess.AS]
Total Page:16
File Type:pdf, Size:1020Kb
Connecting and Comparing Language Model Interpolation Techniques Ernest Pusateri, Christophe Van Gysel, Rami Botros,∗ Sameer Badaskar, Mirko Hannemann, Youssef Oualil, Ilya Oparin Apple Inc., USA {epusateri,cvangysel,badaskar,mirko hannemann,youalil,ioparin}@apple.com Abstract In this work we uncover a theoretical connection between count merging and Bayesian interpolation. We then compare the per- In this work, we uncover a theoretical connection between formance of linear interpolation, count merging and Bayesian two language model interpolation techniques, count merging interpolation on three large data sets. and Bayesian interpolation. We compare these techniques as well as linear interpolation in three scenarios with abundant training data per component model. Consistent with prior 2. Interpolation Techniques work, we show that both count merging and Bayesian inter- We start by formulating LM combination as history-dependent polation outperform linear interpolation. We include the first linear interpolation. Assume the target use case covers multiple (to our knowledge) published comparison of count merging and domains. Then the probability for a word, w, given a word Bayesian interpolation, showing that the two techniques per- history, h can be expressed as the following, where i represents form similarly. Finally, we argue that other considerations will a domain: make Bayesian interpolation the preferred approach in most cir- cumstances. p(w|h)= p(w, i|h), (1) i Index Terms: speech recognition, language modeling, lan- X guage model interpolation, count merging, Bayesian interpo- = p(i|h)p(w|i, h). (2) lation i X In theory, we could learn a parameter corresponding to p(i|h) 1. Introduction for every i and h. But this is rarely feasible in practice. Instead, Virtual assistants such as Apple’s Siri continue to gain in pop- we will derive an estimate of p(i|h) that requires a small num- ularity. Their appeal comes in part from their versatility. The ber of learned parameters. To do that, we first apply Bayes’ rule most advanced virtual assistants can respond to requests in a to p(i|h): p(i)p(h|i) broad range of domains, from simple commands like “set my p(i|h)= . (3) alarm” to complex questions about rare named entities. This j p(j)p(h|j) requirement to handle requests from a wide variety of domains Let λi be a learned parameterP corresponding to p(i) and poses a particular challenge when building the language model pInterp(h|i) be an estimate of p(h|i): (LM) for the speech recognition component of such a system. Interp Neural network LMs have been shown in recent years to Interp λip (h|i) p (i|h)= Interp . (4) outperform n-gram models on a variety of tasks (e.g. [1]), and j λj p (h|j) so one might consider their application here. However, n-gram models maintain advantages in decoding efficiency and in their Now, assume we build a componentP LM for each domain. Let ability to be quickly trained with very large amounts of data. pi(w|h) be the probability given to word w in the context of h These advantages often make an n-gram model the preferred by the ith component LM. Plugging this and our estimate for choice for the first recognition pass of modern real-world sys- p(i|h) into Equation 2: tems. Further, even when a neural network LM is used, it is λ pInterp h i often interpolated with an n-gram model. For these reasons, Interp i ( | ) p (w|h)= Interp pi(w|h). (5) λj p (h|j) this work will assume the use of n-gram models. i j To build an n-gram model that covers a broad range of do- X arXiv:1908.09738v1 [eess.AS] 26 Aug 2019 We now have a formulationP that assumes the existence of esti- mains, a common approach is to build separate domain-specific mates for per-component history probabilities, pInterp h i , and LMs and then apply an interpolation technique to combine them ( | ) has a small number of learned parameters, λ . Next, we show in a way that’s optimized for the target use case. Linear interpo- i that linear interpolation, count merging and Bayesian interpola- lation is one of the simplest and most commonly used of these tion differ only in how they compute pInterp h i . techniques [2]. Count merging [3, 4] and Bayesian interpola- ( | ) tion [5] are more sophisticated techniques that use word his- 2.1. Linear Interpolation tory statistics to achieve better performance, especially in cases where component domains vary widely in content or quantity of Linear interpolation makes the strong assumption that p(h|i) is training data. Despite their high-level similarity, work on count independent of i.1 Thus, Equation 5 reduces to: merging and Bayesian interpolation has proceeded in separate LI threads in the research literature. Accordingly, to our knowl- p (w|h)= λipi(w|h). (6) edge, there is no published comparison of these two techniques. i X 1This is more commonly but equivalently stated as assuming p(i|h) ∗No longer at Apple Inc. is independent of h. 2.2. Count Merging Note that [3] applied count merging using “expected counts” for word histories. Expected counts were computed by mul- Count merging uses the maximum likelihood estimate for Interp tiplying the total number of words in the training corpora by p (h|i). It is computed as follows, where Ni is the total an estimate of p(h, i) where p(h, i) was estimated “in a man- number of words in corpus i and ci(h) is the count of history h ner which may reserve probability mass for unobserved events.” in corpus i. Below we show one way of computing expected counts that is ci(h) pCM(h|i)= . (7) consistent with that description, where the first term is the total N i number of words in the training corpora and the second is an Plugging this into Equation 5: estimate of p(h, i). λ ci(h) p w h CM i N i( | ) p (w|h)= i . (8) expected Ni cj (h) ci (h)= Nj pi(h) . (14) i λj Nj j Nj j ! j ! X X This is not the conventional formulationP of count merging, as Substituting Equation 14 into EquationP 8 results in Equation 13. presented in [3] and [4]: Thus, count merging with this definition of expected counts is equivalent to Bayesian interpolation. ′ βici(h)pi(w|h) pCM (w|h)= . (9) βj cj (h) i j X 3. Static Interpolation P However, the two formulations are equivalent in an important Whatever interpolation technique we use, we’d like to end up way. Define βi as follows, where K is a constant whose value ′ with a single n-gram LM. However, the backoff structure of does not affect pCM (w|h): n-gram LMs presents a complication. Exact computation of the interpolated model probabilities for contexts unseen in any K βi = λi . (10) component LM requires that we maintain separate component Ni LMs and perform interpolation online. So, we apply a com- We see that for any set of λi in Equation 8, there is a set of βi in monly used approximation to create a statically interpolated ′ Equation 9 that will make pCM(w|h) equal to pCM (w|h), and model. First we fix the set of n-grams in the statically inter- vice versa. polated model to be the union of all the n-grams seen in the Note that the formulation in Equation 8 has a few advan- component models. Then, we interpolate separately at each n- tages over the one in Equation 9. First, it makes clear that it gram level. Finally, we recompute the backoff weights so that is the history counts relative to the size of their respective cor- word probabilities for all histories add up to 1. pora that matter rather than the absolute counts. Second, the λi are more constrained than the βi, which may make optimization 4. Related Work easier. Finally, unlike the β in Equation 9, we can interpret the i We are largely concerned with connecting two threads of re- λ in Equation 8 as component priors. In addition to making i search, one on count merging and one on Bayesian interpola- interpretation of learned λ easier, this allows for the λ to be i i tion. Count merging appears to be first formally described in [6] set using prior knowledge (e.g. an educated guess about domain (and elaborated on in [3]), while widely known (in some form) usage frequencies), something that would be difficult to do with before that. [6] and [3] show that count merging using two data the β parameters. i sources is a special case of maximum a posterior (MAP) adap- Note also that given a learned set of β , we can easily com- i tation. [4] starts from the formal definition of count merging pute the implied values for λ . Solving Equation 10 for λ and i i in [3] and describes a more general framework where history taking into account that λi sum to 1: count is one of a handful of features used to compute history- βiNi dependent weights. λi = . (11) Bayesian interpolation is first described in [7] for two data j βj Nj sources and then extended in [5]. The presentation in [5] differs 2.3. Bayesian InterpolationP from ours in a couple of ways. First, their formulation con- tains an extra layer of interpolation parameters. Second, they While count merging uses a maximum-likelihood estimate for assume estimates for the parameters analogous to λi are avail- p(h|i) (Equation 7), Bayesian interpolation uses something able as inputs to the interpolation process, while we learn those more robust.