<<

arXiv:1908.09738v1 [eess.AS] 26 Aug 2019 blt ob ucl rie ihvr ag mut fdata of amounts an large make very often with advantages trained These quickly the be in to and efficiency ability decoding in advantages maintain models ooemgtcnie hi plcto ee However, here. application their consider might one so outperform system. a such of component recognition mod speech language the the for building domain (LM) when of This challenge variety particular wide a entities. a poses from named requests my rare handle “set about to like requirement questions commands complex in simple to requests from alarm” to domains, respond of versatility. range can their broad assistants from virtual part p advanced in in most comes gain to appeal continue Their ularity. Apple’s as such assistants Virtual interp Bayesian merging, count lation interpolation, model guage Terms Index most in cumstances. approach preferred consideration the other pe interpolation Bayesian that techniques make argue two we Finally, the similarly. that form showing an fi interpolation, merging count inter- the Bayesian of Bayesian include comparison published prior We and knowledge) our with merging (to interpolation. count Consistent linear both outperform that polation abunda model. show with we component scenarios work, per three in as data techniques interpolation training these linear compare as We mergin well count interpolation. techniques, Bayesian interpolation and model language two de hr sn ulse oprsno hs w techniqu kno two these our of to comparison published Accordingly, no is literature. there edge, separa research in the proceeded in has threads interpolation Bayesian o work and similarity, merging high-level their o Despite quantity or data. content training his in widely vary domains use especially component performance, where that better achieve techniques interpola to sophisticated statistics Bayesian tory more and these are 4] of [3, [5] used merging tion commonly Count most and [2]. simplest inte techniques the Linear of case. one use target is the lation for optimized the that’s combine way to a technique in interpolation an domain-specifi apply separate then build and to LMs is approach common a mains, hswr ilasm h s of use the assume will work this es ute,ee hnanua ewr Mi sd tis it used, is LM network an neural with a sy interpolated real-world when often modern even of Further, pass recognition tems. first the for choice ∗ erlntokLshv ensoni eetyasto years recent in shown been have LMs network Neural between connection theoretical a uncover we work, this In obidan build To olne tApeInc. Apple at longer No retPstr,Crsoh a ye,Rm Botros Rami Gysel, Van Christophe Pusateri, Ernest oncigadCmaigLnug oe neplto Tec Interpolation Model Language Comparing and Connecting n ga oeso ait ftss(..[],and [1]), (e.g. tasks of variety a on models -gram { pehrcgiin agaemdln,lan- modeling, language recognition, speech : epusateri,cvangysel,badaskar,mirko n ga oe htcvr ra ag fdo- of range broad a covers that model -gram .Introduction 1. Abstract n ga oe.Frteereasons, these For model. -gram n ga models. -gram n ga oe h preferred the model -gram ose ull laOparin Ilya Oualil, Youssef ncases in n count n pl n. USA Inc., Apple -gram will s rpo- The cir- wl- op- es. rst o- nt el te s- r- ir g d c a s - f - . hannemann,youalil,ioparin history, sidpnetof independent is enwhv omlto htasmsteeitneo esti- of probabilities, existence history the per-component assumes for that mates formulation a have now We oan.Te h rbblt o word, a for probability the mul Then covers case use domains. target the history-dependen Assume as interpolation. combination linear LM formulating by start We sets. Bayesi data and large merging three on count interpolation interpolation, per the linear compare of then formance We cou interpolation. between Bayesian connection and theoretical merging a uncover we work this In ieritroainmkstesrn supinthat assumption strong the makes interpolation Linear Interpolation Linear 2.1. o,asm ebidacmoetL o ahdmi.Let domain. each for LM component p a build we assume Now, needn of independent p a ml ubro ere parameters, learned of number small a has Let ythe by indfe nyi o hycompute they how in only inte differ Bayesian tion and merging count interpolation, linear that e flandprmtr.T ota,w rtapyBys ru Bayes’ apply first we that, to do To parameters. learned of of ber estimate an derive will we nter,w ol er aaee orsodn to corresponding parameter a every learn for could we theory, In domain: a p i ( Interp ( i p 1 w | h ( hsi oecmol u qiaetysae sassuming as stated equivalently but commonly more is This aerBdsa,MroHannemann, Mirko Badaskar, Sameer , ∗ λ i | ) ( | h i h h noEuto 2: Equation into ) i | ) h hcmoetL.Pugn hsadoretmt for estimate our and this Plugging LM. component th p ealandprmtrcrepnigto corresponding parameter learned a be i : etepoaiiygvnt word to given probability the be ) i Interp a eepesda h olwn,where following, the as expressed be can and ea siaeof estimate an be .ItroainTechniques Interpolation 2. ( w h p i h p . Interp u hsi aeyfail npatc.Instead, practice. in feasible rarely is this But . | 1 . ( h p w p hs qain5rdcsto: reduces 5 Equation Thus, LI = ) ( | ( ( i h | i w h = ) | h X | = ) = h = ) i = ) X X P i i P p P p ( X λ j p p p h ( j i i λ λ j i ( ( ( | p p | i ,i w, i i i } j λ h Interp ( | ) ) p λ @apple.com p h j p : j ) Interp i Interp ) ) p ( p htrqie ml num- small a requires that | p p p h Interp h i ( ( ( Interp ( | h ) h w i ( w ( (1) , ) | h | h i | ( j | w | ) ,h i, h | ( h i ) hniques j h ) λ ) | (3) . ) ntecnetof context the in j (6) . | i ) p i w ) et eshow we Next, . (2) . ) i (4) . p . ie word a given , ( Interp w | i h represents ) ( p (5) . h p ( | ( i i h p ) ) rpola- p and , ( | ( tiple i i i and ) | | h an h nt le is h ) ) - t 2.2. Count Merging Note that [3] applied count merging using “expected counts” for word histories. Expected counts were computed by mul- Count merging uses the maximum likelihood estimate for Interp tiplying the total number of in the training corpora by p (h|i). It is computed as follows, where Ni is the total an estimate of p(h, i) where p(h, i) was estimated “in a man- number of words in corpus i and ci(h) is the count of history h ner which may reserve probability mass for unobserved events.” in corpus i. Below we show one way of computing expected counts that is ci(h) pCM(h|i)= . (7) consistent with that description, where the first term is the total N i number of words in the training corpora and the second is an Plugging this into Equation 5: estimate of p(h, i). λ ci(h) p w h CM i N i( | ) p (w|h)= i . (8) expected Ni cj (h) ci (h)= Nj pi(h) . (14) i λj Nj j Nj j ! j ! X X This is not the conventional formulationP of count merging, as Substituting Equation 14 into EquationP 8 results in Equation 13. presented in [3] and [4]: Thus, count merging with this definition of expected counts is equivalent to Bayesian interpolation. ′ βici(h)pi(w|h) pCM (w|h)= . (9) βj cj (h) i j X 3. Static Interpolation P However, the two formulations are equivalent in an important Whatever interpolation technique we use, we’d like to end up way. Define βi as follows, where K is a constant whose value ′ with a single n-gram LM. However, the backoff structure of does not affect pCM (w|h): n-gram LMs presents a complication. Exact computation of the interpolated model probabilities for contexts unseen in any K βi = λi . (10) component LM requires that we maintain separate component Ni LMs and perform interpolation online. So, we apply a com- We see that for any set of λi in Equation 8, there is a set of βi in monly used approximation to create a statically interpolated ′ Equation 9 that will make pCM(w|h) equal to pCM (w|h), and model. First we fix the set of n-grams in the statically inter- vice versa. polated model to be the union of all the n-grams seen in the Note that the formulation in Equation 8 has a few advan- component models. Then, we interpolate separately at each n- tages over the one in Equation 9. First, it makes clear that it gram level. Finally, we recompute the backoff weights so that is the history counts relative to the size of their respective cor- word probabilities for all histories add up to 1. pora that matter rather than the absolute counts. Second, the λi are more constrained than the βi, which may make optimization 4. Related Work easier. Finally, unlike the β in Equation 9, we can interpret the i We are largely concerned with connecting two threads of re- λ in Equation 8 as component priors. In addition to making i search, one on count merging and one on Bayesian interpola- interpretation of learned λ easier, this allows for the λ to be i i tion. Count merging appears to be first formally described in [6] set using prior knowledge (e.g. an educated guess about domain (and elaborated on in [3]), while widely known (in some form) usage frequencies), something that would be difficult to do with before that. [6] and [3] show that count merging using two data the β parameters. i sources is a special case of maximum a posterior (MAP) adap- Note also that given a learned set of β , we can easily com- i tation. [4] starts from the formal definition of count merging pute the implied values for λ . Solving Equation 10 for λ and i i in [3] and describes a more general framework where history taking into account that λi sum to 1: count is one of a handful of features used to compute history- βiNi dependent weights. λi = . (11) Bayesian interpolation is first described in [7] for two data j βj Nj sources and then extended in [5]. The presentation in [5] differs 2.3. Bayesian InterpolationP from ours in a couple of ways. First, their formulation con- tains an extra layer of interpolation parameters. Second, they While count merging uses a maximum-likelihood estimate for assume estimates for the parameters analogous to λi are avail- p(h|i) (Equation 7), Bayesian interpolation uses something able as inputs to the interpolation process, while we learn those more robust. Specifically, p(h|i) is computed as the probability parameters on a validation set. of word sequence h given by the ith component LM, which we [8] represents another vein of related work. We note in Sec- 2 denote pi(h) . tion 2 that while it is theoretically possible to learn parameters BI p (h|i)= pi(h). (12) corresponding to p(i|h) in Equation 1, it is rarely practical. [8] Plugging this into Equation 5, we see that p(w|h) is computed starts from that same observation but does not make the moves as follows: in Equations 3, 4 and 5. Instead they explore robust ways of more directly estimating parameters for p(i|h), including vari- λipi(h)pi(w|h) pBI(w|h)= . (13) ous forms of MAP adaptation. λj pj (h) i j X 5. Experiments and Results 2For example, in the case whereP component model i is a 4-gram model, and h−1, h−2, and h−3 represent the words comprising the 5.1. Data history h, pi(h) would be the product of the unigram probability of h−3, the probability of h−3h−2 and the probability of We evaluated in three different scenarios. We constructed the h−3h−2h−1. first scenario from the Billion Word dataset [1] and the second Table 1: Results on the Billion Word scenario. The Uniform Billion Word dataset, the WikiText-103 training set is mono- method denotes linear interpolation with fixed uniform weights. lithic. For this corpus, we were able to use available metadata to Dynamic Static create clusters. Using the full vocabulary version of WikiText (called “raw character level data”), we matched the page titles Interpolation Method Val. PPL Val. PPL Test PPL with the page titles from an English dump. We al- Uniform - 99.1 97.5 lowed for a fuzzy character match, ignoring certain issues of Linear Interpolation 94.2 93.7 91.9 spacing, casing and punctuation, always choosing the most ex- Count Merging 90.8 85.9 83.9 act possible match. Given the matched Wikipedia articles, we performed a Bayesian Interpolation 87.4 85.1 83.2 depth-first search following the links to Wikipedia category pages, until we hit one of the Wikipedia main categories4 (28 Table 2: Results on the WikiText scenario. The Uniform method broad top-down categories like ”Nature” or ”History”). We lim- denotes linear interpolation with fixed uniform weights. ited the maximum depth of the search and in case several main categories were found in the same depth, we chose the cate- Dynamic Static gory with the fewest articles, as determined in a preliminary run. The resulting clustering did not split the data evenly - the Interpolation Method Val. PPL Val. PPL Test PPL biggest clusters were ”People” and ”Entertainment”, the small- Uniform - 253.8 264.0 est ”Concepts” and ”Humanities”. We limited the vocabulary to all words occurring three or Linear Interpolation 228.1 226.1 235.6 more times in the training data, resulting in 266K words. We Count Merging 254.8 221.6 232.0 randomly selected from the “People” cluster from the training Bayesian Interpolation 230.3 219.3 229.5 data to create our test and validation sets and excluded this clus- ter from the training set we used in our experiments. This pro- cess resulted in about 85M words of training data, 90K words from the WikiText-103 dataset [9]. For the third scenario, we of validation data and 90K words of test data. collected training data from a virtual hand held device use case and targeted a virtual assistant use case. 5.1.3. Smart Speaker In this scenario, the training data consisted of speech recognized 5.1.1. Billion Word transcripts from a virtual assistant handheld device use case that The Billion Word dataset consists of a single training set and had been automatically clustered by a machine-learned classi- a matched heldout set. However, in order to perform informa- fier into 32 domains. Examples of these domains are “media tive experiments, we wanted a scenario with clusters of training player”, “home ” and “arithmetic.” The validation data, each with unknown relevance to matched validation and and test sets were randomly sampled from a virtual assistant test sets. We did the following to create this kind of scenario smart speaker use case and hand-transcribed. The training data from the Billion Word data. contained 57B words in total. The validation and test data were First, we designated the first 10 partitions of the heldout set comprised of 502K words and 434K words, respectively. The as test data and the remaining partitions as validation data. Next, vocabulary size was 570K. we jointly clustered the training, validation and test data. To do this, we computed TF-IDF vectors on the first 20 partitions of 5.2. Setup the training data, then used (LSA) In all experiments, we built 4-gram component LMs using to reduce the dimensionality of those vectors to 20 [10]. We Good-Turing smoothing [12]. In the Smart Speaker experi- then ran K-means to learn 10 cluster centroids [11]. After that, ments, minimum count thresholds of 2, 3 and 5 were applied for we assigned the remaining training, validation and test data to , and 4-grams respectively. No count thresh- clusters using the IDF statistics, dimensionality reduction and olds were applied in the other scenarios. We optimized interpo- cluster centroids learned on the 20 training data partitions. lation parameters using L-BFGS-B [13] to minimize validation To create a synthetic target use case, we then chose a ran- set perplexity. dom set of weights for the clusters from a Dirichlet distribution For experiments, we used statically with α = α = ... = α = 1. We created the test and vali- 0 1 9 interpolated LMs, pruned using entropy pruning [14]. The dation sets by sampling respectively from the clustered test and acoustic model was a convolutional neural network with about validation data according to the randomly chosen weights. 28M parameters, trained from millions of manually transcribed, This process resulted in about 800M words of training data, anonymized virtual assistant requests. The input to the model 840K words of validation data and 215K words of test data. We was composed of 40 mel-spaced filter bank outputs; each chose the vocabulary in the conventional way for this corpus, frame was concatenated with the preceding and succeeding ten using all words that appeared in the training data three or more frames, giving an input of 840 dimensions. times. This resulted in a vocabulary of 793K words. In all experiments, we report validation set perplexities on the dynamically interpolated models and validation and test set 5.1.2. WikiText perplexities on the unpruned statically interpolated models. In The WikiText-103 dataset [9] consists of the normalized text of speech recognition experiments, we also report test set perplex- 28350 Wikipedia articles containing 101M words3. As in the on the official webpage https://blog.einstein.ai/the-wikitext-long-term- 3Since we identified and removed exact duplicates of articles in the dependency-language-modeling-dataset/ training set, our statistics are slightly different from those presented 4https://en.wikipedia.org/wiki/Category:Main topic classifications Table 3: Results in the Smart Speaker scenario. The Uniform method denotes linear interpolation with fixed uniform weights.

Dynamic Static Pruned Interpolation Method Val. PPL #n-grams Val. PPL Test PPL #n-grams Test PPL Test WER Uniform - 634M 19.0 19.0 5.4M 21.5 5.3 Linear Interpolation 10.7 634M 10.7 10.7 5.9M 11.2 4.2 Count Merging 8.8 634M 8.8 8.8 7.7M 9.2 3.8 Bayesian Interpolation 8.8 634M 8.8 8.8 7.6M 9.2 3.8

ities on the pruned statically interpolated models, the number of While we do not address the issue here, this analysis sug- n-grams in the unpruned and pruned models, and test set word gests modifying count merging and Bayesian interpolation to error rates (WERs) on the pruned models. As a point of refer- take into account the n-gram structure of the final interpolated ence, we include results for a uniform linear interpolation. model. More specifically, the optimization could compute the probability for a particular validation set n-gram based on the 5.3. Results highest order n-gram present in any of the component models. Table 3 shows perplexity and WER results for the Smart Tables 1 and 2 show perplexity results for dynamically and stat- ically interpolated models for the Billion Word and WikiText Speaker scenario. We see that perplexity results on the un- scenarios. Focusing on the statically interpolated models, we pruned static models are consistent with the results observed in the Billion Word and WikiText scenarios. Further, the perplex- see that the differences between interpolation method perplexi- ity improvements on the statically interpolated models carry ties on the validation and test sets are roughly similar. We ob- serve that count merging and Bayesian interpolation outperform over after pruning. Lastly, test set WER results are consistent linear interpolation, as expected. The difference in performance with the perplexity results. We see a 9.5% relative WER reduc- tion between linear interpolation and count merging or Bayesian is larger in the Billion Word scenario, perhaps because the val- interpolation and no significant difference between count merg- idation and test data are better matched to the training sources. Count merging and Bayesian interpolation perform comparably. ing and Bayesian interpolation. We notice that in the WikiText scenario the test set perplex- ities are higher than validation set perplexities. This is likely 5.4. Other Considerations a consequence of characteristics of the WikiText data and how There are considerations beyond performance when comparing the validation and test sets were selected. The validation and count merging and Bayesian interpolation. Count merging re- test set data were created by randomly sampling lines from a quires storing history counts for each component LM, while single data cluster. A single line often contains many words Bayesian interpolation does not. On the other hand, comput- (e.g. an entire paragraph.) Thus, even with 90K words in each, ing the history statistic is more computationally expensive for the test and validation sets are not perfectly matched. Bayesian interpolation than count merging. In Bayesian inter- Looking at the perplexities of the dynamically interpolated polation, multiple n-gram probabilities must be retrieved, while models, we notice higher perplexities for count merging and in the case of count merging, only a single count must be re- Bayesian interpolation when comparing to static interpolation. trieved. This may introduce significant additional computation The difference is especially striking for count merging in the for Bayesian merging when creating a statically interpolated WikiText scenario. In the case of count merging, we believe LM from very large component LMs. We suspect that in most this is explained by the following phenomenon. Consider an circumstances the added complexity of storing history counts in n-gram with these properties: count merging will be the bigger concern. 1. The component models corresponding to the data sets It is also useful to consider how the two interpolation tech- with non-zero history counts assign the n-gram a very niques might perform in less abundant data scenarios or cases low probability (e.g. because the predicted word is un- where the validation set is poorly matched to the training data seen.) clusters. In those conditions, ci(h) will be 0 for many of the word histories in the validation set, making pCM(h|i) a very 2. The n-gram is not in any of the component models. poor estimate of p(h|i). One can think of heuristics to miti- The dynamically interpolated model will assign the n-gram a gate this problem, but Bayesian interpolation will handle these very low probability, because of 1. However, this n-gram will conditions gracefully without modification. not be included in the statically interpolated model, because of 2. Thus, the statically interpolated model will obtain the prob- 6. Conclusions ability for this n-gram from a lower order n-gram (combined with the backoff probability for the history.) The lower-order We showed a theoretical connection between count merging and n-gram will have been present in at least one of the component Bayesian interpolation. We evaluated these techniques as well models, and so the probability assigned by the statically inter- as linear interpolation in three scenarios with abundant train- polated model is likely to be larger than the one assigned by the ing data. Consistent with prior work, our results indicate that dynamically interpolated model. both count merging and Bayesian interpolation outperform lin- An analogous situation arises for Bayesian interpolation, ear interpolation. Count merging and Bayesian interpolation but the effect is much less severe. This is a positive consequence perform comparably, but Bayesian interpolation has other ad- of the smooth estimate of p(h|i) used by Bayesian interpolation. vantages that will make it the preferred approach in most cir- While pBI(h|i) may be small for a given i, it will never be zero. cumstances. 7. References [1] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson, “One billion word benchmark for measur- ing progress in statistical language modeling,” in Proc. INTER- SPEECH, 2014. [2] F. Jelinek and R. Mercer, “Interpolated estimation of Markov source parameters from sparse data,” Proc. Workshop Pattern Recognition in Practice, pp. 381–397, May 1980. [3] M. Bacchiani, M. Riley, B. Roark, and R. Sproat, “MAP adap- tation of stochastic grammars,” Computer speech & language, vol. 20, no. 1, pp. 41–68, 2006. [4] B.-J. Hsu, “Generalized linear interpolation of language models,” in IEEE Workshop on Automatic Speech Recognition & Under- standing, 2007. [5] C. Allauzen and M. Riley, “Bayesian language model interpola- tion for mobile speech input,” in Proc. INTERSPEECH, 2011. [6] M. Bacchiani and B. Roark, “Unsupervised language model adap- tation,” in Proc. of ICASSP, 2003. [7] M. Weintraub, Y. Aksu, S. Dharanipragada, S. Khudanpur, H. Ney, J. Prange, A. Stolcke, F. Jelinek, and E. Shriberg, “LM95 project report: Fast training and portability,” Research Note 1, Center for Language and , Johns Hopkins Uni- versity, Tech. Rep., 1996. [8] X. Liu, M. Gales, J. Francis, and P. C. Woodland, “Use of con- texts in language model interpolation and adaptation,” Computer Speech & Language, vol. 27, no. 1, pp. 301–321, 2013. [9] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” in Proc. ICLR, 2017. [10] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American society for information science, vol. 41, no. 6, pp. 391–407, 1990. [11] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berke- ley symposium on mathematical statistics and probability, vol. 1, no. 14. Oakland, CA, USA, 1967, pp. 281–297. [12] K. W. Church and W. A. Gale, “A comparison of the enhanced Good-Turing and deleted estimation methods for estimating prob- abilities of English bigrams,” Computer Speech & Language, vol. 5, no. 1, pp. 19–54, 1991. [13] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algorithm for bound constrained optimization,” SIAM Journal on Scientific Computing, vol. 16, no. 5, pp. 1190–1208, 1995. [14] A. Stolcke, “Entropy-based pruning of backoff language models,” in Proc. of DARPA Broadcast News Transcription and Under- standing Workshop, 1998.