Efficient Language Modeling Algorithms with Applications To

Efficient Language Modeling Algorithms with Applications to Statistical Machine Translation Kenneth Heafield CMU-LTI-13-017 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Alon Lavie, Chair Chris Dyer Bhiksha Raj Philipp Koehn, University of Edinburgh Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Language and Information Technologies Copyright c , 2013 Kenneth Heafield Abstract N-gram language models are an essential component in statistical natural language processing systems for tasks such as machine translation, speech recognition, and optical character recognition. They are also responsible for much of the computational costs. This thesis contributes efficient algorithms for three language modeling problems: estimating probabilities from corpora, representing a model in memory, and searching for high-scoring output when log language model probability is part of the score. Most existing language modeling toolkits operate in RAM, effectively limiting model size. This work contributes disk-based streaming algorithms that use a configurable amount of RAM to estimate Kneser- Ney language models 7.13 times as fast as the popular SRILM toolkit. Scaling to 126 billion tokens led to first-place performance in the 2013 Workshop on Machine Translation for all three language pairs where submissions were made. Query speed is critical because a machine translation system makes millions of queries to translate one sentence. Thus, language models are typically queried in RAM, where size is a concern. This work contributes two near-lossless data structures for efficient storage and querying. The first, based on linear probing hash tables, responds to queries 2.42 times as fast as the SRILM toolkit while using 57% of the memory. The second, based on sorted arrays, is faster than all baselines and uses less memory than all lossless baselines. Searching for high-scoring output is difficult because log language model probabilities do not sum when strings are concatenated. This thesis contributes a series of optimizations that culminate in a new approx- imate search algorithm. The algorithm applies to search spaces expressed as lattices and, more generally, hypergraphs that arise in many natural language tasks. Experiments with syntactic machine translation show that the new algorithm attains various levels of accuracy 3.25 to 10.01 times as fast as the popular cube pruning algorithm with SRILM. Acknowledgments First, I would like to thank my adviser, Alon Lavie, who is primarily responsible for introducing me to machine translation. He especially deserves thanks for supporting a side project on querying language models; that project eventually grew in scope to become this thesis. Philipp Koehn hosted me at the University of Edinburgh for more than a year and never turned down a travel request. It is because of him that I was able to focus for the final year of this work. He is also one of the organizers of MT Marathon, where some of the chapters started. It is rare to have a committee that follows—and uses—the work as it is being performed. Chris Dyer provided detailed feedback, encouragement, and help with cdec. He has also been an early adopter and unwitting tester. Bhiksha Raj provided insight into language modeling in general and for speech recognition, including and beyond his own work on the subject. My colleagues at Carnegie Mellon contributed knowledge and discussions. Jonathan Clark was especially helpful with estimating language models and came up with the name lmplz. Greg Hanneman deserves special mention for contributing his French–English system for use in experiments and answering questions about it. Michael Denkowski has been helpful both with METEOR and evaluations. Victor Chahuneau contributed ideas and code. Thanks to Nathan Schneider, Narges Razavian, Kevin Gimpel, Yubin Kim and the tea group for making it entertaining, even though I rarely drink tea. Kami Vaniea has been especially supportive. The University of Edinburgh became my adoptive research family. Hieu Hoang motivated me to improve my work and release it to a broader audience. In addition to helping with Moses, he is to credit for the name KenLM. Barry Haddow has been a great resource on the inner workings of Moses. Hieu Hoang, Barry Haddow, Hervé Saint-Amand, and Ulrich Germann were all supportive in research and as officemates. Nadir Durrani’s collaboration was essential to the submissions to the Workshop on Machine Translation. Miles Osborne provided his preprocessed copy of the ClueWeb09 corpus and encouraged me to think big with language modeling. Alexandra Birch, Christian Buck, Liane Guillou, Eva Hasler, and Phil Williams have all been helpful in discussions. MT Marathon is a great venue to work with other researchers. Hieu Hoang, Tetsuo Kiso and Marcello Federico joined my project on state minimization that eventually grew into a chapter. The next year, Ivan Pouzyrevsky and Mohammed Mediani joined my project on estimating language models. Ivan has been particularly helpful with optimizing disk-based algorithms. System administators from Carnegie Mellon (especially Michael Stroucken), the Texas Advanced Com- puting Center (TACC) at The University of Texas at Austin, the San Diego Supercomputer Center, and the Unversity of Edinburgh (especially Hervé Saint-Amand and Nicholas Moir) all deserve thanks for handling my requests, such as increasing the maximum size of RAM disks to 700 GB on the machines with 1 TB RAM. Finally, my users are a source of motivation, ideas, and code. Matt Post, Juri Ganitkevitch, Stephan Peitz, and Jörn Wübker, Jonathan Graehl, and Baskaran Sankaran have all been helpful in discussions and in increasing the impact of this work. i This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575, specifically Stampede and Trestles under allocation TG-CCR110017. The author acknowledges the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this thesis. This work made use of the resources provided by the Edinburgh Compute and Data Facility (http://www.ecdf.ed.ac.uk/). The ECDF is partially supported by the eDIKT initiative (http://www.edikt.org.uk/). The research leading to these results has received fund- ing from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreements 287658 (EU BRIDGE), 287576 (CASMACAT), 287688 (MateCat), and 288769 (ACCEPT). This material is based upon work supported in part by the National Science Foundation Graduate Research Fellowship under Grant No. 0750271, by the National Science Foundation under grant IIS-0713402, by a NPRP grant (NPRP 09-1140-1-177) from the Qatar National Research Fund (a member of the Qatar Foundation), and by the DARPA GALE program. ii Contents 1 Introduction 1 1.1 Research Contributions.....................................4 1.2 Preview of Translation Results.................................5 2 Background on Language Models and Decoding7 2.1 N–gram Models.........................................8 2.2 Smoothing............................................8 2.3 Querying............................................ 10 2.4 Decoding............................................ 11 2.4.1 Decoding Problems................................... 11 2.4.2 Searching with Language Models........................... 15 2.4.3 Implementations.................................... 19 2.5 Summary............................................ 20 3 Estimating Kneser-Ney Language Models 23 3.1 Introduction........................................... 23 3.2 Related Work.......................................... 24 3.2.1 Google......................................... 24 3.2.2 Microsoft........................................ 25 3.2.3 SRI........................................... 25 3.2.4 IRST.......................................... 25 3.2.5 MIT........................................... 26 3.2.6 Berkeley........................................ 26 3.3 Estimation Pipeline....................................... 26 3.3.1 Counting........................................ 26 3.3.2 Adjusting Counts.................................... 27 3.3.3 Normalization..................................... 28 3.3.4 Interpolation...................................... 29 3.3.5 Joining......................................... 30 3.4 Numerical Precision....................................... 30 3.5 Streaming and Sorting Framework............................... 32 3.6 Experiments........................................... 34 3.6.1 Methodology...................................... 34 3.6.2 Memory Setting.................................... 34 3.6.3 Toolkit Comparison.................................. 34 3.6.4 Scaling......................................... 34 3.7 Summary............................................ 39 iii 4 Querying Language Models 41 4.1 Introduction........................................... 41 4.2 Data Structures......................................... 43 4.2.1 Hash Tables and Probing................................ 43 4.2.2 Sorted Arrays and Trie................................. 45 4.3 Related Work.......................................... 47 4.4 Threading and Memory Mapping................................ 48 4.5 Experiments..........................................

Efficient Language Modeling Algorithms with Applications To

(Meta-) Evaluation of Machine Translation

Public Project Presentation and Updates

8.14: Annual Public Report

Statistical and Hybrid Machine Translation Between All European Languages

Abstract the Circle of Meaning: from Translation

Improving Statistical Machine Translation Efficiency by Triangulation

Factored Translation Models and Discriminative Training For

Convergence of Translation Memory and Statistical Machine Translation

Statistical Machine Translation

Edinburgh's Statistical Machine Translation Systems for WMT16

Improved Learning for Machine Translation

A Survey of Statistical Machine Translation