N-Gram Counts and Language Models from the Common Crawl

N-gram Counts and Language Models from the Common Crawl Christian Bucky, Kenneth Heafieldz, Bas van Ooyen∗ yUniversity of Edinburgh, Edinburgh, Scotland zStanford University, Stanford, CA, USA ∗ Owlin BV, Utrecht, Netherlands [email protected], heafi[email protected], [email protected] Abstract We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the corpus was processed with emphasis on the problems that arise in working with data at this scale. Our unpruned Kneser-Ney English 5-gram language model, built on 975 billion deduplicated tokens, contains over 500 billion unique n-grams. We show gains of 0.5–1.4 BLEU by using large language models to translate into various languages. Keywords: web corpora, language models, multilingual 1. Introduction 2. Data Preparation The Common Crawl2 is a publicly available crawl of the The sheer amount of data in multiple languages makes web- web. We use the 2012, early 2013, and “winter” 2013 scale corpora attractive for many natural language process- crawls, consisting of 3:8 billion, 2 billion, and 2:3 billion ing tasks. Of particular importance is language modeling, pages, respectively. Because both 2013 crawls are simi- where web-scale language models have been shown to im- lar in terms of seed addresses and distribution of top-level prove machine translation and automatic speech recogni- domains in this work we only distinguish 2012 and 2013 tion performance (Brants et al., 2007; Chelba and Schalk- crawls. wyk, 2013; Guthrie and Hepple, 2010). In this work, we The data is made available both as raw HTML and as text contribute n-gram counts and language models trained on only files. The latter collection consists of all HTML and the Common Crawl corpus.1 RSS files from which all tags were stripped. The HTML Google has released n-gram counts (Brants and Franz, comes in the original encoding, while the text has been con- 2006) trained on one trillion tokens of text. However, they verted to UTF-8, albeit with the occasional invalid charac- pruned any n-grams that appeard less than 40 times. More- ter. over, all words that appeared less than 200 times were re- Using the HTML files has the advantage of being able to placed with the unknown word. Both forms of pruning exploit the document structure to select paragraphs and to make the counts unsuitable for estimating a language model tell boilerplate from actual content. However, parsing such with the popular and successful Kneser-Ney smoothing al- large amounts of HTML is non-trivial and requires many gorithm, which requires unpruned counts even if the final normalization steps. model is to be pruned. In this work we focus on processing the text only files The second issue with the publicly available Google n- which we downloaded and processed locally on a small gram counts (Brants and Franz, 2006) is that the training cluster. The advantages of structured text do not outweigh data was not deduplicated, so boilerplate such as copyright the extra computing power needed to process them. notices has unreasonably high counts (Lin et al., 2010). Google has shared a deduplicated version (Bergsma et al., 2.1. Language Detection 2010) in limited contexts (Lin et al., 2010), but it was never The first step in our pipeline is splitting the data by lan- publicly released (Lin, 2013). Our training data was dedu- guage. We explored the option of automatically detecting plicated before counting n-grams. the main language for every page but found that mixed- Microsoft provides a web service (Wang et al., 2010) that language content is quite common. By using the Compact can be queried for language model probabilities. The ser- Language Detector 2 (CLD2)3 we are able to partition every vice is currently limited to the English language whereas document into monolingual spans. CLD2 is able to detect we provide models for many languages. Moreover, an ini- 175 languages and fast enough to process the entire corpus tial experiment on reranking English machine translation within a week. output led to so many queries that the service went down Table 1 shows the relative contribution of the most common several times, despite client-side caching. Using the Mi- languages in the separated data. At this stage of process- crosoft service during machine translation decoding would ing we have no meaningful notion of token or line counts entail far more queries and require lower latency. and therefore report the size of the extracted files. As 2http://commoncrawl.org/ 1http://statmt.org/ngrams 3https://code.google.com/p/cld2/ Relative occurrence % Size Count (M) Line Language 2012 2013 both both 1374:44 Add to English 54:79 79:53 67:05 23:62 TiB 816:33 Share German 4:53 1:23 2:89 1:02 TiB 711:68 Unblock User Spanish 3:91 1:68 2:80 986:86 GiB 68:31 Sign in or sign up now! French 4:01 1:14 2:59 912:16 GiB 61:26 Log in Japanese 3:11 0:14 1:64 577:14 GiB 54:77 Privacy Policy Russian 2:93 0:09 1:53 537:36 GiB 45:18 April 2010 Polish 1:81 0:08 0:95 334:31 GiB 34:35 Load more suggestions Italian 1:40 0:44 0:92 325:58 GiB 19:84 Buy It Now j Add to watch list Portuguese 1:32 0:48 0:90 316:87 GiB 16:64 Powered by WordPress.com Chinese 1:45 0:04 0:75 264:91 GiB Dutch 0:95 0:22 0:59 207:90 GiB Table 2: Selection of very common lines in the English por- 12:23 12:57 12:40 4:37 other TiB tion of the data. We only keep one instance. The counts are given as million lines. Table 1: Results of language detection on raw text, showing the 11 most common languages. Language Lines (B) Tokens (B) Bytes English 59:13 975:63 5:14 TiB German 3:87 51:93 317:46 GiB expected English is by far the predominant language fol- Spanish 3:50 62:21 337:16 GiB lowed by European languages. Due to the large size of the French 3:04 49:31 273:96 GiB overall corpus (35:23 TiB), even a small percentage con- Russian 1:79 21:41 220:62 GiB stitutes a corpus of useful size. For example, only 0:14% Czech 0:47 5:79 34:67 GiB of the data were classified as Finnish, yet yielding a corpus of 47:73 GiB. In total we found 73 languages with at least 1 GiB of uncompressed text each and 42 with at least Table 3: Data statistics after preprocessing 10 GiB. 2.2. Deduplication 2.3. Normalization Since the Common Crawl contains web pages, many frag- In addition to deduplicating, we restricted the data to print- ments are not content but artifacts of automatic page gen- able Unicode characters, replaced all e-mail addresses with eration, such as copyright notices. In order to reduce the the same address, stripped out remaining HTML, and split amount of boilerplate, we remove duplicate lines prior to sentences using the Europarl splitter (Koehn, 2005). We sentence splitting. While a selection of very common lines distribute the text after this stage for those who wish to use in Table 2 suggests that mostly irrelevant data is removed, their own tokenizer. we do risk deduplicating content that should appear repeat- Before building language models, we normalized punctu- edly. ation using the script provided by the Workshop on Sta- Storing all of the lines in memory would take too much tistical Machine Translation (Bojar et al., 2013), tokenized RAM. Instead, we take a 64-bit hash of each line using using the Moses tokenizer (Koehn et al., 2007), and applied MurmurHash4. If the hash was not seen before we keep the Moses truecaser. The truecaser was trained on data from the line and add it to an in-memory hash-table. However, the 2014 Workshop on Statistical Machine Translation for for common languages, such as English, it is not feasible to each language, including Europarl (Koehn, 2005), United keep all hashes in memory. We therefore shard the data us- Nations parallel data, the Giga Fr-En corpus, parallel data ing a different hash, so that all identical lines end up in the mined from CommonCrawl (Smith et al., 2013), the news same shard. We then deduplicate each shard individiually. commentary corpus, LDC Gigaword corpora (Parker et al., While hash collisions can lead to lines being incorrectly 2011), and the news crawls provided by the evaluation. Our identified as duplicates we found that on a 10 GiB sample release includes a frozen version of the scripts and truecas- of the corpus no such errors were made. ing model used to preprocess the data. The deduplication step removes about 80% of the En- Table 3 gives some statistics of the data after all preprocess- glish data which is in line with the reductions reported by ing steps have been performed. Bergsma et al. (2010). Comparing Table 1 and Table 3 we find that this rate is lower for other languages, e.g. about 3. Language Model Estimation 2=3 for Spanish and German. We speculate that this is due After preprocessing, we are left with several large mono- to mixed language content on websites, where boilerplate lingual corpora.

Load more