Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R
Total Page:16
File Type:pdf, Size:1020Kb
Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R. Smith1,2 Herve Saint-Amand3 Magdalena Plamada4 [email protected] [email protected] [email protected] Philipp Koehn3 Chris Callison-Burch1,2,5 Adam Lopez1,2 [email protected] [email protected] ⇤ [email protected] 1Department of Computer Science, Johns Hopkins University 2Human Language Technology Center of Excellence, Johns Hopkins University 3School of Informatics, University of Edinburgh 4Institute of Computational Linguistics, University of Zurich 5Computer and Information Science Department, University of Pennsylvania Abstract are readily available, ordering in the hundreds of millions of words for Chinese-English and Arabic- Parallel text is the fuel that drives modern English, and in tens of millions of words for many machine translation systems. The Web is a European languages (Koehn, 2005). In each case, comprehensive source of preexisting par- much of this data consists of government and news allel text, but crawling the entire web is text. However, for most language pairs and do- impossible for all but the largest compa- mains there is little to no curated parallel data nies. We bring web-scale parallel text to available. Hence discovery of parallel data is an the masses by mining the Common Crawl, important first step for translation between most a public Web crawl hosted on Amazon’s of the world’s languages. Elastic Cloud. Starting from nothing more The Web is an important source of parallel than a set of common two-letter language text. Many websites are available in multiple codes, our open-source extension of the languages, and unlike other potential sources— STRAND algorithm mined 32 terabytes of such as multilingual news feeds (Munteanu and the crawl in just under a day, at a cost of Marcu, 2005) or Wikipedia (Smith et al., 2010)— about $500. Our large-scale experiment it is common to find document pairs that are di- uncovers large amounts of parallel text in rect translations of one another. This natural par- dozens of language pairs across a variety allelism simplifies the mining task, since few re- of domains and genres, some previously sources or existing corpora are needed at the outset unavailable in curated datasets. Even with to bootstrap the extraction process. minimal cleaning and filtering, the result- ing data boosts translation performance Parallel text mining from the Web was origi- across the board for five different language nally explored by individuals or small groups of pairs in the news domain, and on open do- academic researchers using search engines (Nie main test sets we see improvements of up et al., 1999; Chen and Nie, 2000; Resnik, 1999; to 5 BLEU. We make our code and data Resnik and Smith, 2003). However, anything available for other researchers seeking to more sophisticated generally requires direct access mine this rich new data resource.1 to web-crawled documents themselves along with the computing power to process them. For most 1 Introduction researchers, this is prohibitively expensive. As a consequence, web-mined parallel text has become A key bottleneck in porting statistical machine the exclusive purview of large companies with the translation (SMT) technology to new languages computational resources to crawl, store, and pro- and domains is the lack of readily available paral- cess the entire Web. lel corpora beyond curated datasets. For a handful of language pairs, large amounts of parallel data To put web-mined parallel text back in the hands of individual researchers, we mine parallel ⇤ This research was conducted while Chris Callison- Burch was at Johns Hopkins University. text from the Common Crawl, a regularly updated 1github.com/jrs026/CommonCrawlMiner 81-terabyte snapshot of the public internet hosted on Amazon’s Elastic Cloud (EC2) service.2 Us- into a sequence of start tags, end tags, ing the Common Crawl completely removes the and text chunks. bottleneck of web crawling, and makes it possi- (b) Align the linearized HTML of candidate ble to run algorithms on a substantial portion of document pairs. the web at very low cost. Starting from nothing (c) Decide whether to accept or reject each other than a set of language codes, our extension pair based on features of the alignment. of the STRAND algorithm (Resnik and Smith, 2003) identifies potentially parallel documents us- 3. Segmentation: For each text chunk, perform ing cues from URLs and document content ( 2). sentence and word segmentation. § We conduct an extensive empirical exploration of 4. Sentence Alignment: For each aligned pair of the web-mined data, demonstrating coverage in text chunks, perform the sentence alignment a wide variety of languages and domains ( 3). § method of Gale and Church (1993). Even without extensive pre-processing, the data improves translation performance on strong base- 5. Sentence Filtering: Remove sentences that line news translation systems in five different lan- appear to be boilerplate. guage pairs ( 4). On general domain and speech § translation tasks where test conditions substan- Candidate Pair Selection We adopt a strategy tially differ from standard government and news similar to that of Resnik and Smith (2003) for find- training text, web-mined training data improves ing candidate parallel documents, adapted to the performance substantially, resulting in improve- parallel architecture of Map-Reduce. ments of up to 1.5 BLEU on standard test sets, and The mapper operates on each website entry in 5 BLEU on test sets outside of the news domain. the CommonCrawl data. It scans the URL string for some indicator of its language. Specifically, 2 Mining the Common Crawl we check for: The Common Crawl corpus is hosted on Ama- 1. Two/three letter language codes (ISO-639). zon’s Simple Storage Service (S3). It can be downloaded to a local cluster, but the transfer cost 2. Language names in English and in the lan- is prohibitive at roughly 10 cents per gigabyte, guage of origin. making the total over $8000 for the full dataset.3 If either is present in a URL and surrounded by However, it is unnecessary to obtain a copy of the non-alphanumeric characters, the URL is identi- data since it can be accessed freely from Amazon’s fied as a potential match and the mapper outputs Elastic Compute Cloud (EC2) or Elastic MapRe- a key value pair in which the key is the original duce (EMR) services. In our pipeline, we per- URL with the matching string replaced by *, and form the first step of identifying candidate docu- the value is the original URL, language name, and ment pairs using Amazon EMR, download the re- full HTML of the page. For example, if we en- sulting document pairs, and perform the remain- counter the URL www.website.com/fr/, we ing steps on our local cluster. We chose EMR be- output the following. cause our candidate matching strategy fit naturally into the Map-Reduce framework (Dean and Ghe- Key: www.website.com/*/ mawat, 2004). • Value: www.website.com/fr/, French, Our system is based on the STRAND algorithm • (Resnik and Smith, 2003): (full website entry) The reducer then receives all websites mapped 1. Candidate pair selection: Retrieve candidate to the same “language independent” URL. If two document pairs from the CommonCrawl cor- or more websites are associated with the same key, pus. the reducer will output all associated values, as 2. Structural Filtering: long as they are not in the same language, as de- termined by the language identifier in the URL. (a) Convert the HTML of each document This URL-based matching is a simple and in- 2commoncrawl.org expensive solution to the problem of finding can- 3http://aws.amazon.com/s3/pricing/ didate document pairs. The mapper will discard most, and neither the mapper nor the reducer do Sentence Filtering Since we do not perform any anything with the HTML of the documents aside boilerplate removal in earlier steps, there are many from reading and writing them. This approach is sentence pairs produced by the pipeline which very simple and likely misses many good potential contain menu items or other bits of text which are candidates, but has the advantage that it requires not useful to an SMT system. We avoid perform- no information other than a set of language codes, ing any complex boilerplate removal and only re- and runs in time roughly linear in the size of the move segment pairs where either the source and dataset. target text are identical, or where the source or target segments appear more than once in the ex- Structural Filtering A major component of the tracted corpus. STRAND system is the alignment of HTML docu- ments. This alignment is used to determine which 3 Analysis of the Common Crawl Data document pairs are actually parallel, and if they are, to align pairs of text blocks within the docu- We ran our algorithm on the 2009-2010 version ments. of the crawl, consisting of 32.3 terabytes of data. Since the full dataset is hosted on EC2, the only The first step of structural filtering is to lin- cost to us is CPU time charged by Amazon, which earize the HTML. This means converting its DOM came to a total of about $400, and data stor- tree into a sequence of start tags, end tags, and age/transfer costs for our output, which came to chunks of text. Some tags (those usually found roughly $100. For practical reasons we split the within text, such as “font” and “a”) are ignored run into seven subsets, on which the full algo- during this step. Next, the tag/chunk sequences rithm was run independently. This is different are aligned using dynamic programming.