Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R

Total Page:16

File Type:pdf, Size:1020Kb

Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R. Smith1,2 Herve Saint-Amand3 Magdalena Plamada4 [email protected] [email protected] [email protected] Philipp Koehn3 Chris Callison-Burch1,2,5 Adam Lopez1,2 [email protected] [email protected][email protected] 1Department of Computer Science, Johns Hopkins University 2Human Language Technology Center of Excellence, Johns Hopkins University 3School of Informatics, University of Edinburgh 4Institute of Computational Linguistics, University of Zurich 5Computer and Information Science Department, University of Pennsylvania Abstract are readily available, ordering in the hundreds of millions of words for Chinese-English and Arabic- Parallel text is the fuel that drives modern English, and in tens of millions of words for many machine translation systems. The Web is a European languages (Koehn, 2005). In each case, comprehensive source of preexisting par- much of this data consists of government and news allel text, but crawling the entire web is text. However, for most language pairs and do- impossible for all but the largest compa- mains there is little to no curated parallel data nies. We bring web-scale parallel text to available. Hence discovery of parallel data is an the masses by mining the Common Crawl, important first step for translation between most a public Web crawl hosted on Amazon’s of the world’s languages. Elastic Cloud. Starting from nothing more The Web is an important source of parallel than a set of common two-letter language text. Many websites are available in multiple codes, our open-source extension of the languages, and unlike other potential sources— STRAND algorithm mined 32 terabytes of such as multilingual news feeds (Munteanu and the crawl in just under a day, at a cost of Marcu, 2005) or Wikipedia (Smith et al., 2010)— about $500. Our large-scale experiment it is common to find document pairs that are di- uncovers large amounts of parallel text in rect translations of one another. This natural par- dozens of language pairs across a variety allelism simplifies the mining task, since few re- of domains and genres, some previously sources or existing corpora are needed at the outset unavailable in curated datasets. Even with to bootstrap the extraction process. minimal cleaning and filtering, the result- ing data boosts translation performance Parallel text mining from the Web was origi- across the board for five different language nally explored by individuals or small groups of pairs in the news domain, and on open do- academic researchers using search engines (Nie main test sets we see improvements of up et al., 1999; Chen and Nie, 2000; Resnik, 1999; to 5 BLEU. We make our code and data Resnik and Smith, 2003). However, anything available for other researchers seeking to more sophisticated generally requires direct access mine this rich new data resource.1 to web-crawled documents themselves along with the computing power to process them. For most 1 Introduction researchers, this is prohibitively expensive. As a consequence, web-mined parallel text has become A key bottleneck in porting statistical machine the exclusive purview of large companies with the translation (SMT) technology to new languages computational resources to crawl, store, and pro- and domains is the lack of readily available paral- cess the entire Web. lel corpora beyond curated datasets. For a handful of language pairs, large amounts of parallel data To put web-mined parallel text back in the hands of individual researchers, we mine parallel ⇤ This research was conducted while Chris Callison- Burch was at Johns Hopkins University. text from the Common Crawl, a regularly updated 1github.com/jrs026/CommonCrawlMiner 81-terabyte snapshot of the public internet hosted on Amazon’s Elastic Cloud (EC2) service.2 Us- into a sequence of start tags, end tags, ing the Common Crawl completely removes the and text chunks. bottleneck of web crawling, and makes it possi- (b) Align the linearized HTML of candidate ble to run algorithms on a substantial portion of document pairs. the web at very low cost. Starting from nothing (c) Decide whether to accept or reject each other than a set of language codes, our extension pair based on features of the alignment. of the STRAND algorithm (Resnik and Smith, 2003) identifies potentially parallel documents us- 3. Segmentation: For each text chunk, perform ing cues from URLs and document content ( 2). sentence and word segmentation. § We conduct an extensive empirical exploration of 4. Sentence Alignment: For each aligned pair of the web-mined data, demonstrating coverage in text chunks, perform the sentence alignment a wide variety of languages and domains ( 3). § method of Gale and Church (1993). Even without extensive pre-processing, the data improves translation performance on strong base- 5. Sentence Filtering: Remove sentences that line news translation systems in five different lan- appear to be boilerplate. guage pairs ( 4). On general domain and speech § translation tasks where test conditions substan- Candidate Pair Selection We adopt a strategy tially differ from standard government and news similar to that of Resnik and Smith (2003) for find- training text, web-mined training data improves ing candidate parallel documents, adapted to the performance substantially, resulting in improve- parallel architecture of Map-Reduce. ments of up to 1.5 BLEU on standard test sets, and The mapper operates on each website entry in 5 BLEU on test sets outside of the news domain. the CommonCrawl data. It scans the URL string for some indicator of its language. Specifically, 2 Mining the Common Crawl we check for: The Common Crawl corpus is hosted on Ama- 1. Two/three letter language codes (ISO-639). zon’s Simple Storage Service (S3). It can be downloaded to a local cluster, but the transfer cost 2. Language names in English and in the lan- is prohibitive at roughly 10 cents per gigabyte, guage of origin. making the total over $8000 for the full dataset.3 If either is present in a URL and surrounded by However, it is unnecessary to obtain a copy of the non-alphanumeric characters, the URL is identi- data since it can be accessed freely from Amazon’s fied as a potential match and the mapper outputs Elastic Compute Cloud (EC2) or Elastic MapRe- a key value pair in which the key is the original duce (EMR) services. In our pipeline, we per- URL with the matching string replaced by *, and form the first step of identifying candidate docu- the value is the original URL, language name, and ment pairs using Amazon EMR, download the re- full HTML of the page. For example, if we en- sulting document pairs, and perform the remain- counter the URL www.website.com/fr/, we ing steps on our local cluster. We chose EMR be- output the following. cause our candidate matching strategy fit naturally into the Map-Reduce framework (Dean and Ghe- Key: www.website.com/*/ mawat, 2004). • Value: www.website.com/fr/, French, Our system is based on the STRAND algorithm • (Resnik and Smith, 2003): (full website entry) The reducer then receives all websites mapped 1. Candidate pair selection: Retrieve candidate to the same “language independent” URL. If two document pairs from the CommonCrawl cor- or more websites are associated with the same key, pus. the reducer will output all associated values, as 2. Structural Filtering: long as they are not in the same language, as de- termined by the language identifier in the URL. (a) Convert the HTML of each document This URL-based matching is a simple and in- 2commoncrawl.org expensive solution to the problem of finding can- 3http://aws.amazon.com/s3/pricing/ didate document pairs. The mapper will discard most, and neither the mapper nor the reducer do Sentence Filtering Since we do not perform any anything with the HTML of the documents aside boilerplate removal in earlier steps, there are many from reading and writing them. This approach is sentence pairs produced by the pipeline which very simple and likely misses many good potential contain menu items or other bits of text which are candidates, but has the advantage that it requires not useful to an SMT system. We avoid perform- no information other than a set of language codes, ing any complex boilerplate removal and only re- and runs in time roughly linear in the size of the move segment pairs where either the source and dataset. target text are identical, or where the source or target segments appear more than once in the ex- Structural Filtering A major component of the tracted corpus. STRAND system is the alignment of HTML docu- ments. This alignment is used to determine which 3 Analysis of the Common Crawl Data document pairs are actually parallel, and if they are, to align pairs of text blocks within the docu- We ran our algorithm on the 2009-2010 version ments. of the crawl, consisting of 32.3 terabytes of data. Since the full dataset is hosted on EC2, the only The first step of structural filtering is to lin- cost to us is CPU time charged by Amazon, which earize the HTML. This means converting its DOM came to a total of about $400, and data stor- tree into a sequence of start tags, end tags, and age/transfer costs for our output, which came to chunks of text. Some tags (those usually found roughly $100. For practical reasons we split the within text, such as “font” and “a”) are ignored run into seven subsets, on which the full algo- during this step. Next, the tag/chunk sequences rithm was run independently. This is different are aligned using dynamic programming.
Recommended publications
  • Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
    Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative).
    [Show full text]
  • Champollion: a Robust Parallel Text Sentence Aligner
    Champollion: A Robust Parallel Text Sentence Aligner Xiaoyi Ma Linguistic Data Consortium 3600 Market St. Suite 810 Philadelphia, PA 19104 [email protected] Abstract This paper describes Champollion, a lexicon-based sentence aligner designed for robust alignment of potential noisy parallel text. Champollion increases the robustness of the alignment by assigning greater weights to less frequent translated words. Experiments on a manually aligned Chinese – English parallel corpus show that Champollion achieves high precision and recall on noisy data. Champollion can be easily ported to new language pairs. It’s freely available to the public. framework to find the maximum likelihood alignment of 1. Introduction sentences. Parallel text is a very valuable resource for a number The length based approach works remarkably well on of natural language processing tasks, including machine language pairs with high length correlation, such as translation (Brown et al. 1993; Vogel and Tribble 2002; French and English. Its performance degrades quickly, Yamada and Knight 2001;), cross language information however, when the length correlation breaks down, such retrieval, and word disambiguation. as in the case of Chinese and English. Parallel text provides the maximum utility when it is Even with language pairs with high length correlation, sentence aligned. The sentence alignment process maps the Gale-Church algorithm may fail at regions that contain sentences in the source text to their translation. The labor many sentences with similar length. A number of intensive and time consuming nature of manual sentence algorithms, such as (Wu 1994), try to overcome the alignment makes large parallel text corpus development weaknesses of length based approaches by utilizing lexical difficult.
    [Show full text]
  • Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
    RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by
    [Show full text]
  • The Web As a Parallel Corpus
    The Web as a Parallel Corpus Philip Resnik∗ Noah A. Smith† University of Maryland Johns Hopkins University Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content- based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair. 1. Introduction Parallel corpora—bodies of text in parallel translation, also known as bitexts—have taken on an important role in machine translation and multilingual natural language processing. They represent resources for automatic lexical acquisition (e.g., Gale and Church 1991; Melamed 1997), they provide indispensable training data for statistical translation models (e.g., Brown et al. 1990; Melamed 2000; Och and Ney 2002), and they can provide the connection between vocabularies in cross-language information retrieval (e.g., Davis and Dunning 1995; Landauer and Littman 1990; see also Oard 1997). More recently, researchers at Johns Hopkins University and the
    [Show full text]
  • Resourcing Machine Translation with Parallel Treebanks John Tinsley
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by DCU Online Research Access Service Resourcing Machine Translation with Parallel Treebanks John Tinsley A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.) to the Dublin City University School of Computing Supervisor: Prof. Andy Way December 2009 I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Ph.D. is entirely my own work, that I have exercised reasonable care to ensure that the work is original, and does not to the best of my knowledge breach any law of copyright, and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. Signed: (Candidate) ID No.: Date: Contents Abstract vii Acknowledgements viii List of Figures ix List of Tables x 1 Introduction 1 2 Background and the Current State-of-the-Art 7 2.1 ParallelTreebanks ............................ 7 2.1.1 Sub-sentential Alignment . 9 2.1.2 Automatic Approaches to Tree Alignment . 12 2.2 Phrase-Based Statistical Machine Translation . ...... 14 2.2.1 WordAlignment ......................... 17 2.2.2 Phrase Extraction and Translation Models . 18 2.2.3 ScoringandtheLog-LinearModel . 22 2.2.4 LanguageModelling . 25 2.2.5 Decoding ............................. 27 2.3 Syntax-Based Machine Translation . 29 2.3.1 StatisticalTransfer-BasedMT . 30 2.3.2 Data-OrientedTranslation . 33 2.3.3 OtherApproaches ........................ 35 2.4 MTEvaluation.............................
    [Show full text]
  • Towards Controlled Counterfactual Generation for Text
    The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Generate Your Counterfactuals: Towards Controlled Counterfactual Generation for Text Nishtha Madaan, Inkit Padhi, Naveen Panwar, Diptikalyan Saha 1IBM Research AI fnishthamadaan, naveen.panwar, [email protected], [email protected] Abstract diversity will ensure high coverage of the input space de- fined by the goal. In this paper, we aim to generate such Machine Learning has seen tremendous growth recently, counterfactual text samples which are also effective in find- which has led to a larger adoption of ML systems for ed- ing test-failures (i.e. label flips for NLP classifiers). ucational assessments, credit risk, healthcare, employment, criminal justice, to name a few. Trustworthiness of ML and Recent years have seen a tremendous increase in the work NLP systems is a crucial aspect and requires guarantee that on fairness testing (Ma, Wang, and Liu 2020; Holstein et al. the decisions they make are fair and robust. Aligned with 2019) which are capable of generating a large number of this, we propose a framework GYC, to generate a set of coun- test-cases that capture the model’s ability to misclassify or terfactual text samples, which are crucial for testing these remove unwanted bias around specific protected attributes ML systems. Our main contributions include a) We introduce (Huang et al. 2019), (Garg et al. 2019). This is not only lim- GYC, a framework to generate counterfactual samples such ited to fairness but the community has seen great interest that the generation is plausible, diverse, goal-oriented and ef- in building robust models susceptible to adversarial changes fective, b) We generate counterfactual samples, that can direct (Goodfellow, Shlens, and Szegedy 2014; Michel et al.
    [Show full text]
  • Natural Language Processing Security- and Defense-Related Lessons Learned
    July 2021 Perspective EXPERT INSIGHTS ON A TIMELY POLICY ISSUE PETER SCHIRMER, AMBER JAYCOCKS, SEAN MANN, WILLIAM MARCELLINO, LUKE J. MATTHEWS, JOHN DAVID PARSONS, DAVID SCHULKER Natural Language Processing Security- and Defense-Related Lessons Learned his Perspective offers a collection of lessons learned from RAND Corporation projects that employed natural language processing (NLP) tools and methods. It is written as a reference document for the practitioner Tand is not intended to be a primer on concepts, algorithms, or applications, nor does it purport to be a systematic inventory of all lessons relevant to NLP or data analytics. It is based on a convenience sample of NLP practitioners who spend or spent a majority of their time at RAND working on projects related to national defense, national intelligence, international security, or homeland security; thus, the lessons learned are drawn largely from projects in these areas. Although few of the lessons are applicable exclusively to the U.S. Department of Defense (DoD) and its NLP tasks, many may prove particularly salient for DoD, because its terminology is very domain-specific and full of jargon, much of its data are classified or sensitive, its computing environment is more restricted, and its information systems are gen- erally not designed to support large-scale analysis. This Perspective addresses each C O R P O R A T I O N of these issues and many more. The presentation prioritizes • identifying studies conducting health service readability over literary grace. research and primary care research that were sup- We use NLP as an umbrella term for the range of tools ported by federal agencies.
    [Show full text]
  • A User Interface-Level Integration Method for Multiple Automatic Speech Translation Systems
    A User Interface-Level Integration Method for Multiple Automatic Speech Translation Systems Seiya Osada1, Kiyoshi Yamabana1, Ken Hanazawa1, Akitoshi Okumura1 1 Media and Information Research Laboratories NEC Corporation 1753, Shimonumabe, Nakahara-Ku, Kawasaki, Kanagawa 211-8666, Japan {s-osada@cd, k-yamabana@ct, k-hanazawa@cq, a-okumura@bx}.jp.nec.com Abstract. We propose a new method to integrate multiple speech translation systems based on user interface-level integration. Users can select the result of free-sentence speech translation or that of registered sentence translation without being conscious of the configuration of the automatic speech translation system. We implemented this method on a portable device. Keywords: speech translation, user interface, machine translation, registered sentence retrieval, speech recognition 1 Introduction There have been many researches on speech-to-speech translation systems, such as NEC speech translation system[1], ATR-MATRIX[2] and Verbmobil[3]. These speech-to-speech translation systems include at least three components: speech recognition, machine translation, and speech synthesis. However, in practice, each component does not always output the correct result for various inputs. In actual use of a speech-to-speech translation system with a display, the speaker using the system can examine the result of speech recognition on the display. Accordingly, when the recognition result is inappropriate, the speaker can correct errors by speaking again to the system. Similarly, when the result of speech synthesis is not correct, the listener using the system can examine the source sentence of speech synthesis on the display. On the other hand, the feature of machine translation is different from that of speech recognition or speech synthesis, because neither the speaker nor the listener using the system can confirm the result of machine translation.
    [Show full text]
  • ALW2), Pages 1–10 Brussels, Belgium, October 31, 2018
    EMNLP 2018 Second Workshop on Abusive Language Online Proceedings of the Workshop, co-located with EMNLP 2018 October 31, 2018 Brussels, Belgium Sponsors Primary Sponsor Platinum Sponsors Gold Sponsors Silver Sponsors Bronze Sponsors ii c 2018 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-948087-68-1 iii Introduction Interaction amongst users on social networking platforms can enable constructive and insightful conversations and civic participation; however, on many sites that encourage user interaction, verbal abuse has become commonplace. Abusive behavior such as cyberbullying, hate speech, and scapegoating can poison the social climates within online communities. The last few years have seen a surge in such abusive online behavior, leaving governments, social media platforms, and individuals struggling to deal with the consequences. As a field that works directly with computational analysis of language, the NLP community is uniquely positioned to address the difficult problem of abusive language online; encouraging collaborative and innovate work in this area is the goal of this workshop. The first year of the workshop saw 14 papers presented in a day-long program including interdisciplinary panels and active discussion. In this second edition, we have aimed to build on the success of the first year, maintaining a focus on computationally detecting abusive language and encouraging interdisciplinary work. Reflecting the growing research focus on this topic, the number of submissions received more than doubled from 22 in last year’s edition of the workshop to 48 this year.
    [Show full text]
  • Lines: an English-Swedish Parallel Treebank
    LinES: An English-Swedish Parallel Treebank Lars Ahrenberg NLPLab, Human-Centered Systems Department of Computer and Information Science Link¨opings universitet [email protected] Abstract • We can investigate the distribution of differ- ent kinds of shifts in different sub-corpora and This paper presents an English-Swedish Par- characterize the translation strategy used in allel Treebank, LinES, that is currently un- terms of these distributions. der development. LinES is intended as a resource for the study of variation in trans- In this paper the focus is mainly on the second as- lation of common syntactic constructions pect, i.e., on identifying translation correspondences from English to Swedish. For this rea- of various kinds and presenting them to the user. son, annotation in LinES is syntactically ori- When two segments correspond under translation ented, multi-level, complete and manually but differ in structure or meaning, we talk of a trans- reviewed according to guidelines. Another lation shift (Catford, 1965). Translation shifts are aim of LinES is to support queries made in common in translation even for languages that are terms of types of translation shifts. closely related and may occur for various reasons. This paper has its focus on structural shifts, i.e., on 1 Introduction changes in syntactic properties and relations. Translation shifts have been studied mainly by The empirical turn in computational linguistics has translation scholars but is also of relevance to ma- spurred the development of ever new types of basic chine translation, as the occurrence of translation linguistic resources. Treebanks are now regarded as shifts is what makes translation difficult.
    [Show full text]
  • Using Morphemes from Agglutinative Languages Like Quechua and Finnish to Aid in Low-Resource Translation
    Using Morphemes from Agglutinative Languages like Quechua and Finnish to Aid in Low-Resource Translation John E. Ortega [email protected] Dept. de Llenguatges i Sistemes Informatics, Universitat d’Alacant, E-03071, Alacant, Spain Krishnan Pillaipakkamnatt [email protected] Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA Abstract Quechua is a low-resource language spoken by nearly 9 million persons in South America (Hintz and Hintz, 2017). Yet, in recent times there are few published accounts of successful adaptations of machine translation systems for low-resource languages like Quechua. In some cases, machine translations from Quechua to Spanish are inadequate due to error in alignment. We attempt to improve previous alignment techniques by aligning two languages that are simi- lar due to agglutination: Quechua and Finnish. Our novel technique allows us to add rules that improve alignment for the prediction algorithm used in common machine translation systems. 1 Introduction The NP-complete problem of translating natural languages as they are spoken by humans to machine readable text is a complex problem; yet, is partially solvable due to the accuracy of machine language translations when compared to human translations (Kleinberg and Tardos, 2005). Statistical machine translation (SMT) systems such as Moses 1, require that an algo- rithm be combined with enough parallel corpora, text from distinct languages that can be com- pared sentence by sentence, to build phrase translation tables from language models. For many European languages, the translation task of bringing words together in a sequential sentence- by-sentence format for modeling, known as word alignment, is not hard due to the abundance of parallel corpora in large data sets such as Europarl 2.
    [Show full text]
  • A Massively Parallel Corpus: the Bible in 100 Languages
    Lang Resources & Evaluation DOI 10.1007/s10579-014-9287-y ORIGINAL PAPER A massively parallel corpus: the Bible in 100 languages Christos Christodouloupoulos • Mark Steedman Ó The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora. Keywords Parallel corpus Á Multilingual corpus Á Comparative corpus linguistics 1 Introduction Parallel corpora are a valuable resource for linguistic research and natural language processing (NLP) applications. One of the main uses of the latter kind is as training material for statistical machine translation (SMT), where large amounts of aligned data are standardly used to learn word alignment models between the lexica of two languages (for example, in the Giza?? system of Och and Ney 2003). Another interesting use of parallel corpora in NLP is projected learning of linguistic structure. In this approach, supervised data from a resource-rich language is used to guide the unsupervised learning algorithm in a target language. Although there are some techniques that do not require parallel texts (e.g. Cohen et al. 2011), the most successful models use sentence-aligned corpora (Yarowsky and Ngai 2001; Das and Petrov 2011). C. Christodouloupoulos (&) Department of Computer Science, UIUC, 201 N.
    [Show full text]