A Parallel Corpus of Translated Literary Texts Amel Fraisse, Quoc-Tan Tran, Ronald Jenn, Patrick Paroubek, Shelley Fishkin

TransLiTex: A Parallel Corpus of Translated Literary Texts Amel Fraisse, Quoc-Tan Tran, Ronald Jenn, Patrick Paroubek, Shelley Fishkin To cite this version: Amel Fraisse, Quoc-Tan Tran, Ronald Jenn, Patrick Paroubek, Shelley Fishkin. TransLiTex: A Par- allel Corpus of Translated Literary Texts. Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Beijing Advanced Innovation Center for Language Resources, May 2018, Miyazaki, Japan. hal-01827884 HAL Id: hal-01827884 https://hal.archives-ouvertes.fr/hal-01827884 Submitted on 2 Jul 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. TransLiTex: A Parallel Corpus of Translated Literary Texts Amel Fraisse1, Quoc-Tan Tran1, Ronald Jenn1, Patrick Paroubek2, Shelley Fisher Fishkin3 1University of Lille (France), 2LIMSI-CNRS (France), 3Stanford University (USA) {amel.fraisse, ronald.jenn}@univ-lille3.fr, [email protected], [email protected], sfi[email protected] Abstract In this paper, we present our ongoing research work to create a massively parallel corpus of translated literary texts which is useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. Using a crowdsourcing approach, we identified and collected 29 translations of Mark Twain’s Adventures of Huckleberry Finn published in 23 languages including less-resourced languages. We report on the current status of the corpus, with 5 chapter-aligned translations (English-Dutch, two English-Hungarian, English-Polish and English-Russian). We evaluated the correctness of chapter alignment by computing the percentage of common words between the English version and the translated ones. Results show high percentages that vary between 43% and 64% proving the high correctness of chapter alignment. Keywords: parallel corpus, comparable corpus, translated literary texts 1. Introduction results. Section 7 mentions the expected use of the corpus Parallel corpora are a valuable resource for linguistic re- and in the last section, we conclude and underline future search and natural language processing (NLP) applications. work. Such corpora are often used for testing new tools and meth- ods in Statistical Machine Translation (SMT), where large 2. Related work amounts of aligned data are often used to learn word align- Computational linguistics researchers have been exploring ment models between two languages (Och and Ney, 2003). different sources for building parallel and comparable cor- The most widely used parallel corpora in computational ap- pora. Resnik and Smith (2003) used the Web as parallel text proaches are the Canadian Hansards (Roukos et al., 1995) to construct a significant parallel corpus for a low-density which are bilingual (English and French), the United Na- language pair. tions Parallel Corpus (6 languages) (Ziemski et al., 2016), In accordance with the fast growth of Wikipedia, many or the European Parliament proceedings (21 languages) works have been published in the last years focused on (Koehn, 2005). These resources belong to the legal and its use and exploitation for construction of parallel corpora political sphere. (Tomas et al., 2008; Tufis, et al., 2013; Labaka et al., 2016). Another source of parallel corpora that has recently at- Other research works used Twitter as comparable cor- tracted attention is religious texts such as the Bible. This pus to build multilingual linguistic resources (Fraisse and line of research, which entailed the compilation of many Paroubek, 2014; Vicente et al., 2016). parallel texts, has broken new ground and allowed com- There have also been research works which show the poten- putational linguistics to handle vast corpora. Cysouw and tial of the Bible as a source to compile massively parallel Walchli (2007) introduced the notion of ‘massively par- corpora (Resnik et al., 1999). Mayer and Cysouw (2014), allel corpora’ for texts that have translations into a great based on freely available resources, created a Bible corpus number of languages (100+). Although there are not many with over 900 translations in more than 830 language va- such texts, those that are available offer an incredibly rich rieties. Christodouloupoulos and Steedman (2015) built a source for computational linguistic researchers. That is massively parallel corpus based on 100 translations of the why our project taps into relatively unexplored sources for Bible, emphasizing difficulties in acquiring and processing massively parallel corpora: translated literary texts. the raw material. A growing number of those texts are now available in elec- There are also parallel corpora related to translated literary tronic form on the internet and they are indexed by pub- works (e.g. Harry Potter, Le Petit Prince, Master i Mar- 1 2 lic online catalogues such as Wikisource , Archive.org , garita) or translations from the web, mostly available for a 3 Project Gutenberg , etc. In this paper, we will report on our set of closely related languages (Mayer and Cysouw, 2014; ongoing research work to compile such a massively paral- Cysouw and Walchli, 2007). Most of these texts, how- lel literary corpus. This paper is structured as follows. The ever, cannot be regarded as massively parallel texts, they are next section gives an overview on related work on the con- not freely available online, and they mainly concern well- struction of parallel corpora. Section 3 describes the Mark endowed largely known languages. Twain translation corpus. Section 4 presents our method for data collection. Section 5 outlines the corpus alignment. 3. Mark Twain corpus Section 6 describes the corpus evaluation and discusses the Mark Twain’s books are some of the most well-travelled 1https://wikisource.org texts on the planet. As the UNESCO Index Translationum 2https://archive.org shows the American writer is ranked 15 in the top-50 of the 3https://www.gutenberg.org most translated authors worldwide. His works have been translated into almost every language in which books are parametrization of the experiment was as follows: as we are printed (Rodney, 1982). The novel Adventures of Huck- looking for translations over the world, we have not limited leberry Finn (Twain, 1885) is one of the most commonly the geographic location of the contributors. Each task con- translated of his books. Rodney (1982) identified 375 sisted in a set of 9 questions (i.e. units in the CrowdFlower translations as of 1976. As UNESCO’s Index Transla- terminology) and completing the task will earn 0.25 $ (in- tionum4 suggests, hundreds of additional translations have stead of 0.15 $ recommended by CrowdFlower). In fact, been published in the four decades since Rodney com- the task that the workers had to go through to complete the pleted his survey. But these two sources are both signif- job was complex. First we asked people to use search en- icantly out of date and incomplete. (For example, UN- gines or online catalogs to look for existing translations in ESCO’s Index Translationum lists 15 translations of the their native language. Then we asked them if they could novel in Chinese, but Lai-Henderson (2015) documented find the translator’s name, the first year of publication, the 90 Chinese translations). The scores of language into publishing house, the URL of the cover, the bibliographic which the book has been translated include Afrikaans, Al- record, and available public digital versions. banian, Arabic, Assamese, Bengali, Bulgarian, Burmese, Because of the complexity of the task, the crowdsourc- Catalan, Chinese, Chuvash, Czech, Danish, Dutch, Esto- ing approach did not look like the best option. We as- nian, Farsi, Finnish, French, German, Georgian, Greek, sumed that the cultural background of crowdsourcing work- Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, ers would not allow them to complete the task efficiently Kazakh, Korean, Kirghiz, Latvian, Lithuanian, Macedo- but it turned out that they managed to provide us with valu- nian, Malay, Malayalam, Marathi, Norwegian, Oriya, Pol- able and reliable information. One week after launching ish, Portuguese, Romanian, Russian, Serbo-Croatian, Sin- the job on CrowdFlower, we received 710 judgements cov- halese, Slovak, Slovenian, Spanish, Swedish, Tamil, Tatar, ering 31 different languages. On top came Spanish (163 Telugu, Thai, Turkish, Ukranian, and Uzbek. In many of responses), Arabic (76), Malay and Indonesian (47), Ger- these languages, there have been multiple translations over man (46), French (45), Greek (43) and Turkish (39). After time, reflecting different moments in history, and differ- data cleaning we collected 29 translations in 23 languages ent ideological perspectives on the part of the translators or of different formats (html, text, pdf, epub). publishers, as well as different attitudes towards the US, towards childhood, towards minorities and minority dialects, 4.2. Full-text acquisition towards race and racism, etc. Of all the existing transla- Before collecting the full-text, we checked the reliability of tions of Mark

Load more