A Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual Morpheme Alignment
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4229–4236 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual Morpheme Alignment Anssi Yli-Jyrä1;3, Josi Purhonen1;2, Matti Liljeqvist3, Arto Antturi3, Pekka Nieminen3, Kari M. Räntilä3, Valtter Luoto3 1Department of Digital Humanities, University of Helsinki, 00014, Finland [email protected] 2Faculty of Theology, University of Helsinki, 3Raamatun Tietokirja, Aika[media] Oy Abstract Twenty-five years ago, morphologically aligned Hebrew-Finnish and Greek-Finnish bitexts (texts accompanied byatrans- lation) were constructed manually in order to create an analytical concordance (Luoto et al., 1997) for a Finnish Bible translation. The creators of the bitexts recently secured the publisher’s permission to release its fine-grained alignment, but the alignment was still dependent on proprietary, third-party resources such as a copyrighted text edition and pro- prietary morphological analyses of the source texts. In this paper, we describe a nontrivial editorial process starting from the creation of the original one-purpose database and ending with its reconstruction using only freely available text editions and annotations. This process produced an openly available dataset that contains (i) the source texts and their translations, (ii) the morphological analyses, (iii) the cross-lingual morpheme alignments. Keywords: parallel texts, word alignment, the Bible, Finnish translation, alignment guidelines, free annotation 1. Introduction Bible translation and the most relevant source texts. Parallel editions of Bible translations have existed The whole process covers the years 1991-2020 and con- for 1,800 years, but now there is a steadily growing sists of two active phases: interest to attach a fine-grained alignment to the (1) Product Development. The first phase (1991- numerous translations of the Bible and other par- 1997) added a fine-grained alignment on top of a Bible allel texts. The interest brings together linguistics, bitext that contained proprietary components because theology, translation studies and language engineering. we had no freely available morphology for the He- brew source text to rely on. The resulting value-added Fine-grained manual text alignment produces valuable database was then used to produce, in a few months, linked data sets and gold standards for automatic a commercial handbook product: a comprehensive an- alignment. Tools for producing such manual an- alytical concordance of the Finnish Bible translation notations range from coarse word group aligners – Iso Raamatun Sanahakemisto (IRS) (Luoto et al., (Melamed, 1998b) to ones that align words with confi- 1997). The concordance was editorially envisioned by dence levels (Lambert et al., 2005b) and to ones that Valtter Luoto as an addition to the publisher’s long- link various types of subword units (Yli-Jyrä, 1993; running series of Bible handbooks (1967–). On the Yli-Jyrä, 1995). The obtained gold datasets support basis of the work-made-for-hire rule, the publisher ob- the evaluation of alignment algorithms (Mihalcea and tained the copyright to the alignment and had the Pedersen, 2003; Lambert et al., 2005a; Cysouw et al., authentic right to decide whether to share the copy- 2007) and the development of methods for automated righted alignment with other parties. corpus linguistics (Szymanski, 2012). (2) Resource Sharing. The second phase (2018- The aim of this work is to produce an openly shareable, 2020) began as the publisher made the decision to re- fine-grained alignment for parallel Bibles. Such openly lease the in-house dataset (the alignment) under the available language resources catalyse research, sharing, CC-BY 4.0 licence. This opening would have been linking and tool development. useless without further work on the database. Any re- 1.1. The Aim maining proprietary components in the database have now been substituted with open resources by the first Scientific Bible editions and copyrighted translations two authors. The work resulted in an open resource are seldom freely shareable. Moreover, manually vali- that contains two bitexts with a fine-grained bilingual dated annotations and fine-grained alignments are ex- alignment. The quality of the whole is comparable pensive to produce and thus not typically freely avail- with the proprietary database. able for research and product development. The research problem of this paper is to overcome the 1.3. The Methodology obstacles to producing an openly shareable, manually In essence, our high-level methodology is a replicable aligned fine-grained alignment for a parallel Bible. tactic to cope with the delayed availability of open re- 1.2. The Approach sources. It breaks the production of open annotation resources into the following steps: The paper describes a recently completed process to produce a fine-grained and open morpheme-alignment 1. Scaffold: Set up the proprietary text resources between the most important 20th century Finnish as a scaffold corpus (SC). 4229 2. Addition: Add valuable annotations to the SC Until recently, the main application of word align- and keep its copyright manageable. ment (Brown et al., 1993; Tiedemann, 2011) has 3. Switchover: Negotiate an open licence for the been statistical machine translation. Its other ap- copyrighted annotations and replace the propri- plications include lexicon extraction, back-translation, etary components of SC with open resources. cross-lingual transfer learning, and lexical annotation of the translation. Linguistically fine-grained align- The Scaffold and Addition goals were addressed ment uses lemmas and linguistic tags to align texts first, during the Product Development phase. Section in greater detail (Yli-Jyrä, 1993; Yli-Jyrä, 1995; Li et 3 describes the SC and Section 4 outlines the bilin- al., 2014). The increased linguistic precision is poten- gual annotations. The Switchover goal has been ad- tially welcomed in computer-assisted language learning dressed recently, during the Resource Sharing phase. (Nerbonne, 2000), automatic interlinear glossing (Bow In Section 5, we describe how the linked database was et al., 2003; Samardžić et al., 2015), language typology detached from proprietary resources and how we im- (Cysouw and Wälchli, 2007), and contrastive transla- plemented its switchover to open resources. Finally, tion studies (Doval and Nieto, 2019). Sections 6 and 7 summarise and evaluate the results, while Section 8 concludes the paper. In the follow- 2.3. Aligned Bibles ing Section, we describe the background of fine-grained The earliest known parallel Bible is that of Origen Bible alignment. of Alexandria who compiled the Hexapla (245 CE), a six-column edition of the Hebrew Bible (HB) and its 2. Fine-Grained Bible Alignment earliest Greek translations. Hundreds of verse-aligned 1 Fine-grained Bible alignment is defined as a specialised Bible translations are now accessible via Bible.com (1,200 translations), STEP Bible2 (453 translations), task where the purpose is to indicate how the shared 3 4 cross-lingual meaning of the Bible is encoded and BibleGateway (224 translations), and BibleWebApp . aligned between the source and the target texts. The division into verses was developed for the HB and the GNT by the mid-16th century. The division is 2.1. The Bible as a Parallel Text relatively stable (±1 verse offset) across translations Parallel texts have existed for a long time, as demon- and it helps to split long sentences (e.g. the sentence strated by the Mesha Stele/the Book of Kings (ca 840 in Eph. 1:3-14). BCE), the Behistun Inscription (ca 500 BCE) and the Strong’s Exhaustive Concordance (Strong, 1890) of an Rosetta Stone (196 BCE). English Bible translation (King James Version) en- The Greek New Testament (GNT) with more than codes a special kind of manual word alignment. This 2,250 target languages is the most widely translated concordance indicates, for each content word occur- text in the world. Many of its target languages have rence, the lemma of the corresponding source word several translations. Besides the languages into which with a 4-digit reference number. This number is the entire GNT has been translated, some parts of linked to Strong’s Hebrew-Aramaic and Greek lexi- it have been translated into an additional 1,800 lan- cons. There are also other concordances with similar guages. The Hebrew Bible (HB, Tanakh), also know as source correspondencies (Young, 1879; Åberg, 1982) the Old Testament, has been translated into more than and several interlinear and reverse interlinear editions 700 languages, with the first translation (Septuagint) of the Bible, all involving an implicit cross-lingual word dating back to 200 BCE. alignment. The Bible usually refers to the concatenation of the HB and the GNT. It has influential translations such as the 3. Scaffold Corpus Vulgate (in Latin, CE 405) and the King James Bible We began by preparing the necessary text resources (CE 1611). Along with other popular texts like Le Pe- as a SC, without necessarily intending to share it. tit Prince (360 languages), it is used as a massively par- This contained the Finnish translation, the Hebrew allel corpus to research languages (Christodouloupou- and Greek sources text and the morphological analy- los and Steedman, 2015; Jawaid and Zeman, 2010; ses of all texts in the corpus. Tiedemann, 2012; Xia and Yarowsky,