Non-Literal Text Reuse in Historical Texts: An Approach to Identify Reuse Transformations and its Application to Bible Reuse

Maria Moritz1, Andreas Wiederhold2, Barbara Pavlek3, Yuri Bizzoni4, and Marco Buchler¨ 1 1Institute of Computer Science, University of Gottingen¨ 2Institute for Educational Science, University of Gottingen¨ 3Minds and Traditions Research Group, Max Planck Institute for the Science of Human History, Jena 4Department of Philosophy, Linguistics, Theory of Science, University of Gothenburg

Abstract between text excerpts referring to the same source. Specifically, detecting copies of the same historical Text reuse refers to citing, copying or allud- text that have diverged over time (manuscript studies, ing text excerpts from a text resource to a a.k.a., Stemma Codicum) is an important task. new context. While detecting reuse in con- temporary languages is well supported—given Although much work exists in the field of nat- extensive research, techniques, and corpora— ural language processing (NLP), many new chal- automatically detecting historical text reuse is lenges arise when processing historical text. The much more difficult. Corpora of historical lan- most important challenges are the absence of support- guages are less documented and often encom- ing tools and methods, including an agreement on pass various genres, linguistic varieties, and topics. In fact, historical text reuse detection is a common orthography, standardization of variants, much less understood and empirical studies are and a wide range of clean, digitized text (Piotrowski, necessary to enable and improve its automation. 2012; Geyken and Gloning, 2014; Zitouni, 2014). We present a linguistic analysis of text reuse in Typical statistical approaches from the field of NLP two ancient data sets. We contribute an auto- are difficult to apply to historically transferred texts, mated approach to analyze how an original text since these often cover a large timespan and, thus, was transformed into its reuse, taking linguis- comprise many different writing styles, text variants tic resources into account to understand how they help characterizing the transformation. It or even reuse styles (Buchler,¨ 2013). Our long-term is complemented by a manual analysis of a goal is to conceive robust text reuse detection tech- subset of the reuse. Our results show the limi- niques for historical texts. To this end, we need to tations of approaches focusing on literal reuse improve the quantitative empirical understanding of detection. Yet, linguistic resources can effec- such reuse accompanied by qualitative empirical stud- tively support understanding the non-literal text ies. However, only few such works exist. reuse transformation process. Our results sup- port practitioners and researchers working on We study less- and non-literal text reuse of Bible understanding and detecting historical reuse. verses in Ancient Greek and Latin texts. Our focus is on understanding how the reuse instances are trans- formed from the original verses. We identify opera- 1 Introduction tions that characterize how words are changed—e.g., The computational detection of historical text reuse— synonymized, capitalized or part-of-speech (PoS) in- including citations, quotations or allusions —can formation changed. Since our approach uses external be applied in many respects. It can help tracing linguistic resources, including Ancient Greek Word- down historical content (a.k.a., lines of transmis- Net (AGWN) (Bizzoni et al., 2014; Minozzi, 2009) sion), which is essential to the field of textual crit- and various lemma lists, we also show how such icism (Buchler¨ et al., 2012). In the context of mas- resources can help detecting reuse and where the lim- sive digitization projects, it can identify relationships itations are. We complement the automated approach

1849

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1849–1859, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics with a qualitative manual analysis. We contribute: eration and detection, typically used to train statis- tical models (Zhao et al., 2009; Madnani and Dorr, an automated approach to characterize how text • 2010). However, such corpora hardly exist for his- is transformed between reuse and original, torical languages or are copyrighted, such as the an application of the approach to two text • TLG digital library (Pantelia, 2014). Especially in datasets where reuse was manually identified, the field of modern reuse investigation, aligned cor- empirical data based on the automated approach, • pora are often used, providing a rich source of para- complemented by a manual identification. phrasal sentence pairs in one, sometimes multiple Our resulting datasets1 with rich information about languages. One of such is the Microsoft Research the reuse transformation (e.g., PoS and morphology Paraphrase Corpus (MSRP), which contains 5801 changes, and words becoming synonyms or hyper- manually evaluated, paraphrasal sentence pairs in En- onyms, among others) can be used as a benchmark for glish (Dolan and Brockett, 2005). Ganitkevitch et future reuse detection and classification approaches. al. (2013) present a paraphrase database with over 200 million English paraphrase pairs and 196 million 2 Related Work Spanish paraphrases. Each paraphrase pair comes with measures, such as a paraphrase probability score. We first discuss why existing reuse detection ap- In ancient literature, efforts are made to collect Bib- proaches are not applicable to historical texts, and lical reuse. One of such is the collection of Ancinet then present works trying to address this problem. Greek and Latin quotations based on the the Vetus Historical Text Reuse and Plagiarism Detection. Latina series and the Novum Testamentum Graecum Buchler¨ (2013) combines state-of-the-art NLP tech- Editio Critica Maior (Houghton, 2013a; Houghton, niques to address reuse detection scenarios for histor- 2013b). It contains more than 150,000 Latin citations ical texts, ranging from near copies to text excerpts and about 87,000 Ancient Greek Bible references. with a minimum overlap. He uses the commonly used Historical Text Processing in General. Efforts method fingerprinting, which selects n-grams from to automatically process ancient texts are made an upfront pre-segmentized corpus. While his ap- around the Perseus Digital Library project (Crane, proach can discover historical and modern text reuse 1985), among others. For example, Bamman (2008) language-independently, it requires a minimum text presents the discovery of textual allusions in a col- similarity—typically at least two common features. lection of Classical poetry, using measures such as Recognizing modified reuse is difficult in general. token similarity, n-grams or syntactic similarity. This Alzahrani et al. (2012) study plagiarism detection allows finding at least the most similar candidates techniques: n-gram-, syntax-, and semantics-based within a closed library. Some works have focused on approaches. As soon as reused text is slightly modi- text reuse in Biblical Greek text. Lee (2007) investi- fied (e.g., words changed) most systems fail. Barron-´ gate reuse among the Gospels of the , Cedeno˜ et al. (2013) conduct experiments on para- aimed at aligning similar sentences. Using source phrasing, observing that complex paraphrasing along alternation patterns, among others, the approach uses with a high paraphrasing density challenges plagia- cosine similarity, source verse proximity, and source rism detection, and that lexical substitution is the verse order. Focusing on high recall, the detection of most frequent technique for plagiarizing. The Ara- Homeric quotations in Athenaeus’ Deipnosophistai’ PlagDet (Bensalem et al., 2015) initiative focuses was investigated by Buchler¨ et al. (2012), searching on the evaluation of plagiarism detection methods for distinctive words within reuse. for Arabic texts. Eight methods were submitted and While the approaches above rely on string or fea- turned out to work with a high accuracy on exter- ture similarity, Bamman (2011b) attempts to process nal plagiarism detection but did not achieve usable the semantic space using word-sense disambigua- results for intrinsic plagiarism detection. tion (Patwardhan et al., 2003; Agirre and Edmonds, Corpora. Huge parallel corpora of modern lan- 2007). Using a bilingual sense inventory and training guages are used in fields such as paraphrase gen- set, they classify up to 72 % of word senses correctly. 1https://bitbucket.org/mariamoritz/emnlp Utilizing Linguistic Resources. Word nets support

1850 identifying word relationships. Jing (1998) investi- mapping words contain the same cognate). Our oper- gates issues that come with using WordNet (Miller ations are based on a one-word-replacement to better et al., 1990) for language generation. Among others, quantify the results. Third (RQ2.1), we develop an these comprise issues arising from the adaption of a algorithm that identifies operations by first looking general lexicon to a specific domain. These were en- for morphological changes between a word from the countered by using a domain corpus and an ontology reuse and its corresponding candidate from the Bible to prune WordNet to a certain domain. verse and, in case of no success, by seeking for a In our work, we are interested in using linguistic semantic relation. We apply it to our two datasets resources (word nets and lemma lists) together with and investigate the relationships of affected words PoS information to model the transformation process and the literal share. We quantify occurrences of op- of reuse, specifically on an ancient language text to erations and calculate two measures suplem (lemma find limitations when applied to non-literal text reuse. support) and supAGWN (AGWN support) to assess the resources’ coverage for our approach. Fourth 3 Methodology (RQ2.2), we manually analyze a smaller sample of our reuse datasets, using further operations, to under- Our study addresses two main research questions: stand the full richness of the reuse. RQ1. What is the extent of non-literal reuse in our datasets? This analysis provides a baseline for the 3.2 Datasets following characterizations of the non-literal reuse. We use the following two text sources, both reusing RQ2. How is the non-literally reused text modified content from Bible verses. As a ground truth of the in our datasets? We study kinds and frequencies of reuse, we use manually annotated versions of both, semantic, lexical, and morphological changes. We provided to us by Mellerin (2014) and the Biblindex develop an automated approach to identify the reuse project (Mellerin, 2016; Vinzent et al., 2013). transformation, and complement it with a manual, Our first dataset comes from the primary source qualitative analysis. We formulate two sub-questions: text of “Salvation for the Rich” from the An- RQ2.1. How can linguistic resources support the cient Greek writer Clement of Alexandria (Clement´ discovery of non-literal reuse? We conjecture that d’Alexandrie, 2011), a well-known author in Bibli- non-literal reuse is difficult to capture automatically cal literature (Cosaert, 2008). The Biblindex team (especially due to domain- or author-specific words), annotated 128 text passages as Bible reuse instances, but that taking linguistic resources into account helps. adding a footnote with Bible verse pointers to each. We analyze the coverage of words in lemma lists and We select a total of 95 out of these 128, following a synset database, and investigate how useful they four criteria: (i) reuse should not consist of an exact are for understanding the reuse transformations. literal copy of a Bible verse (skipping six instances), RQ2.2. What are the limitations of an automated (ii) reuse should be recognizable by our expert (skip- classification approach relying on linguistic re- ping ten instances), (iii) the reference frame should sources? Our manual analysis investigates the reuse be within five Bible verses (comparable with sen- in its full richness, to understand the limitations of tences) to avoid too much noise in our data to ensure the automated approach and identify further charac- a comparable length to the original Bible verse (skip- teristics of the reuse in our datasets. ping nine instances), and (iv) reuse instances should not exceed a length of 40 tokens (1–2 sentences), 3.1 Study Design again to cut the long tail and avoid too much noise Our study comprises the following main steps. First (skipping eight instances). Sometimes one reuse in- (RQ1), we identify and characterize the literal and stance pointed to different Bible verses or one text non-literal overlap in reuse instances. Second (to- passage contained more than one reuse instance, thus, wards RQ2), we define operations reflecting literal we come up with 199 verse-reuse-pairs. The excerpts reuse, replacements (inspired by semantic relation- point to a total of 15 Bible books. ships, such as synonyms and hyperonyms, supported Our second dataset are extracts from a total of by AGWN), and morphological changes (e.g., when 14 volumes of twelve works and two work collec-

1851 Jer si occultabitur vir in absconditis et ego non videbo eum dicit 23 24 Dominus numquid non caelum et terram ego impleo ait Dominus Algorithm 1: Reuse classification algorithm (Can anyone hide himself in secret places that I will not see him? Said the lord. Do not I fill heaven and earth? Said the Lord) /* Executed for each reuse instance and its corresponding literal et terram ego impleo (and I fill the earth) Bible verse. morph(x) returns the part-of-speech Mk Ἤρξατο λέγειν ὁ Πέτρος αὐτῷ, Ἰδοὺ ἡμεῖς ἀφήκαμεν πάντα καὶ and/or case of x. repl case and repl pos are masked to 10 30 ἠκολουθήκαμέν σοι. (Peter began to say to him: See, we left repl morph for clarity reasons. checkm(x,y) returns everything and followed you.) NOPmorph(morph(x),morph(y)) if morph(x) equals morph(y) literal ἡμεῖς ἀϕήκαμεν πάντα καὶ ἠκολουθήσαμέν σοι (we left everything and repl morph(morph(x),morph(y)) otherwise. */ and followed you) input :L set of word-lemma pairs obtained from the lemma resources ← Prv impius cum in profundum venerit peccatorum contemnit sed input :S set of synsets from AGWN; each synset contains an id and a parent id ← 18 3 sequitur eum ignominia et obprobrium (When the wicked man is input :T list of words of reuse instance (containing part-of-speech information) ← come into the depth of sins, also contempt comes but ignominy and input :B list of words of Bible verse (containing part-of-speech information) ← reproach follow him) output :OP list of sets containing up to 3 parameterized operations more Impius , cum venerit in profundum malorum , contemnit (When s1, s2 any← two synsets S. literal the wicked man is come into the depth of evil) tmp op← temporary variable∈ which presents the absence of a relation but not of a 1Cor νυνὶ δὲ μένει πίστις , ἐλπίς , ἀγάπη , τὰ τρία ταῦτα μείζων δὲ lemma. ← 13 13 τούτων ἡ ἀγάπη (And now remain faith, hope, love, these three; but for t in T do the greatest of those is love.) for b in B do less πίστει καὶ ἐλπίδι καὶ ἀγάπῃ (faith, and hope, and love - in dative if t=b then literal case) OP OP (NOP (t, b), checkm(morph(t), morph(b))) ← ∪ less ἀγάπην , πίστιν , ἐλπίδα (love, faith, hope - in accusative case) break literal else if lowerCase(t) = b then less μένει δὲ τὰ τρία ταῦτα , πίστις , ἐλπίς , ἀγάπη · μείζων δὲ ἐν OP OP (lower(t, b), checkm(morph(t), morph(b))) literal ← ∪ τούτοις ἡ ἀγάπη (and remain these three, faith, hope, love; but the break greatest among them is love) else if lowerCase(b) = t then Mt ὁ ἀγαθὸς ἄνθρωπος ἐκ τοῦ ἀγαθοῦ θησαυροῦ ἐκβάλλει ἀγαθά , καὶ OP OP (upper(t, b), checkm(morph(t), morph(b))) ← ∪ 12 35 ὁ πονηρὸς ἄνθρωπος ἐκ τοῦ πονηροῦ θησαυροῦ ἐκβάλλει πονηρά . break (A good man out of good storage brings out good things , and an else if t L and b L then ∈ ∈ evil man out of the evil storage brings evil things .) /* lemma found for original (b) and reuse word (t) */ non- Ψυχῆς , τὰ δὲ ἐκτός , κἂν μὲν ἡ ψυχὴ χρῆται καλῶς , καλὰ καὶ if lemma(t) = lemma(b) then literal ταῦτα δοκεῖ , ἐὰν δὲ πονηρῶς , πονηρά , ὁ κελεύων ἀπαλλοτριοῦν OP OP (lem(t, b), checkm(morph(t), morph(b))) ← ∪ τὰ ὑπάρχοντα ([are whitin the] soul, and some are out, and if the break soul uses them good, those things are also thought of as good, but if else if t s1 and b s2 and s1 S and s2 S then ∈ ∈ ∈ ∈ [they are used as] bad, [they are thought of as] bad; he who if s1 = s2 then commands the renouncement of possessions) /* t is synonym of b */ OP OP (repl syn(t, b)) break ← ∪ Figure 1: Examples of reuse else if id(s1) = parent id(s2) then /* t is hyperonym of b */ OP OP (repl hypo(t, b)) break ← ∪ tions from the Latin writer Bernard of Clairvaux. We else if parent id(s1) = id(s2) then /* t is hyperonym of b */ OP OP (hyper(t, b)) break again use Biblindex’ extracted Bible reuse, which of- ← ∪ else if parent id(s1) = parent id(s2) then fers over 1,100 reuse instances in alphabetical order. /* synset of t and synset of b both have the same synset as parent */ OP OP (repl cohypo(t, b)) break We follow the same selection criteria as for Greek ← ∪ and—starting top-down and dropping only two—we else tmp op (no rel found(t, b)) obtain 162 Bible-verse reuse-pairs, which is similar ← to the number of Greek reuse instances. Specifi- end if tmp op then cally, since those reuse instances come from several OP OP tmp op ← ∪ else different primary source works, they point to a to- OP OP (lemma missing(t)) tal of 31 Bible books. We use the Bible editions ← ∪ end from Biblindex, specifically, the data based on Septu- return OP agint (Rahlfs, 1935b), Greek New testament (Aland and Aland, 1966), and Biblia sacra juxta vulgatam versionem (Weber R., 1969 1994 2007). hope, and love, the key words in the original verse. Fig.1 shows reuse examples, illustrating the wide 3.3 PoS Tagging range of literalness in our data, comprising literal (all tokens overlap), less literal (important tokens over- The automated and the manual approach also take lap), and non-literal (no content word tokens overlap) PoS information into account to understand the reuse. For example, Clement’s reuse ranges from in- reuse transformation. Following the Greek morphol- troducing the overall topic by citing multiple verses, ogy tagging system of Perseus (Bamman and Crane, 2011a), which maps PoS and case information to sin- to supporting his argumentation. Specifically, Mk 2 10 30 is a fully literal reuse from a passage that dis- gle characters , we manually PoS-tag the 199 reuse cusses the problem of rich men in heaven. Clement instances of Ancient Greek and the 162 of Latin, as uses this episode as a main point in his essay. Later well as the original Bible verses. Since Latin and he refers to 1Cor 13 13, he again refers to how hard Ancinet Greek PoS-taggers lack available implemen- it would be for rich men to enter heaven, explaining tations, appropriate trained models or simply accu- that salvation is independent of “external things,” but 2http://nlp.perseus.tufts.edu/syntax/ depends on the “virtue of the soul,” mentioning faith, treebank/agdt/1.7/docs/README.txt

1852 lemma coverage1 AGWN coverage2 total3 (LXX), a translation of the Old Testament (Rahlfs, 4 corpus lem. syn. hyper. hypo. co-hypo. 1935a) from the Center for Computer Analysis of Greek Bible4 3238 1906 1422 1185 1422 4776 Texts at UPenn. We acknowledge code-page cor- Clement5 739 326 231 175 231 2189 rections by M. Munson. SBLGNT&LXX provide 4

CLTK Latin Bible 2473 1241 905 863 905 2618 Bernard5 1219 643 471 455 471 1335 59,510 word-lemma-pairs. Greek Bible4 752 103 58 67 58 4776 We use AGWN (Bizzoni et al., 2014), which also Clement5 455 54 24 33 24 2189 contains Latin WordNet (Minozzi, 2009), to identify Latin Bible4 2473 1365 1057 1023 1057 2618

Biblindex synsets (sets of synonyms) as well as hyperonyms, Bernard5 1219 701 531 520 531 1335 hyponyms, and co-hyponyms. From the wordnets’ Greek Bible4 4718 3385 2616 2092 2616 4776 Clement5 1297 824 582 421 582 2189 98,950 synsets 33,910 synsets contain Ancient Greek Latin Bible4,6 n/a n/a n/a n/a n/a 2618

& LXX and 27,126 synsets contain Latin words. SBLGNT Bernard5,6 n/a n/a n/a n/a n/a 1335 Coverage. Table1 shows the coverage of each re- Greek Bible4 4723 3449 2684 2156 2684 4776 Clement5 1548 899 653 495 653 2189 source for our datasets. In the lower part of it we Latin Bible4 2473 1378 1057 1023 1057 2618 merge all lemma resources into one set of word- 5 combined Bernard 1219 706 531 520 531 1335 lemma pairs. The table shows that CLTK covers 1 number of tokens found by lemma resource the Bible data better than the Hellenistic Greek as 2 number of lemmatized tokens covered by AGWN 3 number of tokens in original and reuse used in Clement of Alexandria, an author from 2nd 4 original 5 reuse 6 no support for Latin century AD, writing in an archaic style with Biblical Table 1: Coverage of tokens by language resources vocabulary, while also being influenced by Classi- cal Greek. We also check the coverage of lemmata stemming from the same source (Biblindex) as our racy (Crane, 1991; vor der Bruck¨ et al., 2015), we reuse. To increase the coverage for Greek, we consult perform this step manually to assure high accuracy. SBLGNT&LXX, which in fact increases it. To not We also assign cases for the classes noun, article, ad- miss important information, we integrate all of the jective, and pronoun. We introduce b to represent the resources’ data into our approach. For every lemma Latin ablative case, which does not exist in Greek. of a word we check the semantic relations in AGWN. 3.4 Automated Approach We experimented with different ways of looking up lemmas and found that lower-casing all Latin tokens Our approach is to model the transformation process improved the success. For Greek, it had the opposite in terms of parameterized operations applied to the effect, which indicates that the Greek text contains words in the reuse instance in order to obtain the more entities that are not available in lowercase in original words. These operations use linguistic re- the lemma lists, so we did not change in that case.5 sources, such as lemma lists of classical Greek and Operations and Classification. We define replace- Biblical Koine, and a synset database. For each trans- ment operations using words and PoS as parame- formation, we create the set of operations necessary ters, to transform a reuse instance back into the to transform the reuse instance to its original. Bible verse it originates from. Table2 lists the op- Linguistic Resources. We investigate the following erations for the computational approach. We in- lemma lists to look up lemmatized forms of words—a troduce the operations NOPmorph, repl pos, and prerequisite for looking up synsets: Classical Lan- repl case for words having the same cognate, and guage Tool Kit (CLTK) (Johnson et al., 2014 2016) lemma missing(reuse word) when a word is not provides Ancient Greek and Latin lemma lists for 953,907 Greek and 270,228 Latin words. Biblin- 4CATSS LXX is prepared by the Thesaurus Linguae Grae- dex’ Lemma Lists contain entries for 65,537 Biblical cae project directed by T. Brunner at UC Irvine, with further verification and adaptation by CATSS towards conformity with Greek and 315,021 Latin words. SBLGNT&LXX the individual Gottingen¨ editions which appeared since 1935. refers to the Greek New Testament of the Society of LXXM is morphologically analyzed text of CATSS LXX pre- Biblical Literature (SBLGNT)3 and the Septuaginta pared by CATSS led by R. Kraft (Philadelphia team) 5Often, the decision on whether to represent a word in upper 3Logos Bible Software, Sbl new testament, 2014 http: or lower case letters is made by the editor, thus, our decision is //sblgnt.com/about/ affected by the edition we use for our research.

1853 operation description example NOP(reuse word, orig word) Original and reuse word are equal. NOP(maledictus,maledictus) upper(reuse word, orig word) Word is lowercase in reuse and uppercase in original. upper(kai,Kai) - in Greek lower(reuse word, orig word) Word is uppercase in reuse and lowercase in original. lower(Gloriam,gloriam) lem(reuse word, orig word) Lemmatization leads to equality of reuse and original. lem(penetrat,penetrabit) repl syn(reuse word, orig word) Reuse word replaced with a synonym to match original word. repl syn(magnificavit,glorificavit) repl hyper(reuse word, orig word) Word in bible verse is a hyperonym of the reused word. hyper(cupit,habens) repl hypo(reuse word, orig word) Word in bible verse is a hyponym of the reused word. hypo(dederit,tollet) repl co-hypo(reuse word, orig word) Reused word and original have the same hyperonym. repl co-hypo(magnificavit,fecit) NOPmorph(reuse tags, orig tags) Case or PoS did not change between reused and original word. NOPmorph(na,na) repl pos(reuse tag, orig tag) Reuse and original contain the same cognate, but PoS changed. repl pos(n,a) repl case(reuse tag, orig tag) Reuse and original have the same cognate, but the case changed repl case(g,d) - cases genitive, dative lemma missing(reuse word, orig word) Lemma unknown for reuse or original word lemma missing(tentari, inlectus) no rel found(reuse wword, orig word) Relation for reuse or original word not found in AGWN no rel found(gloria,arguitur) Table 2: Operation list for the automated approach known to any of our lemma resources as well as of replacement operations: those from the upper part no rel found(reuse word, orig word) when the rela- of Table2 (without upper and lower), and instead of tionship between a reuse word and each potential only using repl case when a cognate stays the same, word from the original is not covered by AGWN. we refine it and assign all changing morphological Algorithm1 shows our approach to classify the categories from Perseus’ tag set for any “relativeness” reuse transformation by identifying the operations. between two words (e.g., repl case a g). For each reuse token, we identify the first applicable operation matching the foremost Bible verse word (it- 4 Results erating the verse) in the following order: exact word We now present the results for our research questions match (NOP: no operation), case changed to upper in Sec. 4.1–4.3, which are summarized and further or lower. Thereafter, we look up the lemma and re- interpreted in Sec. 4.4. turn lem if the lemma of the reused word matches the lemma of the original. For these four, we also 4.1 Literal Share of the Reuse (RQ1) check the morphology, in addition returning whether We obtain a first understanding of the reuse by look- the original has the same PoS and case (NOPmorph) ing at the percentage of overlapping words between or whether PoS changed (repl pos), case changed reuse instance and original Bible verse. We measure (repl case), or both. So up to three operations can be the longest common substring based on word tokens. returned per word. Finally, we check for synonyms Fig.2 shows the distributions, distinguishing between (repl syn), hyperonyms (hyper), hyponyms (hypo), a lemmatized and non-lemmatized word comparison. and co-hyponyms (repl co-hypo), but do not check While lemmatizing words before comparison has morphology. If a Bible verse word is used as a match, only a small impact, we observe differences between it is not used again for any other word from the reuse. the datasets. In our Latin dataset, the overlap is sig- nificantly higher than in the Greek dataset Sec. 3.2. 3.5 Qualitative Approach 25 % (upper quartile) of Bernard’s reuse instances have 50 % or more tokens overlap with their original, To obtain a deeper understanding of the limitations of which is only the case for less than 25 % in Clement’s linguistic resources for our purpose, two graduate stu- Greek data. Still, large overlaps of up to 75 % (top dents (one Latinist, one Classical Archeologist) man- ually analyze 100 Greek and 60 Latin reuse instances 1 with their expert knowledge, using an extended set 0.5 of operations. It comprises ins(word) (insert a word) 0 and del(word) (delete a word)—two operations we ig- non-lem. lem. non-lem. lem. nore in the automated approach where we focuse on Figure 2: Ratios ofleft: literal Greek, right:overlaps Latin between reuse the coverage of the resources. It also has a richer set instance and original (left: Greek, right: Latin)

1854 1 0.5 4 0 2 unclass. literal nonlit. unclass. literal nonlit. 0 1 0.5 left: Greek, right: Latin NOPupper lem syn Figure 3: Ratios of unclassified, literal, and non-liter- lower hyperhypo poscase rel 0 repl co-hypo no repl repl missing log(# functions used) repl repl al words in reuse instances (left: Greek, right: Latin) NOPmorph reuse length repl lem whisker) in our Greek and up to around 90 % in our functions Latin dataset exist—so a small fraction of the reuse (a) Greek contains literal parts Sec. 3.2. For a more precise understanding of the literalness, we group operations into literal (NOP, upper, lower, 4 2 lem), non-literal (repl syn, repl hyper, repl hypo, 0 1 0.5 repl co-hypo), and unclassified (no rel found and NOPupper lem syn lower hyperhypo poscase rel 0 repl co-hypo no repl repl missing lemma missing). Within each reuse instance, we cal- log(# functions used) repl repl repl NOPmorph reuse length culate their relative occurrence using the results of lem the automated approach (explained shortly). Fig.3 functions shows the distribution of these relative occurrences (b) Latin for all reuse instances. It confirms Fig.2 by show- Figure 4: Occurrence of operations in reuse in- ing a higher rate of literalness for Latin compared to stances. X-axis: operations; Y-axis: relative position Greek. In summary, it also shows that the Latin reuse within reuse instances. Z-axis: natural logarithm of can be better classified by our approach, which takes number of operations. Values are smoothed by spline the lemma lists and AGWN into account. interpolation. The order of operations is arbitrary.

4.2 Automated Approach (RQ2.1) length without a particular trend in both datasets. We Table3 shows the total number of operations identi- only encounter a frequent use of upper at the first fied for the transformation from reuse instances to position in Latin, which means that Bernard often the Greek and Latin originals. For 987 (45 %) out of starts his Biblical references with literal Bible words. 2189 words in the Greek instances and for 893 (67 %) After having checked the overall coverage of the out of 1335 words in the Latin instances, we were linguistic resources for all tokens (cf. Sec. 3.4), able to identify at least one operation, which already we now specifically investigate to what extent the indicates to what extent the resources are helpful. resources support identifying the reuse transforma- Fig.4 visualizes the distribution of the frequencies tion for the non-literal reuse using our approach. (y-axis) of each operation (x-axis) together with the We introduce the measures suplem and supAGWN distribution of the operations’ positions in the reuse to calculate how often looking up a lemma or instances (z-axis). The latter is calculated as the rela- subsequently a synset element was successful. tive position p [0..1] of an operation with respect to This is easy based on our operations. Let Occ(o) ∈ the length of the reuse instance. It indicates that most be the number of occurrences of an operation operation types are distributed over the whole reuse o, obtained from Table3. The operations that successfully looked up a lemma (before consulting AGWN) are lem success= lem, syn, repl hyper, NOP upper lower lem syn hyper hypo co-hypo { repl hypo, repl co-hypo, no rel found . Now recall Occ. Greek 337 6 0 356 153 20 14 101 } Occ. Latin 587 0 44 102 60 14 28 68 that lem missing represents the case when a reuse NOPmorph repl pos repl case no rel found lem missing token was not found in the lemma resources. Then o lem success Occ(o) ∈ Occ. Greek 420 49 258 563 639 suplem = o lem success lem missing . We obtain Occ. Latin 617 46 75 347 85 Occ(o)P∈ ∪{ } a suplem ofP 0.65 for the Greek reuse and 0.88 for the Table 3: Absolute numbers of operations identified Latin reuse. Similarly, the operations that success- automatically fully looked up from AGWN are agwn success= syn, { 1855 operation Greek Latin operation Greek Latin operation Greek Latin operation Greek Latin repl syn 78 (40.6%) 91 (40.4%) repl gender 6 (3.1%) 1 (0.4%) repl case a b 0 6 repl case g a 5 2 repl ant 1 (0.5%) 0 repl mood 11 (5.7%) 12 (5.3%) repl case a n 9 4 repl case g n 4 2 repl hyper 3 (1.6%) 0 repl number 17 (8.9%) 17 (7.6%) repl case b a 0 10 repl case n a 7 5 repl hypo 11 (5.7%) 0 repl person 5 (2.6%) 14 (6.2%) repl case d a 0 2 repl case n d 3 0 lem 1 (0.5%) 2 (0.9%) repl pos 18 (9.4%) 33 (14.7%) repl case d g 3 0 repl case v g 0 2 repl co-hypo 0 1 (0.4%) repl tense 3 (1.6%) 9 (4.0%) repl case d n 5 0 repl case 38 (19.8%) 36 (16%) repl voice 0 8 (3.6%) Table 5: Numbers of case replacements Table 4: Numbers of replacement operations identi- fied for the manual reuse transformation. with its antonym6; once, a synonym also changes its PoS. Four times, more than one morphological cate- repl hyper, repl hypo, repl co-hypo , with gory changes, twice an auxiliary is deleted, and five } no rel found representing a failed lookup. Then: times inserted. We find one writing variance (lem), o agwn success Occ(o) ∈ and three times a synonym is replaced by a multi- supAGWN = o agwn success no rel found . We Occ(o)P∈ ∪{ } word expression. In the Latin data, in 16 cases a obtain sup of 0.34 for Greek and 0.33 for Latin. AGWNP synonym is replaced and morphological information These values can be interpreted as follows. The changed. Seven times, more than one morphological lemma resources for genre- and time-specific text parameter changes for the same cognate. Eight times, work well for less-literal reuse, but the resources for an auxiliary is inserted or deleted, and twice, a writ- semantic relationships (synset databases) show a lack ing variance is encountered. A synonym is replaced of support and need further development. by more than one word five times. In one case, a reuse is too paraphrasal for any word to match se- 4.3 Qualitative Approach (RQ2.2) mantic relationships (e.g., judged calmly—Bernard We manually identify the transformation operations vs. fake friend - Sal 12 18). for 60 reuse instances of the Ancient Greek data and for 100 of the Latin data. Here, NOPs cover 9.3 %, 4.4 Summary and Discussion insertions 49.8 %, and deletions cover 30.5 % in the RQ1. The reuse is significantly non-literal and only Greek data. NOPs cover 26.1 %, insertions 49.7 %, lemmatizing words does not help discovering it. Our and deletions 11.9 % in the Latin data. results show that reuse in two substantial histor- Table4 shows the ratios of the various repl oper- ical texts requires techniques beyond simple pre- ations based on the remaining 10.4 % and 12.2 %. processing (e.g., stemming or lemmatizing), which Similar to the automated approach, we observe a explains why plagiarism-detection systems fail when strong use of synonyms and other semantic-level op- paraphrases are used (Alzahrani et al., 2012). Bible erations, and also a certain portion of switching mor- verses are often used to justify an author’s claim, so phological categories, which indicates para-phrasal only relevant parts of the Bible verse are reused. In reuse. In the Greek data, PoS changes cover about the reuse the Bible verse is modified to better fit the 9%, out of which a participle became a verb (7 times) syntactical and semantic context of an author’s new and vice-versa (5 times). In our Latin data, PoS text, as shown in Tables4 and5. changes represent 15% of replacements: often a pro- RQ2.1. The results from our automated approach noun changed to a noun (6 times) and a participle are encouraging, showing the feasibility of extending became a verb (12 times). Case changes are shown reuse-detection techniques with linguistic resources. in Table5. Significantly often, an ablative became Yet, it is not clear which precision and recall could an accusative, because often changing prepositions be achieved and how existing techniques need to be expect different cases, or an accusative was replaced adapted and calibrated. This investigation is beyond by an ablative or nominative, because para-phrasal the scope of this study and subject to our future work. expression changed. The linguistic resources support the automated We encounter exceptions that prevent applying the 6Translation: “the God, the good (one)” (Clement) vs. “none operations. In the Greek data, one word is replaced is good but the God” (Bible).

1856 1 characteristics with respect to the literal reuse, we 0.5 create Fig.5. It shows the overlap of the whole 1128 0 instances of Bernard’s extracted reuse, which when non-lem. lem. compared to Fig.2 (right) supports the representa- Figure 5: Ratios of literal overlapsLatin in the whole Latin tiveness of our sample. Last, we can only derive dataset operation replacements when a word token was cov- approach, but only for about one third of the lookups. ered by the lemma sources, contained in AGWN, and The manually identified exceptions show that finding when there actually exists a relation between two a connection between original verse and reuse can be words. Also, our authors’ vocabulary can differ in difficult when there is only a vague semantic one. terms of domain knowledge, personal idiolect, and RQ2.2. Our results show that the automated ap- age of the Biblical vocabulary. proach cannot capture the richness of the manual 6 Conclusion approach. Especially from the exceptions, it is clear that less-literal reuse does not only need information We presented a study of historical—and mostly non- from a word’s semantic environment, but also that literal—text reuse. We automatically and manually it needs to be identified by looser relations, such as characterize the reuse and identify to what extent ex- co-hyponyms, multi-to-multi-word associations or isting linguistic resources are able to cover non-literal implicit meanings, which can be hidden in structural text reuse. Our results show the potential as well as or more broader expert knowledge. the necessity to develop robust techniques and to ex- tend linguistic resources for analyzing and detecting 5 Threats to Validity such reuse. Our results can help to enhance para- phrase generation to model automatic ways on how External Validity. We enhance the external valid- small text portions can be rephrased. Considering the ity of our work by focusing on Bible verses—one effects of syntactic rearrangement of reuse can also of the oldest, most conveyed, and cited sources of support such efforts. A smarter automated approach Ancient Greek, offering a vast amount of primary for deriving an original text excerpt would be learn- source text and also coming with a long history of ing so-called edit scripts (Kehrer, 2014; Chawathe scholars studying it. Clement of Alexandria is known et al., 1996), which more precisely identify opera- for his retelling of biblical excerpts (Clemens, 1905 tions an author performed on a text to transform it 1909; Freppel, 1865), providing an interesting base into another version. Whether learning edit scripts for reuse investigation. The french abbot Bernard on such intricate transformations is possible is an of Clairvaux (Smith, 2010) is equally known for his open question and valuable future research. Finally, influence to the Cistercian order and his work in analyzing further languages and data sets helps to . Furthermore, the chosen lemma re- further complete our findings. sources are the most extensive ones existing for An- cient Greek and Latin. We chose the AGWN, since it Acknowledgments is freely available, offering one of the largest synset We thank Laurence Mellerin for providing the database for Ancient Greek and Latin. datasets we used, and for valuable advice on its con- Internal Validity. A threat is that our ground truth tent. Our work is funded by the German Federal Min- has mistakes, as the PoS tagging was done by one istry of Education and Research (grant 01UG1509). author only and relied on a manual post-correction. The selection criteria in Sec. 3.2 were chosen to en- References sure quality and comparability. Extreme outliers in the length of the reuse instance or source (multiple [Agirre and Edmonds2007] Eneko Agirre and Philip Ed- monds. 2007. Word Sense Disambiguation - Algo- Bible verses) are cut-off. For Greek, 33 are cut-off, rithms and Applications. Springer Netherlands. as opposed to Latin, where our sample is significantly [Aland and Aland1966] Kurt Aland and , smaller than the whole population that we have. To editors. 1966. The Greek New Testament. Deutsche automatically check whether the sample has similar Bibelgesellschaft-United Bible Societies, 27 edition.

1857 [Alzahrani et al.2012] Salha M. Alzahrani, Naomie Salim, [Clemens1905 1909] Titus Flavius Clemens. 1905-1909. and Ajith Abraham. 2012. Understanding plagiarism Werke (in greek). In Otto Stahlin,¨ editor, Die Griechis- linguistic patterns, textual features, and detection meth- chen Christlichen Schrifteller, , v. 12, 15, 27. ods. Trans. Sys. Man Cyber Part C, 42(2):133–149. Leipzig. [Bamman and Crane2008] David Bamman and Gregory [Clement´ d’Alexandrie2011] Clement´ d’Alexandrie, 2011. Crane. 2008. The logic and discovery of textual allu- Quel riche peut-etreˆ sauve´. editions´ du Cerf, Paris, sion. In LaTeCH (Language Technology for Cultural sources chrtiennes 537 edition. Heritage Data), Marrakech Morocco. LREC. [Cosaert2008] Carl P. Cosaert. 2008. The Text of the [Bamman and Crane2011a] David Bamman and Gregory Gospels in Clement of Alexandria. New Testament Crane. 2011a. The ancient greek and latin depen- in the Greek Fathers. Society of Biblical Literature. dency treebanks. In Caroline Sporleder, Antal van den [Crane1985] Gregory Crane. 1985. Perseus digital Bosch, & Kalliopi Zervanou (Eds) Language technol- library. http://www.perseus.tufts.edu/ ogy for cultural heritage: Selected papers from the hopper/. LaTeCH Workshop Series, pages 79–98, Berlin, Ger- [Crane1991] Gregory Crane. 1991. Generating and pars- many. Springer-Verlag. ing classical greek. Literary and Linguistic Computing, [Bamman and Crane2011b] David Bamman and Gregory 6(4):243–245. Crane. 2011b. Measuring historical word sense varia- [Dolan and Brockett2005] Bill Dolan and Chris Brockett. tion. In Proceedings of the 11th ACM/IEEE-CS Joint 2005. Automatically constructing a corpus of senten- Conference on Digital libraries (JCDL 2011), pages tial paraphrases. In Third International Workshop on 1–10. ACM Digital Library. Paraphrasing (IWP2005). Asia Federation of Natural [Barron-Cede´ no˜ et al.2013] Alberto Barron-Cede´ no,˜ Language Processing. Marta Vila, M.Antonia` Mart´ı, and Paolo Rosso. [Freppel1865] Charles-Emile Freppel. 1865. Clement 2013. Plagiarism meets paraphrasing: Insights for d’Alexandrie. the next generation in automatic plagiarism detection. [Ganitkevitch et al.2013] Juri Ganitkevitch, Benjamin Van Computational Linguistic, 39(4):917–947. Durme, and Chris Callison-Burch. 2013. PPDB: The [Bensalem et al.2015] Imene Bensalem, Imene Boukhalfa, paraphrase database. In Proceedings of NAACL-HLT, Paolo Rosso, Lahsen Abouenour, Kareem Darwish, pages 758–764, Atlanta, Georgia. Association for Com- and Salim Chikhi. 2015. Overview of the AraPlagDet putational Linguistics. PAN@FIRE2015 Shared Task on Arabic Plagiarism [Geyken and Gloning2014] Alexander Geyken and Detection. In FIRE 2015 Working Notes Papers, 4-6 Thomas Gloning. 2014. A living text archive December, Gandhinagar, India, December. of 15th-19th century german: Corpus strategies, [Bizzoni et al.2014] Yuri Bizzoni, Federico Boschetti, technology, organization. In Corpus Linguistics and Harry Diakoff, Riccardo Del Gratta, Monica Mona- Interdisciplinary Perspectives on Language - CLIP. chini, and Gregory Crane. 2014. The making of an- Narr Tbingen. cient greek wordnet. In Proceedings of the Ninth In- [Houghton2013a] H.A.G. Houghton. 2013a. Patristic ev- ternational Conference on Language Resources and idence in the new edition of the vetus latina iohannes. Evaluation (LREC’14), Reykjavik, Iceland. European In L. Mellerin and H.A.G. Houghton, editors, Bibli- Language Resources Association (ELRA). cal Quotations in Patristic Texts (Studia Patristica 54), [Buchler¨ et al.2012] Marco Buchler,¨ Gregory Crane, pages 69–85. Peeters, Leuven. Maria Moritz, and Alison Babeu. 2012. Increasing [Houghton2013b] H.A.G. Houghton. 2013b. The use of recall for text re-use in historical documents to support the latin fathers for new testament . In research in the humanities. In Theory and Practice of B.D. Ehrman and M.W. Holmes, editors, The Text of the Digital Libraries, Lecture Notes in Computer Science, New Testament in Contemporary Research. Essays on volume 7489, pages 95–100. Springer, Berlin Heidel- the Status Quaestionis second edition. NTTSD., pages berg. 375–405. Brill, Leiden. [Buchler2013]¨ Marco Buchler.¨ 2013. Informationstech- [Jing1998] Hongyan Jing. 1998. Usage of wordnet in nische Aspekte des Historical Text Re-use (English: natural language generation. In Proceedings of the Computational Aspects of Historical Text Re-use. Ph.D. Workshop on Usage of WordNet in Natural Language thesis, , . Processing Systems (COLING-ACL’98). Columbia Uni- [Chawathe et al.1996] Sudarshan S Chawathe, Anand Ra- versity Academic Commons. jaraman, Hector Garcia-Molina, and Jennifer Widom. [Johnson et al.2014 2016] Kyle P. Johnson, Patrick J. 1996. Change detection in hierarchically structured Burns, Luke Hollis, Mart´ın Pozzi, Amit Shilo, Stephen information. In ACM SIGMOD Record, volume 25, Margheim, Gitter Badger, and Eamonn Bell. 2014– pages 493–504. ACM. 2016. Cltk: The classical language toolkit. https:

1858 //github.com/cltk/cltk. DOI 10.5281/zen- [Rahlfs1935b] Alfred Rahlfs, editor. 1935b. Septuaginta, odo.44555 v0.1.32. id est Vetus Testamentum Graece juxta LXX interpretes. [Kehrer2014] Timo Kehrer. 2014. Generierung konsisten- Rahlfs. 2 vol., 1950. zerhaltender editierskripte im kontext der modellver- [Smith2010] William Smith. 2010. Catholic Church Mile- sionierung. In Wilhelm Hasselbring and Nils Chris- stones: People and Events That Shaped the Institutional tian Ehmke, editors, Software Engineering 2014, Church. Indianapolis: Left Coast. Fachtagung des GI-Fachbereichs Softwaretechnik, 25. [Vinzent et al.2013] M. Vinzent, L. Mellerin, and H.A.G. Februar - 28. Februar 2014, Kiel, Deutschland, volume Houghton, editors. 2013. Biblical Quotations in Patris- 227 of LNI, pages 57–58. GI. tic Texts (Studia Patristica 54). Theory and Applica- [Lee2007] John Lee. 2007. A computational model of tions of Natural Language Processing. Peeters, Leuven. text reuse in ancient literary texts. In Proceedings of [vor der Bruck¨ et al.2015] Tim vor der Bruck,¨ Steffen the 45th Annual Meeting of the Association of Com- Eger, and Alexander Mehler. 2015. Lexicon-assisted putational Linguistics, Prague, Czech Republic, pages tagging and lemmatization in latin: A comparison of 472–479. Association for Computational Linguistics. six taggers and two lemmatization models. In Pro- [Madnani and Dorr2010] Nitin Madnani and Bonnie J. ceedings of the 9th SIGHUM Workshop on Language Dorr. 2010. Generating phrasal and sentential para- Technology for Cultural Heritage, Social Sciences, and phrases: A survey of data-driven methods. Comput. Humanities (LaTeCH), pages 105–113, Beijing, China, Linguist., 36(3):341–387, September. July. Association for Computational Linguistics. [Mellerin2014] Laurence Mellerin. 2014. New ways of [Weber R.1969 1994 2007] Gribomont J. Weber R., Fis- searching with biblindex, the online index of biblical cher B., editor. 1969, 1994, 2007. Biblia sacra juxta quotations in early christian literature. In Claire Cli- vulgatam versionem. Deutsche Bibelgesellschaft. vaz, Andrew Gregory, and David Hamidovic, editors, [Zhao et al.2009] Shiqi Zhao, Xiang Lan, Ting Liu, and Digital Humanities in Biblical, Early Jewish and Early Sheng Li. 2009. Application-driven statistical para- Christian Studies, chapter 11, pages 175–192. Brill, phrase generation. In Proceedings of the Joint Con- Leiden. ference of the 47th Annual Meeting of the ACL and [Mellerin2016] Laurence Mellerin. 2016. Biblindex. the 4th International Joint Conference on Natural Lan- http://www.biblindex.mom.fr/. guage Processing of the AFNLP: Volume 2 - Volume [Miller et al.1990] George A. Miller, Richard Beckwith, 2, ACL ’09, pages 834–842, Stroudsburg, PA, USA. Christiane Fellbaum, Derek Gross, and Katherine J. Association for Computational Linguistics. Miller. 1990. Introduction to wordnet: An on-line [Zitouni2014] Imed Zitouni. 2014. Natural language pro- lexical database. International Journal of Lexicography cessing of semitic languages. (special issue), 3(4):235–312. [Minozzi2009] Stefano Minozzi, 2009. Innsbrucker Beitrge zur Sprachwissenschaft, volume 137, chapter The Latin WordNet Project, pages 707–716. Institut fr Sprachen und Literaturen der Universitt Innsbruck, Innsbruck. [Pantelia2014] Maria Pantelia. 2014. Thesaurus linguae graecae. http://stephanus.tlg.uci.edu/ index.php. [Patwardhan et al.2003] Siddharth Patwardhan, Satanjeev Banerjee, and Ted Pedersen, 2003. Computational Linguistics and Intelligent Text Processing: 4th Inter- national Conference (CICLing), chapter Using Mea- sures of Semantic Relatedness for Word Sense Disam- biguation, pages 241–257. Springer Berlin Heidelberg, Berlin, Heidelberg. [Piotrowski2012] Michael Piotrowski. 2012. Natural Lan- guage Processing for Historical Texts (Synthesis Lec- tures on Human Language Technologies). Morgan & Claypool Publishers. [Rahlfs1935a] Alfred Rahlfs, editor. 1935a. Septuaginta. Wurttembergische¨ Bibelanstalt, 9 edition. 1971.

1859