Non-Literal Text Reuse in Historical Texts: an Approach to Identify Reuse Transformations and Its Application to Bible Reuse
Total Page:16
File Type:pdf, Size:1020Kb
Non-Literal Text Reuse in Historical Texts: An Approach to Identify Reuse Transformations and its Application to Bible Reuse Maria Moritz1, Andreas Wiederhold2, Barbara Pavlek3, Yuri Bizzoni4, and Marco Buchler¨ 1 1Institute of Computer Science, University of Gottingen¨ 2Institute for Educational Science, University of Gottingen¨ 3Minds and Traditions Research Group, Max Planck Institute for the Science of Human History, Jena 4Department of Philosophy, Linguistics, Theory of Science, University of Gothenburg Abstract between text excerpts referring to the same source. Specifically, detecting copies of the same historical Text reuse refers to citing, copying or allud- text that have diverged over time (manuscript studies, ing text excerpts from a text resource to a a.k.a., Stemma Codicum) is an important task. new context. While detecting reuse in con- temporary languages is well supported—given Although much work exists in the field of nat- extensive research, techniques, and corpora— ural language processing (NLP), many new chal- automatically detecting historical text reuse is lenges arise when processing historical text. The much more difficult. Corpora of historical lan- most important challenges are the absence of support- guages are less documented and often encom- ing tools and methods, including an agreement on pass various genres, linguistic varieties, and topics. In fact, historical text reuse detection is a common orthography, standardization of variants, much less understood and empirical studies are and a wide range of clean, digitized text (Piotrowski, necessary to enable and improve its automation. 2012; Geyken and Gloning, 2014; Zitouni, 2014). We present a linguistic analysis of text reuse in Typical statistical approaches from the field of NLP two ancient data sets. We contribute an auto- are difficult to apply to historically transferred texts, mated approach to analyze how an original text since these often cover a large timespan and, thus, was transformed into its reuse, taking linguis- comprise many different writing styles, text variants tic resources into account to understand how they help characterizing the transformation. It or even reuse styles (Buchler,¨ 2013). Our long-term is complemented by a manual analysis of a goal is to conceive robust text reuse detection tech- subset of the reuse. Our results show the limi- niques for historical texts. To this end, we need to tations of approaches focusing on literal reuse improve the quantitative empirical understanding of detection. Yet, linguistic resources can effec- such reuse accompanied by qualitative empirical stud- tively support understanding the non-literal text ies. However, only few such works exist. reuse transformation process. Our results sup- port practitioners and researchers working on We study less- and non-literal text reuse of Bible understanding and detecting historical reuse. verses in Ancient Greek and Latin texts. Our focus is on understanding how the reuse instances are trans- formed from the original verses. We identify opera- 1 Introduction tions that characterize how words are changed—e.g., The computational detection of historical text reuse— synonymized, capitalized or part-of-speech (PoS) in- including citations, quotations or allusions —can formation changed. Since our approach uses external be applied in many respects. It can help tracing linguistic resources, including Ancient Greek Word- down historical content (a.k.a., lines of transmis- Net (AGWN) (Bizzoni et al., 2014; Minozzi, 2009) sion), which is essential to the field of textual crit- and various lemma lists, we also show how such icism (Buchler¨ et al., 2012). In the context of mas- resources can help detecting reuse and where the lim- sive digitization projects, it can identify relationships itations are. We complement the automated approach 1849 Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1849–1859, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics with a qualitative manual analysis. We contribute: eration and detection, typically used to train statis- tical models (Zhao et al., 2009; Madnani and Dorr, an automated approach to characterize how text • 2010). However, such corpora hardly exist for his- is transformed between reuse and original, torical languages or are copyrighted, such as the an application of the approach to two text • TLG digital library (Pantelia, 2014). Especially in datasets where reuse was manually identified, the field of modern reuse investigation, aligned cor- empirical data based on the automated approach, • pora are often used, providing a rich source of para- complemented by a manual identification. phrasal sentence pairs in one, sometimes multiple Our resulting datasets1 with rich information about languages. One of such is the Microsoft Research the reuse transformation (e.g., PoS and morphology Paraphrase Corpus (MSRP), which contains 5801 changes, and words becoming synonyms or hyper- manually evaluated, paraphrasal sentence pairs in En- onyms, among others) can be used as a benchmark for glish (Dolan and Brockett, 2005). Ganitkevitch et future reuse detection and classification approaches. al. (2013) present a paraphrase database with over 200 million English paraphrase pairs and 196 million 2 Related Work Spanish paraphrases. Each paraphrase pair comes with measures, such as a paraphrase probability score. We first discuss why existing reuse detection ap- In ancient literature, efforts are made to collect Bib- proaches are not applicable to historical texts, and lical reuse. One of such is the collection of Ancinet then present works trying to address this problem. Greek and Latin quotations based on the the Vetus Historical Text Reuse and Plagiarism Detection. Latina series and the Novum Testamentum Graecum Buchler¨ (2013) combines state-of-the-art NLP tech- Editio Critica Maior (Houghton, 2013a; Houghton, niques to address reuse detection scenarios for histor- 2013b). It contains more than 150,000 Latin citations ical texts, ranging from near copies to text excerpts and about 87,000 Ancient Greek Bible references. with a minimum overlap. He uses the commonly used Historical Text Processing in General. Efforts method fingerprinting, which selects n-grams from to automatically process ancient texts are made an upfront pre-segmentized corpus. While his ap- around the Perseus Digital Library project (Crane, proach can discover historical and modern text reuse 1985), among others. For example, Bamman (2008) language-independently, it requires a minimum text presents the discovery of textual allusions in a col- similarity—typically at least two common features. lection of Classical poetry, using measures such as Recognizing modified reuse is difficult in general. token similarity, n-grams or syntactic similarity. This Alzahrani et al. (2012) study plagiarism detection allows finding at least the most similar candidates techniques: n-gram-, syntax-, and semantics-based within a closed library. Some works have focused on approaches. As soon as reused text is slightly modi- text reuse in Biblical Greek text. Lee (2007) investi- fied (e.g., words changed) most systems fail. Barron-´ gate reuse among the Gospels of the New Testament, Cedeno˜ et al. (2013) conduct experiments on para- aimed at aligning similar sentences. Using source phrasing, observing that complex paraphrasing along alternation patterns, among others, the approach uses with a high paraphrasing density challenges plagia- cosine similarity, source verse proximity, and source rism detection, and that lexical substitution is the verse order. Focusing on high recall, the detection of most frequent technique for plagiarizing. The Ara- Homeric quotations in Athenaeus’ Deipnosophistai’ PlagDet (Bensalem et al., 2015) initiative focuses was investigated by Buchler¨ et al. (2012), searching on the evaluation of plagiarism detection methods for distinctive words within reuse. for Arabic texts. Eight methods were submitted and While the approaches above rely on string or fea- turned out to work with a high accuracy on exter- ture similarity, Bamman (2011b) attempts to process nal plagiarism detection but did not achieve usable the semantic space using word-sense disambigua- results for intrinsic plagiarism detection. tion (Patwardhan et al., 2003; Agirre and Edmonds, Corpora. Huge parallel corpora of modern lan- 2007). Using a bilingual sense inventory and training guages are used in fields such as paraphrase gen- set, they classify up to 72 % of word senses correctly. 1https://bitbucket.org/mariamoritz/emnlp Utilizing Linguistic Resources. Word nets support 1850 identifying word relationships. Jing (1998) investi- mapping words contain the same cognate). Our oper- gates issues that come with using WordNet (Miller ations are based on a one-word-replacement to better et al., 1990) for language generation. Among others, quantify the results. Third (RQ2.1), we develop an these comprise issues arising from the adaption of a algorithm that identifies operations by first looking general lexicon to a specific domain. These were en- for morphological changes between a word from the countered by using a domain corpus and an ontology reuse and its corresponding candidate from the Bible to prune WordNet to a certain domain. verse and, in case of no success, by seeking for a In our work, we are interested in using linguistic semantic relation. We apply it to our two datasets resources (word nets and lemma lists) together