Extracting Two Thousand Years of Latin from a Million Book Library
Total Page:16
File Type:pdf, Size:1020Kb
Extracting Two Thousand Years of Latin from a Million Book Library David Bamman The Perseus Project, Tufts University and David Smith Department of Computer Science, University of Massachusetts-Amherst With the rise of large open digitization projects such as the Internet Archive and Google Books, we are witnessing an explosive growth in the number of source texts becoming available to re- searchers in historical languages. The Internet Archive alone contains over 27,014 texts catalogued as Latin, including classical prose and poetry written under the Roman Empire, ecclesiastical trea- tises from the Middle Ages, and dissertations from 19th-century Germany written { in Latin { on the philosophy of Hegel. At one billion words, this collection eclipses the extant corpus of Classical Latin by several orders of magnitude. In addition, the much larger collection of books in English, German, French, and other languages already scanned contains unknown numbers of translations for many Latin books, or parts of books. The sheer scale of this collection offers a broad vista of new research questions, and we focus here on both the opportunities and challenges of computing over such a large space of heterogeneous texts. The works in this massive collection do not constitute a finely curated (or much less balanced) corpus of Latin; it is, instead, simply all the Latin that can be extracted, and in its reach of twenty-one centuries (from ca. 200 BCE to 1922 CE) arguably spans the greatest historical distance of any major textual collection today. While we might hope that the size and historical reach of this collection can eventually offer insight into grand questions such as the evolution of a language over both time and space, we must contend as well with the noise inherent in a corpus that has been assembled with minimal human intervention. Categories and Subject Descriptors: H.3.7 [Information Systems: Information Storage and Retrieval]: Digital Libraries 1. INTRODUCTION In June 2010, Google released over 500 high-quality scans of major Greek and Latin works, curated with the help of Gregory Crane and Alison Babeu at the Perseus Project.1 This collection included a carefully selected group of texts { largely authors from the Classical canon { drawn from the much deeper recesses of Google Books, which at the time had digitized a total of ca. 12 million works [21]. While this carefully selected set of texts with high-quality metadata stands on its own as a classical example of a curated corpus, those darker and more chaotic depths have the promise to yield up a far greater and potentially more valuable set of data. The Internet Archive contains a smaller set of digitized works (ca. 2 million), but all of them are publicly available for download, and 27,014 of these works have been catalogued as Latin from a range of authors, genres, and eras 1http://www.google.com/googlebooks/ancient-greek-and-latin.html ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1{0??. 2 · Bamman and Smith { the Classical Latin works of Vergil and Cicero, medieval religious authors such as Augustine and Thomas Aquinas, and later scientific writings by the likes of Newton, Copernicus and Kepler. These 27,014 works contain approximately one billion words of Latin, far more than the extant corpus of Classical Latin up to ca. 200 CE (around 10 million words2) and larger still than the largest existing Latin collection (J. Ramminger's Neulateiniche Wortliste [23] at 300 million words), which includes works up to 1700 CE. These 27,014 works also span a total of twenty-one centuries, capturing not only the written native Latin of a Roman elite but also its use as a second language of writers for the two millennia that follow. As others have pointed out, however, problems plague these massive collections in their use for scholarly research, not only in the quality of the image scans and the resulting OCR but also in the metadata itself that describes the texts [18]. While a certain degree of error is to be expected among such large projects focusing on optimizing the usefulness of the average case [14], the first research question we must ask is what tasks such a huge collection is best suited for in the face of such noise. While not a curated or balanced collection in the same vein as the Brown Corpus [13] or a homogenous collection of canonical Classical authors, such a large textual collection has the potential to give us the ability to begin investigating the complex mechanics that govern Latin usage over those two thousand years { a period that involves not only significant linguistic change and a shift in use as a language of native speakers to second language learners, but two thousand years of historically trending ideas as well. 2. PROCESSING To begin looking into these questions, we downloaded all of the 27,014 texts in the Internet Archive that had been classified by a librarian as being written in Latin. Here at once we are confronted with the metadata problems described elsewhere. While many of the works catalogued as Latin are in fact written in that language, 15.7% of a 9,203-work subset that we looked at (see below) were incorrectly marked as such, from Dutch botanical treatises to works in Ancient Greek with a Latin title. Additionally, since almost all of these digitized works are modern editions of historical texts, many contain large sections of text in modern languages such as English, German, French and Italian { not only a preface or introduction at the beginning of the work, but occasionally notes scattered throughout as well. In order to solve both problems { that of incorrect or incomplete metadata and of mixed works { we trained an n-gram language classifier [26] on the complete texts of Wikipedia in 24 different languages and the Ancient Greek and Latin collection of the Perseus Digital Library [6; 8; 7] and used that trained model to create a language fingerprint for each text, allowing us to pinpoint exactly where the Latin showed up in each volume.3 210 million words is an upper bound based on the archive of the Thesaurus Linguae Latinae (TLL), which houses a lexicographical slip for each occurrence of a word in texts from the Classical period [27]. The Packard Humanities Institute CD ROM of Latin, which is fairly comprehensive through 200CE and contains some later materials, holds c. 7.5 million words. 3In the future, we would like to apply this same language identification to the entire collection of books { not simply those manually classified as Latin { in order to detect the Latin in works ACM Journal Name, Vol. V, No. N, Month 20YY. Extracting Two Thousand Years of Latin from a Million Book Library · 3 2.1 Enhancing the metadata In addition to the major languages of the work, the library records that attend each text contain a wealth of other contextual information, including author, title, publisher, subject classifications, and the date of publication. One of our goals in this project, however, is to be able to track the spread of linguistic features within a language and ideas across languages over the two millennia that Latin was used as a lingua franca across Europe. While much of this research operates on the textual data itself, the ability to chart such movement in both space and time requires information on the place and date of a work's composition. The library records available to us, in contrast, report the place and date of publication for a specific edition { which, for historical texts, is often far removed from the time and place of original composition. For establishing the differences in usage between the Latin of Vergil's Aeneid and that of Jean Calvin's Institutio Christianae Religionis, it is far more important for us to know that the former was composed ca. 19 BCE and the latter in 1536 CE than the date of any later printed editions. 15000 10000 count 5000 0 0 500 1000 1500 2000 year Fig. 1. 27,014 works catalogued as Latin in the Internet Archive, charted by date of publication. Figure 1 charts the dates of publication for the 27,014 Latin works in this collec- tion { all of these printed editions are of course dated post-Gutenberg, with the vast where the language has either been misclassified (as a false negative) or the Latin appears too infrequently to be counted as a major language within it. ACM Journal Name, Vol. V, No. N, Month 20YY. 4 · Bamman and Smith majority of them coming from the 19th century (i.e., after the Industrial Revolution but still in the public domain). While this data is not without merit { it may be useful for researching questions on the changing publishing dynamics for Latin editions { what we are interested in is knowing when each of these works was originally composed. To address this, we are supplementing the existing metadata by researching the date of composition for each work. While the text of some authors (like Vergil and Calvin) have more-or- less established composition dates, others (such as those of more obscure medieval authors) do not. The task here { undertaken by three student researchers4 { is to delimit the smallest time window possible given the state of current research on each author (e.g., if not a single year such as 1536, then a window of 1530-1540 etc.). At publication time, approximately 34% of the catalogue (9,203 works) has been completed { 2,818 works have been excluded as not being in Latin or not containing a significant amount of natural language (e.g., lists of manuscripts) while 6,385 have been dated.