Cuneiform Digital Library Initiative White Paper for the Global
Total Page:16
File Type:pdf, Size:1020Kb
Cuneiform Digital Library Initiative White Paper for the Global Philology Project Émilie Pagé-Perron, University of Toronto CDLI directors’ designated representative Keywords: Digital Assyriology, Open Data, Machine Learning, Universal Access, Semantic Web Abstract: The Cuneiform Digital Library Initiative (CDLI) is a two decades strong repository of curated data from cuneiform inscribed objects, in the form of images, metadata, and text (transliterations and translations). Its objectives are the exhaustive compilation, preservation and access promotion of these cuneiform sources in their digital form. It is the largest endeavor of this nature in the field of Assyriology. As a response to the Global Philology project, this white paper addresses three major points: the current state of digital Assyriology, the projects underway at the CDLI for the advancement of research in the field, and finally, our vision for the close future of discipline developments in the present technological landscape. Natural Language Processing (NLP) for the Sumerian and Akkadian languages is in its early stages, when compared with the toolsets available for modern and even classical languages. Processes utilized are rule-based and rely on human input. With this methodology, it is difficult to process large corpora since human intervention is required to result in an acceptable final product. The main obstacle for further developments are the peculiarities of the orthography, syntax, and morphology of cuneiform languages which makes them harder to parse than the Greek and Latin texts of the Perseus project. An additional challenge lies in the disparity of encoding and standards of the digital transcriptions and the quality of such transcriptions across projects, but even more in their limited access. The CDLI is currently working on projects with two main objectives in mind: (1) developing techniques to harness large corpora of data (such as the 21th century BC corpus of 70,000 transliterated texts) and also (2) widening the access to cuneiform sources for existing and prospective audiences. In our opinion, the best way to tackle cuneiform “big data” is to integrate machine learning (ML) components into our methodologies. We are currently researching this avenue in two projects: first by the preparation of images of cuneiform sources for future Optical Character Recognition (OCR) endeavors, where our segmentation algorithms work on recognizing lines of texts and individual cuneiform signs. Second, by integrating ML solutions for the automated translation of and information retrieval in cuneiform texts. In particular social and economic historians have been effectively shut out from access to the content of Babylonian administrative documents that, unlike literary or religious texts, have very rarely been translated by experts, even though they make up the overwhelming majority of cuneiform texts in the aggregate. Computer science and computational linguistics research groups are increasingly requesting cuneiform datasets to run machine learning experiments. As such, some algorithms geared expressly to the study of cuneiform languages are available. Algorithms stemming from research on other languages can complement these methods. The time is ripe to attempt machine translation and larger scale information retrieval on select datasets. To this end, we have developed an overarching approach to formulate, test and evaluate new (context aware) methodologies for information retrieval and machine translation. These projects tie into the objectives of facilitating access to the discipline’s primary sources by machines and humans, through the principles of universal design, linked open data (LOD) and knowledge exchange (as opposed to only dissemination) with actual and potential audiences. We will be utilizing linguistic LOD with generated annotations following the prime objective of enabling researchers to reuse our data. This will effectively make it possible to add Sumerian and Akkadian in meta-research across languages using linguistic LOD (Chiarcos et al. 2012 <http://linguistic-lod.org/>) and will set an exemple for other languages to follow. We will also release all new data and algorithms to the public domain. An updated interface for the CDLI is currently under preparation for easier access to these data by researchers, curators and students. The information that will be released includes tens of thousands of text translations that would take dozens of years to complete by scholars alone. Contextual data, a byproduct of the translation and information retrieval processes, can be visualized, for instance using network analysis or dependency visualisation but will also be linked to enhance search and navigation of the corpus. It is our hope that the future of Assyriology will see as a first step the establishment of shared standards across projects to facilitate interoperability at the lowest level. This would permit, among other things, the integration of shared NLP tools geared to cuneiform transliterations with a software package such as the Classical Language Toolkit. Additionally, we also wish for the wider adoption of an ethical approach to open data, where projects will demonstrate care for data and audiences in the way they prepare and share the methods they employ and the results of their work. .