Designing a Library of Components for Textual Scholarship

i i i i UNIVERSITÀ DI PISA Scuola di Dottorato in Ingegneria “Leonardo da Vinci” Corso di Dottorato di Ricerca in Ingegneria dell'Informazione Tesi di Dottorato di Ricerca Designing a Library of Components for Textual Scholarship Angelo Mario Del Grosso Anno 2015 i i i i i i i i Universitàdi Pisa Scuola di Dottorato in Ingegneria “Leonardo da Vinci” Corso di Dottorato di Ricerca in Ingegneria dell’Informazione Ph.D. Thesis Designing a Library of Components for Textual Scholarship Candidate: Angelo Mario Del Grosso Supervisors: Prof. Francesco Marcelloni Dr. Andrea Bozzi Dr. Federico Boschetti Dr. Emiliano Giovannetti 2015 SSD ING-INF/05 i i i i i i i i Sommario Il presente lavoro affronta e descrive temi legati all'applicazione di nuove tecnologie, di metodologie informatiche e di progettazione software volti allo sviluppo di stru- menti innovativi per le Digital Humanities (DH), un'area di studio caratterizzata da una forte interdisciplinaritàe da una continua evoluzione. In particolare, questo contributo definisce alcuni specifici requisiti relativi al dominio del Literary Com- puting e al settore del Digital Textual Scholarship. Conseguentemente, il contesto principale di elaborazione tratta documenti scritti in latino, greco e arabo, nonché testi in lingue moderne contenenti temi storici e filologici. L’attivitàdi ricerca si concentra sulla progettazione di una libreria modulare (TSLib) in grado di operare su fonti ad elevato valore culturale, al fine di editarle, elaborarle, confrontarle, anal- izzarle, visualizzarle e ricercarle. La tesi si articola in cinque capitoli. Il capitolo 1 riassume il contesto del dominio applicativo e fornisce un quadro generale degli obiettivi e dei benefici della ricerca. Il capitolo2 illustra alcuni importanti lavori e iniziative analoghe, insieme a una breve panoramica dei risultati piùsignificativi ottenuti nel settore delleDH. Il capitolo3 ripercorre accuratamente e motiva il processo di progettazione messo a punto. Esso inizia con la descrizione dei prin- cipi tecnici adottati e mostra come essi vengono applicati al dominio d'interesse. Il capitolo continua definendo i requisiti, l'architettura e il modello del metodo pro- posto. Sono cos`ıevidenziati e discussi gli aspetti concernenti i design patterns e la progettazione delle Application Programming Interfaces (APIs). La parte finale del lavoro (capitolo4) illustra i risultati ottenuti da concreti progetti di ricerca che, da un lato, hanno contribuito alla progettazione della libreria e, dall'altro, hanno avuto modo di sfruttarne gli sviluppi. Sono stati quindi discussi diversi temi: (a) l'acquisizione e la codifica del testo, (b) l'allineamento e la gestione delle varianti testuali, (c) le annotazioni multilivello. La tesi si conclude con alcune riflessioni e considerazioni indicando anche possibili percorsi d'indagine futuri (capitolo5). v i i i i vi i i i i Abstract The present work is the result of the research activity carried out during my PhD studies. This thesis addresses the application of new technologies, computer sci- ence methodologies, and software design principles in the interdisciplinary and evolving field ofDH - in other contexts known as Humanities Computing. In particular, this contribution highlights the specific needs entailed in collaborative literary computing and in digital textual scholarship. The source context especially concentrates on documents written in Latin, Greek and Arabic, or on documents in modern languages concerning historical and philological topics. In the specific, the research activity focuses on the design of a modular library (TSLib) dealing with scholarly sources for what regards their editing, processing, comparison, analysis, visualization and searching. The thesis explores the aforementioned topics across five chapters. Chapter1 tracks the context of the digital textual scholarship and gives a summary of the objectives and the benefits of this research. Chapter2 illustrates related works and similar initiatives, along with worth mentioning projects and outcomes in the area of Digital Humanities. Chapter3 thoroughly describes and motivates the design process implemented. The process starts by describing well-known engineering principles and shows how they are applied for the digital textual domain and for the computational scholarly needs. Then, it continues in- troducing requirements, architecture and models of the proposed method. Design issues with regards to patterns and APIs are highlighted. The final part of this work (chapter4) illustrates concrete results deriving from a number of research projects that, on the one hand, have contributed to the design of the library and, on the other hand, have based their work on it. Several topics have been discussed: (a) acquisition and text encoding, (b) alignment and variant annotation, and (c) multi-level annotation. In the conclusion, a few reflections and considerations are presented, together for suggestions and for further studies (chapter5). vii i i i i i i i i <dedication> ! <people> <people> ! <those I love> <those I love> ! <name> <surname> (after Thomas A. Sudkamp) i i i i i i i i Contents 1 Introduction1 1.1 Overview................................1 1.2 Motivation, goals, and challenges...................5 1.3 The benefits of a library of components............... 14 2 Background 17 2.1 Preliminary remarks.......................... 17 2.2 Initiatives for textual scholarship................... 19 2.2.1 Community environments................... 19 2.2.2 Research infrastructures.................... 22 2.2.3 Philological projects...................... 23 2.3 Textual scholarship tools........................ 26 2.3.1 TUSTEP - TübingenSystem of Text Processing...... 27 2.3.2 LaTeX / Mauro-TeX...................... 27 2.3.3 JuxtaCommons......................... 28 2.3.4 CollateX............................. 28 2.3.5 Text::TEI::Collate....................... 29 2.3.6 eXist-db............................. 30 2.4 Data and metadata encodings..................... 30 2.4.1 Data............................... 31 2.4.2 Metadata............................ 37 2.5 Related open source software libraries................ 43 2.5.1 Text processing......................... 44 2.5.2 Language processing...................... 46 2.5.3 Text architecture........................ 47 vii i i i i viii CONTENTS 2.6 Suitable information technologies................... 49 2.6.1 Document, text and character encodings........... 49 2.6.2 String manipulation, text alignment, and vector space model 50 2.6.3 Image processing........................ 52 2.6.4 Linked Open Data methods and technologies........ 53 2.6.5 Machine learning approaches................. 55 2.6.6 Software engineering principles and processes........ 57 3 Methods 59 3.1 Introduction............................... 59 3.2 Requirements and use cases...................... 66 3.3 System architecture.......................... 72 3.4 Designing the data model....................... 78 3.5 API design and Design Patterns................... 85 3.5.1 API design........................... 86 3.5.2 Design Patterns......................... 95 3.5.3 Reusability and extensibility................. 106 3.6 Developing technologies........................ 108 4 Case Studies 109 4.1 Source acquisition and text encoding................. 110 4.1.1 Text acquisition........................ 112 4.1.2 Character and text encoding................. 119 4.2 Indexing................................. 125 4.3 Alignment................................ 130 4.4 Variant reading and multi-level analysis............... 133 4.4.1 Variant reading annotations.................. 134 4.4.2 Multi-level analysis....................... 136 5 Conclusion and Perspectives 145 6 Acronyms 149 7 Acknowledgements 157 Bibliography 161 i i i i List of Figures 3.1 Component services example..................... 60 3.2 Example of UML Use case diagram within the TSLib........ 64 3.3 Mockup example within the TSLib analysis process......... 65 3.4 Example of UML conceptual class diagram within the TSLib... 65 3.5 Modern manuscript example...................... 68 3.6 TSLib core Use Cases......................... 71 3.7 Textual Scholarship environment within a component schema... 73 3.8 Layers view of the TSLib....................... 74 3.9 The core components of the Textual Scholarship Library...... 75 3.10 Open Philology architecture - Courtesy of Bridget Almas..... 79 3.11 Class model defining the tradition of the textual documents.... 81 3.12 Conceptual Class Model representing textual source materials... 82 3.13 Parallel bilingual texts with comments and selection........ 82 3.14 Lemmatization example for API Component design......... 89 3.15 Convenience layered API for TSLib.................. 94 3.16 API/SPI Delegation pattern...................... 96 3.17 Factory Pattern............................. 97 3.18 Singleton Pattern........................... 98 3.19 Proxy Pattern............................. 99 3.20 Adapter Pattern............................ 99 3.21 Facade Pattern............................. 100 3.22 Composite Pattern........................... 101 3.23 Observer Pattern........................... 102 3.24 Visitor Pattern............................. 103 3.25 Strategy Pattern............................ 104 ix i i i i x LIST OF FIGURES 3.26 Parser Handler Pattern........................ 104 3.27 Request Response Pattern...................... 105 3.28 TSLib Extension Capability..................... 107 4.1 Example of a manuscript written in Greek

Load more