<<

Journal of the

Issue 8 | December 2014 - December 2015 Selected Papers from the 2013 TEI Conference

Arianna Ciula and Fabio Ciotti (dir.)

Electronic version URL: http://journals.openedition.org/jtei/1025 DOI: 10.4000/jtei.1025 ISSN: 2162-5603

Publisher TEI Consortium

Electronic reference Arianna Ciula and Fabio Ciotti (dir.), Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015, « Selected Papers from the 2013 TEI Conference » [Online], Online since 28 December 2014, connection on 23 September 2020. URL : http://journals.openedition.org/jtei/1025 ; DOI : https://doi.org/10.4000/jtei.1025

This text was automatically generated on 23 September 2020.

TEI Consortium 2014 (Creative Commons Attribution-NoDerivs 3.0 Unported License) 1

TABLE OF CONTENTS

Editorial Introduction to Issue 8 of the Journal of the Text Encoding Initiative Arianna Ciula and Fabio Ciotti

TEI in Relation to Other Semantic and Modeling Formalisms

TEI, Ontologies, Linked Open Data: Geolat and Beyond Fabio Ciotti, Maurizio Lana and Francesca Tomasi

The Linked Fragment: TEI and the Encoding of Text Reuses of Lost Authors Monica Berti, Bridget Almas, David Dubin, Greta Franzini, Simona Stoyanova and Gregory R. Crane

TEI in LMNL: Implications for Modeling Wendell Piez

Ontologies, Data Modeling, and TEI Øyvind Eide

TEI to Express Domain Specific Text and Data Models

Texts and Documents: New Challenges for TEI Interchange and Lessons from the Shelley- Godwin Archive Trevor Muñoz and Raffaele Viglianti

Manus OnLine and the Text Encoding Initiative Schema Giliola Barbero and Francesca Trasselli

The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources Susanne Haaf, Alexander Geyken and Frank Wiegand

ReMetCa: A Proposal for Integrating RDBMS and TEI-Verse Elena González-Blanco and José Luis Rodríguez

Modeling Frequency Data: Methodological Considerations on the Relationship between Dictionaries and Corpora Karlheinz Mörth, Laurent Romary, Gerhard Budin and Daniel Schopper

TEI Processing: Workflows and Tools

From Entity Description to Semantic Analysis: The Case of Theodor Fontane’s Notebooks Martin de la Iglesia and Mathias Göbel

Edition Visualization Technology: A Simple Tool to Visualize TEI-based Digital Editions Roberto Rosselli Del Turco, Giancarlo Buomprisco, Chiara Di Pietro, Julia Kenny, Raffaele Masotti and Jacopo Pugliese

TEI4LdoD: Textual Encoding and Social Editing in Web 2.0 Environments António Rito Silva and Manuel Portela

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 2

Bridging the Gap: Greater Usability for TEI encoding Stefan Dumont and Martin Fechner

TeiCoPhiLib: A Library of Components for the Domain of Collaborative Philology Federico Boschetti and Angelo Mario Del Grosso

“Reports of My Death Are Greatly Exaggerated”: Findings from the TEI in Libraries Survey Michelle Dalmau and Kevin Hawkins

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 3

Editorial Introduction to Issue 8 of the Journal of the Text Encoding Initiative

Arianna Ciula and Fabio Ciotti

1 We are pleased to introduce the eighth issue of the Journal of the Text Encoding Initiative featuring selected peer reviewed papers from the 2013 TEI Conference and Members Meeting, which was held at Università della Sapienza in Rome, 2–5 October.

2 Like all TEI conferences, this was an international gathering and, compared to other conferences, enjoyed an exceptional and diverse range of participants. The collaboration between the TEI Board of Directors and the Associazione Italiana di Informatica Umanistica e Cultura Digitale (AIUCD) to ensure an Italian host for this event also aimed at marking the long standing involvement of the Italian community in TEI. Indeed, Italian scholars have made considerable contributions to the development of TEI and Digital Humanities as a whole in terms of theory, technologies, and organisation. We must also mention the name of the late Antonio Zampolli, who contributed enormously to the conception and implementation of the TEI initiative and who hosted the first Members’ Meeting of the Consortium in Pisa in 2001. Since the 1990s scholars at Università della Sapienza in Rome have played an equally important role in raising the awareness of the importance of TEI encoding for formal textual representation. In particular, thanks to the research group led by the late Giuseppe Gigliozzi, research and teaching activities related to TEI have spread throughout Italy.

3 The theme of the conference—“The Linked TEI: Text Encoding in the Web”—focused on the role of text encoding and TEI in a networked cultural environment. This focus encouraged reflections on the semantics of the TEI conceptual model, on the relationships among TEI and other data models and encoding schemas, on the role of TEI within a framework of interconnected digital resources as well as on the diverse ways users can access and take advantage of TEI-encoded resources. The title hints obviously at a very topical theme in the digital realm: the emergence and diffusion of the Linked Data paradigm and of a Participatory Web. TEI has had a crucial—and widely

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 4

recognised—role in encouraging and facilitating the creation of vast quantities of textual and linguistic resources. It has been one of the major enabling technologies in the Digital Humanities. However, the dominant paradigm in the creation of digital resources, especially in the academic domain, has been that of the individual archive, the single monolithic or boutique project, perfectum in itself. To continue in its role of stimulus for innovation in the Digital Humanities, TEI has to be able to embrace fully— albeit critically—the new paradigm. The sharing and the interconnection of data on the Web as well as the emergence of semantically-enriched data are interesting aspects of technological innovation which will bring about new developments. The vision around “Linked TEI” also encompasses issues of multilingualism and multiculturalism: to be connected means to recognise and collaborate with different traditions and languages.

4 Contributions to the conference call for papers responded very well to the challenge with a rich range of topics and perspectives represented in the programme: from reflections on semantic models and textuality, to data modelling and analysis, from re- thinking research infrastructures and developing participatory approaches, to establishing bi-directional linking between dictionaries and corpora. Among the many papers, posters, and panels accepted and presented at the conference, a large number were submitted for consideration and peer review for this issue of the Journal of the Text Encoding Initiative, resulting in the largest issue of the journal published to date.

5 The set of articles published here follow three main threads of interests around the TEI: 1. TEI in relation to other semantic and modeling formalisms; 2. TEI as an expression of domain-specific text and data models; 3. TEI processing workflows and tools.

6 The first topic is opened by Ciotti, Lana, and Tomasi’s article. In line with the main theme of the conference, this paper describes the architecture of the geolat project, a of Latin texts enriched with annotations related to places and persons. Based on different ontologies and languages, these annotations can be formally analysed and shared as Linked Open Data. Also within the of linked environments, Berti et al. introduce the Leipzig Open Fragmentary Texts Series project, the goals of which are the creation of linked and annotated digital editions of classical fragmentary texts and the development of the Perseids Platform, a collaborative environment for crowdsourced annotations. Moving from practical use-cases to the theoretical consideration of diverse modeling frameworks, Eide’s and Piez’s papers reveal the constraints of the tools we use as well as the different views they cast on the material under study. In particular, Piez presents the non-hierarchical modelling language LMNL (the Layered Markup and Annotation Language) and how it can be combined with the TEI-XML formalism to enhance its representational capability. Eide describes different methods of establishing meaning in TEI documents via linking mechanisms between TEI and external ontologies and how these different methods determine the semantic openness and usability of TEI documents.

7 The second cluster collects a set of articles related to the adaptation of the TEI encoding scheme to specific complex data and document types. Using the Shelley- Godwin Archive (S-GA) as a case study, Muñoz and Viglianti discuss the challenges of producing TEI-encoded data and an accompanying reading environment able to support both document-focused and text-focused approaches. Barbero and Trasselli describe the data model developed by the Instituto Centrale per il catalogo Unico

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 5

(ICCU) to export manuscript descriptions from Manus OnLine (MOL), the Italian national catalogue of manuscripts, to TEI documents to promote consistency in manuscript descriptions as well as data interchange. Haaf, Geyken, and Wiegand give an overview of the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set developed by the Deutsches Textarchiv (DTA) project to represent text types in historical corpora generated out of multiple printed sources. González-Blanco and Rodríguez discuss the use of the TEI Verse module to tag metrical and poetic structures in the Repertorio Digital de Métrica Medieval Castellana (ReMetCa) project, where TEI-XML is integrated into a Relational Database Management System (RDBMS). Last but not least, Mörth et al. propose a customization of the TEI dictionary module to respond to the new requirements emerging from an environment of increasingly intertwined language resources, and in particular to the need to include statistical data obtained from language resources in lexicographic workflows.

8 The contributions within the third set of papers describe various approaches and tools to create, manage, and disseminate TEI-encoded resources in a networked environment. De la Iglesia and Göbel’s work uses XML and related Web technologies to process semantic entities and produce data-driven visualizations of the forthcoming edition of Theodor Fontane’s notebooks as well as comparisons with other edition projects using TEI. Rosselli Del Turco et al. describe EVT (Edition Visualization Technology), a TEI based scholarly publishing tool under development since about 2010. Initially designed as a specific solution for the Digital Vercelli Book project, EVT has evolved into a flexible tool to create Web-based digital editions incorporating transcription files encoded in TEI XML and digital facsimiles of the source texts. Portela and Rito Silva’s article presents the rationale and the technical approaches adopted for the digital edition of Fernando Pessoa’s Livro do Desassossego (LdoD). Starting from the intrinsic complexity of this unfinished work, this project offers a digital model of all its authorial and editorial manifestations and aims to create a dynamic platform open to user interaction and collaboration. Dumont and Fechner’s paper illustrates the user- friendly and bottom-up design principles that have guided the development of the ediarum TEI editing and publishing platform. Similarly, Boschetti and Del Grosso describe the design and development of TeiCoPhiLib, a library of Java software components devoted to editing, processing, and visualising TEI documents in the domain of philological studies. This tool is particularly suited to fostering collaborative philological work. This cluster is closed by Dalmau and Hawkins’s article, the focus of which is not on specific tools or platforms, but rather on the impact and adoption of TEI and text encoding practices in the library community.

9 We conclude this introduction with a few words about how this issue fits within the current developments of TEI and of the Digital Humanities community more generally. Every now and then it is healthy to ask where the TEI is going and why we care about it. With the growing emergence of Digital Humanities curricula and research positions, the establishment of digital workflows and resources in the cultural heritage sector, and the ongoing transition from a print to digital scholarly communication cycle, it is easy to get caught in the vortex. With the flurry of new activity in Digital Humanities today, has the time and place for the TEI come and gone? A step back to think of where we are—and therefore where we are going—implies a reconsideration of the historical focus of the TEI: texts, and in particular the modeling of texts for representation and processing purposes. It is this historical focus together with a periodical shifting of its limits and limitations that ensures the continuing relevance of the TEI: the meanings of

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 6

texts change; the term “text” itself is perceived and applied in a wide sense; what we want to do with texts changes. The heart of our research endeavours continues to be the slippery confines of our cultural productions, to be seen, analysed, reflected upon, deconstructed, formalized, processed, remediated, and re-interpreted.

10 The key to ensuring that the TEI will continue its valuable contributions to scholarship and culture more generally lies therefore also in a slight but crucial tilt to its rhetorics: the TEI is not only about delivering a standard but rather, and perhaps more significantly, about the intellectual activity of creating and developing that standard in partnership with the diverse communities of researchers, archivists, librarians, and other professionals of the cultural heritage sector, software developers, infrastructures providers, artists, citizens. A linked TEI looks less and less like a jigsaw where all pieces are cut to fit together, and more like an intertwining of hands.

11 Enjoy the reading.

AUTHORS

ARIANNA CIULA Arianna Ciula is Research Facilitator at University of Roehampton (London, UK), where she supports research and enterprise strategies and actively contributes to the research profile and networks in particular of the Department of Humanities. She is the secretary of EADH, member of the ADHO Steering Committee and member of CLARIN Scientific Advisory Board. She served in the TEI Technical Council as well as TEI Board.

FABIO CIOTTI Fabio Ciotti is Assistant Professor at the University of Roma Tor Vergata, where he teaches Digital Literary Studies and Theory of Literature. He is President of the Associazione per l’Informatica Umanistica e la Cultura Digitale (AIUCD, the Italian digital humanities association) and member of the EADH Executive Board. He has been scientific consultant for text encoding and technological infrastructures in several digital libraries and archives projects, and has served in the TEI Technical Council.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 7

TEI in Relation to Other Semantic and Modeling Formalisms

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 8

TEI, Ontologies, Linked Open Data: Geolat and Beyond

Fabio Ciotti, Maurizio Lana and Francesca Tomasi

AUTHOR'S NOTE

This research is supported by Compagnia di San Paolo Foundation. Fabio Ciotti wrote sections 3.1, 3.2, and 4; Maurizio Lana wrote sections 1, 2, and 5; Francesca Tomasi wrote sections 1, 3.3, 3.4, and 3.5; the authors, however, share with each other and with others (Diego Magro, Silvio Pieroni, and Fabio Vitali) the whole scientific responsibility for the project.

1. Introduction

1 The dialogue that TEI started with other semantic models has two aims: data and document interchange as well as the improvement of the editors’ ability to formally declare hermeneutical positions. The TEI schemas include most of the elements and attributes (and now classes as well) to provide interpretations, while other non-TEI schemas, such as the EAC-CPF (Encoded Archival Context—Corporate Bodies, Persons and Families) metadata element set, may be employed to enhance or augment the TEI model. On the one hand, additional schemas could contribute to perfecting the scope of some TEI elements; on the other hand, other semantic models, such as ontologies, could improve the effectiveness of interpretations.

2 Speaking about models means reflecting on the relationships between TEI and ontologies, but also moving towards new emerging semantic paradigms. People working with texts today have some interesting opportunities that the editorial domain is encouraging, such as: describing or enriching texts through TEI encoding; formalizing their knowledge of textual content(s) by using one or more ontologies; offering open access to their work in order to guarantee dissemination; and making their data interoperable by means of linked open data.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 9

3 Moreover, we could say that these options exist because digital libraries are becoming increasingly widespread. The availability of collections has stimulated, as a natural result, the growth of new “movements.” Actually, without digital text collections, these opportunities would be mostly theoretical; however, we should also consider the resulting question that arises: What can we do with digital texts that cannot be done with non-digital texts? Each of the opportunities mentioned above entails interesting implications: 1. TEI encoding is sufficiently flexible to allow text to incorporate annotation using many different approaches, including structural, syntactic, and linguistic. Many diverse document and content types may be described using TEI. 2. Ontologies allow the formal description of specific knowledge domains. 3. Open access tries to break existing barriers to access and fosters collaboration and knowledge sharing. 4. The linked open data (LOD) mechanism allows data and projects which were not originally intended to work together to interconnect by implementing a common method for describing and publishing data.

4 All the features described above are fundamental to the aims of Geolat—Geography for Latin Literature, a global project for Latin literature annotation which makes use of TEI encoding and other ontologies. The present research therefore intends to address the issues related to TEI collections management from a semantic perspective by presenting a specific case study.

5 The paper is organized as follows: first we introduce the project in order to explain the steps needed to complete it (section 2); then we describe the ontological modeling and the TEI encoding in Geolat, by focusing on both places and persons (section 3). We then discuss the annotation data model (section 4). In conclusion, we introduce some future directions (section 5).

2. The Geolat Project

6 Starting from the need to enrich the TEI encoding with an in-depth formalization of the process steps, the Geolat project aims to propose an extensible model of a digital library in the literary domain: Geolat will semantically annotate texts using many different ontologies, describing places, persons, and textual objects. Additionally, it will publish its content as open-access LOD. The project is led by an interdisciplinary group of scholars, including historians, geographers, classicists, language philosophers, science philosophers, librarians, digital humanities researchers, computer scientists, archeologists, and comparative literature scholars. The main idea is that if places and personal names are formally annotated in a given text collection, those texts can be studied through the knowledge the texts themselves include. It will then be possible, for instance, to answer the question “who was where?”—the places where a given person was at a given time, or the persons who, at a given time, were in a given place— and show the answers not simply with the usual text tools (concordance-like) but also through maps (for places) or graphs (for persons).

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 10

7 From a conceptual standpoint, four simple steps are needed to complete the project: 1. the building of a global digital Latin library from the Archaic period up to the fifth century CE, joining different existing available corpora: ◦ classical Latin texts from Packard Humanities Institute Classical Latin Texts (PHI) CD- ROM which are now freely accessible, according to Italian and EU laws on intellectual property rights; ◦ late Latin texts, from DigilibLT (the Digital Library of Late-Antique Latin Texts);1 ◦ the Latin Grammarians in the collection prepared by Nino Marinone as the basis for Grammaticus: An Index to the Latin Grammar Texts;2 ◦ possibly, the juridical texts from Bibliotheca Iuris Antiqui(BIA);3 ◦ possibly, the Latin poetry from Musisque Deoque: Un archivio digitale di poesia Latina (MQDQ);4 2. the building of an ontology. This step will involve a connection between different models starting from some specific issues—describing places and integrating descriptions of people in EAC-CPF (plus the related ontologies illustrating the textual objects)—and from the consideration that roles and functions change in relation to the context. Further, we need to consider that existing geographical ontologies are not (completely) suitable for describing the classical world because, for example, some kinds of ancient places have no immediate equivalent in contemporary conceptualizations (take for instance the notion of castrum in Latin culture); a city may be founded by a deity or by a human; or you may be dealing with nomadic populations whose “places” are impossible to describe similarly to those of non- nomadic populations; 3. annotation of every personal and geographical name in the texts using a geographical ontology, which means: ◦ assigning IRIs/URIs to textual objects like books, chapters, paragraphs, words; ◦ associating Pleiades5 URIs with geographical names; ◦ associating VIAF6 IDs with personal names for the authority control; ◦ integrating a geographical ontology with specific concepts for the classical world and classical texts; ◦ integrating the EAC-CPF model and ontology for managing roles and functions of mentioned people. The documents are marked up with very light TEI/XML encoding that describes document structures and philological phenomena. The textual segments (nouns and nominal phrases) referring to a person or geographical entity are explicitly encoded through specific TEI elements (see section 3). Only instances of individual persons or geographical instances are encoded (while general or abstract geographical concepts are not). The encoding process for places and names is implemented in two phases: first a Named Entity Recognition (NER) system extracts references from primary sources (Isaksen et al. 2012); then these references are edited by human experts. 4. the use of annotated texts in order to facilitate specific research questions related to geography or named persons. This procedure will allow, for example, ◦ general users to start reading texts through a geographic interface: for example, a map where they can choose an area, or a person interface through which they can choose from a typology of activities and roles; ◦ scholars to discover concepts or information hidden by the standard textual interface usually adopted to access the documents; ◦ mobile and augmented reality devices which display passages describing the physical place where you are.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 11

3. Ontological Modeling and TEI Encoding in Geolat

8 The logical architecture of the Geolat project is a complex one, requiring many levels of ontological modeling and textual encoding. Some general methodological principles are: • maintaining the distinction among different levels of abstraction and adopting the most efficient formalism for each level, trying to avoid unneeded complexity; • minimizing the amount of semantic information directly expressed at the inline markup level in favor of stand-off markup. In this way, it is possible to favor readability, portability, and maintenance of primary resources; • adopting Semantic Web languages to express semantic information (geographic and prosopographical data) and rigorously defining the annotations’ intended semantics with the use of formal ontologies—which allows us to exploit the reasoning capabilities of inference engines and semantic stores; • allowing the gradual extension and modification of geographical descriptions over time; • facilitating interoperability with other repositories and sets of geographic data expressed as linked data.

9 In the following sections, the problems related to the description of places (sections 3.1 and 3.2) and persons (sections 3.3, 3.4, and 3.5) will be explained.

3.1 Distinction between Semantic Geographical Data and Geographical Ontology

10 The distinction between semantic geographical data and geographical ontology is analogous to the distinction between assertion box (ABox) and terminological box (TBox) which is made in description logic. Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. It is this construct for which Structure Dynamics generally reserves the term ‘ontology.’ The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (or individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts. Both the TBox and ABox are consistent with set-theoretic principles. (Bergman 2009)

11 According to the aforementioned distinction, the geographical semantic data (GSD) in Geolat consists of a set of RDF statements defining the individual geographic features. Each individual entity is: 1. identified by an IRI; 2. associated with one or more membership classes; 3. optionally associated with one or more persons; 4. identified by a set of properties: ◦ geographic coordinates in GPS format;

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 12

◦ placename(s) with chronological information about its/their use over time; ◦ itineraries (such as pilgrimage or military expedition) of which the place is a part; ◦ historical, geographical, cultural annotations (such as etymology; typology of settlement: city, castrum; reason for being mentioned: battlefield) ◦ links to IRI/URI in other data sets like Pleiades, Geonames, and DBPedia etc.

12 The geographic ontology (GO) contains the formal representation of general concepts and relationships. The formalism we have adopted for ontological modeling is OWL 2 RL. OWL 2 RL enables the implementation of polynomial time reasoning algorithms using rule-extended database technologies operating directly on RDF triples; it is particularly suitable for applications where relatively lightweight ontologies are used to organize large numbers of individuals and where it is useful or necessary to operate directly on data in the form of RDF triples. (W3C OWL Working Group 2012, sec. 2.4)

13 Geographic ontology provides concepts such as settlement, city, person, physical location, sea, and river, along with their properties; specifies their taxonomy; and defines relations among the “occurrences of concepts,” such as the located in relation, which associates the name of a settlement with its physical place, or the founded by relation, which relates a city to its founders.

14 The design of this ontology must address various theoretical problems, such as the status of fictional places; the need to draw a distinction between purely fictional places and places whose past existence is acknowledged by the textual tradition but which do not have any certain location nor material vestiges (for instance the city of Alba Longa, according to Titus Livius founded by Ascanio, son of Aeneas); the possibility for certain entities to have distinct ontological properties according to different textual traditions (one city, for instance, can have various foundation stories, with different actors—be they real or fictional).

15 Of course this requires that we state explicitly in the ontology the textual context where each property value is valid.

3.2 Connection between Geographical Data and TEI Markup Vocabulary

16 TEI offers a rich vocabulary for the annotation of textual segments that include references to places and geographical entities (TEI Consortium 2015, ch. 13, “Names, Dates, People, and Places”). For our annotation goals, the two most important TEI elements are: 1. identifies a noun or phrase referring to a geopolitical or administrative entity; 2. identifies a noun or phrase referring to a physical place.

17 Their key attributes are: 1. @:id: univocally identifies the encoded element; the purpose is to allow inverse references from geographical RDF assertions to the text passages that mention those entities; 2. @ref: includes the URI which identifies the geographical entity referred to by the text passage in GSD.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 13

18 See this example from Titus Livius, Ab urbe condita libri, 1,3:

pax ita conuenerat ut Etruscis Latinisque fluuius Albula , quem nunc Tiberim uocant, finis esset.

19 As shown above, the TEI attribute @ref allows us to explicitly connect every geographical expression in the text to its RDF description. However, it is also necessary to express the reverse connection from the RDF description to the text. The rationale for this requirement is twofold: from the practical point of view, it is easier to link the query (and inference) results to the relevant text portions; from the methodological point of view, we are sure that the semantic data set is self-consistent and potentially autonomous from the text collection.

20 The simplest method to obtain this result consists in assigning a “textual instance” property to the URI that identifies the entity. The value of this textual instance property is an XPointer expression identifying the corresponding element in the XML/ TEI file. This structure is illustrated in the RDF statement below: 1. subject: geographical entity URI 2. predicate: textualInstance assigns a textual location as an instance of the entity 3. object: XPointer reference to the element that contains the geographical expression

21 For example:7

@prefix geolat

geolat:geoData/Tiber geolat:textualInstance geolat:abUrbeConditam.xml#xpath(//geogName[@xml:id=‘tiber01']); geolat:textualInstance geolat:abUrbeConditam.xml#xpath(//geogName[@xml:id=‘tiber02']) .

3.3 A Model for Managing Persons

22 Among the most significant changes in TEI P5 is the addition of the Biographical and Prosopographical Data features (TEI Consortium 2015, ch. 13.3, “Biographical and Prosopographical Data”). These features provide elements and attributes for describing individuals and their relationships. In 2006 a “Personography” workgroup (the TEI Personography Task Force)8 was established to investigate how other existing XML schemas and TEI customizations handle data about people. The result was Wedervang-

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 14

Jensen and Driscoll’s Report on XML Mark-up of Biographical and Prosopographical Data (2006).9

23 Names of people are identified in TEI with the element. This element, like the and the , supports an @xml:id attribute for unique identification and a @ref attribute to link the name to an external description or URI. At the URI level, many features, as described below, could be used in order to enrich biographical and prosopographical description.

24 The standard bibliographic approach to describing people actually consists in the identification of the unique individuals and in the attribution of an invariant set of features. However, we should never overlook the strong existing connections between people and the textual context. As a result of these connections, roles and functions, intended as features of individuals, may change depending on the context, that is, on the source attesting the individual. It is therefore possible to state that: 1. some features are not only static over time, but also theoretically constant regardless of the context (for example, birth, death, personal name); 2. other features vary depending on date and/or place (for example, age, affiliation, education, event, state); 3. roles (for example, author, actor, editor, speaker) and functions (for example, administrative, organizational, educational) are elements that identify people depending on the context.

25 Thus we can say that a person is a “complex entity,” because she/he is connected with different typologies of phenomena: some of these are unchangeable, while others depend on a time period, a place, or a context.

26 In TEI, the element may be associated with different roles or functions. Consider, for example, the digital edition of a literary text. We may use the TEI element to encode information about all the individuals who created and contributed to the digital edition: the author of the analog source, the editor of the printed version, or all the individuals quoted in the text. The concept of person expands the boundaries: although individuals are related to the source, they are also entities with a role enabling a single person to connect either with different resources (that is, documents), or with several other persons (for example, for the sharing of the same role). A three-level relationship therefore arises: among individuals, between a person and a document in which she/he is mentioned, and among a person and other resources. This “three-level relation” model is a concept adopted by a shared standard in the archival domain, used in order to manage authority records: the EAC-CPF Schema (see section 3.4). This approach was chosen in part because it forces a reply to the following questions: • Why is a person related to another? • What is written in a document about a person? • What connection is possible to establish between a person and other resources regarding the same person?

3.4 TEI and EAC-CPF

27 For the annotation of individuals and groups, TEI may be used in conjunction with the EAC-CPF schema, developed in order to formalize the ISAAR (CPF) standard (International Standard Archival Authority Record for Corporate Bodies, Persons and

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 15

Families),10 which today is also available as an ontology (Mazzini and Ricci 2011). EAC- CPF contributes to the representation of individuals, emphasizing the importance of both context and relationships. The editorial approach to annotation described here borrows from the domain of archival studies. Archival science espouses a principle of separation between the description of records (documents) and the description of people (corporate bodies, persons, and families), and emphasizes context (Pitti 2004). The same approach could be implemented in TEI, when the final purpose is to expose data sets of both TEI XML documents and related personographic data to be used by the Web community.

28 The EAC-CPF schema suggests useful ways to extend the concept in TEI. EAC-CPF is based on the concept of an “entity,” a corporate body, person, or family that manages relationships among other entities and between one entity and a resource linked at some level. Each relationship could be described, dated, and classified. In addition to elements related to “relation” ( and ), EAC-CPF defines a element that “provides information about a function, activity, role, or purpose performed or manifested by the entity being described” on a specific date. The element illustrates a “function related to the described entity.… it includes a @functionRelationType attribute that could support a controlled list of type values” (Encoded Archival Context Working Group 2014).

29 A new model of an “authority record”—a complex structure able to document the context in which the identity is attested—could be introduced: the authority not only is generated by the controlled form of the name, and the related parallel forms, but is also the result of relationships generated by the context (that is, the specific document in which the entity is mentioned) to determine a concept.

30 According to the RDF model, it is possible to assert that an identified entity (URI) manages relationships (predicate) with different objects; these objects could be: 1. another entity (URI): another person, place (URI), date (URI), or event (URI); 2. a contextual resource (URI): the document in which the entity is mentioned; 3. an external resource (URI): another object (a document, an image, a video, an audio record, and so on).

31 This procedure could be applied to describe annotations and other contributions to the digital edition, for instance, by identifying a contributor of an annotation who, on a specific date, performed a specific activity (the principle of “provenance” of an assertion, that is, authorship attribution activity). The responsibility (the TEI element) could be intended as a role. Each person is associated with a responsibility statement able to identify the role that the entity covered in that document, associating persons to documents. The same person may fulfill the same responsibility in other editions. In this way relationships are extended to other documents. Moreover, other individuals who share the same responsibilities may be linked.

32 This process could be declared and exposed as an RDF data set with URIs for the univocal identification and TEI/EAC for classes and predicates in order to build a collection of authorities related to persons that covered either a role or a function in a certain time period and context. By declaring connections as relationships, through the EAC-CPF model, we could develop a knowledge base of people, with a role or function originated by the context.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 16

33 The aforementioned model has already been applied to the description of real individuals who manage activities in relation to documents (Pasin and Bradley 2013). The same approach could be expanded to prosopographical description, with an adequate extension of the function sets to include activities of interest for historiographical research (for instance, military mission, diplomatic activity, arranged marriage, conspiracy). People mentioned, described, or in general cited in sources also assume different roles in different contexts—context being determined by the document, as an historical source, in which a person appears. More specifically, people are related to dates, places, and events, enriching the expressivity of the description.

3.5 People and Places in a “Perspective Function”

34 Place, in particular, could be an interesting key for managing functions. In the Geolat project we are able to assert that people have different roles with respect to places. This means that the context (that is, the document) determines the existing relationship between a person and a place. In addition, the role a person plays in a document also changes the kind of the relationship she/he has with the place. A place could be intended not only as a city that a person was-born-in or a person died-in but also as place that a person went-to for a specific event or to do-something. A taxonomy of predicates will be developed in order to define the possible relationships between a person and a place.

35 An example of a person description (pseudocode):

Lorenzo de‘ Medici xml:id=LdM Access_key: Medici, Lorenzo de’ Dbpedia URI: http://it.dbpedia.org/page/Lorenzo_de%27_Medici VIAF permalink: http://viaf.org/viaf/54169908 Father-of: Piero de’Medici (URI) Born-in: Firenze (URI) Died-in: Firenze (URI) Born-when: 1449 (xsd:integer) Died-when: 1492 (xsd:integer) Visited-what: Roma (URI) –function: diplomatic mission Visited-what: Venezia (URI) –function: military mission Visited-when: Roma (URI) 1466 Visited-when: Venezia (URI) 1465 Iconography: http://www.google.com/search? hl=en&q=lorenzo+de%27+medici&um=1&ie=UTF-8&tbm=isch &source=og&sa=N&tab=wi&ei=_1vcUOeAJqeI4ASikoCYBQ&biw=1146&bih=709&sei=AVzcUNnjIOXV4gTVloHQCQ Biography: http://it.wikipedia.org/wiki/Lorenzo_de'_Medici Author-of: http://it.wikisource.org/wiki/ Canti_carnascialeschi_%28Lorenzo_de%27_Medici%29 Attested-in: http://it.wikiquote.org/wiki/Maria Provenance: http://www.unibo.it/docenti/francesca.tomasi

36 By creating connections between the archival and the philological domains, digital editions open new perspectives on the cultural heritage domain, establishing

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 17

connections between heterogeneous objects and “creating efficiencies in the re-use of metadata across repositories, and through open linked data resources” (Larson and Janakiraman 2011, 4). Linked data describing persons performing specific roles could be considerably improved by employing analytic description concerning people’s functions, using the context as interpretative key: “the description of personal roles and of the statuses of documents needs to vary in time and according to changing contexts … such roles and statuses need to be handled formally by ontological models” (Peroni, Shotton, and Vitali 2012, 9).

4. A Meta-ontology for the Annotation: Open Annotation Data Model

37 The annotation layer that we are connecting to the texts describes semantic features of the textual content. We need, however, to record some meta-properties of the annotations themselves: • responsibility metadata; • typological categorization; • scope or function; • provenance metadata (responsibilities, certainty level, time).

38 For this reason we have opted to add an intermediate layer to express the relationship between the annotation and the text, based on the Open Annotation data model (OA), which is a plausible model for meeting these requirements: An annotation is considered to be a set of connected resources, typically including a body and target, and conveys that the body is related to the target. The exact nature of this relationship changes according to the intention of the annotation, but most frequently conveys that the body is somehow ‘about’ the target. Other possible relationships include that the body is an identifier for the target, provides a representation of the target, or classifies the target in some way. This perspective results in a basic model with three parts, depicted below.…

Figure 1. Annotation, Body and Target.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 18

(Sanderson, Ciccarese, and Van de Sompel 2013)

39 In a nutshell, we can define OA as an RDF vocabulary (formally expressed in OWL 2), which allows the expression of the relationship between an annotation and its object and a metadata set for this relationship. The core of OA is the oa:Annotation class with the two relationships oa:hasBody (that expresses the relationships between an annotation and its content) and oa:hasTarget (that expresses the relationships with the object of the annotation).

40 Both the annotation body and target are expressed by URI; other properties of the oa:Annotation class are oa:annotatedBy (to state responsibility) and oa:annotatedAt (to specify the date of the annotation). Using this model, the definition of an annotation in Geolat would be as follows (in Turtle syntax):

a oa:Annotation ; oa:hasBody geolat:geoData/Tiber ; oa:hasTarget geolat:abUrbeConditam.xml#xpath(//geogName [@xml:id='tiber02']) oa:annotatedBy ; oa:annotatedAt "2013-09-28T12:00:00Z" ;

41 The attribution of a semantic type to this annotation can be achieved with a twofold approach: • extension of the OA ontology through the introduction of a set of subclasses of the oa:Annotation class, in order to formalize this typology; • use of the oa:MotivatedBy property, whose values express the rationale of a given annotation and are members of the oa:Motivations class, which is extensible according to the users’ needs.

42 The adoption of the OA framework has the advantage of making the annotation metadescriptions homogeneous with the complex semantic Geolat architecture, and at the same time using an established standard that assures interoperability with other projects and future software platform change.

5. Closing Remarks and Future Developments

43 In Geolat everything is planned and built through free, open-source software. There are several reasons underlying this choice, including the project’s need to share its software tools with other researchers. Practical benefits of our adoption of open-source tools include the absence of annual license fees, which can be a real burden for long- term humanities research projects.

44 The entire content is made available under CC licenses. Despite a strong pressure towards CC-BY (which requires only citing the author[s]), Geolat chose a CC-BY-NC-SA license which seems more appropriate to stress the value—including from an economic/commercial point of view—of the research outcomes.

45 Semantic annotation of a digital library of textual materials potentially never ends; therefore, a crowdsourcing approach should be adopted. Annotating the entire corpus

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 19

of Latin literature is undoubtedly a demanding task: as a consequence, a core group of people working over many years is needed. However, being open to crowdsourcing means that the members of the larger research community of Latin scholars can significantly contribute to this effort.

46 Linked open data (LOD) is the most efficient framework to allow semantic interoperability of complex data sets on the Web. In the field of classical studies, several important resources are currently available which publish their data as LOD, including Pleiades,11 a gazetteer for classical antiquity, and Pelagios,12 a collaborative initiative to share and link geographic references in ancient cultural artifacts and documents. Notwithstanding its value and usefulness, Pleiades lacks references to textual passages with place names, which it could gather from Geolat. Geolat could retrieve from Pleiades selected geographical data (such as GPS coordinates, the ancient names of places, and names in modern languages where available). This mutual sharing of controlled and authoritative data is our primary motivation for using LOD in Geolat. This approach contributes to a growing community of scholars sharing data and tools.

47 The Geolat model is not language-dependent, and can be adopted for any collection in any language. This is not a purely quantitative approach (the model can be used for n different languages): it means that through the model (provided that the same model is adopted for annotating), different texts, from different literatures, can interact and can be read in their intertextual relations: any person or place, mentioned in different texts from different literatures, can be tracked. The aim is to finely analyze and interpret similarities and differences across texts, languages, and literatures.

BIBLIOGRAPHY

Bergman, Mike. 2009. “The Fundamental Importance of Keeping an ABox and TBox Split.” Ontology Best Practices for Data-driven Applications, part 2.May 17. http:// www.mkbergman.com/489/ontology-best-practices-for-data-driven-applications-part-2/.

Encoded Archival Context Working Group. 2014. Encoded Archival Context—Corporate Bodies, Persons, and Families (EAC-CPF) Tag Library. Edition 2014. http://eac.staatsbibliothek-berlin.de/fileadmin/ user_upload/schema/cpfTagLibrary..

Isaksen, Leif, Elton Barker, Eric C. Kansa, and Kate Byrne. 2012. “GAP: A NeoGeo Approach to Classical Resources.” Leonardo 45 (1): 82–83. doi:10.1162/LEON_a_00343.

Larson, Ray R., and Krishna Janakiraman. 2011. “Connecting Archival Collections: The Social Networks and Archival Context Project.” In Research and Advanced Technology for Digital Libraries: International Conference on Theory and Practice of Digital Libraries (TPDL 2011), edited by Stefan Gradmann, Francesca Borri, Carlo Meghini, and Heiko Schuldt, 3–14. Lecture Notes in Computer Science 6966. Heidelberg, Germany: Springer. doi:10.1007/978-3-642-24469-8_3. http:// link.springer.com/chapter/10.1007/978-3-642-24469-8_3.

Mazzini, Silvia, and Francesca Ricci. 2011. “EAC-CPF Ontology and Linked Archival Data.” In Proceedings of the 1st International Workshop on Semantic Digital Archives, edited by Livia Predoiu,

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 20

Steffen Hennicke, Andreas Nürnberger, Annett Mitschick, and Seamus Ross, 72–81. N.p.: CEUR Workshop Proceedings.http://ceur-ws.org/Vol-801/.

Pasin, Michele and John Bradley. 2013. “Factoid-based prosopography and computer ontologies: Towards an integrated approach.” in the Humanities 30 (1): 86–97. doi:http:// dx.doi.org/10.1093/llc/fqt037.

Peroni, Silvio, David Shotton, and Fabio Vitali. 2012. “Scholarly Publishing and Linked Data: Describing Roles, Statuses, Temporal and Contextual Extents.” In Proceedings of the 8th International Conference on Semantic Systems, 9–16. New York: ACM. doi:10.1145/2362499.2362502.

Pitti, Daniel V. 2004. “Creator Description: Encoded Archival Context.” In Authority Control in Organizing and Accessing Information: Definition and International Experience, edited by Arlene G. Taylor and Barbara B. Tillett, with the assistance of Murtha Baca and Mauro Guerrini, 201–26. Binghamton, N.Y.: Haworth Information Press.

Sanderson, Robert, Paolo Ciccarese, and Herbert Van de Sompel, eds. 2013. “Open Annotation Data Model.” Community Draft, February 8. World Wide Web Consortium (W3C). http:// www.openannotation.org/spec/core/20130208/index.html.

TEI Consortium. 2015. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.8.0. Last updated April 6. N.p.: TEI Consortium. http://www.tei-c.org/Vault/P5/2.8.0/doc/tei-p5-doc/ en/html/.

W3C OWL Working Group. 2012. OWL 2 Document Overview, 2nd ed. W3C Recommendation, December 11. http://www.w3.org/tr/owl2-overview/#profiles.

Wedervang-Jensen, Eva, and Matthew Driscoll. 2006. Report on XML Mark-up of Biographical and Prosopographical data. February 16. http://www.tei-c.org/activities/workgroups/pers/ persw02.xml.

Wittern, Christian, Arianna Ciula, and Conal Tuohy. 2009. “The making of TEI P5.” Literary and Linguistic Computing 24 (3): 281–296. doi:10.1093/llc/fqp017.

NOTES

1. http://www.digiliblt.unipmn.it/. 2. Edited by Valeria Lomanto and Nino Marinone (Hildesheim: Olms-Weidmann, 1990). 3. http://bia.lex.unict.it/. 4. http://www.mqdq.it/mqdq/. 5. http://pleiades.stoa.org/. 6. Virtual International Authority File (https://viaf.org/). 7. This example, and all the following code snippets of RDF statements, are expressed in Turtle syntax. Cfr. RDF 1.1 Turtle. Terse RDF Triple Language. W3C Recommendation 25 February 2014. http://www.w3.org/TR/2014/REC-turtle-20140225/. 8. http://www.tei-c.org/Activities/Workgroups/PERS/index.xml. 9. The actual changes introduced in the TEI Schema to implement the proposals in that document are described in Wittern, Ciula, and Tuhoy (2009). 10. http://www.ica.org/10203/standards/isaar-cpf-international-standard-archival- authority-record-for-corporate-bodies-persons-and-families-2nd-edition.html

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 21

11. http://pleiades.stoa.org/ 12. http://pelagios.org/

ABSTRACTS

This paper presents the rationales and the logical architecture of Geolat — Geography for Latin Literature, a project for the enrichment of Latin literature which makes use of a complex mix of TEI markup, Semantic Web technologies and formal ontologies. The purpose of Geolat is the annotation of the geographical and personal references in a corpus of Latin TEI encoded texts. These annotations are linked to a set of ontologies that give them formal semantics, and can finally be exposed the as linked open data, in order to improve the documents’ interoperability with other existing LOD and to enhance information retrieval possibilities. The paper is organized as follows: first we introduce the project in order to explain the steps needed to complete it (section 2); then we describe the ontological modeling and the TEI encoding in Geolat, by focusing on both places and persons (section 3). We then discuss the annotation data model (section 4). In conclusion, we introduce some future directions (section 5).

INDEX

Keywords: geotagging, personography, ontology, Semantic Web, Linked Open Data, digital annotation

AUTHORS

FABIO CIOTTI Fabio Ciotti is Assistant Professor at the University of Roma Tor Vergata, where he teaches Digital Literary Studies and Theory of Literature. He is President of the Associazione per l’Informatica Umanistica e la Cultura Digitale (AIUCD, the Italian digital humanities association) and member of the EADH Executive Board. He has been scientific consultant for text encoding and technological infrastructures in several digital libraries and archives projects, and has served in the TEI Technical Council.

MAURIZIO LANA Maurizio Lana is an assistant professor at Università del Piemonte Orientale where he teaches Library & Information Science; he also coordinates the geolat-geography for Latin literature project and, with R. Tabacco, the digital library of late-Latin texts digilibLT.

FRANCESCA TOMASI Francesca Tomasi is assistant professor at the University of Bologna, where she teaches archives and computer science and digital humanities. Her research is mostly focused on ontologies and semantic modeling in the cultural heritage domain. She is in particular editor of the digital

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 22

collection: Vespasiano da Bisticci, Letters. She is finally President of the School of Humanities library in the University of Bologna.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 23

The Linked Fragment: TEI and the Encoding of Text Reuses of Lost Authors

Monica Berti, Bridget Almas, David Dubin, Greta Franzini, Simona Stoyanova and Gregory R. Crane

1. Introduction

1 This paper presents the research work of the Humboldt Chair of Digital Humanities at the University of Leipzig to establish a new Leipzig Open Fragmentary Texts Series (LOFTS).1 The project is developed in conjunction with the Perseus Digital Library at Tufts University and the Harvard University Center for Hellenic Studies. LOFTS produces open editions of ancient classical works that survive through quotations and text reuses in later texts (i.e., textual fragments). These fragments include many formats that range from quotations to allusions and translations, which are only a shadowy image of the original depending on their distance from a literal citation (for a definition of fragmentary texts, see Berti 2012 and Berti 2013). Generations of scholars have looked for information about lost authors and have created huge print collections of fragmentary works that cover almost every literary genre in prose and verse, allowing for the rediscovery of authors otherwise lost and forgotten (Most 1997). As far as Greek literature from the eighth century BCE to the third century CE is concerned, we know that the works of about sixty percent of known authors are preserved only in textual fragments, thus showing the importance of this form of evidence (Berti et al. 2009).

2 Print editions of fragmentary works are indexed collections of excerpts extracted from their contexts and from the textual data about those contexts. Editions of fragmentary works are fundamentally hypertexts and the goal of this project is to produce a dynamic infrastructure for a full representation of the complex relationships between sources, quotations, and annotations about them. In a digital edition, fragments are linked directly to the source text from which they are drawn and can be accurately

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 24

aligned to multiple editions (Berti et al. 2009). Accordingly, digital fragments are contextualized annotations about reused authors and works. As new versions of, or scholarship on, the source texts emerge in a standard, machine-actionable form, these new findings are automatically linked to the digital fragments. Such a representation allows us to go beyond the limits of print collections of fragmentary authors, where quotations and text reuses are snippets of text completely isolated from their original context. Fragmentary texts also include quotations and reuses of surviving sources (e.g., Herodotus quoted by Athenaeus). In these cases it is possible to compare the original text with its reuses and, therefore, to produce different textual alignments that help scholars to infer the quotation habits of ancient authors.

3 The first step for rethinking the significance of text reuses of lost works is to represent them inside their preserving context. This entails the selection of the string of words that belong to the portion of text classifiable as reuse and to encode all those elements that signal the presence of the text reuse itself (i.e., named entities such as the onomastics of reused authors, titles of reused works and descriptions of their content, verba dicendi, syntax, etc.). The second step is to encode all the information pertaining to other sources that reuse the same original text with different words or a different syntax (witnesses), or that deal with the same topic of the text reuse (parallel texts), as well as alternative editions and translations of both the source and the derived texts (Almas and Berti 2013).

4 LOFTS has two main objectives: (1) to digitize paper editions of fragmentary works and link them to source texts, and (2) to produce born-digital editions of fragmentary works. In order to achieve these goals, LOFTS is implementing a Fragmentary Texts Editor within the Perseids Platform2 (Almas and Berti 2013) and is producing a digital edition of the Fragmenta Historicorum Graecorum edited by Karl Müller in the nineteenth century (Müller 1878–85; DFHG Project).3 Alongside the DFHG project, LOFTS is working on two other subprojects: (1) the Digital Athenaeus, whose goal is to produce a digital edition of the Deipnosophists, with a complete survey of quotations of both lost and surviving sources preserved by Athenaeus of Naucratis;4 and (2) the Digital Marmor Parium, which seeks to create a digital edition of the so-called Parian Marble, a Hellenistic chronicle on marble from the Greek island of Paros5. The latter is the work of a fragmentary author whose name is lost and the project serves as a test case to build a model that works with both fragmentary texts and physical fragments of ancient works such as inscriptions and papyri (Berti and Stoyanova 2014).

5 The present paper is divided into two parts. The first part (section 2) describes the implementation of a fragmentary texts editor within Perseids, the editorial platform developed by the Perseus Project for collaborative annotation of classical source documents. Perseids facilitates the annotation of text reuses, the production of syntactic annotations, the alignment of multiple texts, and the production of digital commentaries on fragmentary works. This section also elaborates on the complex nature of fragmentary texts and how different technologies need to be combined to reach our goals.

6 The second part (section 3) describes the Digital Fragmenta Historicorum Graecorum, a synergetic effort which is currently encoding print collections of fragmentary texts and creating a directory of fragmentary authors which feeds into the Perseus catalog.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 25

2. Perseids Platform

7 The Perseids Platform supports collaborative editing, annotation, and publication of born-digital editions of source documents in the classics.6 Perseids is not one single application but an integrated environment built from a loose coupling of heterogeneous tools and services from a variety of sources. The development of the Perseids platform was inspired and motivated by the work of several pre-existing projects: the Tufts Miscellany Collection at Tufts University, 7 the Homer Multitext Project at the Harvard’s Center for Hellenic Studies, 8 and the Papyri.info project 9 (Almas and Beaulieu 2013). The Son of SUDA Online (SoSOL) application sits at the core of the Perseids platform. SoSOL is a Ruby on Rails10 application, originally developed by the Papyri.info project, that serves as front end for a Git11 repository of documents, metadata, and annotations. It includes a workflow engine that enables documents and data of different types to pass through flexible review and approval processes. The SoSOL application includes user interfaces for editing XML documents, metadata, and annotations. While it does not include a full-featured XML editor, it supports alternative text-based input of XML markup, and can enforce XML schema validation rules on the documents being edited.

8 A key goal behind the initial development of the platform was to enable original undergraduate research in the field of Classics.12 The workflows related to the encoding of text reuses and lost authors represent core use cases for the current phase of work on the platform (Almas and Berti 2013).13 In developing features of the Perseids Platform to support these workflows, we are focusing first and foremost on the data. We expect that techniques for visually representing digital editions will change rapidly with technology. So, while our work includes prototype representations of digital editions suitable for publication on the web, our first priority is to enable scholars to create data about the authors, texts and related commentaries, annotations, links, and translations in a way that encourages and facilitates their preservation and reuse. We have identified the following core requirements to meet this goal: • The ability to represent the texts themselves, links between them, and annotations and commentaries on them, in semantically and structurally meaningful ways that adhere to well-accepted and documented standard formats. • Stable and resolvable identifiers for all relevant data points, including: ◦ the lost authors and their works ◦ the authors and extant texts that preserve quotations and text reuses of the lost works ◦ different editions and translations of the lost and extant texts ◦ named entities (e.g., persons, places, and events) mentioned within the texts ◦ commentaries and annotations on the texts, from ancient times through the present • The ability to group any of the data points into collections representing different contextual views of the data. • The ability to accurately represent provenance information for data and workflows.

2.1 Data Formats

2.1.1 Texts

9 We use TEI encoding to represent the source texts and textual fragments preserved within them.14 TEI provides the markup syntax and vocabulary needed to produce XML

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 26

that enables citable passages of text to be unambiguously identified and linked to within their preserving context, a key requirement of the representation of text reuses as discussed above. For example, the following excerpt from Athenaeus’ Deipnosophists 3.6 uses the TEI

element to demarcate the book and chapter:

...

...Ἴστρος δ᾽ ἐν τοῖς Ἀττικοῖς οὐδ᾽ ἐξάγεσθαί φησι τῆς Ἀττικῆς τὰς ἀπ᾽ αὐτῶν γινομένας ἰσχάδας, ἵνα μόνοι ἀπολαύοιεν οἱ κατοικοῦντες: καὶ ἐπεὶ πολλοὶ ἐνεφανίζοντο διακλέπτοντες, οἱ τούτους μηνύοντες τοῖς δικασταῖς ἐκλήθησαν τότε πρῶτον συκοφάνται. ...

2.1.2 Identifiers

10 The CITE Architecture developed by the Homer Multitext Project15 provides us with a standard identifier syntax for texts, passages, and related objects and with APIs for services which can retrieve objects identified via these protocols. CITE defines Canonical Text Services (CTS) URNs for creating semantically meaningful unique identifiers for texts and passages within texts. CITE also defines an alternate URN syntax for data objects that are not citable text notes, such as images, annotations, and the lost texts themselves. As URNs, these identifiers are not web-resolvable on their own. By combining them with a URL prefix and deploying CTS and CITE services to serve the identified resources at those addresses, we have resolvable, stable identifiers for our texts, data objects, and annotations (Smith and Blackwell 2012). One of our key motivations for using CTS URNs is that they give us a robust means of targeting annotations at specific substrings of text within a canonical work. The pointers to the text are specific to the location of these strings within their canonical citation structure, and not to the XML markup of any particular digital edition of the text. Further, the URN syntax degrades gracefully to allow us to reference either the notional work or a very specific edition of that work. Our goal in using this syntax, together with RDF and stand-off markup techniques as discussed below, is to enable the assertions we are making to stand on their own as data, independent of the encoding techniques used to digitize the text.

11 For example, the following set of identifiers might be used to represent a reuse of a lost work of Istros at book 3, chapter 6 in Athenaeus’ Deipnosophists (Almas and Berti 2013):

urn:cts:greekLit:tlg0008.tlg001.perseus-grc1:3.6@Ἴστρος%5B1%5D- συκοφάνται%5B1%5D

12 This is a CTS URN for a subset of passage 3.6 in the perseus-grc1 edition of the work identified by tlg001 in the group of texts associated with Athenaeus, identified by tlg0008. The URN further specifies a string of text in that passage, starting at the

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 27

first instance of the word Ἴστρος and ending at the first instance of the word συκοφάνται.

urn:cite:perseus:lci.2 is a CITE URN identifier for the instance of lost text being reused. This URN identifies an object whose name 2 is unique within the Perseus Collection of Lost Content Items (lci). Every item in this collection points to a specific text reuse of a lost author as it is represented in a modern edition (see below for further discussion of CITE Collections).

13 These URNs represent distinct technology-independent identifiers for the two cited objects, and by prefixing them with the http://data.perseus.org URL prefix (representing the web address at which they can be resolved) we create stable web identifiers for them, making them compatible with linked data best practices:16 http://data.perseus.org/citations/ urn:cts:greekLit:tlg0008.tlg001.perseus- grc1:3.6%40Ἴστρος%5B1%5D-συκοφάνται%5B1%5D17 http://data.perseus.org/collections/urn:cite:perseus:lci.2

14 The RDF data model also gives us more precision than XML with respect to targets of annotations and subjects of assertions. A human reader can usually tell from context whether an assertion concerns the wording of a sentence, the writing of a sentence as an event, the author of a sentence, or another scholar’s assertion about the sentence. XML has no standardized semantics, and interpretations of relationships among elements and attributes are often underspecified in XML vocabularies (Renear et al. 2002). In RDF, subject, predicate, and object roles in a statement are explicit at the data structure level. Distinctions among domain entities (such as author vs. authorship event, or morpheme vs. sentence) are encoded as a class identity for the resource.

2.1.3 Annotations

15 The term “Annotations” covers a potentially wide variety of data types. We can have simple annotations in the form of typed links between data points, such as: a textual fragment and a proposed author of that fragment; detailed textual commentaries making an assertion about a text; a complex morphosyntactic analysis of a section of text; an alignment between editions or translations of a text. The Open Annotation (OA) data model “specifies an interoperable framework for creating associations between related resources, annotations, using a methodology that conforms to the Architecture of the World Wide Web.”18 The OA model enables us to serialize every annotation in its most simple form, as a link between one or more target items being annotated, and one or more bodies representing the contents of the annotation. OA also gives us a standard vocabulary for categorizing the motivation for the annotations. URIs are used to specify both the target and the body of the annotation. We use the OA data model both as the primary representation of an annotation, in cases where the annotations are created by linking two identifiers (such as a link between a passage in a text and an identifier for a named entity or event), and also as a serialization method for more complex annotations, where the annotation process involves the creation of complex documents as the annotation bodies which we can then reference by their URI identifiers. In the

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 28

latter case, we use a variety of standard formats for the actual annotation bodies, including: • The Perseus Ancient Greek and Latin Treebank Schema19 for morphosyntactic analyses. • The Alpheios Translation Alignment Schema20 for text alignments. • Syntax21 for short textual commentaries. • TEI XML for primary and secondary source texts.

16 The annotation representing the assertion (discussed above) that text at Athen., Deipn. 3.6 describes a reuse of a lost work of Istros identified by urn:cite:perseus:lci. 2, serialized in OA using the JSON-LD22 format, might be formalized as follows (Almas and Berti 2013):

{ "@context": "http://www.w3.org/ns/oa-context-20130208.json", "@id": "http://perseids.org/annotations/urn:cite:perseus:annsimp.2.1", "@type": "oa:Annotation", "annotatedAt": "2013-03-05T07:57:00", "annotatedBy": { "@id": "http://data.perseus.org/sosol/users/Monica Berti", "@type": "foaf:Person", "name": "Monica Berti" }, "hasBody": "http://data.perseus.org/collections/urn:cite:perseus:lci.2", "hasTarget": "http://data.perseus.org/citations/ urn:cts:greekLit:tlg0008.tlg001.perseus-grc1:3.6%40Ἴστρος%5B1%5D- συκοφάνται%5B1%5D", "oa:motivatedBy":"oa:linking" }

17 Figure 1 shows a possible visual representation of the same annotation:

Figure 1. Visual representation of an OA annotation of a text reuse in Athen., Deipn. 3.6.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 29

2.2 Collections

18 We need to be able to organize text reuses into various types of collections of data, including those represented in a given traditional print edition which comprises: reuses from one or many authors; all text reuses attributed to a specific author; all text reuses quoted by a specific author; all text reuses referencing a specific topic; all text reuses attributed to a specific time period. CITE Collection Services define a protocol for identifying and retrieving digital representations of objects identified by CITE URNs. CITE URNs uniquely identify objects in a set of objects of a similar kind: for example, images, manuscripts, and entries in commentaries. (CITE stands for Collections, Indices, Texts, and Extensions.) Examples of CITE collections used to support the encoding of text reuses for our project include the abstract lost text entities themselves, digital images of manuscripts of the extant source texts that quote those lost texts, commentaries on instances of text reuse, and linguistic annotations of the quoted text.

2.3 Provenance

19 Scholars produce data through a variety of activities, observations, and other events. Some data is created, other data discovered. The provenance of a digital object is an account of its origin and change over time. Documenting and preserving data provenance in a structured, machine-readable format enables us to more precisely track and document shared resources, ultimately improving data quality and encouraging further sharing. Specifically, we intend our retrieval, editorial annotation, and communication tools to create records of the research transactions in which they are used. These records are expressed in RDF vocabularies that are based on abstract provenance models.

20 According to Groth et al. (2006), two key principles for provenance data are that (1) actors must only record propositions that they know to be true, through statements of what they observe; and (2) each statement of provenance must be attributable to a particular actor.

21 In our use case, we need to be able to (1) reference ancient data that can be identified but that did not literally come into existence as the result of any modern computational interaction (and which may in fact no longer be extant in any preserved source); and (2) identify the role a data item, such as an ancient scholarly assertion, plays as the vehicle for the modern scholarly claims. A third requirement, which results from the second, is that we need to be able to represent the assertions of the ancient scholars, on which our modern assertions depend, in a format that can be included computationally in a common data set with the modern claims.

22 Although our usual workflow discourse describes domain entities coming into existence and undergoing genuine change, that understanding poses problems for tracking identity and other semantic relationships. For example, if a text is enriched by adding TEI markup, what exactly is preserved in the derivation of the newer version from the older? On what basis do we identify the same original work in each reuse despite the superficial differences between them? And how do we represent assertions concerning works that have no physical exemplars? Our solution to dilemmas like

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 30

these is to employ complementary data provenance models: one with a transformational view and another with a semiotic view of the same research events.

23 The W3 Consortium’s PROV is a specification for expressing provenance records with descriptions of the entities and activities involved in producing and delivering or otherwise influencing a given digital object.23 A PROV activity is an event through which entities come into existence and/or change to become new entities. Activities are dynamic aspects of the world, such as actions and processes. For example, a PROV account of the OA annotation shown in the previous section would document its coming into existence at a particular time through a researcher’s use of annotation software.

24 In contrast to PROV, the Systematic Assertion Model (SAM) treats events in scholarship as linguistic acts (Wickett et al. 2012). SAM is concerned not with mutable data structures that change from one form to another, but with unchanging abstract objects that come to stand in new contingent relationships during a project. A SAM account of the OA annotation, for example, would be concerned with its role as a vehicle for the scholar’s claim of a proposition’s truth (that is, the proposition that Athen., Deipn. 3.6, quotes LCI 2). A SAM description links an annotation expression to the act of annotation, the claim advanced at that time and place, the annotator’s justification for the claim, and to interpretive contexts necessary for understanding the annotation. In the present example these contexts would include the OA vocabulary and the JSON-LD serialization syntax.

25 SAM’s formal account of research data and its content supplies contextual information that PROV lacks, but PROV can record functional relationships that SAM hides: because SAM’s abstract symbol structures and propositions are immutable, their roles in a project are incompletely specified without the PROV view of their use as inputs and outputs to activities.

26 In the scenario discussed in this paper, a scholar wants to annotate what she thinks is a reuse of a lost text attributed to Istros in an extant source text by Plutarch (Sol. 24.1), where Istros is not named.24 The scholar, using the Perseids annotation tools to produce a concrete representation of this assertion, indicates the text that represents the reuse. An example of an RDF triple documenting this action—the selection of the string starting with the first instance of ἐξαγωγὴ to the first instance of συκοφαντεῖν according to the SAM data model—is:

sam:indicates [ xsd:string "ἐξαγωγὴ1,συκοφαντεῖν1"]; event:agent ; event:time [xsd:dateTime "2013-05-20T09:01:00"]; a sam:Indication.

27 Although we could represent this action as an activity involving the scholar’s physical interaction with the system, we would lose the significance of the linguistic force behind the scholar’s identification of the text.

28 In reaction to her selection of text, the system computes a URI identifier for that text. This event can be represented as a SAM computation, but in order to represent the fact that the URI came into being as a result of a user-system interaction, it is also

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 31

appropriate to describe this as a PROV activity that was informed by the scholar’s previous action of selecting a string of text (Almas et al. 2013):

sam:indicates ; event:agent ; event:time [xsd:dateTime "2013-05-20T09:01:01"]; a sam:Computation, prov:Activity; prov:wasInformedBy .

2.4 Dissemination and Presentation

29 As discussed above, our primary focus thus far has been on capturing the data about the authors, texts and related commentaries, annotations, links, and translations in a way that is accurate and also encourages and facilitates its preservation and reuse. Visual representation of the data is one type of reuse, and the data format selections have been made with the need to support disseminations for online presentation in mind. The JSON-LD syntax recommended by OA allows us to easily build a dynamic display interface in Javascript which navigates the JSON-LD data object and retrieves the datasets identified as the targets and bodies of the annotations at their addressable URIs, as served by supporting CTS and CITE services. Our prototype interface25 provides a demonstration of one possible approach to a digital representation of text reuse data. Similarly, although we hope eventually to be able to represent the rich provenance metadata discussed above in the visual representation of our digital editions, we are above all concerned with capturing the data in a way that ensures they can be preserved and serve as the basis for further research.

3. The Digital Fragmenta Historicorum Graecorum (DFHG) Project

30 As mentioned before, one of the goals of LOFTS is to digitize paper editions of fragmentary works and link them to the source texts from which the fragments have been excerpted. Such a work aims both at preserving a philological heritage and at providing a methodological foundation for implementing a new generation of born- digital editions of fragmentary texts. Accordingly, LOFTS is encoding the Fragmenta Historicorum Graecorum (FHG), which is one of the most important collections of textual fragments of Classical authors.

3.1 The FHG Texts

31 The five volumes of the FHG edited by Karl Müller (1878–85) comprise the first big collection of fragments of Greek historians’ work ever realized. The work collects excerpts from textual sources pertaining to more than six hundred fragmentary authors. Excluding the first volume, these authors are chronologically distributed and range from the sixth century BCE to the seventh century CE. Under each author the

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 32

fragments are numbered sequentially and arranged according to works and book numbers (when such information is available), and every fragment is translated into Latin. The first volume also includes the text of the inscription of the Marmor Parium with a Latin translation, a chronological table, and a commentary, and the Greek text of the Rosetta Stone with a French literal translation, as well as a critical, historical, and archaeological commentary. The fifth volume includes fragments of the writings of Greek and Syriac historians preserved in Armenian sources.

32 The basic structure of the FHG has introductions to the volumes with discussions of the content, citations of sources, and references to bibliographical records. Sometimes short introductions precede the fragments of the fragmentary authors with biographical information and related textual evidence. The Greek fragments are arranged into two columns and the Latin translations are printed at the bottom of the page below a horizontal line. The FHG does not include a critical apparatus for the Greek text.

3.2 The DFHG Project

33 LOFTS is producing the first digital edition of the FHG as a means of providing an open, linked, machine-actionable text for the study and advancement of Greek fragmentary literature.26 The text is encoded in accordance with the latest TEI EpiDoc standards.27 The decision to adopt the EpiDoc subset was dictated by two factors: (1) EpiDoc markup lends itself well to ancient texts, be these papyri, inscriptions, manuscripts, or editions, allowing for various levels of detail; (2) encoding in EpiDoc ensures compatibility with existing papyrological and epigraphic resources, with the EAGLE-Europeana network,28 and with the forthcoming EpiDoc-compliant Perseus 5.0 databank.

34 The Leipzig Humboldt Chair and Perseus Digital Library partnership entails a synergetic effort not only to enrich the Perseus Catalog29 but also to provide new testing material for the continuous development of the Perseids Platform, where the first EpiDoc version of the FHG will be placed for further annotation. This undertaking is proving to be invaluable not only for the growth of partner resources but also for the development of the EpiDoc schema itself.30

35 The project team is also working on the text of the so-called Marmor Parium ( IG 12.5.444), which is a fragmentary inscription from the island of Paros that preserves a Hellenistic chronicle from the reign of Cecrops (1581/1580 BCE) to the archonship of Euctemon (299/298 BCE).31 The epigraphical text was edited in the first volume of the FHG because of its historiographical value (Müller 1878–85, 1:533–90). The author of the text is unknown, but the content reflects his choices and it consists of a list of historical events mainly based on the Athenian history. In this respect, this evidence is a perfect example of a fragmentary author, whose work is preserved not through quotations in later texts, but in a fragmented original form. The text of the inscription is also being encoded in EpiDoc. An important part of the project is the identification of named entities mentioned in the inscription (such as names of kings and magistrates, personal names, and place names). The Pleiades gazetteer has been referenced for the place names.32 The identification of individuals will make use of and feed into the Standards for Networking Ancient Prosopographies project (SNAP).33 The team is also producing a visualization of the chronology preserved by the Marmor Parium with the open source tool TimelineJS, which allows the comparison of the text not only with other ancient

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 33

chronologies but also with different chronological interpretations of the content of the inscription made by modern scholars (Berti and Stoyanova 2014).34

3.3 The EpiDoc Guidelines and Encoding Process

36 The digital text of the FHG was obtained by feeding the volume scans available at the Internet Archive35 to an Optical Character Recognition (OCR) engine, which transformed the printed text into digital form. The error-laden Greek output was corrected semi-automatically and stored in text files. Any remaining errors are manually rectified during the encoding process.

37 LOFTS has drawn up an EpiDoc template that is compatible with the Perseus Digital Library corpus and applicable to all types of texts produced by the University of Leipzig Open Greek and Latin (OGL) Project, which is producing a comprehensive open collection of Classical sources.36 The template covers basic layout and citation structures with room for alteration and improvement and it is integrated with encoding guidelines, produced not only to document the LOFTS editorial choices but also to help external individuals contribute to the encoding process. The project production line consists of two stages: (1) the encoding process is implemented by LOFTS and tackles layout, citation elements, and philological issues; (2) the initial editorial endeavor and the guidelines are made available for crowdsourcing, where contributors are encouraged to further tag the EpiDoc files with information regarding places, personal names, and any other relevant entities.

3.3.1 Stage One: First Level of Encoding by LOFTS

38 The editorial team of LOFTS creates one XML file per fragmentary author. Every file contains a with specific information about the author, volume, or book in question. As for the , Müller’s layout is not kept since the Greek text is separated from its Latin translation. The structure within reflects the structure of each volume, using the

with different @subtype values, whenever needed. Müller’s Latin translations of the fragments are encoded in a separate XML file to facilitate text alignment.

39 For example, fragment 6 of Ion of Chios will be encoded as follows (figures 2, 3, and 4):

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 34

Figure 2. Page image from FHG 2:48: Greek text of fragment 6 (outlined in red) and Latin translation (outlined in yellow).

Figure 3. Image from the XML file for the Greek translation.

Figure 4. Image from the XML file for the Latin translation.

40 While retaining the same number (6), the Greek is encoded in a and the Latin translation in a

. The is broken down into , which contains the source (Plutarch) and work, and a , containing the Greek text. Whenever there is a note, the element is also placed within . Any information that does not strictly pertain to the fragment is encoded in a

. This initial editorial stage also involves replacing special characters, such as the æ diphthong, with their entities (see the Latin translation, where æ is displayed as æ) in order to avoid font and potential display issues and confusion between graphically similar but semantically different characters (e.g., capital Latin c and Greek capital lunate sigma). Numbers, including ordinal, are also tagged. Footnotes are given arabic numbers (Müller uses asterisks) for clarity.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 35

41 An XSL transformation37 is used to help the encoder better visualize the marked-up text and links the main body to the footnotes. All references to other works (often abbreviated by Müller) are tagged as and will eventually point to a master bibliography file. Work titles appearing in the Greek and Latin texts are wrapped in a element. Where we have both the Greek and the Latin titles of a work, the two titles are enclosed in another <title> tag: </p><p><title><title xml:lang="grc">Περὶ τελετῶν sive De mysteriis

42 The FHG does not include a critical apparatus, but sometimes Müller signals variant readings in the text, and these are tagged as elements inside an within the main text. Personal names are simply tagged as . Names often carry additional information, such as patronymics and epithets or clues about an individual’s profession. Encoding such specificity is not a stage one priority but a stage two expectation.

43 Another problematic feature when working with editions of fragmentary works is how to encode specific editorial decisions concerning the attribution of a quotation or a text reuse to an author whose original works are lost. Müller sometimes adds different marks (parentheses, square brackets, or question marks) to the fragment number in order to signal uncertainty about the attribution of that fragment.38 In these cases the uncertainty is encoded using the @ana attribute to the element, with one of the following values: "#dubia", "#incerta", and "#anonyma". These values are defined in the element in in the header. The reason for this choice depends on the fact that when the project was started not all relevant elements had the @cert attribute. Even if they did, Müller reasons for marking something as uncertain vary a lot, and he is also inconsistent in his use of these sigla. The use of the element also proved problematic, because of this inconsistency, since it was at times difficult to determine whether the uncertainty was the content of the citation, the attribution to an author, or the extent of the quotation. So it was decided to roughly map Müller’s three sigla to what seemed the most frequent and probable uncertainty categories across the collection, and rather than have @cert or sometimes in sometimes in sometimes in , to have three different values of the @ana attribute in .

3.3.2 Stage Two: Crowdsourced Annotation

44 Once complete, these basic XML files are deposited in the Perseids Platform for further annotation by third parties. While still in testing mode, this second stage encourages annotators to tag any additional information, including—but not limited to—the following: • Names (person).39 • Names (place).40 • Bibliographic references: expansion of bibliographic references and linking to bibliography file. • Numbers. • Titles within Greek and Latin text.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 36

• Stage 2 encoding of the Latin translations (after initial encoding at stage one).

45 The editorial board provides the final review for each file. EpiDoc-encoded DFHG files are being progressively added to the DFHG GitHub repository41 for everyone to download, improve, and share in accordance with our CC BY-SA 4.0 International License.42

4. Conclusions

46 The Leipzig Open Fragmentary Texts Series (LOFTS) aims not only at producing an open series of Greek and Latin fragmentary authors, but also at building a model for representing quotations and text reuses of lost works in a digital environment. In order to achieve these goals, different technologies are being used and implemented beyond the TEI XML standard. The final outcome is the production of dynamic excerpts from source texts. In the case of digitized print editions of fragmentary works, this means linking the fragments directly to the source text and annotating their metadata within it. In the case of new-born digital editions of fragmentary texts, fragments become multilayered annotations of information concerning fragmentary authors and reuses of their lost works. These dynamic excerpts contribute to the production of a real “multitext,” where each version of the same text embodies a different step in its transmission and a reconstruction of philological conjectures (on the concept of multitext see Blackwell and Crane 2009). The ultimate goals are the creation of open, linked, machine-actionable texts for the study of textual reuses of Classical works and the development of a collaborative environment for crowdsourced annotations that involve both scholars and students.

BIBLIOGRAPHY

Almas, Bridget, and Marie-Claire Beaulieu. 2013. “Developing a New Integrated Editing Platform for Source Documents in Classics.” Literary & Linguistic Computing 28(4): 493–503. doi:10.1093/llc/ fqt046.

Almas, Bridget, and Monica Berti. 2013. “Perseids Collaborative Platform for Annotating Text Re- uses of Fragmentary Authors.” In DH-Case 2013. Proceedings of the 1st International Workshop on Collaborative Annotations in Shared Environments: Metadata, Vocabularies and Techniques in the Digital Humanities, edited by Francesca Tomasi and Fabio Vitali, article no. 7. New York, NY: ACM Digital Library. doi:10.1145/2517978.2517986.

Almas, Bridget, Monica Berti, Sayeed Choudhury, David Dubin, Megan Senseney, and Karen M. Wickett. 2013. “Representing Humanities Research Data Using Complementary Provenance Models.” Poster presented at Building Global Partnerships—RDA Second Plenary Meeting, Washington, DC, September 16–18, 2013.

Berti, Monica. 2012. “Citazioni e dinamiche testuali. L’intertestualità e la storiografia greca frammentaria.” In Tradizione e Trasmissione degli Storici Greci Frammentari II. Atti del Terzo Workshop

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 37

Internazionale. Roma, 24–26 febbraio 2011, edited by Virgilio Costa, 439–58. Tivoli: Edizioni Tored. http://www.monicaberti.com/wp-content/uploads/2014/08/CitazioniDinamicheTestuali..

———. 2013. “Collecting Quotations by Topic: Degrees of Preservation and Transtextual Relations among Genres.” Ancient Society 43: 269–88. http://poj.peeters-leuven.be/content.php? url=article&id=2992614&journal_code=AS. doi: 10.2143/AS.43.0.2992614.

Berti, Monica, and Simona Stoyanova. 2014. “Digital Marmor Parium. For a Digital Edition of a Greek Chronicle.” In Information Technologies for Epigraphy and Cultural Heritage. Proceedings of the First EAGLE International Conference, edited by Silvia Orlandi, Raffaella Santucci, Vittore Casarosa, Pietro Maria Liuzzo, 319–24. Roma: Sapienza Università Editrice. http://www.eagle-network.eu/ wp-content/uploads/2015/01/Paris-Conference-Proceedings.pdf.

Berti, Monica, Matteo Romanello, Alison Babeu, and Gregory R. Crane. 2009. “Collecting Fragmentary Authors in a Digital Library.” In JCDL ’09: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, 259–62. New York, NY: ACM Digital Library. doi: 10.1145/1555400.1555442.

Blackwell, Christopher, and Gregory R. Crane. 2009. “Conclusion: Cyberinfrastructure, the Scaife Digital Library and Classics in a Digital Age.” Digital Humanities Quarterly 3(1). http:// www.digitalhumanities.org/dhq/vol/3/1/000035/000035.html.

Groth, Paul, Simon Miles, and Steve Munroe. 2006. “Principles of High Quality Documentation for Provenance: A Philosophical Discussion.” Lectures Notes in Computer Science 4145: 278–86. doi: 10.1007/11890850_28.

Most, Glenn W., ed. 1997. Collecting Fragments. Fragmente sammeln. Göttingen: Vandenhoeck & Ruprecht.

Müller, Karl, ed. 1878–85. Fragmenta Historicorum Graecorum. 5 vols. Paris: A. F. Didot.

Renear, Allen, David Dubin, and C. M. Sperberg-McQueen. 2002. “Towards a Semantics for XML Markup.” DocEng ’02: Proceedings of the 2002 ACM Symposium on Document Engineering, 119–26. New York, NY: ACM Digital Library. doi:10.1145/585058.585081.

Smith, D. Neel, and Christopher W. Blackwell. 2012. “Four URLs, Limitless Apps: Separation of Concerns in the Homer Multitext Architecture.” In Donum natalicium digitaliter confectum Gregorio Nagy septuagenario a discipulis collegis familiaribus oblatum: A Virtual Birthday Gift Presented to Gregory Nagy on Turning Seventy by His Students, Colleagues, and Friends. Cambridge, MA: Center for Hellenic Studies, Harvard University. http://chs.harvard.edu/wa/pageR? tn=ArticleWrapper&bdc=12&mn=4846.

Wickett, Karen M., Andrea Thomer, Simone Sacchi, Karen S. Baker, and David Dubin. 2012. “What Dataset Descriptions Actually Describe: Using the Systematic Assertion Model to Connect Theory and Practice.” Poster presented at the 2012 ASIS&T Research Data Access and Preservation Summit, New Orleans, March 22–23. http://hdl.handle.net/2142/30470.

NOTES

1. http://www.dh.uni-leipzig.de/wo/projects/open-greek-and-latin-project/the- leipzig-open-fragmentary-texts-series-lofts/. 2. For a prototype interface, see http://perseids.org/sites/berti_demo/. 3. On this project, see below, section 3.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 38

4. Digital Athenaeus, Universität Leipzig, http://www.dh.uni-leipzig.de/wo/projects/ open-greek-and-latin-project/digital-athenaeus/. 5. Digital Marmor Parium, Universität Leipzig, http://www.dh.uni-leipzig.de/wo/ projects/open-greek-and-latin-project/digital-marmor-parium/. 6. http://perseids.org/. 7. http://www.library.tufts.edu/tisch/ematlocalstorage/miscellany_collection/ home.html. 8. http://www.homermultitext.org/. 9. http://www.papyri.info/. 10. http://rubyonrails.org/. 11. http://git-scm.com/. 12. This work was supported by grants from Tufts University, the National Endowment for the Humanities (grant HD-51548-12), and the Institute from Museum and Library Services. Funding from the Mellon Foundation is now allowing us to expand the platform to support classroom-based collaboration on digital editions, beginning-to- end scholarly workflows, and the development of dynamic syllabi using the resources managed by the platform. 13. For the prototype functionality used by students of the Digital Philology course at the University of Leipzig in the fall of 2013, see http://sites.tufts.edu/perseids/ workflows/fragmentary-author-workflows/fragmentary-text-prototype-fall-2013/. 14. The texts will adhere to the EpiDoc subset of the TEI standard, as discussed further below. 15. Documentation: http://www.homermultitext.org/hmt-doc/cite/index.html. 16. “Perseus Stable URIs,” Perseus Digital Library Updates, http://sites.tufts.edu/ perseusupdates/beta-features/perseus-stable-uris/. 17. At the time of this writing, complete implementation of the CTS standard for resolution of passage subreferences at the data.perseus.org address is still pending. 18. Open Annotation Data Model, February 8, 2013, http://www.openannotation.org/ spec/core/20130208/index.html. 19. http://nlp.perseus.tufts.edu/syntax/treebank/agdt/1.7/data/treebank-1.5.xsd. 20. https://svn.code.sf.net/p/alpheios/code/xml_ctl_files/schemas/trunk/aligned- text.xsd. 21. John Gruber, “Markdown: Syntax,” http://daringfireball.net/projects/markdown/ syntax. 22. JSON for Linking Data, http://json-ld.org/. 23. PROV-DM: The PROV Data Model, W3C Recommendation, April 30, 2013, http:// www.w3.org/TR/prov-dm/. 24. To substantiate her argument, the scholar must also identify corroborating material, including instances of Istros’ text in other primary sources—in this example that of Athenaeus, Deipn. 3.6, who does name Istros as the source. 25. http://perseids.org/sites/berti_demo/index.html (source code at https:// github.com/PerseusDL/lci-demo).

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 39

26. Digital Fragmenta Historicorum Graecorum (DFHG) Project, Universität Leipzig, http://www.dh.uni-leipzig.de/wo/projects/open-greek-and-latin-project/digital- fragmenta-historicorum-graecorum-dfhg-project/. 27. EpiDoc: Epigraphic Documents in TEI XML, http://sourceforge.net/p/epidoc/wiki/ Home/. 28. http://www.eagle-network.eu/. 29. http://catalog.perseus.org/. 30. There is little room or intention in this article to elaborate further on the topic. Let it suffice to say this endeavor has contributed to the new release of the schema (published in February 2014) which now allows, among other additions, the following: (1) the @rend attribute of can now have "stacked" as a value to encode stacked symbols or characters (such features are often found in manuscripts); (2) the element is no longer omitted. All this has been explained in version 8.20 of the EpiDoc Guidelines (Tom Elliott, Gabriel Bodard, Elli Mylonas, Simona Stoyanova, Charlotte Tupman, Scott Vanderbilt, et al., EpiDoc Guidelines: Ancient Documents in TEI XML, December 4, 2014, http://www.stoa.org/epidoc/gl/latest/). 31. http://www.dh.uni-leipzig.de/wo/projects/open-greek-and-latin-project/digital- marmor-parium/ 32. http://pleiades.stoa.org/. 33. http://snapdrgn.net/. 34. http://timeline.knightlab.com/. 35. https://archive.org/details/fragmentahistori01mueluoft. 36. http://www.dh.uni-leipzig.de/wo/projects/open-greek-and-latin-project/. 37. http://www.w3.org/TR/xslt. 38. See, for example, Müller 1878–85, 1:1, frr. 5 and 7; 1:56, fr. 83; 2:29, fr. 2. 39. For example, Κάδμος Ἀρχελάου Μιλήσιος, where we have the name of the author (Κάδμος), the name of the father (Ἀρχελάου), and the ethnicon (Μιλήσιος). Further details on, and matching of, name components will be part of a later, spin-off prosopography project. 40. GeoNames geographical database, http://www.geonames.org/. 41. https://github.com/OpenGreekAndLatin/dfhg-dev/wiki. 42. https://creativecommons.org/licenses/by-sa/4.0/.

ABSTRACTS

This paper presents a joint project of the Humboldt Chair of Digital Humanities at the University of Leipzig, the Perseus Digital Library at Tufts University, and the Harvard Center for Hellenic Studies to produce a new open series of Greek and Latin fragmentary authors. Such authors are lost and their works are preserved only thanks to quotations and text reuses in later texts. The project is undertaking two tasks: (1) the of paper editions of fragmentary works with

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 40

links to the source texts from which the fragments have been extracted; (2) the production of born-digital editions of fragmentary works. The ultimate goals are the creation of open, linked, machine-actionable texts for the study and advancement of the field of Classical textual fragmentary heritage and the development of a collaborative environment for crowdsourced annotations. These goals are being achieved by implementing the Perseids Platform and by encoding the Fragmenta Historicorum Graecorum, one of the most important and comprehensive collections of fragmentary authors.

INDEX

Keywords: fragmentary text, digital edition, Classics, CTS, SAM, PROV, EpiDoc

AUTHORS

MONICA BERTI Monica Berti is Assistant Professor at the Humboldt Chair of Digital Humanities at the University of Leipzig, where she teaches courses on digital philology. Since 2008 she has been working with the Perseus Digital Library at Tufts University, where she has been visiting professor in the Department of Classics. Her current research work is focused on representing quotations and text reuses of ancient lost works.

BRIDGET ALMAS Bridget Almas is the lead software developer for the Perseus Digital Library at Tufts University. She has been working in software development since 1994, and for Perseus since 2010. She also currently serves as a member of the Technical Advisory Board of the Research Data Alliance.

DAVID DUBIN David Dubin is a research associate professor at the University of Illinois Graduate School of Library and Information Science. His research areas are the foundations of information representation and description, and issues of expression and encoding in documents and digital information resources.

GRETA FRANZINI Greta Franzini completed her Classics BA and Digital Humanities MA degrees at King’s College London. Greta is currently doing a PhD at the UCL Centre for Digital Humanities, where she conducts research in the fields of Classical Philology, Manuscript Studies and Literary Criticism. She is also currently working as a Researcher at the Göttingen Centre for Digital Humanities. Previously, she worked as a Research Associate at the Humboldt Chair of Digital Humanities at the University of Leipzig.

SIMONA STOYANOVA Simona Stoyanova is a classicist who specializes in epigraphy, and a digital humanist. She works as a research associate for the Open Philology Project at the University of Leipzig. She is also a PhD candidate in at King’s College London. Her research is focused on the Greek and Latin epigraphic traditions in the mixed-language population of the province of Thrace, with particular interest on palaeographical issues and their possible investigation through the DigiPal framework.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 41

GREGORY R. CRANE Gregory R. Crane is the editor-in-chief of the Perseus Project at Tufts University, where he is professor of Classics and holds the Winnick Family Chair of Technology and Entrepreneurship. He is also Alexander von Humboldt Professor of Digital Humanities at the University of Leipzig.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 42

TEI in LMNL: Implications for Modeling

Wendell Piez

1. TEI in LMNL

1 LMNL, the Layered Markup and Annotation Language, describes a data model and syntax for an experimental prototype in document description, modeling, and markup. Properly a meta-, LMNL has no native semantics beyond the idea of annotated ranges over texts. Proposed over ten years ago (Tennison and Piez 2002) and developed intermittently since then, LMNL is not so much an alternative to XML as a foil for it, substituting the information contained in XML hierarchies with “a more rudimentary data model that would capture the information we needed, while not itself trying to assert containment or sibling relations” (Tennison and Piez 2002). For years we have had only partial or prototype implementations of LMNL-based processing systems (see Czmiel 2004, Caton 2005, and Piez 2012), or mockups of what LMNL would be like (Piez 2004)—tantalizingly capable even while fragile and short-lived—and we have had more and more LMNL-like solutions (both within and outside XML) to the problems of representation addressed by LMNL (including multiple concurrent hierarchies and “arbitrary overlap,” that is, the overlap between ranges of the same type such as texts marked for indexing for annotation). However, it has become more and more apparent that what Jeni Tennison and I dreamed up then was, in a certain sense, “obvious,” and (as such) one of those things that will continually be rediscovered. As if it were a territory only now being mapped, LMNL was always there, and indeed visited, before Jeni and I discussed it. One might say that the LMNL we proposed is nothing more than a particular, peculiar way of generalizing some solutions that are increasingly all around us every day. Nonetheless, with a LMNL processor in hand, it turns out one can start doing some interesting things, as it does indeed move the boundaries of what is possible. Both the analysis and the representation of literary texts (just to take the examples with which I have worked) are enabled in new ways by being able to overlap freely in the model. We find LMNL

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 43

gives us a way of approaching modeling and markup that sheds an oblique light on systems (such as XML-based systems or the HTML-based web) with which we are more familiar. Indeed this is, perhaps, what has stimulated such interest as it has attracted, despite its evident weaknesses and immaturity.

2 In one important respect, LMNL is similar to XML. While it projects a model of a text in a hierarchy of elements, whose properties (names, contents, organization) are regular enough to support higher-level application semantics to be described for a particular process (and/or a generalized set of processes), XML imposes no “native semantics” in the sense that it is left to an application designer to define element types, permissible relations among element types, and their behaviors and “meaning” in application. This is done in XML by designing and deploying a controlled vocabulary—a tag set—along with rules for its use, which together describe an XML document type along with an abstract document model. Of course, TEI is an example of this. The semantics of TEI element types are defined not by XML but by TEI; TEI-conformant applications can and do take advantage of the categories and relations described by TEI (not just those that might be inferred directly from the XML structure) in doing useful things with TEI data.

3 Similarly, LMNL does not preclude any naming or any expression of constraints, instead leaving the assignment of range names and semantics to its user; and this (as it is in XML) may be a significant burden. TEI gives us a good place to start whenever we have to consider naming and typing problems. (While largely unappreciated, this is one aspect of TEI that is invaluable.) Then, too, if we could pull TEI data easily into LMNL- based systems, we would be able to take advantage of the prior investment in TEI and perhaps save ourselves steps (at least in areas where the TEI markup remains serviceable).

4 This leads to the intriguing idea that we might build on TEI semantics when working with data in LMNL. The tagging could be TEI XML, or it could be a LMNL rendition of the same structures, but the data model would be LMNL and could represent any overlapping ranges in the data directly, rather than implicitly by means of stand-off or pointer mechanisms, as has to be done in XML.

5 A small example (the first quatrain of a sonnet of Petrarch, in TEI XML) appears below. Two hierarchies are encoded: the poetic structure (prioritized in the XML structure) and the linguistic (sentences and phrases), with the components of the latter linked using the TEI @part attribute. (Of course this is a vast simplication of what one might actually wish to do.)

L’arbor gentil che forte amai molt’anni, mentre i bei rami non m’ebber a sdegno fiorir faceva il mio debile ingegno e la sua ombra, et crescer negli affanni.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 44

6 Transposed into LMNL “naïvely” using a generic transformation:

Example 1. LMNL representation of a quatrain (see Piez 2012 for a detailed explanation of LMNL syntax).

[lg [type}quatrain{]} [l}[seg [type}s{][part}I{]}[seg [type}phr{][part}N{]}L’arbor gentil che forte amai molt’anni,{seg]{seg]{l] [l}[seg [type}s{][part}M{]}[seg [type}phr{][part}I{]}mentre i bei rami non m’ebber a sdegno{seg]{seg]{l] [l}[seg [type}s{][part}M{]}[seg [type}phr{][part}M{]}fiorir faceva il mio debile ingegno{seg]{seg]{l] [l}[seg [type}s{][part}F{]} [seg [type}phr{][part}F{]}e la sua ombra, {seg][seg [type}phr{][part}N{]}et crescer negli affanni.{seg]{seg]{l] {lg]

7 Or, using a transformation amended to recognize the semantics of TEI @part:

Example 2. LMNL representation of a quatrain including @part attribute information.

[lg [type}quatrain{]} [l}[seg [type}s{]}[seg [type}phr{]}L’arbor gentil che forte amai molt’anni,{seg]{l] [l}[seg [type}phr{]}mentre i bei rami non m’ebber a sdegno{l] [l}fiorir faceva il mio debile ingegno{l] [l}e la sua ombra,{seg] [seg [type}phr{]}et crescer negli affanni.{seg] {seg]{l] {lg]

8 In this case, ranges mapped from elements marked as @part='I' get start tags but no end tags, since they are taken not to end their ranges. Similarly, elements marked as @part="F" get no start tags, while those marked as @part="M" get no tags at all.

9 Note that even this transformation is not lossy, since a reverse transformation that induced an element hierarchy from the ranges would be able to mark elements that had to be segmented with their appropriate @part attributes. (To be sure, a processor that wrote such segmented elements would have to respect the intended semantics of TEI @part in doing so—"I" for the first segment in the seg range, "F" for the last, and "M" for all the others. Yet since the sequencing is determinable from the data, the process can be automated.)

10 Finally, one is tempted to promote @type values here into range names. This is no longer TEI, but nonetheless corresponds semantically to the TEI we began with:

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 45

Example 3. LNML representation including information from TEI @type.

[quatrain} [l}[s}[phr}L’arbor gentil che forte amai molt’anni,{phr]{l] [l}[phr}mentre i bei rami non m’ebber a sdegno{l] [l}fiorir faceva il mio debile ingegno{l] [l}e la sua ombra,{phr] [phr}et crescer negli affanni.{phr]{s]{l] {quatrain]

11 All of these LMNL instances are ready for processing by a LMNL processor.

12 This capability of manipulating the data directly raises many questions for modeling documentary data using markup, as outlined below.

2. Implications for Modeling

13 The foregoing examples show how LMNL’s ability to represent overlapping ranges opens modeling opportunities. But there is more to it than this.

14 LMNL also supports structured annotations on ranges. In a way analogous to XML attributes attached to XML ranges, LMNL annotations can be associated with LMNL ranges. However, they also offer several features that XML attributes do not: 1. LMNL annotations may be marked up. Thus, they may be structured in the same way as documents may be structured (in other words, LMNL annotations are isomorphic to documents). 2. The order of annotations assigned to a range is preserved in the model. 3. Annotations may have the same name as other annotations on the same range. And (like ranges) they may be anonymous. 4. Since ranges may also be anonymous, all the properties assigned to a range may be represented in its annotations. 5. Annotations may be annotated. Accordingly, a range may have a “tree” of annotations, each annotation further annotated. This also makes LMNL annotations composable.1

15 In LMNL syntax, annotations are shown using the same syntax as ranges; the only differences are that they are placed in tags (start or end tags for marked ranges, or in tags representing empty ranges or atoms) and that they may not, unlike ranges, overlap with one another. (Every annotation is discrete even though its contents may include overlapping ranges.) Finally, in order to reduce tagging overhead, end tags for simple annotations (those with no internal markup) may be represented using anonymous end-tag notation, as {].

16 So this is a (single) legal LMNL start tag: [sonnet [author [name}Francesco Petrarca{] [born [year}1304{][place}Arezzo{]] [deceased [year}1374{][place}Arquà{]]]} The range tagged as sonnet (whose start tag is shown) has an annotation named author with three annotations of its own: name, born, and deceased. The annotations born and deceased each have a year and place annotation. So we have four levels of annotations annotating annotations; for example, sonnet/author/born/place here is Arezzo. Thus LMNL annotations present an XML-like capability of structure and hierarchy, presenting

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 46

navigable, structured data sets in much the way XML does—even while ranges can overlap within a single layer (whether in the document itself, or in an annotation’s value or “text layer”).

17 Below is the same code, except using indentation to indicate where annotations have annotations. The key is to understand that [born [year}1304{]] represents an empty annotation (which uses the same tagging syntax as an empty range), itself with a single annotation (named “year,” with a value of “1304”). The tagging syntax is recursive—range markup is used inside tags to indicate annotations on the ranges indicated by those tags. In this example, {] serves as an abbreviation for {year] (an end-tag for the “year” annotation).

Example 4. A range in LMNL, with embedded annotations.

[sonnet [author [name}Francesco Petrarca{] [born [year}1304{][place}Arezzo{]] [deceased [year}1374{][place}Arquà{]]]}

18 Overlapping ranges and structured annotations together give LMNL capabilities for modeling very different from those of XML. Just as a transposition from XML syntax to LMNL syntax can expose whatever overlap is indicated by TEI XML directly (by showing start and end tags only where the entire range starts and ends, not every place an XML element boundary appears), it can also promote XML elements into LMNL annotation structures instead of LMNL ranges. Nothing need be flattened: because LMNL annotations can support structure and markup, annotations are first-class objects. For example, it might be considered natural to present a TEI header as an annotation (with its own annotations for , , etc.) in the start tag of a TEI instance. In such a LMNL document, the header would not be considered part of the content of the document, but would be considered an annotation of it—an arrangement which, arguably, better reflects its status as metadata.

19 Modeling in LMNL is thus more flexible and adaptable than modeling in XML. If subordinate information such as alternative readings, analytical descriptions, or component-level metadata can be represented in annotations, the main text itself can maintain its integrity as a distinct artifact—unlike in XML, in which a transcription’s editorial apparatus, as soon as it gets too complicated to represent in attributes (i.e., name/value pairs), must be (if it is to be stored in the same document) promoted into element contents within the same conceptual “tree” as the transcription itself. Similarly, when a text must be organized into multiple concurrent hierarchies (such as grammatical, rhetorical, narratological, and prosodic hierarchies, or a hierarchy representing the physical organization of text in volumes and pages), these can simply be tagged in LMNL and processed appropriately according to rules that respect their semantics.

20 This can be demonstrated in a LMNL representation of a Shakespeare play (converted, like the sonnet example given above, from TEI source data, in this case from the Folger Digital Editions of Shakespeare). Here, the dramatic and dialogic form is represented directly along with the verse structure (where the plays are in verse). Tagged in LMNL,

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 47

a line of verse is always a line, and does not have to be split between two speeches when it spans across them:

Example 5. An extract from Act 1, Scene 2 of The Tempest.

[sp [speaker}ARIEL{speaker]} [line [n}1.2.231{]}To every article.{line] [line [n}1.2.232{]}I boarded the King’s ship; now on the beak,{line]

[...]

[line [n}1.2.242{]}Yea, his dread trident shake. {sp] [sp [speaker}PROSPERO{speaker]}My brave spirit!{line] [line [n}1.2.244{]}Who was so firm, so constant, that this coil{line] [line [n}1.2.245{]}Would not infect his reason? {sp]

21 Interestingly, it becomes possible to represent settings, stage directions, and metadata (such as act and scene headers) in the play in the form of annotations, thus “hiding” them away from the base text, while making them readily available should we want them. Since annotations can be annotated, they provide a convenient way of presenting data (such as complex metadata) that truly is hierarchical, while markup in the text flow can freely describe whatever the researcher deems to be important, irrespective of any structural relations (or lack thereof) among features of interest.

3. Current Capabilities, Potentials, and Work to be Done

22 As shown above, LMNL markup is not difficult to generate programmatically from TEI data; moreover, where TEI markup can be taken to indicate range relations rather than structure (either by marking with milestones or by element alignment), a little logic can translate it into LMNL start/end tag pairs, thus representing the range directly (as in the third quatrain example above). Otherwise, any hierarchies that happen to be present in the XML, will persist, albeit expressed as sets of “enclosing” and “enclosed” ranges (and no longer “parents” and “children”).

23 In order to process LMNL syntax inputs, however, we need a parser and some sort of processing stack. Since 2011 I have been prototyping such a LMNL framework in the form of Luminescent (see Piez 2012). Luminescent is implemented as an XSLT pipeline capable of “compiling” LMNL markup into a database-like representation, expressed as XML, which presents all the same information as stand-off pointers into a sequence of segments of the text. This same technique is used internally by several systems dealing with overlap in XML (see in particular CATMA2 and XStandoff3), though details differ. As it is XML-based, this format can in turn be queried, processed, and transformed using XML technologies including XPath and XSLT to generate outputs in XML, HTML, SVG, or plain text for display or any other application.

24 One such target format of LMNL processing is a hierarchical “view” of a LMNL document—an XML hierarchy interpolated on request, promoting any range types to

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 48

(nesting) elements. Given a set of LMNL ranges, in other words, it is possible to stipulate that you want a promoting whichever ranges you wish into first-class element structures in an XML document—which becomes the so-called “sacred” hierarchy as described by Marinelli, Vitali, and Zacchiroli (2008), while “profane” hierarchies remain implicit.4 This might be a convenient form to have on the way to a presentational format such as HTML. Then, for another purpose, you can return to the LMNL source (or its compiled XML representation) and request a different hierarchy. XML, in other words, remains available and useful for transitional formats between LMNL and the results one ultimately wishes to produce.

25 Some of the outputs offered by Luminescent are on display in the supplementary materials provided with this article.

26 However intriguing some of these may be, in many cases the TEI practitioner may ask what is the point of working so hard not to use XML? Or more closely: if the data is TEI going in, and the first thing that happens to the LMNL is to convert it back into XML, then why not skip the LMNL syntax? Both of these are fair questions. I would in fact agree that everything being done with LMNL here might actually be more straightforwardly and easily done by processing straight from TEI into the various target formats. After all, we actually have ways of dealing with overlap now in XSLT 2.0, as I have reported (Piez 2011).

27 However, it can be argued that efficiency and ease of use are not actually among the primary goals of practitioners of TEI, who have had various research agendas. The intent here is not to demonstrate anything as superior or inferior, either in general or for particular purposes. LMNL may be useful, or it may merely be interesting for the light it casts. To be sure, it supports, in a generalized way, certain operations that are difficult or onerous to engineer in XML natively (and have to be done case by case), such as hierarchical transposition (for example, converting a document representing pages, such as a TEI Genetic Editions instance, into a “logical” view of the same document), or filtering and extraction of ranges of text associated with structures not directly represented in the primary hierarchy of an XML document, such as narrative or dialogic structures. It is not that these operations cannot be done in TEI—they can be and have been done (typically using advanced features in XSLT or XQuery). However, when it comes to a certain kind of work with markup (especially an application of markup one might call exploratory as well as descriptive [Piez 2001]), there is something liberating about LMNL’s simplicity and flexibility. In particular, when multiple concurrent hierarchies (MCH) are of special interest, to be unburdened of the chore of assigning and maintaining pointers or references (something without which XML simply cannot approach these problems) comes as a considerable relief: even in a plain text editor, one can simply tag, focusing on the tasks of tagging. Then too, in LMNL one can simply start tagging, both without worrying about the startup development that inevitably goes with an XML application of this sort,5 and with confidence that certain very generic processes (including processes that can convert LMNL ranges into XML element hierarchies) can be applied to LMNL at any time without any commitment at all to specialized application code for any particular application semantic. This ease of use has an important consequence: essentially, LMNL allows the text encoder to delay commitments to one or another hierarchy as primary until later in the process. This makes it an appealing “counter case” for the use of text-

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 49

based markup as well as for the tree-shaped model an XML document must normally take.

28 It should not need to be stressed that LMNL (as a technology, not simply a conception) has a long way to go to be dependable; the gain in expressiveness when tagged structures can overlap, and annotations can be annotated, comes at a price. At this point the greatest difficulty is obviously the fact that the processing stack is entirely custom-built and “bespoke,” using XML technologies to implement both the parsing of LMNL syntax and processing the range-based data model it was designed to represent. Yet even if this were not an impediment for users other than myself, there are nonetheless important caveats. Without anything strictly analogous to schemas in XML, LMNL is unsuitable for many of the tasks for which XML is so good, such as the preparation of documents for publishing—a task for which validation to a schema, more or less explicit, is essential. Yet even this contrast is instructive, as much of what makes LMNL so interesting is how it provokes thinking about such questions as what a schema language would look like that could account for overlap and MCH (see Sperberg-McQueen 2006 and Tennison 2007), as well as prompting experiments such as an XQuery function library that supports querying a LMNL data model (including overlaps), something I have more recently been doing in Luminescent.

29 Yet the considerable difference between a mature, robust set of standards-based technologies and a small research project (however ambitious in its conception) may also shed light on the greatest strengths of TEI and XML even while it prompts us to think about what is interesting about LMNL, especially in the context of TEI. The primary goal of text encoding in the humanities should not be to conform to standards, although that may be a necessary means to important ends. Rather, we encode texts and represent them digitally in order to present, examine, study, and reflect on the rich heritage of knowledge and expression presented to us in our cultural legacy. TEI markup does this, and the practicability of getting the information out of a TEI XML file and into something quite different shows us how much we do not have to make up, from scratch, even while we grab arbitrary data sets from the Internet. LMNL is both possible and interesting largely because it already has XML technologies, including TEI, to do so much of the heavy lifting. Yet, while it has been evident from early days that texts in the humanities pose special problems for digital processing—because they are characterized by complex structures including concurrent hierarchies (see, especially, Sperberg-McQueen 1991; Renear, Mylonas, and Durand 1993; and Huitfeldt 1994–95)— we have also been hampered by the fact that the best technologies available have been optimized for one structure at a time. Experimenting with a markup and modeling technology that does not force the false choice between one structure and another, we discover not only new capabilities for the digital processing of humanities texts— simultaneously building on TEI and extending it to new uses—but also something about the way our tools condition our view of our texts.

4. Supporting Materials

30 This article includes a number of small demonstrations of LMNL-tagged documents and of processing outputs from Luminescent. These outputs vary in quality (and in some cases, large file sizes may stress available system resources), but should be self- explanatory. Any of the HTML or SVG files may be opened in a browser, although for

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 50

best results, an SVG viewer which supports zoom to arbitrary levelssuch as Apache Batik (available alone or as the SVG Viewer in the oXygen XML editor) should be used. LMNL text files can be viewed in any browser or text editor. • Sonnet examples: Several sonnets, including the Petrarch example above, are given in LMNL, along with HTML/SVG-formatted display versions, with diagrams showing their structure, including overlap between verse and grammatical form. • Longer poems: Percy Shelley’s “Julian and Maddalo” and William Butler Yeats’s “Easter 1916” are provided with similar treatments. • Folger Digital Texts, converted into LMNL: These editions of several of Shakespeare’s plays are converted from TEI inputs and show TEI semantic usage in LMNL data, with some structural adjustments. Also included are SVG “maps” of the plays, showing their structure by means of an illustration of the extent of acts, scenes, and speeches. • A novel: Mary Shelley’s epistolary novel Frankenstein is marked up in LMNL syntax. Again, for the most part TEI element and attribute names and semantics are used. Three concurrent hierarchies are encoded: pages in the source edition; narrative and dialogic structures; and the logical structure of letters, chapters, and paragraphs.

31 LMNL is documented at http://www.lmnl-markup.org/; Luminescent is available on Github, at https://github.com/wendellpiez/Luminescent.

Figure 1. Mary Shelley’s Frankenstein.

32 Depicted in figure 1 is a diagram generated programmatically from a data instance, marked up in LMNL, representing the disposition of ranges within the nineteenth- century novel.The size and placement of “bubbles” indicate the extent of ranges within the text, as identified by markup. At http://wendellpiez.com/ FrankensteinTransformed, a more highly developed and perhaps intelligible version is available.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 51

BIBLIOGRAPHY

Caton, Paul. 2005. “LMNL Matters?” Paper presented at Extreme Markup Languages 2005, Montréal, Quebec, August 1–5. Abstract: http://conferences.idealliance.org/extreme/html/2005/ Caton01/EML2005Caton01.html.

Czmiel, Alexander. 2004. “XML for Overlapping Structures (XfOS) Using a Non XML Data Model.” Paper presented at ALLC/ACH 2004, Gothenburg, Sweden, June 11–16.

Huitfeldt, Claus. 1994–95. “Multi-Dimensional Texts in a One-Dimensional Medium.” In “Humanities Computing in Norway,” special issue, Computers and the Humanities 28 (4/5): 235–41. doi:10.1007/BF01830270.

Marinelli, Paolo, Fabio Vitali, and Stefano Zacchiroli. 2008. “Towards the unification of formats for .” New Review of Hypermedia and Multimedia,14,1. https://hal.archives- ouvertes.fr/hal-00340578. Doi: 10.1080/13614560802316145.

Piez, Wendell. 2001. “Beyond the ‘Descriptive vs. Procedural’ Distinction.” Markup Languages: Theory & Practice 3 (2). http://www.wendellpiez.com/resources/publications/ beyonddistinction.pdf.

———. 2004. “Half-steps toward LMNL.” In Proceedings of Extreme Markup Languages 2004. http:// conferences.idealliance.org/extreme/html/2004/Piez01/EML2004Piez01.html.

———. 2011. “TEI Overlap Demonstration: Processing TEI P5, with Overlap, Using XSLT 2.0 in the Client.” For the Interedition 2011 Boot Camp, TEI Members Meeting 2011, Würzburg, Lower Franconia. Last updated February 2014. http://piez.org/wendell/projects/Interedition2011/.

———. 2012. “Luminescent: Parsing LMNL by XSLT Upconversion.” In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8. doi:10.4242/ BalisageVol8.Piez01.

Renear, Allen, Elli Mylonas, and David Durand. 1993. “Refining Our Notion of What Text Really Is: The Problem of Overlapping Hierarchies.” Final version, January 6. http://www.stg.brown.edu/ resources/stg/monographs/ohco.html.

Sperberg-McQueen, C. M. 1991. “Text in the Electronic Age: Textual Study and Text Encoding, with Examples from Medieval Texts.” Literary and Linguistic Computing 6 (1): 34–46. doi:10.1093/ llc/6.1.34.

———. 2006. “Rabbit/Duck Grammars: A Validation Method for Overlapping Structures.” In Proceedings of Extreme Markup Languages 2006. http://www.idealliance.org/papers/extreme/ proceedings/html/2006/SperbergMcQueen01/EML2006SperbergMcQueen01.html.

Tennison, Jeni, and Wendell Piez. 2002. “The Layered Markup and Annotation Language (LMNL).” Paper presented at Extreme Markup Languages 2002, Montréal, Quebec, August 4–9. Abstract: http://xml.coverpages.org/LMNL-Abstract.html.

Tennison, Jeni. 2007. “Creole: Validating Overlapping Markup.” Paper presented at XTech 2007, Paris, France, May 15–18. http://www.princexml.com/howcome/2007/xtech/papers/output/ 0077-30.pdf

TEI Consortium. 2014. “Non-hierarchical Structures” (chapter 20). In TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.6.0. Last updated January 20. N.p.: TEI Consortium. http://www.tei-c.org/Vault/P5/2.6.0/doc/tei-p5-doc/en/html/NH.html.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 52

NOTES

1. Referring specifically to “composability” in the technical sense as described at https://en.wikipedia.org/wiki/Composability. 2. CATMA: Computer Aided Textual Markup and Analysis. See http://www.catma.de/. 3. http://www.xstandoff.net/. 4. Naturally, in mapping from LMNL to XML, one has to stipulate particular sets of ranges that are known to fall into hierarchies, to map to the XML hierarchy, lest the processor breaks elements into segments at points of overlap; fortunately another feature of LMNL is analytic processing that can report which range types overlap with which, and where. 5. Such as the CSS/ apparatus one typically wants, in a structured XML editor, as an aid to editing the very pointers that one has no need for in LMNL.

ABSTRACTS

LMNL (the Layered Markup and Annotation Language) is a small, if ambitious, research project in text encoding. Unlike XML, LMNL markup does not represent document components or constituents (as “elements”) within a structure, but rather simply labels ranges of text within the document for identification and processing. Avoiding choices between structures, any and all structures can be elucidated; this reveals not only new capabilities for the digital processing of humanities texts—simultaneously building on TEI and extending it to new uses—but also something about the way our tools condition our view of our objects of study. This paper presents LMNL and suggests some of its strengths in the context of TEI.

INDEX

Keywords: text encoding, visualization, markup, overlap, hierarchy

AUTHOR

WENDELL PIEZ Wendell Piez is an independent consultant based in Rockville, Maryland, specializing in XML and XSLT.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 53

Ontologies, Data Modeling, and TEI

Øyvind Eide

1. Introduction

1 In philosophy, ontology has for at least 2,500 years denoted the study of being. Computer science ontologies, usually in the plural, are different from the philosophical concept of ontology. Computer science ontologies refer to shared conceptualizations expressed in formal languages (Gruber 2009) and have been a topic of study for some thirty years, initially connected to the artificial intelligence community. They have not been of much importance in digital humanities until the last ten to fifteen years, but are now gaining momentum as the semantic web develops.

2 In this paper I will discuss ontologies in the context of the Text Encoding Initiative and based on the computer science tradition.1 However, although computer science ontologies are different from philosophical ontology, the two are not totally disconnected (Zúñiga 2001), and therefore some remarks will be made on relations with philosophy as well. The focus will be on how meaning can be established in computer- based modeling. Three broad areas will be described. One is the establishment of meaning through working with and interpreting documents seen as sources for scholarly work, be they primary or secondary. Another area is the establishment of meaning through the development of models, including ontologies. A third area of meaning production with particular reference to TEI happens through linking between TEI documents and external ontologies. How such linking can be done and what it may imply for the meaning and usability of TEI documents is an important focus of this article.

3 TEI represents a shared conceptualization of what exists in the domains relevant to text encoding. It can be expressed in formal models, but it is questionable whether TEI can be seen as an ontology in the computer science sense. According to the classification by Guarino, Oberle, and Staab (2009, 12–13), XML schemas are typically not expressive enough for the formality we need for ontologies. However, the level of language formality forms a continuum, and it is difficult to draw a strict line where the formal starts. So for instance, some parts of the TEI scheme, such as the systems used to

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 54

encode dictionaries, bibliographies, and representations of persons, places, and events, may be closer to an ontology than other, less formalized parts of the standard (Ore and Eide 2009).

2. P4 to P52

4 In TEI P4, as in previous versions of TEI, names of places and people, as well as other strings referring to places, people, and organizations, could be encoded using elements such as and . These elements were used to encode names and other referring strings as textual features. P4 included no construct3 to encode information about real world or fictitious entities that the strings in the text referred to: It should be noted however that no provision is made by the present tag set for the representation of the abstract structures, or virtual objects to which names or dates may be said to refer. In simple terms, where the core tag set allows one to represent a name, this additional tag set allows one to represent a personal name, but neither provides for the direct representation of a person. (TEI Consortium 2001, ch. 20)

5 In P5 this was changed. Real or fictitious entities pointed to by referring strings in the text were now included in the standard so that in addition to the referring strings, which could still be encoded as in P4, the external entities could be represented as well: This module also provides elements for the representation of information about the person, place, or organization to which a given name is understood to refer and to represent the name itself, independently of its application. In simple terms, where the core module allows one simply to represent that a given piece of text is a name, this module allows one further to represent a personal name, to represent the person being named, and to represent the canonical name being used. A similar range is provided for names of places and organizations. The main intended applications for this module are in biographical, historical, or geographical data systems such as gazetteers and biographical databases, where these are to be integrated with encoded texts. (TEI Consortium 2013, ch. 13)

6 This means that while P4 represented only textual occurrences, P5 goes beyond that limitation. In P5, names can be connected to a representation of the named object, which is typically placed in the header of the TEI document (Wittern 2009). The inclusion of the new elements provided one standard way to create such representations, which was needed by many text encoders. We will see later how this header information represents one way among others with which to link the referring strings as they are encoded in the text to external ontologies. As other methods do exist, the inclusion of the new elements in P5 was not strictly necessary for interoperability, but did facilitate a standard way of creating the interconnections.

3. Two Ways of Modeling

7 There are no ontologies without models — an ontology, after all, represents a model of a world or of a certain corner of it. The discussion in the paper will focus on active engagement with models, that is, on how meaning is generated and anchored when ontologies and other models are developed and used. For TEI specifically, the development of the standard was based on ontological assumptions regarding relevant textual sources (e.g., their structure and components) in the philosophical sense.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 55

Further development, such as creating the new elements described in the previous section, is based on similar considerations. Using TEI for encoding texts also involves the study of the source material and how it can be related to the standard as a whole, which has some similarities with the development of the standard itself. Here, however, I will focus on the differences between these two types of work.

8 I will distinguish between two different, although overlapping, ways of modeling. First, one may use already existing models for data integration. An example of this is the task of integrating data from several different libraries and archives in order to create a common data warehouse in which the detailed classifications from each of the databases are preserved. In the process, one will want to use a common ontology for the cultural heritage sector, for instance, FRBRoo (Bekiari 2013). One must develop a thorough understanding of the sources, be they TEI encoded texts or other forms, as well as of the target ontology—one will develop new knowledge.

9 The task is intellectually demanding and those engaged in it will learn new things along the way. Still, the formal specification of the corner of the world they are working towards is already defined in the standard. Only in a limited number of cases will there be a need to develop extensions to the model. Once the job is done, inferences made using the ontology-based data warehouse can be used to understand the sources and what they document even better. Yet, despite all the new knowledge acquired, the process is still mostly restricted to the use of what is already there.

10 The second way of working with models is to create an ontology or another formal model through studying a domain of interest. In this case, a group of people will analyze what exists in the domain and how one can establish classes which are related to each other. This may, for instance, be in order to understand works of fiction, as in the development of the OntoMedia ontology, 4 which is used to describe the semantic content of media expressions. It can also be based on long traditions of collection management in analog as well as digital form, as in the development of CIDOC-CRM (Crofts et al. 2011) in the museum community. Although one will often study data from existing information systems as part of the process, the main goals of such studies are not mappings in themselves, but rather to understand and learn from previous modeling exercises in the area of interest.

11 The historical and current development of TEI as a standard can be seen in this context. The domain of TEI has no clear borders, but the focus is on textual material in the humanities and cultural history. In order to develop a model of this specific corner of the world, one has to analyze what exists and how the classes of things are related to each other. This is a process in which domain specialists and people trained in the creation of data models must work together, as the history of TEI demonstrates.5

12 When applying either of the two methods of modeling, knowledge is gained through the process as well as in the study and use of the end products; one can learn from modeling as well as from models, from the process of creating an ontology as well as from the use of already existing ones. Getting to know a description of a modeling standard rarely gives the same level of understanding as actively engaging with the model as a creator or a user. Reading the TEI Guidelines is a good way of getting an overview of the standard, but it is hard to understand it at a deeper level without using it in practical work, and among the best TEI experts are those who have taken part in creating the standard.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 56

13 There is no clear line between the two approaches to modeling, and they often use similar methods in practice. They both have products as the end goal, and new knowledge is created in the process. Some of this new knowledge is expressed in the end products. For example, working to understand better what is important for a concept such as “person” in the domain under study will result in new knowledge. This knowledge will be shared by the parties involved in the modeling exercise and may be expressed in the end product. However, there is a stronger pressure towards expressing such new knowledge clearly when a data standard is created than when a mapping is created. In the latter case the boxes are already made and one has to find the right fit between one’s data and those boxes, typically assuming that the ones who made them worked thoroughly on the matter. This is useful in many cases, but one is less inclined to question what is there and runs the risk of not seeing what is outside the standard (Zafrin 2007, 66). In the case of creating a standard, one has to build the boxes, establishing both the entities to be used, the relationships between them, and how to name and describe both entities, relationships, and the general principles behind the standard. Everything is open to question, and in the end a new world is built.

14 These two methods are presented as distinct categories for the sake of analysis. They are, however, prototypical, and there is no strict line between them. Middle positions can be found when one takes an existing ontology and extends it, or uses terms taken from multiple ontologies in combination. Such border cases notwithstanding, the distinction between using and creating models is clear in most cases.

15 In addition to the distinction between using models and creating models, we also have a distinction between modeling for production and modeling for understanding. Modeling for understanding is here seen as creating models with the sole or main purpose of learning new things through the modeling activity. The model in itself is not a main goal of the work. In modeling for production a version of the modeled entity is the main goal of the work. So even if the use of a model for production (for instance using TEI to encode a dictionary) is a research activity which gives new understanding, the main purpose is still the end product. The same was the case when the dictionary module of the TEI Guidelines itself was created. It gave new understanding, but the main purpose was to create the end product—the dictionary module. A well known example of modeling for production is the project Henrik Ibsen’s writings, where modeling and TEI encoding were used to create a series of printed volumes.6 An example of the use of modeling for understanding would be be to analyze a poem through iterative rounds of TEI encoding, finding and investigating new problems along the way. In this process, one could develop an ontology based on one’s growing understanding of the poem.7 In table 1 these four types of modeling are presented as a table which expresses a model of modeling. I must repeat here that the distinctions in the table are not clear-cut, but rather an analytical tool to explore the complex and many-faceted activity of modeling. Table 1. A model of modeling.

Why modeling? Using a model Creating a model

Modeling for Using TEI to encode and publish a Creating a module in the TEI production document Guidelines

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 57

Modeling for Using TEI encoding as a research Creating an ontology as part of a understanding method research effort

16 A well-established distinction in modeling theory in digital humanities and beyond is the one between “models for” and “models of.” McCarty, following Geertz, distinguishes between “a denotative ‘model of,’ such as a grammar describing the features of a language, and an exemplary ‘model for,’ such as an architectural plan” (McCarty 2005, 24).8 This analytical distinction is not clear cut, and it “also reaches its vanishing point in the convergent purposes of modeling; the model of exists to tell us what we do not know, the model for to give us what we do not yet have. Models realize” (ibid.). Although not clear-cut, the distinction still has some consequences of relevance for table 1 above. Using a model for production and creating a model for understanding both have a tendency towards creating models of, whereas creating a model for production and using a model for understanding both have a tendency towards creating models for.

17 To return to the examples used above: a dictionary is encoded in TEI as a model of a print dictionary. In creating it, one makes a “simple” (but usually labor-intensive) mapping. Through this process one might learn something new through making explicit certain components of the dictionary which were not apparent in the print version. In creating an ontology of a certain interpretive reading of a poem, one also creates a model of, but one of a different kind. The distance between the object being modeled and the model is much larger; one can learn more from the process.

18 Creating a TEI module as tool for encoding dictionaries is modeling for in that it is not representing a specific document but provides a toolkit one can apply to multiple texts. 9 In the case of using a model for understanding, one is aiming at discovering something new, so one is enabling a model for. Both of these are used as models for, but the former is more practically oriented, creating a tool for encoding, while the latter is more discovery oriented, using a model to elicit new knowledge.

19 As the examples show, the tendencies towards a match with the modeling of/modeling for distinction are weak and will depend on context. This may indicate that the distinction between models for and models of does not match the distinctions in table 1 very well, and that they operate at another level. A closer study of their relationship is beyond the scope of this article, but what this unclear relationship does show is the limitations of any classification such as the one proposed in table 1 and for that matter of any model.

20 Another distinction, established recently by Jannidis and Flanders, is the one between altruistic and egoistic modelers. This is a distinction which fits my classification quite well. They describe models created by altruistic modelers as the ones that “serve as an interchange format for some types of users and user communities where data is typically being created and modeled with someone else’s needs in mind” (Jannidis and Flanders 2013, 238), which corresponds nicely to my row labeled “modeling for production.” They describe models created by egoistic modelers as having the function “to express specific research ideas in cases where data is being created to support the creator’s own research needs” (ibid.), which is similar to my row labeled “modeling for understanding.”

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 58

21 According to Jannidis and Flanders, these two ways of modeling lead to very different modeling practices. If this is true—and I believe it is—it may indicate that the distinction just discussed is more important than the distinction between creating and using models. This points towards the following hypothesis: while the distinction between making or using a model is significant as to how deeply one engages with the material and how well one learns the model, the distinction between modeling for production and modeling for understanding has more significant consequences for how one engages with the model: the broad scope with integration of different views on the one hand, and the narrow focus of specific theoretical assumptions and research interests on the other.

22 I find it necessary here to be explicit about what was implicit in the previous paragraphs: the model I present in table 1 is tentative and should be developed further. This is an area for future research connected not only to the development and use of TEI, but also to better understanding of modeling as a digital humanities practice and how it relates to similar practices in cultural heritage (Ciula and Eide 2014). This must go hand in hand with theoretical work on modeling in digital humanities, as called for by Jannidis and Flanders (2013, 237).

4. Interconnections

23 When a TEI encoding of a text is established, it is the result of a process which includes modeling, even if the involved parties may not be explicitly aware of it. In this section I will discuss links between TEI and external ontologies used in the cultural heritage sector, using CIDOC-CRM as my main example. This will serve two purposes. First, it will give a better understanding of how the semantics of TEI documents are established and work. Are there differences between the ways one creates links between TEI and external ontologies which have consequences for the way the semantics of elements are established and used, comparable to the differences in types of modeling discussed in the previous section? Second, it is not only individual projects or specific editions created in TEI that have a need for integrating TEI documents with external ontologies. The need to integrate resources is shared by the cultural heritage and humanities community at large, where it has become a necessary addition to modeling. While integration with external standards is useful both in modeling for production and for understanding, such integration takes many different forms, as we will see.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 59

Figure 1. Interconnected cultural heritage, showing three examples of digital cultural heritage information: 3D models of artifacts, ontologies, and TEI-encoded texts. The picture to the left is taken from the Arrigo showcase (Havemann 2009).

24 Whether the dream of a semantic web for cultural heritage, as expressed in a simplified way in figure 1, is ever to be established will depend on better understanding of how the integration between TEI and other standards such as CIDOC-CRM works. In this larger picture, the current paper is located as the encircled arrow shown in figure 1: in the link between texts and ontologies. Studying ways of linking them together can create practical basic understanding on which to build a theoretical framework. Such links are created when we use models, but the mechanisms for linking must be established when the models are created in the first place. This may be obvious in the creation of standards such as TEI or CIDOC-CRM, but it is also needed in ad hoc models created in research projects. In the latter case the need may not be equally obvious, but the necessity of providing links from a research argument to the sources on which it is based is crucial in the development of more open and reproducible research. Such linking is dependent on semantically well-defined points of entry in TEI for external systems to point to, and similarly on points of exit to be used to link TEI to other resources.10

25 How does this look from the other side? An ontology may be used to make statements of contradictory facts at different levels. This will often be related to different interpretations of source material. One example is various paper based sources for classifications of objects in a museum. What in 1880 was seen as pair of Old Norse skis could in 1980 be seen as a pair of Sami skis. The two classifications can be recorded in different protocols or index cards and may be based on different interpretations of the objects as well as of written sources about those objects. Although one of them will be chosen as the official classification presented by the museum, both should still be recorded in the museum information system in order to track the intellectual history of the field of study.

26 While an ontology is a model of a domain, an individual mapping to an ontology will be based on specific sources. Links from ontologies to their sources are needed in order to ensure scholarly reproducibility. It is not enough that the references exist. It is also important to make the links easily accessible for human readers and computer

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 60

algorithms alike. Thus, the links must be as precise as possible, point to the source at the highest possible level of detail, and must be formalized in a standardized way. This is to a large extent a matter of practical implementation, but an important one.

Figure 2. Integration seen from TEI. EDD RDBS is a local cultural heritage database system at the University of Oslo.

27 In the situation visualized in figure 2, TEI is the formalism in focus, and the rest of the world, meaning other formalisms and database systems, are seen as children that one needs to link to, based on a linking model which is coherent and usable from the TEI side. In short, it shows how the world of information integration may appear from the perspective of TEI. We will now examine how interconnections can be made in such an interlinked system. Where is the link to the external model to be found in the markup of the text? How is it formalized? It will be shown how, still from the perspective of TEI, we can provide mechanisms for well-defined links back to the encoded texts. We will also examine to what degree this will work differently for links from and to models for production as opposed to models for understanding.

28 Based on practical experience and a close study of chapter 13 of TEI P5, I have established a number of interconnection strategies for elements such as referring strings and names. These are all similar in the sense that they are using references to external targets, in line with normal linked data methods. The links back to the TEI documents are provided through @xml:id attributes which, together with stable identifiers for the individual documents, also work as linked data targets. Any RDF triplet can be encoded in TEI using the element. Thus, the RDF namespace is not strictly needed to make such statements.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 61

Figure 3. Method A: Elements in the body of the TEI document point to an external ontology. Reverse: Links from the external ontology to elements in the body of the TEI document.

Figure 4. Method B: Elements in the body of the TEI document point to elements expressed in another namespace in the header; links from the header elements point to an external ontology. Reverse: Links from the external ontology to elements in another namespace in the header of the TEI document.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 62

Figure 5. Method C: Elements in the body of the TEI document point to TEI elements in the header; links from the header elements point to an external ontology. Reverse: links from the external ontology to TEI elements in the header of the TEI document.

Table 2. The six ways to link between TEI and external ontologies.

Links from TEI Links to TEI

Links from external ontology to TEI in body Elements in the body of the TEI document elements in the body of the TEI (method A) point to an external ontology. document.

Elements in the body of the TEI document Links from external ontology Non-TEI in point to elements in another namespace in the point to elements in another header header. Links from the header elements point namespace in the header of the (method B) to an external ontology. TEI document.

Elements in the body of the TEI document Links from ontology point to TEI TEI in header point to TEI elements in the header. Links from elements in the header of the TEI (method C) the header elements point to an external document. ontology.

29 As figures 3, 4, and 5 indicate,11 there are six different ways identified for links between TEI and external ontologies, shown together in table 2; three types of links from TEI to external formalisms and three types of return links.12 They are all based on a common methodology and on the same basic linked-data concepts. However, the differences we have identified have certain implications for how the links are used as part of modeling and how integration works in practice. The natural center of gravity in the linked system will reside in different places. These differences have significant theoretical consequences.

30 Method C uses TEI elements. Thus, there is one and only one element available to represent a person, namely the element. The same goes for other elements such as and . This means that the use of this method ensures that anyone aiming to harvest or link to such elements in TEI headers13 will have one set of elements with well-defined semantics to connect to.

31 The same can to a certain degree be said of method A: as no information in the TEI header is used to represent the entities that the referring strings refer to, the linking from the external system must be done directly to the element and to the various

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 63

elements used for names in the body of the TEI document. Thus, the elements are standardized and have well-defined semantics. The problem with this method, compared to method C, is the potentially high number of link targets in the TEI document because links must be made to every occurrence of a reference in the text, rather than to every external entity referred to by one or more referring strings. Links must be established to each particular reference rather than to each particular referred object.

32 Method B is different from the other two in that it just states that the representation in the TEI header is expressed in an external namespace. This can be any namespace in which entities such as persons, places, or events are modeled. A number of potential namespaces and different elements exist, each with potentially different meanings. Examples include the CIDOC-CRM as it is expressed in the Erlangen ecrm namespace, 14 FOAF,15 and TEI. Semantic interoperability must therefore be ensured on a case-by-case basis. For instance, CIDOC-CRM includes only real-world persons in the definition of a person, whereas other models, such as TEI and FOAF, include fictitious and mythical persons. In FOAF all persons are agents, which is a restriction not found in TEI. Thus, all three have something they call “person” but the extensions16 are not identical.

33 The semantic openness of the latter method puts different demands on models for production than on models for understanding. In a model for production, the openness should as far as possible be restricted through the establishment of well-defined comparisons between the meanings, at least stating the differences. In a model for understanding, on the other hand, this semantic openness may be used as an important part of the modeling research, making clearer to the researcher any differences detected in attempts to formalize the source material.

34 This semantic openness is complex, however. When a group of TEI documents is encoded, method C will ensure the same semantic coverage of the elements for all documents. The openness is in the link to the external ontology. In method A, no semantic commitment is made in the TEI documents. Method B injects the semantic openness into the TEI header, so to speak. This may be desirable in certain kinds of modeling for understanding. It is a question of where to establish the semantic openness: in the TEI header (B), in the links between TEI and other formalisms (C), or in the other formalisms (A). In modeling for production the choice should be informed by the question of where in the system at large semantic mapping makes the most sense. In modeling for understanding it should be informed by the choice of the core area of the experimental modeling, if such a core exists. What is the most elegant solution will depend on the aims of the encoder—simply speaking, on whether TEI or another formalism is in focus. The world will indeed look different depending on which formalism is in focus.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 64

Figure 6. Integration seen from CIDOC-CRM. DC is Dublin Core, EDD RDBS is a local cultural heritage database system at the University of Oslo.

35 How would this look if we saw it from the perspective of CIDOC-CRM instead, as in figure 6? A TEI document, or a fragment thereof, can be modeled as a CIDOC-CRM conceptual object. This can be used to explicitly document elements of the CIDOC-CRM model, as we see in figure 7. In this example, an event is documented in CIDOC-CRM and the link is made directly to a TEI

element without using any of the TEI semantic elements.

Figure 7. A CIDOC-CRM model with a link to a specific element in a TEI document as an E84 Information Carrier entity.

36 Another solution would be to use a referring string typed as representing an event: . But seen from an external formalism, this does not really matter. In many cases it is even good to be able to link to a document without changing its markup, so if references to events are not marked up in the text, linking to another element, even if it lacks the relevant semantic baggage, may be a good choice. Using mechanisms such as XPointer, one can even link to details within the text of an encoded element.

37 Another significant point is that seen from the perspective of an external formalism, linking to a representation of the event itself in the TEI header (an element) may not be a good solution, because the relevant textual documentation may be one specific string, not any string in the TEI document referring to the event. The link goes

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 65

to a particular use of a referring string (an in the body of the TEI document) rather than to a representation of the entity referred to, and it does so on purpose. This may, for example, be necessary in order to identify an occurrence of a name in the dedication of a poem or in the list of witnesses at the end of a legal act.

38 So in this interconnected world everyone can be at the center of their own universe, and where you stand with respect to which linking method is best depends on where you sit; that is, which linking method you prefer depends on the specificities of your task at hand and your relationship to other projects or standards. I see two important ways to cope with this potentially complex situation from the perspective of the TEI community. One is to establish canonical examples of linking out from and into TEI documents. The use of the element as a tool for specifying semantic similarity is an important part of this, making it easier to establish common structures for linking between TEI and other formalisms, leading to further interoperability. The other is to make sure that TEI documents have sockets for linking to. While can be used for advanced references to any TEI document, an easier and more robust method would be to have @xml:id attributes available on all elements of published TEI documents, along with stable identifiers for the documents themselves. This gives well-defined linking targets, in line with best practice in museum documentation.17 These practices may be more crucial in modeling for production than in modeling for understanding. However, keeping the semantic room open, as is often a goal in modeling for understanding, should not lead to ill-defined linking mechanisms.

5. How Wide Do You Want Your Straitjacket?

39 Standards for digital humanities and cultural heritage information will continue to be developed, and even if we develop common concepts for e.g., people and places, those concepts will still be developed in the contexts of those separate standards. A person referred to in a work of fiction is significantly different from a person referred to in a museum documentation system, and the only way of coping with such differences is to establish mappings. Mappings do not necessarily have to be done at a document-by- document level, but there are limits to how general this level can be. While each separate TEI document does not have to be mapped to CIDOC-CRM, one general mapping for all TEI documents will not work either. The solution lies somewhere in between.

40 Such mapping can be used to close or at least reduce the semantic openness discussed above. While such openness is generally accepted as inherent in humanities and culture heritage data, we see a clear difference between modeling for production and modeling for understanding. In modeling for production, one would typically want to clarify the relationships as far as possible, trying to reduce openness in order to make more coherent models. In modeling for understanding, such openness may be an important element of the research: far from being seen as a problem to be solved, it may be central to the experimental modeling. Computer science tends to be solution-oriented in its approach, which is in line with what Mahr (2009) calls the leading question of computer science.18 Thus, modeling for production is closer to traditional computer science, and modeling for understanding closer to traditional research in the humanities.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 66

41 TEI is a system commonly used for text encoding, where documents are directly created in TEI XML. This is, however, not its only possible use. TEI is also used as an export format for data produced in other systems, such as databases, which can be serialized as TEI XML documents. Dictionaries are a typical example. Encoding can be a step on the way towards typesetting the dictionary, but also works well as a solution to semantic problems connected to long-term preservation. Whether TEI was the production format or not, other systems will need access to TEI-encoded data, such as TEI header data for importing into library catalogues. The data may be semantic information from the header, like the types discussed in this article, or even full TEI documents. This is not limited to the present: if TEI is used as a long-term preservation format, then it will be necessary to import TEI-encoded data into other formalisms in the future, possibly including future non-XML versions of TEI. TEI P5 will not be the encoding system of choice in 2067.

42 TEI is a formalism, and like any formalism, it limits what can be said. The scope of TEI is wide, but not limitless. One can claim, and I have even done so quite recently (Eide 2014), that TEI, being based on XML, can become too much of a straitjacket. Sometimes even oral language is felt as too restrictive for the communication one needs; for example, “I could not speak, I just had to give her a hug.” No formalism works in all situations. Dubin, Senseney, and Jett (2013) point out that this important truth is often hidden in the standards themselves: “As a result of this trade-off one sees two complementary costs in standards adoption: that of relinquishing full control over stipulations to a broader community vs. the complexity necessary for generality of scope and flexibility of application. But the language of information standards doesn’t lay out this complementarity to their audience.”

43 Different formalisms, however, work at their best in particular applications and for specific purposes. The difference between modeling for production and modeling for understanding is such a difference. We saw above that Jannidis and Flanders’s distinction between altruistic and egoistic modeling fits the distinction between modeling for production and modeling for understanding. Their claim that the differences between altruistic and egoistic modeling lead to different modeling practices could be followed by a claim that they will also lead to different linking practices. Such differences are indicated in this article, but a deeper study using real life examples is clearly an important area for further research.

44 The products of any kind of intellectual and creative work, be they written or not, include the potential for future communication. If the work is scholarly, future integration with other resources should also be considered. Even when I sit privately in my study using modeling to better understand the text I am working on, I have an aim of communication, even if it may seem to be in a very distant future. Whether I model for production or for understanding, I need to be able to communicate not only my results, but also the source material behind them. In such a situation also the lack of formalism can be a hindrance to communication. So not only can overly-strict formalisms prevent communication, overly-loose formalisms can have a similar effect. We need our straitjackets to be just tight enough.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 67

BIBLIOGRAPHY

Bekiari, Chryssoula, Martin Doerr, Patrick Le Bœuf, and Pat Riva, eds. 2013. FRBR: Object-oriented Definition and Mapping from FRBRER, FRAD and FRSAD (version 2.0). [Heraklion]: International Working Group on FRBR and CIDOC CRM Harmonisation. Accessed February 22, 2015. http:// www.cidoc-crm.org/docs/frbr_oo//frbr_docs/FRBRoo_V2.0_draft_2013May.pdf.

Burnard, Lou. 2013. “The Evolution of the Text Encoding Initiative: From Research Project to Research Infrastructure.” Journal of the Text Encoding Initiative 5. http://jtei.revues.org/811; doi: 10.4000/jtei.811.

Ciula, Arianna, Paul Spence, and José Miguel Vieira. 2008. “Expressing Complex Associations in Medieval Historical Documents: The Henry III Fine Rolls Project.” Literary and Linguistic Computing 23, no. 3: 311–25.

Ciula, Arianna, and Øyvind Eide. 2014. “Reflections on Cultural Heritage and Digital Humanities: Modelling in Practice and Theory.” In DATeCH ’14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (May 19–20, Madrid, Spain), 35–41. New York: ACM. doi: 10.1145/2595188.2595207.

Crofts, Nick, Martin Doerr, Tony Gill, Stephen Stead, and Matthew Stiff, eds. 2011. Definition of the CIDOC Conceptual Reference Model. Version 5.0.4. [Heraklion]: ICOM/CIDOC CRM Special Interest Group. http://www.cidoc-crm.org/docs/cidoc_crm_version_5.0.4.pdf.

Dubin, David, Megan Senseney, and Jacob Jett. 2013. “What It Is vs. How We Shall: Complementary Agendas for Data Models and Architectures.” In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies 10. doi:10.4242/ BalisageVol10.Dubin01.

Eide, Øyvind, and Christian-Emil Ore. 2007. “Mapping from TEI to CIDOC-CRM: Will the New TEI Elements Make Any Difference?” Paper presented at TEI@20: 20 Years of Supporting the Digital Humanities. The 20th Anniversary Text Encoding Initiative Consortium Members’ Meeting, University of Maryland, College Park, November 1–3. http://www.tei-c.org/Vault/ MembersMeetings/2007/.

Eide, Øyvind. 2007. “The Perspective of the Text Encoding Initiative.” In Ontology-Driven Interoperability for Cultural Heritage Objects: Working Notes. DELOS–MultiMatch Workshop, Tirrenia, Italy, 15 February 2007, edited by Vittore Casarosa and Carol Peters, 29–31. N.p.: DELOS/ MultiMatch. http://www.delos.info/files/pdf/DELOS%20Multimatch%202007/ papersdelostirrenia.pdf.

———. 2014. “Sequence, Tree and Graph at the Tip of Your Java Classes.” In “Digital Humanities 2014: Conference Abstracts,” Lausanne. http://dharchive.org/paper/DH2014/Paper-639.xml.

Gruber, Thomas Robert. 2009. “Ontology.” In Encyclopedia of Database Systems, edited by Ling Liu and M. Tamer Özsu, 1963–65. Boston, MA: Springer US.

Guarino, Nicola, Daniel Oberle, and Steffen Staab. 2009. “What Is an Ontology?” In Handbook on Ontologies, edited by Steffen Staab and Rudi Studer, 1–17. Berlin: Springer.

Havemann, Sven, Volker Settgast, René Berndt, Øyvind Eide, and Dieter W. Fellner. 2009. “The Arrigo Showcase Reloaded—Towards a Sustainable Link between 3D and Semantics.” ACM Journal on Computing and Cultural Heritage 2, no. 1. doi:10.1145/1551676.1551680.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 68

Jannidis, Fotis, and Julia Flanders. 2013. “A Concept of Data Modeling for the Humanities.” In Digital Humanities 2013: Conference Abstracts, 237–39. Lincoln: Center for Digital Research in the Humanities. Available at http://dh2013.unl.edu/abstracts/ab-313.html.

Mahr, Bernd. 2009. “Information Science and the Logic of Models.” In Software & Systems Modeling 8, no. 3: 365–83. doi:10.1007/s10270-009-0119-2.

McCarty, Willard. 2005. Humanities Computing. Basingstoke: Palgrave Macmillan.

Ore, Christian-Emil, and Øyvind Eide. 2009. “TEI and Cultural Heritage Ontologies: Exchange of Information?” Literary & Linguistic Computing 24, no. 2: 161–72. doi:10.1093/llc/fqp010.

TEI Consortium. 2001. TEI P4: Guidelines for Electronic Text Encoding and Interchange: XML-Compatible Edition, edited by C. M. Sperberg-McQueen and Lou Burnard. N.p.: TEI Consortium. http:// www.tei-c.org/release/doc/tei-p4-doc/html/.

———. 2013. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.5.0. Last updated July 26. N.p.: TEI Consortium. http://www.tei-c.org/Vault/P5/2.5.0/doc/tei-p5-doc/en/ html/.

Wittern, Christian, Arianna Ciula, and Conal Tuohy. 2009. “The Making of Tei P5.” Literary and Linguistic Computing 24, no. 3: 281–96. doi: 10.1093/llc/fqp017.

Zafrin, Vika. 2007. “RolandHT, a Hypertext and Corpus Study.” PhD diss., Brown University. http://rolandht.org/.

Zúñiga, Gloria L. 2001. “Ontology: Its Transformation from Philosophy to Information Systems.” In FOIS ’01: Proceedings of the International Conference on Formal Ontology in Information Systems— Volume 2001, edited by Nicola Guarino, Barry Smith, and Christopher Welty, 187–97. Ogunquit, Maine: ACM. doi:10.1145/505168.505187.

NOTES

1. This paper springs out of work in the TEI Ontologies SIG since 2004. It is also based on long discussions with colleagues at the Unit for Digital Documentation at the University of Oslo and beyond. Two anonymous reviewers of this article and the editors of the journal contributed significantly to its final form. However, the interpretation and formulation, and responsibility for its errors and omissions, is mine alone. 2. This section is based on Eide and Ore (2007) and Eide (2007). 3. The guidelines pointed to “Simple Analytic Mechanisms” and “Feature Structures” as “[a]ppropriate mechanisms for the encoding of such interpretative gestures” (TEI Consortium 2001, ch. 20) but these mechanisms were never in common use, and P4 did not include any standard way to represent specific external entities such as places or persons, just mechanisms that could be used to model them. 4. K. Faith Lawrence, Michael Jewell, and Mischa Tuffield, “The OntoMedia Model,” accessed January 29, 2014, http://www.contextus.net/ontomedia/model. 5. Burnard (2013). In addition to separate persons with separate skill sets working together we also find several individuals who are acting as both modeling experts and domain experts at the same time.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 69

6. This was not the only output of the project, and other results were closer to modeling for understanding. See the webpage for details, accessed February 22, 2015, http://ibsen.uio.no/OmUtgaven.xhtml. 7. Such model development will happen through iterative cycles, and the connections to hermeneutics are interesting. They are, however, beyond the scope of this paper. 8. This is in line with the distinction between a data model as an interpretation of a domain (a representational agenda) and as a plan of action (a cohortative agenda) (Dubin, Senseney, and Jett 2013). 9. A model can always be seen both as a model of and a model for. In this specific example the TEI module can also be seen as a model of an abstract structure, that is the class of all dictionaries. Model of and model for represent different aspects of modeling and the “choice of perspective on the model will of course determine what will be brought into the foreground.” (Mahr 2009, 372). 10. This is not at all a new understanding. Some earlier examples include Ciula, Spence, and Vieira (2008), Haveman et al. (2009), as well as the harvesting of TEI data into a CIDOC-CRM–based common system in the CLAROS project, http://www.clarosnet.org/ (accessed January 29, 2014). 11. These three figures were first presented in a paper at the seminar The Message of the Old Book in the New Environment, L’Institut Finlandais en France, Paris, March 18– 19, 2011, in the context of the Paris Book Fair. The colors of the dots indicate different classes in the ontology. 12. The TEI header and body in these examples do not have to be in the same document, but the situation with a header in a different document would not change the essence of the argument. 13. These elements do not need to be in the TEI header, they can, for instance, be used in authority lists in the body of a separate TEI document. The argument about the semantics remains the same, however. 14. “The Erlangen CRM / OWL,” accessed February 22, 2015, http://erlangen-crm.org. 15. “FOAF Vocabulary Specification 0.99. Namespace Document 14 January 2014 - Paddington Edition,” accessed February 22, 2015, http://xmlns.com/foaf/spec/. 16. The extension of a word is what it refers to, that is a set of things in any real or non- real world that are labelled by that word. 17. See CIDOC’s “Statement on Linked Data identifiers for museum objects,” accessed May 22, 2014, http://network.icom.museum/fileadmin/user_upload/minisites/cidoc/ PDF/StatementOnLinkedDataIdentifiersForMuseumObjects.pdf. 18. Note that Mahr (2009) is a translation of a German article using the term Informatik, which in Germany denotes what in English is usually called computer science. In the English version of the article, however, the term was translated to information science.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 70

ABSTRACTS

This paper discusses the relationships between TEI and ontologies from the perspective of computer-based modeling, understood here as a way to establish meaning. The distinctions between creation and use of models as well as between modeling for production and modeling for understanding are presented and compared with other categorizations or models of modeling. One method of establishing meaning in TEI documents is achieved via linking mechanisms between TEI and external ontologies. How such linking can be done and what it may imply for the semantic openness and usability of TEI documents is the practical focus of this article.

INDEX

Keywords: ontologies, conceptual modeling, interchange, CIDOC-CRM, linked data

AUTHOR

ØYVIND EIDE Øyvind Eide holds a PhD in Digital Humanities from King’s College London (2013) and is one of the two founding conveners of the TEI Ontologies SIG. His research interests are focused on the modeling of cultural heritage information, especially as a tool for critical engagement with the relationships between texts and maps as media of communication. He is a lecturer and research associate at the Chair of Digital Humanities, University of Passau, Germany.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 71

TEI to Express Domain Specific Text and Data Models

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 72

Texts and Documents: New Challenges for TEI Interchange and Lessons from the Shelley-Godwin Archive

Trevor Muñoz and Raffaele Viglianti

AUTHOR'S NOTE

This paper was originally presented at the 2013 TEI Members’ Meeting, October 2–5, 2013, in Rome, Italy. We would like to thank the audience members from that original presentation for their engaging and probing questions. We would also like to thank Neil Fraistat, who collaborated on the development of the original presentation. Thanks also to Jennifer Guiliano, who commented on drafts of this piece. Finally, the ideas developed here owe everything to the team of people who have participated in the Shelley-Godwin Archive project. The Shelley-Godwin Archive is the ever-growing, collaborative product of many hands across a variety of institutions and disciplines. Please see http://shelleygodwinarchive.org/about for a full list of our collaborators. We thank them profusely.

1. Background of the Shelley-Godwin Archive

1 The Shelley-Godwin Archive is a project involving the Maryland Institute for Technology in the Humanities (MITH) and the Bodleian, British, Huntington, Houghton, and New York Public Libraries that will eventually contain the works and all known manuscripts of Mary Wollstonecraft, William Godwin, Percy Bysshe Shelley, and Mary Wollstonecraft Shelley. The S-GA project began in 2011 and completed its first phase of work in 2015. In October 2013, the project released a beta version of its online reading environment containing high-resolution images and accompanying TEI-encoded

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 73

transcriptions of the three surviving manuscript notebooks containing Mary Shelley’s drafts of Frankenstein, or, The Modern Prometheus.

2 The development of the S-GA from its original conception to current plans for second and future phases of work reflects wider shifts in emphasis within the digital humanities: from construction and online presentation of expert-curated digital collections (Palmer 2004) to experiments with more participatory forms of scholarship (Burdick et al. 2012). The central claims of the original grant proposal for S-GA (submitted in 2010)1 reach back to rhetoric from earlier digital humanities and digital library development that centered on the possibilities for technology to enable virtual reunification of dispersed collections. Such motivations were more common to the framing of digital projects ten years older than S-GA—dating from the early- to mid-2000s (Deegan and Tanner 2002; for a fuller discussion see Punzalan 2013). Driven by the original vision of virtual reunification (largely without editorial intervention), the proposal for the project’s first phase centered on imaging of manuscript materials from the partner libraries: “With the digitization of these items, particularly P. B. Shelley’s notebooks, the Archive will make an invaluable contribution to Romantic scholarship—bringing together an entire group of widely scattered rare sources” (New York Public Library 2010).

3 Yet, the original plan for the S-GA also showed hallmarks of a field in transition—from one in which simply getting materials online was a central goal to one increasingly focused on allowing users to interact with online materials in additional ways, a trend loosely captured under the banner of “Web 2.0” (O’Reilly 2005). The catalogue of functionalities proposed for S-GA (all based on a prior collaborative project between several of the partners, The Shakespeare Quartos Archive) speaks to a nebulous and evolving conception of how those outside the project team might interact with this kind of digital scholarship. Users would have, the proposal declared, “capacity to collate texts, the ability to overlay images of the original printed editions, compare images side by side, search full-text, and tag text with user annotations” [sic] (New York Public Library 2010). By the time work commenced in late 2011, the encoding of materials from the archive in TEI had been promoted to a more significant component of the project alongside digitization, while other potential features were deferred.

4 The decision to invest in TEI encoding of the S-GA materials along with a reading environment designed to take advantage of features of these textual data has allowed the project to remain current with developments in the digital humanities and scholarly editing not envisioned in the original proposal. Through work on the text encoding schema and reading environment, the S-GA project has refined its conception of how this type of scholarship can create new knowledge. Rather than a grab-bag of “Web 2.0” functionalities, engagement with text encoding, an interpretive method created by and specific to the digital humanities, forms the backbone of the S-GA project’s work with these important materials.

2. Motivations for the S-GA Encoding Scheme

5 The start of text encoding work on the S-GA coincided with the addition of the new “document-focused” elements to the TEI in the release of P5 version 2.0.1. These additions were the product of recent efforts by a subgroup of the TEI Manuscript Special Interest Group and have resulted in a considerable expansion of

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 74

“Representation of Primary Sources” (chapter 11 of the TEI Guidelines). In the ontology that the new elements are intended to help express, digital text is encoded by describing the relationship of written traces to their physical carriers (TEI Consortium 2011). This encoding approach switches focus from text as communicative act or linguistic content to text as sign on some physical support; for the purposes of this paper this approach will be referred to as “document-focused” encoding. That is, encoders may formalize their understanding of how the text takes form on a surface; they will, for example, identify zones that group text topographically and the lines of text within such zones. Encoders may moreover track an author’s actions on the page, identify textual revisions, movements, and deletions, and assert a temporal order for such actions. Describing how a text has been inscribed on a surface is an essential preoccupation of scholars who practice genetic textual criticism. The document- focused encoding strategy is closely but not exclusively identified with the interpretive goals of genetic criticism throughout this discussion. Divergences from a strict genetic editing approach will be described below. The document-focused approach contrasts with an ontology of text in which encoders describe, through combinations of elements and attributes, what a certain portion of text “is.” By surrounding characters with a

element, for example, an encoder formally asserts that those characters form a paragraph. This approach can quickly become more and more complex, depending on both the text and the encoder’s research agenda. Herein this approach will be referred to as “text-focused” encoding (cf. Rehbein and Gabler 2013).

6 Given that the majority of materials in the Shelley-Godwin Archive consist of autograph manuscripts, the editorial team quickly adopted several of the new elements proposed by the manuscript working group such as , , and (with their greatly restricted content models) into its TEI customization. This document-focused approach has served the project well. It yields an encoding scheme that targets features of greatest interest to the scholarly editors who make up the initial user community (the principal investigator and collaborators). Also, focus on documents permits rigorous description of often complicated sets of additions, deletions, and emendations. In the case of the manuscript notebooks of Frankenstein, the encoding scheme for S-GA identifies each page as a containing one or more s.2 Most pages have a main body of writing as well as a wide left margin, where Shelley’s husband Percy wrote annotations and revisions to the developing text. These separate writing areas are encoded as elements containing writing organized into elements. On top of this basic model, information about authorial hands (who wrote what) is encoded as well as revisions—including deletions, additions, substitutions, and transposed and retraced text. The reading environment created from this encoding is used to publish a semi-diplomatic transcription of the text alongside a facsimile image.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 75

Figure 1. A screenshot of the primary reading interface of the Shelley-Godwin Archive.

7 As the discussion above suggests, the S-GA encoding scheme borrows concepts and terminology from genetic editing where the motivations align; namely in the representation of the mise-en-page of writing on the manuscripts. However, the SG-A does not seek to produce genetic editions of the works represented in the archive. Digital genetic editions usually emphasize the temporal sequence of authorial revision and make a stronger distinction between what is on the page and what is interpreted from the page. For example, the Digitale Faustedition project—which directly contributed to the expansion of the chapter of the TEI Guidelines on representation of primary sources—made this distinction by using two different encoding models for the same documents. The models correspond to the concepts of “record” and “interpretation” (Befund and Deutung) first introduced in 1971 by Hans Zeller (cited in Brüning, Henzel, and Pravida 2013). The “record” (Befund) consists of information about a primary source, of which the editor can make a detailed diplomatic transcription as the record of what is “found” on the source.3 For example, some text that has been struck through may only be encoded as being struck through (i.e., not deleted). The “interpretation” (Deutung) records an editor’s understanding of a writing act on the page; for example, some struck-through text is interpreted as a deletion. The Digitale Faustedition project sees this interpretation as belonging to a linguistic domain; therefore, it conflates the marking of “deletion” with a more traditional text-focused encoding that includes linguistic and literary structures such as paragraphs and verses.

8 The S-GA, not being a genetic edition, conflates in one encoding model the transcriptional and editorial work more closely related to the document (roughly corresponding to Befund and Deutung). This conflation is not an accidental failure to conform to one or another ontology of text. The task of developing an encoding scheme to match the goals of the S-GA project pushed the editorial team to consciously borrow tools from both interpretive communities. That is, rather than seeking to produce an established edition or publication, the S-GA project has chosen to embrace the

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 76

ambiguous nature of this type of digital humanities work (Price 2009) in pursuit of the goal of constructing the S-GA as a work site wherein the encoded text, along with its tailor-made reading environment, “operationalizes” certain aspects of editorial theory. The encoding scheme of the S-GA, drawing on a document-focused approach, is a formal model for operationalizing literary knowledge about how texts are constructed from written documents (Moretti 2013). Franco Moretti defines “operationalizing” as “the process whereby concepts are transformed into a series of operations … . Operationalizing means building a bridge from concepts to measurement, and then to the world” (103–4). In Moretti’s case measurement involves quantification of features of literary texts, but measurement need not imply only quantification. In the case of S- GA, the encoding scheme is a formal model by which information about the mise-en- page of the various writing traces helps develop greater knowledge about the literary work Frankenstein.

9 To achieve this goal, the editorial team needed to be able to produce two distinct representations of the S-GA materials so as to provide rigorous, semi-diplomatic transcriptions of the fragile manuscripts for those with an interest in the compositional practices of a significant group of British Romantic authors, and also to make available clear “reading texts” for those who are primarily interested in the final state of each manuscript. Thus, what was needed was not only the powerful new formalization of document-focused encoding but also a mechanism to enable movement back and forth between document-focused and text-focused models of the materials in the Archive. The development of the document-focused encoding scheme has been described above. The work of automating the production of usable “reading texts” encoded in text- focused TEI markup from data that is modeled according to a document-focused approach proved much more challenging. Conversion and interchange between these two models poses challenges for encoding practice and workflow, for data provenance and maintainability, and for reading environment and presentation.

3. Encoding Workflow Challenges

10 The conflict between representing multiple hierarchies of content objects and the affordances of XML is well known, and the TEI Guidelines as well as the professional literature of the text encoding community discuss several possible solutions (TEI Consortium 2011; Renear, Mylonas, and Durand 1993; Roland 2003; Piez 2013). One of these solutions is to designate a primary hierarchy and to represent additional hierarchies with empty milestone elements that can be used by some processing software to construct an alternate representation of the textual object. The approach taken by the S-GA team to produce both document-focused and text-focused TEI data is a version of the milestone-based approach. The document-focused elements form the principal hierarchy while milestone elements are supplied to support automatic conversion to text-focused markup (which will contain elements such as

,

, , etc.).

11 This solution places increased burden on document encoders to maintain “correctness,” thus potentially lowering data consistency and quality. For instance, empty element milestones representing the beginning and ending of textual features have no formal linkages as part of the document-focused document tree. Encoders must supply identifiers and pointers to indicate these linkages. Ensuring that these

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 77

identifiers and pointers pair correctly must be accomplished with some mechanism other than the RELAX NG validation that checks conformance to the rules specified in the TEI schema. In S-GA this is partly addressed by a number of Schematron rules added to our TEI schema customization. These further checks, however, add an additional step within the processing workflow that must be balanced against the need for a simpler and efficient encoding workflow. As noted above, managing multiple hierarchies through the use of milestones is not new. The experience of the S-GA team suggests that the new possibilities available through the increased expressiveness of the TEI Guidelines, which include the additional document-focused elements, also increase the scope for projects to produce data that reflect two divergent ontologies, and thus to encounter the difficulties involved in the “workarounds” for multiple hierarchies more frequently.

4. Maintainability and Provenance Challenges

12 In addition to posing challenges for maintaining workflows with good quality control while producing data, use of the milestone strategy for multiple hierarchies (ontologies) decreases the reusability of the textual data produced. The project relies on an automatic process to convert the document-focused encoding into a text-focused one. This process consists of a set of XSLT transformations authored by Wendell Piez, who served as a consultant to the S-GA project in 2013. These transformations are structured as a pipeline—progressively remodeling the document-focused TEI data to a more familiar text-focused TEI.4 Some of the stages involved in this process include, for example, identifying chapter boundaries that span across multiple s (which for convenience are maintained in separate files), and then combining the content of these surfaces into a single

. While some transformations can be handled heuristically, others require a “hint” for the processor. To support this automated conversion, the S-GA team needed to go beyond purpose- built milestone elements like and and, in effect, semantically overload the general purpose element using attributes. The value of an attribute on indicates which text-focused element is intended to appear in a particular location:5

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 78

also [ w W hen I was about twelve fourteen years old we were at our house near

13 This solution is explained in the project’s documentation, and the convention used would be (one hopes) evident after cursory examination of the data. Nonetheless, the desire to make available two models of the text forced the S-GA team to add markup to the project’s canonical document-focused data. This makes the encoding more unique to the S-GA project and less easily consumable by future users with different goals.

14 To avoid the conceptual and technical challenges involved in automating the transformation between text-focused and document-focused representations, the two sets of data could each have been created by hand (rather than automatically generated) and maintained separately. Indeed, this is the approach followed by the Digitale Faustedition project, where a distinction between what the project calls “documentary” and “textual” transcription was considered necessary not only as a reaction to encoding problems, but also as a practical application of theoretical distinctions between documentary record and editorial interpretation (Brüning, Henzel, and Pravida 2013). The Faustedition project team, however, still encountered technical challenges when trying to correlate and align these two transcriptions automatically. Use of collation and natural language processing tools helped with this problem, but eventually more manual intervention was needed.

15 The S-GA team felt that maintaining two data sets representing different aspects of the textual objects would have led to serious data consistency, provenance, and curation problems. As the example of the Faustedition project shows, separate representations must be kept in sync with project-specific workflows developed for this purpose. In the case of S-GA, documentary transcription is the main focus; the greatly increased cost and time involved in also maintaining a textual transcription would have reduced the size of the corpus that could be encoded and thus the amount of materials from the archive that could be made fully available under the initial phase of the project. These

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 79

exigencies prompted the project’s attempts to automate the generation of text-focused TEI data from the core document-focused data that the project editors were creating.

5. Presentation Challenges

16 The display and presentation of document-focused encoding is another technical challenge introduced by the new TEI elements. Rendering a diplomatic transcription is more easily achievable in a coordinate-based system; the S-GA project, therefore, adopted SharedCanvas, a data model developed by Stanford University and a coalition of partners, which allows editors (and potentially future users) to construct views out of linked data annotations. Such annotations, expressed in the Open Annotation vocabulary, relate images, text, and other resources to an abstract “canvas.” S-GA is developing and deploying a viewer for SharedCanvas that uses HTML5 technologies to display document-focused TEI elements that are mapped as annotations to a SharedCanvas manifest, a Linked Open data graph that ties all the SharedCanvas components together. The encoding scheme does not record position coordinates for every zone and line, but the positions of main zones to be painted on a canvas are automatically inferred, and base HTML display rules govern the rendering of text within these zones.

17 SharedCanvas not only provides S-GA with a framework to publish TEI transcriptions, but also enables the Archive to move further toward a sustainable participatory infrastructure. The SharedCanvas model allows for further layers of annotations to be added dynamically to a manifest; the S-GA already makes use of this for appending search result highlights to the linked data graph and for displaying these to the user. Eventually, the project aims to use the same mechanism to enable user comments and annotations. The engagement of students and other scholars will be driven by the possibility of creating annotations in the Open Annotation format, so that any SharedCanvas viewer will be able to render them. It remains a matter for the future development of the project to understand whether annotations can be added dynamically to the source TEI—especially those pertaining to transcription and editorial statements—or whether these secondary annotations, created after the main encoding of documents is complete, should always be managed separately from the source TEI data.

6. The Need for Interchange Between Document- Focused and Text-Focused Models

18 There is intellectual power and utility in both document-focused and text-focused approaches to creating digital texts; the scholarly community has gained by the increased expressiveness of maintaining two ontologies within the standard governed by the TEI community. There are also intellectual and practical reasons why it is undesirable to maintain data reflecting these two models as separate or severable representations. The experience of the S-GA editorial team lends support to Peter Robinson’s claim that “document, text and work exist in a continuum, and [that] the questions of intention, agency, authority, and meaning exert pressure at every level of reading” (2013, 114). The ability to move along this continuum is an affordance that a digital text should support because it is in the alternation between these two

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 80

ontological models that editors enact the construction of their particular form of humanistic knowledge.

19 The motility inherent in this model of digital text projects creates significant pressure to address the problem of “interchange” between and among textual data modeled according to different schemes. Syd Bauman has provided a valuable operational definition of interchange in the context of text encoding. Following Bauman’s argument, “interchangeable” data is that for which some human intervention (changing data to suit a new system or modifying a system to process new data) is required but for which this intervention can be accomplished without direct human communication with the data originator—because the data is in some way standard or documented (Bauman 2011). Indeed what the S-GA team has pursued is a strategy for staying document-focused in terms of data creation while preserving the ability to produce and share text-focused encoded data by specifying the appropriate semantics within the more widely used text-focused ontology of digital text and developing data- transformation pipelines that use those semantics as a guide.

20 For Bauman it seems that interchange (“blind interchange” as he delineates it) represents a best compromise between interoperability (full equality of semantics across different text models) and the liberty or expressiveness that motivates scholarly text encoding in the first place. This form of interchange is made possible by “a lot of adherence to standards” and extensive documentation of deviation from those standards (see also Flanders 2009). This argument is deployed against skeptics of the value of markup or advocates of other approaches to curation of digital textual data. The case of the two ontologies of text now co-existing within the TEI Guidelines is somewhat different—both are part of the TEI standard. The need for interchange between data modeled according to these different approaches is real and urgent for a project such as the Shelley-Godwin Archive, which will increasingly depend on the ability to flip back and forth between different representations of the data most relevant to different communities seeking to use the Archive as a site for their own knowledge-making.

21 The debates around if and how TEI is a usable standard for interchanging scholarly information about texts are by now very old. The dilemma faced by the S-GA in attempting to create specialized document-focused data but also to generate and share text-focused data—all of it “TEI data”—suggests a complication of and a possible extension to Bauman’s conclusions about interchange. Bauman’s argument is still couched in terms of polarity even as it suggests a “common-sense” relaxation of intensity toward the whole question of TEI’s suitability as a standard, a kind of lowering of expectations from interoperability to interchange. Geoffrey C. Bowker and Susan Leigh Star suggest that “standardization has been one of the common solutions” to the problem of “how objects can inhabit multiple contexts at once, and have both local and shared meaning” but that the vocabulary of standardization is insufficient “to characterize the heterogeneity and the processual nature of information ecologies” (Bowker and Star 1999, 293). The more nuanced conception of the issues and interactions involved in the types of problems that Bowker and Star develop could be useful to the TEI community.

22 Following earlier scholars in the domain of social studies of science, Bowker and Star describe a process of balancing “local constraints, received standardized applications, and the re-representation of information” (1999, 292). This could easily be describing

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 81

the process of developing a TEI project. When these arrangements become “ongoing stable relationship[s] between different social worlds and … shared objects are built across community boundaries,” Bowker and Star refer to the results as “boundary objects” (1999, 292). Thus Bauman’s description of “interchange” around the standard of the TEI above, and Star’s assertion that “boundary objects are a sort of arrangement that allow different groups to work together without consensus” (Star 2010, 602), seem well aligned. In this discussion, the TEI encoding standard is the boundary object: a “set of work arrangements that are at once material and processual … resid[ing] between social worlds (or communities of practice) where [this object] is ill structured” (604). Star observes that multiple groups take advantage of the “interpretive flexibility” of boundary objects and customize them for local purposes. In this sense, the S-GA’s use of the TEI’s formal mechanisms to produce a custom schema and the workflow-driven introduction of project-specific markup and markup conventions (overloading milestones) discussed earlier are hallmarks of work with boundary objects.

23 Yet, according to Star, a less-studied dynamic in the use of boundary objects is the way “groups that are cooperating without consensus tack back-and-forth between [local and shared] forms of the object” (605). This is where the experience of S-GA becomes particularly relevant. By seeking to maintain at the level of the data model the kind of flexibility that Peter Robinson sought to achieve through processing and presentation (Robinson 2009), the encoding scheme that S-GA has developed for itself and the automatic conversion processes that operate on it enact the tacking-back-and-forth that Star describes. The implications for the wider TEI community reside in the linkage Star articulates between boundary objects and the development—and, it would seem to follow, maintenance—of infrastructures and standards.

24 According to Star and her collaborators, infrastructures and standards are a “scal[ing] up” from the back-and-forth use of boundary objects (2010, 605). In this sense, the introduction of a document-focused ontology within the TEI alongside the more common text-focused approach is an opportunity as well as a challenge. Increased commitment to one or the other ontological model of text increases the difficulty that other interpretive communities will face in adapting the digital text to their local meanings and practices. The process of developing the S-GA exposed challenges related to workflow and quality control, maintainability, and presentation. Yet, the concept of boundary objects and their extension into infrastructures and standards provides a framework for articulating the value of constructing a digital text object that spans current boundaries in editorial theory and practice. Digital texts that span the interpretive communities of different schools within textual editing and literary scholarship, by applying pressure to notions of interchange, promote circulation within the system of the standard that contributes to its greater health.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 82

BIBLIOGRAPHY

Bauman, Syd. 2011. “Interchange vs. Interoperability” In Balisage: The Markup Conference 2011 Proceedings. Montréal, QC. doi:10.4242/BalisageVol7.Bauman01.

Bowker, Geoffrey C., and Susan Leigh Star. 1999. Sorting Things Out: Classification and Its Consequences. Cambridge, MA: MIT Press. http://search.ebscohost.com/login.aspx? direct=true&scope=site&db=nlebk&db=nlabk&AN=13186.

Brüning, Gerrit, Katrin Henzel, and Dietmar Pravida. 2013. “Multiple Encoding in Genetic Editions: The Case of Faust.” Journal of the Text Encoding Initiative 4. http://jtei.revues.org/697. doi: 10.4000/jtei.697.

Burdick, Anne, Johanna Drucker, Peter Lunenfeld, Todd Presner, and Jeffrey Schnapp. 2012. “The Social Life of the Digital Humanities.” In Digital_Humanities, 73–98. Cambridge, MA: MIT Press.

Deegan, Marilyn, and Simon Tanner. 2002. Digital Futures: Strategies for the Information Age. New York: Neal-Schuman Publishers.

Flanders, Julia. 2009. “Dissent and Collaboration.” Paper presented at Digital Humanities 2009, College Park, MD, June 22–25.

Moretti, Franco. 2013. “‘Operationalizing’: Or, the Function of Measurement in Literary Theory.” New Left Review 84: 103–19.

New York Public Library. 2010. “Application to the National Endowment for the Humanities (NEH) Humanities Collections and Reference Resources: Shelley-Godwin Archive.” https:// securegrants.neh.gov/publicquery/main.aspx?f=1&gn=PW-50939-11.

O’Reilly, Tim. 2005. “What Is Web 2.0: Design Patterns and Business Models for the Next Generation of Software.” O’Reilly Media. September 5. https://web.archive.org/web/ 20140204114855/http://oreilly.com/web2/archive/what-is-web-20.html.

Palmer, Carole L. 2004. “Thematic Research Collections.” In A Companion to Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth, chap. 24. Oxford: Blackwell. http://www.digitalhumanities.org/companion/.

Pierazzo, Elena. 2011. “A Rationale of Digital Documentary Editions.” Literary and Linguistic Computing 26 (3): 463–77. doi:10.1093/llc/fqr033.

Piez, Wendell. 2015. “TEI in LMNL: Implications for Modeling.” Journal of the Text Encoding Initiative 8. https://jtei.revues.org/1337. doi:10.4000/jtei.1337.

Price, Kenneth M. 2009. “Edition, Project, Database, Archive, Thematic Research Collection: What’s in a Name?” Digital Humanities Quarterly 3 (3). http://www.digitalhumanities.org/dhq/vol/ 3/3/000053/000053.html.

Punzalan, Ricardo L. 2013. “Virtual Reunification: Bits and Pieces Gathered Together to Represent the Whole.” PhD diss., University of Michigan. http://hdl.handle.net/2027.42/97878.

Rehbein, Malte, and Hans Walter Gabler. 2013. “On Reading Environments for Genetic Editions.” Scholarly and Research Communication 4 (3). http://src-online.ca/index.php/src/article/view/123.

Renear, Allen H., Elli Mylonas, and David Durand. 1996. “Refining Our Notion of What Text Really Is: The Problem of Overlapping Hierarchies.” Final version, January 6. In Research in Humanities

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 83

Computing: Selected Papers from the ALLC/ACH Conference. Oxford; New York: Clarendon Press; Oxford University Press. http://hdl.handle.net/2142/9407.

Robinson, Peter. 2009. “What Text Really Is Not, and Why Editors Have to Learn to Swim.” Literary and Linguistic Computing 24 (1): 41–52. doi:10.1093/llc/fqn030.

———. 2013. “Towards a Theory of Digital Editions.” Variants 10: 105–27.

Roland, Perry. 2003. “Design Patterns in XML Music Representation.” In ISMIR Conference Proceedings 2003. http://jhir.library.jhu.edu/handle/1774.2/50.

Star, Susan Leigh. 2010. “This Is Not a Boundary Object: Reflections on the Origin of a Concept.” Science, Technology & Human Values 35 (5): 601–17. doi:10.1177/0162243910377624.

TEI Consortium. 2011. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.0.1. Last updated December 22. N.p.: TEI Consortium. http://www.tei-c.org/Vault/P5/2.0.1/doc/tei- p5-doc/en/html/.

NOTES

1. The authors joined the S-GA Project in 2011 and 2013 respectively. 2. As the TEI Guidelines note, may represent an opening in a codex within a single element. However, since the original items in this case have been disbound for conservation treatment, always refers to a single recto or verso. 3. Some interpretation is always involved in determining what is important among the facts that can be “found” in the document. If anything, dealing with Befund in a digital context poses new complications, because digital diplomatic transcriptions are less bound by the limits of paper-based reproduction. Elena Pierazzo (2011, 463) has pointed out that “[f]or print the choice of which features to include in [a] transcription is limited largely by the limits of the publishing technology. In contrast, the digital medium has proved to be much more permissive and so editors need new scholarly guidelines to establish ‘where to stop.’” The challenges in establishing clear expectations for a digital diplomatic transcription reveal how the Befund is very much subject to editorial interpretation: editors still need to choose what to record from among what is evident from the source. 4. The automated transformation workflow was originally managed using XProc to compose the various stylesheets but was converted to an Apache Cocoon block for ease of maintenance by project staff at MITH. 5. This solution is loosely analogous to some of the ways custom data attributes are intended to be used under the HTML5 Specification: http://www.w3.org/html/wg/ drafts/html/CR/dom.html#embedding-custom-non-visible-data-with-the-data-*- attributes.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 84

ABSTRACTS

The introduction in 2011 of additional “document-focused” (as opposed to “text-focused”) elements represents a significant additional commitment to modeling two distinct ontologies for textual data within the standard governed by the Text Encoding Initiative (TEI) Guidelines. A brief review of projects using the new elements suggests that scholars generally treat the “document-focused” and “text-focused” models as distinct and even severable—the tools of separate interpretive communities within literary studies. This paper will describe challenges encountered by members of the development and editorial teams of the Shelley-Godwin Archive (S-GA) in attempting to produce TEI-encoded data (as well as an accompanying reading environment) that supports both document-focused and text-focused approaches through automated conversion. Based on the experience of the S-GA teams, the increase in expressiveness achieved through the addition of document-focused elements to the TEI standard also raises the stakes for “interchange” between and among data modeled according to these parallel approaches.

INDEX

Keywords: digital archive, diplomatic transcript, schema design, manuscript encoding, linked data, interchange, Romanticism

AUTHORS

TREVOR MUÑOZ Trevor Muñoz is assistant dean for digital humanities research at the University of Maryland Libraries and an associate director of the Maryland Institute for Technology (MITH). He works on developing digital research projects and services at the intersection of digital humanities centers and libraries. Trevor’s research interests include electronic publishing and the curation and preservation of digital humanities research data.

RAFFAELE VIGLIANTI Raffaele Viglianti is a research programmer at the Maryland Institute for Technology in the Humanities (MITH). He is involved in front-end development and TEI-based digital editorial projects. Raffaele holds active roles in both the TEI and MEI communities and his research on digital editing spans both TEI [textual] and MEI [musical] practices.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 85

Manus OnLine and the Text Encoding Initiative Schema

Giliola Barbero and Francesca Trasselli

1. Manus OnLine and ICCU Activity

1 The Activity Area for Manuscript Bibliography, Research and Census (Area di attività per la bibliografia, la documentazione e il censimento dei manoscritti) is a branch of the Central Institute of Cataloguing (ICCU) belonging to the Italian Ministry of Heritage and Culture.1 The goal of this office is to take the census and to catalogue manuscripts held by Italian institutions: medieval manuscripts, early modern and modern manuscripts, literary and private papers of the nineteenth and twentieth centuries, and original handwritten letters.

2 The main achievements of this office are: • Manus OnLine, the Italian national catalogue of manuscripts,2 and • BibMan, the national bibliography of manuscripts.3

3 Manus OnLine (MOL) is a project managing a relational database accessible through a web-based application. In 2007, this was the first national web-based application devoted to manuscript cataloguing. Italian librarians and researchers can produce manuscript descriptions and publish digital images through MOL using personal computers connected through the Internet to the ICCU main server. Within MOL, manuscript descriptions can be linked to images stored in any digital repository. Moreover, cataloguers share the same authority list of names (Marcuccio 2010; Bagnato, Barbero, and Menna 2009).

4 Since all the manuscript descriptions are stored directly in the same central repository, as soon as a description is marked as “published” by the cataloguer, it becomes immediately accessible and searchable through the OPAC.

5 At the time of writing (January 2014), MOL contains about 141,000 manuscript descriptions from 270 different libraries and 470 cataloguers are involved in the project. The catalogue will be continuously updated and extended. As the Italian

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 86

manuscript collections preserve the majority of European manuscripts, MOL contains material of outstanding research importance for all periods and disciplines. Greek manuscripts are catalogued as well, for example those belonging to the Biblioteca Trivulziana of Milan, the Biblioteca Riccardiana in Florence, the Biblioteca Angelica, and the Biblioteca Nazionale Centrale (National Central Library) in Rome. Having been created and developed by the Ministry of Heritage and Culture, MOL is free: cataloguers need only a username and a password to begin their work.

2. MOL and the TEI Schema: the Same Logical Model

6 MOL and the TEI schema share a logical model which is typical of European scientific manuscript cataloguing practices (Petrucci 2001; Rehbein, Sahle, and Schassan 2009; Fischer, Fritze, and Vogeler 2010). Data considered in paleographical and codicological environments, and as a consequence in manuscript cataloguing, are • the manuscript identifier; • the physical description; • history; • the contents (texts); • the bibliography; and • names of people and corporations who have any responsibility for the manuscript curation and for texts, and names of places where manuscripts were written.

7 In European manuscript cataloguing practice, the manuscript identifier—composed of the name of the place and library where the manuscript is preserved and by an official shelfmark—is necessary to define every single manuscript. This identifier is linked to a single physical description when the manuscript consists of one unit, or to multiple physical descriptions when the manuscript is composite. Each area devoted to the physical description is linked to one or more texts, because several texts can be copied in the same manuscript or part of a manuscript. Names can be linked both to the physical description (to represent, e.g., owners or copyists) and to texts (to represent, e.g., authors or translators). This traditional conceptual data model inspired the organization of data storage in the MOL database in which the identifier and the physical descriptions are considered as two separate entities while texts and names are considered different types of entities. The relationship between the identifier and the physical descriptions in MOL is one-to-n (one-to-many); the relationship between physical descriptions and texts is also one-to-n. The relationship between names (which populate the MOL authority file) and physical descriptions and between names and texts is n-to-n.

8 The TEI schema has the same logical structure as MOL. In fact, following the TEI P5 Guidelines (TEI Consortium 2013: 10.2 The Manuscript Description Element), within an element • the manuscript identifier is marked with , which is mandatory in any description; • the physical description is marked with —optional; • history is marked with —optional; • the contents (texts) are marked with —optional; • the bibliography is marked with within —optional; and • names can be inserted at different levels within several other elements.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 87

9 Moreover, at deep levels the TEI schema contains an almost complete set of tags devoted to manuscript description and cataloguing (Barbero and Smaldone 2000; Burnard 2001; Milanese 2007). This is why on several occasions and with different aims the ICCU adopted the TEI schema to exchange data between different electronic repositories.

10 In the past, between 2004 and 2007, when Manus was still a standalone application, the ICCU was able to export TEI XML files from local Manus databases, which were not accessible through the web, and import them into the online public catalogue (Barbero 2004). In 2003, Manus was adopted by the European project Rinascimento Virtuale,4 devoted to the digital publication of Greek palimpsests; on that occasion, the TEI schema was adopted to share manuscript descriptions of palimpsests among all the European partners (Magrini, Pasini, and Arduini 2005). Later, starting in 2007, the ICCU planned and performed harvesting between the CERL Portal, a meta search interface developed by the Consortium of European Research Libraries, and MOL: the ICCU exposes MOL contents on the web, providing OAI-PMH records with light TEI metadata inside, and CERL harvests them.5

11 At present, the TEI schema is used in MOL with another very important aim. The ICCU wants to supply data to the institutions which produced them. In fact libraries and research organizations also need to use their own data outside MOL. For example, they need complete data to manage their own online public access catalogues (OPACs) or to populate their local digital libraries.6 Since institutions produce data on their own, they have the right to access all the data they produce; that is why the ICCU needs a suitably rich schema, not only a partial set of metadata. Bibliographical standards are not adequate for structuring manuscript descriptions in their entirety, because they have a quite different logical structure (Daines and Nimer 2012; Barbero 2013).

12 In 2012, an export application was developed to export catalogues of manuscript collections from MOL as TEI XML files; moreover, users can download each description from the OPAC in TEI format:

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 88

Figure 1. The OPAC of Manus OnLine with the download button.

3. TEI XML Documents Exported from MOL

13 The aim of this paper is to discuss how the ICCU exports the contents of MOL fields into TEI elements to create valid XML files. An example of a TEI XML file exported from MOL is shown in Appendix 1, while a complete list of MOL fields and the corresponding TEI elements follows in Appendix 2.

14 According to TEI P5 Guidelines (TEI Consortium 2013: 2. The TEI Header), at the beginning of each XML file, a is automatically produced, where a title (Catalogo di manoscritti), the name of the publisher (ICCU), the date of the creation of the XML file, and the name of the person or institution or project which is considered responsible for the intellectual content of the file are inserted. In the of the TEI XML document, a element is nested, in which each manuscript description is encoded within an element. For each an @xml:id attribute is create, whose value corresponds to the CNMD,7 the unique code number identifying each description published by the Italian National Census of Manuscripts.

15 The first-child-level elements of defined by the TEI schema, , , , , and , appear whenever the MOL database contains the corresponding information. If a manuscript is defined as composite and its parts are described separately in MOL, then the TEI XML document contains as many elements as there are codicological sections.

16 Within these elements, many other TEI elements are used to create a deeply structured XML manuscript description. In several cases, during the design of the export application, it was clear which TEI elements had to be used in encoding MOL original information. For example, there is no doubt that the name of the library holding a

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 89

manuscript must be encoded through a element within ; the field containing the history of the manuscript, which can be very long in MOL, can be encoded in a

element within the element ; the author’s name is inserted in the tag within an .

17 Several specific highly formalized data have also been encoded as TEI attribute values. For example, the international code identifying libraries, composed of two letters representing a city and four numbers, is encoded as the value of the canonical @key attribute within , because this code can associate each library with the records of the national database Anagrafe delle biblioteche italiane:8

Biblioteca comunale Laudense

18 The code identifying a single record inside the MOL authority file of names (composed of CNMN and a number) is encoded as the value of the @key attribute within :

Alighieri, Dante

19 The global @n attribute has been used in and to encode the numbers of the codicological units and of the texts. These are numbers which are automatically recorded in MOL to order codicological parts and texts in the same sequence as they appear within the original codex:

... Divina Commedia. Inferno ... ... Divina Commedia. Purgatorio ... ... Divina Commedia. Paradiso ...

20 The @type attribute has been used on , and four values have been defined to express whether the title is “attested” in the manuscript by the copyist, “added” by a hand different from the main copyist’s hand, “elaborated” by the cataloguer, or “identified” as a published work: </p><p>Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 90</p><p><title type="elaborato">Divina Commedia. Paradiso Seguita la terça Comedia di Dante chiamata Paradiso et capitolo primo ...

21 The @type attribute is also used on and to distinguish which part of the text begins or ends with those words: the dedication letter, the preface, the first poem of an anthology, the main text, and any other significant part of the text. The ICCU also uses the @defective attribute on the elements and to specify whether incipit or explicit belongs to an acephal text or if the text is lacking its end:

Apuleius Plato ad cives suos Ex pluribus paucas vires herbarum Herba nomine plantago a Graecis dicitur

22 In some other cases, the XML encoding of information stored in MOL needs to be discussed, because different solutions can be adopted in the export process to create valid TEI documents. Astonishingly, this is the case in the encoding of three types of codicological data which are very common in manuscript descriptions: material, number of folios, and dimensions of the manuscripts.

23 In fact there is no consensus on the values of the @material attribute that can be used in the element . Even if papyrus, parchment, and paper can absolutely be considered the most common materials, together with mixed material manuscript, different solutions have been chosen to express these data through the @material attribute. The TEI P5 Guidelines ( Appendix C) suggests "paper", "parch", and "mixed" as values of this attribute, but ultimately the ICCU decided to follow the important European cataloguing projects Manuscriptorium9 and e-Codices10 in using "chart" (paper), "perg", (parchment) and "mixed".

24 In describing the number of folios, cataloguers usually need to encode the quantity of guard leaves bound at the beginning of the volume, of folios which constitute the volume proper, and of guard leaves at the end of the volume. These are three simple numbers, but TEI P5 Guidelines (10.7.1.2 Extent) proposes different solutions and different projects have adopted different encoding methods which, of course, will not help them in sharing data.

25 The project Manuscriptorium, which describes itself as a “European digital library of written cultural heritage,” uses arabic numerals directly within the element; if the codex has guard leaves, their number is distinguished through a + (plus sign): 2+30+2. As all three numerals (separated by plus signs) are expressed only if a manuscript has guard-leaves at both the beginning and the end, and sometimes manuscripts have guard-leaves only at the beginning or only at the end, an automatic procedure cannot distinguish the semantic value of numerals expressed.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 91

26 e-Codices, the virtual manuscript library of Switzerland, encodes the quantity of folios through non-structured information within a element contained in the element ; at the same time a structured version of the same information is given as value of the @n attribute: 114 Bll. + 2 Vor- und 1 Nachsatzbl.. As @n is a global attribute which should number the element, not express any number contained within the element, probably this solution is not to be recommended.

27 As the ICCU needs to preserve the structure of the MOL data model as far as possible, and also the semantic distinction among initial and final guard leaves, in the export process the following solution has been adopted:

2 30 2

28 As with material and number of folios, the size of leaves can also be marked with different elements following the TEI P5 Guidelines (10.7.1.2 Extent): with the elements , , and , or with the element . Manuscriptorium adopts the first solution:

... 21 cm15 cm

while e-Codices uses an empty element and, again, an @n attribute: . As in MOL, several measurements can be recorded and each of them must be related to a folio number from which size was assessed; in the export application the following solution has been adopted:

234 x 123 c. 1

29 Other problems to be discussed are due to MOL contents and data structure. First, each cataloguing area, for example those devoted to material, date, number of folios, and many others, contains a generic field called “Note,” where cataloguers can register annotations in prose about that area. The ICCU decided not to encode all the MOL

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 92

“Note” fields in XML with the same element, without introducing any distinction, and tried to distinguish the original MOL field represented in each element. For this reason each uses the @n attribute to number the element and the @type attribute to encode the subject of the original “Note” field. For example, annotations about the first area of MOL, devoted to the identification of the manuscript, are marked as , while annotations about the second area, devoted to the form of the manuscript, are marked as .

30 The element with @n and @type attributes is also used to encode those contents which have an appropriate field in MOL but do not have an appropriate tag in the TEI schema. For example, in MOL there is a specific area devoted to the description of palimpsest folios, which is not considered in the TEI P5 Guidelines; in the TEI XML document, this information is encoded through a element with "palimpsest" as the value of @type: . A similar solution has been adopted by the ICCU to encode data about many other subjects: description of missing folios (), fragments of manuscripts (), folios derived from printed books (), pricking (), ruling (),11 ruled frame (), lines (), columns (),12 and ink ( ). However, even if these data can be encoded in a TEI XML document by creating specific combinations of a generic element with specific attribute values, at least for the data that are commonly recorded in manuscript cataloguing practice (for example pricking, ruling, and ink), the ICCU would recommend the introduction in the TEI schema of new specific elements.

31 Other codicological data pertaining to manuscript decoration, music notation, and binding, which are analytically recorded in several MOL fields and are expressed through a specialized lexicon, are exported in TEI XML files within a generic element. As with our use of the element, this encoding also uses @n and @type attributes to mark data derived from different MOL fields and to express in this way the specific semantic area of each . For example, a cataloguer can point out the presence of neumes in a manuscript by putting a flag (yes in the database) in the MOL field “Neumatic notation”; such information is exported in the TEI XML document as neumatica. The presence of clasps can be recorded by putting a flag in the MOL field “Clasps,” which causes an element fermagli to be created in the TEI XML document.

4. Letters

32 The Italian standard for manuscripts adopted by MOL also includes original letters. The information about original letters in MOL consists of the following fields related to the physical description: • type of letter (letter, postcard, picture postcard, business card…), • presence of envelope,

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 93

• presence of typescript part, • presence of a printed letterhead, • presence of original signature, • presence of manuscript notes.

33 Such data should be encoded in a , but at present, there are no TEI elements which could be used for this purpose.

34 The information about the text of original letters recorded in MOL consists of • folios, • name of sender, • name of addressee, • stage of the text (draft, original, copy), • place, • date, • summary.

35 Such data should be encoded through an within . The number of folios can be marked with ; the sender can be encoded as an and the addressee as a name with a specific ; but at present the ICCU does not know which elements could be used for other fields. Close collaboration among different projects and the TEI Consortium would be essential in order to develop an encoding system for original letters compatible with TEI encoding; for example a new set of tags to be used in the header of correspondence editions could be studied in collaboration by the TEI Manuscripts SIG13 and the TEI Correspondence SIG.14

36 Discussing the solutions found by the ICCU with the TEI Consortium could help not only to verify the encoding developed by the ICCU, but also to spread the use of this standard to meet the needs of European manuscript cataloguing.

BIBLIOGRAPHY

Bagnato, Gian Paolo, Giliola Barbero, and Massimo Menna. 2009. “ManusOnLine: un’applicazione web per il patrimonio manoscritto.” In Convegno AICA 2009. http://manus.iccu.sbn.it/ AICA2009_Menna.pdf.

Barbero, Giliola, and Stefania Smaldone. 2000. “Il linguaggio SGML/XML e la descrizione di manoscritti.” Bollettino AIB 40(2) (June): 159–79.

Barbero, Giliola. 2004. “Archivi elettronici dei manoscritti delle biblioteche d’Italia.” Biblioteche Oggi 22(2) (March): 90–91. Accessed January 1 2014. http://www.bibliotecheoggi.it/ 2004/20040209001.pdf.

———. 2013. “Manoscritti e standard.” DigItalia 8(2): 43–65. http://digitalia.sbn.it/article/view/ 824.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 94

Burnard, Lou, ed. 2001. “Reference Manual for the MASTER .” Discussion Draft. Revised Jan. 6, 2001. http://www.tei-c.org/About/Archive_new/Master/ Reference/oldindex.html.

Daines, J. Gordon, and Cory L. Nimer. 2012. “U. S. Descriptive Standards for Archives, Historical Manuscripts, and Rare Books.” In IFLA World Library and Information Congress. 78th IFLA General Conference and Assembly, 11–17 August 2012, Helsinki, Finland. Accessed January 1 2014. http:// conference.ifla.org/past/2012/212-danies-en.pdf.

Franz Fischer, Christiane Fritze, and Georg Vogeler, eds. 2010. Kodikologie und Paläographie im digitalen Zeitalter 2 (Codicology and Palaeography in the Digital Age 2). Norderstedt: Books on Demand.

Magrini, Sabina, Cesare Pasini, and Franca Arduini. 2005. “L’Italia e Rinascimento virtuale.” Biblioteche Oggi 23(4) (May): 23–33. http://www.bibliotecheoggi.it/2005/20050402301.pdf.

Marcuccio, Roberto. 2010. “Catalogare e fare ricerca con Manus Online.” Biblioteche Oggi (July– August): 33–49. http://manus.iccu.sbn.it/BibliotecheOggi_Marcuccio2010.pdf.

Milanese, Guido. 2007. “XML e lo studio dei manoscritti. Linee di orientamento e un esempio di applicazione.” In From Manuscript to Digital Text:Problems of Interpretation and Markup. Proceedings of the Colloquium (Bologna, June 12, 2003), edited by Francesco Citti and Tommaso Del Vecchio, 71–91. Rome: Herder.

Petrucci, Armando. 2001. La descrizione del manoscritto: Storia, problemi, modelli. Rome: Carocci.

Rehbein, Malte, Patrick Sahle, and Torsten Schassan, eds. 2009. Kodikologie und Paläographie im digitalen Zeitalter (Codicology and Palaeography in the Digital Age). Norderstedt: Books on Demand.

TEI Consortium. 2013. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.5.0. Last updated July 26. N.p.: TEI Consortium. http://www.tei-c.org/Vault/P5/2.5.0/doc/tei-p5-doc/ en/html/.

APPENDIXES

Appendix 1. A TEI XML File Exported from Manus OnLine

Milano_Archivio_storico_civico_e_Biblioteca_Trivulziana_Trivulziano_Triv_80.xml

Appendix 2. Manus Online / mapping

Mapping.xls

NOTES

1. ICCU’s full name is “Istituto centrale per il catalogo unico delle biblioteche italiane e per le informazioni bibliografiche,” http://www.iccu.sbn.it/opencms/opencms/it/. 2. MOL, http://manus.iccu.sbn.it/ 3. BibMan, http://bibman.iccu.sbn.it/.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 95

4. http://palin.iccu.sbn.it/ 5. http://www.cerl.org/resources/cerl_portal 6. http://www.manoscrittilombardia.it/ is an example of an OPAC edited by Regione Lombardia, whose contents derive from MOL. 7. Censimento Nazionale dei Manoscritti—Descrizione. 8. http://anagrafe.iccu.sbn.it/opencms/opencms/ 9. http://www.manuscriptorium.com/ 10. http://www.e-codices.unifr.ch/ 11. According to the TEI schema, pricking and ruling can be encoded in together with other information. As the MOL project needs to encode pricking, ruling, lines, columns, and justification separately, specific notes were inserted in and

. 12. According to the TEI schema, lines and columns can be described through numbers (data.count) using the attributes @columns, @ruledLines, and @writtenLines of ; the MOL project cannot use such markup because these data can also be represented in MOL through prose, not only through numbers. 13. http://www.tei-c.org/Activities/SIG/Manuscript/index.xml 14. http://www.tei-c.org/Activities/SIG/Correspondence/

ABSTRACTS

The Italian Central Institute of Cataloguing (ICCU) has developed a web application to export manuscript descriptions from Manus OnLine (MOL), the Italian national catalogue of manuscripts, into TEI XML documents. As of June 2013, single descriptions can be downloaded from the MOL Online Public Access Catalogue (OPAC) as TEI XML documents by any user. The aim of the first tool is to supply manuscript descriptions to the institutions which produced them using MOL, because libraries and research organizations need to use their own data, outside the MOL project. The aim of the second service – exporting manuscript catalogues of entire collections as TEI XML files - is to promote exchange of data with other national and international projects. For these purposes, the ICCU needs a rich schema, not only a partial set of metadata, in order to be able to export complete structured manuscript descriptions. As bibliographical standards (MARC, Unimarc) cannot represent the entire structure of MOL descriptions adequately, ICCU adopted the TEI schema as an interchange standard. This paper describes solutions adopted by ICCU in encoding MOL contents within the TEI element and problems they encountered in performing this work. The encoding of information about material (), number of folios and physical dimensions of manuscripts () are discussed as well as a list of data pertaining to original letter cataloguing.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 96

INDEX

Keywords: manuscripts, original handwritten letters, manuscripts cataloguing, bibliographical standards, interchange, letter cataloguing

AUTHORS

GILIOLA BARBERO Giliola Barbero, Ph.D. in Italian literature (humanistic philology), has been a lecturer in Library Sciences and IT for cultural heritage at the Università Cattolica, Milan, since 2001, and is the technical coordinator of Manus OnLine.

FRANCESCA TRASSELLI Francesca Trasselli is head of the Activity Area for Manuscript Bibliography, Research and Census of the Central Institute of Cataloguing and a specialist in an archival approach to manuscripts. She is the author of Manoscritti della Biblioteca Sessoriana di Roma: segnature, inventari, cataloghi (Casamari: Edizione Casamari, 2011).

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 97

The DTA “Base Format”: A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources

Susanne Haaf, Alexander Geyken and Frank Wiegand

1. Introduction

1 Until recently the creation of large historical reference corpora was, from the point of view of its encoding, a rather project-specific activity. Although reference corpora were built from texts of various origins, the texts had to be converted into a tailor-made format. For example, corpora like the well-known British National Corpus1 and the DWDS core corpus (Geyken 2007) are both annotated on the basis of the Guidelines of the Text Encoding Initiative (most recent release: P5; see TEI Consortium 2014). However, the encoding in these cases was typically carried out specifically for the creation of these corpora—that is, it was a unidirectional process. The interchange, and more importantly the interoperability, of corpora with other corpora played only a minor role.

2 In recent years the picture has dramatically changed. With the availability of more and more digitized texts in TEI P5 format and the advent of many academic and non- academic corpus-building projects, the task of creating reference corpora has shifted from a project-specific task to a more general task, requiring joint efforts by many stakeholders. In this new situation, individual projects creating corpora must ensure that their resulting annotation scheme is valid against a TEI P5 schema, and also enables easy reuse of both their metadata and text data in other project contexts. In other words, corpus projects are required to provide interoperable TEI data, so that the resulting corpora compiled from different sources can be made exploitable by common methods and tools. Interoperability issues affect different aspects, from metadata exchange through the extraction and analysis of document components up to (at least

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 98

for historical texts) the creation of a uniform stylesheet in order to present all corpus texts in a similar way.2

3 A main prerequisite for interoperability between corpora is the homogeneity of text structure. Thus, because of the intentionally high flexibility of the TEI tag set,3 it is no longer sufficient to base the annotation on the TEI Guidelines in their entirety. Rather, the TEI tag set has to be narrowed down to subsets that are both extensive, considering the structural phenomena they must document, and unambiguous about how similar phenomena may be encoded. Given these requirements, the desirability of creating some common agreed-upon TEI formats for certain editing purposes—and sharing these formats with the community in order to achieve homogeneously tagged TEI texts across project borders—is evident and has already been attempted several times (see, e.g., Pytlik Zillig 2009; Unsworth 2011).

4 Interoperability problems with TEI-encoded documents can occur on several levels: the exchange of metadata, the consistent extraction of document components, and the creation of a uniform stylesheet. It is well known that the flexibility of the TEI may lead to significant structural variation in TEI-conformant headers. This forces computational methods to deal with an enormous number of different cases for semantically consistent information extraction. Within the transcription itself, similar problems may occur with the extraction of structural phenomena in a text collection. For example, letters or quotations can be annotated differently across different document collections. Therefore it can be very difficult or even impossible to formulate a query that retrieves “all letters” across all document collections without knowing all the solutions adopted by all the individual collections. For complex queries the problem becomes even harder. Obviously, a standardized encoding across collections would be very helpful in solving this problem.

5 Last but not least, structural variation poses difficulties which should not be underestimated to the creation of a satisfactory rendering stylesheet that maps similar structural phenomena to identical styles. As a consequence, at present, document collections encoded in TEI can be exchanged only by accepting the loss of interoperability on one or several of the above-mentioned levels.

6 Existing TEI P5 subsets either were provided by the TEI Consortium, such as TEI Lite (Burnard and Sperberg-McQueen 2012) and TEI Tite (Trolard 2011), or originate from text digitization and curation projects, for example the Best Practices for TEI in Libraries (TEI SIG on Libraries 2011), TEI Analytics (Unsworth 2011; Pytlik Zillig 2009), I5 (Lüngen and Sperberg-McQueen 2012), or TextGrid’s Baseline Encoding for Text Data in TEI P5 (TextGrid 2007–2009).

7 A shared goal of these formats is to allow for the encoding of basic structural properties such as the markup of chapters and paragraphs. These formats agree on many of the most common structural features (e.g., a paragraph is marked up using the element

). However, each of these formats is designed for particular audiences and goals, and hence deals with the range of TEI possibilities for the markup of similar structures according to its own needs.

8 In this article we describe the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set. Its purpose is to provide a balance between expressiveness and precision as well as an interoperable annotation scheme for a large variety of text types in historical corpora of printed text from multiple sources. The DTABf shares properties and tagging solutions with the TEI P5 subsets mentioned above, but differs from those formats in

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 99

several aspects which we described comprehensively in Geyken, Haaf, and Wiegand 2012. For instance, unlike TEI Lite and “Best Practices for TEI in Libraries,” the DTABf uses controlled vocabularies for the specification of attribute values, and unlike TEI Tite, the DTABf forms a strict subset of the TEI tag set (i.e., no extensions of or content changes to the TEI P5 tag set are made) and offers a detailed specification for metadata recording within the TEI header.

9 The DTABf was created in a bottom-up process alongside the corpus compilation of the DTA project, thus covering most of the predictable structural phenomena of historical texts. It was applied to 15 text corpora from scholarly digitization projects. In addition, there is a powerful infrastructure implemented by the DTA project in order to process, correct, annotate, and publish texts in DTABf. These principles have helped establish the DTABf as a common, shared TEI format for the annotation of historical German texts.

10 The remainder of this paper starts with a short presentation of the project background, the Deutsches Textarchiv (DTA), which will show that a common base format was required to integrate text collections from 15 external corpus projects (section 2). In section 3, we describe the DTABf in more detail. Section 4 focuses on the dissemination of the DTABf; we explain how comprehensive documentation, continuous training courses, and the development of customized software tools interact to create a user community for the DTABf. In section 5, we present examples of good practice that illustrate how different external corpora can be converted into the DTABf, making such corpora interoperable in a wider context—for example, as part of the text corpora provided by the large European infrastructure project CLARIN.4 Section 6 discusses how new structural phenomena encountered in new texts are handled within the DTABf by adding new properties to the DTABf. We conclude with a short summary and some ideas about future prospects for the DTABf.

2. Project Background

11 The goal of the project Deutsches Textarchiv (DTA)5 is to create a reference corpus of the historical New High German language (1600–1900) which is balanced with regard to date of creation, text type, and thematic scope, and therefore constitutes the basis of a reference corpus for the New High German language. The DTA corpora contain printed historical works from different genres and text types (fictional texts in prose, poems, dramas, scientific texts of numerous different disciplines, and functional literature such as cookbooks, handbooks, sermons, or travel books). The DTA acquires texts in two complementary ways: via digitization of new texts for the DTA core corpus of around 1,500 historical works (approximately 150 million tokens) and via the curation of existing historical text collections, digitized in other project contexts, which are integrated into the DTA infrastructure (currently 120 million tokens from 15 projects).6

12 Even though text digitization for the entire core corpus is carried out by the DTA project itself and therefore can be based on similar guidelines, the DTA core corpus texts are heterogeneous due to structural differences between text types, variation in printing habits over time and in different regions, and peculiarities of different printing and publishing houses. External texts from different sources being integrated into the DTA infrastructure introduce additional heterogeneity in terms of project- specific markup conventions and different primary formats.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 100

13 In order to cope with this heterogeneous text collection, the DTABf was developed as an annotation scheme that allows for collective processing by software tools including metadata harvesting, retrieval of complex document structures, and presentation of the text data in various formats, including HTML, Text, and ePUB. The DTABf is completely based on the TEI P5 tag set, reducing and further constraining the stock tag set, but not extending it in any way. The goal of the DTABf is to provide solutions for the tagging of all structural phenomena occurring in historical printed texts down to a certain annotation depth while remaining consistent and unambiguous to ensure consistent markup for the DTA corpora as a whole. Thus the DTABf is the backbone of the DTA and guarantees that all DTABf texts are interoperable within the DTA context as well as for re-use in other projects. The DTABf is a “living” TEI format. It is carefully adjusted when new texts containing new structural phenomena are integrated into the DTA corpora.

3. Description of the DTA “Base Format” (DTABf)

3.1 History and Scope

14 The DTABf emerged from the TEI format used for the annotation of the DWDS corpus, a balanced corpus for twentieth-century German (Geyken 2007). With the beginning of the DTA project, the DWDS format was adapted to the requirements of the encoding of historical texts. The DTABf was applied continuously to all texts digitized during the first phase of the DTA project (2007–2010, approximately 700 texts dating from 1780 to 1900). In this period it was successively adapted to new phenomena which occurred in the respective texts and which had not previously been covered by the DTABf. With the beginning of DTA phase 2 (2010–2014, approximately 600 texts dating from 1600 to 1780), the DTABf was extensively revised on the basis of the annotated historical data resulting from phase 1: the treatment of structural phenomena was reconsidered and consistent solutions were determined. In the course of these efforts the formal description of the DTABf as given in a corresponding ODD7 was further substantiated and the DTA annotation guidelines were compiled.

15 The revised version of the DTABf has been successfully applied not only to the phase 2 texts of the DTA core corpus, but also to texts originating from external project contexts which were curated by the DTA in the course of the module DTAE (DTA- Erweiterungen, i.e., DTA Extensions) as well as the curation project of WG 1 in CLARIN- D.8 Continuous adjustments of the DTABf remain necessary in order to account for new phenomena, but we take great care to preserve consistency and avoid ambiguity. In addition, since the DTABf is now based upon a large amount of text, we are mostly able to avoid changes which disturb backward compatibility.9

16 As a result, the DTABf forms a TEI customization which is based on a large corpus of historical texts and thus offers solutions for most structural phenomena encountered within historical printed texts.

3.2 Design and Components

17 The DTABf tag set not only offers tagging solutions for text structuring but also provides a specification for the description of metadata in the TEI header. As of May

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 101

2014, the DTABf consists of 50 TEI header elements and 75 text elements accompanied by limited element- or class-specific attributes and, where applicable, attribute values. See Appendix 1 and Appendix 2 for more information on the distribution of DTABf elements within the DTA corpus.

3.2.1 Recording of Metadata

18 The DTABf TEI header is designed to cover extensive metadata information.10 First, DTABf-conformant metadata records contain bibliographic information about (1) the digital document as published by the DTA, (2) all instances which preceded the current digital edition together with (3) the persons or organizations responsible for those instances, and (4) all licenses relevant for the digital object at hand. Second, descriptions of the physical text source upon which the current edition is based are required, including information about its constitution as well as its physical location (institution, repository, and shelfmark). Finally, general information is given about the content and design of the document (e.g., language, typeface, document type) and the DTA sub-corpus it belongs to.

3.2.2 Formal and Semantic Structuring of Text

19 The tag set for text encoding contains tagging solutions for formal as well as semantic text structures.11 The former include page breaks, lists, tables, and figures, as well as physical layout information like forme work and different types of highlighting. The latter include chapters or text sections with titles, paragraphs, notes, opening or closing text parts, special text types such as poems, letters, and indices, and inline phenomena such as proper nouns or citations. Furthermore, documented editorial interventions are possible (e.g., the correction of printing errors, the expansion of abbreviations, normalizations, and editorial comments).

3.2.3 Linguistic Tagging

20 Linguistic information—tokenization, lemmatization of historical forms, Part-of-Speech (POS) tagging—is acquired automatically by various tools and applied to the DTA texts via the standoff method.12 We decided not to adopt an inline encoding for linguistic annotations for two reasons. First, the integration of token-based linguistic analyses in the texts leads to an enormous increase in the number of tags, which hinders manual editing of the transcriptions. The second reason is that postprocessing TEI texts―including postprocessing of linguistic annotations―often requires a conversion of the TEI text into a version of the text in which the reading order has been re- established (serialization). We provide such a solution for texts encoded in the DTABf schema (DTA-Tokwrap) and we prefer to provide users with linguistic annotations for our texts by converting the TEI texts into the Text Corpus Format (TCF),13 the standoff format used within the CLARIN project.

3.2.4 Components of the DTABf

21 The DTABf now consists of five components: 1. an ODD file14 specifying constraints on TEI elements, attributes, and attribute values 2. a RelaxNG schema15 generated from the ODD

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 102

3. a set of Schematron rules complementing the schema 4. comprehensive documentation16 explaining the treatment of structuring requirements 5. the TEI text instances provided by the DTA17

3.3 Ensuring Consistency: The DTABf ODD, Schema, and Schematron

22 In order to ensure homogeneous over the entire DTA corpus, the tag set has to remain unambiguous; that is, for each phenomenon there should be only one possible method of encoding.

23 Thus, we made use of the possibilities of the ODD source format to restrict annotations down to the attribute value level. First, from all possible TEI modules only a subset needed for our purposes was chosen. Second, from each of the included TEI modules, only a subset of available elements needed to encode the DTA corpus texts was selected. Likewise, attribute classes or single attributes within certain classes were eliminated from the schema if they turned out to be unnecessary for our purposes. And finally, if applicable, each attribute (at the class or element level) was provided with a fixed selection of permitted values. In cases where value lists could not be restricted (e.g., @n on containing the number of a stanza), we set fixed data types for the respective attribute values wherever possible. There are only a few cases remaining where the restriction of fixed attribute values would not be reasonable (e.g., @quantity on specifies the amount of text left out in the transcription for whatever reason and thus can be filled with any numeral). With the restrictions provided by the DTABf schema the flexibility of the TEI P5 tag set is reduced in favor of unambiguous though still fully TEI P5 compliant solutions.

24 As stated above, the DTABf provides not only a vocabulary for text annotation but also a specification for the TEI header in order to allow for consistent metadata recording. Although those two vocabularies—the tag set and the tag set— are mutually exclusive within the DTABf to a large extent, the underlying TEI P5 schema allows for quite a number of the elements to be valid in both the and the areas.

Example 1: Tagging of Notes and Remarks

The element may have different attribute-value pairs depending on where it is used: Within the area notes may be marginal notes ( ), footnotes (), endnotes (), or editorial remarks of the person working on the digital edition of the text (). Within the of a document, however, other kinds of notes are relevant, e.g., remarks about responsibilities for certain instances of the digital document (), about the digital document as a whole (), or about the constitution of its physical source ().

Example 2: DTABf elements within

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 103

There is quite a significant number of DTABf elements which the DTABf would not allow in the area but which are allowed within according to the TEI Guidelines. Examples are , and their descendants, and or the children of (e.g., , , or ).

25 With the ODD vocabulary in itself we cannot change these constraints for elements while remaining fully TEI-compliant. Therefore, to solve this problem, until recently we provided two separate schemas: one representing the DTABf in its entirety, the other excluding the DTABf metadata tagset.18 The latter formed an interim schema to facilitate DTABf-compliant text annotation irrespective of the recording and structuring of metadata, whereas the former constitutes the final schema as it is applied to the complete finalized DTABf documents. But the method of maintaining two separate schemas was time-consuming and thus only provisional. It has now been replaced by a set of Schematron rules which are defined on top of the DTABf schema and which specify contextual constraints.19

Example 3: Schematron rules for only elements

This Schematron rule, which deals with elements which could be used in the or area but should, according to the DTABf, be restricted to the , is:

[E0001] Element "" not allowed anywhere within element "text".

26 In addition, Schematron is used for further quality assurance, such as for checking the content of certain elements or for specifying constraints on attribute values which cannot be expressed with conventional content models.20

Example 4: Schematron rules for @facs values

For example, the DTABf constraints for the @facs attribute of the element cannot be expressed within ODD using regular datatypes, but can be described by Schematron rules: The value of @facs is a string starting with "#f", followed by a four-digit number; the @facs value of the first element in a document should be "#f0001"; the following @facs values should increase successively by 1. The corresponding Schematron rules are:

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 104

[E0015] Value of @facs within first "pb" incorrect; expected value: #f0001.

and

[E0014] Value of @facs within "pb" incorrect; @facs-values of "pb"- elements have to increase by 1 continually starting with #f0001.

27 The more the DTABf grows, the more cautious we have to be with adding material (elements, attributes, or attribute values) to the tag set, considering that certain tagging solutions might be too flexible and/or overlap in their scope. Both cases might lead to misinterpretations and thus to inconsistencies in the application of the DTABf. Whenever changes to the DTABf become necessary, we have to carefully consider the DTABf tag set as a whole in order to ensure its consistency.

3.4 Annotation Levels

28 With the growth of the DTABf it gets increasingly difficult and time-consuming to apply the whole range of possible DTABf annotations to each DTA corpus text. Therefore, there must be a way to communicate the degree of conformity of a given TEI text with the other texts in the DTA corpus. For such scenarios, the TEI Guidelines propose to divide the set of elements used within a certain TEI format into different levels according to the necessity of their usage.21 We introduced several levels of annotation, each containing a set of elements which have to be used consistently where applicable in order to achieve conformity with the respective level. The first three annotation levels are based on one another.

29 Level 1 (required) covers the minimal amount of text structuring which is required to be applied to a text in order to achieve DTABf conformity. Elements used at this level include

, ,

, ,

, , and .

30 Level 2 (recommended) contains additional elements (such as , , , , and ) which are not required but still recommended by the DTABf. All DTA core corpus texts are consistently annotated up to level 2, hence all other level 2 texts are entirely interoperable with the DTA core corpus.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 105

31 Level 3 (optional) defines optional elements (like , , and its possible descendants, or ) which are part of the DTABf though not consistently applied to the DTA core corpus.

32 All elements contained in the first three annotation levels, together with their respective attributes and values, constitute the DTABf as a TEI P5 subset. Therefore, interoperability is completely ensured down to the optional level.

33 Level 4 (proscribed) contains TEI P5 elements which are proscribed by the DTABf because they represent text structures for which a different TEI P5 solution was chosen for the DTABf.22

34 We are currently working on a way to document the DTABf level of a text within its TEI header. As a consequence, the annotation level can be used as a criterion for dividing the DTA corpus into subsets of variable annotation depth. The corpus query tool DDC23 allows for element-based queries of the DTA corpus. Future plans include the combination of those conditions so that queries within DTABf-conformant sub-corpora, which are created based on level as well as on the elements represented by them, are possible via the DTA website. Similar search options are to be offered by CLARIN’s Federated Content Search.24

3.5 Transformation of DTABf Texts into Other Formats

3.5.1 Presentation of DTABf Texts

35 The high granularity of the DTABf tag set and its coherent application to the DTA text corpora enable us to create stylesheets for the transformation of the TEI/XML documents into many other formats, including HTML, plain text with basic structural information, or ePUB. These stylesheets in turn are able to deal with the whole range of tagging scenarios the DTABf allows. In fact, the possibility of uniform presentations of all DTABf texts was one specific goal for the design of the DTABf, and is continually considered when making adjustments to it.

36 In most cases semantic annotations can be represented on the presentation level as long as they are unambiguous, consistent, and well-documented. For instance, division titles, list items, speakers, and stage directions in a drama can without any difficulty be presented in such a way that users of the respective reading versions are able to recognize these textual and structural phenomena at a glance (e.g., in our case titles are presented as bold text of larger size; stage directions are printed in italics; list items are outdented; tables are rendered as tables with dividing rules between rows and cells). Also, the specifications of the DTABf down to the attribute value level support the presentation of contents in underspecified elements. For example, the values "foot", "end", "left", and "right" within the @place attribute of the element define the position of a note in the source text. In addition, since the presentation of DTA texts is meant to come as close to the original layout as possible, the DTABf also includes tagging solutions for pure layout information (such as centered text, italics, changes of fonts, and boldface), which additionally support the presentation.

37 However, sometimes the semantics of structures interfere with the presentation, which should ideally approximate the layout of the source text. An example of this semantic interference is found in the tagging of title pages. The TEI definition of the

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 106

element is quite limited in that it restricts the usage of this element to complete pages.25 There are cases, though, that are not met by this definition where the element would still be reasonable. For example, usually the heading on the first page of a newspaper edition contains bibliographic information about the edition, but does not span an entire page. We therefore considered refraining from using the element in newspapers and instead using and analogous elements for the tagging of title information in newspapers. This solution would be fully TEI-conformant. However, it would lead to different encodings of semantically similar structures (in both cases we want to encode title information of a document as given in the source text; the layout is only of secondary interest here). Furthermore, the presentation of information on title pages within the DTA corpora is based on the possibility to define the element as block element and hence to homogeneously render all text occurring within as centered, etc. So, if we left out in newspapers, we would lose the ability to present title information in this text type (newspapers) similarly to title information in all other text types of the DTA corpus. We therefore decided to use for newspapers as well, but introduced the attribute-value pair @type="heading" to the DTABf to differentiate the area of title information in newspapers from title pages in the narrower sense.

38 The fixed vocabulary of the DTABf permits several coherent presentations of DTABf texts to users who can set the parameters of their preferred text presentation themselves.26

39 We did not make use of the TEI standard stylesheets for two reasons. The first reason is pragmatic. The TEI stylesheet library27 is a very large project on its own, consisting of complex and deeply structured XSL files which try to cover most of the TEI tag set (even though some elements like are completely ignored at least by the TEI to HTML conversion stylesheets). In addition, the standard stylesheets are too generic for an adequate visual representation of historical texts in the DTA context where we attempt to achieve a presentation as close as possible to the original print sources. We distinguish some elements with regard to their presentation depending on the context of their appearance. As an example, the

element describes a paragraph in prose text, but within (speech in a performance text) the same element denotes a speaker’s utterance (which may not be represented as a common paragraph in the original printing). Our own stylesheet library is complemented by an extensive test suite. The stylesheets are available at https://github.com/haoess/dta-tools.

3.5.2 Conversion into other Common Formats

40 The DTABf TEI header can be converted automatically into other common metadata formats. Currently, the DTA provides a Dublin Core and a Component Metadata Infrastructure (CMDI)28 version of all metadata records which can be harvested via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).29 In this way, DTA metadata can be reused by other platforms and Online Public Access Catalogues (OPACs). In addition, the CMDI metadata records for the DTA corpus texts can be interpreted by the Virtual Language Observatory (VLO) of CLARIN 30 and thus become visible together with other language resources within the CLARIN infrastructure.

41 The plain text versions of all DTA documents together with their corresponding linguistic annotations are provided as TCF files. The Text Corpus Format (TCF) 31 is a

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 107

standoff format for linguistic annotation and forms the input format for CLARIN’s WebLicht32 infrastructure. Thus, it becomes possible to further analyze DTA texts using linguistic tools provided within WebLicht.

4. Dissemination of the DTABf

4.1 Documentation

42 All decisions concerning the DTABf are described in comprehensive documentation which explains the treatment of common structuring requirements as well as special cases and illustrates each phenomenon with examples from the corpus.33 The DTABf documentation is thematically subdivided into different sections (formal/semantic annotation, metadata annotation, special encoding of certain text types such as journals and newspapers). Not only does it explain formal customizations of the TEI tag set as realized within the ODD and schema as well as the Schematron rule set, but it also specifies transcription guidelines as well as rules which could not be formalized without changing the content models of the TEI tag set and thus going beyond the DTABf schema.

43 The prose documentation is supplemented by tables containing all DTABf elements, attributes, and attribute values for text encoding on the one hand34 and for metadata annotation on the other.35

44 The DTABf documentation provides a description of the work completed so far, but more importantly it serves as a guideline for users working with the DTABf for the TEI- compliant structuring of historical texts.

4.2 Training and Tools for Users

45 The existence of comprehensive documentation is a necessary prerequisite for the usability of the DTABf and thus for its acceptance by a larger user community. In addition, the DTA offers workshops and tutorials where users learn to apply the DTABf in a consistent way.

46 Furthermore, work on text editions according to the DTABf is supported by the DTA oXygen Framework DTAoX, a framework for the Author mode of the oXygen XML Editor. DTAoX provides an ad hoc, WYSIWYG-like visualization of DTABf-compliant annotations in tagged text passages, of the annotation levels they belong to, and of conflicts with the DTABf. Additionally, to support DTABf-compliant text annotation, elements can easily be inserted in the text using the DTAoX toolbar, which groups elements according to semantic categories and annotation levels.36

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 108

Figure 1. DTAoX oXygen XML Editor framework, with toolbar for DTABf elements and color scheme for different annotation levels (green: level 1; blue: level 2; purple: level 3).

47 All DTABf-compliant texts added to the DTA corpus are integrated into the quality assurance platform DTAQ37 where their transcriptions and annotations may be proofread.

5. The DTABf at Different Stages of Digitization

5.1 Digitization by Way of Transcription – DTABf-born Texts

48 Homogeneity within a corpus does not only concern its tagging but also crucially depends on the way transcriptions are realized. Therefore, the DTABf is complemented by extensive transcription guidelines which can be referred to and further specified within the DTABf-compliant TEI header.38 These guidelines are intended for transcriptions which are as close to the source text as possible, the guiding idea being that the historical language and writing should be preserved. Normalizations, if at all needed, can often be conducted on the presentation level rather than within the primary transcriptions where they might adulterate the results of research on historical spelling, printing habits, historical abbreviations, and the like. Wherever possible, special characters are transcribed as such, using the corresponding Unicode code points. Generally, editorial interventions are marked, and the conditions of the source text are carefully documented as well.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 109

5.2 Interchange of TEI Documents

49 The history of the DTABf coincides with the history of documents which are exchanged between collaborating projects and the DTA.

50 An example of such a collaboration may be found in the project Johann Friedrich Blumenbach—online.39 This project is working on a digital edition of Blumenbach’s printed works in German and Latin as well as his handwritten texts. All texts are prepared in a TEI P5 format.

51 The DTA integrates the digitized full texts of German monographs and selected journal articles of Blumenbach into its platforms. This process is not unidirectional but involves several processing steps in the course of which the digital documents are exchanged between and enriched by the two projects.

52 First, the texts are transcribed by the Blumenbach project and annotated according to the Blumenbach project’s TEI P5 format. The resulting TEI P5-compliant documents are automatically converted into the DTABf, analyzed by the linguistic tools of the DTA (including tokenization, lemmatization, POS tagging, normalization of historical spellings), and integrated into DTAQ where they can be proofread and corrected. At this stage, the texts are already made available to the interested community; that is, they can be read online, downloaded in different formats (such as HTML or plain text versions both derived from the DTABf-XML or the TCF version containing the corresponding linguistic information), and queried within the DTA.

53 After the completion of the first correction phase within DTAQ, the DTABf-compliant texts are further processed by the DTA’s tools for automatic Named-Entity Recognition (NER). The documents are then manually corrected by the Blumenbach project in light of the NER results, and the corrected texts are again returned to the DTA where they are re-integrated into DTAQ, published on the DTA website, and integrated into the CLARIN-D infrastructure.

54 The TEI P5 formats used by both projects (DTABf and Blumenbach’s TEI P5 format) agree semantically, so documents can be converted automatically between the two formats in the course of the exchanges described above.

55 This workflow is a good example of interoperability and interchange of TEI documents, while it also shows that manual efforts remain necessary to create TEI formats compatible with one another.

6. Life History of the DTABf

56 Although the text corpus upon which the DTABf is built is already very large, new structural phenomena may still be encountered, mainly because of individual printing habits, especially in earlier works. In addition, the structuring of external texts integrated into the DTA corpus may differ from the DTABf formally (if varying TEI solutions were chosen for similar phenomena) or semantically (if tagging not provided by the DTABf was applied). Therefore, the DTABf continually comes under scrutiny. The main challenges are twofold: (1) to decide whether adaptations to the format are unavoidable in order to meet new requirements, and (2) to ensure that any such adaptations do not lead to inconsistencies in the structural markup within the corpus.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 110

6.1 Phenomena within the Scope of the DTABf

6.1.1 Treating New and Known Phenomena with Established Solutions

57 By this point, the DTABf has achieved such comprehensive coverage of historical texts that substantial changes are quite unusual. Novelties do not necessarily lead to changes in the DTABf. Often the DTABf already offers solutions which may also be applied to newly encountered phenomena.

58 For example, the DTABf provides solutions for annotating quotations and corresponding bibliographic references as well as for concatenating discontinuous text passages. The example below shows discontinuous citations where the quotation is presented inline, whereas the bibliographic citation occurs beforehand within a marginal note. Though it was new to us, we were able to handle this scenario without any extensions to the existing DTABf.

Figure 2. Discontinuous Parts of Quotations.

Lucas Bacmeister, Leichpredigt ... Tessen Von Parsow (Rostock, 1614), facsimile 18.40

6.1.2 Converting New Solutions into Established Solutions

59 TEI tagging solutions which were applied to external texts might be compliant with the DTABf in their thematic scope, though encoded using different TEI vocabulary. In such cases the original tagging can easily be replaced with the DTABf solution.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 111

Figure 3. Tagging of Text Loss.

6.2 Changes to the DTABf

60 However, there are cases where new requirements cannot be handled within the DTABf and changes to the format become necessary. Possible scenarios are the following: 1. New texts may contain structures which are new to the DTABf, for instance, within new document types such as newspapers or manuscripts.

Example 5: Newspapers

The DTA core corpus consists of texts from various disciplines, text types, and genres to allow for insights about the New High German language as it was used in different contexts and discourses at different points of time in its history. However, an important text type—which because of its enormous extent has not been included in the DTA corpora—is newspapers. We are currently extending the DTA corpus by adding historical newspapers in the course of different project partnerships (see Haaf and Schulz 2014). Newspaper texts from external projects are converted into the DTABf. We adjusted the DTABf to allow for the tagging of structures which are significant for or even limited to the text type "newspaper". Most importantly, we added division types for text passages which are typical for newspapers like political news (@type="jPoliticalNews"), weather reports (@type="jWeatherReports"), financial news (@type="jFinancialNews"), the feuilleton (@type="jFeuilleton"), and articles within those named categories (@type="jArticle"). Specifics of the DTABf for newspapers are covered by separate documentation.41 2. Structuring applied within texts from external projects (e.g., editorial comments) might not yet have an equivalent within the DTABf. Therefore, new components (here, a new attribute- value-pair @type="editorial") are introduced. Figure 4. Editorial Comments.

3. The documentation might lack precise examples, leading to uncertainties about the applied text structuring.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 112

Figure 5. List Items or Paragraphs?

Engelbert Kaempfer, Geschichte und Beschreibung von Japan, ed. Christian Wilhelm von Dohm, vol. 1 (Lemgo, 1777), pp. 69–70.42 4. New TEI P5 releases may introduce changes to tei_all which affect the DTABf. Figure 6. @unit vs. @type within (Release 2.3.0).

61 Changes to the DTABf are carried out only if they are consistent with the existing tag set and do not introduce ambiguities to the format. The changes mainly concern attributes or values and only rarely TEI elements or modules.

7. Conclusion and Further Prospects

62 In this paper we described the DTABf as a “living” TEI format for the annotation of historical written texts for the creation of large reference corpora. The DTA corpus base is still growing either through digitization carried out by the DTA team or through the addition of text collections originating from external cooperating projects. Therefore the DTABf is not static but is constantly checked and adjusted to new structural phenomena.

63 Future work will involve the adaptation of the DTABf to manuscripts. Currently, the DTA corpora almost exclusively contain printed works; only a couple of manuscripts have been integrated so far for evaluation purposes.43 However, some important text types usually exist in handwritten form rather than as printed documents (e.g., letters, diaries, and financial records). In order to improve the balance of the DTA corpus, it would therefore be interesting to integrate manuscripts from some of these widespread

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 113

text types into our collection. Our tests with manuscripts showed that most of the structural phenomena which occur in manuscripts are similar to those in printed texts and hence can already be treated within the DTABf. However, there are some additional characteristics of handwritten texts which might be usefully encoded (e.g., ad hoc additions, deletions, or insertions of the writer, or the change of hands within one document). These additional phenomena will necessitate adaptations to the DTABf. Furthermore, they are likely to affect the presentation of the TEI transcriptions—we might not be able to imitate the manuscript facsimile on the presentation level to the same extent as we do for printed text sources. In addition, some aspects of the metadata needed for manuscripts differ from the metadata needed for print documents. The adaptations to the DTABf and its corresponding tools and services necessary for the integration of manuscripts into the DTA will be performed in a document-based way through the integration of further manuscripts into the DTA corpus.

64 Other adaptations of the DTABf will be necessary in order to integrate texts obtained via Optical Character Recognition (OCR). OCR software only recognizes basic (text) zones which eventually have to be mapped to semantically meaningful structures. Semi-automatic subsequent structuring of OCR texts is possible, but becomes increasingly complex and error-prone with greater sophistication, detail, and granularity in the target markup. Therefore, based on experiences with automatic post- structuring of OCR texts in the course of the DFG-funded project Die Grenzboten,44 we are planning to create an additional basic structuring level for OCR texts within the DTABf, onto which semi-automatic text structuring can be implemented, and upon which further manual post-structuring can be performed.

65 As the best practice format for the structural annotation of historical printed texts within CLARIN-D,45 the DTABf has been brought in line with the CLARIN infrastructure. In this context, the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW), as one of the CLARIN-D service centers,46 provides tools and services which allow for the integration of DTABf texts into the CLARIN infrastructure where they are preserved and made available for the long term. In addition, routines for the conversion of DTABf texts and metadata into the CLARIN formats TCF and CMDI have been developed, so that re-use of DTABf texts within the CLARIN infrastructure is possible.

BIBLIOGRAPHY

Burnard, Lou, and C. M. Sperberg-McQueen. 2012. “TEI Lite: Encoding for Interchange: An Introduction to the TEI.” Final revised edition for TEI P5, August. Accessed February 7 2014. http://www.tei-c.org/Guidelines/Customization/Lite/.

Geyken, Alexander. 2007. “The DWDS Corpus: A Reference Corpus for the German Language of the 20th Century.” In Idioms and Collocations: Corpus-Based Linguistic and Lexicographic Studies, edited by Christiane Fellbaum, 23–41. London: Continuum.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 114

Geyken, Alexander, Susanne Haaf, and Frank Wiegand. 2012. “The DTA ‘base format’: A TEI- Subset for the Compilation of Interoperable Corpora.” In 11th Conference on Natural Language Processing (KONVENS): Empirical Methods in Natural Language Processing. Proceedings of the Conference, edited by Jeremy Jancsary, 383–91. Schriftenreihe der Österreichischen Gesellschaft für Artificial Intelligence 5. Wien: ÖGAI. Accessed February 7 2014. http://www.oegai.at/konvens2012/ proceedings/57_geyken12w/.

Haaf, Susanne, and Matthias Schulz. 2014. “Historical Newspapers & Journals for the DTA.” In Language Resources and Technologies for Processing and Linking Historical Documents and Archives – Deploying Linked Open Data in Cultural Heritage – LRT4HDA. Proceedings of the workshop, held at the Ninth International Conference on Language Resources and Evaluation (LREC’14), May 26–31, 2014, Reykjavik (Iceland), 50–54. Accessed February 7 2014. http://www.lrec-conf.org/ proceedings/lrec2014/workshops/LREC2014Workshop-LRT4HDA%20Proceedings.pdf#page=57.

Jurish, Bryan. 2012. “Finite-state Canonicalization Techniques for Historical German.” PhD diss., Universität Potsdam. Accessed February 7 2014. http://opus.kobv.de/ubp/volltexte/2012/5578/.

Jurish, Bryan, and Kay-Michael Würzner. 2013. “Word and Sentence Tokenization with Hidden Markov Models.” Journal for Language Technology and Computational Linguistics 28, no. 2: 61–83. Accessed February 7 2014. http://www.jlcl.org/2013_Heft2/3Jurish.pdf.

Jurish, Brian, Christian Thomas, and Frank Wiegand 2014. “Querying the deutsches textarchiv.” CEUR Workshop Proceedings. 1131: 25–30. Accessed February 7 2014. http://ceur-ws.org/Vol-1131/ mindthegap14_7.pdf.

Lüngen, Harald, and C. M. Sperberg-McQueen. 2012. “A TEI P5 Document Grammar for the IDS Text Model.” Journal of the Text Encoding Initiative 3. Accessed February 7 2014. http:// jtei.revues.org/508. doi:10.4000/jtei.508.

TEI Consortium. 2014. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.6.0. Last updated January 20. Accessed February 7 2014. http://www.tei-c.org/Vault/P5/2.6.0/doc/ tei-p5-doc/en/html/.

TEI SIG on Libraries. 2011. “Best Practices for TEI in Libraries: A TEI Project.” Edited by Kevin Hawkins, Michelle Dalmau, and Syd Bauman. Version 3.0. October. Accessed February 7 2014. http://www.tei-c.org/SIG/Libraries/teiinlibraries/main-driver.html.

TextGrid. 2007–2009. “TextGrid’s Baseline Encoding for Text Data in TEI P5.” Accessed February 7 2014. http://www.textgrid.de/fileadmin/TextGrid/reports/baseline-all-en.pdf.

Thomas, Christian, and Frank Wiegand. 2012. “Making Great Work Even Better: Appraisal and Digital Curation of Widely Dispersed Electronic Textual Resources (c. 15th–19th cent.) in CLARIN- D.” Full Paper for the Historical Corpora 2012 International Conference, Goethe University, Frankfurt, Germany, December 6–9. Accessed February 7 2014. http://edoc.bbaw.de/volltexte/ 2012/2308/.

Trolard, Perry. 2011. “TEI Tite: A Recommendation for Off-site Text Encoding.” Version 1.1. September. Accessed February 7 2014. http://www.tei-c.org/release/doc/tei-p5-exemplars/ html/tei_tite.doc.html.

Unsworth, John. 2011. “Computational Work with Very Large Text Collections: Interoperability, Sustainability, and the TEI.” In Journal of the Text Encoding Initiative 1 (June). Accessed February 7 2014. http://jtei.revues.org/215. doi:10.4000/jtei.215.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 115

Pytlik Zillig, Brian. 2009. “TEI Analytics: Converting Documents into a TEI Format for Cross- collection Text Analysis.” In Literary and Linguistic Computing 24, no. 2: 187–92. doi:10.1093/llc/ fqp005.

APPENDIXES

Appendix 1. DTABf Elements of Level 1 within with Frequencies, Attributes, and Values

Element Frequency* Attributes Values

1,225,251

811,758 @n "[verse number (in the textsource)]"

@n "[page number]" 558,763 @facs "[link to facsimile]"

456,195

@n "[note sign/number]"

368,227 @place "left", "right", "foot", "end"

@type "editorial"

247,030

@n "[depth of text structure]"

223,885 "abbreviations", "act", "advertisement", "appendix", "bibliography", "chapter" @type "lexiconEntry" "poem", "postface", "preface", "recipe", "scene"

@n "[(for stanzas:) number of stanza]" 117,431 @type "poem"

@cols "[number of colums occupied]"

65,179 @rows "[number of rows occupied]"

@role "label"

@n "[column number]" 63,521 @type "start", "end"

46,989

@type "notatedMusic"

31,731 @facs "[URL pointing to an image of the tagged figure]"

@notation "MathML", "TeX" 30,172 @facs "[pointer to a graphic of the transcribed formula]"

21,749

21,251 @reason "fm", "illegible", "insignificant", "lost"

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 116

@unit "chars", "words", "lines", "pages"

@quantity "[amount of text missing]"

19,292

6,426

5,324 @type "copyright", "dedication", "desc", "main", "price", "series", "sub", "volume"

2,621

2,418 @type "halftitle", "heading", "main", "series"

2,110

2,073

1,054

* Frequency of occurrence within the entire DTA corpus (> 2,100 works) in May 2014.

Appendix 2. DTABf Elements of Level 2 within with Frequencies, Attributes, and Values

Element Frequency* Attributes Values

15,728,726 @n "[printed line number]"

4,188,984 @rendition "[the medium for highlighting]"

@place "bottom", "top" 554,541 @type "catch", "header", "pageNum", "sig"

@unit "section" 77,873 @rendition "#hr", "#hrBlue", "#hrRed"

"[speaker’s id (as assigned to 61,057 @who the role)]"

60,970

@dim "horizontal", "vertical"

@unit "[amount of space concerned]" 46,401 "[unit measuring the amount of @quantity space concerned]"

20,470

11,062

10,384 @type "translation"

9,847

5,470

4,978

4,503

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 117

3,930

2,374

2,127

2,105

1,963

1,944

1,903

1,799

1,038

993

745

514

340

318

238

144

111

97

64

52

36

12

* Frequency of occurrence within the entire DTA corpus (> 2,100 works) in May 2014.

NOTES

1. BNC, http://www.natcorp.ox.ac.uk/. 2. For a discussion of this issue see Geyken, Haaf, and Wiegand 2012, 383ff. 3. See TEI Consortium 2014, 23.3, “Using the TEI: Personalization and Customization”, http://www.tei-c.org/Vault/P5/2.6.0/doc/tei-p5-doc/en/html/USE.html#MD: “[T]he TEI scheme supports a variety of different approaches to solving similar problems, and also defines a much richer set of elements than is likely to be necessary in any given project.… For these reasons, it is almost impossible to use the TEI scheme without customizing or personalizing it in some way.” 4. CLARIN: Common Language Resources and Technology Infrastructure, https:// www.clarin.eu/.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 118

5. The DTA (http://www.deutschestextarchiv.de) at the Berlin-Brandenburg Academy of Sciences and Humanities has been funded by the Deutsche Forschungsgemeinschaft (German Research Foundation, DFG) since 2007. 6. See DTA-Erweiterungen (DTA Extensions), http://www.deutschestextarchiv.de/dtae. 7. ODD: “One document does it all”: see TEI Consortium 2014, “Customization: Getting Started with P5 ODDs,” http://www.tei-c.org/Guidelines/Customization/odds.xml; TEI Consortium 2014, 23.3, “Personalization and Customization”, http://www.tei-c.org/ Vault/P5/2.6.0/doc/tei-p5-doc/en/html/USE.html#MD. 8. “Integration und Aufwertung historischer Textressourcen des 15.–19. Jahrhunderts in einer nachhaltigen CLARIN-D-Infrastruktur: Kurationsprojekt 1 der Facharbeitsgruppe 1 Deutsche Philologie,” http://www.deutschestextarchiv.de/doku/ clarin_kupro_index; see also Thomas and Wiegand 2012. 9. See section 6.2 below for more information about extensions and modifications to the DTABf. 10. See the documentation for the DTABf-compliant recording of metadata, “Dokumente zum DTA-Basisformat: DTA-Basisformat—Header,” http:// www.deutschestextarchiv.de/doku/basisformat_header. 11. See the DTABf documentation for the formal and semantic text annotation: “Dokumente zum DTA-Basisformat: Formale Erschließung des Volltextes,” http:// www.deutschestextarchiv.de/doku/basisformat_texterschliessung_formal; “Dokumente zum DTA-Basisformat: Inhaltliche Erschließung des Volltextes,” http:// www.deutschestextarchiv.de/doku/basisformat_texterschliessung_inhaltlich. 12. DTA-Tokwrap: “Software im Deutschen Textarchiv: 6. Tokenizer für komplexe Texte (DTA-Tokwrap),” http://www.deutschestextarchiv.de/doku/software#dtatw; Jurish and Würzner 2013. CAB: “Software im Deutschen Textarchiv: 8. Linguistische Analyse historischer Texte (CAB),” http://www.deutschestextarchiv.de/doku/ software#cab; Jurish 2012; DDC: “Software im Deutschen Textarchiv: 7. Indizierung und linguistische Suche (DDC),” http://www.deutschestextarchiv.de/doku/software#ddc; Jurish, Thomas, and Wiegand 2014. 13. A “Tool Chaining Format used by WebLicht,” http://www.clarin.eu/category/ glossary/tcf. 14. See http://www.deutschestextarchiv.de/basisformat.odd. 15. See http://www.deutschestextarchiv.de/basisformat.rng. 16. See http://www.deutschestextarchiv.de/doku/basisformat. 17. All DTA texts as well as the DTA corpora as a whole are available for download and are freely accessible on the DTA website, http://www.deutschestextarchiv.de. 18. The full DTABf RNG is available for download at http:// www.deutschestextarchiv.de/basisformat.rng, the reduced schema is accessible under: http://www.deutschestextarchiv.de/basisformat_ohne_header.rng. 19. See http://www.deutschestextarchiv.de/basisformat.sch. 20. However, it is possible to embed Schematron rules in the ODD file by using a different namespace. 21. See TEI Consortium 2014, 15.5, “Recommendations for the Encoding of Large Corpora”,: http://www.tei-c.org/Vault/P5/2.6.0/doc/tei-p5-doc/en/html/ CC.html#CCREC.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 119

22. Ibid. For a description of the DTABf levels, see http://www.deutschestextarchiv.de/ doku/basisformat_table?lang=en. 23. DDC: Dialing DWDS Concordancer; see “Software im Deutschen Textarchiv: 7. Indizierung und linguistische Suche (DDC),” http://www.deutschestextarchiv.de/doku/ software#ddc. 24. http://www.clarin.eu/content/federated-content-search. 25. See TEI Consortium 2014, element specification of : “(title page) contains the title page of a text, appearing within the front or back matter,” http:// www.tei-c.org/Vault/P5/2.6.0/doc/tei-p5-doc/en/html/ref-titlePage.html. 26. See the customizable HTML presentation which can be accessed from the starting page of each book on the DTA website. 27. TEI XSL Stylesheets, https://github.com/TEIC/Stylesheets/. 28. See CLARIN, “Component Metadata,” http://www.clarin.eu/content/component- metadata. For the CMDI profile of the DTA, see http://catalog.clarin.eu/ds/ ComponentRegistry/?item=clarin.eu:cr1:p_1345180279115; for a list of differences between the DTABf TEI header and DTA’s CMDI profile, see http:// www.deutschestextarchiv.de/doku/cmdi. 29. See “DTA—APIs,” http://www.deutschestextarchiv.de/api. 30. See http://catalog.clarin.eu/vlo/?theme=CLARIN-D. 31. See WebLicht, “The TCF Format,” http://weblicht.sfs.uni-tuebingen.de/ weblichtwiki/index.php/The_TCF_Format. 32. See the WebLicht main page, http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/ index.php/Main_Page. 33. See “DTA-Basisformat—Einführung,” http://www.deutschestextarchiv.de/doku/ basisformat. 34. English version: “DTA ‘base format’—overview (elements within text),” http:// www.deutschestextarchiv.de/doku/basisformat_table?lang=en; German version: “DTA- Basisformat—Überblick (Elemente innerhalb von text),” http:// www.deutschestextarchiv.de/doku/basisformat_table?lang=de. 35. English version: “DTA ‘base format’—overview (elements within teiHeader),” http://www.deutschestextarchiv.de/doku/basisformat_table_header?lang=en; German version: “DTA-Basisformat—Überblick (Elemente innerhalb von teiHeader),” http:// www.deutschestextarchiv.de/doku/basisformat_table_header?lang=de. 36. DTAoX is available for download at http://www.deutschestextarchiv.de/doku/ software#dtaox. 37. Deutsches Textarchiv—Qualitätssicherung; see “Kollaborative Qualitätssicherung im Deutschen Textarchiv,” http://www.deutschestextarchiv.de/dtaq/about. 38. For the documentation of the DTA transcription guidelines, see “DTA-Richtlinien zur Texterfassung,” http://www.deutschestextarchiv.de/doku/richtlinien. 39. http://www.blumenbach-online.de/. 40. Deutsches Textarchiv Qualitätssicherung, http://www.deutschestextarchiv.de/ dtaq/book/view/bacmeister_predigt_1614?p=18. 41. “DTA-Basisformat—Auszeichnung von Zeitungen,” http:// www.deutschestextarchiv.de/doku/basisformat_zeitungen.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 120

42. Deutsches Textarchiv, http://www.deutschestextarchiv.de/ kaempfer_japan01_1777/157, http://www.deutschestextarchiv.de/ kaempfer_japan01_1777/158. 43. See, for example, Georg Gustav Erbkam, Tagebuch meiner egyptischen Reise, Teil 1, Ägypten, 1842–43, in Deutsches Textarchiv, http://www.deutschestextarchiv.de/ erbkam_tagebuch01_1842; Theresia Lindnerin, Koch Buch zum Gebrauch der Wohlgebohrenen Frau (um 1780), in Deutsches Textarchiv Qualitätssicherung, http:// www.deutschestextarchiv.de/dtaq/book/show/lindnerin_kochbuch_1780. 44. “Die Grenzboten—Digitalisierung, Erschließung und Volltexterkennung einer der herausragenden deutschen Zeitschriften des 19. und 20. Jahrhunderts,” joint project of the Staats- und Universitätsbibliothek Bremen and the Berlin-Brandenburgische Akademie der Wissenschaften (BBAW), http://gepris.dfg.de/gepris/projekt/196492153. 45. See the CLARIN-D User Guide, version 1.0.1 (December 19), part II, ch. 6, subsection “Text Corpora,” http://www.clarin-d.de/en/language-resources/userguide.html. 46. “CLARIN-D Service Centres,” http://www.clarin-d.de/en/clarin-d-centres.html.

ABSTRACTS

In this article we describe the DTA “Base Format” (DTABf), a strict subset of the TEI P5 tag set. The purpose of the DTABf is to provide a balance between expressiveness and precision as well as an interoperable annotation scheme for a large variety of text types of historical corpora of printed text from multiple sources. The DTABf has been developed on the basis of a large amount of historical text data in the core corpus of the project Deutsches Textarchiv (DTA) and text collections from 15 cooperating projects with a current total of 210 million tokens. The DTABf is a “living” TEI format which is continuously adjusted when new text candidates for the DTA containing new structural phenomena are encountered. We also focus on other aspects of the DTABf including consistency, interoperability with other TEI dialects, HTML and other presentations of the TEI texts, and conversion into other formats, as well as linguistic analysis. We include some examples of best practices to illustrate how external corpora can be losslessly converted into the DTABf, thus enabling third parties to use the DTABf in their specific projects. The DTABf is comprehensively documented, and several software tools are available for working with it, making it a widely used format for the encoding of historical printed German text.

INDEX

Keywords: TEI customization, historical corpora, corpus annotation, interoperability, interchange, schema design, standardization

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 121

AUTHORS

SUSANNE HAAF Susanne Haaf works as a research assistant for the German Text Archive (DTA) and CLARIN-D at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW).

ALEXANDER GEYKEN Alexander Geyken works at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW), and is head of the project groups of the Digital Dictionary of German Language (DWDS, a long-term BBAW-project) and the German Text Archive (DTA).

FRANK WIEGAND Frank Wiegand works as a research assistant/software developer for the German Text Archive (DTA) and the Digital Dictionary of German Language (DWDS) at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW).

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 122

ReMetCa: A Proposal for Integrating RDBMS and TEI-Verse

Elena González-Blanco and José Luis Rodríguez

1. The ReMetCa Project: Definition and Scope

1 ReMetCa1 is a computer-based metrical repertoire of Medieval Castilian poetry. It gathers all poetic witnesses, from the very beginnings of Spanish lyrics at the end of the twelfth century through the rich and varied poetic manifestations of the fifteenth and sixteenth century, the “Cancioneros.” When complete, it will include over 10,000 texts and offer a systematic metrical analysis of each poem. ReMetCa is the first digital tool to analyze Medieval Spanish poetry. It enables users to carry out complex searches in its corpus, following the models of other digital resources in the Romance lyrical traditions, such as the Galician-Portuguese, Catalan, Italian, or Provençal. ReMetCa is not merely a metrical repertoire; it combines metrical schemes together with text analysis as well as data sheets with the main philological aspects that characterize the poems. One of its most important values is that it is a born-digital project designed to be interoperable with other existing poetry databases and digital repertoires, as conceptualized within the Megarep2 project, in which Spanish researchers are already working in collaboration with the Hungarian team led by Levente Seláf (González- Blanco and Seláf 2014). The Spanish repertoire is conceived as an essential tool to complete the digital European poetic puzzle, enabling users to conduct powerful searches in many fields at the same time.

2. The Role of ReMetCa in the European History of Poetic Repertoires

2 ReMetCa belongs to the latest generation in a long tradition of metrical repertoires, which can be divided into three significant periods based on the different technologies used for their construction. The first stage, in which repertoires were published as

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 123

printed books, started at the end of the nineteenth century.3 The second period started after the Second World War, when repertoires became computer-assisted.4 Technological advance has made it possible to create a third generation of repertoires, available online and searchable through the web, in which research time has been considerably reduced by richer search capabilities and the accessibility of online interfaces, especially compared with the complexity of printed indexes and lists of metrical schemes. The first online digital poetic repertoire was the RPHA: Répertoire de la Poésie hongroise ancienne jusqu’à 1600 (Horváth 1991–); Galician researchers created MedDB: Base de datos da Lírica profana galego-portuguesa (Brea et al. 1994–); Italian researchers digitalized BEdT: Bibliografia Elettronica dei Trovatori (Asperti et al.); the Nouveau Naetebus (Seláf), the Oxford Cantigas de Santa María Database (Parkinson), the Analecta Hymnica Digitalia (Rauner), and the Dutch Song Database (Grijp) all appeared later.5 Most of these repertoires are built on relational databases; some of them are open source, but the majority are built on proprietary software. Some of them were previously published as books and later as CDs, but now almost every project has developed an online version. The technical description of these resources is not the topic of this paper, but it will be studied in depth in a later stage of the project, as further possibilities for interoperability are analyzed.

3. The State of the art in Spain

3 Compared with the European tradition described above, the Spanish panorama looks weak, as we do not have a published poetic repertoire which gathers the metrical patterns of Medieval Castilian poetry (except for the book by Gómez-Bravo [1999], restricted to “Cancionero” poetry, which only covers the fifteenth century, but not the poems written during the thirteenth and fourteenth centuries), and, until now, there was no digital resource available. However, researchers are nowadays more conscious of the importance of metrical studies for the analysis and understanding of Medieval Spanish poetry, as has been recently shown.6 On the other hand, metrical studies have flourished thanks to the creation of specialized journals, such as Rhythmica. Revista española de métrica comparada,7 available online, and Ars Metrica,8 born as an electronic journal whose scientific committee is composed of researchers from universities and centers in different countries.

4 At this point, ReMetCa’s aim is to fill the Spanish gap with information in order to build a metrical repertoire within this complex puzzle of digital European poetry, but the objective of the project in its future stages is to design a system—to be based on the TEI model, but integrating linked data technologies—which allows interoperability, information exchange, and combined searches across the different repertoires and databases.

4. The Conceptual Schema of ReMetCa

5 The description of ReMetCa’s structure starts with the definition of a conceptual model based on the domain of medieval metrics. Its global structure is synthesized into a relational database, which will be exportable in the future to full TEI in order to facilitate data interchange and interoperability with other projects and systems. For the moment, TEI is restricted to tagging metrical and poetic phenomena by using the

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 124

TEI Verse module in only one of the database fields. Although there are many possible ways to structure such a database, our proposed model takes into account the elements we need to describe and analyze our field of study. For the graphical representation we use a light version of UML (Unified Modeling Language). Entity names and their attributes are shown as boxes, and these are connected by lines which represent the existing relationships among them, together with numbers to show cardinality.

Figure 1. Conceptual model.

5. The Logical Schema: Modeling the Database E-R Model

6 The conceptual model of entities, attributes, and relationships is transformed into an Entity-Relationship model after applying the transformation rules that convert it into a database.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 125

Figure 2. Logical Model.

7 The relationships established among entities can be “one-to-many” or “many-to- many.” In the first category you can find the relationship between obraCompleta (literary work) and refHismetca (Classification System). As one refHismetca may be shown in many different literary works, the solution given in the ER model is to define a table for refHismetca with at least two attributes: one to identify the refHismetca, and the other to name it. For the entity obraCompleta (literary work), our database has a field with a foreign key which points to the id field of the genre table [refHismetca]. The second type of relationship, many-to-many, is larger. It is necessary to create a third table to contain each of them. This happens, for example, in order to express the relationship between topics and poems, whose third entity consists of as many tuples as there are topics assigned to each poem. Every tuple has at least two attributes to identify poema (poem) and tema (topic). A similar situation happens with the entity bibliografía (bibliography), whose third entity is called referencias.

8 The structure described above covers the majority of entities in our relational database model. Some problems arise when entering entities with a complex textual description, such as the terms used to describe parts of the literary work. Poema (poem), estrofa (stanza), and verso (line) are the components that define the hierarchical structure of each literary work analyzed. Applying the previously described E-R model would drive us into a complex model of relationships among those components which are very difficult to represent in a database. These relationships are created through composition, which means that the entity poema is composed of one or more entities estrofa, and the entity estrofa consists of one or more entities verso. The relations of multiplicity among the components vary from one work to another, depending on the number of stanzas and lines in each entity. For example, a sonnet

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 126

shows a multiplicity of 1 for each stanza contained by it (i.e., each stanza is part of only one sonnet), and the stanza has a multiplicity of 4 in relation to the sonnet (the sonnet contains four stanzas). The problem of representing this composition in an E-R model is that the representation is data-centered, and it does not work for poem, line, and stanza, as these components need to be analyzed as textual items. It is necessary to insert metadata into these textual items to show their compositional relation and to add other relevant information to them. The E-R model is inappropriate for this purpose due to its center-based structure, with the entities of poem, line, and stanza in the middle of its referential domain of study.

9 The difficulties described above remain in each stage of the project and reappear in the programming phase and when designing the different web components, both in the administrative part of the project (for aspects related to the creation, modification, and deletion of registers) and in its querying module. Also, the structure of each of the three components mentioned (poem, stanza, and line) differs significantly depending on the literary work, as there are different kinds of poems, stanzas, and lines. For example, some of the poems are monostrophic, whereas others are composed of several different stanzas. Lines may be rhymed or unrhymed, may be inserted into a strophic structure or not, may form part of a series of compositions, or may constitute a literary composition themselves. From a technical point of view, all of them could be considered “semi-structured data.”

6. The Addition of XML Tagging: The Verse Module of TEI P5

10 This hierarchical and semi-structured pattern is quite ungainly for an E-R database model, which is not flexible enough to show hierarchical structures. For this reason, using an XML-based markup language, such as TEI, is the perfect solution to show all the complex relations and properties. As the ReMetCa model is built on a database which works very well in terms of structure to describe the main fields of our projects, TEI is just used for tagging entities and relationships that cannot be represented in the database structure because of the complexity and diversity of the contents of the texts included, in terms of metrical definitions and properties. We have selected only the tags of the Verse module of the TEI P5 Guidelines, as this module works perfectly to show the hierarchy and properties needed, which may vary depending on the type of poem. Here is an example:

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 127

Example 1. Auto de la huida a Egipto, escena III, vv. 36–44.

Guía al hijo y a la madre, guía al viejo pecador, que se parte sin temor a donde manda Dios padre; estos dos versos son de vuelta y pues al niño bendito y a nosotros tú sacaste, Ángel, tú que me mandaste de Judea ir a Egipto, guíanos con el chiquito.

11 We have designed a TEI customization based on the TEI Verse module and represented as an XML Schema (.xsd) which includes the following tags and attributes: • The stanza, marked with the tag, is our highest tagging level and bears the following attributes: ◦ @type describes the function of the stanza in the poem and may have one of three values, "estrofa", "estribillo", or "cabeza", depending on its function: a normal stanza, a chorus, or the head position of the poem (which is usually repeated with variations). ◦ @subtype contains the name of the stanza form (e.g., "soneto", "cuadernavia", "tirada", "romance", "zéjel"), provided it has a standardized name in the poetic tradition. The classification of the terms is gathered in a controlled vocabulary registered in the guidelines established by ReMetCa. If there is no known conventional name (which happens in many cases), the attribute may not appear. ◦ @n contains the number of the stanza as per the print edition on which our digital editions are based. Since the transcriptions of the texts we use are not complete, it is necessary to number the lines in order to make references to other editions. ◦ @met contains the metrical structure, based on the number of syllables in each line and on the number of lines, separated by commas. A stanza of four lines of octosyllables would be encoded as met="8,8,8,8". ◦ @rhyme contains the rhyming structure of each stanza, represented by letters, as for example in rhyme="abba", to show that the first and fourth lines have the same rhymes and so do the second and third. ◦ @asonancia is a new attribute which does not exist in the TEI Verse module and has been added to our schema. Although our philosophy is to respect and follow the TEI

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 128

Guidelines (TEI Consortium 2013), we have felt obliged to introduce three new attributes. This one indicates the two possible values of the rhyming typology: "asonante" (which means that only vowel sounds are repeated at the end of two rhyming lines), and "consonante" (which means that every sound after the stressed syllable is repeated). The second added attribute is @unisonancia, which takes the values of either "unisonante" or "singular" and shows whether the same rhyme scheme is repeated in different stanzas (unisonante: abba, abba) or not (singular: abba, cddc). The third new attribute is @isometrismo, which takes the values of either "isométrico" or "heterométrico", which indicates whether all the descendant stanzas have the same number of syllables ("isométrico") or not ("heterométrico"). • The element “line,” , a child of , is a mixed element which contains the child element to mark the rhyming string of the line which is pronounced after the stressed vowel, and one attribute, @n, which marks the line number, following the same principles as those indicated by the same attribute described at .

7. Possibilities for Extending our TEI Schema

12 As the ReMetCa project was initially built as a relational model, the only TEI component that has been added to its data model is the Verse module. However, in the future there will be more TEI markup additions to reflect the need for gathering more information on the different manuscripts witnessing each poem. Our team is analyzing the possibility of replacing the manuscript table with the element . The reasons for this decision derive from the complexity of describing many different witnesses, each of them with complex relations to many collections, libraries, owners, and copyists. Following the E-R model would force us to multiply the number of tables in order to be able to reflect the complexity; therefore, using a model based on XML would ease our work by offering both exhaustiveness and flexibility, but this advantage is counterbalanced by the scale of changes that would be required in order to introduce XQuery functionality to the system, so each decision to change in this way has to be taken gradually and carefully.

8. The Physical Model

13 Once the logical model is established, the next task was to implement the selected schema into a database management system, what is known as the “physical model.” (Silberschatz 2002: 40) As far as the selection of the system is concerned, native XML databases, such as eXist or BaseX, were discarded since we wanted to preserve the original relational structure of our logical system, and also because we considered it interesting to use an RBDMS model, seeing that most of the online repertoires are built on similar systems. The problem lay in choosing a combined system to integrate both types of information: E-R relational structures and text marked up with XML. We compared two of the most popular E-R database systems: MySQL and Oracle9i XML. Both of them offered the possibility of adding an XMLType data column, thus offering the possibility of introducing hierarchical structures inside the relational model via the provision of XML nesting structures. The combination of both systems offers great

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 129

advantages, such as the possibility of making combined queries with SQL and XPath languages.9

14 This first trial model of ReMetCa, combining the MySQL RDBMS and TEI XML fields, is already working online and can be visited at http://ww.remetca.uned.es. To access the database, it is necessary to log in to “Área de trabajo” and then into “base de datos,” using a username and password that can be requested from the same website. Eventually the search screen will be online, free and open with no password, but not yet, as we are still working on populating the database and designing the search engine.

15 Oracle databases contain fields defined as XMLType, thanks to the XDB component. This component is usually installed by default, but its status can be checked by using the following SQL statement:

SQL> select comp_name, status from dba_registry where comp_name=’Oracle XML Database’; COMP_NAME

------STATUS ------Oracle XML Database VALID

16 In the case of ReMetCa, this XMLType field needs to appear only once, in the table Poema, which contains elements and their content, as described above. Oracle SQL Developer also allows updates to contents with its unique record view.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 130

Figure 3. Oracle SQL Developer.

17 Oracle stores XSD schemas into its system, so the XML introduced is validated. To test this property, we tried introducing an XSD schema for this element, adding attributes and constraints, and as figure 4 shows, it was perfectly validated. The system also accepts XSLT stylesheets. Oracle registers stylesheets and can apply them to any XMLType field.

Figure 4. Schema validation error.

18 Finally, it is important to create an index for XML elements to test the performance of the database.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 131

Figure 5. Index of verses.

9. Conclusion

19 As described above, the ReMetCa project is under constant development. As our textual corpus grows, the initial schema has to be adjusted to its needs. We are working to expand the technical model and to make it capable of describing the complex panorama of Medieval Spanish poetry. During the next stages of the project, more TEI modules and elements will be added in order to complete some of the sections that are not yet satisfactorily described by the E-R model in the database, and also to make it possible to export all the database contents to full TEI. A further stage will consist of working on a linked-data model to allow interoperability among the different European repertoires based on a common ontology.

20 Although we have adopted Oracle because of its power to work with XML technologies and TEI, there are other open-source systems like MySQL or PostgreSQL which offer enough functions to implement a mixed system like the one proposed in this paper. Actually, it is predictable that both systems will include similar tools in the future to work with XMLType fields, as XML is becoming more and more widely used.

BIBLIOGRAPHY

Bibliography on Metrical Studies

Beltrán, Vicenç. 2007. “Bibliografía sobre Poesía medieval y Cancioneros.” In Biblioteca Virtual Joan Lluís Vives. Accessed 5 February, 2014. http://www.lluisvives.com/FichaObra.html? Ref=25467&portal=1.

Domínguez Caparrós, José. 1985. Diccionario de métrica española. Madrid: Paraninfo.

———. 1988. Contribución a la bibliografía de los últimos treinta años sobre la métrica española. Madrid: C.S.I.C.

Duffell, Martin J. 2007. Syllable and Accent: Studies on Medieval Hispanic Metrics. Papers of the Medieval Hispanic Research Seminar 56. London: Department of Hispanic Studies, Queen Mary and Westfield College, University of London.

García Calvo, Agustín. 2006. Tratado de Rítmica y Prosodia y de Métrica y Versificación. Zamora: Lucina.

Gómez Redondo, Fernando. 2001. Artes Poéticas Medievales. Madrid: Laberinto.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 132

González-Blanco García, Elena. 2010. La cuaderna vía española en su marco panrománico. Madrid: FUE.

González-Blanco García, Elena and Levente Seláf. 2014. “Megarep: A Comprehensive Research Tool in Medieval and Renaissance Poetic and Metrical Repertories.” In Humanitats a la xarxa: món medieval / Humanities on the Web: The Medieval World, edited by Lourdes Soriano, Helena Rovira, Marion Coderch, Glòria Sabaté, and Xavier Espluga, 321–32. Oxford: Peter Lang.

Micó, José María. 2009. Bibliografía para una historia de las formas poéticas en España. Biblioteca Virtual Miguel de Cervantes: Alicante. Accessed February 5, 2014. http:// www.cervantesvirtual.com/nd/ark:/59851/bmc1g127.

Silberschatz, Abraham (et al.). 2002. Fundamentos de Bases de datos, Madrid: Mc Graw Hill.

TEI Consortium. 2013. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.5.0. Last updated July 26. N.p.: TEI Consortium. http://www.tei-c.org/Vault/P5/2.5.0/doc/tei-p5-doc/ en/html/.

Metrical Repertoires Published in Print

Antonelli, Roberto. 1984. Repertorio metrico della scuola poetica siciliana. Palermo: Centro di Studi Filologici e Linguistici Siciliani.

Betti, Maria Pia. 2005. Repertorio Metrico delle Cantigas de Santa Maria di Alfonso X di Castiglia.. Ospedaletto (Pisa): Pacini.

Brunner, Horst, Burghart Wachinger, and Eva Klesatschke, eds. 1986–2007. Repertorium der Sangsprüche und Meisterlieder des 12. bis 18. Jahrhunderts. Tübingen: Niemeyer.

Frank, István. 1953–57. Répertoire métrique de la poésie des troubadours. Paris: H. Champion.

Gómez-Bravo, Ana María. 1999. Repertorio métrico de la poesía cancioneril del siglo XV. Alcalá de Henares: Universidad.

Gorni, Guglielmo. 2008. Repertorio metrico della canzone italiana dalle origini al Cinquecento (REMCI). Florence: Franco Cesati.

Mölk, Ulrich, and Friedrich Wolfzettel. 1972. Répertoire métrique de la poésie lyrique française des origines à 1350. Munich: W. Fink Verlag.

Naetebus, Gotthold. 1891. Die Nicht-Lyrischen Strophenformen des Altfranzösischen. Ein Verzeichnis Zusammengestellt und Erläutert. Leipzig: S. Hirzel.

Pagnotta, Linda. 1995. Repertorio metrico della ballata italiana: secoli XIII e XIV. Milan: Ricciardi.

Parramon i Blasco, Jordi. 1992. Repertori mètric de la poesia catalana medieval. Barcelona: Curial, Abadia de Montserrat.

Pillet, Alfred, and Henry Carstens. 1933. Bibliographie der Troubadours. Halle: M. Niemeyer.

Raynaud, Gaston. 1884. Bibliographie des chansonniers français des xiii e [treizième] et xiv e [quatorzième] siècles: comprenant la description de tous les manuscrits, la table des chansons classées par ordre alphabétique de rimes et la liste des trouvères. 2 vols. Paris: Vieweg.

Solimena, Adriana. 1980. Repertorio metrico dello Stil novo. Rome: Presso la Società.

———. 2000. Repertorio metrico dei poeti siculo-toscani. Palermo: Centro di studi filologici e linguistici siciliani.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 133

Tavani, Giuseppe. 1967. Repertorio metrico della lirica galego-portoghese. Rome: Edizioni dell’Ateneo.

Touber, Anton H. 1975. Deutsche Strophenformen des Mittelalters. Stuttgart: Metzler.

Zenari, Massimo. 1999. Repertorio metrico dei “Rerum vulgarium fragmenta” di Francesco Petrarca. Padova: Antenore.

List of Digital Repertoires and Databases

Asperti, Stefano, Fabio Zinelli, et al. BEdT, Bibliografia Elettronica dei Trovatori. Sapienza Università di Roma, Dipartimento Studi greco-latini, italiani, scenico-musicali. Accessed February 5, 2014. http://www.bedt.it/.

Brea, Mercedes, et al. 1994–. MedDB: Base de datos da Lírica Profana Galego-Portuguesa. Centro Ramón Piñeiro para a Investigación en Humanidades. Accessed February 5, 2014. http://www.cirp.es/ bdo/med/meddb.html.

Dreves, Guido M., Clemens Blume, H.M. Bannister, González-Blanco, Elena, et al. 2013–. ReMetCa: Repertorio métrico digital de la poesía medieval castellano [A Digital Repertory of Medieval Spanish Metrics]. Last updated July 22, 2015. http://www.remetca.uned.es.

Grijp, Louis, et al. Dutch Song Database. Meertens Institute, Royal Netherlands Academy of Arts and Sciences (KNAW). Accessed February 5, 2014. http://www.liederenbank.nl/index.php?lan=en.

Horváth, Iván, et al. 1991–. Répertoire de la poésie hongroise ancienne. Eötvös Loránd University (ELTE). Accessed February 5, 2014. http://rpha.elte.hu/.

Parkinson, Stephen, et al. The Oxford Cantigas de Santa Maria Database. Oxford University. Accessed February 5, 2014. http://csm.mml.ox.ac.uk/.

Rauner, Erwin. Analecta Hymnica Medii Aevi Digitalia. Erwin Rauner Verlag. Accessed February 5, 2014. http://webserver.erwin-rauner.de/crophius/Analecta_conspectus.htm.

Seláf, Levente. Le Nouveau Naetebus: Poèmes strophiques non-lyriques en français des origines jusqu’à 1400. Accessed February 5, 2014. http://nouveaunaetebus.elte.hu/.

NOTES

1. http://www.remetca.uned.es/. 2. http://www.remetca.uned.es/index.php? option=com_content&view=article&id=76:megarep-a-comprehensive-research-tool-in- medieval-and-renaissance-poetic-and-metrical- repertoires&catid=80&Itemid=514&lang=es. 3. With the works of Gaston Raynaud (1884), Gotthold Naetebus (1891), and Pillet and Carstens (1933), among others. 4. With the classic work of Frank on Provençal troubadours’ poetry (Frank 1953–57); it has continued with the editions of printed metrical repertoires, in Old French lyrics by Mölk and Wolfzettel (1972), in Italian by Solimena (1980), Antonelli (1984), Pagnotta (1995), Zenari (1999), Solimena (2000), and Gorni (2008); in the Hispanic Peninsula by Tavani (1967), Parramon i Blasco (1992), Gómez-Bravo (1999), and Betti (2005); in German by Touber (1975) and Brunner, Wachinger, and Klesatschke (1986–2007).

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 134

5. For the URLs and other details, see “Repertoires and Digital Databases” in the bibliography below. 6. Some examples are the publications of Domínguez Caparrós (1985 and 1988), Micó (2009), Gómez Redondo (2001), García Calvo (2006), Duffell (2007), Beltrán (2007) or González-Blanco (2010). 7. http://e-spacio.uned.es/revistasuned/index.php/rhythmica. 8. http://www.arsmetrica.eu/. 9. The disadvantage of MySQL compared to Oracle is that MySQL does not offer validation for XML fields, as it considers them just as text fields, and there are no added XSD schemas or any kind of transformational stylesheets, like XSLT. TEI code has to be composed and validated outside the database by using XML editors (we use Oxygen), since the MySQL RDBMS exploitation of XML technologies is limited to XPath and can only be recovered through the extractValue function of MySQL, which shows important limitations. Oracle Database 9i and 10g, however, are much more powerful in that sense, as they combine their E-R nature with XML technologies, resulting in a much more advanced tool to work with both systems, hierarchical and relational, at the same time.

ABSTRACTS

This paper describes the technical structure of the project ReMetCa (Repertorio Digital de Métrica Medieval Castellana), the first online repertoire of Medieval Spanish metrics and poetry. ReMetCa is based on the combination of traditional metrical and poetic studies (rhythm and rhyme patterns) with digital humanities technology, TEI-XML integrated into a Relational Database Management System (RDBMS) through an XMLType field, thus opening up the possibility of launching simultaneous searches and queries by using a searchable, user-friendly interface. The use of the TEI Verse module to tag metrical and poetic structures is especially important for two reasons: first, because it lets us tag different kinds of poems with a variable metadata structure, making it possible to express a very high level of detail for poetry description that cannot be registered in a conventional database; and second, as it enables extensibility and adds more information to enrich the conceptual model, which is under constant development.

INDEX

Keywords: poetry, medieval metrics, repertoire, RDBMS, MySQL, verse, stanza

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 135

AUTHORS

ELENA GONZÁLEZ-BLANCO Elena González-Blanco teaches in the Spanish Literature Department at Universidad Nacional de Educación a Distancia (Open University) of Spain in Madrid. Her main research and teaching areas are comparative medieval literature, metrics and poetry, and digital humanities. She is the Director of LINHD, the Digital Humanities Center at UNED, leads the project ReMetCa, and is also the coordinator of the linked-data Unedata project at UNED. She is a member of the Executive Committee of EADH, member of the first executive committee of GO::DH, member of the Clarin Scientific Advisory Board, member of Centernet Executive Committee, and the secretary of the Spanish association for digital humanities HDH.

JOSÉ LUIS RODRÍGUEZ José Luis Rodríguez is a systems librarian at the Royal Library (Real Biblioteca) in Madrid. He participated in projects devoted to building the manuscript catalogs of the Salamanca University Library and the Royal Library. He has designed and developed several web applications to manage historical bookbinding collections, old inventories of books, exlibris, digital libraries, and other historical bibliographic materials. He is an expert in Integrated Library Systems and he is currently working with the Koha Open Source ILS. His publications include several articles on the history of Galician language, book history, and digital humanities. He is a member of the ReMetCa Project.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 136

Modeling Frequency Data: Methodological Considerations on the Relationship between Dictionaries and Corpora

Karlheinz Mörth, Laurent Romary, Gerhard Budin and Daniel Schopper

1. Introduction

1 Academic dictionary writing is making greater and greater use of the TEI Guidelines’ dictionary module. And as increasing numbers of TEI dictionaries become available, there is an ever more palpable need to work towards greater interoperability among dictionary writing systems and other language resources that are needed by dictionaries and dictionary tools. In a world of exponentially increasing information, the borders between different types of digital language resources are assuming a role that requires increased attention. Two particularly important instances of such language resources are digital text corpora and dictionaries, both of which play an important part in the TEI community.

2 The research described in this paper has been based on work accomplished in a bundle of linguistically focused projects that—among other activities—also work on glossaries and dictionaries which are intended to be usable both by human readers and by particular NLP applications. The main questions that will be addressed are: • How can we define the relationship between a dictionary and other language resources such as digital corpora, irrespective of whether they are used in the production of the dictionary or to enrich existing lexicographic data? • How can this best be documented using the TEI Guidelines?

3 The paper comprises two parts: in the first, the authors give a concise overview of the scholarly background of the projects involved and their goals. The second part touches on encoding issues in the related dictionary production. We will focus particularly on

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 137

the modeling of an encoding scheme for statistical information on lexicographic data gleaned from digital corpora.

2. The Dictionaries and Projects Involved

4 The projects in which the dictionaries and related technologies have been developed are tightly interlinked: they are all joint endeavors of the Austrian Academy of Sciences and the University of Vienna, and all conduct research in the field of variational Arabic linguistics. It is important to note that Arabic is characterized by a complex polyglossic situation, with Modern Standard Arabic (MSA) on one side of the spectrum and spoken vernaculars on the other side. Linguists and lexicographers may be confronted with three or even four varieties being used by the same speakers in one and the same linguistic biotope in some regions.

5 The first project to be mentioned is the Vienna Corpus of Arabic Varieties (VICAV; see figure 1), which was started two years ago with a low budget, and was intended as an attempt at setting up a comprehensive research environment for scholars pursuing comparative interests in the study of Arabic dialects. The evolving VICAV platform aims at pooling linguistic research data, including various language resources such as language profiles, dictionaries, glossaries, corpora, and bibliographies. One of the main objectives of the project is the creation of a number of dictionaries of Arabic varieties that are primarily intended for comparative purposes.

Figure 1. A screenshot of a prototype of the VICAV website in 2012.

6 The second project to be mentioned here is Linguistic Dynamics in the Greater Tunis Area: A Corpus-based Approach (TUNICO). This project is financed by a grant from the Austrian Science Fund and aims at the exploration of hitherto poorly-documented contemporary Arabic of the Tunisian capital, which is linguistically and demographically a highly dynamic region. A particular feature of the project is the importance of the dictionary–corpus interface, which will allow the researcher to navigate from the corpus to the dictionary and vice versa.The TUNICO project is

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 138

producing two digital language resources: a corpus of spoken youth language and a diachronic dictionary of Tunisian Arabic. The project started in August 2013 and will run for three years.

7 The third project has grown out of a master’s thesis and deals with the lexicographic analysis of the Egyptian vernacular Arabic Wikipedia (Siam 2013). Siam extracted the two hundred most frequent words from Wikipedia Masri,1 which given the scarcity of available tools proved to be quite a challenge. The idea of incorporating statistical information gathered in this project was the initial incentive to start thinking about how to encode such information in accordance with the TEI Guidelines.

8 Four of the dictionaries created in the above-mentioned projects (namely Egyptian, Damascene, Moroccan, and Tunisian) can be correlated with digital corpora that already exist. This does not imply that the existing data have been compiled on the basis of these corpora, none of which are very large. Nonetheless, they are so far the only available digital text collections that can be used to underpin this dictionary- building process with empirical methods. Egyptian is the most widely used Arabic dialect. There is plenty of material on the Internet: of particular interest is the great amount of data that can be found in social media sites and on personal web pages. However, most of this data is of a very hybrid nature and intermixed with MSA. Therefore, it is difficult to use for dialectological research. The only resource we could easily avail ourselves of so far is the Egyptian Wikipedia, which has been made accessible as a corpus as part of another ACDH project working on the conversion of Wikipedias into TEI. This work is particularly interesting for under-resourced languages without other digital texts suitable for linguistic research.

9 VICAV contains a small corpus of samples of Damascene Arabic, which was compiled by Carmen Berlinches during her seven-year stay in Damascus. In addition, there exists the Graz Corpus of Moroccan Arabic, compiled as part of a project funded by the Austrian Science Fund,2 and the above-mentioned TUNICO corpus which is currently being compiled.3

10 Some of these data have already been used to enhance the existing dictionaries, in particular the Egyptian and the Moroccan ones. Many of the words, word forms, and example sentences contained in the corpora have been integrated into the dictionaries. The idea of integrating frequency data grew out of the question as to which lemmas were more important than others. Table 1. Arabic language corpora.

Source lang. Target lang. Corpus Size (entries)

Egyptian en, de Wikipedia Masri ~3,000

Damascus en, de, sp CB-Corpus ~3,000

Tunis en, de, (fr) TUNICO ~4,000

Rabat en, de (GCMA) ~500

11 More dictionaries are under preparation. The VICAV database also contains data on Sudanese Arabic, Maltese, Modern Standard Arabic, and the Shawi dialects. One overarching goal of all these endeavors is the creation of a comparative dictionary with an integrated research environment that allows access to all of these data.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 139

2.1 A Trimmed Dictionary Schema

12 Using the TEI dictionary module to encode digitized print dictionaries has become a fairly common standard procedure in digital humanities. Our paper will not reprise the discussion of TEI vs. LMF vs. LEXml vs. Lift vs. RDF vs. other standards;4 we assume that the TEI dictionary module is sufficiently well developed to cope with all requirements of our projects. The basic schema used has already been tested in several projects for various languages and will furnish the foundation for the intended customizations.

13 Created to serve as sources for comparative research, all of the above-mentioned dictionaries have to fulfill a series of requirements: Technically, they have to be processable by various tools, most importantly by several web services on which the dictionary tools build. They have to be compatible with one another and the tools used in their creation. Therefore, they have to be encoded following one single schema in order to allow electronic tools to work on them in tandem and to allow users to execute meaningful queries across all of the dictionaries. This goal has so far been achieved by applying a narrowly defined schema that imposes a number of specific constraints, which were meant to serve as a mechanism to enhance interoperability. In all design issues we have strived for a high degree of compliance with LMF. The main methods of imposing such constraints are reducing alternate constructs and matching with constructs of LMF (ISO 2008).5

14 Since all of these data are “born digital,” it is comparatively easy to ensure the structural uniformity of the dictionaries. Basically, our dictionaries are conceptualized as a specific type of text and are therefore encoded with elements. Each dictionary starts with a which contains the metadata of the dictionary. The of the VICAV dictionaries contains two

elements: one typed "entries", holding all entries of the dictionary, and another typed "examples", which is populated with a series of / constructs containing example sentences.6 Treating these examples as independent units allows dictionary writers to reuse the same sentence in various parts of a dictionary. The schema uses the element, does not make use of the element, and does not allow , , or .7

15 Thus, our TEI dictionaries basically look like this:

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 140

... ... ... ...
... ... ... ......

16 A typical, slightly simplified taken from the Egyptian dictionary is shown below:

kitāb باتك <"orth xml:lang="ar-arz-x-cairo-arabic> noun ktb
kutub بتك <"orth xml:lang="ar-arz-x-cairo-arabic> book Buch

17 To minimize processing overhead, hierarchical nesting is avoided whenever possible; even inflected forms of the lemma are treated like other variant word forms and are encoded as
elements on the same hierarchical level as the lemma itself.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 141

18 The above structure is the one actually implemented at the moment. All VICAV dictionaries and most other dictionaries being produced at the ACDH follow this basic structure.8

2.2 Frequency Data

19 Lexicostatistical data and methods are used in many fields of modern linguistics, lexicography being only one of them. Modern dictionary production relies on corpora, and statistics play an important role in lexicographers’ decisions, for instance when selecting lemmas to be included in dictionaries or selecting senses to be incorporated into dictionary entries. However, lexicostatistical data are not only of interest for the lexicographer; they might also be useful to the users of lexicographic resources, especially digital lexicographic resources. The question as to how to make such information available takes us to the issue of how to encode it.

20 Reflecting on the dictionary–corpus interface and on the issue of how to bind corpus- based statistical data into the lexicographic editing workflow, two prototypical approaches are conceivable: (1) either statistical information is statically embedded in the dictionary entries or (2) a dictionary interface provides functionalities to access services capable of providing the required data.

21 A group of people working on methodologies to implement functionalities of the second type is the Federated Content Search (FCS) working group, an initiative of the CLARIN-ERIC infrastructure which strives to enhance search capabilities in locally distributed data stores (Stehouwer et al. 2012). FCS is intended to work with heterogeneous data, and dictionaries are only one type of language resource to be taken into consideration. In view of the growing prevalence of more dynamic digital environments, the second of the above-mentioned approaches is more appealing. In practice, the digital workbench will require both options. This is particularly true given that corpora change and grow over time. Resolving polysemy and grouping instances into senses remain tasks that cannot be achieved automatically—yet for the sake of verifiability these should be as accountable as possible. The prevalence of digital workflows in lexicographic editing in combination with the availability of large-scale data storage at reasonable cost provide the technological prerequisites to envision systems that keep track of such editorial processes and make the lexicographers’ decisions more transparent and verifiable.

3. Documenting Consulted Corpora

22 If traceability is to be one of the fundamental benefits of a digital lexicographic workflow, documenting the provenance of the language data which editorial decisions rely upon becomes a basic requirement. Among the various possible elements to accommodate such metadata in the current TEI Schema, the dictionary’s element with elements (or their finer-grained variants or ) might seem an obvious fit.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 142

Österreichisches Wörterbuch. 42. Auflage. Wien: ÖBV 2012 amc – Austrian media Corpus. Austrian Centre for Digital Humanities. Austrian Academy of Sciences http://www.oeaw.ac.at/acdh/amc

23 This solution, however, is far from ideal, at least in cases where the digital dictionary stems from a printed original, as their different relations with respect to the would be indiscernible in the markup. Adding @type attributes to both elements would not help either, as that would merely classify the bibliographic records, not the function of the entities they describe.9 Thus, the distinction between two kinds of “sources” should be made on a more general level. The following construct would be formally valid:

Österreichisches Wörterbuch. 42. Auflage. Wien: ÖBV 2012 amc – Austrian media Corpus. Austrian Centre for Digital Humanities. Austrian Academy of Sciences http://www.oeaw.ac.at/acdh/amc

24 Grouping those two kinds of bibliographic references in separate listBibl elements with appropriate @type attributes—for instance "printedSource" and "consultedCorpora"—solves the issue of ambiguity at least on the surface. There remain concerns, however. First of all, relying on attribute values on parallel constructs to encode such a fundamental difference is not as expressive as one might wish—especially in the case of a much-used attribute like @type. More importantly, does this kind of markup (or any other of the approaches mentioned above) sufficiently denote the specific type of relationship between the digital dictionary and a corpus, which obviously cannot be reduced to one of a “source” and its “derivation?” This conceptual problem pertains to nearly all possible solutions we could think of.

25 One solution would be embedding this kind of metadata into the element, which “provides details of editorial principles and practices applied during the encoding of a text,”10 or even into , where “the rationale and methods used in selecting texts, or parts of text, for inclusion in the resource”11 are documented. Although this seems more accurate with respect to the role of language resources in dictionary writing, the definition of their common ancestor (“documents the relationship between an electronic text and the source or sources from which it was derived”)12 explicitly refers to a relation of dependency of one on the other, which seems inappropriate in our case.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 143

26 Surprisingly, it is the Critical Apparatus module which provides an appropriate solution for our problem. Originating in the tradition of textual criticism, this module defines the phrase-level element to embed various versions of a passage in-line, optionally declaring one as the preferred reading. The resulting TEI is a compound object that does not have one single “non-electronic” counterpart, but documents a multitude of fragments from various resources. Likewise, a dictionary, which is closely intertwined with data from language resources, can be conceptualized as the abstraction of this instance data. In order to express the specific nature of its “sources,” the textcrit module has to define a new child to , the element.

27 Following this example, we propose the introduction of a element to hold a list of any language resources from which the dictionary in the document’s draws its statistical information. This list consists of one or more elements, which provide relevant metadata about each language resource and include pointers to its content. • describes a language resource of any kind (including, but not limited to, text corpora) that has been used as source material in the creation of a dictionary. • (language resource list) contains a list of language resources of any kind that have been used as source material in the creation of a dictionary.

28 By choosing a deliberately broad term as the new element’s name, we try to keep the range of possible language resource types as open as possible without confining their possible function to that of a statistical source.

Österreichisches Wörterbuch ... ... ... ... ...

29 Making a child element of tries to address both kinds of corpus–dictionary relations we have come across in our projects: it takes into account cases where a “born-digital” dictionary draws most of its material from language resources—making them thus an important, but still intermediate source— while possibly drawing a clear line between a source of a dictionary’s and the source of statistical data the has been enriched with.

30 The purpose of the element is twofold: first of all, it enables a user to locate and access the language resource personally. Secondly, it provides a basic description of it through a series of appropriate properties. Although this information is likely to be part of the metadata held alongside the corpus data itself, it seems

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 144

reasonable to keep a summary of it with the dictionary, especially as corpora may become inaccessible or evolve over time. The TEI Guidelines already provide the components for this in the element’s content model, with the / construct being a natural candidate for the representation of corpus size.

Austrian Media Corpus Austrian Centre for Digital Humanities. Austrian Academy of Sciences

Sonnenfelsgasse 19/8 1010 Vienna Austria
1987–2012 2013 2014-01-30 Contains articles from Austrian newspapers and magazines between 1987 and May 2012. PoS tagging with “Tree Tagger” and “RFTagger”

31 Modeling after helps us address some specific limitations of the language resources we currently have to make do with: it lets us distinguish between various kinds of data via the @type and @subtype attributes, while the @statusattribute indicates whether the data in question are expected to be subject to change or not. Since we consider it important to document the essential properties of a corpus (status, size, date of its content) in a consistent manner, we decided to narrow down the possible components of significantly and create a specially tailored version from it.

32 Many other aspects of language resources may be desirable to record—especially considering questions of mid- and long-term preservation. Since data used in lexicographic production are most likely to be distributed over a wide range of organizations and locations, accessibility cannot be assumed in all cases. In order for a user to be able to assess the relevance of any language resource in the context of the final dictionary, a multitude of parameters has to be taken into consideration. For instance, it would be important to identify inherent biases in the sociologic,

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 145

geographic, or diachronic sampling of language data or technical limitations in its markup. The German LAUDATIO Project has developed a comprehensive, -based corpus description specification which is directly aimed at this purpose.13 In light of such issues it seems advisable to allow the inclusionof the of a corpus (or possibly any descriptive metadata) inside our element, to document its state at the time of the dictionary’s publication.

4. Documenting Corpus Queries

33 So far, we have defined a way to describe the language resources we want to refer to in the dictionary’s . This leaves the following questions to be addressed: which information is to be encoded when documenting our corpus queries? Which parts of those entries do we need to attach this information to? And, finally, how can we establish the linkage between our description of the corpus instance data, the dictionary, and the corpus metadata held in the element?

4.1 The Tenets of Frequency Information

34 What is needed is a definitive system to register quantifications of particular items represented in dictionary entries. This of course raises the question as to which parts of a dictionary entry can be considered relevant. First to come to mind, of course, are headwords. But there are many other constituents of dictionary entries that might be furnished with frequency data: inflected word forms, collocations, multiword units, and particular senses are relevant items in this respect.

35 The data model should not only provide elements to encode frequencies within elements describing the above-listed constituents of entries, but also allow indication of the source from which the data were gleaned and how the statistical information was created. The basic constituents of our model should contain these items: • Value (number of occurrences of the particular item) • Rank • Provenance (source from which the data is taken) • Retrieval method (how the statistical information was created) • Query type • Evaluation mode

36 Ideally, persistent identifiers should be used to identify not only the corpora but also the services involved in creating the statistical data.

4.2 Spoilt for Choice

37 In our attempts to design a viable solution to our encoding problems, we went through three stages: (1) we tried to make use of some TEI elements with very flexible semantics and to provide them with @type attributes; (2) we tried to apply TEI feature structures; and (3) we started to work on a new customization.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 146

4.2.1 Catch-all Elements

38 As is well known, there are some TEI elements which can be used for almost anything by furnishing them with @type attributes. The most commonly used ones are , , and , which readily lend themselves to purposes such as ours through their very general semantics. Early attempts of ours to model frequency data also made use of and elements, resulting in constructs such as those in the example below:

mašʕal لعشم <"orth xml:lang="ar-arz-x-cairo-arabic> 6 2456 noun šʕl
mašāʕil لعاشم <"orth xml:lang="ar-arz-x-cairo-arabic> 2 23765

39 In this example, the statistical information indicated by means of elements refers to the
elements. This attempt soon seemed unsatisfactory: although sufficiently versatile in its content, is intended only to be placed in specific “wrapper” elements of the dictionary module (, , andamongst others) but is disallowed in a lot of other contexts where frequency information is potentially relevant. In particular, this would have excluded the possibility to embed frequency data in elements containing grammatical information ( and its syntactic-sugar equivalents , , , etc.) as well as and . The other constructs mentioned above proved equally problematic: the definition of (“represents any segmentation of text below the ‘chunk’ level ”)14 hardly permits arbitrary data structures like the ones

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 147

we needed; , on the other hand, was too likely to semantically interfere with editorial notes (footnotes, marginal notes) in a retro-digitized dictionary; and would have forced us to use bulky pointing mechanisms to express the relationship between the various parts of an entry and the attached frequency information, since it is only allowed as a sibling of .

4.2.2 Feature Structures

40 In a second approach, we used feature structures, a very versatile, sufficiently well- explored tool for formalizing all kinds of linguistic phenomena. One of the advantages of the element is that it can be placed inside most elements used to encode dictionaries.

mašʕal لعشم <"orth xml:lang="ar-arz-x-cairo-arabic> noun šʕl
mašāʕil لعاشم <"orth xml:lang="ar-arz-x-cairo-arabic>

41 Feature structures are impressively simple to use and have a lot of advantages when it comes to modeling abstract structures and their relations. However, they are not very human-readable, nor are frequencies a “feature” of the entry’s components in the strict sense of the word.

4.2.3 Attempting Customization

42 All these “conservative” attempts adopted existing elements and resulted in solutions which appeared to be far from perfect, especially and being void of relevant semantics. This—in the end—made us think of alternatives by customizing our dictionary scheme and adding a set of objects (attributes and elements) to describe

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 148

frequencies in context. With a wide range of different application scenarios in mind, we attempted to design something like a statistical crystal that could also be reused beyond our particular projects.

43 We named the root element to carry frequencies “statistical information.” We chose a generic name rather than making use of narrower terms such as “corpusFrequency,” as the data we wanted to use might come from other language resources than text corpora, such as word lists, other dictionaries, or databases aggregating statistical data from external sources. The chosen term appeared to be semantically correct and would allow us to keep options open to other scenarios. • (statistical information) contains statistical information about instances of any component of a dictionary entry in one or more language resources.

44 The next step was to find a way to indicate where the statistical information came from. Intuitively, one might expect a element. However, already exists in the Manuscript Description module (and can only be used as a child of a element). As we considered it good practice to avoid denominational ambiguities, we eschewed using “source” in the namespace of the customization, but introduced a element which, through membership of the att.canonical class, inherits the attributes @key and @ref, with the latter providing the mechanism to point to a element in the .

mašʕal لعشم <"orth xml:lang="ar-arz-x-cairo-arabic> ... ...

45 In the field of our research, we are still far away from reliable reference corpora in the proper sense of the word. At the moment, we are instead in a situation where we have to integrate anything available in default of anything better. For comparative purposes it might therefore be important to have a list with several elements to give users a more complete picture of the available data beyond the resource proper.

46 The statistical information itself would remain in the TEI namespace. This is exactly the same construct which we have already proposed above.

47 A key issue here is the access mode. The element has been proposed to accommodate information regarding the query and possible modifications

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 149

to the result set. This information is dealt with in two child elements: and .

lemma="go" manual

48 The element contains the query string. It should also have a @type attribute indicating the applied query language. In the example above, CQP (Corpus Query Processor) refers to the query language of the IMS15 Corpus Workbench. The element can be filled with either "none", which implies that the data was retrieved automatically, or "manual", which should be applied when some kind of postprocessing has been done.

49 For purposes of reproducibility it may be desirable to document the various steps that produced the final set of records with finer granularity. In this case, could be replaced by an element, containing a series of tags with one child each.

word=".*lein" Filtering out named entities [pos="NE"] Filtering out adjectives and adverbs [pos="ADJ|ADJD|ADV"]

50 The construct above indicates that the results of the original query ('word=“.*lein”') were first reduced by excluding any named entities ( number 1), and the resulting subset narrowed down by removing any adjectives and adverbs ( number 2). The frequency data in a subsequent element would relate to the final set of records in .

51 Of course, this kind of detail is not attainable without the help of specialized software. Aiming toward tighter integration of language resources with dictionaries, we imagine a next generation of dictionary editors providing facilities to query language resources and keep track of user-driven modifications of the results, and offering functions to embed them in the markup. Until implementations have reached this level of integration, we have to rely on a combination of components to support this

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 150

functionality. An example for this is the commercial product Sketch Engine,16which includes a web interface for querying language corpora. In particular, it offers the ability to download the resulting frequency data (as well as concordances) in its own, vendor-specific, XML format. This format also contains the sequence of query expressions that leads to the final result set. Thus, a simple XSL stylesheet can be used to transform this into the format we have proposed above.

amc word,[word=".*lein"] 0 0 1 [pos="NE"]0 0 1 [pos="ADJ|ADJD| ADV"] word ... fräulein 23515 ...

52 A sufficiently descriptive markup for query results, however, is quite verbose by nature. Depending on the context this practice can have undesirable consequences; for instance, the XML data may become hard to read or, more seriously, whitespace issues may be introduced. If, for example, a dictionary writer states that an entry’s headword can occur in two orthographic forms and wants to underpin this assertion with frequency data from a corpus, he or she would have to use a construct like the following, leaving it technically unclear where the headword ends and the statistical metadata begins.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 151

eintepschen [word=".*tepsch.*"] [word="Step.*"] excluding composites with “Step” eindepschen [word=".*depsch.*"] none

53 In order to avoid ambiguities, we propose an alternative encoding style: instead of specifying the relation between the instance data and the lexicographic description by nesting the former inside the latter, we suggest using the linking module’s @corresp attribute to point from the dictionary to a element which can be kept in the dictionary’s back matter or in another document instance. Although this might add some processing overhead to an application, it improves maintainability and readability significantly by dividing the two layers of information.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 152

eintepschen eindepschen ...
... [word=".*tepsch.*"] [word="Step.*"] excluding composites with “Step” [word=".*depsch.*"] none

54 To take all of this further, we will first create more real-world data with the customized schema and test them in applications currently under development. In addition, it will be necessary to keep discussing the customized dictionary schema with the community. A final goal (the realization of which is admittedly not yet near at hand) is the integration of a viable solution into the TEI Guidelines.

5. Conclusions

55 A major interest that has accompanied our experiments is the clearly discernible phenomenon of blurring boundaries between digital language resources. Data available in one resource can be integrated into others; creating new resources from pre-existing ones has become much more feasible. We strongly believe that the permeability

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 153

between language resources will also change our way of how we look at corpora and dictionaries. In the digital world the two grow closer and closer. Not only do they depend on each other (dictionaries need corpora to be compiled, while corpus tools need dictionaries for annotation); users will increasingly want to use them together, ideally in the same interfaces.

56 Some of the problems described in this paper have not been dealt with so far because readily and freely accessible language resources are not as abundant as one might assume. However, funding agencies increasingly insist on open access not only to research results but also to research data. It is therefore to be hoped that the situation with respect to openly accessible lexicographic data as well as to electronic corpora will improve in the years to come. Solutions for integrating these data and/or accessing various resources simultaneously will become even more important. Thinking about how particular language resources can interact and working on appropriate interfaces is an indispensable prerequisite for more linked (open) data and service-based architectures. The more such data become available, the more important it will become for the TEI to provide viable solutions for dealing with them.

BIBLIOGRAPHY

Budin, Gerhard, Stefan Majewski, and Karlheinz Mörth. 2012. “Creating Lexical Resources in TEI P5: A Schema for Multi-purpose Digital Dictionaries.” Journal of the Text Encoding Initiative 3. http://jtei.revues.org/522. doi:10.4000/jtei.522.

ISO (International Organization for Standardization). 2008. Language Resource Management — Lexical Markup Framework (LMF). ISO 24613:2008. Geneva: ISO.

Romary, Laurent. 2013. “TEI and LMF Crosswalks.” In Digital Humanities: Wissenschaft vom Verstehen (forthcoming), edited by Stefan Gradmann and Felix Sasaki. Humboldt Universität zu Berlin. http://hal.inria.fr/hal-00762664.

Siam, Omar. 2013. “Ein digitales Wörterbuch der 200 häufigsten Wörter der Wikipedia in ägyptischer Umgangssprache.” MPhil thesis, University of Vienna. http://othes.univie.ac.at/ 26036.

Stehouwer, Herman, Matej Durco, Eric Auer, and Daan Broeder. 2012. “Federated Search: Towards a Common Search Infrastructure.” In Proceedings of LREC 2012: 8th International Conference on Language Resources and Evaluation (LREC 2012), edited by Nicoletta Calzolari et al., 3255–59. European Language Resources Association. http://hdl.handle.net/ 11858/00-001M-0000-000F-9E1D-8.

TEI Consortium. 2014. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.6.0. Last updated January 20. http://www.tei-c.org/Vault/P5/2.6.0/doc/tei-p5-doc/en/html/.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 154

NOTES

1. Masri Wikipedia Introduction in English 2. Arabic in the Middle Atlas Mountains (Morocco), University of Graz, FWF Project Number P 21722-G20. See http://sprachausbau.uni-graz.at/de/forschen/sprachen-in- nordafrika-marokko/beschreibung/. 3. The first campaign was undertaken in 2013. The two investigators recorded some 22 hours of audio material, which has been transcribed and are being analyzed. 4. For a concise overview of these formats compare section 3 “Data Formats” in Budin, Majewski, and Mörth 2012 (http://jtei.revues.org/522#tocto1n3). 5. See Budin, Majewski, and Mörth 2012 and Romary 2013b. 6. While entries and example sentences were originally located directly inside the , we have now settled for a clearer distinction between the two. Since most of our dictionary data is created and maintained using the Viennese Lexicographic Editor and held in a relational database (described in Budin, Majewski, and Mörth 2012), format changes like this can be applied transparently on all of our dictionaries, making it easy to ensure structural homogeneity. 7. For more detail see Budin, Majewski, and Mörth 2012. 8. The example above exhibits some features which might seem idiosyncratic at first glance, namely the usage of the @ana attribute and the values in the @xml:lang attributes. The fragment identifier in @ana refers to a feature library in the , providing a concise notation for morphosyntactic annotations. In the example above, it defines the inflected
to bear the features noun and plural. The composition of the @xml:lang attributes is an extension to the BCP 47 standard tags, which proved necessary in order to provide a higher degree of locational granularity (see Budin, Majewski, and Mörth 2012). 9. One might wish to have a @role attribute at hand to express this differentiation, yet given its already fairly vague semantics that does not seem advisable either. While @role is predominantly used in the realm of names and named entities, it is also defined in att.tableDecoration, where it indicates whether a cell holds actual data or just a label. Defining it locally on would only overload it with another, highly specialized sense and seems a makeshift strategy. 10. TEI Consortium 2014: Appendix C Elements, http://www.tei-c.org/Vault/P5/2.6.0/ doc/tei-p5-doc/en/html/ref-editorialDecl.html. 11. TEI Consortium 2014: Appendix C Elements, http://www.tei-c.org/Vault/P5/2.6.0/ doc/tei-p5-doc/en/html/ref-samplingDecl.html. 12. TEI Consortium 2014: Appendix C Elements, http://www.tei-c.org/Vault/P5/2.6.0/ doc/tei-p5-doc/en/html/ref-encodingDesc.html. 13. http://www.laudatio-repository.org/repository/files/documentation/corpus/ teiODD_LAUDATIODocumentation_S6.zip. 14. TEI Consortium 2014: Appendix C Elements,http://www.tei-c.org/Vault/P5/2.6.0/ doc/tei-p5-doc/de/html/ref-seg.html. 15. Institut für Maschinelle Sprachverarbeitung, Stuttgart. 16. https://www.sketchengine.co.uk.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 155

ABSTRACTS

Academic dictionary writing is making greater and greater use of the TEI Guidelines’ dictionary module. And as increasing numbers of TEI dictionaries become available, there is an ever more palpable need to work towards greater interoperability among dictionary writing systems and other language resources that are needed by dictionaries and dictionary tools. In particular this holds true for the crucial role that statistical data obtained from language resources play in lexicographic workflow—a role that also has to be reflected in the model of the data produced in these workflows. Presenting a range of current projects, the authors address two main questions in this area: How can the relationship between a dictionary and other language resources be conceptualized, irrespective of whether they are used in the production of the dictionary or to enrich existing lexicographic data? And how can this be documented using the TEI Guidelines? Discussing a variety of options, this paper proposes a customization of the TEI dictionary module that tries to respond to the emerging requirements in an environment of increasingly intertwined language resources.

INDEX

Keywords: lexicography, language resources, digital corpora, statistics

AUTHORS

KARLHEINZ MÖRTH Karlheinz Mörth is currently director of the Austrian Centre for Digital Humanities of the Austrian Academy of Sciences, senior researcher and project leader, lecturer at the University of Vienna, head of the DARIAH National Coordinator Committee and vice-chair of the CLARIN Standards Committee. With a broad background in cultural, literary, and linguistic studies, he has worked on a number of scholarly digital projects. He has contributed to the design and creation of a number of digital language resources, taking responsibility for text encoding and software development. His current research activities focus on eLexicography, text technology for linguistic research and standards for digital language resources.

LAURENT ROMARY Laurent Romary is Directeur de Recherche for INRIA (France) and a guest scientist at Humboldt University (Berlin, Germany). He carries out research on the modeling of semi-structured documents, with a specific emphasis on texts and linguistic resources. He received a PhD in computational linguistics in 1989 and his Habilitation in 1999. He launched and directed the Langue et Dialogue team at Loria (Nancy, France) and participated in several national and international projects related to the representation and dissemination of language resources and to human–machine interaction, coordinating the MLIS/DHYDRO, IST/MIAMM, and eContent/ Lirics projects. He has been the editor of ISO standard 16642 (TMF – Terminological Markup Framework) and is the chairman of ISO committee TC 37/SC 4 on Language Resource Management, as well as member (2001–2007), then chair (2008–2011), of the TEI Council. In recent years, he has led the Scientific Information directorate at CNRS (2005–2006) and

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 156

established the Max-Planck Digital Library (Sept. 2006–Dec. 2008). He is currently director of DARIAH-EU.

GERHARD BUDIN Gerhard Budin is a full professor of terminology studies and translation technologies at the Centre of Translation Studies at the University of Vienna, former director of the Institute for Corpus Linguistics and Text Technology of the Austrian Academy of Sciences, member (kM) of the Austrian Academy of Sciences, and holder of the UNESCO Chair for Multilingual, Transcultural Communication in the Digital Age. He also serves as vice president of the International Institute for Terminology Research and chair of a technical subcommittee in the International Standards Organization (ISO) focusing on terminology and language resources (ISO/TC 37/SC 2 2001–2009, SC 1 2009–present). His main research interests are language technologies, corpus linguistics, and knowledge engineering, E-Learning technologies and collaborative work systems, distributed digital research environments, terminology studies, ontology engineering, cognitive systems, cross-cultural knowledge communication and knowledge organization, philosophy of science, and information science.

DANIEL SCHOPPER Daniel Schopperis a junior scientist at the Austrian Centre for Digital Humanities of the Austrian Academy of Sciences and head of its working group Data, Resources and Standards. Coming from a background of German philology and literature studies, he became involved with the digital humanities over the course of various edition projects, ranging from seventeenth-century drama to database-driven analysis of autobiographic writing, and is currently expanding his field of activity into linguistic-related areas. His research interests include interdisciplinary applications and methodologies between language corpora and literature studies, user–text interfaces, and web-based research environments.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 157

TEI Processing: Workflows and Tools

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 158

From Entity Description to Semantic Analysis: The Case of Theodor Fontane’s Notebooks

Martin de la Iglesia and Mathias Göbel

1. Project Overview

1 The German writer Theodor Fontane (1819–1898) produced many novels, poems, essays, and other types of texts, including the notebooks which we are currently editing (Radecke 2010, Radecke 2013). Fontane’s notebooks constitute one of the last bodies of his manuscript materials that is still not entirely published; only a few excerpts have been published, and these do not comply with standards of textual criticism and editing (Radecke 2010). From 1859 until the end of the 1880s, Fontane filled notebooks with a mass of miscellaneous materials, such as diary entries, letter drafts, plans for poetic writing, prose sketches and drafts, lecture notes, drafts for theater and arts reviews, book excerpts, and also notes and sketches accumulated on his journeys. A total of 67 of these notebooks are known to exist, most of them measuring approximately 10 × 17 cm and containing between 64 and 180 leaves. Some pages have been left blank, so we are dealing with less than 10,000 pages with writing on them. Fontane’s handwriting is often hard to decipher, which means that we need to put a lot of effort into transcribing it. Once published, though, this transcription will be a considerable benefit to the reader.

2 We plan to release our edition in 2017 when our project is scheduled to end. The project is titled Genetic-Critical and Annotated Hybrid-Edition of Theodor Fontane’s Notebooks Based on a Virtual Research Environment.1 The term hybrid edition indicates that it is going to be published both online, in open access form, and as a printed book by Walter de Gruyter. The term Virtual Research Environment in the title of the project refers to TextGrid,2 which we use in combination with the oXygen XML Editor3 to produce our TEI-encoded transcription as well as to store both the digital images and the XML files permanently. The project is funded by the German Research Foundation (DFG) and carried out by the

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 159

Theodor Fontane Research Centre at Göttingen University, in cooperation with Göttingen State and University Library. The Berlin State Library is the owner of the notebooks and an associated partner of the project. Our core project team consists of three philologists at the University, including the project manager and official editor of the edition, Gabriele Radecke, and, at the university library, an IT and a metadata specialist. The workflow has been set up as follows: the philologists survey the material, determine the editorial concept and principles, transcribe the notebooks, and encode them in TEI according to a TEI P5 subset scheme based on the elements for genetic editions (e.g., and ) specified by the metadata specialist. This TEI code is then XSL-transformed into HTML and integrated into the project’s website by the IT specialist. This transformation can be invoked from within the TextGrid environment at any time. Thus, the HTML file not only is used for publication on the website, but also supports the editors in transcribing and encoding the text of the notebooks by providing an on-the-fly visualization of their work which is often easier to check for errors than the XML text. (A detailed description of this workflow can be found in Radecke, Göbel, and Söring 2013.)

3 All of the examples in this paper use data from one of the 67 notebooks (shelf mark C07). C07 contains some of the notes taken during or shortly after Fontane’s second journey to the German regions of Thuringia and Franconia in the summer of 1873. Fontane visited several different places there, including the battlefields of the Napoleonic Wars of the early nineteenth century, and sites related to the sixteenth- century Protestant reformer Martin Luther. Some notes in C07 refer to a book about Thuringia that Fontane planned to write, but apparently never did (Wüsten 1973).

2. Encoding References to Entities in Theodor Fontane’s Notebooks

4 Obviously, a notebook about Thuringia and its history is bound to contain many references to entities, primarily places and persons, and also dates. In order for these references to be analyzed, they first had to be verified by the philologists and then tagged as references to entities in our TEI code. Despite the progress in automatic Named Entity Recognition technology (Wettlaufer and Thotempudi 2013), it turned out to be less useful for our purposes as the dataset we are dealing with is relatively small and we aim for high precision in identifying references to entities, including indirect references such as pronouns. Therefore we are identifying and tagging all references manually. For notebook C07 we have already completed this task, using the TEI element (referencing string) to point out words in the notebooks which refer to entities.

Example 1. Example .

5. Luther tritt als Mönch in das Auguſtinerkloſter

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 160

5 Reference attributes (@ref) point to nodes located elsewhere in the TEI dataset. It should be noted that the organization of the TEI dataset and the location of the entity notes therein is of no importance to the reference linking mechanism described here. At the current stage of our project, the TEI code for each of the 67 notebooks is stored in its own TEI document, with the entity indexes stored within each TEI header. However, combining the 67 TEI documents into one large document with the entity information stored in a single header, or placing all entity data in a separate TEI document altogether, would be just as feasible. In the entity data nodes, additional information is provided on the entities, such as hyperlinks to authority file records, or a classification into person or place.

Example 2. Example of and .

118575449 2929670 164587492

6 We will show further entity types in a later example. In the case of the authority files, the pointers from the TEI code can be either simple URIs (or just authority file identifiers stored in elements from which URIs can be easily built), or qualified hyperlinks (e.g., in elements) using a vocabulary such as CIDOC CRM (Le Boeuf et al. 2013) in order to express relations which may support a Linked Data structure. The processing of these links to authority records as described in this article works in either case, as long as valid URIs are provided. The philologists validate the entries, and if there is an error within the hyperlinked authority data we will store corrected values or extended information in our TEI dataset. Chronological references are encoded in a simpler way, using the element and a direct normalization within attributes of the att.datable class.

7 Our method of encoding references to entities is fairly similar to that employed in other TEI editions. For instance, in the edition of William Godwin’s Diary (Myers, O’Shaughnessy, and Philp 2010), the more specific elements like and

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 161

are used instead of the element. In the Godwin edition, a @ref attribute points to an HTML website, which however is in most cases based on a TEI document consisting of a element. Within this element, the normalized name of the person and biographical information are provided. Many other TEI projects use entity referencing mechanisms like this, even though the elements, attributes, and attribute values may vary. In many cases, tools for entity analysis and visualization can be applied to different data sources with minimal adaptation effort, so no standardization of the respective TEI source code is required. A problem we did encounter is that some TEI projects do not provide direct access to their XML files, which makes them harder to process automatically.

3. Semantic Analysis

8 Many TEI edition projects already encode references to entities, but semantic analysis entails far more than just knowing which entities have been mentioned by an author. Semantic analysis as we understand it is a methodological approach that builds upon such tagged entities. Applications query these entity data, aggregate them, possibly enrich them with or link them to external data, perform calculations on them (e.g., sorting algorithms), and generate different kinds of output. This output can be in the shape of text, images, moving images, or potentially any other medium, and can be either simple and concise or complex and rich, but may provide new insights into the data present in the TEI-encoded material. This kind of analysis shifts the focus from the written signifiers to the actual signified entities, and tries to make claims about the mentioned persons, places, dates, works, events and other entities, and their interrelations. This in turn allows conclusions about the TEI-encoded work in which these are referenced, as we show in the examples below.

9 Such semantic analysis applications are typically external to the TEI code, but need not be external to the TEI-encoded edition as a whole.4 For instance, this kind of application might be offered as part of the edition website or might be run as a standalone tool by a third party, querying remote TEI data. These possibilities are discussed in the final section of this article.

10 The semantic analysis methods described here should not be confused with Linked Data processing (Berners-Lee 2006), which relies on explicitly qualified relations between entities (typically in the form of RDF triples). Such data can be derived from the TEI data discussed here, but no Linked Data as such is present in the datasets used in our project.

3.1 Timeline

11 As our first example of semantic exploration, we would like to take a look at the persons mentioned in Fontane’s notebook C07, and ask: who were those people, or more specifically: were they Fontane’s contemporaries or, from Fontane’s point of view, historical figures? In the latter case, in which period did they live? Answering these questions will give us an idea of the different historical strata treated in this notebook. Our tool for this purpose will be the Timeline Widget5 from the SIMILE (Semantic Interoperability of Metadata and Information in unLike Environments) collection of open source data visualization tools, originally developed at MIT. A SIMILE

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 162

Timeline consists of events, associated with either a point in time or a duration, which are plotted on a chronological, in this case horizontal, axis. If we want to visualize the persons mentioned in the notebook on such a timeline, and use their lifespans as durations, where do we get their birth and death dates? As mentioned earlier, our references to entities are linked to authority records. In the case of persons, we found the Integrated Authority File (German “Gemeinsame Normdatei,” GND)6 by the German National Library to be the most useful and complete data source for our purpose. We have assigned a GND identifier to 37 of the 47 person entities referenced in the notebook C07. The remaining 10 entities are groups of people rather than individuals, such as “the German emperors” or “the Thuringian landgraves.” Although the GND does provide records for some of these entities, they do not contain much useful information for further investigation.

12 The SIMILE Timeline widget consists of an HTML document that uses JavaScript to process an XML file written in a simple XML markup language specific to SIMILE.

Example 3. Example of SIMILE XML.

13 We have written an XSLT stylesheet in order to produce the HTML document and at the same time generate the required XML data from the TEI code of our notebook. During the transformation, this stylesheet picks up the GND identifiers of all person entities in the TEI code and looks up the corresponding GND record online at the German National Library. It then fetches the date of birth and date of death of the persons from the GND RDF/XML record, as well as their normalized names, and uses these names as labels in the timeline. With only 100 lines of code, this XSL document is quite short and simple, but still checks for missing birth or death dates in the GND record to substitute a calculated estimate for the missing value, as we will show below. Person entities for which neither a precise birth date nor death date can be found in the GND are not included in the timeline at all, which is the case for one of the 37 persons referenced in the notebook (the Thuringian king Hermannfrit, whose birth and death dates are provided in a non-standard form as the text string “ca. ?–531,”7 which is not recognized by the stylesheet used).

Figure 1. Complete SIMILE timeline for notebook C078

14 The resulting timeline (figure 1) consists of two bands, the lower one aggregating the upper one on a larger scale. In the upper band, light blue bars indicate that the date of birth (that is, the beginning of the bar) is an estimate, and only the death date (the end of the bar) could be fetched from the GND record. Our timeline begins in the fifth century CE, when Thuringia was under Frankish rule and the Frankish king Merowech

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 163

defended Thuringia against the attacks of Attila the Hun. In the Middle Ages we see the long line of Thuringian landgraves (most of them called either Ludwig or Friedrich) who ruled Thuringia until the year 1440 when it became part of Saxony. The timeline ends in the early nineteenth century with Napoleon and other military commanders who fought in the battles of Jena and Auerstedt during the Napoleonic Wars. This timeline enables us to tell which historic periods are covered in the notebook, if we assume that the mentioned persons are a suitable indicator for that. We can also see that, at least in this notebook, Fontane did not mention any of his contemporaries.

15 It can be very interesting to compare the timeline of one of Fontane’s notebooks to a timeline created from a different TEI data source. For this purpose we again turn to the edition of William Godwin’s Diary, as it resembles Fontane’s notebooks in both the time of creation and the nature of the material. Of course, a diary is a different medium than a notebook, but both contain a sufficient number of references to person entities that we can display on a timeline. To narrow the material down, we selected a single year’s worth of diary entries, for the year 1835, which is the last complete year within the scope of Godwin’s diaries. The XSLT code to create the timeline has to be adjusted to the Godwin TEI code, though only slightly. The main difference is that birth and death dates are already contained within the elements, and do not need to be retrieved from elsewhere. Again, a filter was applied for missing birth and death dates. Overall, this XSL document is even shorter and simpler at only 75 lines.

16 The resulting timeline (figure 2) shows us many people who were all either alive in 1835, or recently deceased, such as a John Curran. (Godwin refers in his diary to the removal of Curran’s body from London to Ireland at the time of writing.)

Figure 2. SIMILE timeline for William Godwin’s diary entries of the year 1835.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 164

17 The comparison of the two timelines shows a marked difference: while Fontane exclusively mentions historical figures in his notebook C07, Godwin’s diary entries of 1835 are concerned with the present, as can be expected from a diary. The advantage of this method is that it may give an overview of some aspects of the content of large amounts of textual data in a short time, without being prone to the bias of human annotators and indexers.

3.2 Geospatial-Temporal Data Aggregation

18 The referencing of places within our encoded documents is also realized with and a corresponding node within the connected via @ref. Sixty-eight different places are mentioned in notebook C07.

19 We used the authority files GeoNames9 and OpenStreetMap (OSM),10 which provide the required data, and we manually selected identifiers from these databases and added them to the TEI dataset, similarly to the procedure for the personal data described above. Again, the use of automatic systems to identify the historical places would not be feasible, as they are sometimes referred to in the notebooks by uncommon phrases which require human interpretation to match them to corresponding modern-day identifiers. Another XSLT script is used to transform the TEI dataset to a Keyhole Markup Language (KML) file, the typical input format for geospatial visualization tools. The XSLT resolves the IDs and retrieves the coordinates from the respective database. OSM is able to deliver polygons instead of geographic coordinates from a single URL in the resulting file: for example, for All Saints’ Church in Wittenberg or the Saint Augustine Monastery at Erfurt. Again the transformation is very simple, as the common format for spatial information—KML—is also expressed as XML. This approach tries to provide a possible combination of place names that appear in the neighborhood of elements. The geospatial-temporal visualization tool of our choice is the DARIAH Geo-Browser,11 which offers a timeline with a map interface together with different features for selecting data.

20 A notable feature is the use of historical maps to provide better context. The selection of historical maps in the Geo-Browser is still limited, but it is also possible to load one’s own overlays. Furthermore, the data are presented in tabular form with a search function. In addition to the mandatory data—at least one place name with latitude and longitude—HTML code can be inserted in the KML file. A useful method is to pass back links to the digital edition or, more specifically, to the page where the selected place can be found in the manuscript. These hyper-references will appear directly in the geographical information system. We integrated an embedded version of this tool via into our website which is built on an eXist database.12 An XQuery script executes the transformation, stores the KML file in the database, and generates the required element. The attribute value contains the parameters to control the Geo-Browser. A URL-encoded string which points to the previously-generated KML file within our database is passed to the tool.

21 To find dates corresponding to the named places, we selected an interval from eight preceding or following elements starting from the matching of the place. We prefer the following dates if both the nearest preceding and the nearest following element are at an equal distance. The idea behind this matching criterion is that the proximity of place and time references in the notebooks suggests a semantic link.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 165

Ideally, they describe where and when one single event took place. The chosen interval of up to eight steps was determined by trial and error and yields the highest number of meaningful matches. In the future, users should be able to specify this interval as well as the priority of the date, if there is one with the same distance left and right from an entity. In the case of notebook C07, 24 items appear in the Geo-Browser’s table. The Geo-Browser only displays 17 of our 24 items because the others refer to more abstract terms like the Holy Roman Empire (“HRR”) or Thuringia, and the authority files we use are not able to provide the required data, especially not for the desired time.

22 The dates returned by this algorithm are also used to select a background map provided by the Geo-Browser. The arithmetic mean of the temporal data in this example is 1451.25 CE and 11 out of 17 dates with spatial reference specify the early sixteenth century. A suitable background map with the borders from 1492 can be selected via parameter (¤tStatus=mapChanged=Historical+Map+of+1492). Another possible solution is to use the information about the notebook’s date of creation, which can be found on the book cover in most cases. Then the background map of 1880 is the best selection for notebook C07.

Figure 3. DARIAH-DE-Geo-Browser’s timeline with the data from notebook C07.

23 Placenames corresponding to a selection in the timeline are displayed in a table below, where backlinks to the edition can be placed as well as any additional information. The table also includes the terms and values we are not able to show in the map, like the name “HRR” we described above.

3.3 Network Analysis

24 The examples above investigate the persons, dates, and places appearing in the text. As mentioned, groups of people have been excluded from our investigation, and there are even more kinds of entities that can be identified in the notebooks: historical events, organizations, and works. To bring them into context with the others in order to generate an aggregated view of all named entities, a network graph can be built. Network visualization is not new to TEI data. Bingenheimer, Hung, and Wiles (2011) introduce “a way of visualizing social networks extracted from a TEI-encoded corpus” (p. 271) consisting of biographic data. The interface is realized with a proprietary plug- in built upon the Prefuse13 software library. One of our goals is to implement the aggregations within the digital edition, and for this we would like to use web technologies only. The D3.js (Data Driven Documents Javascript library) created by Mike Bostock provides a framework for different visualizations. The list of examples14 is a good starting point and gives an introduction to the tool’s functionality. The following implementations are reproductions of the Force-Directed Graph15 and the Hierarchical Edge Bundling16 examples. Again we use an XSL transformation to extract the data from our source.

25 The transformation script matches all entities and generates the required documents. The first document is the HTML file, which contains the needed JavaScript and a

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 166

reference to the external D3.js library. The second is a JSON file, which contains one object per entity and one associated array per object that includes a list of connected entities. The tree-like structure of XML allows the transformation of any document to a network graph by selecting elements that share the same ancestor. The only requirement is that the XML input consists of at least two elements. For example, the element (which is used in the Fontane TEI data to encode a notebook page) is a common ancestor of the entities. This produces a network of co-occurrences; the (undirected) edges mean that the connected nodes are both descendants of a ; both appear on one page, if the condition is defined to match only the elements that are direct children of , because more than one surface may be part of a single page, for example where there are glued-in newspaper articles.

26 As in the Geo-Browser example above, we assume that the proximity of two references to entities suggests a semantic connection. Naturally, such a connection may also exist between two entity occurrences separated by a page break. Therefore, better criteria for connectedness could be proposed, such as co-occurrence within one sentence, but this requires linguistic markup which is not part of our notebook edition. The notebook page as a unit of semantic coherence is still a relatively meaningful criterion, and the most feasible due to our usage of the element.

Figure 4. Force-Directed Graph of entities for notebook C0717

27 The Force-Directed Graph algorithm creates a network starting from randomized positions of the nodes and applies a weight for a single node and a link strength. Based on these values, a network is rendered. The result for C07 (figure 4) is a reliable network that shows Thuringia in a centered position with the most edges; it is the most frequently occurring entity and it is mentioned on several pages together with the

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 167

connected ones. The other places with many links are Erfurt, Weimar, and Kapellendorf. Erfurt and Weimar were the cultural centers of Thuringia; Kapellendorf is a village where the last of the battles of Jena and Auerstedt during the fourth part of the Napoleonic Wars (War of the Fourth Coalition) took place. There is a part of the network where ethnic groups (green) appear together, which represents a page of notebook C07 (6 recto) on which Fontane describes the Thuringians in opposition to their neighbors, the Franks, Cherusci, Saxons, and others. This page’s headline is “Thüringens Geschichte” (History of Thuringia), which is also the topic of the following pages. The benefit of the network is that a major topic can be identified with a single view.

28 The output of this D3.js application is an SVG graphic which can be further transformed. elements are used to store the node names, which modern browsers should display on mouseover. To get a better overview of the entities in the notebook, the node names should actually be inserted as nodes, but since there is not much space available in the Force-Directed Graph, a different design might be a better choice. The Hierarchical Edge Bundling example (figure 5) provides a circular layout with the nodes in alphabetical order on each level of hierarchy. Again, the hierarchy is based on the element, on which the attribute @n determines the level and will be expressed as the first part of the object name within the JSON file. In our case this attribute contains the leaf number with a letter "r" for recto and "v" for a verso side and this attribute is transformed in a html:id, so we can go back from a single entity to the leaf of its first occurrence by generating a hyperlink with the help of JavaScript. If this part is left out, the objects will be sorted in alphabetical order and the network will contain more edges to link those entities that co-occur on one page. Applying the hierarchy allows these edges to be deleted, because the categorization lets the nodes appear together and a bigger gap between the categories marks the border. This is one of the rare cases in which adding more information to a visualization simplifies and refines the output at the same time.

29 The result is an interactive graphic in which the appropriate edges are highlighted when the cursor is placed over a node. If one selects the node “Luther,” all links and nodes that appear together with a reference to Martin Luther on a page will be highlighted. When one does this, the items within the first and third clusters change their color to red. Thus we get the same information as in the Force-Directed Graph: the topic of Martin Luther is confined to the first part of notebook C07, while Thuringia is the central topic of the rest of the document.

30 Both networks show an outlier. The personal entity of Lucas Cranach, a German Renaissance painter and a good friend of Luther and his wife, is located outside of the network. He appears as the only entity on one page. Furthermore his name is the only inscription on this page at all and it is followed by three blank pages. One possible way to integrate this entity into the network might be the use of another apportionment, as the fact that two entities are referenced on the same page may be regarded as artificial. Only minor changes would have to be made to the XSLT code to use other divisions of text, such as chapters. As notebooks are not typically organized in chapters, we can use paragraphs (encoded with here) instead. Instead of one specific element, a distinctive number of blank surfaces in series can be interpreted as a marker between two sections. Based on this information, we can build a different network. This

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 168

is experimental and aimed at defining the best clusters according to the purpose of our examination.

Figure 5. Hierarchical Edge Bundling (mouse cursor placed over the node “Luther”)18

31 The network graphs above take advantage of the precisely marked up text and focus on all named entities. We seem to take these entities out of their contexts, but actually we make their contexts more clearly visible as we group them together where Fontane has written about them on the same page in his notebook. The networks still represent every surface, or represent the notebook, but the visualizations just pay attention to entities. It would not be difficult to apply these scripts to any other element to aggregate or summarize given information with the option to browse to the respective lines of text where the entity occurs. As our dataset grows, we might be able to categorize the notebooks or to distinguish literary manuscripts from other notes. At some point we will be able to apply methods from network theory, for example to measure the centrality of nodes and compare the values from different networks for a single entity. This could provide insights into the structure of the analysed texts, e.g., regarding the identification of topics.

4. Conclusion

32 All the example applications we have presented in this article build upon existing tools and services. The additional effort required to make them work with the TEI data from the forthcoming edition of Theodor Fontane’s notebooks (and from the edition of William Godwin’s Diary) was minimal. The necessary scripting (mainly XSLT) and code customization were easily carried out in addition to our regular work within the Fontane edition project. These efforts were facilitated by a spirit of openness shared by

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 169

all parties involved: both the D3.js library and the SIMILE Timeline widget are open- source software released under a BSD license; the data sources GND, GeoNames, and OpenStreetMap have permissive licenses—Creative Commons Zero (CC0), Creative Commons Attribution (CC BY), and Open Data Commons Open Database License (ODbL), respectively; and the data from William Godwin’s Diary is released under a Creative Commons Attribution-NonCommercial license (CC BY-NC) and, just as importantly, is offered directly in TEI/XML.

33 The benefits of such an approach are undeniable: it enables researchers who are concerned with the meaning within textual material to explore questions that would otherwise be difficult and/or tedious to tackle, thus fulfilling the promise of the digital humanities. Therefore, we wonder why this kind of semantic exploration is not applied more often within the TEI community (or the digital humanities community as a whole). The lack of similar work is particularly regrettable because the true power of this approach would become apparent if there were more results to compare, so that a scholarly dialogue might ensue. Likewise, the availability of more datasets would increase the validity of the analyses performed on them, while more available tools would increase the possible research problems that could be examined through semantic investigation. Furthermore, this approach is a possible answer to the call for markup analysis that Fotis Jannidis started at a panel discussion during the DH2012 conference (Bauman et al. 2012). The work of scholars in the field of literature is more and more digital, but rather than using the richly encoded TEI document, plain text is still the most common format for text analysis.

34 The underlying question here is: who is responsible for carrying out the required work to develop, maintain, and customize tools for semantic exploration? It is naïve to believe that ready-to-use applications would emerge from “the Web” or “the community” on their own. Rather, all practitioners in the field ought to ask themselves what their contribution to this chain of development, use, sharing, and re-use could be. Instead of waiting for someone else to develop a generic tool that fits all purposes, anyone can make a small effort towards such a development by providing interchangeable data, using common standards, and building on pre-existing work. And with the philosophy of open source and sharing ideas in mind, we provide all our scripts as well as interactive examples at http://fontane-nb.dariah.eu/tei-conf/.

BIBLIOGRAPHY

Bauman, Syd, David Hoover, Karina Van Dalen-Oskam, and Wendell Piez. 2012. “Text Analysis Meets Text Encoding.” Panel discussion at Digital Humanities 2012 conference, Hamburg, Germany, July 16–22.

Berners-Lee, Tim. 2006. “Linked Data.” http://www.w3.org/DesignIssues/LinkedData.html.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 170

Bingenheimer, Marcus, Jen-Jou Hung, and Simon Wiles. 2011. “Social Network Visualization from TEI Data.” Literary and Linguistic Computing 26, no. 3: 271–78. http://llc.oxfordjournals.org/ content/26/3/271.full.pdf+html. doi:10.1093/llc/fqr020.

Le Boeuf, Patrick, Martin Doerr, Christian Emil Ore, and Stephen Stead. 2013. Definition of the CIDOC Conceptual Reference Model, version 5.1.2 (October). http://www.cidoc-crm.org/docs/ cidoc_crm_version_5.1.2.pdf.

Myers, Victoria, David O’Shaughnessy, and Mark Philp, eds. 2010. The Diary of William Godwin. Oxford: Oxford Digital Library. http://godwindiary.bodleian.ox.ac.uk/

Radecke, Gabriele. 2010. “Theodor Fontanes Notizbücher: Überlegungen zu einer überlieferungsadäquaten Edition.” In Materialität in der Editionswissenschaft (= Beihefte zu editio, vol. 32), edited by Martin Schubert, 95–106. Berlin: De Gruyter.

———. 2013. “Notizbuch-Editionen: Zum philologischen Konzept der Genetisch-kritischen und kommentierten Hybrid-Edition von Theodor Fontanes Notizbüchern.” editio 27: 149–72. doi: 10.1515/editio-2013-010.

Radecke, Gabriele, Mathias Göbel, and Sibylle Söring. 2013. “Theodor Fontanes Notizbücher: Genetisch-kritische und kommentierte Hybrid-Edition, erstellt mit der Virtuellen Forschungsumgebung TextGrid.” In Evolution der Informationsinfrastruktur: Kooperation zwischen Bibliothek und Wissenschaft, edited by Heike Neuroth, Norbert Lossau, and Andrea Rapp, 85–105. Glückstadt: Hülsbusch Verlag. http://webdoc.sub.gwdg.de/univerlag/2013/ Neuroth_Festschrift.pdf.

Wettlaufer, Jörg, and Sree Ganesh Thotempudi. 2013. “Named Entity Recognition in Historical Texts from the Natural History Domain.” Poster presented at Mehr Personen – Mehr Daten – Mehr Repositorien, Tagung des Personendatenrepositoriums der BBAW, Berlin, Germany, March 4–6. http://www.gcdh.de/files/2013/6429/9184/Wettlaufer_Thotempudi_2013_NER_final.pdf.

Wüsten, Sonja, ed. 1973. Reisen in Thüringen: Notiz- und Tagebuchaufzeichnungen aus den Jahren 1867 und 1873. By Theodor Fontane. Potsdam: Theodor-Fontane-Archiv der Dt. Staatsbibl.

NOTES

1. Theodor Fontane–Arbeitsstelle, Genetisch-kritische und kommentierte Hybrid-Edition von Theodor Fontanes Notizbüchern basierend auf einer Virtuellen Forschungsumgebung, http:// fontane-notizbuecher.de/ 2. TextGrid: A Virtual Research Environment for the Humanities, http://textgrid.de/ 3. Syncro Soft, http://www.oxygenxml.com/ 4. A typical example can be found in the edition of William Godwin’s Diary (Myers, O’Shaughnessy, and Philp 2010), in which plots of diary appearances of mentioned people are provided within HTML websites that make use of TEI data without being part of a TEI document themselves. 5. Timeline: Web Widget for Visualizing Temporal Data, http://www.simile- widgets.org/timeline/. 6. Deutsche Nationalbibliothek, http://www.dnb.de/EN/gnd. 7. Deutsche Nationalbibliothek, GND integrated authority file, http://d-nb.info/gnd/ 119444720

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 171

8. Scrollable version at http://fontane-nb.dariah.eu/tei-conf/simile/. 9. GeoNames geographical database, http://www.geonames.org/. 10. http://www.openstreetmap.org. 11. DARIAH-DE Konsortium, http://geobrowser.de.dariah.eu/embed/?. 12. http://exist-db.org/. 13. UC Berkeley Visualization Lab, http://prefuse.org/. 14. https://github.com/mbostock/d3/wiki/Gallery. 15. http://bl.ocks.org/mbostock/4062045. 16. http://bl.ocks.org/mbostock/7607999. 17. Interactive version at http://fontane-nb.dariah.eu/tei-conf/net/. 18. Interactive version at http://fontane-nb.dariah.eu/tei-conf/heb/.

ABSTRACTS

Within the last few decades, TEI has become a major instrument for philologists in the digital age, particularly since a set of mechanisms has recently been incorporated which facilitates the encoding of genetic editions. Editors use the XML syntax while aiming to preserve the quantity and quality of old books and manuscripts and publish many more of them online, mostly under free licenses. Scholars all over the world are now able to use huge datasets for further research. There are now many digital editions available, but only a few tools to analyze them. This article explores how web technologies (XML and related technologies as well as JavaScript) can be used to enrich the forthcoming edition of Theodor Fontane’s notebooks with data-driven visualizations of named entities and how at the same time applications can be built on these visualizations which are reusable for other edition projects in the TEI world. Because of the density and historical scope of references to named entities and the variety of entity types, Fontane’s notebooks lend themselves to advanced methods of semantic analysis.

INDEX

Keywords: applications, authority files, visualization, scholarly edition, Fontane, Theodor, notebooks

AUTHORS

MARTIN DE LA IGLESIA Martin de la Iglesia works for the project Genetic-critical and annotated hybrid-edition of Theodor Fontane’s notebooks based on a virtual research environment as a librarian in the Metadata and Data Conversion group at Göttingen State and University Library. He earned a Magister Artium degree

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 172

in art history and library science from Humboldt University Berlin in 2007 and has been working as a metadata specialist since then.

MATHIAS GÖBEL Mathias Göbel is a research associate in the project Genetic-critical and annotated hybrid-edition of Theodor Fontane’s notebooks based on a virtual research environment since 2012. As an IT specialist he is developing the forthcoming website based on a native XML database and the data created within the TextGrid environment. In 2012 he received a Diploma degree in economics, with a minor in German Language and Literature, from Göttingen University.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 173

Edition Visualization Technology: A Simple Tool to Visualize TEI-based Digital Editions

Roberto Rosselli Del Turco, Giancarlo Buomprisco, Chiara Di Pietro, Julia Kenny, Raffaele Masotti and Jacopo Pugliese

AUTHOR'S NOTE

For the purposes of the Italian academy, R. Rosselli Del Turco is responsible for sections 1, 2.1, 3.-3.1, 3.5, 4; G. Buomprisco is responsible for section 3.4; C. Di Pietro is responsible for section 3.3; J. Kenny is responsible for section 2.3; R. Masotti is responsible for sections 2.2 and 2.4; J. Pugliese is responsible for section 3.2. R. Rosselli Del Turco also planned and revised the whole article.

1. The Need for a Digital Facsimile/Edition Browser

1.1 The Digital Vercelli Book Project

1 When the transcription of the Vercelli Book manuscript (Codex CXVII, Archivio e Biblioteca Capitolare di Vercelli) passed the 50% landmark, researchers working on the project1 started to think about the best way to visualize the edition. Thanks to the openness and support of the Biblioteca Capitolare, it was decided to abandon the original plan of a CD/DVD publication, largely inspired by digital editions such as the Electronic Beowulf,2 in favor of a web-based publication. While this decision was critical in that it allowed us to select the most supported and widely-used medium, we soon discovered that it did not make choices any simpler. On the one hand, the XSLT stylesheets provided by TEI are great for HTML rendering, but do not include support for image-related features (such as the text-image linking available thanks to the P5 version of the TEI schema) and tools (including zoom in/out, magnifying lens, and hot

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 174

spots) that represent a significant part of a digital facsimile and/or diplomatic edition; other features, such as an XML search engine, would have to be integrated separately, in any case. On the other hand, there are powerful frameworks based on CMS3 and other web technologies4 which looked far too complex and expensive, particularly when considering future maintenance needs, for our project’s purposes. Other solutions, such as the EPPT software 5 developed by K. Kiernan or the Elwood viewer6 created by G. Lyman, either were not yet ready or were unsuitable for other reasons (proprietary software, user interface issues, specific hardware and/or software requirements).

1.2 Standard vs. Fragmentation

2 We concluded that, while the TEI schemas and Guidelines are a solid foundation for philological work, this excellent standard is matched by an astounding diversity of publishing tools, particularly when it comes to digital editions, especially editions that include images of manuscripts. As a consequence, a single scholar, or a small group of researchers, can easily encode an image-based edition in TEI XML, but will have to look for further support and resources to publish it, which is a serious obstacle to greater acceptance of digital philology methods and techniques. While continuing to inquire about a good, community-tested solution, and in spite of being aware of the dangers coming from the so-called NIH (Not Invented Here) syndrome, we started looking into a project-specific solution that could fulfill our needs.

1.3 First Experiments

3 At first, however, EVT was more an experimental research project for students at the Informatica Umanistica course of the University of Pisa 7 than a real attempt to solve the digital edition viewer problem. We aimed at investigating some user interface– related aspects of such a viewer, in particular certain usability problems that are often underestimated by similar projects (Rosselli Del Turco 2011), and at encouraging the use of standards to ensure the maximum longevity for the edition; we also considered releasing the source code as free software right from the start, since we firmly support open source and open access within the academic community. The first prototypes implemented on the basis of our concepts were promising, but eventually we reached a dead end: in spite of all our good intentions the User Interface (UI) looked cluttered; the software sported several secondary features, such as a rich text editing widget, but lacked critical ones (text-image linking, for instance), and was not fully stable, possibly as a result of too many widgets of different origins being used at the same time. On the architectural side, data was loaded straight into the web code, which had to be done by hand, with no options for configuration at all. In conclusion: not bad for a first attempt, but falling far short of our original goals.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 175

2. The Current EVT Version

2.1 EVT v. 2.0: Rebooting the Project

4 To get out of the impasse we decided to completely reboot the project, removing secondary features and giving priority to fundamental ones. We also found a solution for the data-loading problem: instead of finding a way to load the data into the software we decided to build the edition around the data itself. Making the TEI XML files the starting point means that the editor can focus on his work, marking up the transcription text, with very little configuration needed to create the edition. This approach also allowed us to quickly test XML files belonging to other edition projects, to check if EVT could go beyond being a project-specific tool. The inspiration for these changes came from work done in similar projects developed within the TEI community, namely TEI Boilerplate,8 John A. Walsh’s collection of XSLT stylesheets,9 and Solenne Coutagne’s work for the “Berliner Intellektuelle 1800–1830” project. 10 Through this approach, we achieved two important results: first, usage of EVT is quite simple—the user applies an XSLT stylesheet to their already marked-up file(s), and when the processing is finished they are presented with a web-ready edition; second, the web edition that is produced is based on a client-only architecture and does not require any additional kind of server software, which means that it can be simply copied on a web server to be used at once, or even on a cloud storage service (provided that it is accessible by the general public).

5 To ensure that it will be working on all the most recent web browsers, and for as long as possible on the World Wide Web itself, EVT is built on open and standard web technologies such as HTML, CSS, and JavaScript. Specific features, such as the magnifying lens, are entrusted to jQuery plug-ins, again chosen from the best- supported open-source ones to reduce the risk of future incompatibilities. The general architecture of the software, in any case, is modular, so that any component which may cause trouble or turn out to be not completely up to the task can be replaced easily.

2.2 How it Works

6 Our ideal goal was to have a simple, very user-friendly drop-in tool, requiring little work and/or knowledge of anything beyond XML from the editor. To reach this goal, EVT is based on a modular structure where a single stylesheet (evt_builder.xsl) starts a chain of XSLT 2.0 transformations calling in turn all the other modules. The latter belong to two general categories: those devoted to building the HTML site, and the XML processing ones, which extract the edition text lying between folios using the element and format it according to the edition level. All XSLT modules live inside the builder_pack folder, in order to have a clean and well-organized directory hierarchy.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 176

Figure 1. The EVT builder_pack directory structure.

7 Therefore, assuming the available formatting stylesheets meet your project’s criteria, to create a digital edition you only have to follow three simple steps: • copy the edition data into the data/input_data folder: there are different sub-folders for text (TEI XML documents) and images (these have to follow a specific naming convention); • optionally you can modify transformation options by editing evt_builder-conf.xsl, to specify for example the number of edition levels or presence of images; • you can then apply the evt_builder.xsl stylesheet to your TEI XML document using the Oxygen XML editor or another XSLT 2–compliant engine.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 177

Figure 2. The EVT data directory structure.

8 When the XSLT processing is finished, the starting point for the edition is the index.html file in the root directory, and all the HTML pages resulting from the transformations will be stored in the output_data folder. You can delete everything in this latter folder (and the index.html file), modify the configuration options, and start again, and everything will be re-created in the assigned places.

2.3 The XSLT stylesheets

9 The transformation chain has two main purposes: generate the HTML files containing the edition and create the home page which will dynamically recall the other HTML files.

10 The EVT builder’s transformation system is composed of a modular collection of XSLT 2.0 stylesheets: these modules are designed to permit scholars to freely add their own stylesheets and to manage the different desired levels of the edition without influencing other parts of the system, for instance the generation of the home page.

11 The transformation is performed applying a specific XSLT stylesheet (evt_builder.xsl) which includes links to all the other stylesheets that are part of the transformation chain and that will be applied to the TEI XML document containing the transcription.

12 EVT can be used to create image-based editions with different edition levels starting from a single encoded text. The text of the transcription must be divided into smaller parts to recreate the physical structure of the manuscript. Therefore, it is essential that paginated XML documents are marked using a TEI page break element () at the start of each new page or folio side, so that the transformation system will be able to

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 178

recognize and handle everything that stands between a element and the next one as the content of a single page.

13 The system is designed to generate an arbitrary number of edition levels: as a consequence, the user is required to indicate how many (and which) output levels they intend to create by modifying the corresponding parameter in the configuration file.

14 The following code segment shows the default configuration with two edition levels:

Diplomatic Interpretative

15 The user can change the number of output levels simply by adding or removing an element in the evt_builder-conf.xsl file, and they can also personalize the edition’s name by changing the content of the element. For example, if they wish to generate a critical level, they are required to add Critical to the edition_array variable.11

16 Once the XML file is ready and the parameters are set, the EVT builder’s transformation system uses a collection of stylesheets to divide the XML file containing the text of the transcription into smaller portions, each one corresponding to the content of a folio, recto or verso, of the manuscript. For each of these text fragments it creates as many output files as requested by the file settings.

17 In order to create the contents of these files, templates required to handle the edition level are selected by using the value of their @mode attribute. For example, by associating the transformation for the diplomatic edition to the dipl mode, the correct content of the pages of this edition is obtained applying the templates that have "dipl" as value for @mode. This requires that each one of the pages selected by the system is processed by an instruction before its content is inserted into the diplomatic output file.

18 Using XSLT modes it is possible to separate the rules for the different transformations of a TEI element and to recall other XSLT stylesheets in order to manage the transformations or send different parts of a document to different parts of the transformation chain. This permits the extraction of different texts for different edition levels (diplomatic, diplomatic-interpretative) processing the same XML file, and to save them in the HTML site structure, which is available as a separate XSLT module.

19 The use of modes also allows users to separate template rules for the different transformations of a TEI element and to place them in different XSLT files or in different parts of a single stylesheet. So templates such as the following

and

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 179

can be placed in different stylesheets or in different parts of the same stylesheet without generating conflicts.

20 If the TEI elements that are processed are placed in an HTML element and these elements’ @class attribute values are based on the scheme [edition_level]- [TEI_element’s_name],12 it is then possible to retain the semantic information contained in the markup and, if necessary, to associate the element with the corresponding class in the CSS rules so as to specify the visualization and highlighting of the item.

21 As already stated above, the editors will be able to freely add their own stylesheets in order to personally manage the different levels of the desired edition. What is required from users is to: • personalize the edition generation parameter as shown above; • copy their own XSLT files containing the template rules to generate the desired edition levels in the directory that contains the stylesheets used for TEI element transformation (builder_pack/modules/elements); • include the new stylesheets in the file used to start the transformation chain (builder_pack/evt_builder.xsl); • associate a mode value to the new edition level transformation; • add a @mode attribute with the corresponding value to all the template rules used for that transformation.

22 For the time being, this kind of customization has to be done by hand-editing the configuration files, but in a future version of EVT we plan to add a more user-friendly way to configure the system.

2.4 Features

23 At present, EVT can be used to create image-based editions with two possible edition levels: diplomatic and diplomatic-interpretative; this means that a transcription encoded using elements belonging to the appropriate TEI module13 should already be compatible with EVT, or require only minor changes to be made compatible. The Vercelli Book transcription schema is based on the standard TEI schema, with no custom elements or attributes added: our tests with similarly encoded texts showed a high grade of compatibility. A critical edition level is currently being researched and it will be added in the future.

24 When the website produced by EVT is loaded in a browser, the viewer will be presented with the manuscript image on the left side, and the corresponding text on the right: this is the default view, but on the main toolbar at the top right corner of the browser window there are icons to access all the available views: • “Image-Text” view: as mentioned above, this is the default view showing a manuscript folio image and the corresponding text in one or more edition levels; • “Text-Text” view: conceived to compare different edition levels, which can be chosen by means of the drop-down menu in the text frame toolbar; • “Bookreader”: this view expands the image frame to show double-side images of the manuscript (that is showing the verso side of a folio and the following one’s recto side). At the

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 180

moment it is necessary that these images be double-sided; in a future version of EVT it will be possible to place single folio images side-by-side and to browse them in a synchronized way.

Figure 3. The bookreader view.

25 It is interesting to notice how the “bookreader” view is automatically populated with content during the transformation phase with no action required by the editor. The only necessary requirement at the encoding level, in fact, is that the editor should encode folio numbers by means of the element including r and v letters to mark recto and verso pages, respectively. EVT will take care of automatically associating each folio to the images copied in the input_data/images folder using a “verso- recto” naming scheme (for example: 104v-105r.png). It is of course possible that in some cases the transformation process is unable to return the correct result: this is why we decided to save this kind of information in an XML file, structure.xml, separate and independent from the HTML interface; this file will be updated automatically every time the transformation process is started and can be customized by the editor.

26 Although the different views access different kinds of content, such as single side and double side images, the navigation algorithms used by EVT allow the user to move from one view to another without losing the current browsing position.

27 All content is shown inside HTML frames designed to be as flexible as possible. No matter what view one is currently in, one can expand the desired frame to focus on its specific content, temporarily hiding the other components of the user interface. It is furthermore possible to collapse the frame toolbars to increase the space devoted to content visualization; it is important to notice, however, that we recommend using EVT in full-screen mode to see images and text at the maximum possible screen resolution.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 181

The collapse and restore actions are triggered by icons embedded in the interface, but one can also press the Esc key to instantly return to the default layout.

28 On the image side, several tools are available to improve analysis of manuscript images. The zoom feature is always active, so that the user can zoom in or out at any time by means of the mouse scroll wheel or using the appropriate slider in the bottom toolbar. More tools are available on the top toolbar: • “Magnifier”: a magnifying lens to explore the manuscript images in greater detail; the lens shows an area of a high resolution version of the same folio image, which means a better view compared to the standard zoom feature. • “HotSpot”: if activated, areas of the images for which there are specific notes and/or details will be highlighted. Clicking on them will call up a pop-up window with the related information. This feature is quite recent and still needs to be refined; the pop-up window may be replaced by a better solution. • “TextLink”: when this feature is active, lines in the manuscript are highlighted and linked to the corresponding lines of the edition text, and vice versa. • “Thumbnails”: will show miniature images for all the available manuscript digitized folios.

29 The image-text feature is inspired by Martin Holmes’s Image Markup Tool14 software and was implemented in XSLT and CSS; all the other features are achieved by using jQuery plug-ins.

30 In the text frame tool bar you can see three drop-down menus which are useful for choosing texts, specific folios, and edition levels, and an icon that triggers the search functionality. Again, the editor can modify the structure.xml file to change the text-related information, or even to exclude some of the images from the navigation system.

2.5 A First Use Case

31 On December 24, 2013, after extensive testing and bug fixing work, the EVT team published a beta version of the Digital Vercelli Book edition,15 soliciting feedback from all interested parties. Shortly afterwards, the version of the EVT software we used, improved by more bug fixes and small enhancements, was made available for the academic community on the project’s SourceForge site.16

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 182

Figure 4. The Digital Vercelli Book edition based on EVT v. 0.1.48. Image-text linking is active.

3. Future Developments

32 EVT development will continue during 2014 to fix bugs and to improve the current set of features, but there are also several important features that will be added or that we are currently considering for inclusion in EVT. Some of the planned features will require fundamental changes to the software architecture to be implemented effectively: this is probably the case for the Digital Lightbox (see section 3.4), which requires a client-server architecture (section 3.5), instead of the current client-only model, to perform some of the existing and planned actions. The currently developed search engine (section 3.2) may prove too limited when dealing with larger text collections, but thanks to the new architecture it will be possible to consider new, more powerful solutions.

3.1 New Layout

33 One important aspect that has been introduced in the current version of EVT is a completely revised layout: the current user interface includes all the features which were deemed necessary for the Digital Vercelli Book beta, but it also is ready to accept the new features planned for the short and medium terms. Note that nontrivial changes to the general appearance and layout of the resulting web edition will be necessary, and this is especially the case for the XML search engine and for the critical edition support. Fortunately the basic framework is flexible enough to be easily expanded by means of new views or a redesign of the current ones.

3.2 Search Engine

34 The EVT search engine is already working and being tested in a separate development branch of the software; merging into the main branch is expected as soon as the user

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 183

interface is finalized. It was implemented with the goal of keeping it simple and usable for both academics and the general public.

35 To achieve this goal we began by studying various solutions that could be used as a basis for our efforts. In the first phases of this study we looked at the principal XML databases, such as of BaseX, eXist, etc., and we found a solution by envisioning EVT as a distributed application using the client-server architecture. For this test we selected the eXist17 open source XML database, and in a relatively short time we created, sometimes by trial-and-error, a prototype that queried the database for keywords and highlighted them in context.

36 While this model was a step in the right direction and partially operational, we also felt that it was not sufficiently user-friendly, which is a critical goal of the entire project. In fact, forcing the editor to install and configure specific server software is a cumbersome requirement. Moreover, thanks to its client-only architecture, up to this point EVT could work as a desktop application or an off-line web application that could be accessed anywhere, and possibly distributed in optical formats (CD or DVD). Forcing the prerequisites of an Internet connection and of dependency on a server-based XML database would have undermined our original goal. Going the database route was no longer an option for a client-only EVT and we immediately felt the need to go back to our original architecture to meet this standard. This sudden turnaround marked another chapter in the research process and brought us to the current implementation of EVT Search.

37 However, this new vision also had obvious limitations and issues. An XML database could provide us with crucial functionality typical of every information retrieval system: an indexer, a powerful and flexible server-side language (XQuery), and some useful built-in functions for semantic and linguistic analysis (such as keyword in context and concordances). It became clear that our implementation should take those missing functionalities into account and consider alternative technologies.

38 To simplify the planning process, we focused on the bare functionalities that would be expected by the user. Essentially, we found that at least two of them were needed in order to make a functional search engine: free-text search and keyword highlighting. To implement them we looked at existing search engines and plug-ins programmed in the most popular client-side web language: JavaScript. In the end, our search produced two answers: Tipue Search and DOM manipulation.

3.2.1 Tipue Search

39 Tipue search18 is a jQuery plug-in search engine released under the MIT license and aimed at indexing and searching large collections of web pages. It can function both offline and online, and it does not necessarily require a web server or a server-side programming/query language (such as SQL, PHP, or Python) in order to work. While technically a plug-in, its architecture is quite interesting and versatile: Tipue uses a combination of client-side JavaScript for the actual bulk of the work, and JSON (or JavaScript object literal) for storing the content. By accessing the data structure, this engine is able to search for a relevant term and bring back the matches.

40 Tipue Search operates in three modes: • in Static mode, Tipue Search operates without a web server by accessing the contents stored in a specific file (tipuedrop_content.js); these contents are presented in JSON format;

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 184

• in Live mode, Tipue Search operates with a web server by indexing the web pages included in a specific file (tipuesearch_set.js); • in JSON mode, Tipue Search operates with a web server by using AJAX to load JSON data stored in specific files (as defined by the user).

41 This plug-in suited our needs very well, but had to be modified slightly in order to accommodate the requirements of the entire project. Before using Tipue to handle the search we needed to generate the data structure that was going to be used by the engine to perform the queries. We explored some existing XSL stylesheets aimed at TEI to JSON transformation, but we found them too complex for the task at hand. So we modified our own stylesheets to produce the desired output.

42 This output consists of two JSON files: • diplomatic.json contains the text of the diplomatic edition of the Vercelli Book; • facsimile.json contains the text of the facsimile edition of the Vercelli Book.

43 These files are produced by including two templates in the overall flow of XSLT transformations that extract crucial data from the TEI documents and format them with JSON syntax. The procedure complements well the entire logic of automatic self- generation that characterizes EVT.

44 After we managed to extract the correct data structure, we began to include the search functionality in EVT. By using the logic behind Tipue JSON mode, we implemented a trigger (under the shape of a select tag) that loaded the desired JSON data structure to handle the search (diplomatic or facsimile, as mentioned above) and a form that managed the query strings and launched the search function. Additionally, we decided to provide the user with a simple virtual keyboard composed of essential keys related to the Anglo-Saxon alphabet used in the Vercelli Book.

45 The performance of Tipue Search was deemed acceptable and our tests showed that even large collections of data did not pose any particular problem.

Figure 5. Experimental search interface.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 185

3.2.2 Keyword Highlighting through DOM Manipulation

46 The solution to keyword highlighting was found while searching many plug-ins that deal with this very problem. All these plug-ins use JavaScript and DOM manipulation in order to wrap the HTML text nodes that match the query with a specific tag (a span or a user-defined tag) and a CSS class to manage the style of the highlighting. While this implementation was very simple and self-explanatory, making use of simple recursive functions on relevant HTML nodes has proved to be very difficult to apply to the textual contents handled by EVT.

47 HTML text within EVT is represented as a combination of text nodes and elements. These spans are used to define the characteristics of the current selected edition. They contain both philological information about the inner workings of the text and information about its visual representation. Very often the text is composed of spans that handle different versions of words (such as the sub-elements of the TEI element) or highlight an area of a word (based on the TEI element, for example).

48 This type of markup would not have constituted a problem if it had wrapped complete words, since the plug-ins could recursively explore its content and search for a matching term. In certain portions of the text, however, some letters are separated by a span from the rest of the word (for example: Hello). Since the plug- ins worked on the level of text nodes, a word split by spans would be seen as two different text nodes (for example: H and ello) and could not be matched by the query, making the standard transversal algorithm of the plug-ins impossible to use without any further substantial modifications.

49 To solve this problem we recreated the transversal algorithm entirely, making it more intelligent and context-sensitive. In order to do so, we explored the text to be highlighted by using a map that kept track of all the spans and text nodes and their starting and ending position in this flow. At the end of the process, we simply inserted the span elements (or the user-defined elements) in the positions marked by the map.

Figure 6. Highlighting of search results in the normal view.

50 We found the performance of this solution acceptable and our text showed that even dynamic highlighting (performed as the user inputs text) posed no problem for this implementation.

3.3 Embedded Transcription and Critical Edition Support

3.3.1 Embedded Transcription

51 In some contexts, such as encoding of epigraphical texts, the text transcribed is conceived more as a help for a better understanding of what is written on the object’s

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 186

surface than as a stand-alone edition. This kind of transcription can be encoded by means of the embedded transcription method as described in section 11.2.2 of the TEI Guidelines,19 in which words and other written traces are not supplied in a in parallel with a , but are encoded as subcomponents of the elements representing the physical surfaces inside . The text transcribed, therefore, is not separated from the information about the image, but is placed inside a element, which defines two-dimensional areas within a , and is transcribed using one or more elements.

52 Originally EVT could not handle this particular encoding method, since the XSLT stylesheets could only process TEI XML documents encoded according to the traditional transcription method. Since we think that this is a concrete need in many cases of study (mainly epigraphical inscriptions, but also manuscripts, at least in some specific cases), we recently added a new feature that will allow EVT to handle texts encoded according to the embedded transcription method. This work was possible due to a small grant awarded by EADH.20

3.3.2 Support for Critical Edition

53 One important feature whose development will start at some point this year is the support for critical editions, since at the present moment EVT allows dealing only with diplomatic and interpretative ones. We aim not only to offer full support for the TEI Critical Apparatus module, but also to find an innovative layout that can take advantage of the digital medium and its dynamic properties to go beyond the traditional, static, printed page: “The layers of footnotes, the multiplicity of textual views, the opportunities for dramatic visualization interweaving the many with each other and offering different modes of viewing the one within the many—all this proclaims ‘I am a hypertext: invent a dynamic device to show me.’ The computer is exactly this dynamic device” (Robinson 2005, § 12).

54 A digital edition can, of course, maintain the traditional layout, possibly moving the apparatus from the bottom of the page to a more convenient position, but could and should also explore different ways of organizing and displaying the connection between edition text and the critical apparatus: even if an electronic screen is just as limited by its two-dimensional form factor as a printed book, there are several ways to overcome this limitation, which we intend to research and test until we find an effective solution.

55 Furthermore, a real added value granted by the digital medium is the opportunity to transcribe completely every witness and to visualize the different readings of each one in the wider context of the text of the witness itself, and not only as a single, isolated variant inside the critical apparatus. In fact, thanks to hypertextual links, it is possible to connect every single entry to the precise fragment of text where it belongs, allowing a deeper study and an easier comparison of the main text and the variant texts. A tool based on this approach would really take advantage of the dynamic and hypertextual nature of a web-based edition.

56 Some of the problems related to this approach are related to the user interface and the way it should be designed in order to be usable and useful: how to conceive and where to place the graphical widgets holding the critical apparatus, how to integrate these UI elements in EVT, how to contextualize the variants and navigate through the witnesses’

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 187

texts, and more. There are other problems, for instance scalability issues (how to deal with very big textual traditions that count tens or even hundreds of witnesses?) or the handling of texts produced by collation software, which strictly depend on the current TEI Critical Apparatus module. Considering that there is a subgroup of the TEI’s Manuscript Special Interest Group devoted to significantly improving this module, we can only hope that at least some of these problems will be addressed in a future version.

3.4 Digital Lightbox

57 Developed first at the University of Pisa, and then at King’s College London as part of the DigiPal21 project, the Digital Lightbox22 is a web-based visualization framework which aims to support historians, paleographers, art historians, and others in analyzing and studying digital reproductions of cultural heritage objects. The methodology of research inspiring development of this tool is to study paleographic elements in a qualitative way, helping scholars’ interpretations as much as possible, and therefore to reject any automatic methods such as pattern recognition and clustering which are supposed to return quantitative and objective results. Although ongoing projects making use of these computational methods are very promising, the results that may be obtained at this time are still significantly less precise (with regard to specific image features, at least) than those produced through human interpretation.

58 Initially developed exclusively for paleographic research, the Digital Lightbox may be used with any type of image because it includes a set of general graphic tools. Indeed, the application allows a detailed and powerful analysis of one or more images, arranged in up to two available workspaces, providing tools for manipulation, management, comparison, and transformation of images. The development of this project is consistently tested by paleographers at King’s College London working on the DigiPal project, who are using the web application as a support for analyzing and gathering samples of paleographic elements.

59 The software offers a rich set of tools: besides basic functions such as resizing, rotation, and dragging, it is possible to use a set of filters—such as opacity, brightness, color inversion, grayscale effect, and contrast—which, used in combination, are able to significantly improve the quality of digital images or specific areas of them. For example, one of the possible applications for filters is the improvement of burned or otherwise damaged manuscripts, which are obviously much more difficult to read and transcribe. Images can also be compared using an automatic tool which allows the user to merge two images, returning a third one made up of the differences between them. A typical application of the combined use of these tools is the disambiguation of different types of the same character (allographs, according to the DigiPal terminology).

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 188

Figure 7. Comparison and disambiguation of allographs in the Digital Lightbox.

60 Collaboration is a very important characteristic of Digital Lightbox: what makes this tool stand apart from all the image-editing applications available is the possibility of creating and sharing the work done using the software framework. First, you can create collections of images and then export them to the local disk as an XML file; this feature not only serves as a way to save the work, but also to share specific collections with other users. Moreover, it is possible to export (and, consequently, to import) working sessions, or, in other words, the current status of the work being done using the application: in fact, all the images, letters, and notes present on the workspace will be saved when the user leaves and restored when they log in again. These features have been specifically created to encourage sharing and to make collaborative work more effective and easy. Thanks to a new HTML5 feature, it is possible to support the importing of images from the local disk to the application without any server-side function.

61 Digital Lightbox has been developed using some of the latest web technologies available, such as HTML5, CSS3, the front-end framework Bootstrap,23 and the JavaScript (ECMAScript 6) programming language, in combination with the jQuery library.24 The code architecture has been designed to be modular and easily extensible by other developers or third parties: indeed, it has been released as open source software on GitHub,25 and is freely available to be downloaded, edited, and tinkered with.

62 The Digital Lightbox represents a perfect complementary feature for the EVT project: a graphic-oriented tool to explore, visualize, and analyze digital images of manuscripts. While EVT provides a rich and usable interface to browse and study manuscript texts together with the corresponding images, the tools offered by the Digital Lightbox allow users to identify, gather, and analyze visual details which can be found within the images, and which are important for inquiries relating, for instance, to the style of the handwriting, decorations on manuscript folia, or page layout.

63 An effort to adapt and integrate the Digital Lightbox into EVT is already underway, making it available as a separate, image-centered view, but there is a major hurdle to overcome: some of the DL features are only possible within a client-server architecture.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 189

Since EVT or, more precisely, a separate version of EVT will migrate to this architecture, at some point in the future it will be possible to integrate a full version of the DL. Plans for the current, client-only version envision implementing all those features that do not depend on server software: even if this means giving up interesting features such as collaborative work and annotation, we believe that even a subset of the available tools will be an invaluable help for manuscript image analysis. Furthermore, as noted above, thanks to HTML5 and CSS3 it will become more and more feasible to implement features in a client-only mode.

3.5 New Architecture

64 In September 2013 we met with researchers of the Clavius on the Web project 26 to discuss a possible use of EVT in order to visualize the documents that they are collecting and encoding; the main goal of the project is to produce a web-based edition of all the correspondence of this important sixteenth–seventeenth-century mathematician.27 The integration of EVT with another web framework used in the project, the eXist XML database, will require a very important change in how the software works: as mentioned above, everything from XSLT processing to browsing of the resulting website has been done on the client side, but the integration with eXist will require a move to the more complex client-server architecture. A version of EVT based on this architecture would present several advantages, not only the integration of a powerful XML database, but also the implementation of a full version of the Digital Lightbox. We will try to make the move as painless as possible and to preserve the basic simplicity and flexibility that has been a major feature of EVT so far. The client-only version will not be abandoned, though for quite some time there will be parallel development with features trickling from one version to the other, with the client-only one being preserved as a subset of the more powerful one.

4. Conclusion

65 Started as an experimental, project-based learning initiative to explore issues related to the publishing of TEI-encoded digital editions, this software has grown to the point of being a potentially very useful tool for the TEI community: since it requires little configuration, and no knowledge of programming languages or web frameworks except for what is needed to apply an XSLT stylesheet, it represents a user-friendly method for producing image-based digital editions. Moreover, its client-only architecture makes it very easy to test the edition-building process (one has only to delete the output folders and start anew) and publish preliminary versions on the web (a shared folder on any cloud-based service such as Dropbox is all that is needed).

66 While EVT has been under development for 3–4 years, it was thanks to the work and focus required by the Digital Vercelli Book release at end of 2013 that we now have a solid foundation on which to build new features and refine the existing ones. Some of the future expansions also pose important research questions: this is the case with the critical edition support, which touches a sensitive area of the very recent digital philology discipline.28 The collaborative work features of the Digital Lightbox are also critical to the way modern scholars interact and share their research findings. Finally,

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 190

designing a user interface capable of hosting all the new features, while remaining effective and user-friendly, will itself be very challenging.

BIBLIOGRAPHY

Gabler, Hans Walter. 2010. “Theorizing the Digital Scholarly Edition.” Literature Compass 7(2): 43– 56. doi:10.1111/j.1741-4113.2009.00675.x.

Kiernan, Kevin S., ed. 2013. The electronic Beowulf. 3rd edition. London: British Library. CD-ROM.

Robinson, Peter. 2005. “Current Issues in Making Digital Editions of Medieval Texts—Or, Do Electronic Scholarly Editions Have a Future?” 1.1. http:// www.digitalmedievalist.org/journal/1.1/robinson/.

———. 2013. “Towards a Theory of Digital Editions.” Variants: The Journal of the European Society for Textual Scholarship 10: 105–31.

Rosselli Del Turco, Roberto. 2011. “After the Editing is Done: Designing a Graphic User Interface for Digital Editions.” Digital Medievalist 7 http://www.digitalmedievalist.org/journal/7/ rosselliDelTurco/.

———. Forthcoming. “The Battle We Forgot to Fight: Should We Make a Case for Digital Editions?” In Digital Scholarly Editing: Theory, Practice and Future Perspectives, edited by Matthew Driscoll and Elena Pierazzo. Cambridge: Open Book Publishers.

Siemens, Ray, Meagan Timney, Cara Leitch, Corina Koolen, and Alex Garnett. 2012. “Toward Modeling the Social Edition: An Approach to Understanding the Electronic Scholarly Edition in the Context of New and Emerging Social Media.” Literary and Linguistic Computing 27(4): 445–61. doi:10.1093/llc/fqs013.

TEI Consortium. 2014. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.7.0. Last updated on September 16, 2014. N.p.: TEI Consortium. http://www.tei-c.org/Vault/P5/2.7.0/ doc/tei-p5-doc/en/html/index.html.

APPENDIXES

Appendix 1. The EVT Team

Project lead:

Roberto Rosselli Del Turco

UI design, HTML programming:

Roberto Rosselli Del Turco, Julia Kenny, and Raffaele Masotti

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 191

XSLT framework and graphic plug-ins:

Julia Kenny and Raffaele Masotti

XML search engine:

Jacopo Pugliese

Critical edition and embedded transcription support:

Chiara Di Pietro

Digital Lightbox:

Giancarlo Buomprisco EVT website: http://sourceforge.net/projects/evt-project/

NOTES

1. The Digital Vercelli Book project, http://vbd.humnet.unipi.it/. 2. Support site for the Electronic Beowulf, http://ebeowulf.uky.edu/. 3. The Omeka framework (http://omeka.org/) supports publishing TEI documents; see also Drupal (https://drupal.org/) and TEICHI (http://www.teichi.org/). 4. Such as the eXist XML database, http://exist-db.org/. 5. Edition Production and Presentation Technology, http://sourceforge.net/projects/ eppt/. 6. Elwood Viewer, http://www3.iath.virginia.edu/seenet/elwoodinfo.htm. 7. BA course, http://infouma.di.unipi.it/laurea/index.asp. 8. TEI Boilerplate, http://dcl.slis.indiana.edu/teibp/. 9. tei2html, https://github.com/jawalsh/tei2html. 10. Digitale Edition Briefe und Texte aus dem intellektuellen Berlin um 1800, https:// sites.google.com/site/annebaillot/digitale-edition. 11. The available edition levels are described in the software documentation. Adding another edition level requires providing the corresponding stylesheet. 12. For example, to process the element for the diplomatic edition: dipl- abbr. 13. See chapter 11, “Representation of Primary Sources,” in the TEI Guidelines. 14. The UVic Image Markup Tool Project, http://www.tapor.uvic.ca/~mholmes/ image_markup/. 15. Full announcement on the project blog, http://vbd.humnet.unipi.it/?p=2047. The beta edition is directly accessible at http://vbd.humnet.unipi.it/beta/. 16. Edition Visualization Technology: Digital edition visualization software, http:// sourceforge.net/projects/evt-project/. 17. eXist-db, http://exist-db.org/. 18. Tipue Search, http://www.tipue.com/search/.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 192

19. See “Embedded Transcription,” http://www.tei-c.org/Vault/P5/2.6.0/doc/tei-p5- doc/en/html/PH.html#PHZLAB. 20. See EADH Small Grant: Call for Proposals, http://eadh.org/support/eadh-small- grants-call-proposals. 21. DigiPal: Digital Resource and Database of Palaeography, Manuscript Studies and Diplomatic, http://www.digipal.eu/. 22. A beta version is available at http://lightbox-dev.dighum.kcl.ac.uk. 23. Bootstrap, http://getbootstrap.com/. 24. http://jquery.com/. 25. Digital Lightbox, https://github.com/Gbuomprisco/Digital-Lightbox/. 26. See http://claviusontheweb.it/web/index.php. A preliminary test using a previous version of EVT is available at http://claviusontheweb.it/EVT/. 27. Currently preserved at the Archivio della Pontificia Università Gregoriana. 28. Digital philology makes use of ICT methods and tools, such as text encoding, in the context of textual criticism and philological study of documents to produce digital editions of texts. While many of the first examples of such editions were well received (see for instance Kiernan 2013; also see Siemens 2012 for an example of the new theoretical possibilities allowed by the digital medium), serious doubts concerning not only their durability and maintainability, but also their methodological effectiveness, have been raised by some scholars. The debate is still ongoing, see Gabler 2010, Robinson 2005 and 2013, Rosselli Del Turco, forthcoming.

ABSTRACTS

The TEI schemas and guidelines have made it possible for many scholars and researchers to encode texts of any kind for (almost) all kinds of purposes, but this excellent standard is matched by an astounding diversity of publishing tools, which is particularly true when it comes to image- based digital editions. The different needs of scholars, coupled with the constant search for an effective price/result ratio and the local availability of technical skills, have led to a remarkable fragmentation: publishing solutions range from simple HTML pages produced using the TEI stylesheets (or the TEI Boilerplate software) to very complex frameworks based on CMS and SQL search engines. Researchers of the Digital Vercelli Book project started looking into a simple, user-friendly solution and eventually decided to build their own: EVT (Edition Visualization Technology) has been under development since about 2010 and has turned into a flexible tool that can be used to create a web-based digital edition starting from transcription files encoded in TEI XML. This paper describes why this tool was created, how it works, and developments planned for the future.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 193

INDEX

Keywords: manuscript transcription, diplomatic edition, critical edition, digital edition, XSLT, web publication, digital lightbox

AUTHORS

ROBERTO ROSSELLI DEL TURCO Roberto Rosselli Del Turco is an assistant professor at the University of Torino, where he teaches Germanic philology. He is also an assistant professor of humanities computing at the University of Pisa. He directs the Digital Vercelli Book project and is co-director of the Visionary Cross Project.

GIANCARLO BUOMPRISCO Giancarlo Buomprisco is a graduate student at the University of Pisa currently collaborating with the DigiPal project at King’s College London.

CHIARA DI PIETRO Chiara Di Pietro is a graduate student at the University of Pisa, working toward a MA degree in Informatica Umanistica.

JULIA KENNY Julia Kenny is a graduate student at the University of Pisa, working toward a MA degree in Informatica Umanistica.

RAFFAELE MASOTTI Raffaele Masotti is a graduate student at the University of Pisa, working toward a MA degree in Informatica Umanistica.

JACOPO PUGLIESE Jacopo Pugliese is a graduate student at the University of Pisa, working toward a MA degree in Informatica Umanistica.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 194

TEI4LdoD: Textual Encoding and Social Editing in Web 2.0 Environments

António Rito Silva and Manuel Portela

AUTHOR'S NOTE

We would like to thank Timothy Thompson for his contributions to the TEI template for LdoD. This work was supported by Portuguese national funds through FCT – Fundação para a Ciência e a Tecnologia, under projects PTDC/CLE-LLI/118713/2010 and UID/CEC/ 50021/2013.

1. Introduction

1 Fernando Pessoa’s Book of Disquiet ( Livro do Desassossego, abbreviated LdoD) is an unfinished book project. Pessoa wrote more than five hundred texts meant for this work between 1913 and 1935, the year of his death. The first edition of LdoD was not published until 1982, and another three major versions have been published since then (1990, 1998, 2010), most of which have been further revised. As it exists today, LdoD may be characterized as (1) a set of autograph (manuscript and typescript) fragments, (2) mostly unpublished at the time of Pessoa’s death, which have been (3) transcribed, selected, and organized into four different editions, implying (4) various interpretations of what constitutes this book according to the extant print editions. Editions show four major types of variation: in readings of particular passages, in selection of fragments, in their ordering, and also in heteronym attribution. 1

2 The goal of the LdoD Archive2 is twofold: on the one hand, we want to provide a “standard” archive where experts can study and compare LdoD’s authorial witnesses to their different editions, and each one of these to each other; on the other hand, we

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 195

want to design a virtual archive that allows both experts and non-experts to experiment with the production of different editions of LdoD, and also with the writing of their own fragments based on LdoD’s original fragments. Therefore, this latter goal, which is built on top of the genetic and critical archival goal, extends a scholarly understanding of LdoD as both authorial project and editorial construct to a new perspective of LdoD as an exploratory environment for individual and/or community editing and writing based on authorial witnesses and editorial versions.

3 The rationale for allowing experts and non-experts to experiment with re-editing and rewriting this work originates in social editing theories and in the increasing appeal of the Book of Disquiet for a general and global audience. 3 We want to explore the collaborative dimension of the Web as a reading and writing space in the context of a digital archive in ways that enhance its scholarly, pedagogical, ludic, and expressive uses. Through this double strategy, we also expect to build a tool for investigating the relation between writing processes and material and conceptual notions of the book. By allowing users to perform various roles within the LdoD Archive, we suggest that one- way flow of text from the expert edition to the novice-remixed digital edition can be one way of handling questions of authority concerning the work’s textual form. We argue that it is possible to create a textual environment where professional and general readers can interact, and that such an environment extends theories about the social nature of textual production to current social media technologies.

4 The novelty of this project is built on the possibility of dynamically interacting with the repository. Therefore, we intend to provide to end users, either TEI experts or non- experts, a web interface through which they can interact with the LdoD repository to create their own interpretations of Pessoa’s work. To achieve the project’s goal we designed and implemented a collaborative archive where experts and non-experts can interact, while preserving the relative autonomy of their interventions. The archive enables the coexistence of three different dimensions of LdoD: genetic, social, and virtual. This coexistence is supported by a Web 2.0 environment that integrates an object-oriented database with a file repository for TEI-encoded files. Web 2.0 interactions are done through the object-oriented database while experts continue to use their TEI XML editor of choice—which stores TEI files in the file repository—to interact with the environment.

5 In this article we focus on two aspects: (1) how to encode LdoD, and (2) how to support a dynamic environment. For (1) we will show how the existing witnesses, both authorial and editorial, can be encoded in TEI markup and how the proposed encoding structure will support dynamic interactions with the book, such as, for instance, the creation of further editions.4 Regarding (2), we will propose a software architecture that supports the traditional process of TEI encoding (Vanhoutte 2006; TEI Consortium 2012; Barney 2012; Earhart 2012) with the possibility of dynamically extending the interpretations of LdoD on top of a TEI representation.

6 In section 2 we briefly refer to related work. Section 3 presents the aims and rationale of the project. Section 4 states the problem, and in section 5 we present our solution strategy. Sections 6 and 7 describe our approach as far as, respectively, the TEI encoding and the software architecture are concerned.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 196

2. Related Work

7 Our proposal explores current approaches to editing in electronic environments and attempts to integrate them with TEI conceptual and processing models.

8 In a recent article on the theory of digital editions, Peter Robinson shows how printed editions have traditionally focused on the relationship of text to work, while digital editions have been more focused on the relationship of text to document. He suggests that “a scholarly edition must, so far as it can, illuminate both aspects of the text, both text-as-work and text-as-document” (Robinson 2013, 123), and he calls attention to the presence of reading in any production of text through acts of transcription. The LdoD Archive embodies a similar understanding of the nature of textual semiosis as a process involving self and object in a continuous and co-dependent process of meaning production through acts of reading. Editing Pessoa’s centrifugal and reticular body of unpublished work—a work that is constantly relocating its centre in a network of ever- expanding writing projects—is an especially acute experience of the productive function of reading in activating the material and interpretative fields that allow us to move back and forth from document to text to book to work. To the extent that each text of each edition is contextualizable in an archive of authorial and editorial witnesses, it is the very process of construction of text from document and edition from text that the genetic and social dimensions of the LdoD Archive place in evidence. Our conceptual model reframes established digital archive practices by using the dynamic and social affordances of electronic space itself for simulating literary processes.

9 The object representation of transcriptions is related to the work on data structure for representing multi-version objects (Schmidt and Colomb 2009) because we also parse the TEI files into a customized data structure, in our case a persistent object model that is manipulated in memory to avoid round trips to the databases. We emphasize the need identified by Schlitz and Bodine (2009) to have a clear separation between content and presentation in order to simplify and empower presentation tools. With regard to a Web 2.0 for digital humanities, we are indebted to proposals on cooperative annotations by Tummarello, Morbidoni, and Pierazzo (2005) and the advantages and vision of Web 2.0 and collaboration discussed in Benel and Lejeune (2009), Fraistat and Jones (2009), and Siemens et al. (2010). More recent research work raises the need to have several views of the encoding (Brüning, Henzel, and Pravida 2013). In our approach, different views are also helpful for interoperability (interchange) and to simplify the implementation of user interfaces. The work of Wittern (2013) stresses the need to allow dynamic edition of texts and management of versions, while Muñoz, Viglianti, and Fraistat (2013) highlight the challenges of participatory and distributed encoding in the context of the recently launched Shelley-Godwin Archive.

10 One of the differences related to other existing TEI projects is that in the LdoD project, TEI is envisioned more as an interchange mechanism (Bauman 2011), through the export functionalities, than as the standard for a fully interoperable service-oriented architecture. As a consequence, we can experiment with technologies that are not usually used in TEI environments. To do this we had to implement TEI import and export functionalities to allow the manipulation of the TEI-encoded information in an object-oriented database. Additionally, the export functionality allows backup in a safe repository in a human-readable format. On the other hand, there is information in the object-oriented database that is specific to a Web 2.0 environment, like the information

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 197

required to support access control policies, and which is not explicitly encoded using TEI because it depends on the dynamic aspects of the Web 2.0 tools.

3. LdoD Project

11 The main goal of the project is to create a dynamic literary archive: LdoD Archive— Collaborative Digital Archive of the Book of Disquiet. This archive will aggregate digital facsimiles of all documentary materials of the Livro do Desassossego and loosely topographic textual transcriptions of these materials. Our topographic transcription represents four types of spatial marks in the autographs: line breaks (using the element), spacing between paragraphs (using the element), dividing rulers (using the element), and revision sites (using the @place attribute on ). The main goal of the topographic transcription is to facilitate the side-by-side reading of the facsimile and its transcription. The archive will also include textual representations of the four major Portuguese editions published between 1982 and 2013, respectively by Jacinto do Prado Coelho (1982), Teresa Sobral Cunha ([1990–91] 2008), Richard Zenith ([1998] 2012), and Jerónimo Pizarro ([2010] 2013). At this level the archive combines a genetic edition with a social text edition of LdoD, showing this work both as a network of multiple authorial intentions and as a conjectural construction of its successive editors. Readers will engage with the genetic edition when they see Pessoa’s acts of revision in each fragment as well as his different partial plans for the Book of Disquiet. Future annotations will also mark intertextual relations of fragments to Pessoa’s readings, including references to passages, authors, and marginalia in his own personal library. Readers will engage with the social text edition in two ways: first, by seeing actual historical instances of the Book of Disquiet as it has been edited in the four versions selected for inclusion in the archive, and in our own XML-encoded topographic transcription; and second, by creating new virtual instances of the Book of Disquiet understood as a particular selection and organization of fragments. Tools for textual analysis, collation, and annotation will represent the dynamics of the acts of writing and editing in the production and reproduction of LdoD.

12 Although the extension and detail of the critical apparatus varies considerably in Coelho, Cunha, Zenith, and Pizarro, all of those editions make explicit the interpretative criteria used for selecting and ordering the pieces of text. Our digital representation of the dynamics of textual and bibliographic variation depends on both XML encoding of variation sites (deletions, additions, substitutions, etc.) and metatextual information concerning authorial and editorial witnesses (such as date, order, or heteronym). While TEI markup may be considered as a particular kind of critical apparatus on its own, it is through visualization tools that users will be able to critically engage with the dynamics of variation in authorial and editorial witnesses. They will be able to examine not only the differences between autograph sources or between manuscripts and print editions, but also the differences among the various print editions.

13 Besides using TEI XML encoding and programming to recreate the history of the authorial and editorial dynamics, the LdoD Archive also explores the simulative potential of the digital medium as a space for virtualizing the Book of Disquiet in ways that will enable users to experiment with the processes of editing and writing in relation to this work. Expert and non-expert users will be able to collaborate by making

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 198

their own editions and by adding further textual fragments to the archive. Interventions can take two forms: selecting, ordering, and annotating fragments as part of a user-defined virtual edition; or selecting fragments or parts of fragments as textual seeds for creating variations and extensions based on LdoD as part of a user- defined virtual writing process. This interactive feature of the archive is further enhanced by search and navigation functions that will allow a strong integration between the initial closed set of scholarly materials and the open virtual editing and writing additions.

14 The participatory affordance of the digital medium has two major aspects: an environment for collaboration and social interaction, on the one hand, and the possibility of marking material changes at the level of code, on the other. Material changes can be marked up in the XML encoding, but also as new data and metadata generated by the users’ interaction with the archive, which are stored in the database. We believe that these features of networked computational media can be used to redesign the digital archive as a dynamic environment for different kinds of practice, not limited to research and teaching. A scholarly remediation of LdoD according to tested principles of electronic philology would be part of a larger interactive environment where reading and writing practices around the Book of Disquiet could be socialized within the digital medium itself.5 Aggregation of genetic and editorial witnesses according to criteria defined by readers could also occur within a virtual space that allows users to make critical annotations and write variations based on the fragments. Thus knowledge of textual form and textual transmission would be complemented by experiments with writing processes and bibliographic structures.

4. Problem: The LdoD Archive Model

15 In this section we offer a synthesis of the four functions and three dimensions of our model for a virtual LdoD, as described in Portela and Silva (2014). This description has been diagrammed in figure 1. Our virtual model for LdoD establishes a framework of interactions distributed across four functions: reader-function, editor-function, book- function, and author-function. These functions contain a model of the ecosystem of literary inscriptions. With regard to facsimile surrogates and textual transcriptions, seen here as digital objects available for interaction with expert and non-expert users, we may say that the archive contains three related dimensions: a genetic dimension that allows users to create a narrative of authorial composition; a social dimension that allows users to create a narrative of scholarly editing and textual reception; and a virtual dimension that enables users to explore the possibilities of both writing and editing while interacting with the Book of Disquiet.

16 We should clarify here that “social dimension” is used in this article to refer to the socialization of texts embodied in particular textual and bibliographic forms in the historical archive—a notion derived from social editing theories (McGann 2006). The “virtual dimension” refers to the collaborative additions and interactive transformations of the archive in a web environment. Thus the “virtual dimension” should be understood as a particular expression of the “social dimension” in this constructed electronic reading, editing, and writing space, which can be freely used by either Pessoa experts or non-experts. The final relevance of each virtual edition will result from its popularity—how often it is accessed and used in the construction of

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 199

other virtual editions. “Social editing” in the title of this article subsumes both processes of textual socialization, that is, our digital representation of the historical socialization of Pessoa’s LdoD in a set of expert editions (“social dimension”), and our design of a digital platform for furthering textual collaborations (“virtual dimension”). The social/virtual distinction is pertinent from both a design and a theoretical perspective, since we need to distinguish the static and dynamic layers of the archive. On the other hand, we also want to emphasize the social nature of editing processes through a series of built-in affordances in our digital archive.

Figure 1. A functional model of the LdoD Archive.

17 The reader-function supports a contextualized reading of the LdoD fragments. It enables readers to visualize and compare fragments and variations according to several authorial and editorial witnesses. The book-function enables the construction and reconstruction of LdoD based on textual and metatextual information in the archive. The output of these flexible remakings of the LdoD can be either a given form that already exists in the work’s archive, such as the Coelho 1982 or the Pizarro 2010 edition, or a virtual reorganization of the book. The editor-function, in its turn, is meant to provide interpretations of LdoD as a book project based on a pre-existing (but variable and unstable) corpus of documents. This function allows users to produce textual sequences and textual aggregations according to user-defined criteria. It also enables users to add annotations and tags to particular passages. Finally, the author- function enables the extension of LdoD with new texts based on original fragments from LdoD. Thus, a reader of LdoD is able to write a new text derived from fragments of LdoD, becoming an author in the context of a virtual edition. These newly authored fragments can result both from textual remixes of the archive’s content, and from textual additions of user-created content.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 200

18 We can conclude that what is needed is (1) a scholarly archive where experts can study and compare LdoD’s authorial witnesses with their critical editions; and (2) a virtual archive that allows experts and non-experts to experiment with the production of different editions of LdoD, by means of (2a) editing (aggregating, sequencing, annotating, tagging) and (2b) writing (extensions and variations based on Pessoa’s texts).

19 Given the above set of goals, the LdoD Archive has to accommodate scholarly standards and requirements concerning digital archives, for instance the use of TEI as a specification to encode literary texts, and the virtual communities and social software features to support the social edition of LdoD by both other experts and non-experts. This aspect implies additional constraints on how to coordinate the interactions within the archive. Therefore, there is the need for a dynamic archive where users can edit their own versions of LdoD, and write extensions of the original fragments, while the archive’s initial set of experts’ interpretations and analyses of LdoD are kept “unchanged” and clearly separated from the socialized virtual editions and writings (Silva and Portela 2013). This constraint will actually shape the formulation of the solution strategy.

5. Solution Strategy: A Collaborative Archive Framework

20 To address the separation of the experts’ interpretations from the socialized virtual editions and writings we establish the following principles: • Definition of Expert and Virtual (Social) Editions,6 to explicitly separate the experts’ interpretations, represented in expert editions, from the socializations of the book, represented as virtual editions; • Coexistence and Separation of Editions, to enable users to use the experts’ editions to build their virtual editions without interfering with expert interpretations. This is achieved by imposing a set of restrictions: ◦ Virtual editions are built on top of other expert and virtual editions, and thus use their interpretations; ◦ Expert editions do not refer to virtual editions, in order to keep these experts’ interpretations separate from the virtual editions’ extensions; ◦ By default only the expert editions are presented, so as to preserve an “official” experts’ archive, which means that users have to explicitly access the virtual editions. This explicit access makes them aware of the existence, and separation, of experts’ and socialized virtual interpretations.

21 These principles are illustrated in figure 2. In the center we show how the archive repository is built using three layers (the three aforementioned dimensions), and their recursive construction. The original set of expert editions (that is, our own topographic transcription and the editions by Coelho, Cunha, Zenith, and Pizarro) instantiate the genetic and social dimensions while virtual editions are built on them and on other virtual editions. As we can see, although the virtual level of the archive is separate from its scholarly level, the virtual dimension is built on the genetic and on the social dimensions. For instance, when reading takes place, it is necessary to show what kind of edition is being read: the project’s topographic transcription of an authorial

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 201

document (genetic view), one of the four expert editorial versions that are pre-encoded in the archive (social view), or one of an open set of virtual editions (virtual view). Each textual unit of any virtual edition, in its turn, will contain the stemma of its sources, including authorial witnesses, critical editions, and other virtual editions.

Figure 2. Dynamic and static aspects in the LdoD Archive.

22 Additionally, figure 2 also enriches the representation of the four functions in figure 1 by describing to what extent they support static and dynamic manipulations of the archive repository. In contrast with most digital literary archives, which are built using XSLT technology to transform TEI representations into HTML for visualization, and where only the reader- and book-functions are dynamic, in the LdoD Archive we allow users to create their own editorial interpretations of the LdoD, and extend LdoD written fragments using a web interface. Therefore, users can dynamically change the archive repository through its virtual dimension. However, we still continue to support the traditional scholarly work on the genetic and social dimensions where the project editors do static offline TEI encoding of the LdoD using their XML editor of choice.

6. TEI Encoding

23 Because of the dynamic requirements of LdoD project, and the strategy to create a collaborative archive as described above, the TEI encoding needs to address the three dimensions. Actually, while the description of the TEI encoding of the genetic and social dimensions is driven by how TEI can be used to express the authorial witnesses and their expert interpretations through the concept of edition, the encoding of the virtual dimension focuses on how the TEI encoding chosen for the genetic and social dimensions can be used to support the creation of a non-predefined number of virtual editions.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 202

24 In Portela and Silva (2014) we describe a UML (Unified Modeling Language) model for a virtual LdoD that supports the project requirements. In this section we show how this model is encoded in TEI.

6.1 Genetic and Social Dimensions

Figure 3. Model of the genetic and social interpretations of LdoD.

25 Figure 3 shows a UML model of the genetic and social interpretations of LdoD. The root concept, which has a single instance, is LdoD, and it contains a set of Fragment instances and several interpretations (FragInter instances) for each fragment. Interpretation is the name we give in the model to refer to a textual instance of a fragment according to one of the sources, authorial or editorial, thus stressing transcription as an interpretive act. These interpretations are encoded in TEI using the element with an element for each different transcription. There are two types of interpretations, authorial and editorial. Authorial interpretations, instances of SourceInter, refer to their Source instance, which can be a PrintedSource, Typescript, or Manuscript. On the other hand, editorial interpretations, ExpertEditionInter instances, are contained, and ordered, in the context of expert editions (ExpertEdition instances). Currently, the project is encoding four ExpertEdition instances. Finally, an interpretation contains several TextPortions that represent the use of TEI elements in the fragments’ transcriptions. TextPortions abstract elements which constitute the transcription with the objective of capturing their variations.

26 To encode this model in TEI we use the element to contain all the fragments and aggregate in its header the entities that are common, and can be reused, in each of the fragments. The encoding below shows a part of the TEI Corpus header where the different editions are encoded inside a element as a list of elements. The TEI encoding of Pessoa’s heteronyms is done within and elements as a list of elements with attribute @type set to the "heteronyms" value. In the example, the four experts’ editions and the two heteronyms assigned by experts to the LdoD are declared.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 203

...

27 For each fragment we use a element inside the global LdoD element. This element contains a element where the different interpretations are declared; several elements to declare the fragment’s facsimiles; and a element which contains the interpretations’ transcriptions, represented in figure 3 by TextPortion instances. The encoding example below presents part of the structure of the header for a fragment selected for the four expert editions that has two manuscripts and one printed source.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 204

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 205

28 The Source concept is implemented within the element and the distinction between Manuscript, Typescript, and PrintedSource is indicated by convention through the structure of the @xml:id attribute value. Note that Manuscript and Typescript are implemented within the element, while PrintedSource is used in the element. On the other hand, the FragInter concept is implemented by the element and the distinction between editorial (ExpertEditionInter) and authorial (SourceInter) witnesses is also indicated by convention through the structure of the @xml:id attribute value. A element is used to associate the witness with its source, for SourceInter witnesses, or with its edition, for ExpertEditionInter witnesses. The editions are declared in the corpus header. The editorial contextual information of the fragment (metatextual information) is encoded within the element. Finally, the TextPortion instances depicted in figure 3 are represented within the element and refer to their respective FragInter through the @wit attribute of elements, which contain the witness identifier declared in the fragment header. 7 As mentioned above, an apparatus is used to distinguish the different readings of the fragments. The element is useful in the context of the editions to represent variations in their readings of the source documents.

29 This approach allows us to associate interpretation metadata in the context of each witness. Users will be able to compare digital facsimile representations of authorial documents (and topographic transcriptions of those documents) to editorial transcriptions. The latter can also be compared against each other in order to highlight their interpretations of the source.

30 When a TEI-encoded file for a LdoD fragment is uploaded to the system, it is parsed, and if it does not contain any errors, a new Fragment instance is created associated with a new instance of FragInter for each different transcription of the text. In addition to the verification of the TEI syntax, the parser does a semantic verification of the encoded fragment. For instance, it verifies the existence of the entities referenced by @xml:id attributes. This supports the encoding work because it allows the early detection of errors.

6.2 Virtual Dimension

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 206

Figure 4. Model of the virtual interpretations of LdoD.

31 Figure 4 describes how the model for the genetic and social interpretations is enriched to support virtual interpretations. Three new concepts are introduced: VirtualEdition, which represents virtual editions; VirtualEditionInter, which represents the interpretations of fragments in the context of virtual editions; and VirtualHeteronym, which represents the new heteronyms that end users may create.

32 These three new concepts implement the principle of the separation of expert and virtual editions. To support the coexistence of expert and virtual editions, we define an optional association between a VirtualEdition and an Edition and a mandatory association between a VirtualEditionInter and a FragInter. The former association means that all the interpretations of the virtual edition use the interpretations of the source edition. Therefore, this association is optional at the edition level because we allow the creation of a virtual edition that uses interpretations of fragments from different source editions. The association between a VirtualEditionInter and a FragInter is mandatory because virtual interpretations of fragments should be based on an existing interpretation—either another virtual interpretation or an authorial or editorial interpretation. Ultimately, the transcriptions in a virtual interpretation are identical to either an authorial or an editorial transcription. Note that a virtual interpretation comprises one transcription, either authorial or editorial, and a set of tags and annotations on it.

33 The semantics of this relationship is defined by the @usePolicy attribute that can take two values: "import" and "inherit". When the "import" value is used, the virtual interpretation of the fragment is built on top of the source interpretation and can change it. Any further change done in the source interpretation does not impact on the virtual interpretation. The source interpretation is copied, at virtual interpretation creation time, to the virtual interpretation context and can be freely changed by the user. When the "inherit" policy is used, the source interpretation is extended in the virtual interpretation and cannot be changed. For example, if the source interpretation

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 207

is changed by the source community, then the changes are propagated to its virtual interpretations. These policies only apply to the tags and annotations associated with the virtual edition, because, currently, the archive does not allow any changes to the authorial and editorial transcriptions.

34 The TEI encoding of the genetic and social interpretations is extended to support the virtual interpretations. Therefore, the virtual editions are declared in the corpus header as a new element containing a element for each new virtual edition. However, the distinction between the set of critical editions and the set of virtual editions is done by convention through the @xml:id attribute value. Virtual heteronyms are similarly declared in the corpus header. As regards virtual interpretations, they are encoded in the fragment header using elements and a convention based on the @xml:id attribute value to distinguish them from the other editorial interpretations, namely the experts’ interpretations. Additionally, the editorial contextual information (metatextual information) of the virtual fragment contains, among other information, a reference to the source interpretation and the fragment order in the virtual edition.

35 Following the proposed TEI encoding we are able to implement the solution principles for the LdoD project identified in section 4. However, some of the distinctions associated with the separation principle, or the references between virtual interpretations and their source interpretations, are set by convention and require automatic tools to be aware of them. On the other hand, TEI does not support the encoding of some information that is required in a Web 2.0 application, namely access control information. For instance, we cannot express in TEI whether a virtual edition is private or public, and which specific users may access the virtual edition. We intend, as one of the final results of the project, to define a TEI customization that accommodates all the identified aspects that cannot be expressed using the TEI core.

36 There are some other open issues. The main problem is related to the dynamic evolution of the archive in terms of Web 2.0 requirements: how can TEI code be changed as a result of users’ interactions with the archive? Note that the traditional approach to encoding in TEI is done statically, through tools like oXygen. However, we want to support the evolution of LdoD as a continuously reeditable and rewritable book. This means that it is necessary to enable the dynamic addition of new virtual editions and heteronyms in the Corpus and the addition of new fragments that extend the original ones. Additionally, users can define their own interpretation of any of LdoD’s fragments, for instance by using tags, which results in the generation of new editions of the book through the execution of a categorization algorithm. This open issue is addressed by the software architecture we propose in the next section.

7. Architecture

37 Most digital scholarly archives are static. By static we mean that the construction of the archive is separated from its use. The former is done using TEI and XML editors, and the latter is supported by XSLT transformations. This software architectural approach is not feasible if we want to provide Web 2.0 functionality to the archive. Since we do not want to disregard the existing practice of encoding in TEI by expert users, the architecture needs to support the traditional encoding in TEI by the experts while

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 208

enabling dynamic user interactions with the platform, as highlighted in the static and dynamic aspects of figure 2.

38 Figure 5 presents a component-and-connector view of the LdoD Archive software architecture using the repository and the client-server architectural styles. Components are depicted using UML component instances; connectors are represented by linking UML component ports. The port roles are labelled according to the architectural style that the connector implements: that is, the roles between the :Browser and :LdoD Application Server components denote the client- server architectural style.

Figure 5. LdoD Archive Software Architecture.

39 Traditional TEI encoding, where scholars use an XML editor of choice statically, is represented in figure 5 by the interactions between components :TEI Editor Tool and :LdoD TEI File Repository. However, instead of supporting the presentation through XSLT transformation of files in the repository, we provide an importer component (:LdoD TEI File Importer) that loads the TEI-encoded files into an object-oriented database (:LdoD Object-Oriented Repository). This database contains the object model described in Portela and Silva (2014). For the support of Web 2.0 interactions, the :LdoD Application Server component provides a web client-server interface that users, experts and non-experts alike, use to interact with the object-oriented repository.8 Therefore, the changes occur on the object-oriented representation of the TEI-encoded files. Finally, in order to preserve the TEI interchange qualities of the archive, the :LdoD TEI File Export component allows the regeneration of TEI-encoded files from the object-oriented repository. Note that this component allows the selection of which parts of the repository to generate and the strategy of the encoding. For instance, it is possible to choose only a subset of the interpretations, and decide between different methods for linking critical apparatus to the text. Since the dynamic write interactions can only occur in the context of a virtual edition, it is possible to export the original TEI- encoded files as they were imported, except for their formatting, which is not stored when files are uploaded and parsed.

40 When comparing the software architecture in figure 5 with figure 2, we see that the static aspects of the author and editor functions occur through the :TEI Editor Tool component and all the other dynamic aspects occur through the :Browser component.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 209

41 The key point of the proposed architecture is the use of an object domain model to represent the LdoD archive. Using this approach we, at first, transform LdoD encoded in TEI to the object model, and allow the visualization and editing of this object model through a web user interface. Additionally, TEI files can be regenerated from the object model. This approach has several advantages: (1) the archives’ experts continue using editor tools like oXygen to do their work; (2) users (experts and non-experts) can create their virtual editions and fragment extensions through the web user interface; (3) the object model preserves a semantically consistent LdoD archive by checking the consistency of users’ operations; (4) interoperability (interchange) can be supported by exporting the regenerated TEI files; (5) it is possible to customize the generation of TEI files.

42 This architecture is implemented by a full-fledged prototype that the LdoD encoders use on a daily basis to see the result of their static encoding. Additionally, we are starting to plan experiments to study how users interact with the repository using the dynamic Web 2.0 features. The prototype is implemented in the JAVA programming language, using the Spring MVC framework9 for server-side interaction, Bootstrap10 and JQuery11 for client-side interactions, and the Fénix Framework12 to support a transactional and persistent domain model.

8. Conclusion

43 The specific correlation of static and dynamic goals in the LdoD Digital Archive means that our emphasis falls on open changes that feed back into the archive. The TEI encoding and software design implications of this project make us address both the conceptual aspects of TEI schemas for modeling texts and documents, and the processing problems posed by user-oriented virtualization of Pessoa’s writing and bibliographic imagination.

44 In this article we have presented how to encode LdoD editions and fragments using TEI, 13 and also the software architecture of a Web 2.0 environment through which the encoded fragments are fed. These two approaches create an environment where experts and non-experts can collaborate in four roles: reader, book, editor, and author. The environment separates the different contributions while allowing certain levels of information-sharing that enhance collaboration and the continuous enrichment of the repository.

45 In this phase we have already implemented a full-fledged prototype that is used on a daily basis by the TEI encoders. Currently we are extending the architecture and prototype with more affordances for the editor and author functions. Meanwhile we intend to start experiments with end users to assess the dynamic aspects of the environment and study how to foster the creation of communities around a virtual LdoD.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 210

BIBLIOGRAPHY

Barney, Brett. 2012. “Digital Editing with the TEI Yesterday, Today, and Tomorrow.” Textual Cultures 7(1): 29–41.

Bauman, Syd. 2011. “Interchange vs. Interoperability.” In Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies, vol. 7. doi:10.4242/ BalisageVol7.Bauman01.

Benel, Aurelien, and Christophe Lejeune. 2009. “Humanities 2.0: Documents, Interpretation and Intersubjectivity in the Digital Age.” International Journal of Web Based Communities 5(4): 562–76. doi:10.1504/ijwbc.2009.028090.

Brüning, Gerrit, Katrin Henzel, and Dietmar Pravida. 2013. “Multiple Encoding in Genetic Editions: The Case of ‘Faust’.” Journal of the Text Encoding Intiative 4 (March). http:// jtei.revues.org/697. doi:10.4000/jtei.697.

Coelho, Jacinto do Prado, ed. 1982. Fernando Pessoa, Livro do Desassossego. 2 vols. Lisboa: Ática.

Cunha, Teresa Sobral, ed. 1990–91. Fernando Pessoa, Livro do Desassossego. 2 vols. Lisboa: Relógio d’Água.

———. 2008. Fernando Pessoa, Livro do Desassossego. Lisboa: Relógio d’Água.

———. 2013. Fernando Pessoa, Livro do Desassossego. Lisboa: Relógio d’Água.

Earhart, Amy E. 2012. “The Digital Edition and the Digital Humanities.” Textual Cultures 7(1): 18– 28.

Fraistat, Neil, and Steven E. Jones. 2009. “Editing Environments: The Architecture of Electronic Texts.” Literary and Linguistic Computing 24(1): 9–18. doi:10.1093/llc/fqn032.

McGann, Jerome. 2006. “From Text to Work: Digital Tools and the Emergence of the Social Text.” Text 16: 49–62.

Muñoz, Trevor, Raffaele Viglianti, and Neil Fraistat. 2013. “Texts and Documents: New Challenges for TEI Interchange and the Possibilities for Participatory Archives.” The Linked TEI: Text Encoding in the Web. Book of Abstracts. Abstracts of the TEI Conference and Members Meeting 2013, edited by Fabio Ciotti and Arianna Ciula, 91–96. Rome: DIGILAB Sapienza University and TEI Consortium. http:// digilab2.let.uniroma1.it/teiconf2013/wp-content/uploads/2013/09/book-abstracts.pdf.

Pizarro, Jerónimo, ed. 2010. Fernando Pessoa, Livro do Desasocego. Edited by Jerónimo Pizarro. 2 vols. Lisboa: Imprensa Nacional-Casa da Moeda.

———. 2013. Fernando Pessoa, Livro do Desassossego. Lisboa: Tinta-da-China.

Portela, Manuel, and António Rito Silva. 2014. “A Model for a Virtual LdoD.” Digital Scholarship in the Humanities March 2014. doi:10.1093/llc/fqu004.

Robinson, Peter. 2013. “Towards a Theory of Digital Editions.” Variants 10: 105–31.

Schlitz, Stephanie A., and Garrick S. Bodine. 2009. “The TEIViewer: Facilitating the Transition from XML to Web Display.” Literary and Linguistic Computing 24(3): 339–46. doi:10.1093/llc/fqp022.

Schmidt, Desmond, and Robert Colomb. 2009. “A Data Structure for Representing Multi-version Texts Online.” International Journal of Human-Computer Studies 67(6): 497–514. doi:10.1016/j.ijhcs. 2009.02.001.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 211

Siemens, Ray, Mike Elkink, Alastair McColl, Karin Armstrong, James Dixon, Angelsea Saby, Brett D. Hirsch, Cara Leitch, et al. 2010. “Underpinnings of the Social Edition? A Narrative, 2004–9, for the Renaissance English Knowledgebase (REKn) and Professional Reading Environment (PReE) Projects.” Online Humanities Scholarship: The Shape of Things to Come. Proceedings of the Mellon Foundation Online Humanities Conference at the University of Virginia, March 26–28, 2010, edited by Jerome McGann, Andrew M. Stauffer, Dana Wheeles, and Michael Pickard, 401–60. Houston, TX: Rice University Press.

Silva, António Rito, and Manuel Portela. 2013. “Social Edition 4 The Book of Disquiet: The Disquiet of Experts with Common Users.” In ECSCW 2013 Adjunct Proceedings: The 13th European Conference on Computer-Supported Cooperative Work, edited by Matthias Korn, Tommaso Colombino, and Myriam Lewkowicz, 45–50. Aarhus, Denmark: Department of Computer Science, Aarhus University. http://cs.au.dk/~mkorn/ECSCW_2013_Adjunct_Proceedings-web.pdf.

TEI Consortium. 2012. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.5.0. Last updated October 25. N.p.: TEI Consortium. http://www.tei-c.org/Vault/P5/2.2.0/doc/tei-p5- doc/en/html/.

Tummarello, Giovanni, Christian Morbidoni, and Elena Pierazzo. 2005. “Toward Textual Encoding Based on RDF.” In From Author to Reader: Challenges for the Digital Content Chain: Proceedings of the 9th ICCC International Conference on Electronic Publishing, 57–63. Leuven-Heverlee, Belgium: Peeters Publishing Leuven. http://elpub.scix.net/data/works/att/206elpub2005.content.pdf.

Vanhoutte, Edward. 2006. “Prose Fiction and Modern Manuscripts: Limitations and Possibilities of Text-Encoding for Electronic Editions.” In Electronic Textual Editing, edited by Lou Burnard, Katherine O’Brien O’Keeffe, and John Unsworth, 161–80. New York: Modern Language Association of America.

Wittern, Christian. 2013. “Beyond TEI: Returning the Text to the Reader.” Journal of the Text Encoding Intiative 4 (March). http://jtei.revues.org/691. doi:10.4000/jtei.691.

Zenith, Richard , ed. [1998] 2012. Fernando Pessoa, Livro do Desassossego. Lisboa: Assírio & Alvim.

NOTES

1. The concept “heteronym” was developed by Fernando Pessoa himself – heteronyms are fictional authors who have a specific writing style and a unique psychology. Many of Pessoa’s works have been written by a heteronym. In the case of the Book of Disquiet, the work was assigned by Pessoa to heteronym Vicente Guedes during the first stage of writing, and to Bernardo Soares during the later stage. One editor has assigned this work to both Vicente Guedes and Bernardo Soares; two editors have assigned it to Bernardo Soares; and one editor has assigned it to Fernando Pessoa. 2. “No Problem Has a Solution: A Digital Archive of the Book of Disquiet,” a research project of the Centre for Portuguese Literature at the University of Coimbra, Portugal, is funded by FCT (Foundation for Science and Technology). Principal investigator: Manuel Portela. Reference: PTDC/CLE-LLI/118713/2010. Co-funded by FEDER (European Regional Development Fund), through Axis 1 of the Operational Competitiveness Program (POFC) of the National Strategic Framework (QREN). COMPETE: FCOMP-01-0124-FEDER-019715. 3. During the last three decades, the Livro do Desassossego has been translated several times, and new translations continue to appear in Spanish, French, English, Italian,

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 212

German, Dutch, Swedish, Polish, and several other languages. LdoD is now regarded by many critics as one of the major achievements of Fernando Pessoa and one of the major works of European modernism. In Portugal, two new revised editions (Sobral Cunha 2013; Pizarro 2013) were published in 2013, and excerpts from LdoD were recently selected for inclusion in the secondary school syllabus. Pessoa’s heteronyms, such as Álvaro de Campos, Alberto Caeiro, Ricardo Reis, and Bernardo Soares, are among the most quoted authors in Brazilian and Portuguese blogs. A feature film based on this unfinished book, Filme do Desassossego, directed by João Botelho, was released to critical acclaim in 2010. 4. Our XML encoding of what we refer to as “authorial” and “editorial” witnesses is in itself a new edition of LdoD. The distinction that we make in this article between “authorial” and “editorial” serves the rhetorical purpose of showing two types of relation between our XML encoding and its source texts: in the case of “authorial” witnesses we are referring to our topographic transcription of autograph documents, which will also be accessible as digital image facsimiles; in the case of “editorial” witnesses we are referring to our textual transcription of LdoD texts as they have been transcribed and organized in four major editions published between 1982 and 2013. It can be argued that our XML topographic transcription of the “authorial” witnesses is of course a fifth editorial witness of the LdoD, since it cannot be considered an authorial source. It must always be a particular reading and representation of that source, even if we intend to give readers a less mediated text by placing it in the context of the document facsimiles. It should, however, be noted that, at this level, we are not proposing a specific selection or organization for the fragments other than the semi- arbitrary shelf marks of the National Library. In fact, the fragments included for representation in the archive equal the total sum of the fragments included in those four editions. Our topographic transcription could be described as a new implicit edition by the research team focused on representing revision layers and describing material details of the source documents. An explicit edition by the research team, with its own selection and organization rationale, will take place only at the virtual level of the archive. 5. A second stage of development of this project (2016–2018) will add a corpus of reception texts (reviews and essays) that will be fully integrated with the Book of Disquiet fragments. This new stage will also develop software tools—including text editors, textual generators and other e-lit tools—for enabling writing interactions with the archive’s textual database. Six applications for mobile media, called “Machines of Disquiet,” are currently being tested. 6. Although we refer to expert and virtual editions, the latter may also be constructed by Pessoa experts. “Expert editions” refers to the four printed editions published prior to the existence of the LdoD Archive, and also to the project’s topographic transcription. 7. Although we are not, at this stage of the project, creating a critical apparatus for variants for these major four editions, the possibility remains open for such a representation in the future, if resources permit. Except for Prado Coelho, whose 1982 edition was not reedited (he died soon after publication), the other three editors have never reissued LdoD under the exact same form. Every new publication becomes an occasion for revision. There are now more than fifteen editions that may be considered variations of those three sets. Our copyright agreement with the LdoD editors is that we encode in the archive the most recent version at the time of work.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 213

8. The interactions with the repository are done on top of the object model described in figures 3 and 4. It is possible to generate TEI-encoded files from the repository. On the other hand, the system allows tagging and annotation of transcriptions but does not allow them to be changed. 9. Spring MVC framework: http://spring.io/. 10. Bootstrap: http://getbootstrap.com/. 11. JQuery: http://jquery.com/. 12. Fénix Framework: http://fenix-framework.github.io/. 13. A complete example of an encoded fragment is included in appendix 1.

ABSTRACTS

In this article we describe how textual encoding is used in our current project of constructing a digital archive of Fernando Pessoa’s Livro do Desassossego [LdoD]. Our model for virtualizing the authorial and editorial forms of this unfinished work aims to create a dynamic archive that is open to user interaction and collaboration. A brief introduction to the theoretical rationale of the archive is followed by a description of our technical solution for TEI XML encoding that is responsive to dynamic changes over time. With our software architecture proposal for processing TEI markup, the Collaborative Digital Archive of the Book of Disquiet will be able to instantiate the cooperative and social editing functionalities of Web 2.0 environments.

INDEX

Keywords: Fernando Pessoa, Book of Disquiet, digital archive, Web 2.0, collaborative archive

AUTHORS

ANTÓNIO RITO SILVA António Rito Silva is associate professor in the Department of Computer Science and Engineering, University of Lisbon, and researcher at the INESC-ID. He has worked in the field of collaborative systems and social software, particularly in the domain of business processes tools that blend the roles of producer and consumer. In this project he intends to apply some of these techniques for connecting the roles of producer and reader of the literary work.

MANUEL PORTELA Manuel Portela is assistant professor with Habilitation in the Department of Languages, Literatures and Cultures, University of Coimbra, Portugal, where he directs the Doctoral Program in Advanced Studies in the Materialities of Literature. He is also a researcher at the Centre for Portuguese Literature at the University of Coimbra, and the principal investigator of the project

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 214

No Problem Has a Solution: A Digital Archive of the Book of Disquiet (2012–15). His latest book is Scripting Reading Motions: The Codex and the Computer as Self-Reflexive Machines (MIT Press, 2013).

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 215

Bridging the Gap: Greater Usability for TEI encoding

Stefan Dumont and Martin Fechner

1. The Situation

1.1 A Gap Between User and Possibility

1 For a reader of jTEI it might be self-evident that in this day and age a historical-critical edition should be created with established technologies (e.g., XML). The benefits are well known: texts encoded in TEI-XML can be used to produce both Web and print publications and at the same time ensure long-term accessibility and reusability. Furthermore, both XML and TEI are well-established technologies—the TEI Guidelines having been in use for over 25 years. Nevertheless, in Germany as well as internationally, edition projects continue to use programs such as Microsoft Word to edit their texts. These texts are used to create print editions and may be digitized in a simple form such as PDF.

2 There is thus a gap between the available, useful technologies and their actual adoption. The cause of this is in many cases economic circumstances. Academic institutions and publishing houses are reluctant to alter their “tried-and-true” workflows, in part because any changes cost time and money. Moreover, the know-how is often not available to correctly encode transcriptions into TEI. An important additional factor is that despite the many advantages of a Web publication, the printed edition—especially for veteran edition projects—remains the bar of what serious results should look like. Thus a new technology must justify itself not only economically, but academically as well (see Sahle 2013, 109–10).

3 Another cause of this gap is the dearth of programs to help researchers who lack the technical knowledge to transcribe and annotate their texts in TEI. While other prevalent text-editing programs offer an intuitive user interface, this is not the case for most XML editors.1 A factor contributing to the lack of such a user interface could be the high complexity of a TEI-encoded edition.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 216

1.2 The Bridge: Greater Usability

4 The solution that is repeatedly suggested—that researchers should learn TEI-XML or even learn how to program—does not seem viable, and in any case simply shifts the problem to the side of the user.2 The lack of knowledge of how to encode a manuscript in TEI should not hinder the use of TEI. On the contrary, an easy method to adopt TEI for using should be made available. The center of editorial work should remain the evaluation and examination of manuscripts.3

5 When one considers the developments in graphical user interfaces over the last few years, one observes that programs are offering simpler and more intuitive interfaces— even with sometimes more complex program tasks and functions behind them. It is not without reason that there are now entire fields that concern themselves with the “ergonomics” of software.

6 In our opinion, the Digital Humanities community should not avoid such developments but instead should take advantage of them. A simple and intuitive user interface would help reduce the fear and frustration that often meets new technologies (Nielsen 1993, 24–25).4 User-friendly programs create a positive user experience and would increase interest in using TEI for the transcription and annotation of manuscripts.

7 A fundamental part of user-friendly software is providing the user with visual feedback for an executed action (Nielsen 1993, 134). For example, after the appropriate markup for a crossed-out piece of text is entered, this text can then be displayed as crossed out. Although this type of feedback might seem trivial, another type certainly is not: print output. As mentioned above, the printed edition is still the main goal for most scholarly edition projects, even when it is supplemented with a Web publication (Sahle 2013, 111–112). When the formatting process begins after the encoding is completed, the researcher can only check their work in an abstract manner. When the publication and creation of the edition is not separated, but happens simultaneously in one work environment, the researcher has the benefit of being able to see his or her markup with one click displayed properly in the desired medium. This not only improves the supervision of the work in progress, but also increases trust in and satisfaction with the technology—in this case the TEI encoding of manuscripts.

2. The Pilot Project

8 TELOTA,5 the digital humanities initiative at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW),6 had to confront these issues as they were given the task to develop a digital work environment for the edition project Schleiermacher in Berlin 1808–1834.7 This environment was intended to assist the researchers in transcribing, encoding, and annotating the letters, lectures, and daily calendar of the Protestant theologian Friedrich Schleiermacher. In addition to creating a user interface for transcribing and editing, the environment had to be able to produce both a Web and a print edition from the TEI-XML data. Ideally this would happen automatically, so that the editors would be able to see their results immediately.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 217

9 To develop such a solution from scratch requires significant resources. To reduce time and effort of development it is preferable to build on existing software. After evaluating the available software, the team chose the following programs: • eXistdb8 for the database in which the TEI files would be stored. A decisive factor was the ability to retrieve and run XQuery and XSLT scripts from within the eXistdb and not just externally. • Oxygen XML Author9 as the user interface for entering and editing the centrally stored TEI files. Decisive were the extensive functions of Oxygen XML Author that allowed visual editing of XML for persons without technical knowledge. • ConTeXt10 for the formatting language to generate an automatic print-ready proof from the XML files. Among its benefits were its use of the long-established formatting language TeX, the possibility of processing XML directly, and the option to add one’s own functions with the help of the programming language Lua.

10 Although eXistdb and ConTexT are open source software, this is not the case for Oxygen XML Author. A completely open-source solution would have been preferable, but there is no software currently available that offers such extensive functions for end users.

11 Because of the solution’s modular organization, the software components can be replaced if necessary (for example when a tool is no longer supported by the community or a company). Essentially every piece of software can and will become outdated at some point; what is important is that the data is in a format that can still be reopened and reused.

Figure 1. Workflow in the digital work environment.

12 The three selected software components were combined, and scripts and additional functions were written. The goal was to create a software solution that was not just generally suitable for the research project, but customized to increase the user- friendliness of the work environment despite the complexity behind a critical edition. Besides the creation of multiple TEI P5–compatible schemata, the functions and toolbar

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 218

of the Oxygen XML Author were also configured and supplemented with additional custom functions. XQuery and XSLT scripts were written for the Web publication and its presentation was designed in HTML and CSS. The automatic generation of the print edition was programmed with the formatting language ConTeXt.

2.1 A Central Database Enables Collaborative Work

13 The digital work environment uses the open-source XML database eXistdb as its central repository for the TEI documents. The database is installed on a server and available online. This gives all the researchers on the project access to the same data collection, allowing them to work collaboratively. The contents of the database are directly available and searchable for users as a collection of files in Oxygen XML Author. Collaborative access is made possible through the WebDAV protocol, which locks files that are in use to avoid conflicts and unintended overwrites.

2.2 A User-Friendly Environment with Oxygen XML Author

14 Oxygen XML Author was chosen to create the user interface with which the editors transcribe and edit in the work environment.11 While using this program, the editors usually do not see the code, but instead are in a user-friendly “author view” that is styled with CSS. More than one sub-view was made for the project within this author view, so that editors can choose the one most fitting for their current phase of work. In the various views, only certain TEI markup is shown or highlighted in the transcription. For example, when an editor only wants to examine the indexed passages of a text, then he or she can choose the view which only highlights the elements.

Figure 2. Toolbar for the Schleiermacher edition with functions for inserting TEI markup.

15 Above all, however, the end user can enter markup into the manuscript or into the TEI Header with one click in the toolbar. In this way, text phenomena like struck-out text can be documented, editorial comments can be added, and the creation date of a manuscript can be noted. For every type of editorial markup there is a special button that enters the appropriate valid XML. If needed not only single elements but complex XML fragments will be inserted. Furthermore, the buttons for the metadata insert proper elements in the correct place of the . If necessary, the editor is asked via an extra dialog box for the value of an attribute. The editor is thus not concerned with the name of the actual TEI element, whether it needs an attribute, or whether in this context there are structural variations to be considered. The entire text can thus be quickly and simply marked up with TEI conformant XML. The researcher

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 219

can also, if desired, view the XML code at any time for more precise editing or simply to follow the process more closely.

2.3 Central Indexes

16 Besides the transcription of the text, the database can also include central indexes (for persons, places, works, etc.) in the form of XML files. These can be edited and linked through the texts. If, for example, one wishes to mark up a person’s name, one selects the word in the text to be indexed, and a dropdown list appears with all the names from the person index, from which the editor can choose the appropriate one. Selecting the name causes both the correct element and the correct ID of the person (as an attribute) to be entered. The index with the actual references to the correct text passages is then automatically created for both the print and Web publications from the marked-up TEI document. This index function is created through a specially-programmed Java operation for Oxygen XML Author and with various XQuery scripts in the database. When programming the Java operation, we used parameters and not constants in the source code. Thus, the parameter values could be comfortably entered in Oxygen XML Author and changed if necessary.

Figure 3. Customized Oxygen XML Author with the index function’s dialogue box opened.

2.4 Website

17 In addition to the work environment in Oxygen XML Author, we also created a website for the project. On this website the researchers can easily browse through or search the current data collection through a live connection with the database. Access to the website can be either limited to the project team or open to the public.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 220

Figure 4. Website for the Schleiermacher edition.

18 From a technical viewpoint, the website consists primarily of XQuery scripts, whose request results are presented as HTML5, through an XSL transformation. The scripts and transformations are requested from eXist through a REST interface and processed with the eXistdb internal parser—a great advantage of this database. The website is thus generated completely server-side.

2.5 Print Edition

19 A further output option is the print edition, implemented with the help of ConTeXt, which automatically generates a PDF (at any point in the workflow) from a TEI document. With the correct configuration, the format and presentation of the PDF can meet the research project’s exact needs for printed editions. Each TEI element is given a specific formatting command through a configuration file. In this way the different apparatuses can, for example, appear as footnotes that refer to the main text with the help of line numbers and lemmata. The print edition can also provide the correct index for each transcription and resolve any cross-references between manuscripts.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 221

Figure 5. Print edition for the Schleiermacher edition generated via ConTeXt.

20 The print edition helps the researchers to check their TEI-encoded transcriptions; additionally, it obviates the need for a separate typesetting process.

2.6 Feedback

21 The researchers of Schleiermacher in Berlin 1808–1834 had a two-hour training session introducing them to the work environment. They were also given an extensive handbook, a quick start guide, and a “cheat sheet” with an overview of all the available functions.12

22 After working with the environment for two months, the researchers were asked to share their opinions. The feedback was overwhelmingly positive, and they confirmed that the environment eased their work on the edition and saved them time. The ability to check their work immediately on the website or in the print edition was seen as a great advantage. The researchers were also thankful that they were not expected to work directly in the XML code, but instead could mark up the texts with TEI conform XML in a simple graphical user interface. At the same time, some researchers had familiarized themselves with the TEI XML behind the interface and did sometimes edit directly in the XML, especially in the beginning when some functions were not yet available.

3. From the Bottom Up

23 In the end the pilot run was so successful that other projects within the Berlin- Brandenburg Academy of the Sciences and Humanities expressed interest. In the spring of 2013 the work environment was implemented for the Academy’s project Commentaria in Aristotelem Graeca et Byzantina (CAGB)13 and in the summer of 2013

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 222

for Regesta Imperii—Friedrich III.14 This was done using the same technical concept, but with customized functions for each project for the text entry and for the publication (Web and print). This was due to the fact that the TEI schemata were in both cases very different from the pilot project. For example, CAGB is not an edition of manuscript transcriptions but of manuscript descriptions. In contrast, the time and effort for implementation would be minimal if one were using a schema from a project for which the work environment is already implemented. The amount of work does not depend on the software or technology, but on the complexity and diversity of TEI-encoded editions.

24 With every new implementation, the work environment developed further and was supplemented with new functions. This applied not just to the Java operations but also to the XQuery scripts, which are responsible for various fundamental tasks such as making the index file available for the Java operations. For these frequently used XQuery and XSLT scripts and for further configurations, an eXistdb application15 was developed in 2013 that allows for a simple and fast installation into the database.16 New functions are then, when suitable, integrated into the already existing work environment so that after the launch of the work environment the operability continually improves.

25 After we published a report about ediarum,17 it caught the attention of other research institutions. During this time TELOTA introduced and explained the environment to interested institutions in various presentations. As a result, for example, the Academy of Science and Literature Mainz adopted the concept of ediarum and built their own digital work environment based on it. Other interested persons can read about the concept in a series of blog entries that are written as tutorials.18 In addition, the custom Java functions were made available by TELOTA on GitHub for the TEI community. 19 There the functions can be downloaded and used directly in Oxygen Frameworks or, if necessary, changed in accordance with theGNU LGPL (Lesser General Public License).

26 At the end of 2013, TELOTA began implementing ediarum for the historical-critical edition of Jeremias Gotthelf.20 In this collaboration with Bern University, new functions are being developed and existing ones improved. For instance, we plan to build a function to allow end users to insert overlapping markup (using, e.g., elements and @spanTo) in Oxygen XML Author.

4. Summary and Prospects

27 The bottom-up development of ediarum with its strong focus on user-friendliness brings with it certain implications.

28 It has required a lot of work during the initial implementation and customization of the environment, especially to generate the website and print edition. TELOTA is now considering simplifying the creation of the Web and print editions, so that less time is needed for programming. The goal, however, will be not to develop a complete plug- and-play solution. That would be impossible, considering the complexity and diversity of critical editions. The aim for a future version of ediarum will be rather to create a solution for encoding editions in TEI which involves the lowest possible amount of time and effort for programming while being customized to the specific needs of a project.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 223

29 The concept of ediarum’s development with its tailored solution has various advantages: • Time and effort for programming is spared through the consistent use of existing software and the elimination of extraneous functionality not required by a particular project. In this way a functional software solution is reached more quickly and practically. • The work environment is customized to the exact needs of the project. Due to the user- friendly interface, researchers enjoy operating it and use it in their daily work. • The automatic generation of the Web and print edition from the XML file allows the researcher to immediately check their TEI encoding. This eases the publication process, leaving more time for the actual editorial work.

30 Thus, through addressing the needs of a specific project, we found a solution that can be used by many: ediarum helps bridge the gap between hesitant users and the many possibilities and advantages of TEI encoding.

BIBLIOGRAPHY

Burghart, Marjorie, and Malte Rehbein. 2012. “The Present and Future of the TEI Community for Manuscript Encoding.” Journal of the Text Encoding Initiative 2 (February). http://jtei.revues.org/ 372. doi:10.4000/jtei.372

Nielsen, Jakob. 1993. Usability Engineering. Boston: Academic Press.

Pape, Sebastian, Christof Schöch, and Lutz Wegner. 2012. “TEICHI and the Tools Paradox.” Journal of the Text Encoding Initiative 2 (February). http://jtei.revues.org/432. doi:10.4000/jtei.432.

Rockwell, Geoffrey, Susan Brown, James Chartrand, and Susan Hesemeier. 2012. “CWRC-Writer: An In-Browser XML Editor.” In Digital Humanities 2012: Conference Abstracts, edited by Jan Christoph Meister, 508–11. Hamburg: Hamburg University Press. http://www.dh2012.uni- hamburg.de/wp-content/uploads/2012/07/HamburgUP_dh2012_BoA.pdf.

Sahle, Patrick. 2013. Digitale Editionsformen. Zum Umgang mit der Überlieferung unter den Bedingungen des Medienwandels. Teil 2: Befunde, Theorie und Methodik. Schriften des Instituts für Dokumentologie und Editorik, 8. Norderstedt: BoD. http://kups.ub.uni-koeln.de/id/eprint/5352.

Vertan, Cristina, and Stefanie Reimers. 2012. “A TEI-based Application for Editing Manuscript Descriptions.” Journal of the Text Encoding Initiative 2 (February). http://jtei.revues.org/392. doi: 10.4000/jtei.392.

NOTES

1. In the TEI community one can find multiple tools. However, almost all of them support only the publication of TEI documents and not their creation (see, e.g., Pape, Schöch, and Wegner 2012). Tools that aim to support the actual transcription and editing of texts are less common. One such tool is from the Teuchos project, which provides a basic user interface for certain parts of the manuscript description (Vertan

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 224

and Reimers 2012). In this case the data is not edited directly in XML but saved in a text format. In addition, it is not possible to enter inline markup. A more functional solution is the CWRCWriter, which still requires the user to have a good knowledge of TEI (Rockwell et al. 2012 or http://cwrctc.artsrn.ualberta.ca/). 2. This point calls for a more thorough discussion that does not fit within the parameters of this article. It can nevertheless be determined that the lack of user- friendly tools is an obstacle for new TEI users (Burghart and Rehbein 2012, § 32). 3. Of course the creation of a TEI-conformant XML schema is also an editorial process and is thus done in collaboration with the editors, when not done by them directly. 4. Sahle (2013, 111) notes that editors are often afraid of the technical challenges of digital editions. 5. http://www.bbaw.de/en/telota. 6. http://www.bbaw.de/en. 7. http://www.bbaw.de/en/research/schleiermacherII. 8. http://exist-db.org/. 9. http://www.oxygenxml.com/xml_author.html. 10. http://wiki.contextgarden.net/Main_Page. 11. To this end we used “Oxygen Frameworks”: that is to say, we developed additional frameworks within the context offered by the program. Oxygen XML delivers a standard framework for TEI documents that provides a toolbar with a few basic TEI elements. However, because of the large number of TEI elements and attributes, the development of an all-inclusive TEI Framework would not be practical. The editor would be overwhelmed with such a large number of superfluous functions. 12. http://www.bbaw.de/en/telota/software/ediarum. 13. http://www.bbaw.de/en/research/cagb. 14. http://www.bbaw.de/forschung/regestaimperii and http://www.regesta- imperii.de/unternehmen/abteilungen/xiii-friedrich-iii.html. 15. For further information about apps in eXistdb, see “Getting Started with Web Application Development,” http://exist-db.org/exist/apps/doc/development- starter.xml. 16. This eXistdb app will soon be available online. 17. Stefan Dumont and Martin Fechner, “Digitale Arbeitsumgebung für das Editionsvorhaben ‘Schleiermacher in Berlin 1808–1834’,” digiversity: Webmagazin für Informationstechnologie in den Geisteswissenschaften (October 11, 2012), http:// digiversity.net/2012/digitale-arbeitsumgebung-fur-das-editionsvorhaben- schleiermacher-in-berlin-1808-1834/. 18. Stefan Dumont, “Tutorial: Wie baue ich ein eigenes Framework für Oxygen XML?,” digiversity: Webmagazin für Informationstechnologie in den Geisteswissenschaften (October 30, 2013), http://digiversity.net/2013/tutorial-wie-baue-ich-ein-oxygen-xml-framework/, and Stefan Dumont, “Tutorial: Indexfunktionen für Oxygen XML Frameworks,” digiversity: Webmagazin für Informationstechnologie in den Geisteswissenschaften(December 16, 2013), http://digiversity.net/2013/tutorial-indexfunktionen-fuer-oxygen-xml- frameworks/.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 225

19. https://github.com/telota/ediarum/tree/master/oxygen_java; for documentation, see https://github.com/telota/ediarum/wiki/Documentation-ediarum.jar-(english). 20. Jeremias Gotthelf: Historisch-kritische Gesamtausgabe (HKG), Forschungsstelle Jeremias Gotthelf, Universität Bern, http://www.gotthelf.unibe.ch/content/index_ger.html.

ABSTRACTS

This article shows how, with a focus on user-friendliness and a bottom-up approach, the digital work environment ediarum was successfully developed. With ediarum, researchers can comfortably encode and edit in TEI, as well as publish their results in an online or print edition. This solution, developed by the TELOTA initiative of Berlin-Brandenburg Academy of Sciences and Humanities, is based on three software components: eXistdb, Oxygen XML Author, and ConTeXt. These components are combined, supplemented with additional functions, and tailored to fit a project’s needs. After a pilot run, ediarum has been implemented for multiple internal and external research projects. The experience that was gained and our self developed program components are available to the Digital Humanities community.

AUTHORS

STEFAN DUMONT Stefan Dumont is a research associate in the TELOTA working group for digital humanities at the Berlin-Brandenburg Academy of Sciences and Humanities.

MARTIN FECHNER Martin Fechner is a research associate in the TELOTA working group for digital humanities at the Berlin-Brandenburg Academy of Sciences and Humanities.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 226

TeiCoPhiLib: A Library of Components for the Domain of Collaborative Philology

Federico Boschetti and Angelo Mario Del Grosso

1. Introduction

1 The TeiCoPhiLib library is a collection of components currently implemented in Java (JSR 270), which parses documents encoded according to a basic subset of TEI tags defined in an ODD file1 and creates an object-oriented data structure in order to handle the textual content and its processing. The overall architecture is based on the well known Model-View-Controller (MVC) pattern, which separates the representation of data from the rendering and management of the content for the sake of flexibility and reusability. TeiCoPhiLib maps the structured document onto an aggregation of objects. The library enables the visualization through a web browser by instantiating a collection of widgets rendered on the client through standard web technologies. Specifically, the server-side environment jointly processes data and visualization templates,2 and generates HTML pages rendered on the client. Special components are devoted to monitoring the behavior and interactions among the objects generated from the input TEI documents.

2 In distributed and collaborative environments, the maintenance of links and relations among editable XML documents is a challenging task (Di Iorio et al. 2009; Peroni, Poggi, and Vitali 2013; Schmidt and Colomb 2009). Indeed, this kind of environment must preserve the referential integrity among interrelated textual units and the consistency among interlinked contents (Schmidt 2014; Barabucci et al. 2013; Ronnau, Philipp, and Borgho 2009). It is worth noting that both texts and related annotations may change asynchronously. Accordingly, the problems with maintenance have an increasing relevance in social editing (Siemens et al. 2012) and in general in collaborative philology. This emerging field concerns the social activity of scholars focused on shared philological tasks (such as scholarly editing and collaborative annotation) through a

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 227

cyberinfrastructure (Terras and Crane 2010; Crane, Seales, and Terras 2009). In order to face this challenge, our approach exploits software engineering techniques illustrated in section 3.3, which explains the TeiCoPhiLib design patterns.

3 For this reason, the design of TeiCoPhiLib widely leverages the stand-off approaches provided by the TEI Guidelines, that is, both the reference to plain text offsets and the reference to nodes denoted by the @xml:id unique identifiers (TEI Consortium 2015; Wittern, Ciula, and Conal 2009; Pierazzo 2015). Consequently, the design of the components aims at the separation of concerns through four distinct layers: (1) textual structure; (2) semantics; (3) style; and (4) behavior, in order to ensure modularity, scalability, and flexibility.

2. Background

4 The main benefit of XML, and especially of the TEI Guidelines (TEI Consortium 2015), resides in simplicity, flexibility, readability, and customizability, with the assurance of a formal approach for validating the marked-up data. Consequently, XML provides a standard way to define a set of tags (vocabulary) for specific purposes. Moreover, the cluster of technologies associated with XML3 allows us to process, query and publish structured documents.

5 Several frameworks and initiatives have been developed over the years for handling XML, achieving great results and benefits for both scholars and developers. Among others, the open-source general-purpose framework Cocoon4 and the native XML database eXist-db5 deserve to be mentioned. Specifically for TEI-annotated documents, TUSTEP,6 TEIBoilerplate,7 TXM,8 and TAPAS9 are prominent projects.

6 For all of these initiatives, the transformation from an XML document structure to another format by XSLT can be considered the focal point.

3. Method

3.1 Flexibility and Reusability

7 A document-oriented approach can be complemented by an application/API-oriented approach for the development of textual analysis tools. We are adopting a top-down design integrated with bottom-up processes (Del Grosso and Boschetti 2013), which allows us to generalize, extend, and refactor the overall architecture as new requirements and common issues emerge from use cases under development. The design is top-down because we are defining both the general abstract framework and the mechanisms that allow us to implement new functionalities according to emerging needs. On the other hand, our library also adopts a bottom-up approach because we apply refactoring strategies to adapt existing components implemented in our previous projects to the general framework, extending the framework.

8 The library of components is designed by exploiting object-oriented methods and processes such as analysis of requirements, definition of the domain entities, separation of concerns, information hiding, and software reusability and extensibility (Fowler 1996). Extensive use of design patterns (i.e., recurring solutions to common

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 228

problems within a given context [Gamma et al. 1995]) facilitates the achievement of these goals.

9 Agile software development10 and use case–driven modeling (Rosenberg and Stephens 2007) ensure the progressive enhancement of old functionalities and the development of new ones. The main principles of agile software development that we adopt are: (1) individuals and communication are more important than processes and tools; (2) documentation and design must be accessible to everybody all the time; (3) software development starts as soon as possible; (4) changes and refactoring are part of the design and the development process; (5) all lab team members participate in all presentations; (6) software is organized in short releases and divided into short iterations; (7) results are validated by domain expert collaborations and test-driven development (both unit tests and acceptance tests). The continuous integration and release are supported by open source Integrated Development Environments (IDEs) like Eclipse or NetBeans and by a software configuration management tool such as SVN or Git for versioning and revision control.

10 The aforementioned paradigm is applied in the TeiCoPhiLib library by (1) the implementation of a flexible importing and normalization module in the pre- processing phase, which ensures a coherent abstraction model of the resources; (2) the definition of the functional specification by designing the objects and by declaring the application interfaces; (3) the export of the information encapsulated in the objects into different data formats, in order to enable data integration and data exchange.

3.2 Separation of Concerns

11 First of all, objects that represent the whole document or interrelated documents are initialized by parsing the original TEI document(s) and by creating a new data structure, which decouples the orthogonal information conveyed by the XML elements: (1) textual structure, (2) semantics, (3) style, and (4) behavior. It is important to point out that the new data structure is the result of transformations (by XSLT DOM transformations or SAX event-driven transformations) managed during the parsing process. Thus, the current implementation of the TeiCoPhiLib exposes methods that parse the XML file and create Java objects. The resources are stored and maintained in a native XML database management system (i.e., eXist-db). The APIs and services provided by Lucene, a software library developed and hosted by the Apache Foundation, have been used for indexing the textual data.

12 For instance, the information conveyed by the following TEI snippet is distributed among the appropriate Java objects that handle the four levels described above:

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 229

[...]

Io nacqui veneziano ai 18 ottobre del 1775, giorno dell’evangelista san Luca; e morrò per la grazia di Dio italiano quando lo vorrà quella Provvidenza che governa misteriosamente il mondo.

[...][...]

13 The parsing process concerns the following aspects: 1. Textual structure. The same document originally structured paragraph by paragraph for literary analysis can easily restructured page by page for layout analysis and for comparison with the original page image. Semantics, style, and behavior are represented by objects separated from (but linked to) the nodes of the DOM tree. 2. Semantics. At the semantic level, both attributes (such as @type) and tag names (such as

) are processed in the same way and linked to the related DOM node. 3. Style. The style is managed by separated renderers, which point to textual positions affected by stylistic features. For instance, the information extracted from the @style attribute is used to instantiate the Java objects devoted to managing the rendering information. 4. Behavior. Behaviors are handled by objects that process textual resources according to the current state of the data structure and the rules to manage such a state. For example, a hyphenator performs its tasks according to the language of the textual data (encoded in the original TEI file, e.g., by @xml:lang="it") and the related hyphenation rules (such as the hyphenation rules for the Italian language, managed by the hyphenator bundles).

14 The object-oriented representation of the document allows data to be processed dynamically, taking account of its physical and logical structure, in an attempt to overcome the multiple hierarchies issue. This means that the data model of the library has a decoupled and abstract structure which can be serialized in any available , including all standard TEI-compliant approaches. Consequently, the TeiCoPhiLib document entity keeps structure and logical information in an independent but related aggregate of objects (figure 3). Furthermore, output modules or visitors (figure 2) can traverse and serialize the object representation of the document to a file. In particular, modules for TEI files define the marshalling process, which encompasses the operation to obtain the encoded XML files. In this way the system provides the actual representation of the document.

3.3 Design Patterns

15 The overall architecture of the library is based on several design patterns, according to the object-oriented paradigm (Gamma et al. 1995; Buschmann, Henney, and Schmidt 2007).

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 230

16 Design patterns were introduced in software engineering in order to provide a common solution for a recurring problem in a specific context. A design pattern defines the instantiation policies, the structure, or the behavior for an aggregate of objects that cooperate to provide a complex but recurrent functionality, such as the creation of polymorphic entities, the management of decoupled modules, or the selection at run time of the most suitable algorithm for the current task. The general idea of object- oriented patterns is to encapsulate functionality and data inside an efficient and flexible collection of classes. The current implementation of the prototype exploits the Java programming language technologies. 1. The Model-View-Controller (MVC) pattern (Burbeck 1992) determines the architecture of the library by separating the internal representation of the data from the rendering and the behavioral purposes. 2. The Factory pattern allows designers to enhance the object creation procedure by means of special classes (i.e., the factory classes) that guarantee abstract coupling among system modules (i.e., the object relationships do not reference implementing classes). As a matter of fact, the foregoing design makes it possible to programmatically reference abstract objects independently from the run-time instances which actually perform the task. This design pattern lets applications change the implementation of a class in a flexible way. In our case, for instance, a document object has textual content maintained in a specific DOM data structure transparent to the user. The client agent is able to manipulate and process the document independently of its internal DOM representation. The algorithms and processing keep the state of the data structure coherent by updating the DOM representation in a transparent way. 3. The Builder pattern is used to initialize and populate the document data structure. Together with the factory pattern, it hides the real type of the objects from the user agents and maintains the state consistency of the interconnected information. In this way, in the library initialization process, it is possible to create different data structures for different aims, in a way that is completely transparent to the user agents. For instance, a Builder oriented to the layout analysis can restructure the information parsed from the TEI input document. 4. The Composite pattern is the core of the data structure (figure 3). The document object is defined as an aggregation of hierarchical entities with the same data type. The hierarchy maps either the DOM structure of the original XML-TEI document or the structure of one of its transformations based on an XSLT input parameter. Thanks to this pattern, an efficient object-oriented structure, sketched through the UML class diagram on the right of figure 3, represents the whole/part relationships among the objects in the data structures. 5. The Strategy pattern implements different operations in different ways based on the object type or on specified parameters. For example, the building process uses different strategies, which are driven from specific features given through property files. In the previous example, the original TEI page milestone is represented by an element node in the DOM internal structure; conversely, the original TEI element for paragraph can be represented by a milestone. Furthermore, the Strategy pattern is useful for rendering the same data in multiple views in different contexts or processing the same data with different algorithms. 6. The Observer pattern provides a mechanism for handling dependencies among interrelated objects. This ensures that when a change occurs, the overall state is synchronized and updated. For example, if an edit operation deletes some values in the document, all related structures are notified and updated accordingly. The library organizes the entities derived from the original document information through the stand-off approach. In this way the document structure is separate from its semantics, style, and behavior.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 231

Figure 1. Class diagram of the Observer pattern designed for the TeiCoPhiLib.

Several modules of the library need synchronized data. In particular, annotations and comments need to be notified if an event occurs that changes the state of the referenced object. This problem is well known in the context of graphical user interfaces, where embodied components communicate by message exchange according to a standard protocol. The Observer pattern, which is a behavioral technique, solves the aforementioned problem in an elegant and easy way. TeiCoPhiLib exploits the Observer pattern to manage communication among the object-oriented representation of the document and the encapsulated data that point to it. Figure 1 shows the UML class diagram of the mechanism designed for the library. The pattern involves two interfaces: (6a) Subject and (6b) Observer. These two entities provide the flexibility to implement a decoupled notification mechanism. The Subject provides a registration procedure for the Observer object, and the Observer object provides a standard method allowing the Subject to notify it. TeiCoPhiLib defines objects that can change, such as the Document data type, and objects that need to be notified, such as the Annotation or the Comment data types. Consequently, the Document class implements the Subject interface, whereas the Annotation and Comment classes implement the Observer interface. The following simplified Java snippet illustrates this concept programmatically.

Observer annotation = new Annotation(); Observer comment = new Comment();

Subject teiDocument = new Document(); teiDocument.subscribe(ObserverType.ANNOTATION, annotation); teiDocument.subscribe(ObserverType.COMMENT, comment);

// some processing that modifies the teiDocument object // accordingly the observers have to be notified //[...] teiDocument.notify(ObserverType.ANNOTATION);

7. The pattern Visitor allows the user agents to gather data which are stored in different object data fields and enables the reconstruction of a consistent and homogeneous view. The stand-off mechanism implies a distributed and interrelated structure where information is maintained across various objects. For example, during the exporting phase the object- oriented representation of the document is processed in order to produce a valid TEI XML document according to a schema selected by the user.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 232

Figure 2. Class diagram of the Visitor pattern designed for the TeiCoPhiLib.

The hierarchical nature of the document representation facilitates the data structure traversal in a flexible and customizable way. The Visitor pattern provides a mechanism to extend the functionality of the TeiCoPhiLib by allowing components to perform a client- supplied operation on each node of the document hierarchy. Figure 2 shows how a client of the data model can traverse the document tree in order to write its textual content. The mechanism supplies the actual visitor instance by means of a custom operation. The current implementation of the Visitor interface is then used by the entities of the hierarchy which are unaware of the actual behavior of the visitor instance. In this way, the Visitor pattern provides a procedure to operate on the representation of the document by making available to the clients a point of extensibility (Martin 2000).

17 An example should clarify the aforementioned architecture. The client application that uses TeiCoPhiLib APIs invokes the building method of the abstract Builder class. Moreover, the resulting document object is a concretization of an abstract class representing the current structure of the TEI-encoded resource, as illustrated in the following Java statement:

Document teiDocument = AbstractBuilderFactory.buildDocument(new File("features.properties"),new File("teiDocument.xml"));

18 The Builder object needs two input files: (1) a property file containing the suitable configuration for the instantiation of the concrete objects; (2) the TEI XML file to parse. The state of the internal representation of the document is composite-component– based. In this way each single element is handled by the TEIComposite object shown on the right-hand side of figure 1, which represents each node of the hierarchical DOM structure. Leaves are instances of the StrippedTextChunk class (figure 3). Methods in such a structure make it possible to manipulate the content and the structure of the resources.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 233

Figure 3. Parsed TEI document (left), mapped into a Composite pattern structure (right).

19 Finally, the data structure is an object-oriented representation of the entities in the real domain of the digital document, but the storage platform/paradigm could actually be relational, hierarchical, semi-structural (XML), or network/graph–structured (Hohpe and Woolf 2004).

20 The UML diagram on the left side of figure 1 shows the structure of a TEI document DOM representation. Objects, in this case, exactly map the tags of the TEI-encoded file format. The UML diagram on the right side illustrates the composite-components design. This diagram maps the same aforementioned information and structure, but using a flexible and recursive design. Each TEIComposite object reflects a TEI element and each child is recursively a TEIComposite or a textual element.

4. Case Studies

21 The case studies illustrated below have been implemented with the components already developed for our library.

4.1 Euporia: Visualization, Editing, and Annotation of Parallel Texts for Didactic Purposes

22 Euporia is a project aimed at visualizing, editing, and annotating bilingual texts displayed in parallel. The original digital resources are stored and maintained in authoritative digital libraries available online, such as the Biblioteca Italiana and the Perseus Digital Library, or they are downloaded from social proofreading websites, such as WikiSource, and subsequently processed and marked up in TEI. Some examples of Greek and Latin texts potentially alignable or actually aligned with their Italian translations are shown in table 1.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 234

23 As mentioned in section 3, different subcollections of texts that must be aligned may provide or omit some extratextual information (such as line number or page number) and they may organize texts in different ways (for instance, lines can be grouped or not inside elements). For this reason, the XSD schema (which is expected to be a subset of the general TEI schemas) is generated a posteriori from the actual text subcollections. This approach can be considered complementary to the TEI Roma approach, a kind of reverse engineering, which also allows us to generate the ODD file. Studying the schemas, XSLT transformations are created in order to deal only with relevant information and canonical formats processed by the appropriate Aligner. Currently only the SpeechAligner for dramatic texts has been implemented: correspondences between the Italian translation and the original Greek text are automatically injected with the @corresp attribute (see row B in table 1) and misalignments must be manually corrected. Table 1. Parallel texts.

A) Perseus Digital Library Digital Library of Biblioteca Italiana

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 235

[...] Tutti ammutiro e s'affisaro. Enea Da l'alto seggio incominciava. sample="complete"> Infando, O Regina, è il dolor cui tu Conticuere omnes, intentique ora m'imponi tenebant. Ch'io rinnovelli. I' dovrò dir Inde toro pater Aeneas sic orsus ab da' Greci alto: I teucri averi e il miserando Infandum, regina, iubes renovare regno dolorem, Come fosser diserti; io dire i Troianas ut opes et lamentabile casi regnum Tristissimi dovrò, cui vidi io eruerint Danai; quaeque ipse stesso miserrima vidi, E di che fui gran parte. E qual et quorum pars magna fui. Quis talia potrebbe fando O Mirmidone o Dolope o seguaceMyrmidonum Dolopumve aut duri miles l> Ulixi Del fero Ulisse rattenere il temperet a lacrimis? Et iam nox umida pianto caelo Tai cose in ragionando? E già praecipitat, suadentque cadentia dal cielo sidera somnos. Precipita la notte umida, e gli Sed si tantus amor casus astri cognoscere nostros Vanno in cader persuadendo il et breviter Troiae supremum audire sonno. laborem, Ma se tanto desio nel cor ti quamquam animus meminisse horret, siede luctuque refugit, De l'ascoltar de' nostri casi e incipiam. unit="para"/>Fracti bello fatisque Udir di Troia l'ultima sciaura, [...] l> Benchè pur del pensiero io mi spaventi, Comincerò. Dopo tant'anni infine [...] [...]

B) Perseus Digital Library Euporia Resources

corresp="#sp_grc_0002"> Ἑλένη ELENA:

τί δ, ὦ ταλαίπωρ’ — ὅστις ὤν μ’ Perché, qual che tu sia, ἀπεστράφης misero, gli occhi καὶ ταῖς ἐκείνης συμφοραῖς ἐμὲ torci da me, pei falli altrui στυγεῖς; m'aborri?

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 236

24 Parallel texts are visualized and managed through EuporiaWebApp (figure 4), which is a server-side Java web application compliant with the JSR 314 specification intended for educational purposes. Students, the end users of Euporia, are allowed to query texts, both jointly and independently, through multilingual or monolingual keywords.

Figure 4. Euporia.

25 Annotations can also be associated with linked chunks of text (such as a sentence and its translation: see figure 5) or with single, independent chunks of text (such as a single word of the original text). Annotations and their pointers to the text are stored as stand-off markup. Currently, the adopted stand-off linking mechanism uses only @xml:id, but the use of XPointer is under evaluation.

Figure 5. Annotation in Euporia.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 237

4.2 Aporia: Adapting the Parallel Text Framework to Specific Scientific Requirements

26 Aporia is the enhanced version of Euporia, intended for research purposes. Accordingly, the Parallel Text framework has been adapted and extended to meet specific scientific requirements (Bozzi 2013). An experimental case study has been performed on Theodor Mommsen’s edition of the Res Gestae Divi Augusti (1883), in order to examine the complementarity of attested fragments of text in the Latin inscription of the RGDA and the related attested fragments in the Greek translation (Lamé 2012). Attested and conjectural parts from Mommsen’s critical edition have been marked. As shown in figure 6, the feature “status” (attested / partially attested / conjectural) is not only visualized in different colors (a CSS stylesheet is enough for this task), but also available in query masks to filter the results of a query (for instance, “find only attested or partially attested words”) and in tables of results.

27 Unlimited stand-off layers of analysis can be added (such as morphological, syntactic, and semantic analysis), at different levels of granularity (for example, at the level of words for morphological analysis and at the level of sentences for semantic analysis). The layers of annotation are added through the web application and stored in the XML database. Moreover, they can be manually encoded and visualized through the web application.

Figure 6. Aporia.

4.3 Saussure Project: Supporting Genetic Criticism

28 The Saussure Project exploits the flexibility of Aporia in order to adapt the system to the study of Saussurean autographs, making author’s variants searchable and creating multilingual indexes of ancient terms studied by the linguist.

29 Instead of showing linked texts in parallel, the system shows the image of the manuscript and the related transcription (figure 7).

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 238

Figure 7. Saussure Project.

5. Conclusion

30 The TeiCoPhiLib is a work in progress focused on the creation of a library of software components aimed at managing a limited subset of TEI tags used in the domain of collaborative philology. Because of the increasing complexity of annotations and the multiple usages of the same texts in collaborative environments, stand-off annotation and dense mark-up make it challenging to keep annotated documents readable and manageable. While annotation focuses on types of texts (such as poetic, dramatic, and with or without critical apparatus), software development focuses on abstraction of data structures and behaviors related to those texts (such as searching them in parallel, filtering by morphological features, and comparing text and image).

31 Reusable software components promote the management of stand-off annotation at any level (such as editing, searching, or visualizing), improving the experience of the annotation and use of TEI documents.

32 The document parsing in the current Java implementation takes place on the server side, where the Java virtual machine runs within the web application environment.

33 The marshalling and unmarshalling process handles the serialization of the object representation of the TEI document, in order to store and retrieve data on the filesystem or in native XML databases, such as eXist-db.

34 Performance measurement tools such as JMeter will help to optimize the performance of the library components.

35 Software currently under development will be available on GitHub at https:// github.com/CoPhi/cophilib.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 239

BIBLIOGRAPHY

Barabucci, Gioele, Angelo Di Iorio, Silvio Peroni, Francesco Poggi, and Fabio Vitali. 2013. “Annotations with EARMARK in Practice: A Fairy Tale.” In DH-CASE ’13: Proceedings of the 1st International Workshop on Collaborative Annotations in Shared Environment: Metadata, Vocabularies and Techniques in the Digital Humanities, article no. 11. New York: ACM. doi:10.1145/2517978.2517990.

Bozzi, Andrea. 2013. “G2A: A Web Application to Study, Annotate and Scholarly Edit Ancient Texts and Their Aligned Translations.” Studia graeco-arabica 3:159–71. http:// www.greekintoarabic.eu/uploads/media/BOZZI_SGA_3-2013.pdf.

Burbeck, Steve. 1992. “Applications Programming in Smalltalk-80TM: How to Use Model-View- Controller (MVC).” Last modified March 4, 1997. http://www.math.sfedu.ru/smalltalk/gui/ mvc.pdf.

Buschmann, Frank, Kevlin Henney, and Douglas C. Schmidt. 2007. Pattern Oriented Software Architecture. Vol. 5, On Patterns and Pattern Languages. Wiley Software Patterns Series. Chichester, England: John Wiley & Sons.

Crane, Gregory, Bridget Almas, Alison Babeu, Lisa Cerrato, Anna Krohn, Frederik Baumgart, Monica Berti, Greta Franzini, and Simona Stoyanova. 2014. “Cataloging for a Billion Word Library of Greek and Latin.” In DATeCH ’14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 83–88. New York: ACM. doi:10.1145/2595188.2595190.

Crane, Gregory, Brent Seales, and Melissa Terras. 2009. “Cyberinfrastructure for Classical Philology.” Digital Humanities Quarterly 3 (1). http://www.digitalhumanities.org/dhq/vol/ 003/1/000023/000023.html.

Del Grosso, Angelo Mario, and Federico Boschetti. 2013. “Collaborative Multimedia Platform for Computational Philology: CoPhi Architecture.” In MMEDIA 2013, The Fifth International Conferences on Advances in Multimedia [Proceedings], edited by Philip Davies and David Newell, 46–51. N.p.: IARIA. http://www.thinkmind.org/index.php?view=article&articleid=mmedia_2013_3_10_40059.

Di Iorio, Angelo, Michele Schirinzi, Fabio Vitali, and Carlo Marchetti. 2009. “A Natural and Multi- layered Approach to Detect Changes in Tree-based Textual Documents.” In Enterprise Information Systems: Proceedings of 11th International Conference on Enterprise Information Systems (ICEIS 2009), edited by Joaquim Filipe and José Cordeiro, 90–101. Lecture Notes in Business Information Processing 24. Berlin: Springer. doi:10.1007/978-3-642-01347-8_8.

Fowler, Martin. 1996. Analysis Patterns: Reusable Object Models. Menlo Park, CA: Addison Wesley.

Gamma, Erich, Richard Helm, Ralph Johnson, and John Vlissides. 1995. Design Patterns: Elements of Reusable Object-Oriented Software. Reading, MA: Addison-Wesley.

Hohpe, Gregor, and Bobby Woolf. 2004. Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Boston: Addison-Wesley.

Lamé, Marion. 2012. “Epigraphie en réseau: réflexions sur les potentialités d’innovations dans la représentation numérique d’inscriptions complexes.” Ph.D. thesis, University of Bologna – University of Aix-Marseille. http://www.theses.fr/2012AIXM3130.

Martin, Robert C. 2000. “Design Principles and Design Patterns.” Gurnee, IL: Object Mentor. http://www.objectmentor.com/resources/articles/Principles_and_Patterns.pdf.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 240

Peroni, Silvio, Francesco Poggi, and Fabio Vitali. 2013. “Tracking Changes through EARMARK: A Theoretical Perspective and an Implementation.” In DChanges 2013: Proceedings of the International Workshop on Document Changes: Modeling, Detection, Storage and Visualization, edited by Gioele Barabucci, Uwe M. Borghoff, Angelo Di Iorio, and Sonja Maier. Aachen, Germany: CEUR Workshop Proceedings. http://ceur-ws.org/Vol-1008/paper6.pdf.

Pierazzo, Elena. 2015. Digital Scholarly Editing: Theories, Models and Methods. Farnham, Surrey: Ashgate.

Ronnau, Sebastian, Geraint Philipp, and Uwe M. Borgho. 2009. “Efficient Change Control of XML Documents.” In DocEng 2009: Proceedings of the 9th ACM Symposium on Document Engineering, 3–12. New York: ACM. doi:10.1145/1600193.1600197.

Rosenberg, Doug, and Matt Stephens. 2007. Use Case Driven Object Modeling with UML: Theory and Practice. Berkeley, CA: Apress.

Schmidt, Desmond. 2014. “Towards an Interoperable Digital Scholarly Edition.” Journal of the Text Encoding Initiative 7 (November). http://jtei.revues.org/979.

Schmidt, Desmond, and Robert Colomb. 2009. “A Data Structure for Representing Multi-version Texts Online.” International Journal of Human-Computer Studies 67 (6): 497–514. doi:10.1016/j.ijhcs. 2009.02.001.

Siemens, Ray, Meagan Timney, Cara Leitch, Corina Koolen, and Alex Garnett. 2012. “Toward Modeling the Social Edition: An Approach to Understanding the Electronic Scholarly Edition in the Context of New and Emerging Social Media.” Literary and Linguistic Computing 27 (4): 445–61. doi:10.1093/llc/fqs013.

TEI Consortium. 2015. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.8.0. Last updated April 6. N.p.: TEI Consortium. http://www.tei-c.org/Vault/P5/2.8.0/doc/tei-p5-doc/ en/html/.

Terras, Melissa, and Gregory Crane, eds. 2010. Changing the Center of Gravity: Transforming Classical Studies through Cyberinfrastructure. Digital Technologies and the Ancient World 4. Piscataway, NJ: Gorgias Press.

Wittern, Christian, Arianna Ciula, and Conal Tuohy. 2009. “The Making of TEI P5.” Literary and Linguistic Computing 24 (3): 281–96. doi:10.1093/llc/fqp017.

NOTES

1. The ODD files currently available can be downloaded from GitHub: https:// github.com/CoPhi. The TEI schema we intend eventually to adopt conforms to the EpiDoc vocabulary, following the policy of the Perseus Catalog (Crane et al. 2014). 2. Facelets XML templates are used under the Java Server Faces 2.0 specification. 3. http://www.w3.org/standards/xml/. 4. http://cocoon.apache.org/. 5. http://exist-db.org/. 6. http://www.tustep.uni-tuebingen.de/tustep_eng.html. 7. http://dcl.ils.indiana.edu/. 8. http://sourceforge.net/projects/txm/.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 241

9. http://tapasproject.org/. 10. See the Agile Manifesto ( http://www.agilemanifesto.org/) for a detailed explanation.

ABSTRACTS

In this article we illustrate a work in progress related to the design of a library of software components devoted to editing, processing, and visualizing TEI-annotated documents in the domain of philological studies, in particular in the subdomain of collaborative philology, which concerns the social activity of scholars focused on shared philological tasks. We discuss the technologies related to XML markup languages and the processing of marked-up documents. We describe the method used to design and implement the TeiCoPhiLib, outlining the design patterns as well as discussing general benefits of the overall architecture. Finally, we present case studies in which some components of our library currently implemented in Java have been used.

INDEX

Keywords: APIs, design patterns, library of components, collaborative philology

AUTHORS

FEDERICO BOSCHETTI Federico Boschetti, holds a PhD in classical philology and brain and cognitive sciences—language, interaction and computation. He is a researcher at the A. Zampolli Institute for Computational Linguistics of the Italian National Research Council (ILC-CNR) of Pisa. His area of interest is computational and collaborative philology.

ANGELO MARIO DEL GROSSO Angelo Mario Del Grosso, holds a PhD in information engineering at the Engineering Ph.D. School “Leonardo da Vinci” of the University of Pisa. He is a research fellow at the A. Zampolli Institute for Computational Linguistics of the Italian National Research Council (ILC-CNR) of Pisa. His area of interest is the analysis, design and development of object-oriented and computational methodologies applied to the textual scholarship domain.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 242

“Reports of My Death Are Greatly Exaggerated”: Findings from the TEI in Libraries Survey

Michelle Dalmau and Kevin Hawkins

1. Introduction

1 In the early days of the TEI Guidelines, academic libraries extended their access and preservation mandates to include electronic text (Engle 1998; Friedland 1997; Giesecke, McNeil, and Minks 2000; Nellhaus 2001). At the turn of the twenty-first century, momentum for text encoding grew in libraries as a result of the maturation of pioneering digital-library programs and XML-based web-publishing tools and systems (Bradley 2004). Libraries were not only providing “access to original source material, contextualization, and commentaries, but they also provide[ed] a set of additional resources and service[s]” equally rooted in robust technical infrastructure and noble “ethical traditions” that have critically shaped humanities pedagogy and research (Besser 2004).

2 In 2002, Suzana Sukovic posited that libraries’ changing roles would and could positively impact publishing and academic research by making use of standards such as the TEI Guidelines as well as traditional library expertise, namely in cataloging departments because of their specialized knowledge in authority control, subject analysis, and bibliographic description (Sukovic 2002). Soon thereafter, in 2004, Google announced the scanning of books in major academic libraries to be included in Google Books (Google 2012), and in 2008 many of these libraries formed HathiTrust to provide access to facsimile page images created through mass digitization efforts (Wilkin 2011). For many, the momentum behind mass digitization called into question the role for libraries in text encoding that Sukovic had advocated. In 2011, with the formation of the HathiTrust Research Center and IMLS funding of TAPAS (TEI Archiving, Publishing, and Access Service, http://www.tapasproject.org/), we see that both large- and small-

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 243

scale textual analysis are equally viable and worthy pursuits for digital research inquiry in which libraries are heavily vested (Jockers and Flanders 2013).

3 More recently, we are witnessing a call for greater and more formal involvement of libraries in digital humanities endeavors and partnerships (Vandegrift 2012; Muñoz 2012) in which the resurgence of TEI in libraries is becoming apparent (Green 2013; Milewicz 2012; Tomasek 2011; Dalmau and Courtney 2011). How has advocating for such wide-ranging library objectives—from digital access and preservation to digital literacy and scholarship, from supporting “non-consumptive research” (“nonexpressive use”) to supporting research practices rooted in the markup itself—informed the evolution or devolution of text-encoding projects in libraries?

4 Inspired by the papers, presentations, and discussions that resulted from the theme of the 2009 Conference and Members’ Meeting of the TEI Consortium, “Text Encoding in the Era of Mass Digitization,” the launch of the AccessTEI program in 2010, and the release of Best Practices for TEI in Libraries in 2011 (Hawkins, Dalmau, and Bauman 2011), we surveyed employees of libraries between November 2012 and January 2013 to learn more about text-encoding practices and gauge current attitudes toward text encoding in libraries. We hypothesized that as library services evolve to promote varied modes of scholarly communication and accompanying services, and as digital library initiatives become more widespread and increasingly decentralized, text encoding is undertaken less often in libraries, especially at smaller institutions, and is seeing decreased support even at larger institutions. We also wanted to investigate the nature of library-led or - partnered electronic text projects, including whether there is an increase or decrease in local mass digitization or scholarly encoding initiatives.

2. Method

5 We developed a survey using SurveyMonkey with a combination of yes/no, multiple- choice, ranking and rating, and free-response questions. In an effort to collect longitudinal data that we could leverage in our own study, we referenced and modeled a subset of questions after a survey circulated in 20081 that informed what we now know as AccessTEI,2 a TEI Consortium member benefit providing a volume discount for digitization and text encoding.

6 We decided to target communities of practice as opposed to individuals, intending to lower the probability of bias that might have occurred with an otherwise judgmental sample of responses. However, we encouraged responses from multiple staff members from the same institution to ensure a more holistic view of text-encoding practices across libraries. In turn, we generated institutional biases that we did not attempt to normalize since the data was collected in an anonymous fashion.

7 We formally announced the survey as part of the poster sessions for the 2012 Digital Library Federation (DLF) Forum and the 2012 Conference and Members’ Meeting of the TEI Consortium, which occurred within weeks of each other.3 Once the survey was unveiled at the DLF Forum on November 4, 2012, the survey was announced via digital library and digital humanities mailing lists (including TEI-L, DLF-ANNOUNCE, DIGLIB, and XML4LIB) and social media channels like Twitter and Facebook.

8 Depending on how they answered certain questions, respondents encountered one of four paths with 11, 17, 28, or 30 questions to complete. The only respondent

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 244

requirement was that she or he worked in a library in some capacity. Not all questions were answered; we have estimated a completion rate of 60% that takes into account the various forks in the survey.

9 The survey was comprised of four major sections: • Study Information • Determination of Eligibility • Background Information About the Institution (type of library, size, attitudes) • Text Encoding Practices ◦ Standards Used ◦ Collaborations/Partnerships ◦ Types of Text Encoding Projects

10 The survey closed on January 31, 2013, with 112 valid responses that provided the foundation for our analysis. The survey questions and the data collected are available at https://github.com/mdalmau/tei_libraries.

2.1 Data Preparation

11 Mishaps occurred with the data collection using SurveyMonkey because of a combination of researcher error and glitches with the survey tool. Of the original 138 respondents, 26 answered “no” to the question “Do you work in a library?,” but despite their not meeting the sole criterion for taking the survey, the system somehow allowed them to continue. These responses were disqualified. In addition, a subset of questions (for 10 respondents) were disqualified (and marked as “invalid”) based on other errors uncovered in SurveyMonkey’s logic for skipping questions.4 The Indiana Statistical Consulting Center provided close consultation to determine which responses to disqualify and to normalize the data for statistical processing, which included content analysis and coding of the qualitative responses.

12 We coded responses of both the quantitative and qualitative questions. After questions were keyed (Q1, Q2, etc.) for statistical processing, values for all ranking questions, Likert scale questions (with responses ranging from “almost always” to “never”), and yes/no questions were normalized. Six qualitative questions (Q4, Q9, Q16, Q25, Q118, and Q119) were coded following a three-step process: (1) each of us coded the responses separately, (2) we combined our respective codings to generate a single scheme, and (3) together we reassigned codes based on the single scheme.

13 The spreadsheet containing the coded data, which is also available on GitHub (https:// github.com/mdalmau/tei_libraries), contains multiple tabs including • Q_KEY, containing the mapping of the prose questions to an identifier scheme for statistical processing • Data with the 112 valid responses including normalized values • Likert_Key, reflecting the normalized values assigned to all Likert scale questions • Content analysis of the 6 qualitative questions including original and coded responses: ◦ Q4, Q9, Q16, Q25, Q118, Q119 (each in separate tabs).

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 245

3. Results

14 The following summary and discussions of the results are presented as a “snapshot” in time based on analysis of data collected in the survey. The lack of pre-existing data measuring text-encoding activities in libraries made it difficult to make assertions about changes over time. Still, the results provide valuable information about text- encoding activities and attitudes in libraries that can be used in future studies.

3.1 Profile of Survey Respondents

15 Of the 112 respondents, we determined from IP addresses that • 55 are clearly affiliated with an institution; 41 of these are unique institutions • 57 are unidentifiable because they used off-site internet connections (via ISPs).

16 As table 1 indicates, most respondents are affiliated with North American academic libraries. This finding is not surprising given that publicity was mostly on North American and UK lists and given the relatively long history of American academic library support and adoption of the TEI Guidelines, starting with the TEI and XML in Digital Libraries Workshop sponsored by the Digital Library Federation (DLF) in 1998 (Hawkins, Dalmau, and Bauman 2011). In 1999, the DLF published the first version of what was known as the “TEI in Libraries Guidelines” (Digital Library Federation 1999), and in 2011, version 3, now known as Best Practices for TEI in Libraries (Hawkins, Dalmau, and Bauman 2011), was released, with contributions mostly by American librarians. Table 1. Responses to demographic questions pertaining to the respondent’s institutional affiliation (n=112).

Indicate the type of What is the size of your academic institution Where is your library for which you based on student enrollment or number of institution located? work. patrons served?

Academic Asia 2 92 Up to 5,000 8 Library

Europe 11 National Library 2 5,000–10,000 16

North 84 Public Library 5 10,000–25,000 27 America

Unknown 15 Research Library 10 25,000–40,000 22

No Response 0 Special Library 3 Over 40,000 14

No Response 0 No Response 25

17 Respondents were asked to identify their departmental affiliations, and to list departments with which they partner on text-encoding projects. Responses were coded (see figure 1) according to twelve main areas of work or departments (such as cataloging or technology), but not weighted with respect to respondents providing multiple departmental affiliations (9 of 112 responses). Not surprising, departments reporting the most text-encoding work include Technology, Digital Scholarship, Cataloging, Special Collections, and Archives. Of the 58 respondents who indicated

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 246

units with which they partner, most had partnered with at least 3 other departments elsewhere in the library, revealing a concentration of partnerships in departments like Technology, Digitization, and Cataloging. While we cannot claim that text-encoding work has become “decentralized” in libraries based on our data alone, we certainly see a spread of text-encoding work across various library departments (see figure 1).

Figure 1. This pie chart shows respondents’ reported departmental affiliations, coded according to twelve main areas of work or departments, with “General Library” for responses such as “main” and “general.”

3.2 TEI Consortium Affiliations

18 As mentioned earlier, we were primarily interested in the individual’s experience with text-encoding practices in libraries, but we also asked respondents to identify whether their institution was affiliated with the TEI Consortium. To present a more accurate picture, we attempted, only in this instance, to control for multiple responses per institution (figure 2). This chart gives four data points for each response: • total responses • total institutions • total unique institutions • total respondents who accessed the survey via an ISP.

19 The responses were analyzed based on whether or not the respondent said his or her institution is a member of the TEI Consortium, in addition to unsure and blank responses. For those who answered “yes” to the TEI Consortium membership question, we can see that 18 of the 39 respondents are affiliated with an identifiable institution and 9 of those (half), after de-duplication, are unique institutions. For those who answered “no” to the question, we can see that 23 of the 43 respondents are affiliated with an identifiable institution and 17 of those are unique institutions. In sum, 50% of respondents that claimed their institutions are members of the TEI Consortium were identified as being from unique institutions, and 73% that claimed their institutions were not affiliated were identified as being from unique institutions. In keeping with

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 247

Lynne Siemens’ report, “Understanding the TEI-C Community: A Study in Breadth and Depth, Toward Membership and Recruitment,” presented at the TEI Consortium’s Members Business Meeting in 2008, 5 it is not surprising that most respondents are not from institutions that are members of the TEI Consortium. The report, which led to the publication of “The Apex of Hipster XML GeekDOM” (Siemens et al. 2011) covered the findings of a study led by Lynne Siemens and her team to survey the at-large TEI community’s response to a viral video featuring the TEI encoding of Bob Dylan’s “Subterranean Homesick Blues” in order to understand current TEI membership and potential members.

Figure 2. This graph shows TEI Consortium membership status as reported by respondents, with an attempt to de-duplicate institutional affiliations as indicated by the “Total Unique Institutions” data point.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 248

Figure 3. This graph shows the number of TEI-C member institutions from 2005 to 2013 (with the exception of 2012) coded by type: libraries, non-libraries, combination of library and non-library units, and unsure. The 2012 data is incomplete due to a change in the TEI-C membership tracking system.

20 We attempted to compare the TEI Consortium membership data we collected with the TEI’s historical membership records (for 2005 to 2011 and 2013). We coded the institutions as one of the following: libraries, non-libraries, a combination (as represented by partnering units like an academic or technology department and a library), or unsure (see figure 3).

21 We see a fairly consistent membership base consisting of an average of 18 member libraries between 2005 and 2010. If we correlate membership data with the start and rise of mass digitization like Google Books (2004) and HathiTrust (2008), we do not see an obvious impact of these initiatives on library membership. The decline in membership we do see in 2011 and 2013 is not unique to library members and could very well be an effect of the 2008/2009 global recession, which led to significant reductions in library budgets (Valade-DeMelo 2009; Bailey 2009; Nicholas et al. 2010).

22 Still, simply counting member institutions does not reflect the varying level of financial support that they offer to the TEI through different classes of membership. While it is often said that libraries provide the majority of financial support for the TEI Consortium, it turns out that library members contribute an average of 45% of the TEI- C’s revenue (Hawkins 2014); not quite half, but indeed a significant collective contribution.

3.3 Text-Encoding Practices and Partnerships in Libraries

23 Libraries support text encoding across a wide spectrum of discrete tasks and work practices associated with starting and completing a text-encoding project, from consulting and training to actual markup and web publishing (see figure 4). Such activities are carried out in partnership with various other constituencies inside and

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 249

outside the library (see figure 5). As we have seen thus far, it is not surprising that the greater number of partnerships is across library staff and departments, but we see an equally high number of partnerships with faculty and information technology (IT) staff. Such library-faculty partnerships could indicate a trend toward more advanced or scholarly text-encoding support. How tasks align with partnerships is not surprising: for example, we see IT staff featuring prominently in web-publishing tasks, and librarians featuring prominently in establishing text-encoding workflows and engaging in markup directly. Despite the nature of the relationship, what is of particular interest is the great number of faculty partnering with libraries on text-encoding projects.

Figure 4. This graph shows ways in which respondents reported that they support text-encoding activities in their respective units.

Figure 5. This graph shows ways in which respondents reported partnering with other constituencies on text encoding.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 250

24 Respondents were asked to rank eight types of projects or kinds of collections commonly encountered in libraries in terms of how often they work with such collections, from most common to least common. As is evident in figure 6, based on the data reflected in table 2, the top three most common types of projects or collections for which text encoding features prominently are rare books and manuscripts, archival materials, and faculty or librarian digital research projects. It appears that text encoding is reserved for special collections and unique content, not the most commonly used materials.

Figure 6. This graph shows the frequency of the three most common responses to the question “Rank the nature of your text encoding projects (1 is most common, 8 is least common)”: Rare Books & Manuscripts, Archival Materials, and Faculty or Librarian Digital Research Projects.

Table 2. This table shows all responses to the question “Rank the nature of your text encoding projects (1 is most common, 8 is least common).”

1 (Most 8 (Least N/A or 2 3 4 5 6 7 Common) Common) Blanks

Archival Materials 12 19 10 7 6 3 1 2 60

Faculty or Graduate Student 2 5 5 5 6 15 6 7 61 Digital Teaching Projects

Faculty or Librarian Digital 11 13 5 5 13 3 3 0 59 Research Projects

Library General Collections 4 3 4 11 14 5 4 2 65

Other Library Special 3 6 17 21 2 3 3 1 56 Collections

Rare Books & Manuscripts 27 16 9 2 3 0 2 1 52

University Collections 1 0 5 1 7 9 16 1 72

University Press 7 1 0 0 0 6 6 1 70

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 251

25 We asked respondents to describe the level of text encoding with which they most often engage, describing these levels abstractly rather than as numbers as in the Best Practices for TEI in Libraries: • Basic reformatting of text for bibliographic and keyword search (Level 1) • Mid-level structural encoding for full-text display and basic functionality like linking table of contents, notes, etc. (Levels 2 and 3) • Richer encoding for content analyses like name tagging, rhyme schemes, etc. (Level 4) • Scholarly encoding projects (Level 5)

Figure 7. This graph shows the frequency that respondents reported conducting different types of encoding.

26 According to figure 7, we can see activity across all levels of text encoding with an emphasis on mid-level structural encoding. We also asked respondents to indicate the number of text-encoding projects with which they are involved, from none to more than 30, with most people working on 1–5, 6–10, or more than 30 projects. We correlated the number of projects with encoding levels, and assumed that those involved with fewer projects are encoding at higher levels and vice versa. Instead, we noticed a wide range of activity across all levels of encoding regardless of the number of text-encoding projects. However, as we look more closely at the correlation between levels of encoding and types of materials most commonly encoded in libraries (figure 8), we see peaks in mid-level structural encoding (level 3), richer encoding for content analysis (level 4), and scholarly encoding (level 5).

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 252

Figure 8. This graph shows the frequency of different types of encoding for two types of material reported as the most commonly encoded.

3.4 Text-Encoding Interests and Attitudes

27 We presented respondents with a mixture of quantitative and qualitative questions with respect to text-encoding interests and attitudes across their library. We correlated responses to both sets of questions to ensure reliability of the responses. Table 3. This chart shows responses (n=112) to survey questions in which respondents rated their library’s administrative support for text-encoding projects and general level of interest in text-encoding projects across the library as a whole.

How would you rate your library’s How would you rate the level of interest in text administrative support for text encoding encoding by members of your library as whole? projects?

Extremely Supportive 5 Extremely Interested 1

Very Supportive 21 Very Interested 10

Moderately Supportive 37 Moderately Interested 36

Slightly Supportive 23 Slightly Interested 39

Not at all Supportive 13 Not at all Interested 16

Not Applicable 6 Not Applicable 4

No Response 7 No Response 6

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 253

Figure 9. This graph shows a cross-tabulation of reported administrative support for text encoding and reported general interest across the respondent’s library in text encoding.

28 As seen in table 3 and figure 9, administrative support and general interest in text encoding across the libraries are closely related, as they are respectively situated in the moderately-to-slightly-interested and moderately-and-slightly-supportive responses of the Likert scale.

29 At face value, this occupation of the middle ground might seem like a safe, even rational, place for an institution given this time of transition for academic libraries as they begin to define themselves more clearly in the age of digital scholarship. However, the sentiments on the fringes of the Likert scale are problematic. We see little to no correlation between an extremely and very supportive library administration and an extremely and very interested library staff. And the “not interested” camp is threatening to tip the moderate scale.

30 We then compared the quantitative responses to qualitative responses we collected. A little over half of the respondents (approximately 63 of 112) answered the question: “In a few sentences, could you describe how you see the state of and attitudes toward text encoding in your library today?” We completed two levels of coding for the qualitative responses to this question: we assigned thematic categories to the responses (following the three-step process identified above in the “Data Preparation” section), and then we tagged the categories as either positive, negative, or neutral. For this analysis, we did not disqualify those who only provided responses to the quantitative questions, though we disclose the number of “no responses” (). Because of this discrepancy, the quantitative responses are marginally inflated, but they do not seem to detract or bias the qualitative responses in any way, as is made clear by their strong correlation.

31 Those in the neutral camp (35%) align well enough with the slightly-to-moderately- interested/supportive camp as seen in figure 9. The negative responses dominate at 44%, which illustrates a perceived threat to text encoding in libraries (see figure 11), leaving 21% positive responses.

32 Figure 10 reveals the categories coded as positive and their distribution among respondents. The granular coding makes it impossible to generalize these sentiments more broadly, but the number of people who reported “expected uptake” and “general

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 254

interest” in text-encoding projects is heartening. Responses indicating that the survival of text encoding in their library is a result of individual initiative are more concerning since this implies an overall lack of institutional support. Though the numbers are not as high, interest among catalogers and the training opportunities around text encoding correlate with trends we are seeing in figures 1 and 4.

Figure 10. Of responses (n=63) to the question “In a few sentences, could you describe how you see the state of and attitudes toward text encoding in your library today?” this graph shows responses with portions coded as positive (n=25) after two levels of coding: (1) themes were identified and then (2) themes were tagged as positive, negative, or neutral.

33 The findings for the categories coded as negative are not especially surprising (figure 11). Libraries have been struggling with the resource intensity of text encoding, from doing markup to publishing the encoded texts online, for years. The various types of opposition to text encoding reported require further exploration. While we have not correlated the responses coded as various types of opposition with responses indicating that text encoding is resource-intensive, we suspect a tight relationship.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 255

Figure 11. Of responses (n=63) to the question “In a few sentences, could you describe how you see the state of and attitudes toward text encoding in your library today?” this graph shows responses with portions coded as negative (n=52) after two levels of coding: (1) themes were identified and then (2) themes were tagged as positive, negative, or neutral.

34 The neutral camp yielded a medley of codings (figure 12), including expressions of apathy, mixed feelings about whether text encoding is a viable endeavor for libraries, and uncertainty about the benefits of encoding. A few codings, however, were used for ambiguous responses that could easily manifest as positive or negative depending on the argument made. However, two themes emerged which can be seen as somewhat at odds: libraries engage selectively in text encoding, yet they prioritize basic access to text collections. Specifically, while we know that text encoding is more often used for special collections and for scholarly projects than for general collections, libraries are under pressure to provide online access to their general collections as well, forgoing encoding for simple digitization using facsimile page images and keyword searching.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 256

Figure 12. Of responses (n=63) to the question “In a few sentences, could you describe how you see the state of and attitudes toward text encoding in your library today?” this graph shows responses with portions coded as neutral (n=42) after two levels of coding: (1) themes were identified and then (2) themes were tagged as positive, negative or neutral.

4. Discussion

35 We have uncovered several areas that deserve additional investigation and consideration. As we move forward, the “TEI and libraries” community would benefit from • gaining a more global perspective and understanding of text encoding in libraries • proposing TEI Consortium member benefits for libraries of all sizes with a special emphasis on cohesive, centralized, and certified training opportunities offered by the Consortium • verifying what appears to be a concerted effort by libraries to use text encoding for special collections, and determining to what extent that correlates with the peaks we observed in structural encoding (level 3), richer encoding for content analysis (level 4), and scholarly encoding (level 5). In understanding the nature of these collections and scenarios in which text encoding is deemed important for discovery of these collections, we would be better positioned to provide fine-tuned, relevant training, guidelines, and overall support for libraries. • exploring ways in which text encoding is resource intensive, focusing both on easing the publishing process for libraries and on libraries facilitating ways in which scholars can self- publish. These options might include: better promotion of Best Practices for TEI in Libraries,which now contains schemas for encoding at levels 1 through 4; understanding how libraries can benefit from and contribute to the TAPAS project; and supporting TEI Simple (formerly “TEI-Nudge,” see Mueller 2013). These three initiatives imply a strong role libraries can take, with the TEI Consortium’s help, in fostering TEI-aware publishing systems.

36 The limitations of the survey and the lack of longitudinal data temper any conclusions that could be drawn from the survey results. Conveyed herein is at most a snapshot of TEI in libraries today, but a snapshot with great promise. This study dovetails with

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 257

more recent research conducted by Harriett Green (2012, 2013) that aimed to identify concrete ways in which libraries can foster and support text encoding for library and scholarly research projects. Though we have yet to consult these and other related data sources systematically, we have released our own data set for others to use.

37 In retrospect, we consider this survey to be a preliminary data-gathering instrument. The findings as summarized above debunk our wholesale hypothesis that text-encoding practices have significantly declined in libraries. However, the data we have gathered alone are not robust enough to make more specific claims about the state of text encoding in libraries. We are more acutely aware of this precarious “middle zone” of neither giving up on nor fully embracing text encoding that libraries are occupying and will focus our investigations on uncovering and understanding the nuances of being in the middle as a way to further refine this study.

BIBLIOGRAPHY

Bailey, Charles W., Jr. 2009. “Seven ARL Libraries Face Major Planned or Potential Budget Cuts.” DigitalKoans (blog), April 28. http://digital-scholarship.org/digitalkoans/2009/04/28/seven-arl- libraries-face-major-planned-or-potential-budget-cuts/.

Besser, Howard. 2004. “The Past, Present, and Future of Digital Libraries.” In A Companion to Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth. Oxford: Blackwell. http://www.digitalhumanities.org/companion/.

Bradley, John. 2004. “Text Tools.” In A Companion to Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth. Oxford: Blackwell. http:// www.digitalhumanities.org/companion/.

Dalmau, Michelle, and Angela Courtney. 2011. “The Victorian Women Writers Project Resurrected: A Case Study in Sustainability.” Paper presented at Digital Humanities 2011: Big Tent Humanities, Palo Alto, California, June 19–22.

Digital Library Federation. 1999. TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices, version 1.0. http://old.diglib.org/standards/tei-old.htm.

Engle, Michael. 1998. “The Social Position of Electronic Text Centers.” Library Hi Tech 16(3/4): 15– 20, 42. http://dx.doi.org/10.1108/07378839810304522.

Friedland, LeeEllen. 1997. “Do Digital Libraries Need the TEI? A View from the Trenches.” Paper presented at TEI10: The Text Encoding Initiative Tenth Anniversary User Conference, Providence, Rhode Island, November 14–16. http://www.stg.brown.edu/conferences/tei10/tei10.papers/ friedland.html.

Giesecke, Joan, Beth McNeil, and Gina L. B. Minks. 2000. “Electronic Text Centers: Creating Research Collections on a Limited Budget, The Nebraska Experience.” Journal of Library Administration 31(2): 77–92. http://digitalcommons.unl.edu/libraryscience/63/.

Google. 2012. “Google Books History.” Last modified December 21. http://www.google.com/ googlebooks/about/history.html.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 258

Green, Harriett. 2012. “Library Support for the TEI: Tutorials, Teaching, and Tools.” Paper presented at TEI and the C(r|l)o(w|u)d: 2012 Annual Conference and Members’ Meeting of the TEI Consortium, College Station, Texas, November 8–10.

———. 2013. “TEI and Libraries: New Avenues for Digital Literacy?” dh+lib: Where Digital Humanities and Librarianship Meet, January 22. http://acrl.ala.org/dh/2013/01/22/tei-and-libraries-new- avenues-for-digital-literacy/.

Hawkins, Kevin. 2014. “On the Relative Weight of Libraries among Members of the TEI Consortium.” Ultra Slavonic (blog), January 12. http://www.ultraslavonic.info/blog/?p=46.

Hawkins, Kevin, Michelle Dalmau, and Syd Bauman. 2011. Best Practices for TEI in Libraries, version 3.0. http://purl.oclc.org/NET/teiinlibraries.

Jockers, Matthew L., and Julia Flanders. 2013. “A Matter of Scale.” Keynote presentation at Boston-Area Days of DH 2013, Boston, MA, March 18. http://digitalcommons.unl.edu/ englishfacpubs/106/.

Milewicz, Liz. 2012. “Why TEI? Text > Data Thursday.” Duke University Libraries News, Events, and Exhibits (blog), September 26. http://blogs.library.duke.edu/blog/2012/09/26/why-tei-text-data- thursday/.

Mueller, Martin. 2013. “TEI-Nudge or Libraries and the TEI.” Blog of the Center for Scholarly Communication & Digital Curation, October 1. http://sites.northwestern.edu/cscdc/?p=872.

Muñoz, Trevor. 2012. “Digital Humanities in the Library Isn’t a Service.” Notebook (blog), August 19. http://trevormunoz.com/notebook/2012/08/19/doing-dh-in-the-library.html.

Nellhaus, Tobin. 2001. “XML, TEI, and Digital Libraries in the Humanities.” portal: Libraries and the Academy 1(3): 257–77. http://muse.jhu.edu/journals/portal_libraries_and_the_academy/ v001/1.3nellhaus.html. doi:10.1353/pla.2001.0047.

Nicholas, David, Ian Rowlands, Michael Jubb, and Hamid R. Jamali. 2010. “The Impact of the Economic Downturn on Libraries: With Special Reference to University Libraries.” The Journal of Academic Librarianship 36(5): 376–82. doi:10.1016/j.acalib.2010.06.001.

Siemens, Lynne et al. 2011. “The Apex of Hipster XML GeekDOM.” Journal of the Text Encoding Initiative (1). http://jtei.revues.org/210. doi: 10.4000/jtei.210.

Sukovic, Suzana. 2002. “Beyond the Scriptorium: The Role of the Library in Text Encoding.” D-Lib Magazine 8(1). http://www.dlib.org/dlib/january02/sukovic/01sukovic.html.

Tomasek, Kathryn. 2011. “Digital Humanities, Libraries, and Scholarly Communication.” Doing History Digitally (blog), November 2. http://kathryntomasek.wordpress.com/2011/11/02/digital- humanities-libraries-and-scholarly-communication/.

Valade-DeMelo, Lisa. 2009. Impact of the Economic Recession on University Library and IT Services. London: Joint Information Systems Committee. http://www.webarchive.org.uk/wayback/ archive/20140614203644/http://www.jisc.ac.uk/publications/research/2009/libsitimpacts.aspx.

Vandegrift, Micah. 2012. “What is Digital Humanities and What’s It Doing in the Library?” In the Library with the Lead Pipe (June 27). http://www.inthelibrarywiththeleadpipe.org/2012/ dhandthelib/.

Wilkin, John. 2011. “HathiTrust’s Past, Present, and Future.” Remarks presented at the HathiTrust Constitutional Convention, Washington, DC, October 8. http://www.hathitrust.org/ blogs/perspectives-from-hathitrust/hathitrust039s-past-present-and-future.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 259

NOTES

1. Daniel Paul O’Donnell to TEI-L mailing list, June 19, 2008, http://listserv.brown.edu/ archives/cgi-bin/wa?A2=TEI-L;MCZ7mw;200806191855060600. 2. “AccessTEI: A Digitization Benefit for Members of the Text Encoding Initiative,”last updated June 14, 2010, http://www.tei-c.org/AccessTEI/. 3. “Do You TEI? A Survey of Text Encoding Practices in Libraries” (poster presented at the 2012 Digital Library Federation Forum, Denver, CO, November 4), http:// www.slideshare.net/kshawkin/20121104-14988094. 4. For more information about the data set, and access to the raw data and survey questions, visit https://github.com/mdalmau/tei_libraries. 5. “Understanding the TEI-C Community’ Study,” TEI Members Business Meeting, November 7, 2008: Minutes, last updated November 7, 2008, http://www.tei-c.org/ Membership/Meetings/2008/mm48.xml.

ABSTRACTS

In the early days of the TEI Guidelines, academic libraries extended their access and preservation mandates to include electronic text, providing expertise in authority control, subject analysis, and bibliographic description. But the advent of mass digitization efforts involving simple scanning of pages and OCR called into question such a role for libraries in text encoding. This paper presents the results of a survey targeting library employees to learn more about text- encoding practices and to gauge current attitudes toward text encoding.

INDEX

Keywords: libraries, digital libraries, mass digitization, text encoding practices

AUTHORS

MICHELLE DALMAU Michelle Dalmau, Head of Digital Collections Services at the Indiana University Libraries, manages digital projects and services for the Libraries and affiliated cultural heritage organizations, leads e-text initiatives at the IU Libraries, and partners with faculty on scholarly text-encoding projects like Chymistry of Isaac Newton and Swinburne Project. She co-convened the TEI Libraries SIG from 2007 to 2013 and co-edited Best Practices for TEI in Libraries. She currently leads a praxis-based professional development initiative, “Research Now: Cross- Training for Digital Scholarship,” for the IU Libraries’ Scholars’ Commons. Her undergraduate background is in English and Art History; she holds MLS and MIS degrees from Indiana University.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015 260

KEVIN HAWKINS Kevin S. Hawkins is director of library publishing for the University of North Texas Libraries. Previously he was director of publishing operations for Michigan Publishing, the hub of scholarly publishing at the University of Michigan Library which includes the University of Michigan Press and other brands and services. Kevin has also worked as visiting metadata manager for the Digital Humanities Observatory, a project of the Royal Irish Academy. His involvement with the TEI includes co-editing the 2011 revision to Best Practices for TEI in Libraries and serving for two terms on the Technical Council.

Journal of the Text Encoding Initiative, Issue 8 | December 2014 - December 2015