<<

Journal of the

Issue 5 | June 2013 TEI Infrastructures

Tobias Blanke and Laurent Romary (dir.)

Electronic version URL: http://journals.openedition.org/jtei/772 DOI: 10.4000/jtei.772 ISSN: 2162-5603

Publisher TEI Consortium

Electronic reference Tobias Blanke and Laurent Romary (dir.), Journal of the Text Encoding Initiative, Issue 5 | June 2013, « TEI Infrastructures » [Online], Online since 19 April 2013, connection on 04 March 2020. URL : http:// journals.openedition.org/jtei/772 ; DOI : https://doi.org/10.4000/jtei.772

This text was automatically generated on 4 March 2020.

TEI Consortium 2011 (Creative Commons Attribution-NoDerivs 3.0 Unported License) 1

TABLE OF CONTENTS

Editorial Introduction to the Fifth Issue Tobias Blanke and Laurent Romary

PhiloLogic4: An Abstract TEI Query System Timothy Allen, Clovis Gladstone and Richard Whaling

TAPAS: Building a TEI Publishing and Repository Service Julia Flanders and Scott Hamlin

Islandora and TEI: Current and Emerging Applications/Approaches Kirsta Stapelfeldt and Donald Moses

TEI and Project Bamboo Quinn Dombrowski and Seth Denbo

TextGrid, TEXTvre, and DARIAH: Sustainability of Infrastructures for Textual Scholarship Mark Hedges, Heike Neuroth, Kathleen M. Smith, Tobias Blanke, Laurent Romary, Marc Küster and Malcolm Illingworth

The Evolution of the Text Encoding Initiative: From Research Project to Research Infrastructure Lou Burnard

Journal of the Text Encoding Initiative, Issue 5 | June 2013 2

Editorial Introduction to the Fifth Issue

Tobias Blanke and Laurent Romary

1 In recent years, European governments and funders, universities and academic societies have increasingly discovered the as a new and exciting field that promises new discoveries in humanities research. The funded projects are, however, often ad hoc experiments and stand in isolation from other national and international work. What is lacking is an infrastructure to leverage these pioneering projects into systematic investigations, with methods and technical environments that can be taken up by others. The editors of this special issue are both directors of Digital Research Infrastructure for the Arts and Humanities (DARIAH), which aims to set up a virtual bridge between the many digital arts and humanities projects across Europe. DARIAH is being developed with an understanding that infrastructure does not need to imply a seemingly generic and neutral set of technologies, but should be based upon communities.

2 DARIAH can learn from the successes and failures of such communities. The Text Encoding Initiative (TEI) has been one of the successes. It has emerged as the de facto standard for online critical scholarly editions as well as a format that promotes interoperability and exchange. Furthermore, the TEI as a community brings people together in a concentrated dialogue about particular problems and challenges they face and projects they can build together, such as an infrastructure to support TEI scholarship, comprising workflows, support for collaborative encoding, and distributed annotation. The TEI is therefore the perfect point of orientation for DARIAH to develop long-term outlook European communities not just about scholarly editions but around people, data, and tools. The papers in this special issue very much reflect this community spirit and demonstrate how new approaches and services can emerge through collaboration. 3 The first paper, by Timothy Allen, Clovis Gladstone and Richard Whaling ("PhiloLogic4: An Abstract TEI Query System"), presents the method used by PhiloLogic, a reference search engine for encoded textual documents, to adapt to the great variety of syntax and semantics found in TEI documents. Based upon an internal abstract model for

Journal of the Text Encoding Initiative, Issue 5 | June 2013 3

representing structural levels within XML documents, it provides an easy-to-customize environment for users to extend the scope and complexity of document handling. 4 A more TEI-centric approach is provided by the TAPAS project, presented by Julia Flanders and Scott Hamlin ("TAPAS: Building a TEI Publishing and Repository Service"). The idea is to offer a generic repository service that is based on a core number of generic publishing scenarios identified within the TEI user community. From an infrastructural point of view, TAPAS wants to meet the needs of small projects that often do not have the means to deploy their own repository environment. 5 An additional level of complexity is offered in the next paper, "Islandora and TEI: Current and Emerging Applications/Approaches," where Kirsta Stapelfeldt and Donald Moses show how a - and Fedora-based digital object management system can be adapted to deal with TEI documents and how a whole workflow, from to complex inline annotations, can be created within a coherent architecture. The authors reflect on one of the key issues in processing TEI documents: managing precise encodings with the specifics of OCR or NLP (in this case, GATE). 6 Stepping back from an analysis of particular software, Quinn Dombrowski and Seth Denbo ("TEI and Project Bamboo") analyze the impact that the widely deployed TEI Guidelines and the corresponding user community have had on the development of Project Bamboo. Conceived as a digital infrastructure project for the humanities, Bamboo addressed various forms of services, from technical components to generic guidelines for data curation. By reviewing several experiments and activities carried out within the initial phase of the Bamboo project, the authors demonstrate how the TEI Guidelines have been seminal in understanding interoperability issues and therefore will continue to serve as an essential basis for future text-based infrastructural initiatives in the humanities. 7 In the next paper, Mark Hedges, together with several colleagues representing the TextGrid, TEXTvre, and DARIAH initiatives, discusses how infrastructures for textual scholarship can be made sustainable and how the TEI, itself seen as an infrastructure, can contribute to this idea. TextGrid was based on the TEI Guidelines from the start, followed by TEXTvre; both from the beginning did not follow the model of generic repository infrastructures but emphasized identifying the community characteristics of dedicated text-based virtual research environments. As examples of how to work together on tools for editing, annotation, etc., these projects have inspired the European DARIAH infrastructure to pay particular attention to how the TEI can offer services to the wider community of digitally based research in the humanities. 8 TEI as an infrastructure is indeed the best way to summarize the final paper of this special issue ("The evolution of the Text Encoding Initiative: from research project to research infrastructure"), where Lou Burnard provides a general overview of the historical, editorial, and technical developments of the TEI Guidelines over the last 25 years. Lou Burnard shows in particular how the early customization principles included in the TEI Guidelines, followed by the invention of the ODD specification language, helped popularize data modeling in support of one's research objectives in the humanities.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 4

AUTHORS

TOBIAS BLANKE

Tobias Blanke is the director of the MA in Digital Asset and Media Management. His academic background is in philosophy and computer science. Tobias Blanke has played a leading role on several multi-disciplinary research projects, mainly concerned with digital libraries and archives, as well as infrastructures for research, particularly in the arts and humanities. Recently completed research includes work on open-source optical character recognition; open linked data and semantically enriched, intelligent digital content; scholarly primitives; and document mining and information extraction for research. Tobias Blanke also continues earlier research on new logic-based approaches to information science. He is on several international committees. Most notably, he is one of the transition directors of DARIAH. He also leads the joint research work for EHRI, a pan-European consortium to build a European Holocaust Research Infrastructure.

LAURENT ROMARY

Laurent Romary is Directeur de Recherche for INRIA (France) and guest scientist at Humboldt University (Berlin, Germany). He carries out research on the modeling of semi-structured documents, with a specific emphasis on texts and linguistic resources. He received a PhD degree in computational linguistics in 1989 and his Habilitation in 1999. He launched and directed the Langue et Dialogue team at Loria (Nancy, France) and participated in several national and international projects related to the representation and dissemination of language resources and to man-machine interaction, coordinating the MLIS/DHYDRO, IST/MIAMM, and eContent/Lirics projects. He has been the editor of ISO standard 16642 (TMF – Terminological Markup Framework) and is the chairman of ISO committee TC 37/SC 4 on Language Resource Management, as well as member (2001–2007) then chair (2008–2011) of the TEI Council. In the recent years, he led the Scientific Information directorate at CNRS (2005–2006) and established the Max-Planck (Sept. 2006–Dec. 2008). He currently contributes to the establishment and coordination of the DARIAH infrastructure in Europe as a transitional director.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 5

PhiloLogic4: An Abstract TEI Query System

Timothy Allen, Clovis Gladstone and Richard Whaling

1. Introduction

1 A common problem in TEI-based software development is that the semantic intricacies of the XML encoding demand that the engineering team develop a custom software stack, often from scratch. While this has been a successful approach for a number of projects—to name just one example, the Perseus Digital Library has had a long run of successes with it—the duplication of effort and high cost of entry is a detriment to the TEI community as a whole. Further, the entire approach can break down when a heterogeneous corpus is aggregated from several smaller source projects. The Text Creation Partnership has not, in fact, succeed in creating rigorously interoperable texts (Unsworth and Mueller 2009; Pytlik Zillig 2009), and although the Monk Project made great strides in developing tools to normalize TEI (Unsworth and Mueller 2009; Pytlik Zillig 2009), we would observe that in the general case, the diversity of possible and extant TEI encodings demands more flexible tools. In that spirit, we will describe the design of version 4 of the PhiloLogic corpus query engine, which has a number of architectural features that address these and other obstacles.

2 Prior versions of PhiloLogic, developed in the early 2000s (Olsen 2002) and released under the Affero GPL open-source license, attempted to use extremely detailed procedural programming techniques to analyze and "guess" the best way to index a file, without a priori knowledge, by mapping a variety of common TEI XML and SGML features to an abstract data model. This worked in many common cases, but in practice, the size and complexity of the program prohibited substantial modifications or customizations. In this paper, we will describe how we have standardized PhiloLogic's abstract data model, brought it into compliance with the TEI P5 Guidelines, and leveraged contemporary XPath tools for ease of use and configurability, all while maintaining PhiloLogic's performance and scalability characteristics. We will also demonstrate the end-user benefits of querying a data model designed to match the TEI

Journal of the Text Encoding Initiative, Issue 5 | June 2013 6

specification, and the advantages in precision, expressivity, and ease of use over standard search engines such as Apache Solr. Finally, we will show that the generality of this architecture has substantial benefits for software reuse, and allows for powerful TEI applications to be adapted to new corpora with a minimum of custom programming. By providing an end-to-end TEI processing library, we allow developers to use the same abstract data model at all stages in the application and permit formatting software to work on a wide variety of TEI constructs and schemata. This enables developers to leverage their knowledge of their own corpora and the TEI Guidelines rather than delving into the minutiae of web frameworks and relational databases. This pragmatic benefit will lead us to discuss the more general and theoretical implications of abstraction as a TEI processing technique.

2. Data Model

3 In order to index TEI documents encoded in different ways, PhiloLogic maps such documents to an abstract data model, which is actually based on the set of "basic model classes" listed in section 1.3.2 of the TEI Guidelines (TEI Consortium 2013). These "basic model classes" can subdivide the set of all elements found in the Guidelines. This forms the basis of PhiloLogic's type system, along with a few additions. 1. Document: jointly describes the contents of the TEI Header and the element itself. 2. Div-like: includes the component and division model classes, or other large container units. 3. Paragraph/Chunk: typically includes

, , or similar elements. 4. Sentence/Phrase: by default, sentence objects are generated by the tokenizer. 5. Words: first-class objects. 6. Parallel objects: spans of text between milestone elements forming a "parallel hierarchy" of pages, lines, or other components.

4 Of these object classes, Div-like and Sentence/Phrase are potentially recursive, to arbitrary depth (as in the TEI Guidelines), while Document, Chunk, Word, and Parallel are non-recursive. The major point of difference from the XML/XQuery data model, which treats only elements as first-class objects (W3C [World Wide Web Consortium] 2010), is that PhiloLogic treats words and sentences as first-class objects, defined either by explicit TEI markup or by user-defined tokenization patterns. This allows it to support the storage of arbitrary attributes on any object in the hierarchy, such as for common bibliographic metadata, complex morphosyntactic or analytic metadata at the word level, or most anything in between. We believe that this model is a useful compromise between the expressivity permitted by the TEI P5 Guidelines and the far stricter schema used by relational databases, and have found that it is applicable, without modification, to a large majority of extant corpora.

5 In the current state of the practice, more and more projects are moving their metadata out of TEI markup and into such formats as standoff XML files (ISO [International Organization for Standardization] 2012), SQL tables (Dik and Whaling 2008), and RESTful web services (Smith 2009). As a result, merely modeling the TEI itself can be insufficient to capture the rich ranges of data that modern projects are accumulating. To support this encoding style, PhiloLogic4 does not restrict attribute names, classes, or values to the categories defined in the Guidelines, giving developers the ability to customize and extend the data model at will, albeit at the expense of interoperability.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 7

Ideally, all TEI constructs that contain useful metadata could be mapped to unambiguous RDF predicates, much as basic bibliographic data can be mapped to Dublin Core, and indeed we support properly namespaced object attributes for just such a use. However, in the absence of such a semantic model for the TEI Guidelines, such a practice is probably counterproductive. As we will discuss in the following section, progress in standardizing external annotation and metadata formats should greatly ameliorate this problem well before a comprehensive semantic map of the Guidelines could be completed.

3. Parser Design

6 For PhiloLogic's XML-parsing components, we had a number of interrelated design goals. We wanted to support the widest possible range of valid TEI XML documents, and we wanted to use a stream-based parsing algorithm in order to support extremely large files. Most importantly, we wanted to use a declarative syntax for specifying all behaviors of the parser.

7 PhiloLogic3 used a 3,000-line Perl SGML parser, with over a hundred regular expressions to cover the most common TEI Lite constructs; while it was extremely efficient, in practice it proved difficult to modify for other TEI variants, and its behavior was often opaque. In the last ten years, the related technologies of XPath, XQuery, and XSLT have become the tools of choice for XML processing and transformation. Although these tools are outstanding in terms of clarity and precision, they have substantial and fundamental performance disadvantages, particularly with large, heterogeneous corpora, in which the complexity of a query expression grows in proportion to the size of the corpus, and the requisite processing time thus grows exponentially, as described by Gottlob, Koch, and Pichler (2005). 8 To avoid this, we constructed a streaming, linear-time XPath evaluator capable of handling a useful subset of the full XPath specification. We control the behavior of this parser by specifying two different sets of XPath expressions: object XPaths (example 1), which map XPath expressions onto the six object types described in the previous section, and metadata XPaths (example 2), which further describe how to capture metadata attributes for each kind of object. The behavior of the parsing algorithm is simple. It walks through the tree in document order, and for each element, it first checks all the object XPaths to determine if it needs to emit a new object into the index; then it checks the current element for all metadata paths relative to all ancestor objects, and extracts content or attribute values as necessary to store in the index as a named attribute. In the case of multiple metadata matches, the attribute is only assigned to the innermost matching object.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 8

XPaths = { "." => "doc", ".//front" => "div", ".//back" => "div", ".//div" => "div", ".//div1" => "div", ".//div2" => "div", ".//div3" => "div", ".//p" => "para", ".//sp" => "para", ".//stage" => "para" ".//s" => "sent" ".//w" => "word" ".//pb" => "page" }

Example 1. A set of object XPaths, expressed as a mapping of XPath expressions onto object types

Metadata_Xpaths = { ("doc","./teiHeader/fileDesc/titleStmt/author") => "author", ("doc","./teiHeader/fileDesc/titleStmt/title") => "title", ("doc","./teiHeader/sourceDesc//date") => "date", ("doc",".@:id") => "id", ("div","./head") => "head", ("div","./head//*") => "head", ("div",".@n") => "n", ("div",".@xml:id") => "id", ("para","./speaker") => "who", ("para","./head") => "head", ("page",".@n") => "n", ("sent",".@xml:id") => "id", ("word",".@xml:id") => "id", ("word",".@pos") => "pos", ("word",".@lemma") => "lemma" }

Example 2. A set of metadata Xpaths, expressed as an (object-type,XPath) => feld-name mapping

9 Figure 1 shows the results of our performance tests, on a disparate corpus of 1,841 TEI XML files ranging in size from a few kilobytes to five megabytes. The average parse speed over the whole collection is 16.422 microseconds per byte. In the bottom quartile, containing the 461 smallest files, the average parse speed was 17.399 microseconds per byte, whereas in the top quartile, the average parse speed was 15.261 microseconds; we believe the slight disparity is due to memory-management and garbage-collection runtime behaviors, rather than a better-than-linear performance curve. Although these measurements confirm our claim of linear performance characteristics, the overall

Journal of the Text Encoding Initiative, Issue 5 | June 2013 9

speed—60 seconds to parse a five-megabyte file—is by no means fast. This is to be expected from an ad-hoc implementation in a scripting language. However, for our purposes, aiming at universal applicability to any TEI XML file, a predictably slow implementation of a linear-time algorithm was preferable to a fast implementation of a potentially exponential one.

Figure 1. Performance Graph

10 A final design goal for the parser was a minimal, easily extensible, and maintainable design, in contrast to the monolithic Perl script in PhiloLogic3. At the time of writing, the parser module of PhiloLogic4 occupies just 241 lines of thoroughly-commented Python, a compactness achieved by separating out the XML/SGML lexing and internal object models into separate classes, as well as by instituting a "pipeline"-style functional API for all other operations. Thus not only the basic merging, sorting, and filtering but also more complex corpus-specific operations are encapsulated in a list of manipulations of the parser's output, and these operations maintain a common file format for all intermediate stages prior to final index compression.

11 One of the great strengths of this architecture is that it facilitates integration of external resources. For example, while PhiloLogic can extract syntactic tokens if already encoded in the text (using the metadata XPaths as described above), most documents lack this tagging and therefore need to be tokenized using a lexical database that resolves certain problematic phenomena, such as compound words in German or contractions in pre-modern English, which may be stored in a stand-off XML file or stored as a URI-accessible resource. In this case, the post-parser filtering stage is the ideal place to perform these lookups. By separating these corpus- and language-specific procedures from the main parsing code, we can allow the widest possible set of languages and texts to be handled at the highest level of accuracy, without requiring a rewrite of the parser itself. Although we currently don't include any such modules in the standard PhiloLogic distribution, we are greatly encouraged by the emergence of complementary standards such as the W3C Open Annotation Data Model (W3C 2013), ISO MAF (ISO 2012), and LMF (ISO 2008), among others, and look forward to providing comprehensive support for annotation features in a future release.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 10

4. Query Engine

12 The exact mechanisms for querying the kind of hierarchical index structure described above are well beyond the scope of this article, although the technical problems are for the most part solved (see Witten, Moffat, and Bell 1999 and Grün 2010 for details). However, the user interface for specification of such a query remains a research question. A system like ANNIS (Rosenfeld 2010; Zeldes et al. 2009), designed for treebank queries, provides users with an extremely powerful query language for specifying intricate tree structures, assisted by a graphical query-editing tool; however, it can be very difficult to learn. Likewise, form-oriented user interfaces for complex TEI search engines tend to require the user to have substantial a priori knowledge of either the markup structure of the corpus, or of the internal data model of the search engine. In particular, we found that most users didn't use PhiloLogic3's three-level object model (ARTFL Project 2010) but instead only performed full-text or bibliographic queries due to the complexity of the search form.

13 We addressed this problem in PhiloLogic4's query engine by leveraging the same declarative markup used in the parsing procedure. The query engine can use the metadata-XPath map to infer a series of nested queries from the user input, which takes the form of a "flattened" set of key–value pairs, as depicted in examples 3a and 3b. For consistency with the parser behavior, an ambiguous or global attribute, such as @xml:id, is assigned to the innermost object type matching the query.

SELECT x FROM objects WHERE author="Sophocles" AND who='Oedipus' AND pos='verb';

Example 3a. A flat query translated before transformation, in SQL-like pseudocode.

SELECT x FROM words WHERE pos='verb' AND x is_contained_by_one_of( SELECT y FROM paragraphs WHERE who='Oedipus' AND y is_contained_by_one_of( SELECT z FROM documents WHERE author="Sophocles" ) );

Example 3b. The same query, transformed by inference from examples 1 and 2.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 11

14 This consistent object model has benefits for user interface design as well, in that it greatly facilitates the construction of a faceted browsing environment. Usually, faceted browsing is applied to commercial products or bibliographic searches, where a single set of objects is gradually reduced by the application of various filters or constraints. Although this approach has substantial usability benefits over the traditional "advanced search form" design (Hearst et al. 2002), it has proven difficult to apply to the hierarchical data structures typically used in corpus query engines: adding and removing constraints from a complex hierarchical query expression is a non-trivial operation and difficult to perform consistently based on user input. PhiloLogic4 addresses this difficulty by presenting a single set of parameters for user manipulation. The one-to-one mapping of user interactions to query parameters allowed us to construct a faceted interface in only a few hundred lines of code. In the accompanying figures 2, 3, and 4, we can see how a user can select, group, and combine metadata filters at a variety of levels—header metadata, speakers names in an tag, and word-level metadata in a tag—with a single, intuitive interface.

Figure 2. Faceted browsing interface with search results grouped by speaking character.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 12

Figure 3. Faceted browsing interface showing time-series display.

Figure 4. Faceted browsing interface showing morpho-syntactic analysis.

5. Client API

15 An unexpected outcome of our initial experiment with this "flat" query model was that it complicated the development of client software. Since any query could, in theory, return objects at any level of the document, one could easily write code that, for example, inadvertently looked for bibliographic metadata in other object types, and thus caused errors or undefined results. We had to develop a solution to handle such disparities at the programming level. Our initial attempts to develop result-formatting software used heavy introspection to infer the object type at run-time and render the

Journal of the Text Encoding Initiative, Issue 5 | June 2013 13

results accordingly. Eventually, we devised a more elegant approach, in which a query returned not only a list of objects but also a "wrapper" object containing the union of each object with all of its ancestors.

16 Thus, with this latter approach, if we queried r.title, a document-level attribute title on a full-text query result r, we wouldn't cause a type error—instead, our wrapper would implicitly scan up the hierarchy for the innermost ancestor in which the attribute is defined. This behavior demands some additional mechanism for specifying objects more granularly: in our notation, r.div2.n specifies the attribute n of the second-outermost div, and r.word2.lemma, the attribute lemma of the second word in a phrase search.1 17 We have found that, in practice, this kind of programming interface greatly improves the ease of developing TEI processing software. By writing formatting code against an abstract metadata model, rather than the original XML, client software can be reused with minor modifications; any changes that must be made are limited to precisely the domain of object types and attributes that the database specialist built at the parsing stage, and thus do not require specialized knowledge of XML transformation libraries or the small variations in encoding that are endemic in large corpora. For example, much of the formatting code that we used for ARTFL-Frantext, a 3,500-volume collection of texts from the 12th to 20th centuries, could be re-used for Diderot and d'Alembert's Encyclopédie with only the smallest changes by moving the "author" and "title" metadata fields from the document object type to the s and s. An additional benefit is that this flexibility extends to non-XML data sources as well, for the integration of data from spreadsheets, MARC records, relational databases, or other data sources as desired.

6. Conclusion

18 The design of a modern corpus query engine goes beyond the architectural considerations described in this paper: scaling and distribution behaviors, especially, are critical in the current environment of mass digital corpora. However, a conceptually sound data model is a necessary prerequisite for this sort of engineering. Furthermore, while the success of any open-source software project depends as much on the community that develops around it as on any intrinsic factors, we feel that the innovations we have described can help to expose the expressive power of TEI markup to a much broader and more general audience than has previously had effective access to it. We invite feedback from the broadest spectrum of the TEI community as we continue to develop our software, not just for the improvement of any one piece of software but also to better explore and explain the operational and semantic models that underlie the markup we work with daily.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 14

BIBLIOGRAPHY

ARTFL Project. 2010. "Developing Forms - PhiloLogic3." Accessed March 28, 2013. https:// sites.google.com/site/philologic3/wiki/development-forms.

Dik, Helma, and Richard Whaling. 2008. "Bootstrapping Classical Greek Morphology." Paper presented at Digital Humanities, Oulu, Finland, June 25–29, 2008.

Gottlob, Georg, Christopher Koch, and Reinhard Pichler. 2005. "Efficient Algorithms for Processing XPath Queries." ACM Transactions on Database Systems 30(2): 444–91. doi: 10.1145/1071610.1071614.

Grün, Christian. 2010. "Storing and Querying Large XML Instances." PhD diss., University of Konstanz.

Hearst, Marti, Ame Elliott, Jennifer English, Rashmi Sinha, Kirsten Swearingen, and Ka-Ping Yee. 2002. "Finding the Flow in Web Site Search." Communications of the ACM 45(9): 42–49. doi: 10.1145/567498.567525.

ISO (International Organization for Standardization). 2012. "Language Resource Management — Morpho-syntactic Annotation Framework (MAF)." ISO 24611:2012. Geneva: ISO.

ISO (International Organization for Standardization). 2008. "Language Resource Management — Lexical Markup Framework (LMF)." ISO 24613:2008. Geneva: ISO.

Olsen, Mark. 2002. "Rich Textual Metadata: Implementation and Theory." Paper presented at the Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, Tübingen, Germany, July 24–28, 2002.

Pytlik Zillig, Brian L. 2009. "TEI Analytics: Converting Documents into a TEI Format for Cross- collection Text Analysis." Literary and Linguistic Computing 24, 187–92.

Rosenfeld, Viktor. 2010. "An Implementation of the Annis 2 Query Language." Technical report, Humboldt-Universität zu Berlin.

Smith, Neel. 2009. "Citation in Classical Studies." Digital Humanities Quarterly 3(1). http:// www.digitalhumanities.org/dhq/vol/3/1/000028/000028.html.

TEI Consortium. 2013. "TEI P5: Guidelines for Electronic Text Encoding and Interchange." Edited by Lou Burnard and Syd Bauman. Version 2.3.0. Last updated January 17, 2013. http://www.tei- c.org/release/doc/tei-p5-doc/en/html/index.html.

Unsworth, John, and Martin Mueller. 2009. "Monk Project Final Report." Technical report. Accessed March 28, 2013. http://www.monkproject.org/MONKProjectFinalReport.pdf.

Witten, Ian H., Alistair Moffat, and Timothy C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd ed. San Francisco: Morgan Kaufmann.

W3C (World Wide Web Consortium). 2013. "Open Annotation Data Model." Community Draft, 08 February 2013. Edited by Robert Sanderson, Paolo Ciccarese, and Herbert Van de Sompel. http:// www.openannotation.org/spec/core/.

W3C (World Wide Web Consortium). 2010. "XQuery 1.0 and XPath 2.0 Data Model (XDM)." 2nd ed. W3C Recommendation 14 December 2010. Edited by Anders Berglund, Mary Fernández, Ashok Malhotra, Jonathan Marsh, Marton Nagy, and Norman Walsh. http://www.w3.org/TR/xpath- datamodel/.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 15

Zeldes, Amir, Julia Ritz, Anke Lüdeling, and Christian Chiarcos. 2009. "Annis: A Search Tool for Multi-layer Annotated Corpora." Proceedings of the Corpus Linguistics Conference, University of Liverpool, UK, July 20–23, 2009, article no. 358. http://ucrel.lancs.ac.uk/publications/cl2009/.

NOTES

1. Since div objects are potentially recursive, whereas words are not but are often related to one another in a phrase, we define distance in different ways for the various object types, hence the semantic difference between div2 and word2. We may revise and expand this syntax in the future for more clarity.

ABSTRACTS

A common problem for TEI software development is that projects develop their own custom software stack to address the semantic intricacies present in a deeply-encoded TEI corpus. This article describes the design of version 4 of the PhiloLogic corpus query engine, which is designed to handle heterogeneous TEI encoding through its redesigned abstract data model. We show that such an architecture has substantial benefits for software reuse, allowing for powerful TEI applications to be adapted to new corpora with a minimum of custom programming, and we discuss the more general and theoretical implications of abstraction as a TEI processing technique.

INDEX

Keywords: corpus query, indexing, search, user interface, parsing, web frameworks

AUTHORS

TIMOTHY ALLEN

Timothy Allen is a project manager at the ARTFL Project, University of Chicago.

CLOVIS GLADSTONE

Clovis Gladstone is a PhD candidate in French literature at the University of Chicago whose dissertation work focuses on 18th-century intellectual . Since 2008 he has been working for the ARTFL Project, where he has explored new methods of retrieving relevant data from large databases and participated in the development of the online Dictionnaire Vivant de la Langue Française (DVLF).

RICHARD WHALING

Richard Whaling is an Emerging Technologies Developer at the University of Chicago IT Services and the ARTFL Project.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 16

TAPAS: Building a TEI Publishing and Repository Service

Julia Flanders and Scott Hamlin

1. Introduction

1 The question every TEI workshop instructor dreads: "So, how do I publish my TEI data?" This question and the setting that triggers it are central to the genesis of TAPAS, the TEI Archiving, Publishing, and Access Service (http://www.tapasproject.org). Following a TEI workshop at Wheaton College, participants reflected on the fact that learning and creating TEI data seemed to lie well within their grasp, but publishing that data seemed to depend on institutional resources (server space, programmer time, XML publishing expertise) that they could not count on. The central question arising from that meeting came to be the mission statement for TAPAS: how can creators of TEI data at small institutions publish and store their data, and get access to technical support for their projects? It is primarily to serve this set of needs, and this community of users, that TAPAS exists. TAPAS is currently under development by a team that includes a full-time web applications programmer, several repository experts, a TEI schema developer, a user interface design and development group, and a growing team of scholarly collaborators whose TEI projects and data form the testing ground for the service. More information is available at http://www.tapasproject.org. We plan an initial launch date early in 2014, and our goal in this article is to describe the goals and philosophy of the service and the general approach being taken to its design. It should be noted at the outset that all details given here are provisional and subject to change as the project develops.

2 Several key factors have shaped the way TAPAS is developing in response to the needs articulated at its inception. One of the most important is the question of scale: in the age of "," digital publication increasingly calls for interfaces that support visualization, text analysis, and other forms of reading that situate the individual text within a much larger ecology of meaning. However, such systems typically lie beyond the capacity of individual data producers, for reasons of technical ability, funding, and

Journal of the Text Encoding Initiative, Issue 5 | June 2013 17

also the quantity of data they have to work with. TAPAS seeks to respond to this issue by offering its users access to a shared publication infrastructure. This infrastructure can provide large-scale services that operate at the TAPAS level, thereby situating the individual scholar's data within a larger ecology of TEI data and also providing a set of tools for publication, analysis, and visualization that would otherwise be out of scope for an individual to develop and maintain. However, TAPAS and its users will face challenges in designing a system that will work gracefully both at this level of scale and also at the level of the individual text or project. 3 A second crucial question for TAPAS to address is that of data vulnerability. Ironically, despite the claim that the digital medium puts authors in a position of greater self- reliance and independence (by removing dependence on commercial publishing infrastructure), in fact, for humanities scholars, the digital medium requires greater and longer-term dependence on institutional infrastructure (or a commercial equivalent) to ensure the ongoing viability of scholarly data and electronic publications. Unlike books, digital data and publications cease to exist very quickly if not actively maintained, so scholars cannot simply publish their work and move on to another project. TAPAS seeks to respond to this problem by offering a place where data can be curated and published without ongoing involvement by the contributor. However, there are significant challenges here as well: most importantly, how to frame TAPAS's responsibilities so that they are feasible, funded, and clear. An open-ended commitment to unlimited curation of complex TEI data is neither feasible nor fundable, but in order to meet the needs of TAPAS contributors we do need to establish a bounded time period during which TAPAS will take responsibility for contributed data, and a business model that will provide support for TAPAS during that period. 4 The third substantial challenge that TAPAS is grappling with has to do with the customizability of the TEI Guidelines, and with the variability of data produced even under "unmodified" TEI. Although in discussions within the TEI community, this variability is often treated as a disadvantage to be overcome and if possible eliminated, for TAPAS it poses a very interesting set of research questions about the nature and purpose of the actual variability exhibited by TEI data from small scholarly projects. Because these projects often use markup to engage in a detailed analysis of a single text (in many cases chosen precisely because of its idiosyncrasies or structural difficulty), the ability to use the TEI Guidelines in flexible and perhaps unprecedented ways is an important aspect of these projects' scholarship. Customizations—particularly those that establish new controlled vocabularies, create new elements, or modify the structure of existing elements—may constitute an important contribution to our understanding of textual structure or of a particular text's ways of making meaning. TAPAS thus has an interest in supporting these forms of scholarship. At the same time, because of its data curation goals, TAPAS also has an interest in reducing casual or unintentional variation. The project responds to this complex set of desiderata by studying the customizations and encoding of contributors' projects, by providing a target schema that reflects good practice, and by encouraging convergence in areas where data variance does not reflect principled disagreement. At this early stage in the project's development, studying the problem is the first priority. 5 This project takes place within a landscape already well populated with large-scale infrastructural projects (Hedges 2009), such as TextGrid (http://www.textgrid.de/; TextGrid 2010), DARIAH (http://www.dariah.eu/), CLARIN (http://www.clarin.eu/

Journal of the Text Encoding Initiative, Issue 5 | June 2013 18

external/), and the Canadian Writers Research Collaboratory (CWRC, http:// www.cwrc.ca/). Projects of this kind must all confront a central set of strategic concerns and design challenges, including questions about how much uniformity to impose upon the data, how to accommodate variation, how to create interoperability layers and tools that can operate meaningfully across multiple data sets (Blanke et al. 2011), and how to manage issues of sustainability (of both the data and the service itself). TAPAS is distinctive within this landscape because of its focus on a single form of data (TEI-encoded research materials) and also because of its initial emphasis on serving an underserved constituency (scholars at smaller or underresourced institutions) rather than on providing an infrastructure that can operate comprehensively. However, there is clearly a great deal that TAPAS can learn from these projects, and there are potentially significant connections to be made between projects as well.

2. Background and Genesis

6 The development of TAPAS has proceeded in several stages. Following the original conversation described above, representatives from several small liberal-arts colleges including Wheaton College, Willamette University, Hamilton College, Vassar College, Mount Holyoke College, and the University of Puget Sound, and later joined by Brown University and the University of , sought and received a one-year planning grant from the Institute of Museum and Library Services (IMLS). This grant, which began in September 2009 and ended in December 2010, funded a series of four planning meetings at which the group developed the basic outlines of the TAPAS service and a rough specification for its activities. At this stage, the key features of the service included a repository in which TEI data could be stored on behalf of contributors, and which would also expose TAPAS data to third parties via an API; a public-facing interface that would permit readers to search, browse, and read texts from TAPAS collections; and an access-restricted interface for the upload and management of contributed data (including the configuration of publication and retention options). At this stage the planning group also articulated the principle that TAPAS data would be published on an open-access basis, without charges to the reader, and that the cost to contributors should be as low as possible (consistent with sustainability) with some form of free upload options available to support short-term experimentation. This work served as the basis for two further grant proposals: one, also funded by the IMLS, for a two-year National Leadership Grant (2011–13) to implement the essential infrastructure of the service; and another, funded by the National Endowment for the Humanities (NEH), for an 18-month Digital Humanities Start-Up Grant (2011–12) to fund the design and development of the user interface. Both of these development activities are now underway as of the time of writing.

7 In the development process the project's understanding of its scope and audience has undergone some shifts which are worth examining. First, although researchers at small liberal-arts colleges are still a crucial constituency, we are increasingly aware that the challenges they face (lack of access to XML expertise, programmer time, and repository services) are not limited to these institutions. Repository services at large institutions may still not provide the kinds of TEI-sensitive publishing opportunities that TAPAS seeks to offer, and researchers may still have difficulty getting support for a complex

Journal of the Text Encoding Initiative, Issue 5 | June 2013 19

TEI project if they are the only TEI user at their institution, however well-resourced. Furthermore, the value of TAPAS may not be simply remedial: even for projects with publication sites of their own, contributing their data to TAPAS may offer other kinds of benefits, such as the value of putting their data into a shared ecology where it can be examined in relation to that of other projects or accessed via the TAPAS API. Another set of considerations has to do with TAPAS's ambitions towards sustainability and longevity, and the nature of the commitments the service is able to make to its contributors. It is clear that at this stage of its development TAPAS cannot make firm long-term commitments, let alone open-ended ones, but the question of how long TAPAS can preserve contributed data also entails further research concerning the needs of TAPAS contributors. It may be that long-term preservation (however that is to be defined) is less important than short- or medium-term publication and project support. 8 The initial launch of the TAPAS service is tentatively planned for the end of the IMLS implementation grant, in 2014; additional funding is already being sought for further development work, since the service at that point will be quite basic. The most important aspect of the service's future development, however, lies in the development of a TAPAS community, both individual and institutional. TAPAS was conceived from the start as a community-supported endeavor and as a way of focusing existing forms of mutual assistance. In this respect TAPAS anticipates working closely with the TEI Consortium itself. By providing ways for digital humanists (who may not immediately identify themselves as part of the TEI community) to engage with others at the same level of expertise and with the same kinds of support needs, TAPAS seeks to strengthen the existing connections between the TEI Consortium and other parts of the digital humanities community. At the institutional level as well, TAPAS has from the outset drawn on a collaborative network of colleges and universities and will seek to broaden that base of participation. In particular, while the TAPAS repository is currently hosted at Brown University, other aspects of the service can be hosted at other institutions, and even the repository hosting is envisioned as mobile over time.

3. Services

9 When completed, TAPAS will offer several different kinds of services and points of access to several different kinds of users. TAPAS contributors—creators of TEI data—will be able to upload TEI files (and perhaps other kinds of related data such as page images) to TAPAS, using either a "sandbox" interface which will require very little metadata and minimal authentication, or a fuller ingestion interface aimed at those who want to create durable projects within TAPAS. The sandbox interface is currently envisioned as a space for experimentation and short-term usage, through which users could quickly and easily see their TEI files displayed on the Web. In a teaching context such as a workshop or humanities course, the sandbox could offer an easy, immediate way for novice TEI users to see the results of their encoding work. Access to the fuller set of TAPAS publication and data curation services described below will probably require some form of membership, probably with a fee, but these details have not yet been fixed.

10 Contributors will also be able to publish their TEI data through TAPAS, creating projects which may in turn include one or more collections of TEI documents. TAPAS will offer a

Journal of the Text Encoding Initiative, Issue 5 | June 2013 20

standard publication interface at the outset, which as the service develops will become increasingly configurable so that projects can adapt the interface to the specific needs of their data. Expert users will be able to develop and use their own XSLT and CSS, while novice users will be able to take advantage of the standard appearance and its simple configuration options to create simple publications quickly. 11 As part of the data upload and management process, it will be important for contributors to be able to inspect, validate, and troubleshoot their data. Because of the likely diversity of data contributed to TAPAS, there may be some TAPAS tools—for instance, visualization options that require the presence of specific elements—that will only be relevant to a small subset of TAPAS data. In addition, some novice users may not be aware of inconsistencies or invalidities in their data, and will need a user- friendly way of discovering and remedying these problems before their data can function within the TAPAS publication system. The larger question framing these features is that of the constraints TAPAS would impose on the data contributed. While TAPAS does not have a strong interest in imposing such constraints for their own sake, it does have a functional interest in constraint that operates at two levels. The first level has to do with the feasibility of the most essential TAPAS functions, such as the basic TAPAS publication interface and the corpus-wide search and discovery features. At this level, TAPAS has an interest in imposing some minimum requirements for metadata, and also in ensuring that the data contributed is some form of TEI. Data not meeting these basic requirements might still be accepted (and this is a question TAPAS still needs to answer), but the contributors might be advised that certain basic functions of the service may not work as expected. These first-level requirements will probably be expressed as a kind of "gateway schema" that imposes only minimal constraint on the data. The second level has to do with the more ambitious and specialized tools that TAPAS can offer to projects whose data meets a higher standard of consistency, explicitness, and detailed encoding. At this second level, contributors have an interest in meeting some optimum requirements for the encoding of specific features, if they want to take advantage of the more powerful handling of those features. For example, projects wanting to have dated entry–based materials (such as diaries or letters) navigable via a timeline would need to encode the dates of the entries in a specific way. These higher-level requirements would be expressed in a "target schema" that contributors could use in creating their data, or to which they could convert existing data. Such requirements (like the basic requirements described above) might be optional for users who were prepared to create their own publication stylesheets from scratch, or who were only interested in TAPAS as a repository service. 12 TAPAS will also offer a set of supporting services to contributors, supplying the technical expertise that is often difficult for small projects to procure for themselves, particularly in small increments. The services currently under consideration that seem likely to offer the greatest value to TAPAS contributors are assistance with data conversion (for instance, to assist in migrating data from non-XML formats or from older versions of TEI, or to help fix problems of inconsistency), technical consulting, TEI workshops, and assistance with project management. For instance, TAPAS could assist contributors in developing encoder training materials and local documentation, or in designing schema customizations for use in data capture. Revenue from these services would help fund TAPAS's core activities.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 21

13 TAPAS readers—members of the public with an interest in TEI data or in the content of specific TAPAS projects—will be able to access and work with TAPAS data in many different ways. TAPAS as a corpus of projects will have a corpus-level interface through which those new to TAPAS can explore, find projects and texts of interest, analyse the TEI encoding used, and perform other corpus-level activities. Individual projects will also offer their own views of their data, through their published collections, and a reader interested in a specific project might visit it directly and use its collection interface to search, browse, and read the materials it contains. TAPAS also plans to offer an API to the TAPAS data, to support third-party use of TAPAS data. 14 The final dimension of the TAPAS service is the preservation and curation of TEI data, for which the TAPAS development team is still considering various different models. It has been clear since the early planning process that long-term preservation of data is of great concern to many of those who constitute TAPAS's potential body of users, so the demand for such a service is clear. However, several questions need to be addressed before the terms of such a service could take a firm shape. First, how sustainable is TAPAS? The services it plans to offer are clearly of value, but can its potential users afford to pay the real costs of providing such services? Or are there other ways of funding these services? Second, what kind of long-term preservation do contributors really need? Do they need a trustworthy place to store their data for a clearly defined interim period, or do they need a more open-ended guarantee that their data will remain usable and accessible? Can TAPAS count on users to remain active participants in the management of their data, or should we assume that some data will become the sole responsibility of TAPAS? And if the latter, is that a responsibility we can realistically take on? Third, what provision does TAPAS need to make for migrating data into future versions of TEI, and how might such migration be funded? 15 Clearly the TEI community, and the TEI Consortium itself, have a significant role to play in the shaping of TAPAS. Although the service has a primary responsibility towards users in its target constituency—individuals and small projects who lack access to supporting resources—there may be advantages to all in planning for much broader use. Some TAPAS services, such as training and technical support, have been clearly needed for some time and TAPAS may be a useful way of organizing and supporting them. For others, such as long-term data curation, it may require the input of the entire community and a further development phase to make an appropriate plan. TAPAS welcomes your input and support. 16 TAPAS is made possible in part by a grant from the U.S. Institute of Museum and Library Services, and in part by a grant from the National Endowment for the Humanities. Any views, findings, conclusions, or recommendations expressed in this article do not necessarily represent those of the National Endowment for the Humanities.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 22

BIBLIOGRAPHY

Blanke, Tobias, Michael Bryant, Mark Hedges, Andreas Aschenbrenner, and Michael Priddy. 2011. "Preparing DARIAH." In Proceedings: 2011 Seventh IEEE International Conference on eScience, 158–65. Los Alamitos, CA; Washington, DC: IEEE. doi:10.1109/eScience.2011.30.

Hedges, Mark. 2009. "Grid-enabling Humanities Datasets." Digital Humanities Quarterly 3.4. http:// www.digitalhumanities.org/dhq/vol/3/4/000078/000078.html.

TextGrid 2010. "Roadmap Integration Grid/Repository." TextGrid, December 2010. http:// www.textgrid.de/fileadmin/berichte-2/report-1-2-1..

ABSTRACTS

This project report details the goals and development of the TEI Archiving, Publishing, and Access Service (TAPAS). The project's mission centers on providing repository and publication services to individuals and small projects working with TEI data: a constituency that typically lacks access to institutional resources such as server space, XML expertise, and programming time. In developing this service, TAPAS addresses several important challenges: the creation of a publication ecology that operates gracefully at the level of the individual project and the TAPAS corpus; the problem of the vulnerability of TEI data in cases where projects cease their activity; and the variability and complexity of TEI data. When completed, the TAPAS service will enable its contributors to upload, manage, and publish their TEI data, and it will offer a publication interface through which both individual project collections and the TAPAS collection as a whole can be read, searched, and explored. It will also provide a range of related services such as technical consulting, data curation and conversion work, TEI workshops, tutorials, and community forums. This article discusses the philosophy and rationale behind the project's current development efforts, and examines some of the challenges the project faces.

INDEX

Keywords: infrastructure, publishing tools, repositories, customization

AUTHORS

JULIA FLANDERS

Julia Flanders is the Director of the Brown University Women Writers Project (http:// www.wwp.brown.edu), and a member of the Center for in the Brown University Library. She has served as president and vice president of the Association for Computers and the Humanities, and vice-chair and chair of the Text Encoding Initiative Consortium. She is the editor-in-chief of Digital Humanities Quarterly and also co-directs the TAPAS project. Her research focuses on text encoding, scholarly communication, and digital editing.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 23

SCOTT HAMLIN

Scott Hamlin is the Director of Research and Instruction in Wheaton College's Library and Information Services (Norton, MA). At Wheaton, he leads a group of academic technologists and research librarians responsible for supporting information fluency and effective teaching and learning experiences through the use of information resources and technology. He has worked for many years with multiple institutions to promote the use of TEI at small undergraduate liberal arts colleges and is currently a project director for the TAPAS Project (http:// tapasproject.org).

Journal of the Text Encoding Initiative, Issue 5 | June 2013 24

Islandora and TEI: Current and Emerging Applications/Approaches

Kirsta Stapelfeldt and Donald Moses

1. About Islandora

1 Developed at the Robertson Library, University of Edward Island, Islandora (Wilcox 2012) is a robust open-source digital asset management system that can be used anywhere that collaboration and digital data stewardship, for the short and long term, are critical. Islandora integrates the Drupal content management system (Drupal 2012) and the Fedora digital repository with a host of functions that: • provide a public-facing website while at the same time providing a secure private workspace for authenticated members of the group • enable users to easily customize and brand their website • provide a set of collaboration and communication tools (e.g., document sharing and editing, data collection, blogs, RSS feeds, tagging, and commenting) • enable authorized users to manage (create, edit, or delete) any type of digital content in a secure repository • provide a forms-based interface for users to describe their digital content using the standards of their discipline (for example, describing herbarium samples using the Darwin Core XML schema) • provide users with discovery tools (search and browse) to display and analyze their content • provide custom transformation or views of content • provide best-practice digital preservation services

2 At UPEI, Islandora is used across the entire enterprise —in learning, research, and administration. In 2010, the Atlantic Canadian Opportunities Agency funded a $2.4 million project to help establish the open-source community surrounding Islandora, build the codebase, and release a number of tools ("solution packs") that would provide key functions for different knowledge domains or data management tasks. This grant also helped launch a services company for the software, called discoverygarden inc. (also in 2010). Since 2006, operational funding at UPEI has been dedicated to the

Journal of the Text Encoding Initiative, Issue 5 | June 2013 25

management of the codebase. In addition to the work of UPEI staff, all software developed by discoverygarden inc. is released to the public; at the time of writing, the Islandora users' and developers' Google Groups combined have over 500 members, and the project's GitHub account has over 40 repositories.

3 The Islandora website contains additional information about Islandora, its community, software updates and documentation. From the site visitors can experience logging into and administering an Islandora 6 or Islandora 7 installation.

2. System Details

4 Islandora represents a modular approach to system development. At its core are two open-source projects: the Drupal content management system and the Fedora Commons Repository software, which together create a robust digital asset management system that can be used to meet the short- and long-term requirements of digital data stewardship in a collaborative context.

Figure 1. The Islandora architecture

5 The Drupal content management framework provides a highly customizable interface for collaborative content authoring and publication. It is widely used in the digital humanities community: for example, in TEICHI (Pape, Schöch, and Wegner 2012), Saint Patrick's Confessio,1 and in tutorials such as Drupal for Humanists. 2 There are over 20,000 modules available for various versions of Drupal that extend the functionality of a Drupal site; the Islandora project maintains a core Islandora module, and Islandora solution pack modules for both Drupal 6 and Drupal 7. These Islandora modules allow a user to administer an Islandora site in a way that feels much like administering a Drupal site. By utilizing the Islandora modules, implementers can use Drupal to discover and manage assets in a Fedora Commons repository. A Fedora Commons

Journal of the Text Encoding Initiative, Issue 5 | June 2013 26

repository provides complex data modeling and datastore better suited to knowledge sharing and long-term data preservation than Drupal. Fedora provides not only web- based services and APIs but also support for metadata sharing and RDF (via a Mulgara triplestore). Apache Solr provides the ability to search and discover Fedora assets in a way that respects Fedora's security layer, as well as providing current best-practice features such as faceted search results. To this base stack a number of additional applications have been added or integrated, including the and ABBYY OCR engines and the Djatoka Jpeg 2000 Image Server. Overall, Islandora integrates a number of robust open-source software projects.

6 Additional software can be integrated into an Islandora site via solution packs, which provide features such as content models (which prescribe the nature of a piece of content and its relationship to other pieces of content), ingest workflow and file processing (via modular microservices), metadata management, search profiles, and custom viewers. Other solution packs include utility modules that provide tools for repository managers and other administrative staff: for example, Islandora's batch ingest or workflow solution packs. Islandora also supports alternative discovery strategies to searching and browsing. An example of this is a visual chemical search that allows the user to draw a molecule (or part of it), with the system returning to the user the molecule expressed as a text-based formula, as a 2-D image representation, or a rotatable 3-D object.

3. Modeling Book-Style Content in Islandora

7 To further illustrate the Islandora framework, particularly in the context of digital humanities research, and to introduce the integration of TEI tools into Islandora for IslandLives, the following section discusses the way that content is understood, stored, searched, and displayed in Islandora. Each data asset in Fedora is referred to as an object, and contains two or more datastreams (or files) that are relevant to an object. A "content model object" defines the nature of a data asset object and its relationship to other data asset objects. The resulting atomic approach (which heavily leverages XML and XSLT) allows data structures to be more transparent, reusable, and transferable to other application environments, which is why this model is considered appropriate for the long-term stewardship of digital data.

8 Let us illustrate using the example of a book as it is modeled by the "Book Solution Pack"—not because TEI is only useful when describing a book, but because the book model illustrates a more complex data modeling procedure. A book encountered on a shelf is experienced as a single, self-contained object; however, the same book, when digitized and placed in the Islandora repository, is modeled as numerous objects. The "book object" is defined by the "content model object," and it contains descriptive metadata (the bibliographic record), a thumbnail image considered representative of the book, and RDF statements linking the book to a particular digital collection. The bulk of the book's content—such as OCR'd text, TEI-encoded text, image/text annotations, and high-resolution images of the page—is stored with page objects. These page objects are related to the book object via RDF data, which declare the sequence of the page in the book. This relationship is, again, specified by the content model object for a book and for a page. Pages might belong to multiple books, and books to multiple collections. These RDF statements are written to a datastream known in Fedora as

Journal of the Text Encoding Initiative, Issue 5 | June 2013 27

RELS-EXT (represented in fig. 2), which illustrates the relationships written between discrete repository objects to represent the concept of a book. Additional information can be found in the Islandora documentation (Wilcox 2012).

Figure 2. All digital objects in Islandora contain multiple datastreams defned by content models. One of these datastreams (RELS-EXT) stores RDF statements that describe the relationship between a given object and others in the repository. This object-oriented data modeling is flexible, non- hierarchical, and lends itself to atomistic (or highly granular) data models, and, as a result, to complex ontologies.

4. Modeling and Storing TEI in Islandora

9 Drupal's book module can be modified to store TEI data. Defining a TEI datastream within an Islandora content model object (which defines the existence of TEI in the data asset objects that comprise the digital book) is a fairly straightforward procedure: the content model object is modified to define a TEI datastream, which can then be created, read, updated, and deleted by any user with appropriate permissions. From the perspective of data modeling, the challenge is to reconcile the fragmentation of the book into separate pages with the common need to produce and edit single TEI documents which represent an entire book. In other words, if a page's OCR'd text (and TEI-encoded text) are stored in fragments, how can users access a single continuous TEI file, or upload a continuous TEI file that corresponds to multiple pages?

10 The IslandLives project chose to resolve this challenge by duplicating TEI datastreams at the page and book levels. On the one hand, this was a pragmatic decision having to do with system resources—presenting a 500-page TEI document for editing and saving is an expensive computing procedure—as well as with the problem of how to present relevant page-object images alongside the corresponding TEI data for OCR and for encoding correction. On the other hand, this was also a data modeling decision. The Islandora model is designed to keep pages as discrete packages of data (to "atomize" content), and there was the desire to maintain this approach. Apache Solr and the Islandora Solr client allow for selective indexing of datastreams and elements, so there was little concern with contaminating the search index with duplicate data or providing confusing results for users.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 28

11 In order to provide an overview of the data model in IslandLives, figures 3 and 4 illustrate both the book-level ("Annotated TEI") and page-level ("TEI Page Fragment") datastreams. What these screenshots do not reflect is that the TEI header defined in the book object was duplicated into the page-level TEI fragments.

Figure 3. Datastreams at the book level: A TEI datastream representing all book pages is stored at the book level in the IslandLives project.

Figure 4. Datastreams at the page level: TEI page fragments are stored at the page level. This stream is used for editing, display, and page-level search resolution.

12 Both the image viewer and the TEI editor are associated with content at the page level, and the book-level TEI document was generated once by a script in order to support those who wish to download or view the complete TEI stream or for use with further transformations like creating an EPUB file from the TEI. This data model poses obvious synchronization problems since any additional work at the page level will not be reflected in the book-level TEI document. In addition, this model does not address the common use case where a project has produced a number of page-level image files and a single continuous TEI document. In the IslandLives project, neither of these issues is particularly pressing. However, both are roadblocks to generalizing for multiple projects.

13 A potential solution to address synchronization between page and book level TEI is to regenerate the book-level TEI document whenever page-level TEI files are edited and saved. A script, to be run by an administrator or run in a nightly cron job, would concatenate all the pages related to a book into a consolidated TEI document and replace the TEI document stored at the top level of the book. Having a corrected book- level TEI document would add a great deal of value, allowing the project to expose its

Journal of the Text Encoding Initiative, Issue 5 | June 2013 29

book-level TEI documents. In this case, a page's OCR'd text (and TEI-encoded text) would be stored in fragments, and a user would be able to access a single continuous TEI file, created on demand. However, this is not how IslandLives currently works. 14 The second issue is one of ingesting. In general, uploading a continuous TEI file that corresponds to multiple pages at the time of book object creation remains an issue since a book is comprised of multiple page level objects. However, it is common to encounter projects that have produced multiple page-level images and single, continuous, TEI documents. A script could be created to divide a TEI document for a single repository, but such a generic tool that allows for the uploading of TEI documents and their alignment with digitized assets may pose a problem if different approaches are taken toward encoding for page breaks for different objects. 15 A potential solution to the challenge of addressing page breaks is to use the sequencing information stored with page objects and, on upload, dynamically update the facs attribute of elements to associate them with the appropriate image URLs in the repository. This approach would assume that a TEI digital object (adhering to a TEI content model in Islandora) is associated with an image digital object (adhering to an Islandora's image content model). The resulting fragments would have to be adequately addressed in the concatenation script so that the complete TEI document would be re- usable outside the repository—another common requirement for TEI systems. The IslandLives approach evolved to address a specific set of requirements, and this evolution is described in the following section.

5. The IslandLives Project: Automating TEI Encoding

16 The IslandLives project began in 2008, with the Robertson Library collecting permissions to digitize and digitally publish a number of the Prince Edward Island local in its collection, forming the core of the IslandLives collection. IslandLives is one of many projects illustrating the Robertson Library's commitment to preserving and sharing unique material relating to Prince Edward Island with students, educators, researchers, and others interested in Island culture and heritage.

17 In IslandLives, the TEI encoding was not created by hand but through an automated series of processes developed by the IslandLives team. The books were scanned, and then run through ABBYY FineReader OCR software. ABBYY can produce a number of output formats; ABBYY FineReader XML, the output of the IslandLives digitization process, includes both the OCR text and structural information about the document, including the location of images, lines, and the sizes of pages and paragraphs. For long- term preservation, ABBYY's proprietary format had to be transformed into a recognizable, open standard, and TEI was the clear choice. 18 The IslandLives team used an XSLT stylesheet to transform the ABBYY FineReader XML tags into simple TEI structural tags. Most of the OCR coordinates in the ABBYY XML are lost in the transformation to TEI, so the current IslandLives code, and earlier versions of the book solution pack in Islandora 6, are unable to support highlighting of hits in page images. Current versions of the book solution pack, which uses Tesseract, preserve OCR coordinates as part of the hOCR (Breuel 2010) output. Development of an XSLT stylesheet to transform hOCR to TEI is currently underway. One challenge facing the project is how to edit and update text while preserving page coordinates: currently,

Journal of the Text Encoding Initiative, Issue 5 | June 2013 30

the OCR coordinates cannot be corrected in Islandora. A small sample of the ABBYY FineReader XML input, and the resulting TEI output, is provided in example 1.

ABBYY XML (before)

Setting my pen to write "Pioneer Days & Shanty Ways" gave me a special reason for choosing the title for the book. Having a pioneer fisherman father, who told us many stories of the sea, I felt would be an interesting place to start. Having a farm girl as a mother, whose stories were passed down from her grandfather's knee, was the opportunity I needed to fill the book with many treasured memories of family life back then.

Final TEI (after)

Journal of the Text Encoding Initiative, Issue 5 | June 2013 31

Setting my pen to write "Pioneer Days & Shanty Ways" gave me a special reason for choosing the title for the book. Having a pioneer fisherman father, who told us many stories of the sea, I felt would be an interesting place to start. Having a farm girl as a mother, whose stories were passed down from her grandfather's knee, was the opportunity I needed to fill the book with many treasured memories of family life back then.

Example 1. During the automated IslandLives process, ABBYY XML (above) becomes simple TEI encoding (below).

19 Beyond just OCR'ing of the page images, the IslandLives team decided to attempt an automated semantic encoding of place names, personal names, dates, and organizations mentioned in the texts to allow for richer connections to be made across the multiple histories. Tools from the General Architecture for Text Engineering (GATE) project (Cunningham et al. 2002) were used with specially created controlled vocabularies of names for this process. The sources of information for these gazetteers varied: for example, given and family names were captured from the Prince Edward Island Baptismal Index,3 organizational names were derived from corporate subject headings in the library's catalog, and place names came from the Canadian Geographical Names Data Base.4 Place names are stored in the project's controlled vocabulary alongside their CGNDB database keys, allowing for data associated with the place name (such as the latitude and longitude coordinates) to be retrieved dynamically from the external data source. With sources for semantic and structural information in place, the only remaining piece was the TEI header. All of the books in IslandLives exist in the Library's online catalog, which allows for the export of bibliographic records in MARCXML format. Team members created a crosswalk from MARCXML fields to TEI header elements in order to create headers.

6. The Encoding Tool Chain

20 In April 2009, IslandLives team member wrote scripts that automated the processes described above and the conversion of ABBYY XML to valid TEI XML, generating pages as HTML. During this conversion process, semantic tags are turned into links that, when clicked, will launch a relevant repository search (via a Fedora disseminator) of all works in the collection that contain the same tag (see fig. 5).

Journal of the Text Encoding Initiative, Issue 5 | June 2013 32

Figure 5. Site visitor view of TEI: When a datastream is viewed by a user, the encoding appears as HTML links that launch Solr searches of other works in the collection containing the same TEI tag.

21 The automation that underpins this display takes the following steps: 1. A script retrieves the appropriate MARCXML record from the library catalogue and the output is transformed into a using an XSLT stylesheet (MARC2teiHeader_UPEIv2.xsl) and stored in a temporary file for use during the process. 2. A source ABBYY XML-encoded document is transformed using an XSLT stylsheet (ABBYY- to-TEIv2.xsl) into a TEI document (structural TEI only) and the document is integrated into the resulting TEI file. 3. The TEI file is then run through GATE, which encodes people, organizations, dates, and places using its own specific encoding format. Here is an example of a place name with GATE encoding: Rocky Point

4. The resulting XML document, with a mix of TEI and GATE tagging, is run through another XSLT stylesheet (GATE-to-TEIv2.xsl) that turns GATE tags into appropriate TEI tags. Here is the same fragment converted into TEI: Rocky Point

5. As a final step, an XSLT stylesheet (TEI-Split-Pages.xsl) is used to parse the TEI document and extract individual pages, each of which includes the of the parent document. Each of these page-level TEI fragments is stored as a datastream with its related page object.

22 Three additional files are created during the process to assist with proofing: 1. A KML file that provides a map of all the place names and a snippet of related text mentioned in a work (created by TEI-to-KML.xsl). 2. An extract of all the terms encoded with the persName element. (created by ExtractPersname.xsl). 3. An html representation of the work (created by TEI-to-HTML.xsl).

Journal of the Text Encoding Initiative, Issue 5 | June 2013 33

23 The XSLT stylesheets referenced in the proceeding steps are available from the Robertson Library's GitHub repository.5 A high-level view of the transformation process is offered in figure 6.

Figure 6. The IslandLives automated workflow takes the output of digitization and produces Islandora objects, including valid TEI data about the contents of book pages.

7. Automation Results and the TEI Editor

24 The IslandLives project team found that the structural TEI markup created by the automation process was much more accurate than the semantic markup created by the automation process. The automated semantic markup was more reliable when tagging place names than personal and organization names. While a certain amount of tuning may produce acceptable results for a semantically similar data set, the efficiency of the automation process for semantic markup is questionable. Where encoding is focused on a very specific goal (such as tagging place names) and on semantically similar data, there is some success. However, the more complex the scholarly work and the desired semantic encoding, the less likely that an automated approach of the kind employed by IslandLives would be satisfying.

25 In the IslandLives project, the errors in the automatically encoded XML needed to be corrected. To address this requirement, a group of fourth-year computer science students at the University of Prince Edward Island were approached to create an XML editor for the inserted semantic tags, which would provide administrative users with an interface to view the TEI encoding, add and modify tags, validate the document against a custom TEI schema, and update the relevant datastreams in the underlying repository. There was also the requirement for the editor to work alongside Islandora's original book viewer. After the initial work was complete, a developer was contracted to complete the work required to finish developing the editor. The editor is still in use for the IslandLives project. 26 To access the TEI editor, a user logs into the site as an administrator and navigates to the relevant book record (see fig. 7). The "Edit" link provides access to the TEI editor.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 34

Figure 7. Administrator's view of a book record.

27 When administrative users select the "Edit" link, they are presented with a viewer for book pages, with the OCR text on the right-hand side. Semantic encoding is presented using a light-colored background (see fig. 8), the colors of which are configurable.

Figure 8. Beside each page, the editor presents the text of the page, with the TEI tags produced through automation highlighted.

28 The discovery and editing of an erroneous tag is depicted in figure 9.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 35

Figure 9. A user discovers and edits a tag that has been improperly encoded in the tool chain.

29 For users needing to modify attributes and their values, an attribute editor is provided (depicted in fig. 10). For more advanced users, the TEI encoding can be edited manually (depicted in fig. 11).

Figure 10. Additional flexibility is also provided via an attribute editor, which allows each tag to be enriched by an administrative user.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 36

Figure 11. Users familiar with TEI can also view and edit the encoding manually. TEI is validated before it is saved.

30 The TEI editor is based on JavaScript and was originally inspired by Drupal's integration of CKEditor. When the Islandora TEI Editor code (which resides in a Drupal module) is installed, it defines access rules for the editor and provides information about how a page is rendered in the editor. The code retrieves a TEI document from the repository (specified by the page's persistent identifier), permits editing, and validates the edited TEI XML against the custom TEI schema before saving the document back to Fedora. The validation occurs within the TEI Editor module against a RELAX NG file that was generated using Roma (a tool for working with TEI customizations) and stored locally as part of the module's files. A document that does not pass validation produces a dialog box to inform the user that their edits will not be saved to Fedora.

31 In this way, the editor is designed to combine WYSIWYG elements with a few editing options for more advanced TEI users who may prefer to tag more directly. The WYSIWYG code was designed to be extended: the TEI schema is defined (in JSON) separately from the code for the editor, allowing new TEI elements and attributes to be added easily (see example 2).

// Format of TEI elements that are defined in the TEI Editor { this.persname = { name: 'persName', bgColor: '#d5e5f1', attributes: [att.global, att.datable, att.editLike, att.personal, att.typed], };

Example 2. Format of TEI elements defned in the TEI editor.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 37

32 If tags and elements were to be added to this JSON file and the associated CSS file, they would appear immediately in the XML editing tools and be functional. The user would also need to ensure that the element was defined and able to be validated within their custom TEI schema.

33 The code has not been developed beyond what was required for the project, although messages posted to the Islandora list in April 2012 suggested that the community may be interested in reviving the code base and bringing it into line with current versions of Islandora. The code is available at Islandora's GitHub repository.6 At the same time as the community is showing a renewed interest in this project, other projects using Islandora are emerging to address more complex scholarly requirements for TEI.

8. Current TEI-related Initiatives in Islandora

34 Editing Modernism in Canada (EMiC) is a $3.8 million project funded by a SSHRC Strategic Knowledge Cluster / Strategic Network Grant, and is designed to run through 2015 (Pelham 2010). The project aims to create, disseminate, and foster public discourse about the literature of Canadian modernism (early to mid-twentieth century), develop a suite of tools for the production of digital texts, and provide training to, and develop relationships among, digital humanists. As part of the mandate to forge partnerships, EMiC has also partnered with the Canadian Writing and Research Collaboratory (CWRC), the Modernist Versions Project, and Mukurtu. As such, the project represents a number of key forces affecting tool and project development in the digital humanities. Based largely on an exploration of the IslandLives project, EMiC contracted with discoverygarden inc. to develop key tools for use in Islandora, including a web-based TEI editing interface that would enable users to create, edit, and validate TEI documents against a RELAX NG schema, and provide some WYSIWYG or even WYSIWYM ("what you see is what you mean") functions in addition to straight editing of raw XML. The resulting TEI will be stored, versioned, and managed from the underlying Fedora Commons repository.

35 The project is currently integrating successive versions of CWRC-Writer tool to meet EMiC's TEI requirements. The CWRC-Writer tool is being actively developed by the Canadian Writing and Research Collaboratory project (CWRC). The tool, currently in version 0.3, extends the JavaScript-based TinyMCE editor using jQuery. Broadly speaking, the goal for the development of CWRC-Writer is to facilitate the production of enriched documents by scholars with limited knowledge of RDF and XML markup. At the same time, the application is designed to be useful to those who prefer to work in code. Committed to open standards and information sharing, the project's goals include: • Close-to-WYSIWYG editing and enrichment of scholarly texts with meaningful visual representations of markup • Ability to add named entity annotations to texts • Ability to combine TEI markup for the text and stand-off RDF for named entities • Ability to export using "weavers" that recombine the plain text, the TEI, and the RDF into different forms (including an embedded TEI-compliant XML) • Documented code to allow editorial projects to incorporate CWRC-Writer into their environments. (Rockwell et al. 2012)

Journal of the Text Encoding Initiative, Issue 5 | June 2013 38

36 Additional information about CWRC-Writer, including information on how to test-drive a current demo, is available at the project website.7 Experimental code for integrating CWRC-Writer into Islandora is available in GitHub but is rapidly changing as CWRC- Writer itself is incomplete.

37 Islandora's integration with the SharedCanvas (Sanderson et al. 2011) image annotation tool is more complete, and will be released as a separate "image annotation" solution pack in 2013, for Islandora 7. Beta versions of the module are available in the Islandora repository. SharedCanvas is being developed in a partnership between , Los Alamos National Laboratory, the Open Annotation Collaboration, and the Mellon Foundation. The SharedCanvas application facilitates complex, layered image annotation and complies with the Open Annotation Collaboration's standard for the production of open annotations (Sanderson and Van de Sompel 2011). 38 Since the IslandLives project was begun, a number of web-based XML editing tools have emerged that may facilitate the development of a completely integrated web-based XML editor. A December/January thread on TEI-L under the subject "web-based XML editors" included over a dozen suggestions for possible open-source tools that would allow users to validate against a RELAX NG schema, and provide some WYSIWYG or WYSIWYM functions in addition to simply editing raw XML. As the IslandLives project continues to evolve, and new tools are built by EMiC, CWRC, and others, the Islandora project and UPEI hope that community input will improve the quality of generic tools available in Islandora for TEI scholars. That said, the Islandora framework is designed so that most web-based applications can be integrated and users can develop custom workflows and tools required to support specific TEI projects.

BIBLIOGRAPHY

Breuel, Thomas, ed. 2010. "The hOCR Embedded OCR Workflow and Output Format." Last modified March 2010. https://docs.google.com/document/preview?id=1QQnIQtvdAC_8n92- LhwPcjtAUFwBlzE8EWnKAxlgVf0.

Cunningham, Hamish, Diane Maynard, Kalina Bontcheva, and Valentin Tablan. 2002. "GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications." Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002. http://eprints.aktors.org/90/.

Pape, Sebastian, Christof Schöch, and Lutz Wegner. 2012. "TEICHI and the Tools Paradox: Developing a Publishing Framework for Digital Editions." Journal of the Text Encoding Initiative 2 (Feb.). http://jtei.revues.org/432.

Pelham, Amanda. 2010. "Speaking Volumes." Dal News [Dalhousie University], October 29. http:// www.dal.ca/news/2010/10/29/volumes.html.

Rockwell, Geoffrey, Susan Brown, James Chartrand, and Susan Hesemeier. 2012. "CWRC-Writer: An In-Browser XML Editor." Poster presented at Digital Humanities 2012 (Hamburg, Germany,

Journal of the Text Encoding Initiative, Issue 5 | June 2013 39

July 16–20). http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/cwrc- writer-an-in-browser-xml-editor/

Sanderson, Robert, Benjamin Albritton, Rafael Schwemmer, and Herbert Van de Sompel. 2011. "SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemination." Paper presented at 11th ACM/IEEE Joint Conference on Digital Libraries (Ottawa, Ontario, June 13–17, 2011). http://www.arxiv.org/abs/1104.2925.

Sanderson, Robert, and Herbert Van de Sompel, eds. 2011. "Open Annotation: Beta Data Model Guide." Last modified August 10, 2012. http://www.openannotation.org/spec/beta/.

Wilcox, David. 2012. "Creating Custom Content Models." Last modified June 25, 2012. https:// wiki.duraspace.org/display/ISLANDORA6122/Creating+Custom+Content+Models

Resources (Sites Mentioned)

ABBYY Products & Services: http://finereader.ABBYY.com/

Apache Solr project: http://lucene.apache.org/solr/

Canadian Writing and Research Collaboratory: http://www.cwrc.ca/en/

Drupal: http://drupal.org/

Editing Modernism in Canada Project: http://editingmodernism.ca/

Fedora Commons: http://fedora-commons.org/

Islandora: http://www.islandora.ca/

IslandLives: http://www.islandlives.ca/

Modernist Commons (The EMiC project): http://modernistcommons.ca/

TEI-L archive: http://listserv.brown.edu/tei-l.html

NOTES

1. "Saint Patrick's Confessio Hypertext Stack Project," accessed December 12, 2012, http://www.confessio.ie/. 2. Quinn Dombrowski and Elijah Meeks, Drupal for Humanists, http:// drupal.forhumanists.org/. 3. Prince Edward Island Baptismal Index, Public Archives and Records Office, accessed December 12, 2012, http://www.gov.pe.ca/archives/baptismal/. 4. "Geographical Names of Canada," Natural Resources Canada, last modified March 1, 2011, http://www.nrcan.gc.ca/earth-sciences/products-services/mapping-product/ topographic-maps/5776. 5. https://github.com/roblib/TEI 6. https://github.com/Islandora/islandora_tei_editor 7. https://sites.google.com/site/cwrcwriterhelp/

Journal of the Text Encoding Initiative, Issue 5 | June 2013 40

ABSTRACTS

Islandora is an open-source software framework developed since 2006 by the University of Prince Edward Island's Robertson Library. The Islandora framework is designed to ease the management of security and workflow for digital assets, and to help implementers create custom interfaces for display, search, and discovery. Turnkey options are provided via tools and modules ("solution packs") designed to support the work of a particular knowledge domain (such as chemistry), a particular content type (such as a digitized newspaper), or a particular task (such as TEI encoding). While it does not yet have native support for TEI, Islandora provides a promising basis on which digital humanities scholars could manage the creation, editing, validation, display, and comparison of TEI-encoded text. UPEI's IslandLives project, with its forthcoming solution pack, provides insight into how an Islandora version 6 installation can support OCR text extraction, automatic structural/semantic encoding of text, and web-based TEI editing and display functions for site administrators. This article introduces the Islandora framework and its suitability for TEI, describes the IslandLives approach in detail, and briefly discusses recent work and future directions for TEI work in Islandora. The authors hope that interested readers may help contribute to the expansion of TEI-related services and features available to be used with Islandora.

INDEX

Keywords: Islandora, annotation, RDF, CWRCWriter, SharedCanvas, Drupal, EMiC

AUTHORS

KIRSTA STAPELFELDT

Kirsta Stapelfeldt, MA, MLIS, is the manager of the Islandora project at the University of Prince Edward Island's Robertson Library, and has served as a subject matter expert for discoverygarden inc. on Editing Modernism in Canada and other projects.

DONALD MOSES

Donald Moses, MLIS, is the digital initiatives and systems librarian at UPEI's Robertson Library, and manages the IslandLives project as part of his role as manager of the Island Archives project, an umbrella project containing numerous digital history initiatives undertaken at UPEI.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 41

TEI and Project Bamboo

Quinn Dombrowski and Seth Denbo

AUTHOR'S NOTE

In December 2012, two months after this article was submitted to the Journal of the Text Encoding Initiative, the Mellon Foundation requested that Project Bamboo draw to a close rather than embark on a second phase of technical development. At the point of its conclusion, Bamboo's infrastructure was not in use by any community of scholars, leaving unanswered questions about the consequences of not specifically leveraging TEI markup in the Bamboo Book Model. The "legacy" of Project Bamboo, as captured through code, documentation, and various notes and documents, will persist via links on the projectbamboo.org website, with the goal of informing future cyberinfrastructure efforts.

1. Introduction

1 Project Bamboo, a cyberinfrastructure initiative supported by the Andrew W. Mellon Foundation, takes as its core mission the enhancement of arts and humanities research through the development of shared technology services. Rather than developing new tools for curating or analyzing data, Project Bamboo aims to provide core infrastructure services including identity and access management, collection interoperability, and scholarly data management. The longer-term goal is for many organizations and projects to leverage those services so as to direct their own resources towards innovative tool or collection development. In addition, Bamboo seeks to model a paradigm for tool integration that focuses on tools as discrete services (such as a morphology annotation service and a geoparser service, instead of a web-based environment that does morphological annotation and geoparsing) that can be applied to texts, individually or in combination with other services, to enable complex curatorial and analytical workflows. This paper covers points of intersection between Project Bamboo and TEI over the course of Bamboo's development, including the

Journal of the Text Encoding Initiative, Issue 5 | June 2013 42

participation of the TEI community in the Bamboo Planning Project, a demonstrator project that aimed to develop services that could process TEI texts from a variety of sources while producing a uniform output, and how TEI-encoded collections have influenced Bamboo's work on collection interoperability and the analysis of large corpora.

2 During its planning phase, Project Bamboo brought together over 600 scholars, librarians, and IT professionals from 115 institutions, through a series of workshops that focused on identifying common and unique scholarly practices in the humanities; the technological and social challenges that impede the uptake of digital research methodologies; and how a multi-institutional, interdisciplinary, and interorganizational community and cyberinfrastructure development project can break down those barriers. Throughout Bamboo's "community design" process, numerous scholar and librarian participants emphasized the importance of facilitating the interchange of TEI-encoded documents, but also the inherent challenges that have become evident through previous attempts at aggregating and processing documents from different sources. These opportunities and challenges have influenced decisions ranging from Bamboo's approach to collection interoperability to the choice of tools to partner with when determining the essential components of a tool set for corpus analysis.

2. Bamboo Planning Project

3 The Bamboo Planning Project was centered on a series of eight workshops held over the course of two years (2008–2010). During this period, workshop participants had the opportunity to propose "demonstrator projects"—scholar/technologist partnerships that would test some of the project's assumptions about the feasibility of different approaches to tool and collection interoperability; how to design, implement, and scale scholarly services; and where scholars' technical roadblocks actually lie.

2.1 Workshops

4 The participants in the Bamboo Planning Project workshops shared a common desire for improving the recognition, support, and opportunities for using digital tools and methodologies in humanities research and pedagogy, but represented a wide range of disciplines and technical backgrounds. While some of the faculty participants had only recently been introduced to TEI, numerous active members of the TEI community were influential in the discussions about the role TEI could play in relation to Bamboo, not only as a set of encoding guidelines, but as a community that shares the same core values. As Martin Mueller noted: There is a global and tightly knit community there, and a question that arises in the encoding of a French medieval manuscript may find its answer in a practice developed in the encoding of Buddhist manuscripts in Kyoto. To my mind, the TEI community is a remarkable example of a scholarly group that has global breadth, temporal depth, and is united in a common purpose to use technology to help with philological problems of long standing. (Project Bamboo 2009a)

Journal of the Text Encoding Initiative, Issue 5 | June 2013 43

5 John Unsworth warned against the temptation in creating workflows for scholarly analysis of texts to rely on tools that disregard the metadata provided by TEI-encoded texts: Do we need structural markup? Does it matter [for] paragraphs? Paragraphs, sentences, verses, lines, are meaningful units of composition -- meaningful units of analysis... This requires structural markup. [It is] also useful if you're going to ask statistical questions [to] differentiate between core intellectual content and … table of contents, running headers, index, etc. Those words could throw off your stat results in ways that would obscure information. What about [the] word level? [In] tag cloud visualizations of word frequency in a novel, [the] most frequent words are names of characters. So [one has to] ignore proper names to get at the next level of what's going on in the book, but to ignore them you have to identify them -- tag [them] as proper names. (Project Bamboo 2009b) 6 Project co-director David Greenbaum envisioned Bamboo running a platform that could provide large groups of scholars around the world with more reliable access to scholarly services that can perform particular kinds of analysis on TEI texts from diverse fields: If you have TEI encoded set of scenarios, you want to take data and derive social relationships out of it. If there's other people who want to do that, it'd be great for you to run something as a service — send data in, get [a] visualization out. We want to run something like that as a service, but that needs to have reliability we don't have right now. [The Bamboo Shared Services Platform] allows us to build up that infrastructure —exposing services in a more robust way. (Project Bamboo 2009b) 7 The sustainability and scalability issues that would be addressed by a shared services platform do place limits on the potential impact of tools developed to process TEI texts. However, one Bamboo demonstrator project shed light on the difficulty of developing tools that would be sufficiently usable across a broad range of texts to even encounter problems with scalability.

2.2 TEI Bibliography Demonstrator

8 In fall 2008, a team at the University of Chicago including Bamboo program staff member Quinn Dombrowski undertook a demonstrator project proposed by Kent Hooper and Rick Peterson for developing an "XSLT web service engine to transform XML-marked up bibliographic entries into HTML." This demonstrator project will be used to create a web service that scholars could use to easily input TEI-tagged bibliographical information which has been validated against the TEI data type description (DTD) and transform it using XSLT into a series of HTML (and, if possible, PDF and other transforms) files. These files would be the primary listing (alpha, by author) and sub-listings of the bibliographical data. . . . A sub-listing transform example would be: "Provide a list of all articles, alpha by title of periodical in which they appear, then by chronological order of publication." (Project Bamboo 2008) 9 In theory, the perceived simplicity and standardization of bibliographic markup practices in TEI—in contrast to the more diverse and complex patterns possible for marking up literary texts—would make this an attractive case study for demonstrating the value of shared scholarly services. The Bamboo program staff assumed that in three weeks, a team with Java programming, XSLT, and design skills could develop "an agnostic service implementation that might serve a very specific scholar's need without sacrificing either the agnostic or the specific."1 In the proposed timeline, one

Journal of the Text Encoding Initiative, Issue 5 | June 2013 44

week would be spent finding existing generic XSLT stylesheets for processing TEI bibliographies. The following two weeks would be dedicated to developing a service for processing XSLT, and to developing a web UI where scholars could either upload their TEI and XSLT stylesheets, or choose a stylesheet provided by the system (such as a generic stylesheet for processing bibliographies).

10 In practice, it took a year to meet the scholarly requirements of the project. The environmental scan failed to yield generic XSLT stylesheets for TEI bibliographies that were compatible with the specific markup conventions used in Hooper's document (for instance, values of the @type attribute on elements that include pages, volume, and issue). Instead, Dombrowski developed a basic XSLT stylesheet that transformed Hooper's TEI into a recognizably formatted bibliography. This revealed markup inconsistencies within the TEI-encoded data (such as the ordering of authors' first and last names), which Hooper then revised. The revised TEI, in turn, brought to light legitimate variation in the data that the XSLT stylesheets did not correctly account for. In this way, both the data and the stylesheets underwent iterative development for a number of months. 11 During the summer of 2009, Hooper hired Jacob Jett, a graduate student at the University of Illinois School of Library and Information Science, to take the lead in further work, as Project Bamboo demonstrator project work was winding down. One of the most significant issues related to collation for alphabetical listings. The conventions used by scholars of German literature when alphabetically sorting names transliterated from a variety of languages does not align perfectly with any official standard that can be accessed as part of the XSLT processing. In an alphabetically organized bibliography, an item that is not sorted in accordance with the user's expectations may as well not exist, making near-perfect solutions unacceptable. Despite multiple attempts to solve the problem in the code layer of the project (as described in exchanges on the saxon-help email list),2 because of constraints on time and funding, Hooper added an @xml:id attribute to each element in the TEI, indicating the correct sort order. While this approach would be deeply problematic for an expanding bibliography, it was sufficient for making Hooper's bibliography accessible to scholars in its current state. 12 By 2010, the TEI bibliography demonstrator had successfully transformed Hooper's TEI bibliography into HTML web pages that his colleagues could easily consult online. The amount of custom XSLT work needed to achieve even a minimally useful output from Hooper's data made the program staff reconsider how easy it would be to develop agnostic service implementations that directly serve scholars' specific needs, at least in relation to processing TEI.

3. Bamboo Implementation Project, Phase 1

13 The Bamboo implementation project that followed the planning project focused on exposing collections of texts and services for managing and curating textual material within "Work Spaces" (research environments), while developing the identity and access management (IAM) infrastructure that could mediate access to restricted resources. The implementation project also involved a planning process that would lead to the development of a second application, "Corpora Space" (Project Bamboo 2011a). Corpora Space would enable professional humanists and citizen scholars

Journal of the Text Encoding Initiative, Issue 5 | June 2013 45

(including undergraduate students) to work on dispersed digital corpora using an integrated set of sophisticated curatorial, analytic, and visualization tools.

14 Bamboo's work on collection interoperability, with the goal of enabling collections from disparate sources to function similarly within a Work or Corpora Space, focused on defining standard methods for making digital content available to web services. This would be achieved by identifying existing protocols, practices, and ontologies, and extending them where needed (Project Bamboo 2010).

3.1 Collection Interoperability

15 For its first phase of technical development, structural and semantic level interoperability was considered out of scope, as explained in the project's proposal to the Mellon Foundation: TEI, for example, addresses structure of textual documents in a manner that enables some algorithmic operation across textual collections or corpora. Yet TEI's diverse variants, which have in many cases evolved to address concerns of real import to adopting scholars, gives rise to much frustration among textual researchers in the arts and humanities—and more so when complicated by uneven mapping of semantic meaning to structural markup. Interoperability hampered by differences in point-of-view is even less tractable (is the Royal Pavillion in Brighton, England situated on Le Terrain Crétacé or in the County of Sussex?). (Project Bamboo 2010a) 16 Instead, the Bamboo collections interoperability work centered on providing Work Space and Corpora Space users with access to a predictable set of data, regardless of the source collection, as defined by the Bamboo Book Model (Project Bamboo 2011b). This model requires: • [A] source repository id and an identifier for the items within that repository. • Metadata to supply the required item-level fields: title, creator, and date. • Full text of the book . . . TEI, HTML, or OCR text files. (Project Bamboo 2011b) and optionally includes: • Additional item-level metadata: publisher, publication date(s) • An identifier for another book if this is a version. If this is a volume in a series, an identifier for that series. • Structural information about the book, divisions (chapters, acts, etc) • Scanned images of individual pages (Project Bamboo 2011b)

3.2 Corpora Space

17 The first phase of prototyping work for Bamboo's Corpora Space initiative aimed to make content from HathiTrust, the Text Creation Partnership (TCP)'s transcriptions of Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) texts, and the Perseus Digital Library available for curation (such as OCR correction and identifying different texts that are part of the same work) and analysis. In addition, Corpora Space would leverage the metadata provided through text curation processes (such as annotations to indicate linguistic and literary traits) for the purposes of making novel scholarly claims—both about the text itself, and about the scholarly workflows documented by the provenance metadata associated with any curatorial act.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 46

18 As two of the three collections (TCP and Perseus) use TEI encoding for their content, compatibility with TEI texts was a key trait when selecting candidate tools. These included the Abbot processing scripts that convert XML-like text collections into TEI- A3; Juxta, which collates multiple witnesses to a single work and includes specific support for a subset of the TEI elements4; and TypeWright,5 an OCR correction environment that produces lightly marked-up TEI (Project Bamboo 2011c). 19 The Corpora Space workshops, which brought together tool developers and representatives of the above-mentioned collections, included discussions of scholarly workflows that could make use of the TEI markup provided by TCP and Perseus. However, to accommodate time and resource constraints, the prototype development work removed the TEI markup provided by TCP and Perseus to bring those texts into alignment with HathiTrust content, rather than attempting to mark up the HathiTrust texts: The functionality that we had decided to implement at the workshop was designed to operate on very simple representations of the texts from our three collections. Because these collections used two very different formats—TEI-A in the case of TCP and Perseus and a simple page-based plain text format for Hathi—we had a choice between attempting to add structure to the Hathi texts, in order to bring them closer to TCP and Perseus, or to remove structure from the TCP and Perseus texts. While preparing for the workshop we ran a series of experiments on the Hathi texts that suggested that the latter would be a more practical approach, and during the first sessions of the workshop we decided to use a JSON format that would include some basic metadata about each document as well as two simple representations of its content: a plain-text version for use in analysis, and an HTML version for display in the drill-down view. (Project Bamboo 2011d)

4. Future Directions

20 The planning and prototyping work for Corpora Space brought to light the need for scholarly data management services that can associate provenance metadata with every curatorial action that modifies a corpus, both automated (such as scripts to correct common OCR errors) and manual (such as manual corrections by student assistants, or annotations that form a scholarly argument as part of the development of a digital critical edition). These services make the history of a text more transparent to scholars who can then assess whether a text is in a "good enough" state to be used for meaningful analysis. Bamboo's IAM services will play a key supporting role in the development of these scholarly data management services, ensuring that individuals and tools alike receive proper attribution for their curation work.

21 In addition to extending its core infrastructure to support scholarly data management, Bamboo plans to build out its Collection Interoperability services to enable scholars to submit their corpus annotations back to the repository from which the text originated. At the discretion of the source repository, those annotations could be incorporated into the publicly available edition, increasing the quality and/or the analytical context of the digital texts available to all. The necessity of providing TEI-encoded data back to source repositories that use TEI encoding will provide Project Bamboo with new opportunities to grapple with the challenges of developing a general-purpose system for enriching TEI-encoded documents that minimizes conflicts with existing markup conventions within those documents.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 47

5. Conclusions

22 Over the course of its development, Project Bamboo's work towards a vision of interoperability for the collections and tools that scholars in different disciplines value has provided many opportunities for testing the practical limits of different approaches to analyzing, augmenting, and processing TEI-encoded texts. As John Unsworth stated in regard to TEI, "[i]nterchange has been part of the goal from the beginning" (Project Bamboo 2009b), and Bamboo has served as a laboratory for grappling with issues that arise when attempting different levels of integration between TEI-encoded texts from different collections. The intellectual work that has gone into enriching texts using TEI in ways that open new avenues of scholarly inquiry is too significant to dismiss in the name of easier interoperability through plain text files. Project Bamboo looks forward to the continued participation of the TEI community in our efforts to define and implement approaches to tool and collection interoperability that are both scalable and capable of facilitating new forms of scholarly inquiry.

BIBLIOGRAPHY

Project Bamboo. 2008. "XSLT Original Proposal." Project Bamboo Wiki. Last modified November 10. http://quinndombrowski.com/projects/project-bamboo/wiki/xslt-original-proposal

———. 2009a. "Analyzing Scholarly Narratives." Project Bamboo Wiki. Last modified March 27. http://quinndombrowski.com/projects/project-bamboo/wiki/analyzing-scholarly-narratives.

———. 2009b. "W3- Perspectives." Project Bamboo Wiki. Last modified January 23. http:// quinndombrowski.com/projects/project-bamboo/wiki/w3-perspectives.

———. 2009c. "W4 - Action Plans." Project Bamboo Wiki. Last modified April 20. http:// quinndombrowski.com/projects/project-bamboo/wiki/w4-action-plans.

———. 2009d. "NYX." Project Bamboo Wiki. Last modified May 13. http://quinndombrowski.com/ projects/project-bamboo/wiki/nyx

———. 2010. "Collections Interoperability - BTP Proposal - 6 July 2010." Project Bamboo Wiki. Last modified July 14. http://quinndombrowski.com/projects/project-bamboo/wiki/collections- interoperability-btp-proposal-6-july-2010.

———. 2011a. "Corpora Space - BTP Proposal - 6 July 2010." Project Bamboo Wiki. Last modified March 1. http://quinndombrowski.com/projects/project-bamboo/wiki/corpora-space-btp- proposal-6-july-2010.

———. 2011b. "Book Model (Draft)." Project Bamboo Wiki. Last modified June 1. http:// quinndombrowski.com/projects/project-bamboo/wiki/book-model-draft.

———. 2011c. "TypeWright." Project Bamboo Wiki. Last modified December 19. http:// quinndombrowski.com/projects/project-bamboo/wiki/typewright.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 48

———. 2011d. "Corpora Camp Lessons Learned Report." Last modified May 17. http:// quinndombrowski.com/projects/project-bamboo/wiki/corpora-camp-lessons-learned-report.

NOTES

1. Kaylea Champion, e-mail message to Bamboo Program Staff mailing list, October 29, 2008. See also Project Bamboo 2009d for anticipated timeline. 2. Jacob Jett, "Problem with xsl:sort and German," email to saxon- [email protected] dated August 10, 2009, with replies dated August 10–17, 2009, http://old.nabble.com/Problem-with-xsl:sort-and-German-td24904567.html. 3. See http://monkproject.org/docs/abbot.html. 4. See http://www.juxtasoftware.org. 5. See http://www.18thconnect.org/typewright/documents.

ABSTRACTS

Project Bamboo, a cyberinfrastructure initiative supported by the Andrew W. Mellon Foundation, takes as its core mission the enhancement of arts and humanities research through the development of shared technology services. Rather than developing new tools for curating or analyzing data, Project Bamboo aims to provide core infrastructure services including identity and access management, collection interoperability, and scholarly data management. The longer- term goal is for many organizations and projects to leverage those services so as to direct their own resources towards innovative tool or collection development. In addition, Bamboo seeks to model a paradigm for tool integration that focuses on tools as discrete services (such as a morphology annotation service and a geoparser service, instead of a web-based environment that does morphological annotation and geoparsing) that can be applied to texts, individually or in combination with other services, to enable complex curatorial and analytical workflows. This paper addresses points of intersection between Project Bamboo and TEI over the course of Bamboo's development, including the role of TEI in Bamboo's ongoing development work. The paper highlights the significant contributions of the TEI community to the early development of the project through active participation in the Bamboo Planning Project. The paper also addresses the influence of TEI on the Bamboo Technology Project's collection interoperability and corpus curation/analysis initiatives, as well as its role in current (as of October 2012) development work.

INDEX

Keywords: cyberinfrastructure, digital infrastructures, Project Bamboo, interoperability, corpus annotation, scholarly editing, tools

Journal of the Text Encoding Initiative, Issue 5 | June 2013 49

AUTHORS

QUINN DOMBROWSKI

Quinn Dombrowski is the community liaison for Project Bamboo at UC Berkeley, and has been deeply involved with the project since 2008. She holds a BA/MA in Slavic Languages and Literatures from the University of Chicago and an MS in Library and Information Science from the University of Illinois. She has worked as the lead developer for numerous digital humanities projects, including DHCommons, Bamboo DiRT, and Bulgarian Dialectology as a Living Tradition.

SETH DENBO

Seth Denbo is the project coordinator for Project Bamboo at Maryland Institute for Technology in the Humanities. He holds a PhD in history from the University of Warwick in the United Kingdom and is a cultural historian of eighteenth-century England. He has worked on projects in digital history and been Research Associate at King's College London where he was involved in strategic planning for a major European digital research infrastructure. He is also a convenor of a seminar in digital history at the Institute for Historical Research in London.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 50

TextGrid, TEXTvre, and DARIAH: Sustainability of Infrastructures for Textual Scholarship

Mark Hedges, Heike Neuroth, Kathleen M. Smith, Tobias Blanke, Laurent Romary, Marc Küster and Malcolm Illingworth

1. Introduction

1 In recent years, a variety of initiatives have been funded with the aim of producing software tools or environments of a type variously known as virtual research environments, research infrastructures, or cyberinfrastructures. These initiatives vary in their scale, specialization, scope, and level of funding. One issue that they face in common, however, is that of sustainability: how can the continued—and useful— existence of a system or tool be guaranteed, or at least facilitated, once a project's funding has been spent? In this paper, we examine how such sustainability has been enabled, in the particular case of infrastructures for textual scholarship, in the context of three international projects: TextGrid,1 TEXTvre, 2 and DARIAH 3. Firstly, we will address the inter-project collaboration and cross-fertilization between TextGrid and TEXTvre, including architectural decisions and shared data infrastructures, and investigate how the projects benefited from the exchange. We will then discuss how this existing collaboration can be taken forward by the loosely-coupled and distributed framework being developed by the DARIAH community, and how it can serve as a model for the sort of collaborations that DARIAH plans to enable.

2 TextGrid is a longstanding German initiative to establish a virtual research environment for several text-based humanities disciplines. It has been running for many years and its current aims are to achieve a high level of sustainability of its work and to collaborate with international partners to enhance its offerings. The Digilib4 project is one example of a scholarly editing environment within TextGrid. Digilib is noteworthy for its strong focus on an existing community, and it has led to various

Journal of the Text Encoding Initiative, Issue 5 | June 2013 51

attempts within other communities to adopt and replicate its methodologies and technologies. 3 One of TextGrid's international collaborations has been with the UK-based and JISC- funded TEXTvre project, which aimed to adopt and institutionalize TextGrid's solutions for the well-established community for digital textual scholarship in the Department of Digital Humanities at King's College London. To this end, TEXTvre spoke to King's researchers engaged in these activities, analyzed their responses, and enhanced the TextGrid services to suit their needs, for example by integrating better support for automated text analysis. 4 The Digital Research Infrastructure in the Arts and Humanities (DARIAH) is building a virtual bridge between different humanities and arts resources across Europe. Initially funded under the ESFRI programme,5 DARIAH combines various national infrastructures—from the Dutch national support infrastructures for e-humanities to the German e-humanities infrastructure TextGrid. It also aims to help other EU countries establish their own arts and humanities e-infrastructures, and to achieve new modes of collaboration between computing and humanities based on existing communities of practice such as the one that has emerged around TextGrid. However, there is no generic way to use a research infrastructure; it needs to be oriented and customized towards the needs of emerging communities. Services in such an infrastructure are useful only if they are built around such communities of practice. 5 This paper is structured as follows. In section 2 we describe the TextGrid and TEXTvre projects and examine how they relate to each other. Section 3 discusses their collaboration on tools and services, and section 4 their deeper collaboration on data infrastructure, before we examine in section 5 how these could be further supported within the context of DARIAH. Finally, in section 6 we address ongoing collaborations on sustainability and on tools and services.

2. Background: TextGrid and TEXTvre

6 The TextGrid research group, a consortium of ten research institutions in Germany, has developed a virtual research environment (VRE) for researchers in the arts and humanities. It provides services and tools for the digital analysis of text data and supports the curation of research data using grid technologies. Research partners— including libraries and data centers as well as universities and research institutions— are collaborating in a community-driven process that is funded by the German Ministry for Education and Research (BMBF). As part of the German Grid Initiative D-Grid,6 TextGrid maintains a common Grid Resource Centre in Göttingen (GoeGrid)7 together with grid projects in physics, medicine, astronomy, and climate research.

7 An essential aspect of the TextGrid infrastructure is that all its technical developments have been accompanied by the definition of precise recommendations, most of them based on the Text Encoding Initiative (TEI) Guidelines. These Guidelines, which for example establish precise constraints on stand-off annotation mechanisms or on the basic representation of dictionary entries, have contributed to the propagation of the idea of an infrastructure as comprising both technical and intellectual services. In particular, it provides a range of tools for supporting the production of digital editions based on TEI.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 52

8 Currently, TextGrid consists not just of its original communities—textual philology and linguistics—but also many others, for example art history, classical philology, and musicology, all of which have extensive interests in textual editing and annotation. TextGrid was conceptualized as an infrastructure that enabled new communities, together with their tools, to be easily integrated and also to take part in the development process. TextGrid has tried to put these communities first, and thus has only gradually and carefully introduced changes to the technologies and methodologies that it uses in its VRE. 9 As detailed by Neuroth, Lohmeier, and Smith (2011), the TextGrid VRE consists of two main components: the TextGrid Lab(oratory), which serves as the entry point for users, and the TextGrid Rep(ository), which provides access to a shared data archive. The TextGridLab provides common functionalities in a sustainable environment in order to increase reuse of data, tools and services, and the TextGridRep enables researchers to publish and share their data in a way that supports long-term availability, interoperability, and reusability. 10 TEXTvre built on the success of TextGrid and worked towards an institutionally- focused VRE that addressed the requirements of a wider community of practice of textual scholars and could be integrated with institutional infrastructure. Its approach to this involved engagement with researchers based in and collaborating with staff in the Department of Digital Humanities at King's, exploring their research practices and integrating user tools to support and enhance these practices. 11 The Department of Digital Humanities (DDH; formerly known as the Centre for Computing in the Humanities) is an academic department in the School of Arts and Humanities of King's College London and the home of many multidisciplinary research projects in the digital humanities (including many text-based ones) carried out in collaboration with a range of humanities departments, both at King's and elsewhere. TEXTvre's engagement with humanities researchers was mediated through this department and focused on research projects in which DDH was a collaborator. 12 One of the main aims of the TEXTvre project was the analysis of existing research practices to identify processes that are typical of an institutional environment for digital humanities. It was expected that there would be significant differences at DDH in comparison with a national infrastructure such as TextGrid. Three case studies, which were considered to be typical of the critical edition projects carried out within the department, were selected: 1. Early English Laws8 produced new editions and translations, both online and in print, of all English legal codes, edicts, and treatises produced till the Magna Carta in 1215. The project operated in an extremely heterogeneous document environment, and the researchers were distributed across the UK. It also had very specific descriptive, administrative, and technical metadata requirements that weren't then accommodated in TextGrid. 2. The Gascon Rolls Project9 addressed texts that form a fundamental primary source for the study of the administrative, political, and economic history of the French region of Aquitaine under English rule. The source documents are somewhat inaccessible in the UK's National Archive, so the project developed an online edition with summary translation, high-quality images, and detailed indexes. It worked within a well-developed tradition of editing historical records (at least printed editions) but required the creation and support of an ontological layer.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 53

3. The Inscriptions of Roman Tripolitania10 produced an online edition of the inscriptions, mainly in Latin, from the Roman province of Tripolitania (in modern Libya). There is an established print-based tradition for the editing and publication of inscriptions, and a reasonably well-established extension of this to the digital domain, EpiDoc,11 has grown out of this tradition, in part as the result of work by the IRT team.

13 Based on an analysis of these projects, TEXTvre identified the kinds of services and tools that could be reused from the TextGrid environment, as well the gaps in provision.

3. Working Together on Tools

14 As discussed above, TextGrid is built on a distributed architecture and offers its users access to an ecosystem of different services and content; these tools will be the focus of this section. TextGridLab, TextGrid's Eclipse-based user interface, integrates a set of interactive tools such as the XML Editor, the Text-Image Link Editor (German "Text- Bild-Link-Editor", TBLE) and the Text-Text Link Editor (discussed in section 6). The XML Editor is one of the core functionalities of TextGridLab, allowing users to switch easily between a more technical view with tags and attributes and a structural view that is oriented towards standard text-editing applications. The Unicode Character Table enables the user to search, copy, and insert symbols from the Unicode character set. The Text-Image Link Editor supports the XML Editor by linking text sequences with image sections in order to create files that contain text elements and topographic descriptions.

15 TextGrid offers various utility tools that support the XML annotation process. The Dictionary Search Tool, for instance, provides an integrated search interface to a number of different dictionaries within the TextGrid VRE. The Wörterbuchnetz dictionary network12 at the University of Trier has been integrated into the interface for this purpose. The CollateX13 tool is another utility tool for comparing two or more files encoded in XML (including TEI) and annotating any differences. 16 The Text Publisher Web tool, on the other hand, can be used to present project results and publications on a project website. The tool is part of a TextGrid toolkit for web publishing. For more specialized publishing tasks, a module called XML Print14 is integrated into TextGrid to allow printing of scholarly texts with complex typesetting layout requirements based on XML data, with a particular view towards the needs of critical editions and dictionaries. Finally, the Bibliography Tool can import bibliographical data from existing data inventories and be used to capture, process, and administer bibliographies. Users can also export bibliographical data into certain standard formats (for example, TEI and MODS). 17 With the addition of new user communities, TextGrid has added new tools and customized existing ones. The Note Editor illustrates MEI15-encoded scores and displays them in a simplified format. Unique features of the Note Editor, such as the visualization of variants, are located in the editorial field. Another new tool is the Gloss Editor, which facilitates the description of specific text structures in gloss comments, the synoptical presentation of different gloss comments regarding the same text, and the synoptical presentation of gloss comments and related digital manuscripts.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 54

18 TEXTvre wanted to go beyond these tools and services and interface with tools relevant to the local environment. These included alternatives to standard TextGrid tools, such as the commercial XML editor , and services for linguistic analysis based on the open source framework GATE (Cunningham, Maynard, and Bontcheva 2011; Cunningham et al. 2002). 19 is the XML editor of preference for DDH projects, so the TEXTvre team considered accommodating this to be an important candidate for TEXTvre. The Eclipse plugin, which supports all of 's basic XML editing functionality,16 was integrated into TextGridLab, allowing users to create TEI files with instead of the TextGrid editor. 20 TEXTvre also integrated completely new tools and services. The Solr17 plugin, for instance, allows a user to index one or more documents that exist in the repository, search over the index, define a Solr schema for specific searching requirements, and browse using facets. There was some concern that these features duplicate existing TextGrid search functionality and may lead to confusion; however, it was felt useful to have a client-side, customizable faceted browsing tool for project-specific search. 21 Some work was carried out on integrating GATE services with the TextGridLab client platform to provide more advanced text analysis functionalities than those available in TextGrid. GATE is an information extraction and semantic annotation framework developed at the University of Sheffield, and this aspect of the work was intended as a demonstrator of the potential for integrating such services. Specifically, TEXTvre implemented named-entity recognition (NER) web services for German and English. These services take as their input an XML document (typically TEI XML) and return the same document with additional XML tagging for the named entities. The services contain slightly modified versions of GATE's German NER pipeline and the ANNIE information extraction components.18 22 In addition to creating tools and services, TEXTvre and TextGrid have also engaged in a fundamental collaboration on the underlying data infrastructure of both projects, which we describe next.

4. Data Infrastructure Collaborations

23 The second main component of TextGrid is the TextGridRep, which provides access to a repository infrastructure for long-term preservation of data based on grid technology. Researchers can decide how and with whom their data will be shared by using sophisticated rights management services. Findings and research data can be published directly from the TextGridLab to the TextGridRep via a publishing process that guides researchers in preparing the data for long-term accessibility.

24 The TextGridRep middleware consists of various components for handling files in the data grid, for rights management, for metadata management in an XML database, and for relations in an RDF triple store. TextGrid is not a closed system but rather an open platform that enables scholars to adapt the environment to their needs. Owing to its modular structure (Küster et al. 2007), it is easy to integrate external web services (such as the dictionary network "Wörterbuchnetz" referenced above), and the layered architecture also allows access to the services via graphical user interfaces other than TextGridLab.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 55

25 It did not meet the requirements of the researchers at King's College London simply to reuse the German Grid infrastructure. King's had already started to develop its own repository infrastructure based on the Fedora repository technologies (Lagoze et al. 2006), and it was decided to integrate the VRE with this instead. Integrating TextGrid with Fedora essentially meant implementing management of digital objects in a Fedora object repository instead of using the German D-Grid (or similar grid-based resources). The TEXTvre team wanted to use Fedora to hold complete representations of the objects—not just to use it as a data store—so as to be able to use Fedora as the basis for various modes of publication and reuse of the content. This also requires storing inter- object relationships in Fedora, thus implementing the TextGrid object model by means of the Fedora Digital Object Model and the underlying Content Model Architecture (Lagoze et al. 2006). Within TextGrid, object management is carried out by the TG-CRUD service,19 which is modular and has a defined storage interface for implementing access to an object repository. For the TEXTvre implementation, the team developed a new component implementing this interface and allowing objects to be managed in Fedora instead of D-Grid, making use of the Fedora REST API instead of writing to D-Grid using JavaGAT,20 the Java-based grid interface. This in itself was relatively straightforward, and is illustrated in figure 1.

Figure 1. TextGrid and TEXTvre data infrastructure

26 However, replicating TextGrid relationships in Fedora proved to be difficult because of the various ways in which relationships and object versioning are managed within TextGrid, and so a simple mapping of the RDF in the TextGrid triple store to the RDF in Fedora did not work. Specifically, in TextGrid, the relationships are stored in a Sesame RDF database,21 and the objects themselves are stored elsewhere, whereas in Fedora, relationships are stored in a reserved datastream (called RELS-EXT) of the source object, where the target object must already exist within Fedora.

27 An additional complication is that edition projects as a whole are not actually modelled as objects in TextGrid (i.e., they are not managed by TG-CRUD); instead, they are LDAP constructs referenced in TextGrid relationships. As there are no objects in TextGrid

Journal of the Text Encoding Initiative, Issue 5 | June 2013 56

corresponding to projects, an attempt to store the relationship between an object and a project in Fedora fails as the target object does not exist. 28 Other issues arose from TextGrid's approach to versioning. In TextGrid, different versions of an object can be created, and each version is maintained as an independent object and can be edited and changed at any time; however, no change management is available on the individual objects. TextGrid maintains the concept of a current version of an object, and the "current" version number is kept as a field in the object's metadata in the eXist database. For example, an initial version of an object might have an identifier textgrid:45679.0, indicating version 0. Creating a new version would result in a new object with identifier textgrid:45679.1, and so on. In relationships between objects, TextGrid can involve both specific versions of an object (e.g. textgrid:45679.2) and "unversioned" objects (e.g. textgrid:45679), where the most recent version of an object is assumed. Fedora, on the other hand, has an entirely incompatible concept of versioning, in which object versions are implemented using a timestamp that captures the state of the object at a particular point in time. Fedora's own versioning mechanism therefore cannot be used to implement TextGrid versioning. To address these issues, a subclass of the TextGrid component providing relationship storage was implemented for TEXTvre, with the following processing: • Relationships are received as an input stream of triples in N3 format,22 which are processed into individual triples using the Jena library.23 • As projects do not exist as objects in TextGrid, an explicit check is made for "inProject" relationships, and a dummy project object is created in Fedora (if it does not already exist).

29 Each time that a relationship is added to Fedora, the component checks that both of the objects in the relationship already exist in Fedora. If either or both of the objects do not exist, placeholder objects are created and the relationship is stored referencing these. A user searching via Fedora can easily see whether an object referenced in a relationship is versioned or is an unversioned placeholder. This part of the project led to additional funding from the Software Sustainability Institute (SSI),24 an EPSRC- funded25 body that helps to increase the sustainability of open-source research software. This allowed TEXTvre to streamline the process of installing and configuring an instance of the software, develop a virtual machine image that allows a TextVRE installation to be run "out of the box", and increase modularization of the source code. The work was carried out in collaboration with EPCC26 at the University of Edinburgh, as well as with the TextGrid team at Göttingen and DAASI.27

5. Sustainability: The DARIAH Initiative

30 As is evident from these discussions, large research infrastructures such as TextGrid do not consist of a single solution and a single toolkit of services, but are essentially a combination of smaller, often standalone, services that a user can combine at will. This is particularly the case in digital environments for the arts and humanities. DARIAH recognizes this, and thus a key feature of its provisions is a virtual "social marketplace", a framework that offers projects such as TextGrid and TEXTvre a forum for exchanging such tools and services, and that supports advanced collaboration across diverse networks and specialized service providers. As detailed by Blanke et al. (2011), such a marketplace has three main pillars: 1. Open APIs exposing reusable services

Journal of the Text Encoding Initiative, Issue 5 | June 2013 57

2. Composition and aggregation facilities allowing researchers to build on these services 3. Promotion of applications that are based on these services, for a variety of use cases

31 The rules and procedures implemented by DARIAH for its marketplace determine how and where participants can make their tools and services available. It is an open environment where services can be exchanged, but of course in any such environment trust is a major issue, and some kind of compliance framework is required to ensure the quality of the exposed services. Within the DARIAH community, a service can simply be documented without certification, in which case users cannot assume that it is reliable, or else it can be fully certified as compliant by the DARIAH organization (Blanke et al. 2011). Compliance with DARIAH guidelines ensures that data and services can be contributed by decentralized parties while at the same time ensuring that they are both accessible and trustable across the DARIAH ecosystem. Thus, while DARIAH is not concerned intrinsically with TEI, it provides a framework for making available and sustaining the TEI-related functionality provided by other initiatives such as TextGrid.

32 In order to scale tools and services from projects such as TEXTvre and TextGrid to a sustainable European collaboration, DARIAH has developed a flexible organizational model that enables pan-European collaboration. DARIAH has set up a number of Virtual Competency Centres (VCCs), each of which is based on topics that, based on deep experience with digital arts and humanities projects, are part of the challenges any such project faces: • an e-Infrastructure VCC, which is concerned with the technical platform • a VCC for liaising with research and education communities, thus ensuring that research activities are directly embedded in the infrastructure • a VCC for scholarly data, which supports the long-term availability of data • a VCC that works on the impact of the tools and resources that DARIAH incorporates

33 DARIAH provides these VCCs as platforms for collaboration. The DARIAH partners between them have extensive experience with meeting the kind of challenges that both TextGrid and TEXTvre have faced; however, the flow of expertise is not one-way, and DARIAH can learn from the collaboration experiences of these projects. Customizing one's own storage solutions is often a very involved task that needs careful planning and organization. DARIAH anticipates that many digital humanities research groups and centers will have to comply with their own institutional requirements, while for many countries there will be other dedicated national services for supporting the work of data publishing and preservation. Therefore, a common DARIAH data layer will not consist of a single data infrastructure but rather will comprise a set of services, either for reusing one of several existing solutions or for providing a new solution, and thus will build on the expertise of the DARIAH partners. The experience of the collaboration between TEXTvre and TextGrid can be seen as exemplary in this context.

34 While variety is to be expected at this level, the diversity of local and national data services will be brought together into a common European data layer through a toolkit of lightweight services that allows, for example, cross-referencing and discovery across various data repositories. The existing DARIAH Persistent Identification (PID) service is a good example of such a lightweight solution (Blanke et al. 2011). DARIAH recognized early on that the use of PIDs within the new infrastructure is imperative; when a researcher cites an article or dataset, whether digitally or in hardcopy, they need to be assured that the citation itself will always lead to the original resource cited. The creation of relationships between resources, for example between a researcher and the

Journal of the Text Encoding Initiative, Issue 5 | June 2013 58

articles they have published, requires a similarly persistent mechanism. Many DARIAH members already provide their own PID solutions; this is the case for TextGrid and TEXTvre, for example. The DARIAH PID service has been deliberately designed to accommodate these local solutions, scaling them to a European level, and allowing DARIAH PIDs to be used in the heterogeneous environment of existing archives and VREs, for example in the Fedora-based repository used by TEXTvre.

6. TextGrid–TEXTvre Collaboration: Next Steps

6.1 Sustainability

35 TextGrid started its third funding period, which will last for three additional years, in June 2012, and the main focus of this period is on sustainability, specifically how to ensure that TextGrid will be able to provide services and tools after the project has come to an end. The second major concern of this phase is to address the scalability and performance issues of the TextGrid middleware: increasing numbers of research groups are joining TextGrid, leading in turn to increasing numbers of concurrent users, so a stable, sustainable, and trusted infrastructure has therefore become paramount.

36 TextGrid's sustainability issues are not just technical; a key aspect is the financial and organizational sustainability of running the system after funding runs out in 2015. While "big science" disciplines, such as astronomy, climate research, or particle physics, have well-developed approaches to setting up the financial and political structures required for such infrastructures, for example a legal entity with a reliable budget, until now there has been no such model to follow in the arts and humanities communities. The roadmap developed for the sustainability of TextGrid, which has been produced at the request of the German Ministry of Research and Education, will serve as a blueprint for the sustainability of other humanities research infrastructures in Germany. Moreover, TextGrid is working closely with DARIAH to determine the extent to which DARIAH services, such as the PID service, the authorization/ authentication/identification (AAI) infrastructure, and storage solutions for data curation can be leveraged to support this sustainability. 37 TextGrid is likely to be sustained through a hybrid financial model, with some of the budget provided by the central government, supplemented by funding from German state governments that have already set up their own digital humanities initiatives. Other stakeholders could also provide support to VREs such as TextGrid by providing central services in kind, while researchers who plan to use TextGrid could request additional money in their funding proposals. The main federal funding agencies in Germany, the BMBF (Bundesministerium für Bildung und Forschung) and the DFG (Deutsche Forschungsgemeinschaft), have agreed to support this idea, which has operated in the sciences for many years. TextGrid is thus entering its operational phase. After the first phase, which focused on conceptual issues, and the second phase, which saw most of the development, TextGrid is now targeting its own long-term future, focusing on sustainability, trust, and financial viability. This future also includes education and training, such as the development of professional teaching materials, special lectures within BA and MA curricula, and expert workshops on innovative topics in collaboration with national and international research and developer communities.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 59

6.2 New Services: Embedding Intertextual Links

38 Extensibility in the face of new requirements is an important attribute of any infrastructure, and both TextGrid and TEXTvre have given rise to follow-on projects that addressed specific research questions. One such set of requirements for additional functionality arose from the needs of the SAWS (Sharing Ancient WisdomS) project28 (Jordanous et al. 2012), which addressed the digital publication of medieval wisdom literatures in Greek and Arabic. These texts, also termed gnomologia, are anthologies of "wise sayings", encapsulating moral and social guidance or philosophical ideas. Gnomologia also formed a key route via which ideas were disseminated over a large area, and across different cultures, often over the course of many centuries.

39 While the publication of such collections using TEI was one objective of the SAWS project, a distinguishing characteristic of these texts—which both makes them of particular interest to scholars and raises particular challenges for existing models of digital publication—is that the sayings or excerpts from which they are constituted are borrowed, modified, and recombined in different ways as new manuscripts are written on the basis of older ones. A key requirement of a scholarly editing and publication infrastructure for such texts is that it should support identification, representation, and exploration of the relationships that occur between excerpts, both within texts and across multiple texts. For example, one anthology may repurpose excerpts from a number of existing anthologies, changing them in various ways; sayings may be translated, and often changed in subtle ways in the process; and many of them can be traced back to known "source texts", often classical texts or the Bible. 40 Consequently, as well as producing digital editions of the manuscripts, SAWS aimed to build up a corpus of links between textual excerpts. As part of the markup process, scholars need to identify excerpts and relationships and mark them up within the TEI document. Traditionally such information would form part of the human-readable apparatus criticus; instead, SAWS is producing a network of interrelated excerpts formalized using an RDF-based model, in addition to the more standard TEI approach. As such relationships between texts and text fragments become more important to the work of scholars, tools are needed for editing and managing them in a persistent manner (Hedges et al. 2012). 41 These requirements are being addressed by the Text-Text Link Editor (TTLE), a tool developed at the TextGrid partner Fachhochschule Worms that handles links between arbitrary fragments of XML documents. TTLE can be integrated as a plugin into TextGridLab, thus providing textual scholars with a consistent user experience and workflow for both standard TEI editing and working with intertextual links. TTLE's persistence layer uses the linking functionality already provided by TEI; although there are other standards available to describe text linking—for example, the Open Annotation Core Data Model29—using TEI allows TTLE to make use of the services already available in the TextGrid repository. 42 The mechanism TTLE uses to mark up fragments depends on the status of the document. For writeable documents, it simply inserts specific tags, whereas for write- protected documents it either uses character offsets or allows an identifier to be assigned to any text enclosed in an XML tag. In either case, the information is stored in TEI format, either in the document itself or in a separate file. The current version

Journal of the Text Encoding Initiative, Issue 5 | June 2013 60

allows users to add comments to a link, and subsequent releases will offer a way of specifying link types for easy sorting, searching, and filtering.

7. Conclusions

43 This paper has presented the collaborations between the TextGrid and TEXTvre projects in Germany and the UK, and examined how DARIAH might operate to facilitate their collaborations. We have shown how TextGrid emerged from the specific needs of a German community and was then adopted and enhanced for the needs of the scholarly community at King's College London. These two virtual research environments are not monolithic infrastructures; rather, each is a collection of flexible services and tools that can operate independently from one another, and many separate services are needed to serve the growing community of user groups across different disciplines.

44 When TEXTvre tried to adopt the TextGrid environment, they found that even fundamental tools such the TextGrid XML editor might require reconfiguration, as users prefer to continue to use the environments with which they are familiar. Another major component part of the TEXTvre work was the integration of automated annotation, which is continued today by projects such as Pundit and Korbo,30 and increasing numbers of tools can be expected in the future. DARIAH has learned from the collaboration experiences of TextGrid and TEXTvre, and to facilitate such communities of tools and services, it is setting up a DARIAH "social marketplace" where they can be sustained, shared, and exchanged. 45 The collaboration between TextGrid and TEXTvre also addressed the creation of a common data infrastructure. Even in this area, where common services might seem easier to achieve than in user-facing tools and services, it was evident that requirements could vary significantly. Moreover, the seemingly straightforward task of integrating a standard digital repository technology into the storage infrastructure of TextGrid has run into difficulties on several levels. Collaboration thus cannot just mean reuse of complex software application but rather needs to function as a genuine exchange of tools, expertise, and experience so that customized solutions that work in different national and institutional environments can be developed and taken forward. Therefore, as far as data infrastructure is concerned, DARIAH has rejected the idea of providing a single data layer, and will instead focus on supporting a range of more diverse services, which are difficult to realize locally as they require a more sustainable environment than project-based funding can provide. 46 TEXTvre has reached the end of its funding, while TextGrid has achieved a third round of funding. During this period, TextGrid is establishing a national working group on the coordination of sustainability and information infrastructure in the humanities (Koordination Geisteswissenschaften Verstetigung/Informationsinfrastruktur [KGW]) with the support of the BMBF. TEXTvre has been taken forward in several new projects, among them the SAWS project and the TTLE service discussed above. While both TextGrid and TEXTvre continue to develop, each has already achieved real (institutional) change in the communities they support.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 61

BIBLIOGRAPHY

Blanke, Tobias, Michael Bryant, Mark Hedges, Andreas Aschenbrenner, and Michael Priddy. 2011. "Preparing DARIAH." In Proceedings: 2011 Seventh IEEE International Conference on eScience, 158–65. Los Alamitos, CA; Washington, DC: IEEE. doi:10.1109/eScience.2011.30.

Constantopoulos, Panos, Costis Dallas, Petern Doorn, Dimitris Gavrilis, A. Groß, and Georgios Stylianou. 2008. "Preparing DARIAH." In Digital Heritage: Proceedings of the 14th International Conference on Virtual Systems and MultiMedia, Limassol, Cyprus: Short Papers, 164–66. Budapest: Archaeolingua. http://pubman.mpdl.mpg.de/pubman/item/escidoc:66841:4/component/ escidoc:66842/Preparing_DARIAH.pdf.

Cunningham, Hamish, Diana Maynard, and Kalina Bontcheva. 2011. Text Processing with GATE. Sheffield, UK: University of Sheffield Department of Computer Science.

Cunningham, Hamish, Diana Maynard, Kalina Bontcheva, and Valentin Tablan. 2002. "GATE: An Architecture for Development of Robust HLT Applications." In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 168–75. Stroudsburg, PA: Association for Computational Linguistics. doi:10.3115/1073083.1073112.

Hedges, Mark, Anna Jordanous, Stuart Dunn, Charlotte Roueché, Marc W. Kuster, Thomas Selig, Michael Bittorf, and Waldemar Artes. 2012. "New Models for Collaborative Textual Scholarship." In 2012 6th IEEE International Conference on Digital Ecosystems and Technologies (DEST). Piscataway, NJ: IEEE. doi:10.1109/DEST.2012.6227933.

Jordanous, Anna, K. Faith Lawrence, Mark Hedges, and Charlotte Tupman. 2012. "Exploring Manuscripts: Sharing Ancient Wisdoms across the ." In Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, article no. 44. New York: ACM. doi:10.1145/2254129.2254184.

Kuster, Marc Wilhelm, Christoph Ludwig, and Andreas Aschenbrenner. 2007. "TextGrid as a Digital Ecosystem." In 2007 Inaugural IEEE International Conference on Digital Ecosystems and Technologies (IEEE DEST 2007), 506–11. Piscataway, NJ: IEEE. doi:10.1109/DEST.2007.372029.

Lagoze, Carl, Sandy Payette, Edwin Shin, and Chris Wilper. 2006. "Fedora: An Architecture for Complex Objects and Their Relationships." International Journal on Digital Libraries 6(2): 124–38. doi: 10.1007/s00799-005-0130-3.

Neuroth, Heike, Felix Lohmeier, and Kathleen Marie Smith. 2011. "TextGrid – Virtual Research Environment for the Humanities." International Journal of Digital Curation 6(2): 222–31. doi:10.2218/ ijdc.v6i2.198.

NOTES

1. http://www.textgrid.de/en.html 2. http://textvre.cerch.kcl.ac.uk/ 3. http://www.dariah.eu/ 4. http://digilib.berlios.de/ 5. http://cordis.europa.eu/esfri/ 6. http://www.d-grid.de/index.php?id=1&L=1

Journal of the Text Encoding Initiative, Issue 5 | June 2013 62

7. http://www.goegrid.de/ 8. http://www.earlyenglishlaws.ac.uk/ 9. http://www.gasconrolls.org/ 10. http://irt.kcl.ac.uk/irt2009/ 11. http://epidoc.sourceforge.net/ 12. http://www.woerterbuchnetz.de 13. http://collatex.sourceforge.net 14. http://www.fh-worms.de/index.php?id=4616&L=1 15. http://music-encoding.org/ 16. Note that it is not possible to apply an XSLT transformation to an XML file being edited with the plugin. The TextGrid team investigated some analogous issues and suggested that it might be possible to remedy them by using virtual workspace folders, which were introduced in a more recent release of the Eclipse plugin. However, this was not high priority for TEXTvre since the King's users were concerned only with TEI file creation. 17. http://lucene.apache.org/solr/ 18. http://gate.ac.uk/ie/annie.html 19. https://dev2.dariah.eu/wiki/display/TextGrid/TG-crud 20. http://www.cs.vu.nl/ibis/javagat.html 21. http://www.openrdf.org/ 22. http://www.w3.org/TeamSubmission/n3/ 23. http://jena.apache.org/ 24. http://software.ac.uk/ 25. http://www.epsrc.ac.uk/ 26. http://www.epcc.ed.ac.uk/ 27. https://daasi.de/ 28. Funded by HERA (Humanities in the European Research Area). 29. http://www.openannotation.org/spec/core/ 30. http://dm2e.eu/first-release-of-pundit-and-korbo-ready-for-testing/

ABSTRACTS

A variety of initiatives for developing virtual research environments, research infrastructures, and cyberinfrastructures have been funded in recent years. The systems produced vary considerably, but they all face the issue of sustainability, namely how to ensure the continued existence of a resource once the project that created it has finished. This paper addresses the sustainability issues faced by the TextGrid and TEXTvre virtual research environments for textual scholarship, examining the inter-project collaboration and cross-fertilization that took

Journal of the Text Encoding Initiative, Issue 5 | June 2013 63

place, and investigating how the projects benefited from this exchange. It also examines how their sustainability is being facilitated by the more general-purpose DARIAH infrastructure, and conversely how their existing collaboration can serve as a model for future collaborations within the DARIAH community.

INDEX

Keywords: VRE, ESFRI, tools, services, data architecture, sustainability, research infrastructure

AUTHORS

MARK HEDGES

Mark Hedges is director of the Centre for e-Research and senior lecturer in the Department of Digital Humanities at King's College London, teaching on a variety of modules in the MA programme in Digital Asset and Media Management. His original academic background was in mathematics and philosophy, and he gained a PhD in mathematics at University College London before starting a 17-year career in the software industry, working on large-scale development projects for industrial and commercial clients. He began his career at King's as technical director of the Arts and Humanities Data Service.

HEIKE NEUROTH

Heike Neuroth holds a PhD in Geology and since 1997 she has been working at the Göttingen State and University Library (SUB) in Germany, where she heads the Research and Development Department (RDD). She has been involved in building virtual research environments and research infrastructures for many years. She lectures in information science at the Applied University of Cologne as well as in digital humanities at the University of Würzburg. Additionally, she has been an e-humanities consultant at the Max Planck Digital Library (MPDL) since February 2008.

KATHLEEN M. SMITH

Kathleen M. Smith is a research fellow in the Research and Development Department of the Göttingen State and University Library. She holds an MLS and a PhD in Germanic Literatures, and her research focuses on early modern information networks. Kathleen is currently working on the CENDARI project, which is developing a research infrastructure for the fields of medieval and World War I history.

TOBIAS BLANKE

Tobias Blanke is a senior lecturer in the Centre for e-Research, Department of Digital Humanities, at King's College London. He is a director of DARIAH, a European research infrastructure for arts, humanities, and cultural heritage data, and leads the joint research work for EHRI, a pan- European consortium to build a European Holocaust Research Infrastructure. He holds PhDs in philosophy and in computer science.

LAURENT ROMARY

Laurent Romary is directeur de recherche (director of research) at INRIA, Rocquencourt, France, and guest scientist at Humboldt University in Berlin, Germany. He carries out research on the

Journal of the Text Encoding Initiative, Issue 5 | June 2013 64

modeling of semi-structured documents, with a specific emphasis on texts and linguistic resources. He is the chairman of ISO committee TC 37/SC 4 on Language Resource Management, and has been active as member (2001–2007), then chair (2008–2011), of the TEI Council. He currently contributes to the establishment and coordination of the European DARIAH infrastructure.

MARC KÜSTER

Marc Küster is professor for web services and XML technologies at Fachhochschule Worms, Germany. He has a background in physics, literary studies, and history, and his current research interests include the application of information technology in the humanities, Web services, and digital ecosystems. He has served as keynote speaker, program committee member, workshop chair, technical program chair, and general co-chair for various international conferences, and is active in standardization in the fields of cultural diversity, eBusiness, and eGovernment. Since 2008, he has been on leave for a project at the Publication Office of the EU (central data and metadata management).

MALCOLM ILLINGWORTH

Malcolm Illingworth is an applications consultant at EPCC, one of the foremost supercomputer centers in the world, established in 1990 as a focus for the University of Edinburgh's work in high-performance computing. He works on facilitating the effective exploitation of advanced computing methods to solve real-world problems in academia, industry, and commerce.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 65

The Evolution of the Text Encoding Initiative: From Research Project to Research Infrastructure

Lou Burnard

1. In the Beginning

1 The Text Encoding Initiative was born into quite a different world from that of today. In 1987, there was no such thing as the World Wide Web, and construction of the tunnel beneath the English Channel had only just begun. A major political power called the Union of Soviet Socialist Republics still existed, while in the UK, Margaret Thatcher's government had just been reelected for a third time, and in the US the Senate rejected for the first (and so far only) time a presidential nomination to the Supreme Court. In academic life, it was still (just about) possible to finance an undergraduate degree on the basis of government grants. A typical "home computer" cost about 1,500 pounds in the UK, had an Intel 80286 processor and up to 640 Kb of memory, with maybe up to 50 Mb of storage on its internal hard disk, and probably ran some version of Microsoft's ubiquitous MS-DOS, unless of course it was a Macintosh. New machines were beginning to appear on the market, some of them with nearly enough memory and processing power to run Microsoft's new Windows , or IBM's optimistically named "OS/2," also launched in this year. And meanwhile in another part of the forest Steve Jobs was busy imagining the Next computer, which would run something like Unix, but with a Windowing interface. However, any serious computing would still be done on your departmental minicomputer (perhaps a VAX or a PDP) or your institutional "mainframe," as the massive energy-hungry arrays of transistors and magnetic storage systems sold by such companies as IBM, Univac, Burroughs, ICL, or Control Data were known.

2 At the same time, much of the work done on those massive machines looks quite familiar today. The process of digitization of the office environment had already begun in some scientific disciplines with software such as TeX, Scribe, or tRoff becoming

Journal of the Text Encoding Initiative, Issue 5 | June 2013 66

dominant in the production and dissemination of research articles and documentation. The tide of personal computers able (allegedly) to close the gap between the writer and the publisher would soon engulf us as surely as would replace the seemingly unstoppable Word Perfect 4.2 (released in 1986). Such neologisms as "desktop publishing," "expert systems," and "digital resources" began to appear in serious academic journals, as well as dominating the discourse of fledgling online communities such as the Humanist discussion list launched in 1987. The Internet already existed, as did many theories about how it might be used "hypertextually," though the World Wide Web was still barely an idea. Both in the research community and, increasingly, beyond it, the goals of corpus linguistics and artificial intelligence alike had established a need to work on large-scale digitized textual resources just as the technologies to support such work were beginning to appear. Launching a major revision of the Oxford English Dictionary, Oxford University Press for the first time proposed to do so using computational methods. Text-based disciplines of all kinds were beginning to imagine the possibilities offered by computationally tractable corpora of source texts created by such projects as the Thesaurus Linguae Graecae, the Trésor de la Langue Française, or the Dictionary of Old English. And, given the rapidity with which one storage technology was already replacing another, new discourses concerning the best methods of guaranteeing long-term access to newly created digital resources, and about the necessities for open access and platform-independent formats alike, were beginning to be heard. 3 Symptomatic of that discourse were two key meetings held in 1987: one, in April, organized by a group of practicing historians in Europe, debated the possibility of establishing some kind of consensual standardization for the encoding of primary historical source data in computer-readable form. A paper prepared for that workshop by Manfred Thaller, and subsequently published in the influential journal Historical Social Science, together with a collected proceedings volume edited by the French historian Jean-Philippe Genet (see, among others, Thaller 1986; Genet 1988), may have strengthened the case put forward to the US National Endowment for the Humanities for the funding of a larger international workshop on the issue; or perhaps they just spurred on the organizers of that workshop. But in either case, the majority of the European institutions represented at one workshop reappeared at the other, along with many other scholars from North America and elsewhere in the world. The TEI's foundational event1 was held at Vassar College, Poughkeepsie, New York, immediately after the joint annual conference in Toronto of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, twin venerable precursors of today's Alliance of Digital Humanities Organizations. 4 However, the purpose of this paper is not simply to chronicle the history or foundational myth of the TEI, but rather to try to answer a more interesting question. If the TEI is so very old—preceding as it does the Web, the DVD, and widespread use of technologies such as the portable phone, cable television, and Microsoft Word—and given that computer related technologies are hardly renowned for their longevity, how is it that the TEI is still with us, and still occupying a significant position in the world of research, after nearly thirty years?

Journal of the Text Encoding Initiative, Issue 5 | June 2013 67

2. Transformations of The Text Encoding Initiative

5 As first conceived, the Text Encoding Initiative was an international research project intended to define a kind of digital demotic, as indicated by its alternative expansion Text Encoding for Interchange. In a world dominated by mutually incompatible formats, each computer manufacturer could impose its own conventions for the structuring and representation of textual data. This was a world in which some computers worked in EBCDIC and others in ASCII, where even the number of bits in a byte could vary between 6, 8, and 16, and where, as a consequence, there were serious technical obstacles even to the simple transfer of data files from one machine to another, to say nothing of the difficulties posed by mutually incompatible and proprietary file formats. Nevertheless the TEI announced that it would facilitate the creation, exchange, and integration of textual data in machine-readable form, for all kinds of texts, in every human language, from every historical or social context. In the world before the Web, these were ambitious goals. But this quixotic ambition was further compounded by the TEI's declared goal of making its proposals accessible both to the novice, looking for guidance on well-established best practice, and to the expert, seeking to establish new practice in response to new research goals. It was this latter objective which, in retrospect, gives the TEI its distinctive nature, and which largely distinguishes it from other standardization efforts, both those now forgotten, and those currently entrenched or in formation.

6 In its earliest stages, the TEI was a creature of a time of transition, when the notion of "humanities computing" was just beginning to invent itself as a form of interdiscipline, a space within which information specialists, computer scientists, and traditional humanists might meet, if only to trade secret handshakes and suspicious glances; it was a period in which the notion of the disciplinarity (or otherwise) of that interchange was not yet more important than its simple existence. The founding parents of the TEI were, from the standpoint of traditional academic life, a very mixed bunch of people, who had for the most part foregone the safety of traditional career paths for the excitement of transgressing disciplinary boundaries. Many of those who met that snowy weekend in Poughkeepsie came from research teams or institutions on the fringes of traditional scholarship, owing allegiances to computer science, linguistics, philology, lexicography, or literary studies in many languages, but not centrally placed within any of those disciplines. What united them was an expertise in the creation and management of digital text, a vision of its future importance to the traditional disciplines, and a concern that lack of standardization and commercial pressure might prevent the realization of that vision. 7 The outcome of the conference was a set of eminently practical recommendations as to how an extensible set of guidelines consistent with the goal of a universal text- encoding scheme might be achieved. These "Poughkeepsie Principles," with commentary focusing on implementation, have long been available from the TEI's website2 and we do not discuss them further here, though they repay reading for anyone curious about the theory underlying the shape and content of the TEI scheme. Instead we propose to focus more on the organizational and managerial issues which have determined its evolution. 8 It is instructive to compare the organizational structures of the TEI during its first decade of existence with that which has come into being during its second decade.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 68

Initially conceived as a research project answerable to a self-selecting group of experts, the TEI of the 1990s manifested a typically centralized structure, in which all the work is done by a small number of people under the control of a small executive authority (see fig. 1). The TEI of the present decade has a more distributed structure in which the task of maintaining and developing the Guidelines is the responsibility of many different people drawn from a loosely defined community (see fig. 2).

Figure 1. TEI organizational structure, 1991

Journal of the Text Encoding Initiative, Issue 5 | June 2013 69

Figure 2. TEI organizational structure, 2012

9 That is not, of course, to imply that the original TEI editors worked in isolation from the many scholarly communities that their work was intended to benefit. On the contrary, the organization of work required the editors to spend much of their time extracting proposals and requirements from a very wide range of experts, which could then be subsequently organized into a coherent whole for appraisal and ratification by further groups representing those communities. In all, nearly 200 experts from both sides of the Atlantic gave their time and energy to codifying their own practice in such a way as to facilitate its integration with that of others.

10 During the first funding cycle (1992–93), the TEI's work was carried out by four appointed committees. The Text Documentation Committee, populated by documentalists and librarians, produced recommendations that eventually became the TEI Header. The Metalanguage and Syntax Committee, populated by computer scientists, made recommendations on the formal metalanguage in which the TEI outcomes should be formulated. The other two committees divided the universe of possible objects for text markup between them, along rather loosely defined conceptual lines: one committee, Text Representation, was supposed to identify the "significant particularities" of written texts, specifically those which needed to be made explicit in a marked-up document; the other, Text Analysis and Interpretation, was supposed to focus on the encoding of "added value" brought to a text by an analytic tool, such as a parser or (indeed) a careful reader. This opposition between analysis and representation typifies much contemporary debate about the proper function of markup, and indeed the process of reading itself. 11 The Text Documentation committee may be credited with the idea, still useful today, that the TEI Header is intended to act as a primary source of information for a digital resource, in the rather specialist sense that the librarian community uses that phrase.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 70

The title page of a printed book may not correspond exactly with a catalogue entry for that book, but cannot be ignored during the production of one. Given that most digital resources were going to be produced by non-cataloguers, the committee argued that the TEI Header should make it possible for the non-expert to record whatever information they deemed useful in such as way as to simplify, but not to replace, the task of the subsequent cataloguer. This idea perhaps underlies much that professional cataloguers find frustrating in the TEI header today. 12 The choice of SGML3 as a vehicle for expression of the TEI's recommendations was a fairly obvious one for the Metalanguage Committee. This committee also established the principle that the use of this ISO standard was conditional, and that the TEI itself should as far as possible be independent of any particular metalanguage syntax; the wisdom of this decision and the consequent system architecture became apparent when, several years later, the task of re-expressing the TEI in a new metalanguage called XML was taken on. 13 The Text Representation committee began by consolidating current encoding practice across a number of related disciplines to provide the essentials of the TEI scheme: the basic components of textual objects, viewed as hierarchically organized containers, in which identifiable, possibly internally structured, components such as bibliographic references, highlighted phrases, or proper names float in a kind of "soup" of plain text. To these, the committee added recommendations for markup of textual variance and hypertextual links, amongst other topics reflecting its diverse membership. A recurrent theme in its discussions was the desire to encode an idealized version of the text itself, independently of its realization in a particular source, while at the same time wanting to preserve what was significant in that source. This debate found expression in a number of published scientific articles4; it also probably underlies the TEI's notorious tendency to provide both cake and the eating of it (or, as they say in France, both the butter and the money for the butter). Nevertheless, as critics were quick to point out, whenever it became hard or impossible to give equal time to both the presentational or visual and the analytic or structural properties of a text, this committee, and therefore the TEI, consistently came down on the side of the latter. Many of those at Poughkeepsie felt that contemporary digital publishing systems, with their emphasis on "camera-ready" copy, engendered a distraction from the true business of scholarship, as had been cogently argued in a highly influential article published in the Communications of the ACM shortly before the Poughkeepsie Conference.5 14 The Analysis and Interpretation Committee began with the rather ambitious goal of providing ways of encoding in a normative way the full range of linguistic analysis. In an early working paper, Langendoen (1990) notes that amongst many other topics, it would need to deal with: • Underspecification, uncertainty, multiple hierarchies • Phonology and prosody • Morphology and word-level tagging • Higher-level syntactic analysis • Structural ambiguity • Anaphora and Deixis • Idioms • Figures of speech (not for this cycle) • etc. etc.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 71

15 If this seems a touch ambitious, it should not be forgotten that this was the period during which Europe's Language Engineering industries were born, and that part of the motive behind the EU's funding of the TEI was precisely to formulate recommendations about specific linguistic categories for the use of that industry.

16 The linguists represented on the A&I Committee did not, however, belong to a school in which such concepts as "noun" or "verb" were unproblematic: instead they formulated a more abstract system, a kind of metamodel of linguistic annotation based on feature structure theory,6 which could be used to represent any kind of linguistic (or indeed non-linguistic) annotation. This model eventually achieved wide acceptance in the language engineering field and is, so far, the only part of the TEI to have been formally adopted as an ISO standard,7 but its power and generality of application were not immediately appreciated by the computational linguist who just wanted to build a simple parser. To meet such needs, the committee also proposed a range of generic segmentation and synchronization mechanisms of various kinds. 17 Inevitably, the encoding needs and mechanisms identified by these two committees frequently coincided in all but name. Despite the vigorous application of Ockham's razor, the TEI often found itself proposing a range of different ways of encoding linguistically motivated segmentation and of recording analytic judgments, with more or less internal structure, reflecting the diversity of the user community. This process continued further with the appointment, during the second funding cycle, of a number of specialist working groups, charged with testing the TEI proposals within specific research areas and identifying any gaps. As with the larger committees of the first phase, the members of these specialist working groups were nominated by experts in the field, and their meetings funded on the understanding that they would produce detailed proposals for integration into the Guidelines, the first version of which—a modest 275-page volume known as P1—had been distributed in print form in 1990. In the event, some of these working groups were content to assess the applicability of that draft to their own area of expertise, or to make general comments about their usability, but in many cases this work resulted in significant expansion of what was already proposed, and in a few cases substantial entirely new material was developed, extending the scope and coverage of the Guidelines considerably. One working group, for example, produced proposals for the encoding of transcribed speech (which had not featured at all in P1), while another produced a complex tagset for the markup of existing print dictionaries. Meanwhile, the TEI editors revised chapter by chapter on the basis of the feedback received. Rather than delay publication until the whole work was complete, the new version (P2) was distributed in fascicle form during 1992, as each new chapter took shape. 18 The chairs of all the working groups and other significant contributors were invited to sign off on the editorial process at a four-day Technical Review Meeting, held at Eynsham Hall near Oxford in May of 1993. After further revisions consequent on that review, the first complete version of the TEI Guidelines, known as P3, appeared at the international SGML conference held in Montreux in May 1994. TEI P3 took the form of two substantial green volumes, making up a 1300-page reference manual documenting and defining some 600 SGML elements which could be combined and modified in a variety of ways to create specific SGML document type definitions (DTDs) for particular purposes. In 1994 the first edition of TEI Lite also appeared. This was a simplified subset of the whole scheme, which was originally written to accompany the first of many

Journal of the Text Encoding Initiative, Issue 5 | June 2013 72

training workshops held in Chicago in December of the same year. This Chicago "metaworkshop" was conceived as a training session for trainers, and did much to establish how the TEI is generally introduced to a novice audience. 19 In the five years that followed, the proposals of the TEI established themselves, without benefit of grant funding or centralized management, as a necessary part of the intellectual infrastructure. In the US, several influential university libraries began to deliver online versions of set texts and to host the textual databases needed for research. Training in using the TEI for basic encoding appeared as a module on specialist library courses; research projects in their grant submissions would routinely add the phrase "we will follow the TEI recommendations" (often followed by "but . . .") ; a new generation of research assistants found it necessary to understand the difference between and

, or why there was a element but no element. A detailed history of the way in which this happened is hard to write, so rapidly did the TEI become a part of the humanities computing ecosystem at this period. To some extent we may speculate that there was no serious alternative for people who found inadequate the purely presentational focus and lack of formality in the HTML of the time, or that in a world where the technical infrastructure was rapidly changing, something so clearly grounded in traditional scholarship, and so clearly underpinned by theoretical rigor, offered at least the promise of a long-term return on the serious costs of investing in the digital future. 20 In his preface to a collection of working papers about the TEI published in 1995, Charles Goldfarb, inventor of the SGML standard, remarked with typical prescience, "The vaunted 'information superhighway' would hardly be worth travelling if the landscape were dominated by industrial parks, office buildings, and shopping malls. Thanks to the Text Encoding Initiative, there will be museums, libraries, theaters, and universities as well."8 The TEI Recommendations were endorsed by the US National Endowment for the Humanities, the UK's Arts and Humanities Research Board, the Modern Language Association, the European Union's Expert Advisory Group for Language Engineering Standards, and many other agencies that funded or promoted digital library and electronic text projects. They had become part of the intellectual infrastructure of the day. 21 For the purposes of this article, however, the key point to note is that by the end of the twentieth century, although everywhere cited as if it had some kind of formal existence, the TEI was no longer under the active care or management of anyone. Many of those most responsible for its original development had moved on to other challenges, most notably the principal editor Michael Sperberg-McQueen, who in 1996 had been appointed co-editor of the World Wide Web's emerging standard XML. At a conference organized to celebrate the TEI's tenth anniversary at Brown University in 1997, there was cake and an a cappella choir, but no consensus as to the organizational way forward. Did the TEI need any kind of formal management? Was anyone interested in maintaining it into the future, or was it a great idea to be fixed in stone, until consensus emerged that its time had now passed? 22 Ideas about an appropriate structure for continued maintenance and development of the TEI had been discussed within the original Steering Committee of the project over two or more years without practical result; the need for some kind of scientific council able to make decisions about the future technical development of the Guidelines had also been foreseen, and indeed a body not unlike the TEI's current Technical Council

Journal of the Text Encoding Initiative, Issue 5 | June 2013 73

had held two meetings. As a result of their work, a new and final version of P3 appeared online, with a small number of egregious errors corrected, and the addition of one new element. It was already clear, however, that a very much more extensive program of work was needed to bring P3 up to date. The difficulty was in setting up a financial and organizational infrastructure within which that extensive program of work might be undertaken. 23 The task was eventually accomplished after what might be caricatured as a kind of "management buyout." At a meeting held at King's College London in January 1999, representatives of the three learned societies in whose name the TEI Guidelines had originally been published, and representatives of four key academic institutions at which the TEI was widely used and promoted, met and agreed to transfer ownership and management of the Guidelines to a new entity: a non-profit international membership organization called the TEI Consortium, the constitution and structure of which were laid out in a detailed proposal submitted to the meeting by representatives from the and the University of Bergen (see TEI 1999). 24 The goal of the TEI Consortium would be to establish a permanent home for the TEI as a democratically constituted, academically and economically independent, self- sustaining, nonprofit organization. It was clear that the TEI Guidelines had a major role to play in the application of new XML-based standards that were driving the development of text-processing software, search engines, web browsers, and indeed the Web in general. To re-express the TEI's SGML schema fragments as XML was not particularly difficult, but the TEI also needed to adapt to a technical environment completely transformed by the rise of the Web. The first act of the Consortium was to publish a new version of the Guidelines, known as TEI P4. This was a largely unmodified version of TEI P3, except for its re-expression using XML syntax rather than SGML. At the same time, the new Consortium declared its intention of embarking on a major revision of the entire Guidelines which would bring them up to date, extending and improving their coverage, and also radically transforming the way in which they would be maintained and developed in the future. 25 In a presentation given at the first annual meeting of the members of the new TEI Consortium, held in Pisa in November 2001, Michael Sperberg-McQueen (2001) itemized the things that in his view the "old" TEI had done right. These included some interesting technical aspects: for example, the TEI interchange format, a subset of SGML that corresponded very closely with what subsequently became XML; the TEI extended pointer syntax, which approximated to what became XPointer and XPath; the fact that DTDs are not written, but generated using an additional layer of abstraction which integrates the formal and informal or documentary aspects of the architecture— the ODD system. But Sperberg-McQueen also underlined the importance of the work invested "to earn community buy-in," in his phrase, thus re-asserting the essentially community-driven aspects of the project. 26 At the same meeting, Lou Burnard and Syd Bauman listed a dozen or more possible areas in which the Guidelines needed enhancement. These included topics evidently in need of update because the world had moved on, such as character encoding following the definition of Unicode, and also areas where other research communities had already been active, independently of (but perhaps inspired by) the TEI, such as the detailed proposals for the encoding of manuscript descriptions arising out of the European MASTER project. But for the most part they presented an ambitious shopping

Journal of the Text Encoding Initiative, Issue 5 | June 2013 74

list of items in which they felt work was needed, with no promise that the work would actually be done without support from the user community. 27 Beyond simply adding yet more elements and more recommendations, it was clear that the TEI's internal architecture needed substantial revision. The move to XML meant that many new technologies were now available, and many assumptions about document processing which were entirely valid in 1994 needed to be rethought for the new digital world, of which the TEI wished to become a good citizen. In his report to the members in 2005, Christian Wittern, as chair of the Council, likened the process to the restoration of a venerable piece of architecture: "When I grew up in the small southwest German town of Tübingen, conservation and re-creation of the medieval town centre was a major project there. This resulted in many a half-timbered house being completely redone from within without actually taking down the structure, but resulting in a totally new interior . . . . Something similar is happening at the moment with P5 . . ."9 28 During the TEI's first two development phases, working groups had communicated extensively by e-mail, circulating and discussing drafts at a leisurely pace punctuated by expensive if productive face-to-face meetings funded by the TEI. The working environment for this new phase was rather different: for most working groups, and most notably the TEI Council itself, telephone conferencing became the norm, with only very occasional face- to-face meetings funded by the TEI. Nevertheless, when the first version of TEI P5 was finally published in 2007, much of its new intellectual content was the result of working groups10 independent of the TEI Consortium. The new chapters on manuscript description, on the description of named entities, on the documentation of characters and glyphs outside Unicode, and on the evolution of the ODD system were all created in the same way: a group of enthusiastic and largely self- selected experts worked up and tested proposals, which were then consolidated within the new Guidelines for ratification by the TEI Council. For the most part, the work of such groups was funded by specific grants, by institutional benevolence, or by individuals working on their own time. As of TEI P5, the TEI consciously transformed itself into a classic open source project, not only making all of its inner workings openly accessible to the public at large (or at least those parts of the public unintimidated by the Sourceforge user interface) but also becoming reliant on community input both to define and to execute all of its future development. 29 For example, at release 1.2.0 of the TEI P5 Guidelines, a substantial set of proposals was introduced to support encoders wishing to represent the visual appearance or physical makeup of source documents, to align portions of a digitized page image with exact transcriptions of the text on it, to assign parts of a transcription to documented stages in the evolution of a text, and a number of other facilities typifying the discipline of documentary editing, or édition génétique. These extensions did not come about because the TEI Consortium identified a need for them, applied for a grant, chartered a working group, and managed the process. On the contrary, they came into being because a number of existing TEI users, dissatisfied with the current state of the TEI, identified a need, obtained funding for a wide-ranging consultative exercise, formulated proposals, and then worked with the TEI Council to ensure their eventual inclusion into the Guidelines. This is not an isolated example, but typifies the way in which the TEI has now become better able to evolve in response to the changing priorities of the research community.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 75

30 That plasticity has also been facilitated by a number of technical developments, in particular in the evolution of the current TEI processing architecture. Even in its original SGML manifestation, the TEI was conceived as a modular system, which could be customized to suit the needs of particular research communities, enabling them to define, for example, a schema containing all and only the elements of importance to them, to add new elements, or to modify existing ones. However, to carry out such modifications successfully required a degree of technical knowledge significantly beyond that of the average digital humanist, even with the availability of a web interface (the so-called Pizza Chef) designed to simplify the task. With the shift to XML, the creation of such tools became both simpler and more essential. With the advent of TEI P5 in particular, users of the TEI are increasingly expected to desire to engage with the full sophistication of the TEI architecture rather than to rely on precompiled one- size-fits-all subsets such as TEI Lite. This in turn has led to the development of more sophisticated schema generation tools such as Roma, which greatly reduce the complexity of developing a customization. 31 The TEI Guidelines are, of course, written and maintained as a large TEI-conformant document, from which schemas in various formal languages (RELAXNG, DTD, and W3C Schema) and documentation in various formats, ranging currently from PDF to EPUB, are all automatically generated using a suite of XSLT stylesheets developed and maintained by Sebastian Rahtz. Although developed primarily for the TEI's own internal needs, this same suite of XSLT stylesheets is sufficiently generically designed that it has come to be regarded as an integral part of the TEI—even though the TEI was originally designed as an abstraction independent of any particular processing model. The free availability of this suite of stylesheets, and its necessary maintenance in tandem with the Guidelines themselves, has greatly facilitated the development of TEI- aware systems by a new generation of web developers and programmers. To take two particular examples: oXygen, a popular commercial XML development tool, uses them to offer on-the-fly visualization of any TEI document; and the Oxgarage web application packages them to provide basic conversions (in both directions) between many different formats, including several popular word-processing formats and TEI XML. 32 Each new version of TEI P5 (since 2007, there has been a new release of P5 twice a year) is the consequence of an ongoing revision process, in which outstanding errors or feature requests submitted by the general public are considered and acted upon by the Council. As with many other open-source projects, the number of different people active at any one time is relatively small, but the essential point is that the TEI itself is not the only agency managing the review or proposing extensions. A TEI orthodoxy undoubtedly exists, but it does not control the evolution of the Guidelines, and is constantly under critical review by the user community, via mechanisms such as the TEI discussion list, the Sourceforge Trackers, and debate on the TEI Council list, all of which are publicly accessible. A quick glance at the Sourceforge site shows that this activity is increasing: between 2005 and 2010, visits to the site averaged a million page impressions per year, while between 2010 and 2013, the figure is nearer four million a year. Looking at feature requests and bug reports alone, in 2011 the Council reviewed a total of 170 tickets; in 2012 the number increased to 245.11 The annual meeting of the TEI membership, initially a legal obligation for the TEI Consortium, has of recent years been transformed into an open TEI Conference, showcasing current debate about the use of the TEI within the digital humanities research community, much of which

Journal of the Text Encoding Initiative, Issue 5 | June 2013 76

eventually finds its way to a larger audience via journals such as the Journal of the Text Encoding Initiative.

3. TEI Today

33 There are, it has been noted, two kinds of unsuccessful standardization effort. One kind fails because it is based on a theory which is not yet sufficiently mature for widespread use. The other kind fails because it addresses a community which has not yet achieved consensus and within which any proposal therefore seems alien to at least some of its intended beneficiaries. Today's TEI is engineered to avoid both kinds of problem. An immature theory—for example a new element definition—can be incorporated into the TEI model without perturbing the rest: that is one of many advantages inherent in the TEI architecture. Hence, it is easy for ideas developed for the benefit of one project to be made visible to a wider community, tested, evaluated, and, if appropriate, eventually included in the standard. The notion of TEI conformance now forming part of the TEI incorporates precisely this model: conformance is defined as expressing clearly which parts of a model are to be understood in the TEI sense, and which parts are not. This does not preclude the addition of new features or the modification of old ones (mutations in the evolutionary sense); it requires only that they be clearly identified as such.

34 As to the problems of diversity within the target audience, the TEI has always sought to be a truly international enterprise. In its first phase, this desire was manifested by the careful geographical balance of working groups membership; more recently it has been shown in the support for multilinguality built into the Guidelines themselves. More significantly perhaps, the Guidelines today are hospitable to other standards. Where an existing XML vocabulary such as SVG or MathML exists, the TEI schema generally prefers to incorporate it (within its own namespace) rather than to reinvent it. And, at the conceptual level, there is scope for reexpressing the semantics of any TEI element set using other conceptual formalisms (for example, the CIDOC Conceptual Reference Model has been used to provide a mapping for almost all TEI elements). 35 For longevity, however, a standards initiative must also address the need for financial and administrative support. The TEI has increasingly adopted an open-source–style business model, in which its users are the key agents providing that support. The frontiers of the TEI user community are, however, both ill-defined and large. Within the community, to date, a sufficiently large number of institutions have proved willing to provide small amounts of regular funding to keep the infrastructure itself in existence. Opinions differ notably on either side of the Atlantic as to whether that regular funding should be provided by individuals or by collectivities, and consequently how its continuation should best be ensured or its functions promoted; ensuring that continuity is the function of the TEI's Board of Directors, who are elected by the members of the Consortium, with responsibility to them and also the (currently non-voting) subscribers. 36 Fortunately, at least as far as concerns the maintenance and development of its core technical deliverables, the TEI organizational model is increasingly independent of funding. Final editorial decisions as to what does or does not change in the Guidelines are taken by an elected body of experts, the TEI Council, but this is carried out in response to reports of bugs and suggestions for change which may come from

Journal of the Text Encoding Initiative, Issue 5 | June 2013 77

individuals, from special interest groups, or from working groups specially chartered by the TEI or others, and which may have members even beyond the TEI user community. Although there are individuals who sit on both the TEI Board and the TEI Council to ensure good communication between the two, it is clear that the TEI needs both administrative and technical expertise, and that there is an advantage in providing distinct forums for the two. 37 Arguably, with this change the TEI ceased being an academic research project and became something different. To talk of it as a "community-based research project" (to use the officially sanctioned phrase) conveys something of that difference but also obscures a part of it. Research projects come to an end: they have a goal, which may or may not be achieved, and a fixed term. The TEI, however, has transformed itself into an open-ended activity, which will continue to be usable for as long as it is felt to be useful. Its strengths are enhanced by its community of users; its weaknesses can be corrected by them. In a strictly Darwinian sense, the TEI has evolved by fostering and profiting from mutations that are considered beneficial to the user community, while ignoring those which are not. Because it is felt to be useful to so many, and because its products underpin so much of current research activity, it seems not unreasonable to regard it as constituting in itself a research infrastructure. If nothing else, its organizational model seems highly appropriate for all such infrastructural institutions or initiatives.

BIBLIOGRAPHY

Burnard, Lou D. 1988. "Report on Workshop on Text Encoding Guidelines." Literary and Linguistic Computing 3(2): 131–33. doi: 10.1093/llc/3.2.131.

Coombs, James H., Allen H. Renear, and Steven J. DeRose. 1987. "Markup Systems and The Future of Scholarly Text Processing." Communications of the ACM 30(11): 933–47. doi:10.1145/32206.32209.

DeRose, Steven J., David G. Durand, Elli Mylonas, and Allen H. Renear. 1990. "What is Text, Really?" Journal of Computing in Higher Education 1(2): 3–26. doi:10.1007/BF02941632.

Genet, Jean-Philippe, ed. 1988. Standardisation et échange des bases de données historiques: actes de la troisième table ronde internationale, Paris, 15–16 mai 1987. Paris: Editions du CNRS.

Ide, Nancy, and Jean Véronis, eds. 1995. Text Encoding Initiative: Background and Context. Dordrecht: Kluwer.

Langendoen, Terry. 1990. Plan of Action for Remaining Work of A and I Committee. http:// www.tei-c.org/Vault/AI/air03.gml.

Langendoen, D. Terence, and Gary F. Simons. 1995. "Rationale for the TEI Recommendations for Feature-Structure Markup." In Ide and Véronis 1995, 191–210.

Sperberg-McQueen, C. M. 2001. "The TEI is Dead: Long Live the TEI." Paper presented at the TEI Members Meeting, Pisa, Italy, November 2011. http://www.tei-c.org/Membership/Meetings/ 2001/tei1116alt.html.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 78

TEI (Text Encoding Initiative). 1988. Design Principles for Text Encoding Guidelines. TEI ED P1. Last revised January 9, 1990. http://www.tei-c.org/Vault/ED/edp01.htm.

———. 1999. An Agreement to Establish a Consortium for the Maintenance of the Text Encoding Initiative. http://www.tei-c.org/About/consortium.html.

———. 2005. TCR03: Report of the TEI Council to the Members Meeting. http://www.tei-c.org/ Activities/Council/Reports/tcr03.xml.

Thaller, Manfred. 1986. "A Draft Proposal for a Standard for the Coding of Machine Readable Sources." Historical Social Research/Historische Sozialforschung 40: 3–46, repr. in Modelling Historical Data, ed. Daniel Greenstein, Halbgraue Reihe zur historischen Fachinformatik, A11 (St. Katharinen: Max-Planck-Institut fur̈ Geschichte in Kommission bei Scripta Mercaturae Verlag, 1991), 19–64

Wittern, Christian, Arianna Ciula, and Conal Tuohy. 2009. "The making of TEI P5." Literary and Linguistic Computing 24(3): 281–96. doi:10.1093/llc/fqp017.

NOTES

1. See Burnard 1988 for a contemporary report on the event. 2. See TEI 1988. 3. Many resources are available providing information about SGML (Standard Generalized Markup Language) and its successor language XML (Extensible Markup Language); Wikipedia is as good a starting point as any. 4. For example DeRose et al. 1990. 5. See Coombs, Renear, and DeRose 1987. 6. A good introduction to the TEI's use of this formalism is provided by Langendoen and Simons (1995). 7. The two chapters of P5 which define feature structure representation and feature system definition are copublished by ISO as ISO 24610-1:2006 and ISO 24610-2:2011 respectively. 8. Ide and Veronis 1995. 9. See TEI 2005. 10. See Wittern, Ciula, and Tuohy 2009 for a summary of the work undertaken. 11. These and other statistics on Sourceforge usage are available at https://sourceforge.net/ projects/tei/stats/.

ABSTRACTS

It is twenty-five years since the Text Encoding Initiative was first launched as a research project following an international conference funded by the US National Endowment for the Humanities. This article describes some key stages in its subsequent evolution from research project into research infrastructure. The TEI's changing nature, we suggest, is partly a consequence of its close and highly responsive relation with an active user community, which may also explain both its longevity and its effectiveness as a part of the digital humanities research infrastructure.

Journal of the Text Encoding Initiative, Issue 5 | June 2013 79

INDEX

Keywords: humanities computing, history, infrastructures, text encoding

AUTHOR

LOU BURNARD

Lou Burnard has been closely associated with the TEI since its inception, initially as European editor, and more recently as an elected member of the TEI Board. He also plays an active part on the TEI Council. Since retirement from Oxford University Computing Services, where he was responsible for the development of the Oxford Text Archive, the British National Corpus, and several other key projects in the Digital Humanities, he has been working as a private consultant, most recently with the French infrastructural organization TGE Adonis.

Journal of the Text Encoding Initiative, Issue 5 | June 2013