Journal of the

Issue 1 | June 2011 Selected Papers from the 2008 and 2009 TEI Conferences

Kevin Hawkins, Malte Rehbein and Syd Bauman (dir.)

Electronic version URL: http://journals.openedition.org/jtei/125 DOI: 10.4000/jtei.125 ISSN: 2162-5603

Publisher TEI Consortium

Electronic reference Kevin Hawkins, Malte Rehbein and Syd Bauman (dir.), Journal of the Text Encoding Initiative, Issue 1 | June 2011, « Selected Papers from the 2008 and 2009 TEI Conferences » [Online], Online since 01 June 2011, connection on 22 May 2020. URL : http://journals.openedition.org/jtei/125 ; DOI : https:// doi.org/10.4000/jtei.125

This text was automatically generated on 22 May 2020.

TEI Consortium 2011 (Creative Commons Attribution-NoDerivs 3.0 Unported License) 1

TABLE OF CONTENTS

Editorial Introduction to the First Issue Susan Schreibman

Guest Editors’ Note Malte Rehbein and Kevin Hawkins

Computational Work with Very Large Text Collections Interoperability, Sustainability, and the TEI John Unsworth

Knowledge Representation and Digital Scholarly Editions in Theory and Practice Tanya Clement

A TEI-based Approach to Standardising Spoken Language Transcription Thomas Schmidt

‘The Apex of Hipster XML GeekDOM’ TEI-encoded Dylan and Understanding the Scope of an Evolving Community of Practice Lynne Siemens, Ray Siemens, Hefeng (Eddie) Wen, Cara Leitch, Dot Porter, Liam Sherriff, Karin Armstrong and Melanie Chernyk

Journal of the Text Encoding Initiative, Issue 1 | June 2011 2

Editorial Introduction to the First Issue

Susan Schreibman

1 On behalf of the Board of the Text Encoding Initiative and my co-editors, Markus Flatscher and Kevin Hawkins, I am delighted to announce the publication of the inaugural issue of the Journal of the Text Encoding Initiative.

2 This Journal has been nearly three years in the making. It was a natural outgrowth of the expansion of the yearly members meeting into a conference format (beginning in 2007 with the University of Maryland meeting) that regularly attracts over 100 participants. It was felt by the TEI Board that a dedicated journal would be the ideal vehicle to build on the success of the conference as well as to capture the diverse scholarly interests of an ever more vibrant TEI user community. The following year, at the London meeting, a committee (consisting of myself, Gabriel Bodard, Lou Burnard, Julianne Nyhan, and Laurent Romary) was formed to explore the best way to achieve this goal. At the 2009 meeting in Ann Arbor, a full proposal was presented to the Board. It was adopted unanimously. Thus the Journal of the Text Encoding Initiative was born.

3 The committee decided that the journal should be published as an open-access online journal with TEI as the underlying data format. Moreover the committee felt that we should, if at all possible, avoid developing a new publication system. After investigating several platforms, Revues.org with its TEI-native publishing platform was recommended to host the journal. It was also decided that we would endeavour to publish two issues a year: the autumn issue consisting of a selection of articles arising from the previous conference and the spring issue focusing on a topic of relevance to the TEI community.

4 This inaugural issue consists of papers given at the London and Ann Arbor conferences. An introduction by two of the guest editors, Malte Rehbein and Kevin Hawkins, demonstrates just how wide-ranging and diverse the interests of the community have become; these articles truly represent the broad tent that is TEI scholarship today.

5 This inaugural issue owes much to many people. Thanks are due first to the TEI Board, where the idea originated, for supporting it so wholeheartedly, and particularly to Dan

Journal of the Text Encoding Initiative, Issue 1 | June 2011 3

O’Donnell, previous chair of the TEI Consortium, for his unfailing support. Thanks are also due to the committee that drew up the parameters of the Journal as well as to Revues.org for agreeing to host the Journal and for providing much aid-in-kind in its production. The editors of the Journal of the Text Encoding Initiative have benefitted immensely from the French team’s experience and expertise as well as their enthusiasm for the TEI.

6 Most of all, thanks are due to my co-editors, Markus Flatscher (Technical Editor) and Kevin Hawkins (Managing Editor), who have so generously given their and expertise to make this project happen. No detail, large or small, has been beyond their notice. Their professionalism, attention to detail, and good humour has made seeing this issue into “print” a real pleasure. I also thank the guest editors of this first issue who admirably and with good humour suffered through our teething process as we put in place our workflows while at the same time going into production.

7 At the 2008 London meeting the TEI celebrated its 21st birthday (a traditional rite of passage in the UK and Ireland). Another rite of passage for the TEI community is this Journal, marking one of the many milestones in the TEI becoming not only “a mature organization,” as Council Chair Laurent Romary would say, but a flourishing academic and intellectual community.

8 Susan Schreibman Editor-in-Chief Trinity College Dublin

AUTHOR

SUSAN SCHREIBMAN [email protected] Trinity College Dublin

Journal of the Text Encoding Initiative, Issue 1 | June 2011 4

Guest Editors’ Note

Malte Rehbein and Kevin Hawkins

1 With this inaugural issue of the Journal of the Text Encoding Initiative, we are happy to present selected papers from TEI Conference and Members’ Meetings held in 2008 and 2009.

2 In 2007, the TEI Consortium expanded its members’ meetings to a full conference format. At the 2008 and 2009 conferences there was great variety in the topics presented and discussed among approximately 100 participants from around the world at each event, reflecting the broad range of the TEI community. While a single issue of a scholarly journal can document only a selection of the papers, posters, and demonstrations from these conferences, we believe that this selection illustrates the broad community that the TEI now represents.

3 In his contribution “Computational Work with Very Large Text Collections: Interoperability, Sustainability, and the TEI,” John Unsworth, one of the keynote speakers for the 2009 conference, directly addresses that year’s theme: text encoding in the era of mass . He discusses how the “I” of “TEI” stands for both “Initiative” and “Interchange” yet argues that we need to move towards “Interoperability” as well. Analyzing large-scale digitization enterprises, Unsworth sums up with a plea for greater engagement of the TEI in the development of the .

4 While Unsworth is interested in a role for TEI in large text collections, Tanya Clement’s article approaches the TEI from the opposite perspective: that of a scholarly edition of a small-scale corpus of texts. In “Knowledge Representation and Digital Scholarly Editions in Theory and Practice,” she discusses the textual features and variations of a modern manuscript using selected poems by the Baroness Elsa von Freytag- Loringhoven, a German-born Dadaist artist and poet, as a case-study. Clement’s article argues for new frameworks of knowledge representation and scholarly editing to theorize the way TEI encoding and the Guidelines are used.

5 Thomas Schmidt’s article can be seen as a bridge between Unsworth’s and Clement’s approaches. “A TEI-based Approach to Standardizing Spoken Language Transcription” discusses both interoperability and scholarly practice in using the TEI Guidelines to formulate a standard for the transcription of corpora of spoken languages. Schmidt’s

Journal of the Text Encoding Initiative, Issue 1 | June 2011 5

“route to standardization” combines conformance to existing principles and conventions on the one hand with interoperable encoding based on the Guidelines on the other. Schmidt also adds a third dimension to his argument and sums up with a discussion of tool development and transformation workflows.

6 One might argue that for such an endeavor to succeed, a community needs to agree on shared standards, approaches, and tools to facilitate interoperability. “The Apex of Hipster XML GeekDOM’: TEI-Encoded Dylan and Understanding the Scope of an Evolving Community of Practice,” co-authored by Lynne Siemens, Ray Siemens, and Hefeng Wen, describes the Text Encoding Initiative as it is meant in its core: as a community of practice. The viral marketing experiment described in their article not only gives insight into the diversity of TEI practitioners and practice, but also illustrates the TEI’s engagement and potential engagement with new communities.

7 Enjoy!

AUTHORS

MALTE REHBEIN [email protected] Julius-Maximilians-Universität Würzburg, Germany

KEVIN HAWKINS [email protected] University of Michigan, Ann Arbor

Journal of the Text Encoding Initiative, Issue 1 | June 2011 6

Computational Work with Very Large Text Collections Interoperability, Sustainability, and the TEI

John Unsworth

1 The “I” in TEI sometimes stands for interchange, but it never stands for interoperability. Interchange is the activity of reciprocating or exchanging, especially with respect to information (according to Wordnet), or if you prefer the Oxford English Dictionary, it is “the act of exchanging reciprocally; giving and receiving with reciprocity.” It’s an old word, its existence attested as early as 1548. Interoperability is a much newer word with what appears to be military provenance, dating back only to 1969, meaning “able to operate in conjunction.” The difference is worth dwelling on for a moment since it’s important to the discussion here: for the interchange of encoded text you need an agreed-upon interchange format to which and from which various encoding schemes are capable of translating their normal output. Interoperability, on the other hand, implies that you can take the normal output from one system and run it, as is, in a different system—or to put it another way, the difference between an interchange format and an interoperable format would be that various systems actually operate directly on the interoperable format, while an interchange format is just a way- station between two other formats, each of which is required by different target systems. Even if there’s a single interoperable format, then, it has to be a common or baseline representation that is technically valid and intellectually acceptable in multiple systems. The conditions for interoperability would be some combination of flexibility and shared purpose in the systems, strictness in encoding, and consistency in practice. The TEI has a role to perform at each position in this combination, but it hasn’t always embraced these roles, with respect to interoperability.

2 In the P4 Guidelines, the word “interoperability” only appears twice, once in Volume 1 of the print edition in connection with , and once in Volume 2, in connection with Z39.50 (Bath Profile). On the other hand, interchange has been a core goal of the

Journal of the Text Encoding Initiative, Issue 1 | June 2011 7

TEI from the earliest meetings at Vassar College in 1988 where the effort to produce the TEI Guidelines began. The first principle emerging from those meetings is that 1. The guidelines are intended to provide a standard format for data interchange in humanities research. (TEI 1988)

3 In fact, TEI is an acronym with two possible expansions: it can stand for the “Text Encoding Initiative,” when it refers to the activity of producing and maintaining the Guidelines, but in the title of those Guidelines, it stands for “Text Encoding and Interchange.” Interchange is the subject of an entire chapter in the TEI Guidelines, as well—Chapter 30 (P4), “Rules for Interchange,” the headnote to which says: This chapter discusses issues related almost exclusively to the use of SGML-encoded TEI documents in interchange. XML-encoded TEI documents may be safely interchanged without formality over current networks, largely without concern for any of the issues discussed here. This chapter has not therefore been revised, and will probably be withdrawn or substantially modified at the next release. (p. 647)

4 This would seem to indicate that, at least in the universe of TEI, XML has solved the problem of interchange. One significant way in which it has done so is to require Unicode for character representation. In the pre-Unicode era in which Chapter 30 was first written, character encoding was the major concern in the area of interchange especially when the interchange might take place over a network: Current network standards allow—indeed, require—gateway nodes to translate material passing through the gateway from one coded character set into another, when the networks joined by the gateway use different coded character sets. Since there is no universally satisfactory translation among all coded character sets in common use, the transmission character set will normally be the subset which is satisfactorily translated by the gateways encountered in transit between the sender and the receiver of the data. (p. 647)

5 TEI tackled this level of the problem by developing writing system declarations and entity references—strategies later adopted by HTML.

6 Beyond the character-encoding level of the problem, interchange advice in TEI P4 and earlier consisted mostly of recommendations to expand minimized tags and supply omitted tags. Since tag minimization and tag omission are not allowed in XML, and since Unicode is required, this chapter’s advice on encoding and formatting of marked- up documents is now unnecessary. And by the same token, these features of XML take us (in theory) a step closer to being able to achieve some functional level of interoperability across text collections, at least for particular well-defined purposes. If this is true, this will be important when one wants to work at scale with documents produced by different projects, publishers, or . However, those who have tried to move from interchange to interoperability have quickly discovered that it’s an extremely difficult step to take successfully.

7 In a part of the MONK project (http://www.monkproject.org) called Abbot, we did take this step successfully, and we learned some things in the process. First and foremost, we learned that even within a single project, there may be significant deviations from the norms of tagging and transcription established for that project: this ranges from apparently unmotivated variations in the application of attribute values to apparently random behavior in transcribing and encoding documentary features like line-end hyphens. For the fullest discussion of the challenges met and overcome by Abbot, see Brian L. Pytlik Zillig’s essay “TEI Analytics: Converting Documents into a TEI Format for Cross-Collection Text Analysis” in Literary and Linguistic Computing (2009). TEI-A (for

Journal of the Text Encoding Initiative, Issue 1 | June 2011 8

“TEI Analytics”) is a TEI customization developed for the MONK project,1 and it is deliberately strict and stripped down. TEI-A is related to TEI Tite (Trolard 2009), a customization developed for use with keyboarding vendors. Both are intended to allow minimal variation and require minimal interpretation. As Brian notes in his LLC essay: If one were setting out to create a new literary text corpus for the purpose of undertaking text analysis work, the most sensible approach might be to begin with one of TEI’s pre-fabricated tagsets (TEI Corpus, perhaps). In the case of the MONK project, however, we are beginning with collections that have already been tagged using diverse versions of TEI with local extensions. TEI-A is therefore designed to exploit common denominators in these texts while at the same time adding new markup for data structures useful in common analytical tasks (e.g. part-of-speech tags, lemmatizations, word tokens, and sentence markers). The goal is to create a P5-compliant format that is designed not for rendering but for analytical operations such as data mining, principal component analysis, word frequency study, and n-gram -analysis. (188-189)

8 Brian goes on to talk about the “schema harvesting” technique that is embodied in Abbot, consisting of a meta-stylesheet which is used to analyze the input text and identify TEI-A elements that are either similar or identical to the elements in the input text; the result of this analysis is a second stylesheet, automatically generated by the first, that contains XSL templates for converting the input documents into TEI-A format. Files that fail validation after running through this second stylesheet are set aside for further (human) analysis, after which stylesheet logic might be extended and the process re-run or (in rare cases) files might be edited by hand. Brian writes: All processes are initiated by the Abbot program in the following sequence: 1. Use the MonkMetaStylesheet.xsl stylesheet to read the TEI-A schema 2. Generate the XMLtoMonkXML.xsl stylesheet, as a result of the prior task 3. Convert the input collection to TEI-A 4. Parse the converted files against the MONK schema and log any errors 5. Move invalid files to a quarantine folder These steps are expressed in a sequence of Unix shell scripts, and all source files are retained in the processing sequence so that the process can be tuned, adjusted, and re-run as needed without data loss. (191)

9 Getting the world to adopt TEI-A probably isn’t the answer to interoperability problems, though. As general as it is, TEI-A has a purpose in mind other than interoperability, namely analysis. A better choice might be TEI Tite, which has its purpose comfortably behind it, as soon as its texts come into existence. But it would be easy to get from one to the other. TEI Tite was developed (by Perry Trolard) as a sort of union-set of encoding practices in large libraries (Michigan, Virginia, Indiana) that contract out for substantial amounts of text-encoding. It focuses on non-controversial structural aspects of the text, and on establishing a high-quality transcription of that text.

10 Abbot, for its part, seeks to deduce similarities in the encoding practices of those entities that contributed text to the MONK project, namely ProQuest’s Early English Books Online and Eighteenth-Century Collections Online, the University of North Carolina at Chapel Hill Libraries’ Documenting the American South, the Indiana University Program’s Wright American Fiction, ProQuest’s Nineteenth-Century Fiction, the University of Virgina Library’s Early American Fiction, and Martin Mueller’s Shakespeare texts. The input formats here varied quite a bit, but they included both SGML and XML with both entity references and Unicode for character encoding. As Brian notes:

Journal of the Text Encoding Initiative, Issue 1 | June 2011 9

Local text collections vary not because maintainers are unaware of the importance of standards or interoperability but because particular local circumstances sometimes demand customization. The nature of the texts themselves may necessitate a custom solution, or something about the storage, delivery, or requirements for display may favor particular tags or particular structures. Local environments also require particular conventions (even within the TEI header). (188)

11 Or as I put it, in a talk at the NEH back in 2007: Once you start to aggregate these resources and combine them in a new and for a new purpose, you find out, in practical terms, what it means to say that that their creators really only envisioned them being processed in their original context—for example, the texts don't carry within themselves a public URL, or any form of public identifier that would allow me to return a user to the public version of that text. They often don't have a proper Doctype declaration that would identify the DTD or schema according to which they are marked up, and if they do, it usually doesn't point to a publicly accessible version of that DTD or schema. Things like entity references may be unresolvable, given only the text and not the system in which it is usually processed. The list goes on: in short, it's as though the data has suddenly found itself in Union Station in its pajamas: it is not properly dressed for its new environment. So, there's some benefit to the library, and to the long-term survivability and usefulness of their collections, or publishers' collections, to have them used in new ways, in research. (Unsworth 2007)

12 In interchange scenarios, as long as you can get from schema A to schema B by some agreed-upon intermediate step, it doesn’t matter that the source texts from the two environments are incompatible in their markup. In an interoperability scenario like MONK, you are trying to bring texts from a number of different sources into a kind of lowest-common-denominator format that can then actually be used in processing.

13 In fact, though, in the MONK project the TEI-A format isn’t the last stop: it’s a stage in a process with more specific goals than interoperability. The TEI-A produced by Abbot is subsequently processed through Morphadorner,2 which tokenizes, marks sentence boundaries, extracts named entities, and provides trainable part-of-speech tagging. The result of that process is fed to another program, called Prior,3 which feeds the texts into a MySQL —the final representation and the one that is queried for statistical information about the texts. However, we keep the TEI-A and TEI-A “morphadorned” states of the text as well, and in MONK we call on the former to provide a reading text for the user of the system at various points in the analysis process.4

14 I think, actually, that this is what interoperability looks like, or will look like in the future: it’s a state or a stage in the processing of data, and not necessarily (perhaps not often) the final state or stage. To attain it, you have to supervise the process, mindful of the need to produce an opportunity for interoperability. If libraries and scholarly projects that require the keyboarding or OCR of texts could use a common format (like TEI Tite) as the target of that stage of the process, and if that could be saved and made available for other purposes, it would allow other projects and processes to pick up those texts and either process them in that state or process them from a predictable source format into some more heavily tagged format that supports a more specific purpose. Interoperability, I’m suggesting, is a plateau and a publication, and it’s a matter of influencing the workflow for what you and others do so that it passes through that plateau and undertakes that publication. I’m not suggesting that TEI-A is necessarily the spec to use here—more likely, it would be something like TEI Tite,

Journal of the Text Encoding Initiative, Issue 1 | June 2011 10

meant as spec for vendors and now stipulated as the output format for TEI members who wish to take advantage of the AccessTEI member benefit (a discount on keyboarding services offered by Apex CoVantage).5 No doubt, in most cases this output will receive further processing for particular purposes and for the local environment, but if TEI members, libraries, and publishers using specifications similar to TEI Tite could learn to think about the Tite output as having a purpose of its own, namely interoperability, that would go a long way toward solving the kinds of problems that we encountered in MONK and that are certain to be encountered by anyone else who tries to make texts from different sources work (and play) together.

15 Interoperability is not just a matter of text format, though: it’s also very much a matter of license conditions. In the MONK project our final act was to present MONK to the public in two instances. The first instance6 is available to all users: it includes about 50 million words of American literary text from North Carolina, Indiana, and Virginia, plus the Shakespeare texts. The second instance7 is available only to users with login privileges at a CIC Institution:8 it provides access to a corpus of 150 million words that includes licensed material from ProQuest and Cengage. Login is negotiated through InCommon, which is an Internet2 implementation of the Shibboleth authentication protocol that has been set up at each CIC institution. All of those universities license the ProQuest materials, so permission for this re-presentation of their materials was not hard to get; however, only about half of them licensed the Cengage materials, so special permission was required from Cengage to allow them all uniform access to a single instance of MONK. Thankfully, that permission was provided; otherwise, it would have been a good deal more complicated to sort out who was allowed access to what.

16 This solution to the problem of heterogeneous access to licensed material is not scalable, obviously: there isn’t time for each new research project to negotiate access in the way that we did, and there’s no guarantee that other publishers would agree, as these did. In this connection, “scale” is represented by the project, which aims to digitize all printed books. As of October 2009, Google would admit to having scanned 10,000,000 books (Brin 2009), but Google estimates that there are about thirteen that many books out there (Taycher 2010), so they’re far from done. The scalable solution might come out of the Google Books Settlement agreement, if a settlement is ever finalized.

17 The proposed agreement (Case No. 05 CV 8136-DC 2009), which has preliminary approval from the courts, calls for Google to set up two research centers in which public domain and copyrighted works would be available for computational research, on the condition that the use of copyrighted material is “non-consumptive” (Case No. 05 CV 8136-DC 2009, section 7.2.d). Non-consumptive research is defined in the settlement as: …research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book. Categories of Non-Consumptive Research include: (a) Image Analysis and Text Extraction—Computational analysis of the Digitized image artifact to either improve the image (e.g., de-skewing) or extracting textual or structural information from the image (e.g., OCR). (b) Textual Analysis and Information Extraction—Automated techniques designed to extract information to understand or develop relationships among or within Books or, more generally, in the body of literature contained within the Research Corpus. This category includes tasks such as concordance development, collocation

Journal of the Text Encoding Initiative, Issue 1 | June 2011 11

extraction, citation extraction, automated classification, entity extraction, and natural language processing. (c) Linguistic Analysis—Research that performs linguistic analysis over the Research Corpus to understand language, linguistic use, semantics and syntax as they evolve over time and across different genres or other classifications of Books. (d) Automated Translation—Research on techniques for translating works from one language to another. (e) Indexing and Search—Research on different techniques for indexing and search of textual content. (Case No. 05 CV 8136-DC 2009, section 1.93)

18 The uses defined in (b) and (c) would cover all of what we did in MONK, and everything I can envision as falling under the general heading of text-mining. However, the notion that you can, for example, do supervised learning in text-mining without reading or displaying substantial portions of the book or understanding its intellectual content is more than a little implausible, and the whole idea of non-consumptive research, should it survive, will need to be refined in light of actual research and research use-cases. In any case, the settlement has not been finalized and the judge under whom it was negotiated has been promoted to a higher bench, so the whole thing may start over, or the suit may be withdrawn.

19 Even if that happens, though, HathiTrust is considering proposals for a research center that would leverage their shared digital repository which was set up by many of the libraries that participate in the Google Books project (Hagedorn, York, and Levine 2009). I am involved in a HathiTrust proposal submitted jointly by Scott Poole at the University of Illinois and Beth Plale at Indiana University under consideration by the HathiTrust Executive Committee as of this writing. At this time, the HathiTrust includes 7.1 million books, about 24% (or about 1.7 million) of which are in the public domain (HathiTrust 2010). By comparison, MONK included about 1500 titles, so even the public-domain content of the HathiTrust component of the Google Books collection is over 1,000 times the size of MONK. That counts as scale.

20 Working with only that portion of the potential research corpus, you could still seriously pursue the research goals spelled out in the HathiTrust RFP: • aggregation/distillation – “raw texts or abstracts covering particular topics or types of materials are reduced to subsets or of interest that can be used by one or multiple researchers” • development of tools for research – for “textual analysis, entity extraction, aggregation of data, and the representation and analysis of results” • collaboration – the Center must offer the ability to share processes, results, and communication with individuals and groups in a secure manner. • Miscellaneous additional needs and concerns of researchers, e.g. ◦ “The ability to include additional data.” ◦ “The ability to have access to both raw and pre-processed texts” (HathiTrust 2010, 7–10)

21 and complexity envisioned here will raise challenges in that area. One possible strategy for sustainability in this case would be to connect the maintenance of a research corpus, institutionally, to the maintenance of rights information. Another proposal in the Google Books Settlement that may survive even if the settlement agreement does not is the establishment of a non-profit clearinghouse for settling claims against money earned by the use of orphan works—those works that are in copyright, but for which a copyright holder cannot be located. A conservative estimate puts the number of orphan works in the Google Books collection at about 580,000 (Cairns 2009),9 but some estimate

Journal of the Text Encoding Initiative, Issue 1 | June 2011 12

the number in the millions. If the rights clearinghouse and the research host site were connected, the activity of the first might contribute to the sustainability of the second. Even if that subsidy were prohibited or constrained (as it would be, under the proposed settlement), the two activities obviously need to be conducted with awareness of one another, so that it’s clear what rights conditions apply to what works. And even if there’s no cross-subsidy, a research center could support itself with a combination of budgeted funds in research proposals that use the resource, plus institutional support.

22 These are the bits of an emerging cyberinfrastructure for disciplines that work with text. Characteristically, they include standards, strategies, organizations (like scholarly societies), institutional structures (like libraries and perhaps publishers, as well as research and its funders), and commercial players (including at least software developers and publishers, in this case). These characteristic bits also include moments of production, transmission, storage, representation, and analysis. And because cyberinfrastructure is also a social structure, it is a process. The TEI has a leading role to play at several points in that process, including of course as a standard, but also as a standards organization that interacts with institutional structures and commercial players. TEI competes—whether it wants to or not—in intellectual and institutional ways with various other disciplines and institutional commitments.

23 In general, one area of competition is in the academic recognition of computational research into ontologies. As more and more material has been digitized, people have begun to work toward what Tim Berners-Lee and others call the “semantic web” (2001). The Semantic Web Conference is a high profile academic event, but it is also a very large and fairly commercial event, and semantic web topics are discussed not only in AI and other CS contexts, but also as the foundation of business activities. Semantics, in this case, depends on ontology, and ontology is therefore “one of the pillars of the semantic web.”10 The Text Encoding Initiative has been doing the ontology of literary and linguistic texts since 1987. TEI has an Ontology SIG, in fact, that it should probably fund to represent TEI in semantic web contexts. TEI may have been here first, but it is coming from behind in terms of institutional recognition or functional centrality in semantic web contexts, possibly for the same reason that we seemed late to arrive at the Hypertext Ball when it was first thrown, by the World Wide Web. Neither the semantic web nor the web itself is a pure and well thought-out system, and they’re both over-commercialized already. But the TEI has a lot to offer both—and in fact, has offered it to the Web, the point of continuity being Michael Sperberg-McQueen, former North American Editor of the TEI, and his work for the World Wide Web Consortium on the XML standard.

24 We need to make a similarly important contribution, perhaps with more recognition, in the development of the semantic web, or at least in developing what is understood by that term. Doing this may help the TEI to track and participate in proposals for the research use of our expanding corpus of digital cultural heritage material in the form of text. By participating, we can assert the needs and the ontological views of a diverse humanities user community, and we can do that with more historical perspective and more authority than any other organization I can think of. If the TEI were to participate in such proposals, we could help to ensure that the emergent research environment is TEI-friendly, something that will serve the interests of the humanities research community. Through this participation in research proposals and in the research center, we can also contribute to the sustainability and the interoperability of a

Journal of the Text Encoding Initiative, Issue 1 | June 2011 13

research corpus. And if TEI is part of doing that, the TEI will also be sustainable, and participation in the TEI will be increased. Simple things like reminding users of the potential interoperability of texts produced through AccessTEI, and perhaps maintaining a record of whose institutions produced what texts with which access rights, would allow us to begin to carve out a role in the rights discovery and maintenance part of this ecology as well.

25 Finally, although we will certainly need research efforts like Abbot in order to move toward interoperability in the very large corpora of the near future, we need organizations like the TEI itself even more, and we need the TEI to have a vision and a strategy for asserting its role in the semantic web—by engaging early and often with emerging text-research centers and collections, and by promoting the potential interoperability of the materials produced through its AccessTEI service.

BIBLIOGRAPHY

Berners-Lee, Tim, James Hendler, and Ora Lassila. 2001. “The Semantic Web.” Scientific American, May 2001. http://www.scientificamerican.com/article.cfm?id=the-semantic-web.

Brin, Sergey. 2009. “A Tale of 10,000 Books.” The Official Google Blog, October 9. http:// googleblog.blogspot.com/2009/10/tale-of-10000000-books.html.

Cairns, Michael. 2009. “580,388 Orphan Works – Give or Take.” Personanondata, September 9. http://personanondata.blogspot.com/2009/09/580388-orphan-works-give-or-take..

Case No. 05 CV 8136-DC: Amended Settlement Agreement. 2009. http:// www.googlebooksettlement.com/agreement.html.

Hagedorn, Kat, Jeremy York, and Melissa Levine. 2009. “Call for Proposal to Develop a HathiTrust Research Center.” HathiTrust. http://www.hathitrust.org/documents/hathitrust-research- center-rfp..

HathiTrust. 2010. “Welcome to the Shared Digital Future.” HathiTrust. http:// www.hathitrust.org/about.

Ontology. 2010. Semantic Web wiki. http://semanticweb.org/wiki/Ontology.

Taycher, Leonid. 2010. “Books of the World: Stand Up and Be Counted! All 129,864,880 of You.” Inside Google Books, August 5. http://booksearch.blogspot.com/2010/08/books-of-world-stand-up- and-be-counted.html.

Text Encoding Initiative. 1988. “Design Principles for Text Encoding Guidelines.” http://www.tei- c.org/Vault/ED/edp01.htm.

Trolard, Perry. 2009. “TEI Tite—A Recommendation for Off-Site Text Encoding.” Version 1.0. Text Encoding Initiative Consortium. http://www.tei-c.org/release/doc/tei-p5-exemplars/html/ tei_tite.doc.html.

Unsworth, John. 2007. “ Centers as Cyberinfrastructure.” http:// www3.isrl.illinois.edu/~unsworth/dhcs.html.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 14

Zillig, Pytlik, and Brian L. 2009. “TEI Analytics: Converting Documents into a TEI Format for Cross-Collection Text Analysis.” Literary and Linguistic Computing 24, no. 2: 187–192. doi:10.1093/ llc/fqp005.

NOTES

1. The TEI-A schema can be retrieved at http://www.monkproject.org/downloads/texts/ schemata.gz and documentation is available online at http://segonku.unl.edu/teianalytics/ TEIAnalytics.html. 2. See http://morphadorner.northwestern.edu/. 3. See http://monkproject.org/docs/monk-datastore-doc/doc-files/prior.html. 4. With respect to the need to read, see the discussion below, on the subject of non-consumptive research. 5. For more information, see http://www.apexcovantage.com/content-solutions/accessTEI- digitization.asp. 6. See http://monkpublic.library.illinois.edu/monkmiddleware/public/index.html. 7. See https://monk.library.illinois.edu/secure/mainMenu.html. 8. For a list of CIC institutions, see http://www.cic.net/home/AboutCIC/CICUniversities.aspx. 9. See http://personanondata.blogspot.com/2009/09/580388-orphan-works-give-or-take.html. 10. In the Semantic Web wiki entry on Ontology (Ontology 2010), we learn that there is no universally accepted definition of ontology, raising the specter of recursion.

ABSTRACTS

This essay will address the challenges and possibilities presented to the Text Encoding Initiative, particularly in the area of interoperability, by the very large text collections (on the order of millions of volumes) being made available for computational work in environments where the texts can be reprocessed into new representations, in order to be manipulated with analytical tools. It will also consider TEI’s potential role in the design of these environments, these representations, and these tools. The argument of the piece is that interoperability is a process as well as a state, that it requires mechanisms that would sustain it, and that TEI is one of those mechanisms.

INDEX

Keywords: interchange, interoperability, text-mining

AUTHOR

JOHN UNSWORTH [email protected]

Journal of the Text Encoding Initiative, Issue 1 | June 2011 15

Graduate School of Library and Information Science, University of Illinois at Urbana- Champaign John Unsworth is Dean and Professor at University of Illinois' Graduate School of Library and Information Science (GSLIS) and Director of the Illinois Informatics Institute. He organized, incorporated, and chaired the Text Encoding Initiative Consortium, co- chaired the Modern Language Association's Committee on Scholarly Editions, and served as President of the Association for and the Humanities and later as chair of the steering committee for the Alliance of Digital Humanities Organizations. During the previous ten years, from 1993-2003, he served as the first Director of the Institute for Advanced Technology in the Humanities (IATH), and a faculty member in the English Department, at the University of Virginia. For his work at IATH, he received the 2005 Richard W. Lyman Award from the National Humanities Center. He chaired the national commission that produced Our Cultural Commonwealth, the 2006 report on Cyberinfrastructure for Humanities and Social Science, on behalf of the American Council of Learned Societies. He has also published widely on the topic of electronic scholarship, as well as co- directing one of nine national partnerships in the 's National Digital Information Infrastructure Preservation Program, and securing grants from the National Endowment for the Humanities, the National Science Foundation, the Getty Grant Program, IBM, Sun, the Andrew W. Mellon Foundation, and others.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 16

Knowledge Representation and Digital Scholarly Editions in Theory and Practice

Tanya Clement

1. Introducing In Transition: Selected Poems by the Baroness Elsa von Freytag-Loringhoven

1 In Transition: Selected Poems by the Baroness Elsa von Freytag-Loringhoven is a publicly available scholarly edition of twelve unpublished poems written by Freytag- Loringhoven between 1923 and 1927. Alongside extensive annotations and a critical introduction, this edition serves to provide access to a textual performance of her creative work in a digital environment. It is an interaction that is made possible by using the Text Encoding Initiative’s (TEI) P5 Guidelines for critical apparatus including parallel segmentation and location-referenced encoding. The encoded text is rendered into an interactive web interface using XSLT, CSS, and JavaScript available through the Versioning Machine (VM).1 In this discussion, I show that a digital edition like In Transition is formed as much by the underlying theory of text as it is by its content and the particular application or form it takes. This discussion employs the language of knowledge representation in computation (through terms like domain, ontology, and logic) in order to situate this scholarly edition within two existing frameworks: theories of knowledge representation in computation and theories of scholarly textual editing.

2. Knowledge Representation and Digital Scholarly Editions in Theory

2 John F. Sowa writes in his seminal book on computational foundations, that theories of knowledge representation are particularly useful “for anyone whose job is to analyze knowledge about the real world and it to a computable form” (Sowa 2000, xi).

Journal of the Text Encoding Initiative, Issue 1 | June 2011 17

Sowa’s suggested approach to designing systems for digital knowledge representation is not dissimilar to the principles set forth in the Modern Language Association’s (MLA) “Guidelines for Editors of Scholarly Editions” (2007). The MLA Guidelines recommend that an editor “choose what to attend to, what to represent, and how to represent it” according to “the editor’s theory of text” or “a consistent principle that helps in making those decisions” (MLA 2007). An analogy can be made between these guidelines and Sowa’s assertion about the application of knowledge representation: “Knowledge representation,” he writes, “is the application of logic and ontology to the task of constructing computable models for some domain” (xii). Sowa’s concept of logic or “pure form” to the MLA guidelines’ consideration for how a text is represented in an edition; his use of ontology or “the content that is expressed in that form” maps to the MLA guideline’s concern with what is attended to or represented in an edition; and Sowa’s consideration for the domain maps to the MLA guidelines’ notion of an edition’s underlying theory of text (Sowa 2000, xiii). Further, the MLA guidelines consider a scholarly edition “a reliable text” by measuring its “accuracy, adequacy, appropriateness, consistency, and explicitness” against what editors define as the edition’s form, content, and theory of text (MLA 2007). Similarly, Sowa notes that knowledge representation is unproductive if the logic and ontology which shape its application in a certain domain are unclear: “without logic, knowledge representation is vague, Sowa writes, “with no criteria for determining whether statements are redundant or contradictory,” and “without ontology, the terms and symbols are ill- defined, confused, and confusing” (xii). Knowledge representation is the work of all editors. Moreover, the work that scholarly editors undertake in a digital environment must take into account, not only traditional textual scholarship, but theories in computation. It is thus useful to theorize the extent to which the production of knowledge in a digital edition is unique to this environment.

2.1. The Domain and Theory of In Transition: Textual Performance

3 In Transition reflects a theory of text I am calling textual performance. Textual performance theory is based on John Bryant’s notion of fluid text theory in which social text theory is combined with the geneticist notion that a literary work is “equivalent to the processes of genesis that create it” (Bryant 2002, 71). What is productive about this theory for this discussion is the notion that a textual event is a “flow of energy” rather than a product or a “conceptual thing or actual set of things or even discrete events” (Bryant 2002, 61). Accordingly, a text in performance comprises multiple versions in manuscript and print, various notes and letters and comments of contemporaries or current readers, plus the element of performance, which entails time, space, and a collaborative audience. We can perceive these elements working together in the meaning-making event of a text if we consider a literary work to be a “phenomenon . . . best conceived not as a produced work (oeuvre) but as work itself (travaille), the power of people and culture to create a text” (Bryant 2002, 61). As well, considering the literary work as a phenomenon situated in space and time corresponds to the Baroness’s notion of “lifeart,” which reflects a concept of art that was germane to the Dadaist movement, one even Ezra Pound adopted as “an act of art” instead of “a work of art” (Gammel 2002, 14). In other words, as a Dadaist, the “act” of art was intricately tied with one’s ability to provoke a response from fellow Dadaists and the bourgeois culture, which were the targets of most Dada performances. Because

Journal of the Text Encoding Initiative, Issue 1 | June 2011 18

provocation was at the root of Dadaist art, the context in which Dada art is performed and the fact of a live, collaborative audience are essential to the art. Likewise, this concept of the “flow of energy” within fluid text theory is a useful way of thinking about how meaning is being produced when a reader interacts with an electronic edition of the Baroness’s poetry.

4 The Baroness’s particular perspective on creating art coheres to this sense of flow and the nature of creation in terms of historical time and place. First, the Baroness believed that for the artist, “art” is conceived in a wave of imagination that comes before its logic or form and that the medium then serves as a catalyst or a signpost within the creative act. In a letter to Djuna Barnes the Baroness refers to the overwhelming nature of being an artist and the productive and enabling forces of the logic or form of poetry. She writes to Barnes that her “rambling” way of “analytical speculation by emotional facts” is an “endless way —until now only to be mastered by rhythmical [sic] and symbolical force of poetry” in which “the logic is already the motive of the very start— and is contained in it and is the thing itself” (UMD 2.144).2 In another letter the Baroness notes, “I am all wave—first—arrangement—ability—comes later” since “the possibility of the structure grows your wings to ‘create’” (UMD 2.45). In other words, various poetic expressions may start from the same wave, but each medium’s particular structure lends itself to a unique performance of that expression. This method is apparent in other poems by the Baroness such as “Orgasmic Toast,” “Statements on Circumstanced Me” (also called “Purgatory Lilt” and “Hell’s Wisdom”), and “Christ – Don Quixote – St. George,” which have multiple versions written as prose in paragraphs and other versions structured into more traditional stanza-and-line formats.

5 Using different styles, genres, and forms was part of the Baroness’s creative process. She writes in a note on a version of “Purgatory Lilt” she has included in a letter to Barnes that “This is not a poem but an essay—statement. Maybe—it were better not to print it in this cut form—perpendicular but in usual sentence line—horizontal?” (UMD 2.226-227) Hans Richter calls this process of revision more dream-like than fancy: “What is important is the poem-work, the way in which the latent content of the poem undergoes transformation according to concealed mechanisms,” transformations “that work the way dream-work strategies operate—through condensation, displacement, and the submission of the whole of the text to secondary revision” (1965, 80). For these reasons, the Baroness’s manuscripts often do not correspond to a sequence that manifests the teleological evolution of a poem. In some cases, the extant manuscripts show little evidence of a clear, creative evolutionary path within a text. Indeed, the Baroness’s manuscripts often manifest experiments on a theme, making one version’s relationship to another an example of alternative choices rather than a system of rough drafts leading to final versions. Richard Poirier claims that this is a modernist technique: “[m]odernist writers, to put it too simply, keep on with the writing of a text because in reading what they are writing they find only the provocation to alternatives” (1992, 113). A reading environment where the reader can experiment based on textual provocations reflects these Dadaist and modernist textual practices.

6 One aspect of textual performance theory I am exploring within In Transition concerns the social text network. The social text network these twelve texts always and already represent presupposes the notion of a constant circulation of networked social text systems. A social text network is entered much like one enters McGann’s “editorial horizon”: the entrance point is “the words that lie immediately before a reader on

Journal of the Text Encoding Initiative, Issue 1 | June 2011 19

some page [which] provide one with the merest glimpse of that complex world we call a literary work and the meaning it produces” (Textual Condition 12). The network represented by In Transition is based primarily on issues of reception, materiality, and theme which engage and reflect the social nature of the text in the 1920s and now. This is to say two things: (1) that the concept of the network is not new with digital scholarly editions; and (2) that these networks in a digital edition foreground the situated 1920s history of these texts as well as the real-time, situated electronic reading environment.

7 Social networks are not new. Indeed, the notion of the network is used both by Bruno Latour and Jay David Bolter and Richard Grusin to ameliorate the polarities that exist in the current discourse between nature and technology and between “old” and “new” technologies. Notions of the “network” help to diminish the polarities within the overriding discourse. In We Have Never Been Modern (1993), Bruno Latour explores the notion that the hybridization of nature and culture in this age of new technologies has necessitated discourses of purification and denial; these discourses, he argues, seek to create an age of digital “revolution” that diminishes what has always been a cyborgian culture (48). “When we see them as networks,” Latour writes, “Western innovations remain recognizable and important, but they no longer suffice as the stuff of saga, a vast saga of radical rupture, fatal destiny, irreversible good or bad fortune” (1993, 48). Bolter and Grusin explore our current, perceived digital utopia as the result of the “double-logic” of “remeditation” (the “repurposing” of old technologies) in which “our culture wants both to multiply its media and to erase all traces of mediation” (Bolter and Grusin 1998, 5). In Transition is a remediation of social text networks, but it is also the enactment of new social text networks that is in constant circulation or “flow.” The real-time audience participation required within the In Transition interface foregrounds the extent to which these social text networks underlie all textual performances or events.

2.2. The Ontology and the Content: Social Text Networks

8 This scenario, in which the making of meaning is a performance that relies on a constant state of shifting social networks corresponds to the edition’s central theme of transition. These twelve texts are included as expressions created during a time of transition in the Baroness’s life between 1923 and 1927 when she moved from New York to Berlin and finally to Paris, but the edition also serves to represent a moment of transition in the culture of little magazines and the technologies of conversation during this time period. This is a period which sees the little magazine change shape from a venue that engages more popular responses and conversations about literature and art—such as the one represented by the inclusion of the Baroness’s poetry in The Little Review—to a venue which begins to address an audience more attuned to and engaged with literature and poetry as high art. Alan Golding associates the “point that modernism becomes Modernism” with the moment that the Baroness left New York to return to Germany in 1923, a point that both a highly experimental phase of modernist writing and one in which conversation and dialog was freely flowing (Golding 76).

9 The social text networks represented by In Transition comprise three primary relationships within this context. The first relationship is based on the reception

Journal of the Text Encoding Initiative, Issue 1 | June 2011 20

environment at transition magazine 3 where the editors at first accepted and then rejected the Baroness’s poems for their audience in the late nineteen-twenties.4 For instance, during the period between 1927 and 1929, three of the twelve poems included within In Transition (“Café Du Dome,” “Xray,” and “Ostentatious”) were published in transition while five of the other poems—”Ancestry,” “Christ—Don Quixote—St. George” (a subsection of “Contradictory Speculations”), “Cosmic Arithmetic,” “Sermon On Life’s Beggar Truth,” and “A Dozen Cocktails Please”—were under consideration by the transition editors and ultimately rejected for future issues.5 Cary Nelson argues that this time period is one in which “a revolution in poetry seemed naturally to entail a commitment to social change [. . .] all the arts were in ferment and aesthetic innovations were politically inflected” (230). Much of this fermentation, innovation, and commitment to change was generated by the relationships between writers and editors. Indeed, the conversation at the root of modernism extended to the offices of the little magazines where writers read each other’s work and discussed it both in person and in print. These eight poems share a relationship tied to the particular social text network engaged by the transition editors in the 1920s.

10 A second relationship represented by the textual network within this edition includes the material space that some of these poems share, a relationship that in some cases overlaps with the ties just mentioned. For instance, in some cases, draft versions of certain poems appear on the verso or in the margins of the manuscripts for draft versions of other poems. Versions of “Café Du Dome,” “Ancestry,” and “Sermon” appear on versions of “Ostentatious” while versions of “Orchard Farming,” “Sermon,” “Christ —Don Quixote —St. George,” and Ostentatious“ appear on versions of “Xray.” The material nature of these relationships is useful for considering the role that materiality plays in situating these poems in a particular time and place, both historically and in the present. That is, a reader could assume that two poems were produced in close succession because they share a manuscript leaf, but it is also true that the Baroness was quite poor and could have reused these sheets multiple times over a long span of time for economical reasons. Further, it is difficult to say if the proximity of one poem influenced how the Baroness wrote another. At the same time, in the current iteration of In Transition in which images of the manuscripts are used, the reader is exposed to multiple poetic events each time she opens a manuscript leaf that shows multiple poems. As a result, these material relationships play a role in both the text’s perceived material history and the materiality of its current performance.

11 The third interconnected relationship embodied by the content within this edition is one that is determined by thematic ties between poems written during this time period. The remaining three poems “Purgatory Lilt/ Statements by Circumstanced Me,” “Orgasmic Toast,” “Matter Level Perspective” have thematic ties with a variety of the aforementioned texts. For instance, the interplay among historical, personal, scientific, and creative forces in “Hell’s Wisdom” points to themes inspired by the Baroness’s fellow Dadaists, but it is difficult to decipher the abstract logic that the arithmetic in a poem like “Hell’s Wisdom” represents unless one also reads “Cosmic Arithmetic.” The other poems share thematic ties as well, such as images of “radiance” in “Orgasmic Toast,” “Sermon on Life’s Beggar Truth,” “Purgatory Lilt,” and “Xray” or mathematic formulas in “Orgasmic toast,” “Purgatory Lilt,” and “Cosmic Arithmetic.” More of these relationships are explored in the extensive introduction and annotations to the edition.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 21

12 Reception, thematic, and materiality networks are also reflected in the relationships between words and forms of punctuation across different versions of the poems. For instance, in the poem “Sermon on Life’s Beggar Truth” words are underlined in one version and then not emphasized at all; dashes and colons are deleted and replaced with periods or spaces or exclamation points (and vice versa); and all of these relationships occur in an order that seems to contradict a linear evolution of text. For instance, Figure 1 shows the relationship between the words “Menacing” and “Behold,” which function as “heading” words for two prose stanzas. These words change in similar ways across multiple versions but not in a similar sequence. In versions one and two, “Menacing” and “Behold” remain consistent, underlined with a colon. In versions three through six, “Menacing” is not underlined but is separated from the following prose group by a space. In versions five and six it has a colon while in versions three and four, it has an exclamation point. “Behold” is always on its own line but the colon is deleted and replaced by an exclamation point in version five while versions three, four, and six maintain the colon and so on. The progression shows a network of relationships that hint at multiple performances or instantiations of the poems instead of a teleological process towards an end result. In contrast, there are other social text networks between versions that are linear. The poem “Xray,” for example, which was published in transition (October 1927) has nine extant versions that show changes that we can map to the reception and materiality relationships between nodes. For example, the first three lines of the first stanza of the published version read: Nature causes brass to oxidize People to congest– By dull-radiopenetrated soil . . .

13 In the first version in the interface, the first line is “Nature causes brass to oxidize,” which changes to “Nature intends brass to oxidize” in version six. The second line in the first version is “Nature causes people to amass,” which becomes in version six, “Nature intends people [sic] to amass”; this line evolves in version two to “Nature causes people to congest” and eventually becomes, in the published text, a truncated clause: “People to congest—.” While the evolution of these lines are relatively easy to follow, the third line becomes something that seems entirely different if one merely looks at the last version in comparison to the first: “Because of latent ideal of brilliancy” becomes “By dull-radiopenetrated soil” (see Figure 2). The Baroness’s compulsive desire to create multiple versions of each work is reflected in the ontology or content across which particular words, punctuation marks, and symbols move and change.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 22

Figure 1: The words “Menacing” and “Behold” compared across versions of “Sermon on Life’s Beggar Truth” in the Versioning Machine

Figure 2: “Xray”, versions one, eight, and the published 1927 text, in the Versioning Machine

14 In short, all twelve poems participate by and through multiple and varied relationships based on reception, materiality, and theme within the textual network that was circulating between 1923 and 1927. In Transition stages a textual performance that sets these social text networks into play.

2.3. Logic and Form: the TEI in the Versioning Machine

15 Encoding a transcription of a printed or manuscript text is a method for creating a computable model of a text that can be instantiated or implemented with programs for a variety of applications such as search and retrieval, linguistic analysis, or visualizations. This modularity facilitates the various stagings within a given textual performance. For instance, the TEI-encoded documents of which In Transition is comprised include logical and ontological metadata that can describe both the physical and the semantic nature of the manuscript. Currently, the TEI schema is the most productive standard available for creating a scholarly edition of the Baroness’s poetry because it is able to express the dynamic network of relationships that exist when multiple versions of a poem are performing at once. Created primarily for use with linguistic and literary documents, the standard has a robust schema for considering manuscript texts in multiple versions, making it suitable for the particular textual ontology on which a scholarly edition based on these kinds of texts depends. In particular, methods corresponding to the “Critical Apparatus” guidelines called “parallel segmentation” and “location-referenced,” allow an editor to designate and thus visualize networks among linguistic codes (words, phrases, lines, paragraphs, etc.)

Journal of the Text Encoding Initiative, Issue 1 | June 2011 23

and bibliographic codes (page images, page breaks, column breaks, and milestones) that correspond across various versions. In terms of In Transition, the TEI parallel segmentation encoding facilitates the reader’s ability to compare the social text networks of a poem like “Xray” or “Sermon on Life’s Beggar Truth” described above. In particular, In Transition uses the open platform application called the Versioning Machine (VM),6 which renders the TEI XML (shown in Figure 3) into a dynamic HTML page using XSLT, CSS, and JavaScript (shown in Figure 1 and Figure 2). Figure 1 and Figure 2 are examples from In Transition in which lines from various versions of “Xray” and “Sermon on Life’s Beggar Truth” are being compared. With the VM styles, these comparisons can be enacted by readers dynamically in a browser window in two primary ways: (1) the scholar can open and rearrange version panels as needed and (2) the scholar chooses which networks to highlight by selecting lines of interest.

Figure 3: An excerpt of “Xray” in TEI P5 encoded XML, versions one through eight and the published 1927 text

16 Determining which TEI elements present which social text networks is the work of knowledge representation. It is setting the stage for a textual performance. Critical, editorial choices that ensure textual modularity are involved in every aspect of the text’s transformation from a transcript to a fully encoded TEI XML document to a text presented in an application such as the Versioning Machine. These choices include deciding how to sequence the versions, choosing the lines that correspond across versions, and assessing the HTML rendering of such choices. The underlying TEI XML of an edition such as In Transition (Figure 3) includes data within a structured logic that computer systems need to facilitate the scholar’s ability to manage and manipulate various networks of relationships that comprise the bibliographic and linguistic codes of a text. For instance, in Figure 3, the logic represented by the “nested” structure indicates a particular relationship between the parent apparatus () element and

Journal of the Text Encoding Initiative, Issue 1 | June 2011 24

the reading () elements “nested” within it (the children) that allows the editor to indicate and compare corresponding parts of the text across versions. In this manner, the elements that appear between the opening and closing elements indicate which of the nine versions or witnesses are associated with a particular aspect of the apparatus. The witnesses are indicated in the encoding by the numbers va1, va2, va3, etc. with the published version labeled as “pub1927.” In this case, the apparatus with :id “a6” is being used to compare versions of the third line associated with each witness. In addition, the “loc” element (also “a6”), which links together readings from different apparatus elements, indicates that the element with xml:id “a6” is associated with the element with xml:id “a5.” Consequently, the extra lines that appear in witness va8 above the third line (area “A” in Figure 3) are associated with this line of text across the versions. This “link” is visualized in Figure 3in which lines are highlighted according to the element. In the interface, the reader can click on any line to automatically highlight associated words, phrases, and lines across readings based on two criteria: the presence of these readings within the same element or the association of the same loc attribute on different elements. The editor can use these structures to group or organize both unique versions and changes across versions and interface of In Transition allows the reader to see and construct different stories about the underlying networks of the text.

17 In considering the form of a digital scholarly edition, it is necessary to interrogate how the digital environment instantiates or stages the application of the underlying editorial philosophy. For instance, as a computable model, Willard McCarty calls encoded text “reductive and fixed” since it cannot detail “the massive amount and complexity of detail for a microscopic phenomenon across 12000 lines of text” (McCarty 2005, 58). An encoded text also cannot, according to Jerome McGann, capture the n-dimensional aspect of the “autopoetic” field of transactions, connections, and resonances. McGann notes that “[a]ll this phenomena exhibit quantum behavior. We distinguish a structure of relational segmentation in all texts, but in autopoetic forms we observe as well that the segments and their relations cannot be read as self- identical. They mutate into different symmetries and asymmetries” (McGann 2002, 298). On the other hand, in an essay titled “Electronic Textual Editing: When not to use the TEI,” John Lavagnino discusses the advantages of using the TEI Guidelines for a scholarly edition. For a scholarly edition in which “the creation of new writing” such as scholarly apparatus is just as essential as the transcription of the original text, Lavagnino quite simply argues, “the TEI is applicable to your texts” (Lavagnino 2006, 334). The difference between these two perspectives is remarkable. The former is summarily reductive in considering the varied applications for encoding while the latter seems unduly expansive in theoretical terms. Certainly, as one reviewer of this article noted, there is a lot of information in the notes and introduction of In Transition that appear in natural language and are not essentially reliant on the “computable model” for “enactment.” These notes represent static language about biographical and literary significance that describe a certain historical context. Yet, I am arguing that there are dialogic modes of knowledge representation enacted with this edition by both “natural” and “encoded” language and the premise underlying McCarty, McGann, and Lavagnino’s claims speak to the reason for using the TEI to engage it: these critics are essentially saying that determining the standard or model for encoding a text depends on how the scholar defines the digital textual event in which it will be enacted (i.e., for what domain).

Journal of the Text Encoding Initiative, Issue 1 | June 2011 25

18 In theorizing how and why we use TEI encoding, it is useful to consider Sowa’s observation that knowledge representation corresponds to “the application of logic and ontology to the task of constructing computable models for some domain” (Sowa, 2000, xii). McCarty’s sense of the limitations of encoding are premised by his argument that the encoded text does not represent a productive computable model since the ontology created in an encoded text does not accurately represent the original object nor is it structured in such a manner to record what it is not able to represent. Essentially, McCarty’s concern is to build a better system of representation based on what could be learned from a given model within that system. Likewise, McGann’s perspective comes from his desire to represent the multidimensional “autopoietic field” of a textual event for observation and study. Lavagnino, on the other hand, defines the function of an encoded text in terms of editorial scholarship. As scholarly editors, he argues, “we are engaged in analyzing texts and creating new representations of them, not in creating indistinguishable replicas” (Lavagnino 2006, 338). Similarly, In Transition is a digital textual environment which is not intended to replicate history but is intended to elicit more questions than answers about social text networks through play, discovery, and inquiry. These performances are scripted by the editor—by my ability to mark and annotate aspects of the text that foreground certain networks and generate a particular narrative. These textual events, however, are also motivated by an underlying theory of textual performance which requires a real-time, live audience to “handle” the digital texts and images, to move them around and, by doing so, to set new autopoietic fields in motion.

3. Knowledge Representation and Digital Scholarly Editions in Practice

19 Applying the logic of the electronic edition (the form) and the ontology (the content) of these twelve networked texts to a computable model that represents textual performance (the domain) is not a simple task—but perhaps this difficulty is appropriate in this context. Richard Poirier writes that modernist “texts are mimetic in that they simulate simultaneously the reading/writing activity;” thus, “[t]he meaning resides in the performance of writing and reading, of reading in the act of writing” (Poirier 1992, 113). For this reason, he continues, modernist texts enact “a mode of experience, a way of reading, a way of being with great difficulty conscious of structures, techniques, codes and stylizations” (Poirier 1992, 114). For instance, the Baroness believed that punctuation (what she calls “interpunction”) should be as varied and expressive as words. This sentiment is reflected in a note to Barnes in which she invents the “scorn mark” and the “joy mark”: . . . why does no scorn-mark mark of contempt—exist? I often miss it! see? that is one of thing’s [sic] I will invent. . . to invent happiness—joy mark! Not only exclamation mark. Djuna—as I just see now—our interpunction—system is puny! One should be able to express almost as much in interpunction as words [. . .] in this new strange thing—to express absolute in it! As I did in sounds—like music! Wordnotes! (UMD 2.44)

20 Here, the Baroness acknowledges that her ontology includes the system of words and symbols from which she could draw and that these objects belong to a system or network of relationships that must reflect how we read but also how we write poetry. In Transition seeks to set this “performance of writing and reading” into play by engaging

Journal of the Text Encoding Initiative, Issue 1 | June 2011 26

the reader in some of the same “difficult” textual conditions the Baroness encountered in creating her poetry, such as the play between elements of ontology (content) and logic (form) and the temporal nature of the writing experience in real-time.

21 Based on the theory of textual performance, In Transition illustrates through practice that versions are a matter of perspective and situation just as they are a matter of textual difference. For instance, two versions of a poem titled “He” and “Firstling” appear on the same manuscript page. Next to the versions, the Baroness writes a note to Djuna Barnes saying “These two poems are the same. I leave it to you if you will print them both?” (UMD 4.54) Other versions of the poems that appear in the extant manuscripts are German versions. On yet another version, the Baroness writes to Barnes about combining “Firstling” and “He” but this time “Firstling” is in German: “What is interesting about the 2 together,” she writes, is their vast difference of emotion—time knowledge—pain. That is why they should be printed together. For they are 1 + 2 the same poem—person sentiment life stretch between one—divided—assembled—dissembled. The German one is young— naïv [sic]—ingenous [sic]—the English one ripe—experienced bitter. The German one is deep woe of child—in whoms [sic] very violence thus naïve expressed— lingers balm of recovery sensible.—The English one—as is superfluous to point out— is grim sophisticated. (UMD 4.58-59)

22 The Baroness reiterates her idea that the poems are versions of the same poem though they have different titles, are written in different languages and written in different countries. The details the Baroness emphasizes, however, are differences made by time and experience. In fact, what she is describing is not only her experience in writing the poems at different times in her life, but what would eventually be the readers’ experiences in reading this poem at a time later than they were written. Textual performance necessitates similar experiences with temporal uncertainties or instabilities. For instance, in “Prose Fiction and Modern Manuscripts: Limitations and Possibilities of Text Encoding for Electronic Editions,” Edward Vanhoutte’s main contention is that a genetic textual edition can only be partially accomplished by the TEI standard. He cites “time and overlapping hierarchies” as the most problematic aspects of his attempt to encode modern manuscript material since “the structural unit of a modern manuscript is not the paragraph, page, or chapter but the temporal unit of writing” (Vanhoutte 2006, 172). Clearly, he is not alone in contending that the TEI logic (the nesting elements) and its ontology (the aspects and behaviors of the text of which the elements are comprised) remain insufficient for representing modern textual events.7 On the other hand, perhaps it is not productive to assume that the TEI schema should be held culpable for the representation of every aspect of a textual performance. In “Psychoanalytic Reading and the Avant-texte,” Jean Bellemin-Noël sites “chance” as the salient element within the textual event that mollifies the need to reproduce what could be called the text’s originary temporality in the genetic edition. “Since the writing process is itself a production governed by uncertainty and chance,” Bellemin-Noël writes, “we absolutely must substitute spatial metaphors for temporal images to avoid reintroducing the idea of teleology” (Bellemin-Noël 2004, 31). In other words, instead of attempting to reproduce temporality in the scholarly edition (an attempt that presupposes a teleological textual event), the goals of an edition with concerns about versions might be better served by engaging the element of uncertainty and chance that the temporal nature of textual events inevitably produce.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 27

23 The facility to engage an element of chance, especially as it is engendered by space, is enhanced by a dynamic and manipulative interface to the textual event. Visualizations facilitated by a combination of text and image work well to produce a space that functions as a signifier for temporal uncertainty. For instance, in version three of “Xray,” certain lines (“Suns [sic] radioinfused soil,” “Radio’s soil secret,” “Radio’s sun message,” and “Radio’s sunimpregnated soil”) may be understood as alternative readings for the same point in a line of text because of their spatial arrangement (all radiating around the word “soil”) on the manuscript page (see Figure 4). Or, since the text appears between the second and third line of text, the word cluster could be a kind of brainstorming cluster that may or may not have helped the writer develop the final phrase “Dumb radiopenetrated soil” that appears, for the first time in any version, on the line beneath the clustered constellation. Ultimately, uncertainty and chance are enacted by the spatial arrangement of the words on the page since it is impossible to ascertain which words were written first; consequently, our inability to decipher the exact chain of events is emphasized.

24 Finally, our access to this level of uncertainty is enacted by the combination of text and image that the VM facilitates. Within the TEI, the editor is able to express alternative readings for a given textual moment by using the reading-group element () within a “parent” reading () element in order to group additional “children” readings (for an example, see element xml:id “a5” in Figure 3, Area “A”). At the same time, TEI XML must be written in a linear form, first one reading, then another, which prescribes an order on text that is essentially unordered.8 For instance, in Figure 5, a element is rendered by the presence of a dotted line under the phrase “Suns [sic] radioinfused”. This line indicates that a mouseover will reveal alternative readings; yet, on the mouseover, the alternative readings are ordered, vertically, in the same order that the XML prescribes: first “Radios’ soil secret” then “sun message” then “penetr sunimpregnated”. This linear orientation is prescribed both by the XML and the resulting HTML (of which the VM interface is constructed), giving the impression that there is an order to the phrases that is not necessarily evident on the manuscript page. On the other hand, it is this discrepancy that lends a powerful element of uncertainty to the textual performance of “Xray” in the VM. That is, because of the encoding, a dotted line is rendered that indicates alternate readings for the phrase “Suns radioinfused soil” (see Figure 5). By mousing over the dotted line, the above- mentioned alternative readings appear in a “floating box” that indicates to the reader that the variants included in the box are alternative choices for this spot in the text. In addition, in this example, “soil secret” is also underlined with a dotted line indicating that alternative choices for this sub-reading are “sun message” and “sun impregnated.” This is where the image enters into this performance. For instance, in Figure 5 and Figure 6, the encoded poem supports a logic of text according to linguistic codes that are associated across words and phrases. The image (shown in the bottom right corner of Figure 6) facilitates a logic of text that points to bibliographic codes associated with the material layout of the manuscript page. The dialogic that is played as these different textual messages are visualized through the encoded text and the manuscript images generates the element of temporal uncertainty that Bellemin-Noël mentions and that textual performance requires. In theory, “playing” the encoded text and image together opens a space for uncertainty, for conversation, and for situated, alternative readings that, in practice, become texts in performance.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 28

Figure 4: Manuscript excerpt from “Xray” version three in the Versioning Machine

Figure 5: Excerpt from “Xray,” three versions in the Versioning Machine

Figure 6: “Xray,” three versions in the Versioning Machine

4. Conclusion

25 The knowledge represented and produced in creating and reading In Transition is provocative since it encourages critical inquiry concerning how a digital scholarly edition represents knowledge differently than a print edition; it raises questions about the role social text networks may have played in how the Baroness’s poetry is and was presented and received; and it requires that we interrogate whether In Transition presents the Baroness in the trajectory of history or provides for a location in which we can read her work in the now, in an n-dimensional autopoetic field that is situated squarely in the present moment of the reader’s open (browser) window. At best, with this work we imagine what is possible in creating a singularly digital text environment that requires the reader to ask, how does this environment work? How is it constructed? What new and traditional modes of textuality are at play and at risk here? The above discussion has sought to make transparent how the edition’s ontology and

Journal of the Text Encoding Initiative, Issue 1 | June 2011 29

logic are in dialog with the domain of textual performance. At best, the multiple versions of these twelve poems related through social text networks, the manifestation of these relationships in the TEI encoding, and the VM environment which allows users to set these relationships into play provides for a situated reading environment in which a particular instantiation of text is never the same from one moment to the next. The edition is enacting the element of real-time, live-body, evocative performance that informed how the Baroness and her contemporaries engaged in her poetry within social text networks of modernist magazines and the Dadaist art scene of the 1920s. At best, that work remains ongoing.

BIBLIOGRAPHY

Bellemin-Noël, Jean. 2004. “Psychoanalytic Reading and the Avant-texte.” In Genetic Criticism: Texts and Avant-Textes, edited by Jed Deppman, Daniel Ferrer, and Michael Groden, 28-35. Philadelphia: University of Pennsylvania Press.

Bolter, J. David, and Richard A. Grusin. 1999. Remediation: Understanding . Cambridge, Mass: MIT Press.

Bryant, John. 2002. The Fluid Text: A Theory of Revision and Editing for Book and Screen. Ann Arbor: University of Michigan Press.

Freytag-Loringhoven, Elsa. 2008. In Transition: Selected Poems by The Baroness Elsa von Freytag- Loringhoven, edited by Tanya Clement. University of Maryland, College Park Libraries. http:// www.lib.umd.edu/digital/transition/.

———. 1928. “Selections from the Letters of Elsa Baroness Von Freytag-Loringhoven,” edited by Djuna Barnes. transition, 11 (February): 19-30.

———. 1921. “Thee I Call ‘Hamlet of Wedding Ring’: Criticism of William Carlos William’s [sic] ‘Kora in Hell’ and Why . . . (pt. 2).” Little Review 8 (Autumn): 108-111.

———. 1927. “Xray.” transition, 7 (October): 135.

Gammel, Irene. 2002. Baroness Elsa: Gender, Dada, and Everyday Modernity: A Cultural Biography. Cambridge, Mass: MIT Press.

Golding, Alan. 2007. “The Dial, The Little Review, and the Dialogics of Modernism.” In Little Magazines and Modernism: New Approaches, edited by Suzanne W Churchill and Adam McKible, 67-81. Aldershot, England: Ashgate Pub.

Hockey, Susan M. 2000. Electronic Texts in the Humanities: Principles and Practice. Oxford: Oxford University Press.

Huitfeldt, Claus. 2007. “Scholarly Text Processing and Future Markup Systems.” Accessed October 10, 2010. http://computerphilologie.uni-muenchen.de/jg03/huitfeldt.html.

Latour, Bruno. 1993. We Have Never Been Modern. Cambridge, Mass: Harvard University Press.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 30

Lavagnino, John. 2006. “Electronic Textual Editing: When not to use TEI.” In Electronic Textual Editing, ed. Lou Burnard, Katherine O’Brien O’Keeffe, and John Unsworth, 334-338. New York: Modern Language Association of America.

McCarty, Willard. 2005. Humanities Computing. New York: Palgrave Macmillan.

McGann, Jerome. 2004. Marking Texts of Many Dimensions. In Companion to Digital Humanities (Blackwell Companions to Literature and Culture), edited by Ray Siemens, John Unsworth, and Susan Schreibman, 198-217. Hardcover. Blackwell Companions to Literature and Culture. Oxford: Blackwell Publishing Professional. Accessed October 10, 2010. http://www.digitalhumanities.org/ companion/.

———. 2002. “Visible and Invisible Books: Hermetic Images in N-Dimensional Space.” In Literary and Linguistic Computing: Journal of the Association for Literary and Linguistic Computing 17 (April): 61-75.

———. 1991. The Textual Condition. Princeton, N.J: Princeton University Press.

Modern Language Association. “Guidelines for Editors of Scholarly Editions.” 25 September 2007, accessed February 21, 201, http://www.mla.org/cse_guidelines.

Nelson, Cary. 1989. Repression and Recovery: Modern American Poetry and the Politics of Cultural Memory, 1910-1945. Madison, Wis: University of Wisconsin Press.

Papers of Elsa von Freytag-Loringhoven, Special Collections, University of Maryland Libraries. Digital edition available attached to the finding aid for the papers, accessed February 21, 2011, http://hdl.handle.net/1903.1/1501.

Poirier, Richard. 1992. “The Difficulties of Modernism and the Modernism of Difficulty.” In Critical Essays on American Modernism, edited by Michael Hoffman and Patrick D. Murphy, 104-114. New York: G.K. Hall & Company.

Renear, Allen, Elli Mylonas, and David Durand. 1996. “Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies.” In Research in Humanities Computing, edited by Nancy Ide and Susan Hockey. Oxford University Press, accessed February 21, 2011, http:// hdl.handle.net/2142/9407.

Richter, Hans. 1965. Dada: Art and Anti-Art. New York: McGraw-Hill.

Smith, Martha Nell. 1992. Rowing in Eden: Rereading Emily Dickinson. 1st ed. Austin: University of Texas Press.

Sowa, John F. 2000. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Pacific Grove, CA: Brooks Cole Publishing Co.

Vanhoutte, Edward. 2006. “Prose Fiction & Modern Manuscripts.” In Electronic Textual Editing, edited by Lou Burnard, Katherine O’Brien O’Keeffe, and John Unsworth, 161-180. New York: Modern Language Association of America.

NOTES

1. More information about the Versioning Machine is at http://www.v-machine.org/. The iteration used for this project is based on VM version 4.0 with some modifications I implemented. These modifications are described at http://www.lib.umd.edu/digital/transition/vmchanges.jsp.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 31

2. This number represents a reel and frame number from the microfilm of the Papers of Elsa von Freytag-Loringhoven, Special Collections, University of Maryland Libraries. All subsequent references are noted as UMD. 3. Between 1927 and 1929, transition was edited by Eugene and Maria Jolas, Eliot Paul (until 1928), and Harry Crosby (until 1929). 4. Reception here is considered as part of a “triangular intertextuality” or only as one aspect of the “influences of biography, reception, and textual reproduction” (Smith 1992, 2). 5. This information is indicated in two letters between the Baroness and M rie Jolas at transition now housed at the University of Maryland Libraries. The letter from the Baroness asks the editors to include a dedication in “A Dozen Cocktails Please” to “Mary R.S.” and to change a line in “Sermon on Life's Beggar Truth.” While Jolas's return letter, dated October 12, 1927, does not mention “Sermon,” she does note that they “are keeping for future use” the poems that the Baroness sent in with “Contradictory Speculations,” namely “Ancestry,” “Cosmic Arithmetic,” “A Dozen Cocktails Please” and “Chill.” “Chill” is not included in this edition because there are two poems by the Baroness titled “Chill,” either of which could have been the one sent to transition (UMD 2.905). 6. More information about the Versioning Machine is at http://www.v-machine.org/. The iteration used for this project is based on VM version 4.0 with some modifications I implemented. These modifications are described at http://www.lib.umd.edu/digital/transition/vmchanges.jsp. 7. Of course, there are many discussions about the limitations of the TEI standard. For example, in his desire to create an electronic edition that expresses the time and space dimension a cache of multiple versions necessarily engages, Edward Vanhoutte discovers that speech elements serve his editorial principles since he considers his project to be a recording of the “author” having a conversation with the biographical writer (Vanhoutte 2006, 175-176). Other discussions include Renear et al., 1996; Hockey 2000, specifically pgs. 24-28; and Huitfeldt 2007. 8. As pointed out by one reviewer of this article, an extension can be added to the TEI Guidelines “to specify whether or not the order in the encoding of variants is significant or not; there's also the need for a customized interface that can this to the reader.”

ABSTRACTS

In Transition: Selected Poems by the Baroness Elsa von Freytag-Loringhoven is a publicly available scholarly edition of twelve unpublished poems written by Freytag-Loringhoven between 1923 and 1927. This edition provides access to a textual performance of her creative work in a digital environment. It is encoded using the Text Encoding Initiative’s (TEI) P5 Guidelines for critical apparatuses including parallel segmentation and location-referenced encoding. The encoded text is rendered into an interactive web interface using XSLT, CSS, and JavaScript available through the Versioning Machine (http://www.v-machine.org/). One aspect of textual performance theory I am exploring within In Transition concerns the social text network. The social text network these twelve texts always and already represent presupposes the notion of a constant circulation of networked social text systems. The network represented by In Transition is based primarily on issues of reception, materiality, and themes which engage and reflect the social nature of the text in the 1920s and now. This is to say two things: (1) that the concept of the network is not new with digital scholarly editions; and (2) that these networks in a digital edition foreground the situated 1920s history of these texts as well as the real-time, situated electronic reading

Journal of the Text Encoding Initiative, Issue 1 | June 2011 32

environment. The argument of a digital edition like In Transition is formed as much by the underlying theory of text as it is by its content and the particular application or form it takes. This discussion employs the language of knowledge representation in computation (through terms like domain, ontology, and logic) in order to situate this scholarly edition within two existing frameworks: theories of knowledge representation in computation and theories of scholarly textual editing.

INDEX

Keywords: digital editions, interface development, knowledge representation, scholarly editing, text encoding, versioning

AUTHOR

TANYA CLEMENT [email protected] Associate Director, Digital Cultures and Creativity, University of Maryland, College Park Research Associate, Maryland Institute for Technology in the Humanities, University of Maryland, College Park Tanya Clement is the Associate Director of Digital Cultures and Creativity, an undergraduate honors program at the University of Maryland, College Park. She has an English PhD from UMD and an MFA in fiction from the University of Virginia. She is also a Research Associate at the Maryland Institute for Technology in the Humanities (MITH). She has published chapters and articles on digital scholarly editing, text mining, visualizations, and literary modernism, is the Associate Editor of the Versioning Machine and the editor of In Transition: Selected Poems by the Baroness Elsa von Freytag-Loringhoven.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 33

A TEI-based Approach to Standardising Spoken Language Transcription

Thomas Schmidt

AUTHOR'S NOTE

An earlier version of this paper, co-authored by Andreas Witt, was presented as “Transcription tools, transcription conventions and the TEI guidelines for transcriptions of speech” at the 2008 TEI Members Meeting in London. I am grateful to Peter M. Fischer and two anonymous reviewers for very helpful suggestions for improvement.

1. Introduction

1 Spoken language transcription is an important component of many types of humanities research. Among its central areas of application are linguistic disciplines like conversation and discourse analysis, dialectology and sociolinguistics, and phonetics and phonology. The methods and techniques employed for transcribing spoken language are at least as diverse as these areas of application. Different transcription conventions have been developed for different languages, research interests, and methodological traditions, and they are put into practice using a variety of computer tools, each of which comes with its own data model and formats. Consequently, there is, to date, no widely dominant method, let alone a real standard, for doing spoken language transcription. However, with the advent of digital research infrastructures, in which corpora from different sources can be combined and processed together, the need for such a standard becomes more and more obvious. Consider, for example, the following scenario: A researcher is interested in doing a cross-linguistic comparison of means of expressing modality. He is going to base his

Journal of the Text Encoding Initiative, Issue 1 | June 2011 34

study on transcribed spoken language data from different sources. Table 1 summarises these sources.

Table 1: File formats and transcription conventions for different spoken language corpora

Corpus (Language) [URL] Transcription convention

SBCSAE (American English) SBCSAE text DT1 (DuBois et al. 1993) [http://projects.ldc.upenn.edu/SBCSAE/] format

BNC spoken (British English) BNC XML BNC Guidelines (Crowdy [http://www.natcorp.ox.ac.uk/] (TEI variant 1) 1995)

CallFriend (American English) CHAT text CA-CHAT (MacWhinney [http://talkbank.org/] format 2000)

METU Spoken Turkish Corpus (Turkish) EXMARaLDA HIAT (Rehbein et al. 2004) [http://std.metu.edu.tr/en] (XML format)

Corpus Gesproken Nederlands (CGN, Dutch) Praat text CGN conventions (Goedertier [http://lands.let.kun.nl/cgn/ehome.htm] format et al. 2000)

Forschungs- und Lehrkorpus Gesprochenes FOLKER Deutsch (FOLK, German) cGAT (Selting et al. 2009) (XML format) [http://agd.ids-mannheim.de/html/folk.shtml]

Corpus de Langues Parlées en Interaction (CLAPI, CLAPI XML French) ICOR (Groupe Icor 2007) (TEI variant 2) [http://clapi.univ-lyon2.fr/]

Swedish Spoken Language Corpus (Swedish) Göteborg text GTS (Nivre et al. 1999) [http://www.ling.gu.se/projekt/old_tal/ format SLcorpus.html]

2 Undoubtedly, the corpora have a lot in common as far as their designs, research backgrounds, and envisaged uses are concerned. Still, as the table illustrates, not a single one of them is compatible with any of the others, neither in terms of digital file formats nor transcription conventions used. In order to carry out his study, the researcher will thus have to familiarise himself with eight different file formats, eight different transcription conventions and, if he is not able or willing to do a lot of data conversion, eight different techniques or tools for querying the different corpora. Obviously, the world of spoken language corpora1 is a fragmented one. The aim of this paper is to explore whether an approach based on the Guidelines of the TEI can help to overcome some of this fragmentation. In order for such an effort to be successful—that is, to really reduce the variation—I think that it is necessary to take the following factors into account: • Since spoken language transcription is a very time-consuming process, it is crucial for transcribers to have their work supported by adequate computer tools. Any standardisation

Journal of the Text Encoding Initiative, Issue 1 | June 2011 35

effort should therefore be compatible with the more widely used tool formats. This compatibility should manifest itself in something that can be used in practice, such as a conversion tool for exchanging data between a tool and the standard. • The reason for variation among transcription conventions and tool formats can be pure idiosyncrasy, but it can also be motivated by real differences in research interests or theoretical approaches. A standardisation effort should carefully distinguish between these two types of variation and suggest unifications only for the former type. • Not least because the line between the two types of variation cannot always be easily drawn, any standardisation effort should leave room for negotiations between the stakeholders (that is, authors and users of transcription conventions, and developers and users of transcription tools) involved. This paper therefore does not intend to ultimately define a standard but rather to identify and order relevant input to it and, on that basis, suggest a general approach to standardisation the details of which are left to discussion.

3 Following these basic assumptions, the paper is structured as follows: Sections 2 and 3 look at two fundamentally different, but interrelated, things to standardise. Section 2 is concerned with the macro structure of transcriptions—that is, temporal information and information about classes of transcription and annotation entities (for example, verbal and non-verbal)—as defined in tool formats and data models. Section 3 is concerned with the micro structure of transcriptions—that is, names for, representations of, and relations between linguistic transcription entities like words, pauses, and semi-lexical entities. This is what a transcription convention usually defines. Both sections conclude with a suggestion of how to standardise commonalities between the different inputs with the help of the TEI. Section 4 then discusses some aspects of application—that is, ways of using the proposed standard format in practice.

2. Macro Structure and Tool Formats

4 Transcription tools support the user in connecting textual descriptions to selected parts of an audio or recording. I will call the way in which such individual descriptions are organised into a single document the macro structure of a transcription. Transcription macro structures, and, consequently, the file formats used by the tools, usually remain on a relatively abstract, theory-neutral level. They are concerned with abstract categories for data organisation and with the temporal order of textual descriptions and their assignment to speakers, among other things, but they usually do not define any concrete entities derived from a theory of what should be transcribed (such as words and pauses). This latter task is delegated to transcription conventions (see the following section).2

2.1. Data Models: Commonalities and Differences

5 Disregarding word processors (like MS Word) and simple combinations of text editors and media players (like F4)3, the following seven tools are among the most commonly used for spoken language transcription:4 • ANVIL (Kipp 2001), a tool originally developed for studies of multimodal behaviour • CLAN/CHAT (MacWhinney 2000), the tool and data format belonging to the CHILDES database, originally developed for transcription and coding of child language data

Journal of the Text Encoding Initiative, Issue 1 | June 2011 36

• ELAN (Wittenburg et al. 2006), a multi-purpose tool used, among other things, for documentation of endangered languages and sign-language transcription • EXMARaLDA Partitur-Editor (Schmidt and Wörner 2009), a multipurpose tool with a background in pragmatic discourse analysis, dialectology, and multilingualism research • FOLKER (Schmidt and Schütte 2010), a transcription editor originally developed for the FOLK corpus for conversation analysis • Praat (Boersma and Weenink 2010), software for doing phonetics by computer • Transcriber (Barras et al. 2000), an editor originally developed for transcription of broadcast news

6 Although there are numerous differences in design and implementation of the tools, and although each tool reads and writes its own individual file format, their data models can all be understood as variants of the same base model. The basic entity of that data model is a time-aligned annotation—that is, a triple consisting of a start point, an end point, and a field containing the actual transcription or annotation.5 Further structure is added by partitioning the set of basic entities into a number of tiers and assigning tiers to a speaker and/or to a type. As Schmidt et al. (2009) have shown, this simple structure can be viewed as a common denominator of all tools, and it can be used to establish a basic interoperability between them.

7 Beyond the common denominator, the tool models also differ in several details: • Implicit vs. explicit timeline: In some models (like ANVIL and Praat), start and end points of the basic entities point directly to a time point in the recording. In other models (like EXMARaLDA and ELAN), they point to an external timeline— an ordered set of time points, which, in turn, can (but need not) have timestamps pointing into the recording. • Speaker assignment of tiers: Some models (like EXMARaLDA and ELAN) allow (and sometimes require) tiers to be explicitly assigned to a speaker entity. Other models (like ANVIL and Praat), although they allow tiers to be characterised by a name and other features, do not have an explicit concept for speakers. • Simple and structured annotations: In some models (like ANVIL and ELAN), the basic entities can have an internal structure, while in others (like EXMARaLDA and Praat), they always consist of simple text strings. • Single layer and multi-layer: Some models (like FOLKER and Transcriber) provide a single tier for each speaker in which all annotation for that speaker has to be integrated. Other models allow multiple tiers for each speaker onto which annotations of different kinds (such as verbal vs. non-verbal or segmental vs. supra-segmental) can be distributed. In most models of the latter type, tier categories and semantics can be freely defined on the basis of a few abstract tier types (as in ANVIL, ELAN, EXMARaLDA, but see next point), whereas CLAN/CHAT predefines an extensive set of tier categories and a semantics for them. • Tier types and dependencies: All multi-layer tools provide a system for classifying tiers according to their structure and semantics. The tier types can be associated with certain structural constraints on annotations within the respective tier or in relation to annotations in another tier. This often results in a tier hierarchy where one tier is regarded as primary and other tiers as subordinate to (or dependent on) the primary tier. No two tools use the same system of tier types, but there are some obvious commonalities and interrelations between the systems.

8 Schmidt et al. (2009) conclude that, “given that the diversity in tool formats is to a great part motivated by the different specializations of the respective tools”, a full assimilation of the different data models is neither theoretically desirable nor

Journal of the Text Encoding Initiative, Issue 1 | June 2011 37

practically possible. However, the similarities between the data models clearly outweigh the differences. I would therefore like to argue that, at least for the purposes of this paper, it will be sufficient to declare one of the formats as a typical exponent of a class containing all the others, and use this typical exponent as the basis for a transformation to TEI. The fact that EXMARaLDA has conversion filters for importing the formats of all the other tools shows that this assumption is not only true in theory, but can also be put to use in practice. In what follows, I will therefore use EXMARaLDA’s data model as a representative of all the other tools.

2.2. EXMARaLDA’s Data Model and Format

9 Concerning the above parameters, EXMARaLDA’s data model has an explicit timeline, allows speaker assignment of tiers, uses only simple annotations, allows multi-layer annotations, and distinguishes three tier types which I will illustrate with the help of the following example. Figure 1 shows a transcription as displayed by the EXMARaLDA Partitur-Editor.

Figure 1: Example transcription as displayed in the EXMARaLDA Partitur-Editor with a waveform representation of the recording (top) and a musical score representation of the transcription (bottom). Annotations (white fields in the musical score) are assigned to tiers (“rows” of the score) and intervals of the timeline (“columns” of the score). The tiers are labelled with abbreviations for the corresponding speakers (”DS” and “FB”) and with a category (“sup”, “v”, etc.).

10 The transcription consists of twelve annotation triples, organised into seven tiers, each of which is attributed to one of two distinct speakers (DS and FB), one of five distinct (freely definable) categories (sup, v, en, nv and pho) and one of three (predefined) tier types. Note that the same mechanism—assigning identical start and end points to the respective annotations—is used to represent both temporal simultaneity (as in the speaker overlap between “très bien” and “Alors ça”) and semantic equivalence (as between the orthographic transcription “un petit peu” and its phonetic counterpart “ [ ɛ◌̃tipø:] ”). Figure 2 gives a schematic representation of the underlying data model.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 38

Figure 2: Schematic representation of the EXMARaLDA data model

11 Tiers of type T(RANSCRIPTION) contain the primary information—that is, the transcription of words uttered by the respective speaker alongside with descriptions of non-phonological phenomena (such as coughing and pauses) which are alternative (rather than simultaneous) to the actual speech. Tiers of type A(NNOTATION) contain information which is dependent on the primary tiers. For instance, the tiers of category en contain English translations of the speakers’ French utterances, whereas the tier of type sup contains annotations which describe suprasegmental features of transcribed words. Finally, in tiers of type D(ESCRIPTION) secondary information, which is independent of the transcribed words etc., can be entered. In the example, the tier of category nv contains an annotation for a non-verbal action by speaker DS. The data model has the following simple constraints with respect to tier types:

1. Tiers of type T and A must be attributed to a speaker (if a tier of type A and a tier of type T are attributed to the same speaker, the latter is the parent tier of the former).

2. There has to be exactly only one tier of type T for each speaker, but there can be any number of tiers of type A and D.

3. For each annotation in a tier of type A, there must be an annotation or a connected sequence of annotations in the parent tier with the same start and end point.

12 As illustrated in figure 3, EXMARaLDA represents this data model in an XML file which hierarchically organises individual annotations ( elements) into tiers ( elements). All other structural relations, in particular the assignment of annotations to points in the timeline and the assignment of tiers to speakers, are not expressed in the document hierarchy, but with the help of pointers to @id attributes.

Figure 3: XML representation of an EXMARaLDA transcription (simplified)

Journal of the Text Encoding Initiative, Issue 1 | June 2011 39

faster Okay. Très bien, très bien. Ah oui ? Okay. Very good, very good. right hand raised Alors ça dépend ((cough)) un petit peu. That depends, then, a little bit [ɛ̃tipø:]

Journal of the Text Encoding Initiative, Issue 1 | June 2011 40

13 While this has proven a practically adequate representation of the data model for the purposes of the EXMARaLDA editor (and similar XML formats are used, for instance, by ANVIL and ELAN), it ha s some obvious drawbacks from the point of view of XML based data modelling and processing: • The document order of individual annotations does not match the order in which the corresponding phenomena occur in the transcribed discourse. • Likewise, elements having a close semantic relationship, like the orthographic and phonetic transcriptions in the last two tiers, are not necessarily close to one another in the document. • The dependency between annotations in tiers of type T and tiers of type A is not explicitly represented in the document structure. • Since the division of annotations is motivated by the temporal structure of the discourse, the boundaries of individual annotation elements may cut through linguistic entities. This is the case, for example, for the utterance “Alors ça dépend ((cough)) un petit peu.”, which is distributed across three elements in order enable the representation of different simultaneity relations in the discourse.

14 One resulting disadvantage is that certain XML techniques (like XPath queries) can become inefficient for such documents because the techniques are optimised for processing tree structures, whereas the principal structure of the document is not represented in the document tree. Another disadvantage is that the (manual) insertion of additional markup, such as with the help of a standard XML editor, becomes difficult because the elements of the document do not behave as in a “normal” (i.e. written) text. As a basis for a transformation to a TEI-conformant form, this kind of document organisation is thus not ideal. A first question on the way to a TEI-based standardisation therefore is whether an equivalent XML representation of the data model can be found which does not suffer from the same drawbacks.

2.3. A TEI Representation of EXMARaLDA’s Data Model

15 My suggestion is to derive such an equivalent representation on the basis of the concept of a segment chain. With respect to the EXMARaLDA data model, a segment chain can be defined as any maximally long, temporally connected sequence of annotations in a tier of type T. The above example contains three such segment chains, marked with grey boxes in figure 4.

Figure 4: Combing annotations into segment chains

Journal of the Text Encoding Initiative, Issue 1 | June 2011 41

16 These segment chains–which loosely correspond to an entity often called a turn or a speaker contribution —have three important structural properties: • They are implicitly contained in the data model and can be automatically derived from it. • They re-combine the character data of linguistic entities (words and utterances) from tiers of type T, which were separated in the data model due to temporal considerations (temporal overlap of annotations) into a superordinate entity. • Since annotations in tiers of type A will, by definition, not cross the boundaries of such segment chains, each such annotation can be assigned to exactly one segment chain.

17 Subsuming all annotations in tiers of type A under “their” segment chain and ordering segment chains by their start points, a document can thus be constructed whose document order is globally analogous to the actual sequence of events in the transcribed discourse, whose elements locally behave like normal written text, and in which dependent annotations are grouped together with the annotations they depend on.

18 Chapters 3 (Elements Available in All TEI Documents), 4 (Default Text Structure), 8 (Transcriptions of Speech), 16 (Linking, Segmentation, and Alignment) and 17 (Simple Analytic Mechanisms) of the P5 Guidelines furnish all the elements necessary to represent such a document in TEI. More specifically, the following elements can be used: • inside a to define speakers • inside a to define the timeline •

to group segment chains and corresponding annotations • to represent the actual segment chains6 with a @who attribute assigning this element (and its siblings) to a speaker • inside with @synch attributes pointing to elements to represent the internal temporal structure of a segment chain • to group annotations of the same type (i.e. coming from the same tier) • inside with @from and @to attributes to represent dependent annotations and their position in the timeline • to represent the remaining annotations coming from tiers of type D

19 Figure 5 shows a TEI-conformant document which uses these elements and is equivalent to the document in figure 3.7

Figure 5: TEI representation equivalent to the representation in figure 3 (simplified, see Appendix for the full version)

Journal of the Text Encoding Initiative, Issue 1 | June 2011 42

DS FB

Okay. Très bien, très bien. faster Okay. Very good, very good.
Alors ça dépend ((cough)) un petit peu. That depends, then, a little bit

Journal of the Text Encoding Initiative, Issue 1 | June 2011 43

[ɛ̃tipø:]

right hand raised
Ah oui?.

3. Micro Structure and Transcription Conventions

20 If, as described above, the macro structure of a transcription is concerned with the way textual elements are organised and put into relation with one another in a transcription document, the micro structure of a transcription can be said to specify the form and semantics of the textual elements themselves. Whereas macro structure is defined by tool developers and represented in file format specifications, micro structure is defined by transcribing linguists and represented in transcription conventions. There are numerous, if not countless, such conventions, most of which are specific to a single corpus or project and have never been published for a larger audience. Among those that have been published in some form or other are the following: • HIAT (Halbinterpretative Arbeitstranskriptionen: Ehlich and Rehbein 1976; Rehbein et al. 2004), a system widely used in the functional pragmatics research community • GAT (Gesprächsanalytisches Transkriptionssystem: Selting et al. 2009), a system widely used in the German conversation analysis research community for the transcription of German, and cGAT (Schütte and Schmidt 2010) an adaptation of GAT used for transcription in the FOLK corpus • CHAT (Codes for Human Analysis of Transcripts: MacWhinney 2000), a system widely used in the child language research community, and CA-CHAT, an adaptation of CHAT to CA (Conversation analysis, Sacks et al. 1978) for use in conversation analysis • DT1 (Discourse Transcription: DuBois et al. 1993), a system used for transcription of the Santa Barbara Corpus of Spoken American English • Convention ICOR (ICOR 2007), a system used for the French CLAPI database • GTS (Göteborg Transcription Standard: Nivre et al. 1999), a system used for the Spoken Swedish Corpus at Göteborg University

21 As will be detailed in the following subsections, these conventions have a lot in common. Although some of them claim to be “unified systems” (GAT) or even “standards” (GTS), they exist more or less independently of one another. In contrast to the situation with tool formats, there have been few attempts to establish

Journal of the Text Encoding Initiative, Issue 1 | June 2011 44

“interoperability” between transcription conventions; real standardisation efforts have, to my knowledge, not been undertaken at all. The present paper is not a place to carry out a full comparative analysis of the systems that would be needed for such a standardisation effort. Instead, I will restrict myself to discussing some commonalities and differences by using examples and working under the assumption that the same method can be transferred to other aspects of the systems. Schmidt (2005a) carries out a more comprehensive and detailed analysis of two of the systems mentioned here (HIAT and GAT).

3.1. Commonalities and Differences

22 Perhaps the most fundamental commonality among the conventions is that they depart from standard written orthography in order to motivate and explain their rules for representing spoken language in the written medium. An important consequence of this is that the entity “word” is present in all the conventions with more or less the same meaning, namely that of a word as defined by standard orthography. Two other basic entities shared by all the conventions are unfilled pauses and audible non-speech events like breathing, laughing or coughing. Furthermore, all of the conventions specify ways to represent uncertainty in transcription (sometimes with the possibility to provide alternatives to an uncertain part) and to represent incomprehensible passages. I will call these five elements the basic building blocks of transcription conventions.

23 Another class of entities to be found in most systems consists of prosodic characterisations of words or parts thereof. This class can comprise phenomena like (emphatic) stress or lengthening of syllables. Finally, most systems define entities which summarise words and other basic building blocks into larger units analogous (but explicitly not identical) to the sentence in written language.

24 Taking these commonalities as a starting point, I will illustrate some important differences between the conventions using the set of examples in figure 6 in which a fictitious stretch of speech is transcribed according to five different transcription systems.8

Figure 6: Transcriptions of the same stretch of speech according to five different conventions

((coughs)) You must/ you (should) let • it be. ((laughs)) HIAT Pleease!

((coughs)) you must- you (should/could) let (-) it be; GAT ((laughs)) plea:se-

CHAT &=coughs you must... you should let # it be. &=laughs please!

DT1 (COUGH) you must-- you let .. it be. @@ please?

((coughs)) you must you (should/could) let (-) it be cGAT ((laughs)) please

Journal of the Text Encoding Initiative, Issue 1 | June 2011 45

25 Obviously, some variation is due only to symbolic differences among the conventions. Thus, HIAT, GAT and cGAT describe non-verbal incidents (“coughs”) in double parentheses, whereas CHAT marks such descriptions with the prefixed symbols &= and DT1 chooses capital letters between single parentheses and, additionally, has special predefined symbols for certain such incidents (laughing is represented by the symbols @@). Similarly, each system has its own symbol(s) for representing a short, unmeasured pause: the bullet • in HIAT, the symbols (-) in GAT and cGAT, the hash sign # in CHAT, and two full stops (periods) in DT1.

26 The conventions also vary in what phenomena are represented in the transcription. Thus, the lengthening of the vowel in the word “please” is indicated in HIAT through a reduplication of the vowel symbol and through the insertion of a colon in GAT (this being another case of symbolic variation), but it is not represented at all in the other three systems. Similarly, transcriber uncertainty with respect to a given word can be marked in HIAT, GAT, cGAT and DT1 (through single parentheses in the first three and through a pair of in the latter), but only GAT and cGAT also provide the possibility to specify one or more alternative transcriptions for an uncertain word (added inside the parentheses after a slash).

27 While symbolic and other variation discussed so far remain on the level of basic building blocks, a last type of variation is more complex and concerns the way basic transcription units are organised into larger structures. This type of variation is visible in the punctuation symbols used in figure 6, specifically: • HIAT divides the stretch of speech into two entities called utterances. Utterances are pragmatic units of speech, identified and classified according to function-based criteria, most importantly their mood. The first utterance is terminated by a full stop (period), indicating that it is in declarative mood, while the second is terminated by an exclamation point, marking its mood as exclamative. A third punctuation symbol—the forward slash behind the word “must”—indicates a self-repair but does not act as an utterance terminator. Note that in contrast to all other systems, HIAT uses capitalisation of words at the beginning of utterances. • GAT divides the same stretch of speech into three entities called intonation phrases. Intonation phrases are prosodic units of speech, identified and classified according to form- based criteria, most importantly their intonation contour. The first and third intonation phrases are terminated by a hyphen, indicating a level final pitch movement. The second intonation phrase is terminated by a semicolon, which stands for a falling final pitch movement. • CHAT proceeds similarly to HIAT, but has three utterances instead of two. The first is terminated by an ellipsis symbol (three dots), marking it as an interrupted utterance. The other two are marked by a full stop (period) and an exclamation point, making them declarative and emphatic, respectively. • The corresponding entities in DT1 are called intonation units. The first is terminated by two hyphens (an interrupted intonation unit), the second one by a full stop (period) (a terminative intonation unit), and the third one by a question mark (an “appeal”). • cGAT, finally, does not group basic building blocks into larger entities at all.

28 If the information codified in transcription conventions is to be standardised, these different kinds of variation between the systems must be taken into account. Ideally, a standard should make sure that pure symbolic variation is harmonised by mapping different surface forms onto standard single form, and that all other variation is

Journal of the Text Encoding Initiative, Issue 1 | June 2011 46

expressed in a manner that conserves the original diversity while still making it possible to process transcriptions from different sources on a common basis.

29 I think that the TEI Guidelines furnish all the necessary elements for such a standardisation; at least the following elements from chapters 3 (Elements Available in All TEI Documents), 4 (Default Text Structure), 8 (Transcriptions of Speech) and 17 (Simple Analytic Mechanisms) will be necessary to adequately represent transcriptions according to any of the above conventions: • and to mark up individual words and punctuation characters (unless the semantics of a punctuation character is already represented through another mechanism in the markup), possibly with an attribute @type to characterise a word as a repaired form, as an assimilated form, etc. or to note that a character represents a lengthened phoneme • with a @dur attribute and with a child to represent pauses and non-speech events • elements, possibly with a superordinate element to represent uncertain transcriptions and alternatives • elements with a @function attribute to provide the general name for such units in the respective conventions (such as utterance vs. intonation unit) and a @type attribute to capture the specific characterisation of that unit (such as declarative vs. interrupted)

30 Using these elements, the elements in the example from figure 5 (which follows the HIAT convention) could be marked up as shown in figure 7.

Figure 7: TEI marked up version (according to HIAT) of the transcription from figure 5 (simplified)

Journal of the Text Encoding Initiative, Issue 1 | June 2011 47

Okay Très bien , très bien
Alors ça dépend cough un petit peu

31 If the same stretch of speech is transcribed according to different conventions, the resulting TEI markup will be the same with respects to elements like ,

Journal of the Text Encoding Initiative, Issue 1 | June 2011 48

, and where there is only symbolic variation, but it can differ with respect to elements like where there is a “real” difference between the systems. Figure 7 shows possible markup for three of the examples from figure 6.

Figure 8: TEI markup for examples from figure 6 (simplified)

CHAT

coughs you must you should let it be laughs please

DT1

Journal of the Text Encoding Initiative, Issue 1 | June 2011 49

cough you must you should let it be laughs please

cGAT

Journal of the Text Encoding Initiative, Issue 1 | June 2011 50

coughs you must you should could let it be laughs please

4. Application of this Standard Format

32 Having defined a proposal for a TEI-based standard, I will now turn to the question of how to use it in practice. Most importantly, this means thinking of ways in which transcribers can efficiently produce standard, conformant transcriptions. Ideally, they will continue to be able to use the tools they are familiar with and to focus on the transcription task itself rather than on issues related to XML and TEI encoding.

33 These requirements are relatively easy to meet as far as the macro structure of transcriptions is concerned: the format illustrated in figure 5 is isomorphic to EXMARaLDA’s tool format. This format, in turn, is compatible to a large extent with all the other tool formats mentioned in Section 2 because of the import and export routines built into EXMARaLDA and several other tools. By virtue of transitivity, making all tools compatible with the format in figure 4 is therefore simply a matter of defining a one-to-one mapping between one tool format and the TEI format. In order to ensure maximal portability, this mapping should be accomplished with an XML-only approach using XSL stylesheet transformations. XSL stylesheets which transform an EXMARaLDA transcription into an equivalent TEI representation and vice versa have been made available on the EXMARaLDA website at http://www.exmaralda.org/ tei.html. The stylesheets have also been integrated into the EXMARaLDA editor, where the transformations can be carried out using the tool’s import and export functions.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 51

For formats from other tools, either a direct mapping could be defined in an analogous manner, or EXMARaLDA could be used as an intermediary representation.

34 The requirements are harder to meet for the micro structure of transcriptions. Most commonly used tools (FOLKER being an exception) do not provide a way of directly representing micro structure in their file formats. While the markup expressing the micro structure could be added manually in a generic XML editor after a tool’s format has been converted to the TEI representation of figure 5, this procedure would be rather inefficient since it requires a second tedious manual processing step after the actual transcription has been completed. A more efficient way is to automatically derive the micro structure markup from the regularities formulated inside the transcription conventions. This is possible if we interpret some of the symbols defined by a convention as an implicit (and non-standardised) markup and formulate an algorithm—a parser—to transform this implicit markup into explicit, TEI-conformant XML markup. Figure 9 exemplifies this process for the HIAT example from figures 5 and 7.

Figure 9: Parsing for micro-structure

1. Unparsed

Alors ça dépend ((cough)) un petit peu.

2. Character data of unparsed

Alors ça dépend ((cough)) un petit peu.

3. Parsing: Transforming implicit to explicit markup

Alors␣ça␣dépend␣((cough))␣un␣petit␣peu.␣ 9

Alorsçadépend cough unpetitpeu

4. Reinserting anchors

Journal of the Text Encoding Initiative, Issue 1 | June 2011 52

Alorsça dépend cough unpetitpeu

35 The implicit markup in this case consists of spaces indicating word boundaries, double parentheses indicating non-phonological descriptions, and the full stop (period) indicating and qualifying an utterance boundary. Of course, in order for the parsing algorithm to work reliably, the symbols interpreted as implicit markup must have been rigidly and unambiguously defined in the respective convention. Luckily, all conventions claim to ensure this unambiguousness in their choice of transcription symbols.10 The parsing algorithm can then, in principle, be implemented in any technology and does not need to take any prescribed form as long as it produces correct output (a well-formed TEI-compliant XML fragment) for correct input (a string following the rules of a given transcription system).11 EXMARaLDA has built-in parsing algorithms for HIAT, GAT, cGAT and CHAT which are implemented as finite-state transducers in Java, showing that a very simple parsing technique can be sufficient to deal with several of the transcription conventions mentioned above.

36 Transforming a tool format to a corresponding TEI format in which both macro and micro structure are represented is thus a two-step-process. First, a generic TEI document is produced in which only the macro structure is represented. Second, a parsing algorithm is applied, which adds markup for the micro structure. Figure 10 gives a schematic illustration of the transformation workflow.12

Journal of the Text Encoding Initiative, Issue 1 | June 2011 53

Figure 10: Transformation workflow from tool format to parsed TEI document

37 In order to make this transformation workflow available to users in a maximally accessible way, we have written a Java droplet which takes as input any CHAT, ELAN, EXMARaLDA, FOLKER or Transcriber transcription file and transforms it to a TEI file using a set of parameters—the parsing algorithm to be used among them—specified by the user. Figure 11 shows a screenshot of that application, which will be made freely available as a part of the EXMARaLDA tool package.

Figure 11: Screenshot of TEI Drop

Journal of the Text Encoding Initiative, Issue 1 | June 2011 54

5. Summary, Conclusion and Outlook

38 In this paper, I have formulated a proposal for standardising spoken language transcription with the help of the TEI Guidelines. The proposal consists of two principal components. First, a TEI-conformant format is defined that is structurally equivalent to the formats written by several widely used transcription tools and which represents the macro structure of the transcription in a form that is well-suited for standard XML processing. Second, implicit markup contained in the character data of such documents is transformed to explicit TEI conformant markup using a parsing algorithm that embodies the formal regularities of a transcription convention. The resulting document then represents both macro and micro structure of the transcription in a TEI-compliant way. A droplet application enables users to carry out the transformation from tool format to TEI format and the parsing of the TEI format according to a specific transcription convention in a user-friendly way.

39 The route to standardisation formulated here can be viewed as a synthesis of work in three areas related to spoken language transcription: tool development, TEI encoding, and transcription conventions. All three can be said to have as one of their goals unification or harmonisation of similar practices, but each of them foregrounds a different aspect in that goal.

40 Tool developers usually aim at defining data models and formats which are both general and flexible enough to be used for different data types and different research interests while at the same time specific enough to allow for efficient processing of the data. As the present paper has shown, the solutions they have developed to meet these requirements are sufficiently interoperable to become the first ingredient of the standardisation effort.

41 The goal of the TEI is to provide a common tag set for the representation of texts in digital form where spoken language transcriptions are simply viewed as “texts of a special kind”. Again, the present paper has shown that the existing solutions—as formulated in the P5 version of the Guidelines—are comprehensive and detailed enough to adequately represent commonalities and differences between transcription formats and conventions. They can thus become the second ingredient of the standardisation effort.

42 The situation is a little less clear for the third ingredient, the transcription conventions. Here, the present paper has shown—as a proof of concept at least—that existing conventions are sufficiently systematic to become the basis for a parsing algorithm. However, the formalisations required to derive such an algorithm are usually not explicitly defined inside the conventions but have to be inferred from a potentially error-prone interpretation of an informal text. Likewise, the distinction drawn here between symbolic and other variation among transcription conventions, though arguably very important for standardisation, is not a topic that the conventions themselves deal with at greater length. It seems, therefore, that in this area, the idea of formal standardisation has not yet gained as much ground as in the area of tools and the TEI. If the approach suggested here is to become the basis of a full-grown standard, most work will probably remain in standardising transcription conventions.

43 If we assume that such a full-grown standard can be agreed upon eventually, the task of the example researcher from the introduction will become considerably easier. He

Journal of the Text Encoding Initiative, Issue 1 | June 2011 55

will be dealing with only a single format, which rests on a well-defined and well- documented basis, namely the TEI Guidelines. Inside that format, pure symbolic variation between different transcription conventions will be levelled out, and “genuine” theory-motivated variation will be retained in a manner which facilitates a common processing of data from different sources. Moreover, new data in the same form will easily be produced because transcribers will continue to use established technology and established conventions for their task. Last but not least, the fact that the proposed standard for spoken language transcription draws from the same set of TEI elements as many other actual or proposed standards in the field of written language, such as the Corpus Encoding Standard (CES, see http://www.xces.org/), also opens a potential for common processing of spoken and written data.

BIBLIOGRAPHY

Barras, C., E. Geoffrois, Z. Wu, and M. Liberman. 2001. “Transcriber: Development and Use of a Tool for Assisting Speech Corpora Production.” Speech Communication 33: 5-22.

Bird, S., and Mark Liberman. 2001. “A formal framework for linguistic annotation.” Speech Communication 33: 23-60.

Boersma, P., and D. Weenink. 2010. “Praat: doing phonetics by computer.”Version 5.1.05. http:// www.praat.org/.

Crowdy, S. 1995. “The BNC spoken corpus.” In Spoken English on computer: transcription, mark-up and application, ed. G. Leech, G. Myers and J. Thomas, 224–235. Harlow: Longman.

DuBois, J., S. Schuetze-Coburn, S. Cumming, and D. Paolino. 2003. “Outline of Discourse Transcription.” In Talking Data: Transcription and Coding in Discourse Research, ed. J. Edwards and M. Lampert, 45–89. Hillsdale, NJ: Erlbaum.

Ehlich, K. and J. Rehbein. 1976. “Halbinterpretative Arbeitstranskriptionen (HIAT).” Linguistische Berichte 45: 21–41.

Ehlich, K. 2003. “HIAT: A Transcription System for Discourse Data.” In Talking Data: Transcription and Coding in Discourse Research, ed. J. Edwards and M. Lampert, 123–148. Hillsdale, NJ: Erlbaum.

Goedertier, W., S. Goddijn, and J.-P. Martens. 2000. “Orthographic Transcription of the Spoken Dutch Corpus.” In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), 909-914. http://lands.let.kun.nl/old/cgn.old/2000_01.pdf.

Groupe ICOR. 2007. Convention ICOR. Lyon: Université de Lyon. http://icar.univ-lyon2.fr/ documents/ICAR_Conventions_ICOR_2007.doc.

Kipp, M. 2001. “Anvil: A Generic Annotation Tool for Multimodal Dialogue.” In Proceedings of the 7th European Conference on Speech Communication and Technology, 1367-1370. Eurospeech. http:// www.dfki.de/~kipp/public_archive/kipp2001-eurospeech.pdf.

MacWhinney, B. 2000. The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Erlbaum. http:// childes.psy.cmu.edu/manuals/chat.pdf.

Journal of the Text Encoding Initiative, Issue 1 | June 2011 56

Nivre, J., J. Allwood, L. Grönqvist, E. Ahlsén, M. Gunnarsson, J. Hagman, S. Larsson, and S. Sofkova. 1999. Göteborg Transcription Standard. Version 6.3. Department of Linguistics, Göteborg University. http://www.ling.gu.se/projekt/tal/doc/transcription_standard.html.

Rehbein, J., T. Schmidt, B. Meyer, F. Watzke, and A. Herkenrath. 2004. “Handbuch für das computergestützte Transkribieren nach HIAT.“ Working Papers in Multilingualism 56: 1–78. http:// www.exmaralda.org/files/azm_56.pdf.

Sacks, H., E. Schegloff, and G. Jefferson. 1978. “A Simplest Systematics for the Organization of Turn Taking for Conversation.” In Studies in the Organization of Conversational Interaction, ed. J. Schenkein, 7–56. New York: Academic Press.

Selting, M., P. Auer, D. Barth-Weingarten, J. Bergmann, P. Bergmann, K. Birkner, E. Couper- Kuhlen, A. Deppermann, P. Gilles, S. Günthner, M. Hartung, F. Kern, C. Mertzlufft, C. Meyer, M. Morek, F. Oberzaucher, J. Peters, U. Quasthoff, W. Schütte, A. Stukenbrock, and S. Uhmann. 2009. “Gesprächsanalytisches Transkriptionssystem 2 (GAT 2)”. Gesprächsforschung 10: 353–402. http:// www.gespraechsforschung-ozs.de/heft2009/px-gat2.pdf.

Schmidt, T. 2005a. Computergestützte Transkription: Modellierung und Visualisierung gesprochener Sprache mit texttechnologischen Mitteln. Frankfurt: Peter Lang.

Schmidt, T. 2005b. “Time-based Data Models and the Text Encoding Initiative's Guidelines for Transcription of Speech.” Working Papers in Multilingualism 62: 1–32. http://www.exmaralda.org/ files/SFB_AzM62.pdf.

Schmidt, T., S. Duncan, O. Ehmer, J. Hoyt, M. Kipp, M. Magnusson, T. Rose, and H. Sloetjes. 2009. “An Exchange Format for Multimodal Annotations” in Multimodal Corpora, ed. M. Kipp, J.-C. Martin, P. Paggio, and D. Heylen, 207–221. Berlin: Springer.

Schmidt, T. and W. Schütte. 2010. “FOLKER: An Annotation Tool for Efficient Transcription of Natural, Multi-party Interaction.” In Proceedings of the Seventh International conference on Language Resources and Evaluation (LREC 2010), 2091–2096. http://www.exmaralda.org/files/ LREC_Folker.pdf.

Schmidt, T. and K. Wörner. 2009. “EXMARaLDA: Creating, analysing and sharing spoken language corpora for pragmatic research.” Pragmatics 19: 565–582.

Wittenburg, P., H. Brugman, A. Russel, A. Klassmann, and H. Sloetjes. 2006. “Elan: A Professional Framework for Multimodality Research.” In Proceedings of the Fifth International conference on Language Resources and Evaluation (LREC 2006). http://www.lat-mpi.eu/papers/papers-2006/elan- paper-final.pdf.

APPENDIXES

Appendix: Full example of (unparsed) TEI transcription

Journal of the Text Encoding Initiative, Issue 1 | June 2011 57

</titleStmt> <publicationStmt> <p/> </publicationStmt> <sourceDesc> <recordingStmt> <!-- the recording to which the transcription refers --> <!-- it was necessary to introduce an attribute @url here --> <!-- so that the actual digital file could be referenced --> <recording type="audio" url="./PaulMcCartney.wav"/> </recordingStmt> </sourceDesc> </fileDesc> <profileDesc> <particDesc> <person xml:id="SPK0" sex="1"> <persName> <abbr>DS</abbr> </persName> </person> <person xml:id="SPK1" sex="0"> <persName> <abbr>FB</abbr> </persName> </person> </particDesc> </profileDesc> <revisionDesc> <change when="2011-01-19T13:41:42.515+01:00"> Created by XSL transformation from an EXMARaLDA basic transcription </change> </revisionDesc> </teiHeader></p><p><text> <!-- timeline with timepoints used as anchors inside the transcription --> <!-- the absolute times are offsets into the recording specified above --> <timeline unit="s" origin="#T1"> <when xml:id="T1" absolute="00:00:00"/></p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 58</p><p><when xml:id="T2" absolute="00:00:00.4"/> <when xml:id="T3" absolute="00:00:00.9"/> <when xml:id="T4" absolute="00:00:01.4"/> <when xml:id="T6" absolute="00:00:02"/> <when xml:id="T5" absolute="00:00:02.3"/> <when xml:id="T7" absolute="00:00:02.6"/> <when xml:id="T0" absolute="00:02:56.96"/> </timeline> <body> <!-- the first segment chain --> <div> <!-- the transcribed text from the primary tier --> <u who="#SPK0"> <anchor synch="#T1"/>Okay. <anchor synch="#T2"/>Très bien, <anchor synch="#T3"/>très bien. <anchor synch="#T4"/> </u> <!-- additional annotations from a sup (=suprasegmentals) tier --> <spanGrp type="sup"> <span from="#T2" to="#T4">faster</span> </spanGrp> <!-- additional annotations from an en (=English translation) tier --> <spanGrp type="en"> <span from="#T1" to="#T2">Okay. </span> <span from="#T2" to="#T4">Very good, very good.</span> </spanGrp> </div></p><p><!-- the second segment chain --> <div> <u who="#SPK1"> <anchor synch="#T3"/>Alors ça <anchor synch="#T4"/>dépend ((cough)) <anchor synch="#T6"/>un petit peu. <anchor synch="#T5"/> </u> <spanGrp type="en"> <span from="#T3" to="#T5">That depends, then, a little bit</span> </spanGrp> <spanGrp type="pho"> <span from="#T6" to="#T5">[ɛ̃tipø:]</span> </spanGrp> </div></p><p><!-- an incident from a nv (=nonverbal) tier describing nonverbal behaviour --> <incident who="#SPK0" type="nv" start="#T3" end="#T6"> <desc>right hand raised</desc> </incident></p><p><!-- the third segment chain --> <div> <u who="#SPK0"></p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 59</p><p><anchor synch="#T5"/>Ah oui? <anchor synch="#T7"/> </u> </div></p><p></body> </text> </TEI></p><p>NOTES</p><p>1. And the examples in Table 1 are still homogeneous insofar as they are all orthographically (rather than phonetically) transcribed corpora of spontaneous (rather than read or prompted), multi-party (rather than monological) speech. This type of corpus is typically used in conversation analysis and related fields. If we add to the picture spoken language corpora used in speech technology or phonetics and phonology, even more variation in transcription techniques will have to be taken into account. It is doubtful, however, whether a standardisation across such a diverse spectrum of practices is feasible at all. This paper therefore concentrates on the type of spoken language corpora exemplified in Table 1. 2. In a way, CHAT is an exception to this because it is the name both of the data format used by the CLAN tool and of a transcription convention. However, the CHAT format and the CHAT convention can be clearly separated conceptually. Thus, it is possible to use the CHAT format with a different transcription convention and to use the CHAT convention with a different format. 3. It is by no means uncommon to use such tools for transcription. However, the resulting data are more or less unstructured texts, and this lack of explicit structure makes them ill-suited for a standardisation effort. 4. Further tools belonging to the same family are: the TASX annotator, tools from the AG toolkit and WinPitch. 5. The data models can therefore all be understood as special types of annotation graphs as defined by Bird & Liberman (2001). 6. Note that the definition given in the TEI Guidelines for the <u> element – “a stretch of speech usually preceded and followed by silence or by a change of speaker” – is compatible with the way it is used here to represent a segment chain. The name “utterance”, however, may not be too lucky a choice for this element since some transcription conventions use the same name to denote a much more specific entity of speech (see next section). 7. There are of course many possible alternative representations which also conform to the TEI Guidelines. However, as Schmidt (2005b) and others have repeatedly argued, processing of the data is much facilitated by selecting one option out of the many and disallowing all others. For example, the document in Figure 4 might just as well connect a <u> to the timeline by giving it a @start and an @end attribute. The representation chosen here is not in any way superior or inferior to that alternative, but it is still important to minimise variation by explicitly declaring one alternative as the preferred one. 8. The examples use a selection of the conventions’ rules only. Proficient users of the respective conventions may disagree on some details of what is transcribed here and how it is transcribed, and the example is certainly not a realistic one. Remember, though, that the aim here is to exemplify some differences between the systems, not to fully and precisely describe them. 9. Implicit markup is printed in bold face here. The symbol ␣ represents a space.</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 60</p><p>10. E.g. MacWhinney (2000) for CHAT: “Codes, words, and symbols must be used in a consistent manner across transcripts. Ideally, each code should always have a unique meaning independent of the presence of other codes or the particular transcript in which it is located.” 11. Since the algorithm relies on the regularities defined in the transcription conventions, any incorrect input (a string violating the convention) should lead to an error in parsing, indicating the non-validity of the input string with respect to the conventions. In the tool described below, such parsing errors will be signalled to the user, and an unparsed TEI version will be produced as output. 12. Solid lines stand for existing conversion routes; dashed lines indicate additional possible conversion routes.</p><p>ABSTRACTS</p><p>This paper formulates a proposal for standardising spoken language transcription, as practised in conversation analysis, sociolinguistics, dialectology and related fields, with the help of the TEI guidelines. Two areas relevant to standardisation are identified and discussed: first, the macro structure of transcriptions, as embodied in the data models and file formats of transcription tools such as ELAN, Praat or EXMARaLDA; second, the micro structure of transcriptions as embodied in transcription conventions such as CA, HIAT or GAT. A two-step process is described in which first the macro structure is represented in a generic TEI format based on elements defined in the P5 version of the Guidelines. In the second step, character data in this representation is parsed according to the regularities of a transcription convention resulting in a more fine-grained TEI markup which is also based on P5. It is argued that this two step process can, on the one hand, map idiosyncratic differences in tool formats and transcription conventions onto a unified representation. On the other hand, differences motivated by different theoretical decisions can be retained in a manner which still allows a common processing of data from different sources. In order to make the standard usable in practice, a conversion tool—TEI Drop—is presented which uses XSL transformations to carry out the conversion between different tool formats (CHAT, ELAN, EXMARaLDA, FOLKER and Transcriber) and the TEI representation of transcription macro structure (and vice versa) and which also provides methods for parsing the micro structure of transcriptions according to two different transcription conventions (HIAT and cGAT). Using this tool, transcribers can continue to work with software they are familiar with while still producing TEI-conformant transcription files. The paper concludes with a discussion of the work needed in order to establish the proposed standard. It is argued that both tool formats and the TEI guidelines are in a sufficiently mature state to serve as a basis for standardisation. Most work consequently remains in analysing and standardising differences between different transcription conventions.</p><p>INDEX</p><p>Keywords: digital infrastructures, spoken language, standardization, transcription</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 61</p><p>AUTHOR</p><p>THOMAS SCHMIDT thomas.schmidt@uni-hamburg.de Research Centre on Multilingualism, University of Hamburg Thomas Schmidt holds a PhD in German linguistics and text technology from the University of Dortmund. He currently is a Principal Investigator at the Research Centre on Multilingualism at the University of Hamburg. His research interests include corpus technology, spoken language corpora, and computational lexicography.</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 62</p><p>‘The Apex of Hipster XML GeekDOM’ TEI-encoded Dylan and Understanding the Scope of an Evolving Community of Practice</p><p>Lynne Siemens, Ray Siemens, Hefeng (Eddie) Wen, Cara Leitch, Dot Porter, Liam Sherriff, Karin Armstrong and Melanie Chernyk</p><p>AUTHOR'S NOTE</p><p>Primary authors: Lynne Siemens, Ray Siemens and Hefeng (Eddie) Wen. The authors wish to express their sincere gratitude to Dot Porter for the text encoding, Liam Sherriff for the video widget, Cara Leitch for the transcription, Karin Armstrong for the project management, and Melanie Chernyk for her writing contribution to this paper.</p><p>1. Introduction</p><p>1 Since its creation, the Text Encoding Initiative (TEI) has existed because of and to serve its community of practice. The TEI began in 1987 as an effort to “develop, maintain, and promulgate hardware- and software-independent methods for encoding humanities data in electronic form” (Text Encoding Initiative 2010). The need for these standards was articulated by computing humanists across academic and geographical borders who were frustrated by the unshareable, unsustainable commercial tools that were available to them. The prevalence of proprietary formats and platform-dependent tools made it difficult to distribute or reuse information, and therefore almost impossible for computing humanists to pool their data and collaborate on similar projects. In a way, the founders of the TEI were already a community of practice1 who were as yet without a unified practice. </p><p>2 Comprised of educational institutions and academic libraries, the TEI Consortium (TEI- C) was created in 2001 as a way to formalize the existing relationships. The Consortium is mandated to maintain, develop, and promote the TEI, but also to “foster a broad- based user community with sustained involvement in the future development and widespread use of the TEI Guidelines” (Text Encoding Initiative 2010). It has continued</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 63</p><p> to do this and, during the development of the most recent version of the TEI Guidelines (P5), put out a public call for feature requests in an effort to build the TEI that practitioners wanted. However, the rapid growth of those using TEI means that it may become increasingly difficult to communicate with and coordinate the entire community which has grown to include members who are not formally associated with the Consortium. These new community members are practitioners who have learned about the TEI through other sources and are using it for their projects without the benefits of support and community that membership might afford them.2 However, a key challenge for the Consortium is identifying those individuals so that they might be brought more formally into the Consortium.</p><p>3 As a result, the Consortium discussed various ways in which it might meet this challenge while also building community in an entertaining manner. In 2007, at the TEI-C annual meeting, several members were discussing popular computational and social representations of TEI. A Google search for TEI returned a South Korean pop singer as the most likely result. This led to discussion about linking TEI and music in a way that could generate excitement about the TEI while also illustrating its usefulness. Those initial ideas grew into a series of discussions with members of the TEI Board about where recruitment efforts might be best placed. The goal would be to learn more about those who were practising TEI but were not yet members, and how those institutions where these members are based could be brought into the community more formally. A viral marketing experiment was designed to insight into the current status of the community practice, and how practitioners engage with the TEI. </p><p>4 This paper will begin by discussing the concept of viral marketing and its usefulness for the TEI. It will then describe the viral marketing experiment and the methodology behind it. Finally, it will report and discuss the findings of the experiment and make recommendations for future TEI recruitment and community-building efforts. </p><p>2. Viral Marketing and the Community of Practice</p><p>5 Viral Marketing was a good choice for the type of experiment that was discussed, for many reasons. A relatively new phenomenon, viral marketing is cost-effective, entertaining, and takes advantage of the existence of online communities. This made it ideal for the purposes of the TEI experiment.</p><p>6 Viral Marketing has been defined as “network-enhanced word of mouth” (Jurvetson and Draper 1997) or even as “word of mouse” (Skrob 2005). The emphasis in both of these concise definitions is on person-to-person communication of a message and on the online setting of the message. Rather than using mass media such as television, radio, or print to advertise, viral marketing takes advantage of free online media settings such as YouTube, Facebook, and weblogs. To become “viral,” the video, image, or game should inspire users to pass the message along to their peers in the interest of sharing the experience. At the same time, they share information about the product or service behind that message.</p><p>7 The use of viral marketing is cost-effective not only because websites like YouTube and weblogs are free to use, but also because the message is transmitted by the target audience, not by paid marketing executives or media outlets. In this way, the target audiences, rather than the organization, identify those who would be most interested in the message. Viral marketing takes advantage of the ease with which a video on</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 64</p><p>YouTube can be shared. With the click of a button, users can recommend the video to friends via any number of online social networks. The more entertaining and engaging the experience is, the more likely users are to pass along the link. One recent and well- known example of a successful viral marketing campaign is the Old Spice social media video responses campaign. Old Spice’s spokesperson Isaiah Mustafa, spent two days in a bathroom set as his “The Man Your Man Could Smell Like” character, recording more than 180 video responses to comments from social network users (Reiss 2010). The viral campaign resulted in many millions of views and “the biggest gain in market share of any body wash over the four weeks ending June 13” (Van Buskirk 2010). </p><p>8 This type of marketing is an ideal choice for the TEI marketing experiment because of the goals of that experiment, which are (as previously stated) to identify and unify the community surrounding TEI. Viral marketing motivates viewers to pass the message along to others; this means that the TEI does not have to guess at who might be in this community. In this way, community-oriented marketing sustains itself while providing an accurate picture of the distribution of that community. Viral marketing can also help the TEI build that community by giving practitioners an opportunity to share an experience that requires an understanding of the TEI to be fully enjoyed and appreciated. This shared experience creates a feeling of community and of pride in the special knowledge shared, and a sense of being linked to other members of that community (Muniz and O'Guinn 2001).</p><p>3. Steps to Understanding TEI through the Viral Marketing Experiment</p><p>9 Once viral marketing was determined to be the appropriate strategy for this experiment, the next step was to design it. First, the viral message was prepared. Before releasing the viral message, benchmark data was collected; this provided a baseline usage pattern for the TEI website that could be used to measure the effect of the viral marketing experiment. Once the viral message was released, new data would be collected and compared against the benchmark data. </p><p>3.1. Creation and Launch of Viral Marketing Widget</p><p>10 Given the need to be an entertaining and interesting application of TEI, a video widget was determined to be the best medium. Because music had come up in the initial conversation about the experiment, the team’s focus turned to applications of TEI to music and image. The music video for Bob Dylan’s “Subterranean Homesick Blues” was an ideal candidate. The visual imagery of the video consists of Bob Dylan standing in front of the camera holding a series of cue cards. Some of the cards contain song lyrics and others contain other text. These cards are themselves a kind of markup of the song. While the song plays, Dylan discards cue cards one at a time. The visual text provides clarification or comment on the song lyrics. TEI is a useful tool for encoding this video because it can juxtapose the song lyrics, the text on the cue cards, and any other visual elements in the video. Moreover, TEI allows the encoder to designate points of uncertainty in the text, which can be used to mark unclear words in the song lyrics, or places where the cue cards and lyrics are not identical. The encoding also provides</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 65</p><p> contextual information for pop culture references within the lyrics. All of this information can be included in the TEI encoding. </p><p>11 The video widget was created by a team of people, many of them part of the TEI community of practice themselves. Cara Leitch (song transcription), Dorothy Porter (transcription encoding), Liam Sherriff (video creation), and Karin Armstrong (website creation, video posting) combined their various skills and knowledge to create the video. None of these people is a marketing professional; they are graduate students and academics in the field of digital humanities. Thus, the TEI viral marketing widget was created by members of the TEI community itself, for other members of the community to enjoy. The key to viral marketing is that the experience be entertaining; otherwise, users will be unwilling to pass the message along to their peers. This team was interested in and engaged with the project which is a good indication that their peers, the target audience, were also likely to be interested and engaged. The final widget, as included below, shows the music video with the TEI encoding superimposed on the video image, scrolling by in time with the music. The final frame of the video provides the URLs for the widget home page and the TEI-C website.</p><p>This media file cannot be displayed. Please refer to the online document http:// 12 journals.openedition.org/jtei/210</p><p>13 After the video widget had been created, it was posted to YouTube (Siemens et al. 2008). News of the video was circulated via the TEI mailing list and other humanities mailing lists. Released at midnight GMT on October 1, 2008, the video overtook the number one position on the YouTube Canada website before the sun rose on the west coast, and held that position for one hour. In order to gauge the interest in the video, the interest in the TEI-C that was inspired by the video and the discussion about the video by interested individuals, Google Analytics was used to track traffic to the TEI-C website and understand website users and habits. Further analysis of this data would provide insight into the TEI community of practice and identify previously unknown members of the community as well as potential new members. The goals of the experiment, as briefly described above, were to attract new visitors to the TEI-C website, to increase awareness of the TEI itself (and to distinguish it from the Korean pop singer of the same name), and to understand and unify this community. </p><p>3.2. Collection of Benchmark and Experiment Data</p><p>14 The release of the viral marketing widget accomplished one of the project goals, which was to share an entertaining experience with the rest of the TEI community. However, the goals of understanding the various groups who visited the TEI website via the widget would require more in-depth analysis. In order to gather the data required for such analysis, Google Analytics was embedded in a representative sample of web pages on the TEI site (http://www.tei-c.org). The traffic on those pages was then monitored as an experiment in using browsing patterns of those interested in TEI. To collect and analyze this data, University of Victoria MBA student Eddie Wen joined the team.3 </p><p>15 The analysis of benchmark and experiment data provided a picture of the community of practice, which included TEI-C members and TEI practitioners whose institutions were not members, as well as related communities of practice (such as other digital</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 66</p><p> humanists and XML users.) Figure 2 shows the relationship between these groups. TEI- C members are the core of that group. Moving outward, the next circle includes the larger group of non-TEI members and related communities of practice. Finally, the largest group is labelled “New Visitors.” This group includes users who were motivated by the video widget to visit the TEI-C site, but who are not yet part of the inner circle.</p><p>Figure 2: Target Groups</p><p>16 The experiment was approached in two stages. The first was the “benchmark” stage, which took place between June 1 and September 30, 2008. Using Google Analytics to track traffic to several pages on the TEI website, the team was able to gather information about the users’ geographic location, the time spent on the TEI site, and the number of pages accessed during each visit. Next, a Google search allowed the team to identify the home institution of the visitor; they could then determine whether a given institution was a TEI member. If a user was not accessing the TEI website from an educational institution or academic library, their personal information was not included in the study. Those institutions known to be TEI members could then be excluded from the list of total site visitors to generate a list of non-member visiting institutions. Any institution that had contributed more than ten visits to the TEI-C site and had visited more than one page during each visit was considered to be an active user, and thus within the community of practice. These groups—the active TEI members and the larger community —exhibit certain browsing habits that would function as a baseline by which to evaluate new visitors in the next stage of the experiment.4 </p><p>17 The benchmark data showed that 52 of the 82 TEI member institutions were making active, regular use of the TEI website. At this point, one of the limitations of the experiment can be seen. Many of the pages that were embedded with Google Analytics were portions of the TEI Guidelines, a comprehensive reference guide for the TEI.</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 67</p><p>However, the TEI Guidelines are not solely available on the website. The TEI-C offers many options for using the Guidelines, such as ordering a print-on-demand copy, downloading a PDF version, or saving a copy of the HTML files locally.5 The 30 TEI members who were not actively using the TEI website may have downloaded the TEI Guidelines to their local computers. This would mean that members would not be accessing those pages as often (or remaining on the site for as long a period of time), because they were not using it as a primary reference point. This does not mean that the 30 members are not participating in the formal relationship with TEI, but rather that their involvement in the community could not be accurately measured in this experiment. </p><p>18 After subtracting the 82 member institutions from the list of visiting institutions during the benchmark period, the results showed 5972 non-TEI-member visitors. Of that group, only 124 were institutionally-identified visitors who exhibited the same browsing patterns as the more active TEI members. However, because the experiment was only designed to include institutionally-identified site visitors, any members of education institutions or academic libraries who accessed the TEI site from off-campus would not be included in the study. This is a further limitation of the experiment, but the data gathered still provides a useful starting point by which to evaluate the TEI website’s audience.</p><p>19 The second stage of data collection was the “new data” phase, which took place from October 1 to 17, 2008. Beginning with the release of the widget on October 1, new visits to the TEI-C website were tracked. During this phase, the researchers were interested primarily in those users who had not visited the TEI-C site during the benchmark period, and especially those visitors who were motivated by the viral video widget to investigate the TEI-C site. Again, Google searches aided identification of the home institution of each visitor. The institutions listed in either of the groups defined during the benchmark stage were subtracted from the list of total new visitors to the site during the new data stage. This resulted in a list of 709 “widget-inspired” visitors. The browsing patterns (number of pages visited, length of stay on website) of those new visitors were then compared to the data collected during the benchmark stage, thus identifying those who exhibited the same browsing patterns as current TEI-C members. The final list of new visitors who satisfied all of those requirements was only four. However, as mentioned above, there may have been many more visitors who simply did not access the TEI website while on campus. This experiment necessarily excludes those visitors. </p><p>20 The data collected during the benchmark and new data stages was then analyzed to understand the qualities and habits of the three groups in question as well as their geographical distribution.</p><p>4. Discussion of Findings</p><p>21 Drawing upon the data, an understanding of scope and size of the TEI community of practice and potentially new members to this community can be identified. This picture can be explored from the perspective of visitor demographics, Web traffic analysis, and blog comments.</p><p>22 As seen in Figure 3, the first analysis of the data shows that this community and those just outside it is quite large. Initial analysis suggests that the community of practice</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 68</p><p> might be around 5000, with 4000 being more active within the community. However, this picture can be refined further as discussed below.</p><p>Figure 3: Total Identifiable Groups</p><p>4.1. Visitor Demographics</p><p>23 The results of the viral marketing experiment show that visits to the site came from 153 countries, and that 50 of those countries make up 98% of total visits. The community is geographically very diverse and active. </p><p>24 Besides the TEI-C members, the larger community of practice can be identified by analyzing browsing patterns. Of those visitors who are not TEI-C members, 102 of them were considered to be the most active. The average site visits per user within this group were 29; they spent an average of four minutes on the site (compared to a site average of 59 seconds); and they visited an average of 3.6 pages per visit (compared to a site average of 1.68). These visitors were visiting more often, staying longer, and accessing more pages during each visit than the average site visitor. These statistics demarcate a clear group of institutions which might benefit from a more formal involvement in the formal organization. Figure 4 provides a comparison of the number of members of each group who exhibit the similar browsing patterns </p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 69</p><p>Figure 4: Groups Showing Similar Browsing Patterns</p><p>25 Drawing on Google Analytics data, the understanding of the larger TEI community can be further developed with an analysis of the geographical spread of visitors to the TEI website. Figure 5 shows the geographical distribution of visits by TEI-C members during the benchmark period.</p><p>Figure 5: Visits from TEI Members (June 1 to September 30)</p><p>26 Not surprisingly, given the member profile within the Consortium, the majority of visits (58%) came from the United States. A further 15% came from the United Kingdom, and 8% were from Germany. When these numbers are compared to benchmark-period visits from those who were not TEI-C members (as seen in fig. 6) the result is that the majority of visitors are still from the United States (38%), but that China, Hong Kong, and Taiwan each account for 5% of visitors, and that 31% of visits came from “other” countries. This distribution confirms that those interested in TEI are much more widely distributed (geographically) than those who are currently members of the TEI-C.</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 70</p><p>Figure 6: Visits from Non-TEI-C Members (June 1 to September 30)</p><p>27 The final map (in fig. 7) shows the geographical distribution of widget-inspired visitors between 1 and 17 October. Again, the United States leads the visits with 44%. However, of the remaining visits, several countries register that have not appeared in previous results: India, Russia, and Israel are examples. This unexpected result shows that there was worldwide interest in the video widget and that it was able to attract people to the TEI website that otherwise might not have come across it. The results show that many non-member visitors were from non-English-speaking countries, but also that non- English-speaking members of TEI are less active on the website than English-speaking ones. </p><p>Figure 7: New Visits (October 1 to 17)</p><p>28 By analyzing the various data, a clearer picture of TEI’s community of practice comes into focus. As can be seen from the illustrations, there is a larger community beyond the TEI-C members and these visitors share similar characteristics with TEI members. Likely involved in many activities similar to those of TEI-C members, they are probably</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 71</p><p> using XML, encoding documents, and digitizing material—all of which can be made easier or more useful through the TEI. A more formal relationship with the TEI-C would thus be of most value to this group. The third and smallest group indicates the effectiveness of the viral video widget in reaching users who might be outside those adjacent communities of practice but still demonstrate an interest in TEI and were inspired by the widget. These users represent the broadening of the TEI’s appeal, and may also be the target of future efforts to encourage more awareness and even potentially more formal membership in the TEI Consortium.</p><p>4.2. Website Traffic Analysis</p><p>29 A further analysis of website traffic can provide insight into usage over time and highlight significant events on the site. Figure 8 shows visits to the TEI website through the benchmark period and viral marketing experiment. Three spikes in usage can be identified, at points A, B, and C, and are of note.</p><p>Figure 8: Site Usage from May 1 to October 17</p><p>30 Traffic to the TEI-C website during the benchmark period is somewhat steady until August. The first small spike, at point A, occurs in late August, when faculty and staff began returning to educational institutions and academic libraries for the fall session and likely accounts for an increase in traffic as they return from holidays or simply start to access the TEI site from campus instead of from home. The spike at point B coincides with the launch of the Chinese TEI-encoded letters project, which appears to have created more traffic than the viral marketing experiment (Point C). (Unfortunately, the subsequent drop represents the TEI website’s server crash, caused by the widget-inspired traffic.) After spike C, the traffic quickly returns to steady levels, although they are slightly higher than the benchmark levels. </p><p>4.3. Blogs and Comments from the Larger Community of Practice and Beyond</p><p>31 While the benchmark period was designed to define the immediate TEI community of practice, the viral marketing experiment was focused on identifying those who were just beyond this group and would be interested in TEI and the video widget. Initially, the experiment was designed to collect data from Google Analytics; however, as the video widget went viral an opportunity was created to analyze the blogs and online discussion about the experiment.</p><p>32 Nine websites or blogs announced the widget and hosted discussion, generating around 30 comments about the widget, TEI, XML, and “geekdom.” In fact, YouTube itself</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 72</p><p> generated much less interest than was originally expected; most of the widget-inspired increase in TEI website traffic on October 1 and 2 came via a post on MetaFilter.com (MetaFilter 2008). MetaFilter traffic accounted for 37% of new visits on October 1 and 2; more new visitors came from MetaFilter than from any other source (including the Google search engine). </p><p>33 Overall, the comments were in support of TEI and its associated relationships. Because these individuals were experienced in XML, though not necessarily in TEI, they had an appreciation for the video widget and the text encoding. As richardaholden commented “This video of Bob Dylan’s Subterranean Homesick Blues overlaid with XML markup might be the geekiest thing I’ve ever seen. It’s the work of a body called the Text Encoding Initiative, whose mission is to “develop and maintain Guidelines for the...” (Technorati 2008). Further, jack_mo wrote “Cheers Knappster, you just made me chuckle slightly while reading an XML file. I never thought that would happen” (MetaFilter). Further, Ronnie Brown praised the video by saying that it was the “Geekiest video ever. No exceptions. Language Log picks up the University of Victoria’s Electronic Textual Cultures Laboratory Subterranean Homesick Blues project. Bob Dylan, XML and language dissection. That’s what I’m talking about!” (Technorati 2008). Finally, Gnomic suggested the next project with “Now can they do Springsteen's Blinded By the Light?” (MetaFilter 2008).</p><p>34 In the spirit of debate, several critiqued or debated the encoding method. For example, inn asked “Am I missing something or did they only mark up selected words and phrases, not the full lyrics? Isn't the ability to have your annotations inline in the text part of the idea with TEI?” (MetaFilter 2008). Also, jccummings suggested that “It is interesting, I would have assumed that you'd use the 'spoken text' transcription module and code the lyrics as utterances (u) with the displayed boards as (writing).” (Siemens et al. 2008).</p><p>35 Of course, given that this video widget moved beyond the immediate TEI community practice, some individuals who access the widget were previously unaware of TEI and asked questions about it. Perhaps, theoGoh summed it up most succinctly with “For the benefit of the uninitiated, what's the point of doing this?” (Siemens et al. 2008). On the other hand, alasdair showed some knowledge of the field, though not TEI, with “Hey, interesting. Not heard of the TEI before. Should I think of it as a competitor to the Dublin Core set of elements, or can/should you use Dublin Core elements in TEI documents (without doing Namespace stuff, I mean)?” (MetaFilter 2008). </p><p>36 It is clear from the last group of comments that the widget certainly succeeded in its attempt to unite a community of practice (and geekiness). The comments and TEI website traffic both show that the general audience of YouTube was less likely to appreciate, understand, and follow up on the TEI widget than the MetaFilter audience. MetaFilter is a weblog, but instead of a small number of contributors, anyone can contribute to MetaFilter. Members generally provide links to interesting articles they have encountered elsewhere on the internet. Browsing MetaFilter is somewhat like browsing the “best of” the internet, according to its support base. Because of this “filtering” of content, it is not unreasonable to assume that MetaFilter might have a more engaged, focused audience in TEI than YouTube has.</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 73</p><p>5. Conclusions and Recommendations</p><p>37 The viral marketing experiment ultimately succeeded in its goal of more clearly defining communities of practice and suggesting strategies to formalize the TEI community. </p><p>38 Primarily, the viral video was a success because it did indeed go viral. The video intrigued people, both within the TEI community of practice and outside that community. In fact, the widget attracted individuals and inspired comment and conversation in ways not anticipated. Indeed, perhaps the most interesting results were the comments and validation from the larger community itself and beyond. These comments clearly state their sense of the shared community experience of the video and the text encoding which underpinned it.</p><p>39 Second, the experiment provided enough data to begin defining TEI’s community of practice. There is a clearer indication of who is using TEI, and therefore might benefit from a more formal relationship through TEI-C membership. The data collected also indicates that this community is big, both as defined by number and geographical spread. </p><p>40 Finally, the website tracking data can illustrate much about common usage patterns and browsing habits, as well as indicating the types of events draw people to the TEI website for information. The spike in traffic caused by the announcement of Chinese writings, for example, indicates that widely publicizing new TEI-encoded projects might in fact be the best method of marketing the usefulness of the TEI. As the TEI-C evaluates mechanisms to promote itself, it might consider that TEI projects themselves appear to be the best tool for creating awareness.</p><p>41 From these conclusions, several recommendations can be made on ways to enlarge and formalize the TEI community of practice</p><p>42 Given the geographical spread beyond English-speaking countries, TEI-C might consider providing several non-English language options for the website. As discussed above, the Google Analytics results show that many non-member visitors were from non-English-speaking countries, and also that non-English-speaking members of TEI are less active on the website than English-speaking ones. More language options would allow non-English-speaking visitors to engage more fully with the website and, by extension, the TEI community of practice. </p><p>43 TEI could also consider making a concentrated effort to spread announcements about events, workshops, and new project releases through sites like MetaFilter, which tend to reach a more specific audience than a site like YouTube, as seen in the comments discussed above. On a related note, TEI could seek out newly attracted visitors who already have knowledge of XML and computer programming languages for future viral experiments. Those users seemed to be most engaged in discussion in the widget comments.</p><p>44 The viral experiment was primarily a means to identify and understand the interest surrounding and commitment to the TEI. As a way to continue this community- building, and to put TEI at the centre of the community, the TEI should work on strengthening its brand image by embedding its logo in all emails, newsletters, and related projects. Furthermore, the creation of an online social community focused on TEI could become a place where users convene to learn about upcoming events, project</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 74</p><p> announcements, as well as to network, discuss technical issues, and get advice. For future viral experiments, this network could be used to launch a competition for the next encoded music video, film clip, or other media. Engaging the community in the early stages would strengthen the community of practice even further. Further, the TEI might consider tracking more formally those projects which have used TEI and evaluate the impact of the announcement of these projects on TEI’s website. At the same time, the Consortium might encourage these TEI-encoded projects to more formally highlight the fact that they used TEI.</p><p>45 Finally, the TEI (as they have with the development of the P5 Guidelines) should continue to return to those members for more recommendations on how to improve the TEI Guidelines, website, brand, and community-building efforts. This could be accomplished by sending out surveys to current members, or by organizing focus groups of members and practitioners from different countries, backgrounds, and with different TEI-related goals.</p><p>46 In conclusion, this viral marketing experiment was created to generate an understanding of the TEI community of practice and those located just beyond. From the analysis of benchmark and the viral marketing experiment data, the TEI-C has a picture of the members of the community and potential ways to formalize this community. This community is diverse, widespread, and quite large. Further, with comments on YouTube and blogs, it is clear that individuals located outside the TEI community are interested in the TEI and its potential. Many of the comments promote the sense that the video can function as a shared community experience. The TEI-C’s goal of sustaining a “broad-based user community” has certainly succeeded, as the TEI continues to grow.</p><p>BIBLIOGRAPHY</p><p>Holden, R. 2008. “Subterranean Homesick Blues.” Richard Holden: Science, words, etc. http:// richardholden.net/?tag=bobdylan.</p><p>Jurvetson, S. and T. Draper. 1997. “Viral Marketing: Viral Marketing phenomenon explained.” DFJ Network News. http://www.dfj.com/news/article_26.shtml.</p><p>Lave, J. and E. Wenger. 1991. Situated Learning: Legitimate Peripheral Participation. Cambridge: Cambridge University Press.</p><p>McCarty, W. 2005. Humanities Computing.London, Palgrave.</p><p>MetaFilter. 2008. “Look out kids, you’re gonna get hits.” http://www.metafilter.com/75307/Look- out-kids-youre-gonna-get-hits.</p><p>Muniz, Jr., A. M. and T. C. O’Guinn. 2001. “Brand Community.” Journal of Consumer Research 27: 412–32.</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 75</p><p>Reiss, C. 2010. “Now look here, now learn from this …: Businesses can learn from the Old Spice Man viral marketing campaign.” Entrepreneur.com on msnbc.com. http://www.msnbc.msn.com/id/ 38282026.</p><p>Siemens, R., Porter, D., Sherriff, L., Leitch, C., and Armstrong K. 2008. "TEI Encoding of Dylan's Subterranean Blues" (video). http://www.youtube.com/watch?v=4shydfitjhy (no longer available on youtube.com).</p><p>Skrob, J. R. 2005. “Open Source and Viral Marketing: The viral marketing concept as a model for open source software to reach the critical mass for global brand awareness based on the example of TYPO3.” Vienna, Austria.</p><p>Technorati. 2008. “TEI Encoding of Dylan’s Subterranean Hom…” http://technorati.com/videos/ youtube.com%2Fwatch%3Fv%3D4sHYDfITjHY.</p><p>Text Encoding Initiative. 2010. “TEI: History.” Accessed October 29. http://www.tei-c.org/About/ history.xml.</p><p>Van Buskirk, E. 2010. “The Power of Social Media, Part 2,” Epicenter (blog), Wired.com, July 26, 2010. http://www.wired.com/epicenter/2010/07/the-power-of-social-media-part-ii/.</p><p>Wen, H. 2008. “Research and Development of Viral Marketing and Brand Community Strategies for the Text Encoding Initiative (TEI)” (unpublished manuscript)</p><p>NOTES</p><p>1. The term “community of practice” originates in cognitive anthropology, popularized by Lave and Wenger, who used it to designate a group of people who share an occupation, craft, or any set of practices; midwives are an example (Lave and Wenger 1991). The term has since been adapted for use in many other fields. In this paper, we use the term to indicate a group of people who share the same specialized knowledge and for whom that knowledge is central to their occupation. Specifically, we are discussing academic practitioners of TEI, but also adjacent communities of practice, such users of other methods of text encoding or digitization. 2. While membership primarily supports the maintenance of the TEI and brings the member formally into the community, there are more tangible benefits as well. Members have access to further training and consultation, the use of specialized digitization tools, and discounts on encoding and digitization software. Details about these and other benefits can be found at the TEI website: <http://www.tei-c.org/Membership/benefits.xml>. 3. Wen subsequently wrote a paper, titled “Research and Development of Viral Marketing and Brand Community Strategies for the Text Encoding Initiative (TEI)” about this experiment. His work has been an invaluable source for this paper. 4. For a full description of the methodology used in this experiment, see “Research and Development of Viral Marketing” pages 45-57 in Wen, 2008. 5. These options and more are available from the following website: <http://www.tei-c.org/ Guidelines/P5/get_p5.xml>.</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 76</p><p>ABSTRACTS</p><p>If the notion of the methodological commons is as centrally located as we believe it to be in any visualization accurately depicting the intellectual structure of the digital humanities and digital literary studies (McCarty 2005, 119), then so, too, must be the community itself whose members provide that which populates the commons. As an interdiscipline, humanities computing has always well-understood its methodologies; indeed, the digital humanities (of which digital literary studies is a part), more generally, have made a virtue of the way in which they render explicit and tangible the theoretical models that govern the representative and analytical endeavour of their fields via computational application. So, too, have those in the field understood and documented its formal structures and institutional manifestations, a chief example being the Text Encoding Initiative itself. Less explicitly rendered and less formally documented–though intuited by its chief practitioners and builders–is the exact nature of the community itself, its depth and breadth, its own centre and, perhaps more important in a field whose embrace of interdisciplinarity is far from self-serving, its periphery and those aspects of which promise to become central. This article presents work carried out in conjunction with the Text Encoding Initiative Consortium, a foundation of many digital literary studies projects, work that seeks to document the full nature of its community, from the institutional and research project groups that comprise the formal consortium at centre to those who appear on the other side of the easily- permeable periphery that separates it from the centre, largely individual practitioners in areas hitherto not closely identified with the digital humanities but clearly sharing methods and tools, thus suggesting their place in the same communities of practice, as they are members of the same methodological commons. This methodological approach is drawn from marketing and organizational behavior, manifest in social networking, in the study of viral marketing campaigns conducted in online environments. The method for this work was centred around a viral marketing experiment designed to showcase the TEI and novel ways that it can be used to encode different kinds of text. At the heart of the experiment was a Bob Dylan song and its associated video which incorporated text; encoded text was overlaid and the video was posted to YouTube and a blog with links to the TEI website with analysis of traffic patterns carried out.</p><p>INDEX</p><p>Keywords: community of practice, viral marketing</p><p>AUTHORS</p><p>LYNNE SIEMENS siemensl@uvic.ca School of Public Administration, University of Victoria, Canada Lynne Siemens is in the School of Public Administration at the University of Victoria. Her research interests include academic team development, organizational behavior, small business and entrepreneurship.</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011 77</p><p>RAY SIEMENS siemens@uvic.ca Department of English, University of Victoria, Canada Ray Siemens is the Canada Research Chair in Humanities Computing and Professor of English at the University of Victoria with cross appointment in Computer Science. He is the editor of several renaissance texts and the founding editor of the electronic scholarly journal, Early Modern Literary Studies; he has written numerous articles on the connections between computational methods and literary studies and is the co-editor of several humanities computing books such as Blackwell's Companion to Digital Humanities and Companion to Digital Literary Studies. Ray serves as Director of the Implementing New Knowledge Environments project, and the Digital Humanities Summer Institute, and is Chair of the Steering Committee for the Alliance of Digital Humanities Organisations; he has also been President (English) of the Society for Digital Humanities/Société pour l'étude des médias interactifs (SDH/SEMI), as well as Chair of the Modern Language Association's Committee on Information Technology and the MLA Discussion Group on Computers in Language and Literature.</p><p>HEFENG (EDDIE) WEN hefengw@uvic.ca Faculty of Business, University of Victoria, Canada Eddie Wen is a graduate of the University of Victoria’s Master’s of Business Administration program, who undertook working with the Text Encoding Initiative Board for what is reported on in this article. His fuller report, which also covers issues related to viral marketing, is documented in the Works Cited list.</p><p>CARA LEITCH cmleitch@uvic.ca Electronic Textual Cultures Lab, University of Victoria, Canada</p><p>DOT PORTER dot.porter@gmail.com Digital Library Content & Services, Indiana University</p><p>LIAM SHERRIFF Electronic Textual Cultures Lab, University of Victoria, Canada</p><p>KARIN ARMSTRONG Electronic Textual Cultures Lab, University of Victoria, Canada</p><p>MELANIE CHERNYK uvic.etcl@gmail.com Electronic Textual Cultures Lab, University of Victoria, Canada</p><p>Journal of the Text Encoding Initiative, Issue 1 | June 2011</p> </div> </article> </div> </div> </div> <script type="text/javascript" async crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-8519364510543070"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script> var docId = '2d1a1077ba6e7fae89ef70621c40dd6b'; var endPage = 1; var totalPage = 78; var pfLoading = false; window.addEventListener('scroll', function () { if (pfLoading) return; var $now = $('.article-imgview .pf').eq(endPage - 1); if (document.documentElement.scrollTop + $(window).height() > $now.offset().top) { pfLoading = true; endPage++; if (endPage > totalPage) return; var imgEle = new Image(); var imgsrc = "//data.docslib.org/img/2d1a1077ba6e7fae89ef70621c40dd6b-" + endPage + (endPage > 3 ? ".jpg" : ".webp"); imgEle.src = imgsrc; var $imgLoad = $('<div class="pf" id="pf' + endPage + '"><img src="/loading.gif"></div>'); $('.article-imgview').append($imgLoad); imgEle.addEventListener('load', function () { $imgLoad.find('img').attr('src', imgsrc); pfLoading = false }); if (endPage < 7) { adcall('pf' + endPage); } } }, { passive: true }); </script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> </html>