12/10/2016

Some questions we asked ourselves…

Why use XML-TEI markup? • Why use markup at all?

• Why choose XML, and TEI?

• Why is XML better than SGML?

• Why is TEI better than other markup vocabularies?

Alejandro Bia Universidad Miguel Hernández Spain

Why use markup at all? When are facsimilies better than text?

Markup: Digital facsimiles are used to reproduce newspapers with • is any means of making explicit an interpretation of a text. historical value • both for computers and for humans (text encoders) (e.g. “Izquierda Republicana”):

•usefulfordata and metadata Or also to accurately reproduce old printings The alternatives are: (e.g. “La Celestina”):

• Plain text (like the Gutenberg project) But the most interesting use is to reproduce • Facsimile formats: TIFF / JPEG / PDF / DejaVu manuscripts • Marked-up text + facsimiles  more functionality (e.g. “El Divino Cazador”): Three types of markup: And, for everything else, • Punctuational there’s… • Presentational or procedural TEXT: plain text, marked-up text, • Descriptive or structural  formated text… (adds semantic value to the text) Example images from the MCDL website. 4

Digital facsimiles with text transcriptions Markup is based on the OHCO Model

OHCO:

• Ordered Hierarchy of Content Objects

• A book is a good example of what can be represented in this way.

• Each Content Object is marked at its beginning and end with a tag. Transcription notation according to the Madison Guidelines - Hispanic Seminar of Medieval Studies - HSMS 5 6

(c) Alejandro Bia 1 12/10/2016

Limitations of the OHCO Model Markup Languages and Vocabularies

MARKUP LANGUAGES • OHCO is good in most cases (Paretto law - 80/20) (in black) • Except, for instance… … when we need overlapping markup (forbidden by design) TEI-SGML DocBook XML syntaxes for overlap, such as in TEI or in LMNL*, adopt five different techniques: TEI-XML SGML • milestones XML • fragmentation XHTML • flattened HTML • multiple document strict HTML • standoff loose MARKUP VOCABULARIES * LMNL: The Layered Markup and Annotation Language (in red) 8

The rules of XML: compared to SGML XML and digital preservation concerns…

Digital preservation is focused on: The XML imposes strict rules: • Preventing data loss, mainly due to deterioration of the media • Preventing obsolescence of the data formats • All opened tags must be closed • All empty tags should use the / to indicate that the tag is empty. XML is plain text + tags: • All attribute values within an XML document must be quoted. • The names of Elements and Attributes used in an XML document must be an • Tags are text as well: exact match of how they are declared in the DTD (this includes letter-case). • … so XML is just text • … the oldest, simplest, most widely used computer file format However, we still have to be aware of changes in image formats: XML imposes more constraints than SGML. • Tiff, Jpeg, Png…

Many file formats rely on XML nowadays: • MS .docx .pptx. .xlsx • Epub, SVG This makes it easier to use and to process. • Java JAR files • And a long etc. … 10

Likewise… with images: Single source, multiple outputs, automatic transformation... single source, multiple outputs, and automatic transformation

Avoid duplication of information Avoid duplication of information  Principle of UNIQUE SOURCE  PRINCIPLE OF UNIQUE SOURCE

HTML UNIQUE THUMBNAILS PDF HIGH SMALL JPEG SOURCE eBook formats RESOLUTION MEDIUM JPEG DOCUMENT Searches / Concordances IMAGE (XML-TEI) Etc. (TIFF) LARGE JPEG

source to automatic disposable subproducts source to automatic disposable subproducts preserve transformation preserve transformation 12 13

(c) Alejandro Bia 2 12/10/2016

XML as a family of technologies and tools So, why use XML?

XML is not just a markup language DH Research potential but a family of standards and related tools • Precise searches • Automatic concordances that interoperate: • Comparative analysis (parallel alignment) • Linguistic research – XML • Statistical research – Namespaces – XPath Multipurpose – XSL or XSLT • can be used for different research purposes – DTDs and Schemas • can be rendered in several formats – XQuery • adequate for Web publishing – Xlink and XPointer – and many powerful software tools…

14

So, why use XML? And, why use TEI?

It’s the most complete markup vocabulary Ease of preservation • More than 500 elements • It is text • Specialized tagsets • Easily converted to other formats • Customizable • Easily upgradeable • A de-facto standard for scholarship

Plus Drawbaks: • Community collaboration (TEI list, conference, SIGs) • Documents interchange • encoding is more expensive • Tool sharing • requires skilled personnel • Support • overlapping markup is not allowed

(c) Alejandro Bia 3