Why Use XML-TEI Markup? • Why Use Markup at All?
Total Page:16
File Type:pdf, Size:1020Kb
12/10/2016 Some questions we asked ourselves… Why use XML-TEI markup? • Why use markup at all? • Why choose XML, and TEI? • Why is XML better than SGML? • Why is TEI better than other markup vocabularies? Alejandro Bia Universidad Miguel Hernández Spain Why use markup at all? When are facsimilies better than text? Markup: Digital facsimiles are used to reproduce newspapers with • is any means of making explicit an interpretation of a text. historical value • both for computers and for humans (text encoders) (e.g. “Izquierda Republicana”): •usefulfordata and metadata Or also to accurately reproduce old printings The alternatives are: (e.g. “La Celestina”): • Plain text (like the Gutenberg project) But the most interesting use is to reproduce • Facsimile formats: TIFF / JPEG / PDF / DejaVu manuscripts • Marked-up text + facsimiles more functionality (e.g. “El Divino Cazador”): Three types of markup: And, for everything else, • Punctuational there’s… • Presentational or procedural TEXT: plain text, marked-up text, • Descriptive or structural formated text… (adds semantic value to the text) Example images from the MCDL website. 4 Digital facsimiles with text transcriptions Markup is based on the OHCO Model OHCO: • Ordered Hierarchy of Content Objects • A book is a good example of what can be represented in this way. • Each Content Object is marked at its beginning and end with a tag. Transcription notation according to the Madison Guidelines - Hispanic Seminar of Medieval Studies - HSMS 5 6 (c) Alejandro Bia 1 12/10/2016 Limitations of the OHCO Model Markup Languages and Vocabularies MARKUP LANGUAGES • OHCO is good in most cases (Paretto law - 80/20) (in black) • Except, for instance… … when we need overlapping markup (forbidden by design) TEI-SGML DocBook XML syntaxes for overlap, such as in TEI or in LMNL*, adopt five different techniques: TEI-XML SGML • milestones XML • fragmentation XHTML • flattened HTML • multiple document strict HTML • standoff loose MARKUP VOCABULARIES * LMNL: The Layered Markup and Annotation Language (in red) 8 The rules of XML: compared to SGML XML and digital preservation concerns… Digital preservation is focused on: The XML markup language imposes strict rules: • Preventing data loss, mainly due to deterioration of the media • Preventing obsolescence of the data formats • All opened tags must be closed • All empty tags should use the / to indicate that the tag is empty. XML is plain text + tags: • All attribute values within an XML document must be quoted. • The names of Elements and Attributes used in an XML document must be an • Tags are text as well: <tag also=“text”> exact match of how they are declared in the DTD (this includes letter-case). • … so XML is just text • … the oldest, simplest, most widely used computer file format However, we still have to be aware of changes in image formats: XML imposes more constraints than SGML. • Tiff, Jpeg, Png… Many file formats rely on XML nowadays: • MS .docx .pptx. .xlsx • Epub, SVG This makes it easier to use and to process. • Java JAR files • And a long etc. … 10 Likewise… with images: Single source, multiple outputs, automatic transformation... single source, multiple outputs, and automatic transformation Avoid duplication of information Avoid duplication of information Principle of UNIQUE SOURCE PRINCIPLE OF UNIQUE SOURCE HTML UNIQUE THUMBNAILS PDF HIGH SMALL JPEG SOURCE eBook formats RESOLUTION MEDIUM JPEG DOCUMENT Searches / Concordances IMAGE (XML-TEI) Etc. (TIFF) LARGE JPEG source to automatic disposable subproducts source to automatic disposable subproducts preserve transformation preserve transformation 12 13 (c) Alejandro Bia 2 12/10/2016 XML as a family of technologies and tools So, why use XML? XML is not just a markup language DH Research potential but a family of standards and related tools • Precise searches • Automatic concordances that interoperate: • Comparative analysis (parallel alignment) • Linguistic research – XML • Statistical research – Namespaces – XPath Multipurpose – XSL or XSLT • can be used for different research purposes – DTDs and Schemas • can be rendered in several formats – XQuery • adequate for Web publishing – Xlink and XPointer – and many powerful software tools… 14 So, why use XML? And, why use TEI? It’s the most complete markup vocabulary Ease of preservation • More than 500 elements • It is text • Specialized tagsets • Easily converted to other formats • Customizable • Easily upgradeable • A de-facto standard for scholarship Plus Drawbaks: • Community collaboration (TEI list, conference, SIGs) • Documents interchange • encoding is more expensive • Tool sharing • requires skilled personnel • Support • overlapping markup is not allowed (c) Alejandro Bia 3.