Markup Overlap: a Review and a Horse

Extreme Markup Languages 2004® Montréal, Québec August 2-6, 2004 Markup Overlap: A Review and a Horse Steven DeRose Bible Technologies Group Abstract "Overlap" is the common term for cases where some markup structures do not nest neatly into others, such as when a quotation starts in the middle of one paragraph and ends in the middle of the next. OSIS [Duru03], a standard XML schema for Biblical and related materials, has to deal with extreme amounts of overlap. The simplest case involves book/chapter/verse and book/story/ paragraph hierarchies that pervasively diverge; but many types of overlap are more complicated than this. The basic options for dealing with overlap in the context of SGML [ISO 8879] or XML [Bray98] are described in the TEI Guidelines [TEI]. I summarize these with their strengths and weaknesses. Previous proposals for expressing overlap, or at least kinds of overlap, don't work well enough for the severe and frequent cases found in OSIS. Thus, I present a variation on TEI milestone markup that has several advantages. This is now the normative way of encoding non-hierarchical structures in OSIS documents. Markup Overlap: A Review and a Horse Table of Contents Introduction.......................................................................................................................................1 The problem types.............................................................................................................................1 Evaluation criteria.............................................................................................................................2 Some available solution types ..........................................................................................................3 SGML CONCUR .......................................................................................................................3 Segmentation...............................................................................................................................4 MECS [Sper98]...........................................................................................................................4 JITTs...........................................................................................................................................4 Joins............................................................................................................................................5 Standoff markup..........................................................................................................................5 TEI-style Milestones...................................................................................................................6 Differing type of milestones..............................................................................................................7 Catching Hell'n tagging ....................................................................................................................7 CLIX and LMNL..............................................................................................................................8 Creating CLIX from LMNL........................................................................................................9 On ordering identical ranges.......................................................................................................9 Ranges co-terminous at only one end........................................................................................11 Range ordering with and without co-located endpoints............................................................11 Canonical ordering in CLIX......................................................................................................12 Creating better XML from CLIX..............................................................................................13 Implementations..............................................................................................................................13 Summary.........................................................................................................................................13 Acknowledgements.........................................................................................................................14 Bibliography....................................................................................................................................14 The Author......................................................................................................................................15 Markup Overlap: A Review and a Horse Markup Overlap: A Review and a Horse Steven DeRose § Introduction "Overlap" describes cases where some markup structures do not nest neatly into others, such as when a quotation starts in the middle of one paragraph and ends in the middle of the next. OSIS [Duru03], a standard XML schema for Biblical and related materials, has to deal with extreme amounts of overlap. The simplest is book/chapter/verse and book/story/paragraph hierarchies that pervasively diverge; but many types of overlap are more complicated than this. The basic options for dealing with overlap in the context of SGML [ISO 8879] or XML [Bray98] are described in the TEI Guidelines [TEI]. I summarize these with their strengths and weaknesses. Previous proposals for expressing overlap, or at least kinds of overlap, don't work well enough for the severe and frequent cases found in OSIS. Thus, I present a variation on TEI milestone markup that has several advantages, though it is not a panacea. This is now the normative way of encoding non-hierarchical structures in OSIS documents. § The problem types The simplest type of non-hierarchical structure is the union of multiple structures, each of which is hierarchical. In this case, if you discard all the markup for all but one of the structures, the one structure remaining is a non-problematic hierarchy. For example, the Biblical book/chapter/verse and book/story/ paragraph/sentence hierarchies are essentially independent below the level of book (which must be identical across the hierarchies). A particularly nasty case of such overlap arises when one kind of component crosses many others, as sometimes happens with quotes. The example below shows a typical case from the Biblical book of Jeremiah, and one way to encode it by segmenting the non-hierarchical quotes into parts that are hierarchical, then co-indexing the parts of each logical quote by assigning them the same value for the "splitID" attribute. First, consider the passage without markup. The parenthesized numbers indicate where each numbered verse begins; indentation indicates nested quotations. (1) Moreover the word of the LORD came to me, saying, (2) Go and cry in the hearing of Jerusalem, saying, Thus says the LORD: I remember you, The kindness of your youth, The love of your betrothal, When you went after Me in the wilderness, In a land not sown. (3) Israel [was] holiness to the LORD, The firstfruits of His increase. All that devour him will offend; Disaster will come upon them, says the LORD. (4).... The difficulty is that verse 3 begins in the middle of a third-level nested quote. In a hierarchy, ending verse 2 forces ending the three quotes as well. Then as verse 3 begins all those quotes must be re- opened. Further, it is important to encode somehow that the three quotes opened at the start of verse 3 are not just any three quotes, but exactly the same ones that were closed to end verse 2. Put another way, these quotes should not logically have closed at the verse boundary at all -- if they close it is because of syntax limitations. Below is the marked-up equivalent, where distinct physical parts of each single logical quote are co-labeled using values of the splitID attribute: <div> <p> <verse osisID="Jer.2.1">Moreover the word Extreme Markup Languages 2004® page 1 DeRose of the LORD came to me, saying,</verse> <verse osisID="Jer.2.2"> <q splitID="Q-Jer.2.2-A">Go and cry in the hearing of Jerusalem, saying, <q splitID="Q-Jer.2.2-B">Thus says the LORD: <q splitID="Q-Jer.2.2-C">I remember you, The kindness of your youth, The love of your betrothal, When you went after Me in the wilderness, In a land not sown. </q> </q> </q> </verse> <verse osisID="Jer.2.3"> <q splitID="Q-Jer.2.2-A"> <q splitID="Q-Jer.2.2-B"> <q splitID="Q-Jer.2.2-C">Israel [was] holiness to the LORD, The firstfruits of His increase. All that devour him will offend; Disaster will come upon them,</q>  says the LORD.</q>  </q>  </verse> </p> ... This sort of case is commonly advanced to show that the "Ordered Hierarchy of Content Objects" model does not suffice [see Dura96, Rene95]. It does show that, but it remains simple enough that theories and systems of multiple simple hierarchies remain viable. Somewhat more complicated are cases where components of what is arguably the same hierarchy can remain open beyond the end of the component in which they began, but instances of the same component type can never overlap each other. This may not seem to be a natural category because it depends on the details of particular schema; but it is a common class: even quotations operate this way. Quotes generally do not partially overlap other quotes, although they may overlap many other structures and may be arbitrarily nested. This model is very close to that of MECS [Sper98] and permits some

Markup Overlap: a Review and a Horse

THE TEXT ENCODING INITIATIVE Edward

Towards a Content-Based Billing Model: the Synergy Between Access Control and Billing

3.4 the Extensible Markup Language (XML)

A Personal Research Agent for Semantic Knowledge Management of Scientiﬁc Literature

Never the Same Stream: Netomat, Xlink, and Metaphors of Web Documents Colin Post Patrick Golden Ryan Shaw

Streaming-Based Processing of Secured XML Documents

Semantic Web Technologies and Legal Scholarly Publishing Law, Governance and Technology Series

Proceedings of Balisage: the Markup Conference 2012

Towards the Unification of Formats for Overlapping Markup

Standoff Markup Peter L

The MLCD Overlap Corpus (MOC)

Enhancing the Security of Applications Using XML-Based Technologies