Infrastructure-Agnostic Hypertext
Total Page:16
File Type:pdf, Size:1020Kb
Infrastructure-Agnostic Hypertext Jakob Voß [email protected] Verbundzentrale des GBV (VZG) ABSTRACT OUTLINE This paper presents a novel and formal interpretation of the orig- The architecture of an infrastructure-agnostic hypertext system inal vision of hypertext: infrastructure-agnostic hypertext is inde- consists of four basic elements and their relations: pendent from specific standards such as data formats and network protocols. Its model is illustrated with examples and references to 1) documents include all finite, digital objects existing technologies that allow for implementation and integra- 2) document identifiers reference individual documents tion in current information infrastructures such as the Internet. 3) content locators reference segments within documents CCS CONCEPTS 4) edit list combine parts of existing document into new ones Instances of each element can further be grouped by data formats. • Information systems → Hypertext languages; Document rep- The elements and their relations are each described in the following resentation; • Human-centered computing → Hypertext / hyper- sections after a formal definition. media. Formal model KEYWORDS A hypertext system is a tuple hD, I,C, E, S, R,U ,T, Ai where: Xanadu, hypermedia, transclusion, documents D is a set of documents I is a set of document identifiers with I ⊂ D INTRODUCTION C is a set of content locators with C ⊂ D E is a set of edit lists with E ⊂ D The original vision of hypertext as proposed by Ted Nelson [14, 22] S is a set of document segments with S ⊂ C × D still waits to be realized. His influence is visible through people who, influenced by his works, shaped the computer world of today, Document sets can each be grouped into (possibly overlapping) last but not least the Web [20].1 Nelson’s core idea, a network of data formats. The hypertext system further consists of: visibly connected documents called Xanadu, goes beyond the Web R is a retrieval function with R : I → D in several aspects. In particular it promises non-breaking links and U is a segments usage function U : E → P(S) it uses links to build documents (with versions, quotations, over- T is a transclusion function with T : S → D lay markup...) instead of using documents to build links [18]. The A is an hypertext assemble function with A: E → D concept of hypertext or more general ‘hypermedia’ has also been used differently from Nelson, both in the literary community (that A practical hypertext system needs executable implementations of focused on simple links), and in the hypertext research community the functions R,U ,T, A and a method to tell whether a given com- arXiv:1907.00259v1 [cs.DL] 29 Jun 2019 (that focused on tools) [30]. bination hc,di ∈ C × D is part of S to allow its use with T . This paper tries to get back to the original vision of hypertext by specification of a formal model that puts transcludeable documents Documents at its heart. Apart from Nelson’s works [14–17, 19, 22] this paper draws from research on data format analysis [11, 28], content-based A document is a finite sequence of bytes. This definition roughly identifiers [12, 26], existing transclusion technologies for the Web equates with the definition of data as documents [8, 29]. The no- [1, 2, 6, 25] and hypertext systems beyond link-based models [3]. tion of hypertext used in this paper therefore subsumes all kinds of hyperdata (datasets that transclude other datasets). Documents are Limited by the state of document processing tools and by sub- static by content [24] but they may be processed dynamically. Doc- mission guidelines, this paper is not a demo of hypertext.2 On a uments are grouped by non-disjoint data formats such as UTF-8, closer look however there are traces of transclusion links that have CSV, SVG, PDF and many many more. been processed to this paper. See figure 2 for an overview and https://github.com/jakobib/hypertext2019 for details and sources. Document identifiers 1Tim Berners-Lee references Nelson both in the proposal that led to the Web [4, 15] and at the early W3C homepage http://www.w3.org/Xanadu.html (1992/93) [17]. A document identifier is a relatively short document that refers to 2See [5] for a good example of what an actual demo paper might look like. another document. Identifiers have properties depending on the 2019-07-02 01:06. Page 1 of 1–4. Jakob Voß particular identifier system they belong to [29, pp. 59-71]. Identi- Data formats fiers in infrastructure-agnostic hypertext must first be unambigu- ous (an identifier must reference only one document), persistent (the An often neglected fundamental property of digital documents is reference must not change over time), and actionable (hypertext their grounding in data formats. A data format is a set of docu- systems should provide methods to retrieve documents via the re- ments that share a common data model, also known as their docu- trieval function R). Properties that should be fulfilled at least to ment model, and a common serialization (see fig. 1). Models define some degree include uniqueness (a document should not be refer- elements of a document in terms of sets, strings, tuples, graphs or enced by too many identifiers), performance (identifiers should be similar structures. These structures are mathematically rigorous in easy to compute and to validate), and distributedness (identifiers theory [24] but more based on descriptive patterns in practice [28]. should not require a central institution). The actual choice of an The meaning of these elements (for instance “words”, “sentences”, identifier system depends on weighting of specific requirements. A and “paragraphs” in a document model) is based on ideas that we promising choice is the application of content-based identifiers as at least assume to be consistent among different people. proposed at least once in the context of hypertext systems [12]. idea model format serialization Content locators 01101... A content locator is a document that can be used to select parts of another document via transclusion. Nelsons refers to these locators Figure 1: levels of data modeling as “reference pointers” [19], exemplified with spans of bytes or char- acters in a document. Content locators depend on data formats and Data modeling, the act of mapping between ideas, models, and for- document models. For instance locator languages XPath, XPointer, mats is an unsolved problem because ideas can be expressed in and XQuery act on XML documents, which can be serialized in dif- many models and models can be interpreted in many ways [11]. ferent forms (therefore it makes no sense to locate parts of an XML Data models can further be expressed in multiple formats although document with positions of bytes). Other locator languages apply these formats should fully be convertible between each other, at to tabular data (SQL, RFC 7111), to graphs (SPARQL, GraphQL), or least in theory.4 Formats can further be serialized in multiple forms, to two-dimensional images (IIIF), to name a few. Whether and how which for their part are based on other data models. For instance the partsofa document can be selectedwith acontent locatorlanguage RDF model can be serialized in RDF/XML format which is based on depends on which data format the document is interpreted in. For the XML model. XML can be serialized XML syntax which is based instance an SVG image file can be processed at least as image, as on the Unicode model, and Unicode can be serialized in UTF-8. At XML document, or as Unicode string, each with its own methods the end of these chains of abstraction eventually all documents can of locating document segments. Content locators can be extended be serialized as sequence of bytes. Serializations are seldom sim- to all executable programs that reproducibly process some docu- ple mappings as most serialization formats allow insignificant vari- ments into other documents. This generalization can be useful to ances such as additional whitespace. To check whether a data ob- track data processing pipelines as hyperdata such as discussed for ject conforms to a serialization, formats are often described with executable papers and reproducible research. Restriction of content a formal grammar that can also give insights about the format’s locators to less powerful query languages might make sense from model.5 Infrastructure-agnostic hypertext does not impose any lim- a security point of view. its on possible data formats and their models. Edit Lists EXAMPLE An edit list is a document that describes how to construct a new doc- The following example may illustrate the formal model and its el- ument from parts of other documents. Edit lists are known as Edit ements. Let hD, I,C, E, S, R,U ,T, Ai be a hypertext system with D Decision Lists in Xanadu and the idea was borrowed from film mak- the set of printable ASCII character strings and some documents: ing [15]. Simplified forms of edit lists are implemented in version d1 = ‘My name is Alice’ control systems and in collaborative tools such as wikis and real- d2 = ‘Alice’ time editing. Hypertext edit lists go beyond this one-dimensional c1 = ‘char=11,15’ case by support of multiple source documents and by more flex- d3 = ‘Hello, !’ ible methods of document processing in addition to basic opera- c2 = ‘char=7’ tions such as insert, delete, and replace. The actual processing steps d4 = ‘Hello, Alice!’ tracked by an edit list depend on data formats of transcluded docu- ments. Just like content locators, edit lists could be extended to arbi- If c1 and c2 are read in content locator syntax as defined by = trary executable programs that implement the hypertext assemble RFC 5147 so that T (hc1,d1i) d2 then d4 can be constructed by function A for some subset of edit lists.