The TNTBASE System and Validation of XML Documents

The TNTBASE System and Validation of XML Documents Vyacheslav Zholudev Jacobs University of Bremen D-28759, Bremen, Germany [email protected] Abstract based (wrt. diff/patch/merge) and do not integrate well with XML databases and validators for different TNTBASE is an open-source versioned XML schema languages for XML, like RelaxNG [Rel] or XML database obtained by integrating Berkeley DB Schema [W3C06]. Some relational databases address XML into the Subversion Server. The system temporal aspects [DDL02], but this does not seem to is intended as a basis for collaborative editing have counterparts in the XML database or XQuery world. and sharing XML-based documents. It integrates Wikis provide simple versioning functionalities, but these versioning and fragment access needed for fine- are largely hand-crafted into each system’s (relational) granular document content management. database design. Nowadays well-formedness of electronic docu- In this paper we describe in short the TNTBASE sys- ments plays a giant role in the contemporary doc- tem, an open-source versioned XML database obtained by ument workflows and applications have to pro- integrating Berkeley DB XML [Ber09b] into the Subver- vide domain-specific validation mechanisms for sion Server [SVN08]. The system is intended as an en- documents they work with. On another hand, abling technology that provides a basis for future XML- XML is coming of age as a basis for docu- based document management systems that support collab- ment formats, and even though there are a lot orative editing and sharing by integrating the enabling tech- of schema-based validation formats and software nologies of versioning and fragment access needed for fine- available for XML, domain-specific directions granular document content management. Also we discuss still remain unfilled. our vision of how the validation of XML documents should In this paper we present the TNTBASE system in be done in a way that is conformant to Subversion philoso- general and its validation support for XML doc- phy and how it fits to the TNTBASE system. uments. The TNTBASE system is developed in the context of the OMDOC project (Open Mathematical Documents [OMD; Koh06]), an XML-based representation format for the 1 Introduction structure of mathematical knowledge and communication. With the rapid growth of computers and Internet resources Correspondingly, the development requirements for the the communication between humans became much more TNTBASE come out OMDOC-based applications and their efficient. The number of electronic documents and the storage needs. We are experimenting with a math search speed of communication are growing rapidly. We see engine [KS¸06], a collaborative community-based reader the development of a deep web (web content stored in panta rhei [pan], the semantic wiki SWiM [Lan08], the Databases) from which the surface Web (what we see in learning system for mathematics ActiveMath [Act08], and our browsers) is generated. With the merging of XML a system for the verification of statements about programs fragment access techniques (most notably URIs [BLFM98] VeriFun [Ver08]. and XPath [CD99; BBC+07]) and database techniques and In the next section we will summarize what the TNT- the ongoing development of XML-based document for- BASE system is and what it does. Then we will be ready mats, we are seeing the beginnings of a deep web of to cover validation mechanisms (see Section 3) offered by XML documents, where surface documents are assem- TNTBASE and explain some design decisions. Section 4 bled, aggregated, validated and mashed up from back- will detail ideas for future work regarding validation of ground information in XML databases by techniques like OMDoc documents, and Section 5 concludes the paper. XQuery [XQu07] and document (fragment) collections are managed by XQuery Update [XQU08]. 2 The TNTBASE System The Web is constantly changing — it has been esti- mated that 20% of the surface Web changes daily and 30% 2.1 Overview monthly [CGM00; FMNW03]. While archiving services In this section we provide an overview of TNTBASE to like the Wayback Machine try to get a grip on this for allow a better understanding of Section 3. Details can be the surface level, we really need an infrastructure for man- found at [ZK09]. aging and validating changes in the XML-based deep web. A slightly simplified view of the TNTBASE architec- Unfortunately, support for this has been very fru- ture is presented in Figure 1. The core of TNTBASE is gal. Version Control systems like CVS and Sub- the XSVN library developed by the author. The main dif- version [SVN08] which have transformed collabora- ference between XSVN and the SVN server it replaces tion workflows in software engineering are deeply text- is that the former stores the youngest revisions of XML Figure 2: XSVN repository future work.1 Currently readers can try out an online TNTBASE test Figure 1: TNTBase architecture instance (see [TNT09a]), for additional information refer to [TNT09b]. files and also other revisions (on user requests) in Berkeley DB XML (DB XML) instead of Berkeley DB [Ber09a]. 2.2 XSVN, an XML-enabled Repository This gives us a possibility to employ the DB XML API Since XSVN is a core of TNTBASE and will be referred to for querying and modifying XML files via XQuery and in Section 3, we will cover it here in detail. The architecture XQuery Update facilities. Also such a substitution helps of XSVN and thus TNTBASE is motivated by the follow- us to keep XML files well-formed and, optionally, confor- ing observation: Both the SVN server and the DB XML mant to an XML Schema. library are based on Berkeley DB (BDB) [Ber09a]. The In TNTBASE XSVN is managed by Apache’s SVN server uses it to store repository information2, and mod dav svn module or accessed by DB XML DB XML for storing raw bytes of XML and for support- ACCESSOR (a Java library which provides a high-level ing consistency, recoverability and transactions. Moreover, access to DB XML on top of its API) locally on the same transactions can be shared between BDB and DB XML. machine. Apache’s mod dav svn module exposes an Let us look at the situation in more detail. HTTP interface exactly like it is done in SVN. Thereby The SVN BDB-based file system uses multiple tables a TNTBASE user can work with TNTBASE repository to store different repository information like information exactly in the same way as with a normal SVN repository about locks, revisions, transactions, files, and directories, via HTTP protocol including Apache’s SVN authentica- etc. The two important tables here are representations and tion via authz and groups files. In other words any strings. The strings table stores only raw bytes and one SVN client is able to communicate with TNTBASE. The entry of this table could be any of these: non-XML content can be managed as well in TNTBASE, 1. a file’s contents or a delta (a difference between two but only via an XSVN’s HTTP interface. versions of the same entity (directory entry lists, files, The DB XML ACCESSOR module can work directly property lists) in a special format) that reconstructs file with XML content in an XSVN repository by utilizing the contents DB XML API. All indispensable information needed for XML-specific tasks is incorporated in a DB XML con- 2. a directory entry list in special format called skel or a tainer using additional documents or metadata fields of delta that reconstructs a directory entry list skel documents. SVNKITADAPTER (a Java library which em- 3. a property list skel or a delta that reconstructs a prop- ploys SVNKit [SVN07]) comes into play when the revision erty list skel information needs to be accessed, and acts as a mediator From looking at a strings entry alone there is no way to between an XSVN repository and DB XML ACCESSOR. tell what kind of data it represents; the SVN server uses the And in turn when DB XML ACCESSOR intends to cre- representations table for this. Its entries are links that ad- ate a new revision in a XSVN repository it also exploits dress entries in the strings table together with information SVNKITADAPTER functionality. about what kind of strings entry it references, and — if it The DB XML ACCESSOR realizes a number of useful is a delta — what it is a delta against. Note that the SVN features, but is able to access an XSVN repository only server stores only the youngest revision (called the head locally. To expose all its functionality to the world TNT- revision) explicitly in the strings table. Other revisions of BASE provides a RESTful interface, see [ZK09; TNT09b]. whatever entity (a file, a directory or a property list) are re- We use the Jersey [Jer09] library to implement a RESTful computed by recursively applying inverse deltas from the interface in TNTBASE. Jersey is a reference implemen- head revision. tation of JAX-RS (JSR 311), the Java API for RESTful To extend SVN to XSVN (an XML-enabled repository), Web Services [JSR09] and has simplified our implemen- we have added the DB XML library to SVN and add a new tation considerably. type of entry in the representations table that points to the TNTBASE provides a test web-form that allow users to last version of that document in the DB XML container play with a subset of the TNTBASE functionality. Also an XML-content browser is available online which shows 1see Ticket https://trac.mathweb.org/tntbase/ticket/3 the TNTBASE file system content including virtual files. 2In fact SVN can also use a file-system based storage back end United authentication for all interfaces is a subject for a (SVN FS), but this does not affect TNTBASE.

The TNTBASE System and Validation of XML Documents

Specification for JSON Abstract Data Notation Version

Darwin Information Typing Architecture (DITA) Version 1.3 Draft 29 May 2015

An Automated Approach to Grammar Recovery for a Dialect of the C++ Language

XML: Looking at the Forest Instead of the Trees Guy Lapalme Professor Département D©Informatique Et De Recherche Opérationnelle Université De Montréal

File Format Guidelines for Management and Long-Term Retention of Electronic Records

SWAD-Europe Deliverable 5.1: Schema Technology Survey

Information Technology — Topic Maps — Part 3: XML Syntax

NISO RP-2006-01, Best Practices for Designing Web Services in The

Iso/Iec 13250-3:2007(E)

Iso/Iec 23007-3:2011(E)

The Emergence of ASN.1 As an XML Schema Notation

Towards Conceptual Modeling for XML