Phd and Mphil Thesis Classes
Total Page:16
File Type:pdf, Size:1020Kb
Enhancing XML Preservation and Workflows Viacheslav Zholudev School of Engineering and Science Jacobs University Bremen A thesis submitted for the degree of Doctor of Philosophy in Computer Science Dissertation Committee: Prof. Dr. Michael Kohlhase (supervisor) Prof. Dr. Peter Baumann Prof. Dr. Dieter Hutter Defended: 2012, July 17th To my parents... Declaration I herewith declare that I have produced this paper without the prohibited assistance of third parties and without making use of aids other than those specified; notions taken over directly or indirectly from other sources have been identified as such. This thesis has not previously been presented in identical or similar form to any other German or foreign examination board. Viacheslav Zholudev Nothing is as practical as a good theory. Kurt Lewin Abstract With the proliferation of computers and networked resources the amount of data avail- able on personal computers and on the Web is growing exponentially. Handling of data becomes more complex as the scale expands. Moreover, data collections change in place over time, so that one of the important challenges consists in supporting a life-cycle of data. In particular, relevant parts of information have to be preserved and made accessible for future retrieval. XML is a wide-spread language for encoding documents that is both machine- readable and human-readable. It plays a unique role on the Web and in other areas of digital life: publishers, libraries, warehouses, technical writers, to name a few, { all make extensive use of XML for encoding books, articles, product information, documen- tation, interchanging and transformation of data, and the extraction and aggregation of relevant pieces of information. Contemporary web services are stepping away from traditional relational data models and are moving towards semi-structured data repre- sentations and utilization of NoSQL, and, in particular, XML databases for persisting data. A considerable technological stack has been built around XML, making it even stronger as a format. The development of XML technologies specifications is a never-ending process and emerging implementations of them push the progress further. However, there are still many open problems and challenges to be addressed in the XML domain. This thesis selects a number of the unsolved ones and suggests solutions; concretely (i) Versioning support for XML, (ii) XML databases views as counterparts of relational database views, (iii) Support for XML document templating. Solving the XML versioning problem for a subset of use cases enables an XML persistence layer that tracks the history of changes and provides XML database func- tionality like querying or indexing of data. The XML database views concepts allows to abstract away from the notion of XML files and think in terms of customizable and editable abstract XML entities with origins in some XML documents. Finally, support for XML document templating facilitates separation of responsibilities while authoring XML documents. Being able to use diverse expertise of developers in different phases of document creation, in turn, optimizes the whole authoring workflow. Highlighting the practical value of this work, all the concepts described in this the- sis are implemented and integrated into the TNTBase system (http://tntbase.org/). Over the last 4 years TNTBase has been constantly utilized in a number of mainly research projects maturing by receiving feedback from its users. The main target of the TNTBase project was providing a versioned repository for the Open Mathematical Documents (OMDoc) with a strong focus on XML-related functionality. Since the v OMDoc format combines data-like aspects (axioms, theorems, examples, etc.) with document-like aspects (sections, paragraphs, etc.), the number of applications was di- verse, and therefore all the approaches have been generalized to make it possible to adapt TNTBase to other domains and XML languages. As a result, TNTBase has been deployed in multiple real-life scenarios where it has been used on a daily basis. vi Acknowledgements A lot of people have helped me to accomplish results described in this thesis. First of all I would like to thank my supervisors: my \doctor father" Michael Kohlhase has helped me in all aspects of doing a PhD research starting from elabo- rating my vision and ending with correcting my English writing. Peter Baumann and Dieter Hutter have considerably helped me with looking at my research from differ- ent perspectives and questioning research contributions that provoked more thoughts and deeper understanding of the subject. All my colleagues from the KWARC group made a considerable contribution to the development of the TNTBase system. Andrea Kohlhase, Christine Muller¨ , Normen Muller¨ , Christoph Lange, Florian Rabe, Catalin David, Deyan Ginev, Dimitar Misev, Fulya Horozal, Constantin Jucovschi, Bogdan Mat- ican shared their feedback with me and came up with great ideas for TNTBase case studies that have significantly shaped up the system and allowed me to understand what is still lacking in the XML domain. I want to thank other people with whom I had a chance to collaborate: Nikita Zhiltsov, Sebastian Graf, Alexander Holupirek, Christian Grun¨ . Immanuel Normann, a former KWARC group member and John Batemann from University of Bremen provided an opportunity to test TNTBase in the completely new environment: in the ontology field. Mihai Cirlanaru provided some TNTBase logos and pictures, whom I would like to thank as well. Erik Hennum from MarkLogic Inc. initiated the development of the XSLT/XQuery tag libraries technology (Chapter 5) by contacting me after my talk on virtual documents at the Balisage 2010 conference. The work on tag libraries has been done in collaboration with him and used in this thesis with his permission. Additionally, colleagues from DFKI, Serge Autexier, Dominik Dietrich and Dieter Hutter allowed me to extend the number of TNTBase case studies by experimenting with management of change for OMDoc documents. Considerable amount of my work involved dealing with Berkeley DB XML, and the following people from Oracle helped me a lot by sharing the expertise in the XML and XML-databases field: George Feinberg, John Snelson and Lauren Foutz. Additionally, I want to thank Andrea Kohlhase, Michael Kohlhase, Raul Cajias, Patrice Lopez, Faye Stringer, Christoph Lange, Otilia Dreglea, Niall Kelly for proof-reading my thesis. Nikolay Melnikov, Christoph Lange and Catalin David substantially sup- ported me in maintaining servers used for hosting TNTBase installations. substan- tially I received financial suport from Jacobs University Bremen, DFKI Bremen and the KWARC group, in particular, whom I would like to thank too. Apart from the \scientific” list above I would like to express my gratitude to people who gave me inspiration and made the PhD path more pleasant: Otilia Dreglea, Nikolay Melnikov, Michael Ludwig, Lech Rzedzicki and the Bremen ice-hockey team \Bremer Brigade". i ii Contents I Introduction 1 1 Introduction 3 1.1 Visions with XML: A Real World Use Case . .4 1.2 Open Challenges in XML . .7 1.3 Contribution of the Thesis . .9 1.4 Structure of this Thesis . .9 II A Versioned XML Database 11 2 An XML-enabled Repository 13 2.1 Introduction . 13 2.2 Fundamentals . 14 2.2.1 Basic Concepts . 14 2.2.2 VCS Concepts . 16 2.2.3 Operations on a Repository . 23 2.2.4 Relation between FS-trees and Repositories during a Repository Lifetime . 31 2.2.5 Data Store Repository . 34 2.3 XML-enabled Repository (XeR) . 34 2.3.1 XPath, XQuery Data Model . 34 2.3.2 Building an XML-enabled Repository . 35 2.4 Towards An XML-Enabled Repository . 36 2.4.1 Methodology and General Architecture . 36 2.4.2 A Data Store Function . 37 2.4.3 A Key-Mapping Function for Embedding X into V ........ 38 2.5 The XeR Implementation { xSVN . 39 2.5.1 What is an XML File for xSVN? . 40 2.5.2 Integration . 40 2.6 Conclusion . 43 3 Towards a Versioned XML Database 45 3.1 Introduction . 45 3.1.1 Goals . 46 iii CONTENTS 3.2 Versioned XML Database Fundamentals . 47 3.2.1 XML-enabled Repository with Metadata . 50 3.2.2 FS-Aware Versioned XML Database . 51 3.2.3 A File System for X ......................... 54 3.2.4 Summary . 58 3.3 A Versioned XML-Database Toolkit . 58 3.3.1 An XQuery Library . 59 3.3.2 xAccessor and VcsAccessor . 59 3.3.3 A VXD Interface . 62 3.4 Open Problems and Directions to Solve Them . 62 3.5 Related Work . 63 3.5.1 A Time Machine for XML . 63 3.5.2 xDB Versioning . 63 3.5.3 DeepFS . 63 3.6 Implementation { TNTBase ........................ 64 3.6.1 Consistency of TNTBase FS.................... 65 3.7 Conclusion . 66 III XML Workflows 67 4 Scripting XML Documents 71 4.1 Introduction . 71 4.1.1 Database Views . 72 4.2 Virtual Document Fundamentals . 74 4.2.1 Defining Virtual Documents . 75 4.2.2 Virtual Documents by Example . 77 4.2.3 VDoc Specifications and Operational Semantics . 79 4.2.4 VDocs as XML Database FS Entities . 83 4.2.5 VDocs as XML Database Views . 85 4.2.6 Editing Virtual Documents . 86 4.3 Integrating with an XML Database . 93 4.3.1 Embedding Node Origins . 94 4.3.2 Extending the VXD Toolkit . 94 4.3.3 Versioned Snapshots and Materializing VDocs . 95 4.3.4 VDocs and Versioning . 96 4.4 Related Work . 97 4.4.1 Drupal . 97 4.4.2 Materialized XML Views . 97 4.5 Implementation in TNTBase ........................ 98 4.6 VDoc Use Cases . 99 4.6.1 Automated Exam Generation . 99 4.6.2 Multiple Versions of Documents . 100 4.6.3 Managing Document Collections . 101 iv CONTENTS 4.6.4 Refactoring Ontologies . 102 4.6.5 Dependency Graph of Documents .