Efficient Change Management of XML Documents

Efficient Change Management of XML Documents Dissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) vorgelegt von Dipl.-Inf. Sebastian Rönnau im August 2010 Tag der mündlichen Prüfung: 09. Dezemver 2010 Vorsitzender der Kommission: Prof. Dr. Peter Hertling 1. Berichterstatter: Prof. Dr. Uwe M. Borghoff 2. Berichterstatter: Prof. Dr. Michael Koch 1. Prüfer: Prof. Dr. Ulrike Lechner 2. Prüfer: Prof. Klaus Buchenrieder, Ph.D. Universität der Bundeswehr München Fakultät für Informatik Abstract XML-based documents play a major role in modern information architectures and their corresponding work-flows. In this context, the ability to identify and represent differences between two versions of a document is essential. A second important aspect is the merging of document versions, which becomes crucial in parallel editing processes. Many different approaches exist that meet these challenges. Most rely on operational transformation or document annotation. In both approaches, the operations leading to changes are tracked, which requires corresponding editing applications. In the context of software devel- opment, however, a state-based approach is common. Here, document versions are compared and merged using external tools, called diff and patch. This allows users for freely editing documents without being tightened to special tools. Approaches exist that are able to compare XML documents. A corresponding merge capability is still not available. In this thesis, I present a comprehensive framework that allows for comparing and merging of XML documents using a state-based approach. Its design is based on an analysis of XML documents and their modification patterns. The heart of the framework is a context-oriented delta model. I present a diff algorithm that appears to be highly efficient in terms of speed and delta quality. The patch algorithm is able to merge document versions efficiently and reliably. The efficiency and the reliability of my approach are verified using a competitive test scenario. iii iv Acknowledgements Many people have supported me in my work. First of all, I have to thank to my supervisor, Uwe M. Borghoff, for giving me the opportunity to do my research. He allowed me to follow my own research interests, with broad but precise constraints. He has always trusted in me and my work and encouraged me to take on the responsibility for the organization of the ACM DocEng 2009. I also have to thank my second advisor, Michael Koch. He provided me with valuable comments, helping me to sharpen this thesis up. Almost all of the time, I did really enjoy being member of the Institute of Software Tech- nology. Apart from the excellent research conditions, the staff played a major role. Florian Brieler, Nico Krebs, Sonja Maier, Steffen Mazanek, Mark Minas, Peter Rödig, Arne Seifert, Thomas Triebsees, and Daniel Volk helped me elaborating my ideas in countless discussions. Susanna Defendi, Harald Hagel, and Michael Minkus allowed me for discussing topics not related to the domain of computer science. It was a pleasure and a honor to me to work with my colleagues. Some of them became even friends. Other people have contributed to my thesis. First of all, Jan Scheffczyk helped me writing my first paper, opening me the way to an academic career. Many students supported my research. Christian Pauli wrote the first prototype of XCC Patch. Geraint Philipp, a highly skillful programmer, made a complete re-design of XCC Patch and programmed XCC Diff. Maik Teupel designed a user interface and wrote converting tools. Arthur Müller developed further applications of the XCC framework on the file system level. Manfred Pohlemann supported my work by developing a test framework. Finally, I have to thank my family. My wife Silvia and my children Luca and Lilith have been very patient with me, accepting when I was coming home late or lost in thought. My youngest girl, Lene, was being born just a few hours after the submission of my final draft. Thanks for her patience, too. v vi Contents Abstract iii Contentsv List of Figures xi List of Tables xiii List of Acronyms xv 1 Introduction1 1.1 Motivation....................................1 1.2 Document Change Control...........................3 1.3 Goal and Approach...............................4 1.4 Outline.....................................5 I Concepts and Notations7 2 XML Documents9 2.1 What is a Document?..............................9 2.2 XML in General................................. 10 2.3 Document Formats............................... 15 2.4 Tree Properties................................. 18 2.5 Modification Patterns.............................. 22 2.6 Conclusions................................... 28 3 Differencing Strings and Trees 29 3.1 String Editing.................................. 29 3.2 Tree Editing................................... 33 3.3 Differencing XML Documents......................... 38 3.4 Conclusions................................... 46 4 Document Merging 47 4.1 Parallel Editing of Documents......................... 47 vii Contents 4.2 State-Based Change Control.......................... 51 4.3 Operational Transformation........................... 54 4.4 Annotation of Documents............................ 57 4.5 Other Approaches................................ 61 4.6 Conclusions................................... 61 II A Context-Sensitive Approach to XML Change Control 63 5 The XCC Framework 65 5.1 Basic Idea.................................... 65 5.2 Architecture................................... 67 6 A Context-Oriented Delta Model 73 6.1 Definitions.................................... 73 6.2 Edit Operations................................. 76 6.3 Dependencies.................................. 78 6.4 Set-Based Delta vs. Edit Script......................... 80 6.5 Representing Context in XML......................... 81 6.6 Inversion, Aggregation, and Decomposition.................. 84 6.7 Conclusions................................... 86 7 An Efficient Differencing Algorithm 87 7.1 Basic Idea and Outline............................. 87 7.2 Finding the Common Content......................... 88 7.3 Identifying Structure-Preserving Changes................... 90 7.4 Catching Structure-Affecting Changes..................... 92 7.5 Complexity Analysis.............................. 95 7.6 Conclusions................................... 99 8 A Merge-Capable Patch Algorithm 101 8.1 Basic Idea.................................... 101 8.2 Linear Patching................................. 102 8.3 Context-Aware Patching............................ 103 8.4 Conflicts..................................... 106 8.5 Best-Effort Merging............................... 108 8.6 The Algorithm at a Glance........................... 111 8.7 Complexity Analysis.............................. 113 8.8 Conclusions................................... 114 9 Evaluation 115 9.1 Experimental Setup............................... 115 viii Contents 9.2 Test Scenarios.................................. 117 9.3 Runtime..................................... 118 9.4 Merge Quality.................................. 122 9.5 Conclusions................................... 124 III Conclusions 127 10 A Comparative Evaluation of XML Differencing Tools 129 10.1 Experimental Setup............................... 129 10.2 Test Scenarios.................................. 132 10.3 Differencing................................... 133 10.4 Delta Analysis.................................. 136 10.5 Conclusions................................... 142 11 Conclusions and Future Work 143 11.1 Summary.................................... 143 11.2 Scientific Contribution............................. 144 11.3 Applications................................... 145 11.4 Future Work................................... 148 Bibliography 151 A Resources for Document Evaluation 165 A.1 Selection Criteria................................ 165 A.2 Web Documents................................. 165 A.3 Office Documents................................ 166 B XML Document Hashing 169 B.1 Normalization.................................. 169 B.2 XML Hashing.................................. 171 B.3 Related Work.................................. 172 C XCC Delta Specification 173 D A Running Example 175 D.1 Setting...................................... 175 D.2 Differencing................................... 176 D.3 Merging..................................... 179 D.4 Conclusions................................... 185 ix Contents x List of Figures 1.1 Loosely coupled ad-hoc collaboration......................3 1.2 Non-conflicting merge, showing the impact of altered paths..........4 2.1 Normal XML representation vs. pretty printing................. 13 2.2 ODF file structure................................ 17 2.3 Node distribution across the XML trees..................... 21 2.4 Node repetition ratio across the XML trees................... 22 2.5 Simple markup change.............................. 23 2.6 Change impact on the XML representation................... 25 2.7 Example of Web page evolution......................... 27 3.1 Tree editing example............................... 34 3.2 Example of different tree traversal algorithms.................. 34 3.3 Representation of a string as tree......................... 37 3.4 Delta generated by Microsoft XML Diff..................... 40 3.5 Delta generated by diffxml............................ 42 3.6 Delta generated by jXyDiff............................ 43 3.7 Delta generated by

Load more