Storing and Querying Large XML Instances

Storing and Querying Large XML Instances Christian Grun¨ Dissertation zur Erlangung des akademischen Grades Doktor der Naturwissenschaften (Dr. rer. nat.) Fachbereich Informatik und Informationswissenschaft Mathematisch-Naturwissenschaftliche Sektion Universitat¨ Konstanz Referenten: Prof. Dr. Marc H. Scholl Prof. Dr. Marcel Waldvogel Tag der mundlichen¨ Prufung:¨ 22. Dezember 2010 Abstract After its introduction in 1998, XML has quickly emerged as the de facto exchange format for textual data. Only ten years later, the amount of information that is being processed day by day, locally and globally, has virtually exploded, and no end is in sight. Corre- spondingly, many XML documents and collections have become much too large for being retrieved in their raw form – and this is where database technology gets into the game. This thesis describes the design of a full-fledged XML storage and query architecture, which represents the core of the Open Source database system BASEX. In contrast to numerous other works on XML processing, which either focus on theoretical aspects or practical implementation details, we have tried to bring the two worlds together: well-established and novel concepts from database technology and compiler construction are consolidated to a powerful and extensible software architecture that is supposed to both withstand the demands of complex real-life applications and comply with all the intricacies of the W3C Recommendations. In the Storage chapter, existing tree encodings are explored, which allow XML documents to be mapped to a database. The Pre/Dist/Size triple is chosen as the most suitable encoding and further optimized by merging all XML node properties into a single tuple, compactifying redundant information, and inlining attributes and numeric values. The address ranges of numerous large-scale and real-life XML instances are analyzed to find an optimal tradeoff between maximum document and minimum database size. The process of building a database is described in detail, including the import of tree data other than XML and the creation of main memory database instances. As one of the distinguishing features, the resulting storage is enriched by light-weight structural, value and full-text indexes, which speed up query processing by orders of magnitudes. The Querying chapter is introduced with a survey on state of the art XML query lan- guages. We give some insight into the design of an XQuery processor and then focus on the optimization of queries. Beside classical concepts, such as constant folding or static typing, many optimizations are specific to XML: location paths are rewritten to access less XML nodes, and FLWOR expressions are reorganized to reduce the algorithmic com- iii plexity. A unique feature of our query processor represents the dynamic rewriting of location paths to take advantage of available index structures. Next, we examine the evaluation of queries and propose an adaptive approach to benefit from both the iter- ative and atomic processing paradigm. Based on the evaluation of location paths, it is illustrated how databases are accessed by the query processor. The concluding summary gives an overview on the optimizations that have been applied to the most important XQuery expressions. In the Performance chapter, we demonstrate the efficiency and scalability of the resulting database system BASEX. The storage and query capabilities are tested and compared with other database systems and query processors. The benchmark results show that the proposed architecture and its interplay between the storage and query components em- braces some qualities that are, to the best of our knowledge, unique and unprecedented among comparable products. iv Zusammenfassung (German Abstract) Nachdem XML 1998 das Licht der Welt erblickt hat, hat es sich sehr schnell zum Quasi- Standard fur¨ den Austausch textueller Daten entwickelt. Nur zehn Jahre spater¨ sind die Informationsmengen, die tagtaglich¨ lokal und global verarbeitet werden, explodiert, und ein Ende der Entwicklung ist noch nicht abzusehen. Demzufolge sind auch viele XML-Dokumente und -Kollektionen langst¨ zu groß geworden, um Sie in ihrer Rohform abzufragen – und hier kommt Datenbanktechnologie zum Einsatz. Diese Dissertation beschreibt das Design einer ausgereiften XML-Speicher- und Query- Architektur, die zugleich den Kern des Open-Source Datenbanksystems BASEX darstellt. Im Gegensatz zu zahlreichen anderen Publikationen uber¨ XML, die sich entweder the- oretischen Teilaspekten oder praktischen Implementierungsdetails verschreiben, wurde in dieser Arbeit versucht, beide Welten zusammenzufuhren:¨ wohlbekannte und neuar- tige Konzepte der Datenbanktechnologie und des Compiler-Baus bilden die Basis fur¨ eine machtige¨ und offene Software-Architektur, die sowohl den Anforderungen komplexer, realer Anwendungen standhalten als auch die Feinheiten der W3C-Empfehlungen beruck-¨ sichtigen und einhalten soll. Im Storage-Kapitel werden existierende Baum-Kodierungen untersucht, die die Spei- cherung von XML-Dokumenten in Datenbanken ermoglichen.¨ Das Pre/Dist/Size-Tripel wird als die geeignetste Kodierung ausgewahlt¨ und weiter optimiert: alle Eigenschaften eines XML-Knotens werden in einem Tupel abgebildet, redundante Information werden kompaktifiziert und Attribute und numerische Werte werden gelinlined, d.h. di- rekt innnerhalb der Tupel abgespeichert. Die Adressbereiche zahlreicher großer, realer XML-Instanzen werden analysiert, um einen optimalen Kompromiss zwischen maxi- maler Dokument- und minimaler Datenbankgroße¨ zu finden. Die Erzeugung neuer Datenbankinstanzen wird im Detail vorgestellt; dabei werden auch hauptspeicherorien- tierte Datenbanken und andere hierarchische Datentypen neben XML betrachtet. Eine Besonderheit der diskutierten Speicherarchitektur stellt die Erweiterung durch schlanke struktur-, inhalts- und volltextbasierte Indexstrukturen dar, die die Anfragegeschwindig- keit um mehrere Großenordnungen¨ beschleunigen konnen.¨ v Das Querying-Kapitel beginnt mit einem Uberblick¨ uber¨ die relevanten XML-Anfrage- sprachen und beschreibt den Aufbau eines XQuery-Prozessors. Die Optimierung von An- fragen steht anschließend im Mittelpunkt. Klassische Techniken wie Constant Folding oder Statische Typisierung werden durch XML-spezifische Optimierungen erganzt:¨ Doku- mentpfade werden umgeschrieben, um die Zahl der adressierten XML-Knoten zu re- duzieren, und FLWOR-Ausdrucke¨ werden reorganisiert, um die algorithmischen Kosten zu senken. Ein einzigartiges Feature des vorgestellten Query-Prozessors stellt die flex- ible Umschreibung von Dokumentpfaden fur¨ indexbasierte Anfragen dar. Als nachstes¨ wird die Evaluierung von Anfragen untersucht und ein adaptiver Ansatz vorgestellt, der die Vorteile der iterativen und atomaren Anfrageverarbeitung vereinigt. Anhand der Evaluierung von Dokumentpfaden wird der Zugriff auf die Datenbank veranschaulicht. Der abschließende Uberblick¨ fasst die Optimierungsschritte zusammen, die auf die wich- tigsten XQuery-Ausdrucke¨ angewandt wurden. Die Effizienz und Skalierbarkeit des Datenbanksystems BASEX ist Schwerpunkt des Per- formance-Kapitels. Die Speicher- und Anfrage-Architektur wird getrennt voneinander analysiert und mit anderen Datenbank-Systemen und Query-Prozessoren verglichen. Die Ergebnisse sollen demonstrieren, dass die vorgestellte Architektur und das Zusam- menspiel zwischen den Speicher- und Query-Komponenten uber¨ bestimmte Qualitaten¨ verfugt,¨ die unserem Kenntnisstand nach einzigartig unter vergleichbaren Produkten sind. vi Acknowledgments Most certainly, this thesis would not have been completed without the continuous help, support and inspirations of some persons, which I am pleased to mention in the follow- ing: First of all, I owe my deepest gratitude to my supervisor Marc H. Scholl, who has given me all the time and freedom I could have possibly asked for to develop and pursue my own ideas – a privilege that I know many postgraduates can only dream of. At the same time, Marc has always had time for discussions, and I learned a lot from both his guidance and vast expertise. Whenever I had doubts whether I was on the right path – or any path at all – it was Marc who backed me, and confirmed me to go on. Next, I would like to thank Marcel Waldvogel and his disy Group. The exchange between his and our group consisted in numerous fruitful debates, joint publications and, as I believe, brought the work of all of us forward more quickly. Another thank you is directed to Harald Reiterer, who was the first in Konstanz to get me enthusiastic about scientific work. The cooperation between his HCI Group and ours lasts till the present day. It was my colleague Alexander Holupirek who I shared most prolific ideas with during the last years, and some more drinks in the evenings. He gave me regular feedback on my flights of fancy (or figments), and many of the contributions presented in this work are due to his invaluable inspirations. I am also indebted to Marc Kramis, whose visionary approach has advised me to remain open for new ideas, and Sebastian Graf, who has triggered our most recent cooperation with the disy Group. The collaboration with all the students working in my project was one of the most ful- filling experiences, and I learnt a lot about what it means to lead a project, and how productive real team work can be. In particular, I’d like to say thank you to Volker Wildi, Tim Petrowski, Sebastian Gath, Bastian Lemke, Lukas Kircher, Andreas Weiler, Jorg¨ Hauser, Michael Seiferle, Sebastian Faller,

Storing and Querying Large XML Instances

Basex Server

Declarative Access to Filesystem Data New Application Domains for XML Database Management Systems

Using Map and Reduce for Querying Distributed XML Data

Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications

Applying Representational State Transfer (REST) Architecture to Archetype-Based Electronic Health Record Systems

Polishing Structural Bulk Updates in a Native XML Database

Efficient and Failure-Aware Replication of an XML Database

Basex Documentation

Xquery 3.0 69 Higher-Order Functions 75 Full-Text 82 Full-Text/Japanese 85 Xquery Update 87 Java Bindings 91 Packaging 92 Xquery Errors 95 Serialization 105

INEX Efficiency Track Meets Xquery Full Text in Basex

Basex- Extending a Native XML Database with Xquery Update

Benchx Basex Performance Platform Comparison