Evaluation of Xpath Queries Against XML Streams Dan Olteanu

Evaluation of XPath Queries against XML Streams Dan Olteanu Dissertation zur Erlangung des akademischen Grades des Doktors der Naturwissenschaften an der Fakultät fur¨ Mathematik, Informatik und Statistik der Ludwig{Maximilians{Universität Munc¨ hen vorgelegt von Dan Olteanu Munc¨ hen, Dezember 2004 Erstgutachter: Fran¸cois Bry Zweitgutachter: Dan Suciu (University of Washington) Tag der mundlic¨ hen Prufung:¨ 11. Februar 2005 To my wife Flori iv v Abstract XML is nowadays the de facto standard for electronic data interchange on the Web. Available XML data ranges from small Web pages to ever-growing repositories of, e.g., biological and astronomical data, and even to rapidly changing and possibly unbounded streams, as used in Web data integration and publish-subscribe systems. Animated by the ubiquity of XML data, the basic task of XML querying is becoming of great theoretical and practical importance. The last years witnessed efforts as well from practitioners, as also from theoreticians towards defining an appropriate XML query language. At the core of this common effort has been identified a navigational approach for information localization in XML data, comprised in a practical and simple query language called XPath [46]. This work brings together the two aforementioned \worlds", i.e., the XPath query evaluation and the XML data streams, and shows as well theoretical as also practical relevance of this fusion. Its relevance can not be subsumed by traditional database management systems, because the latter are not designed for rapid and continuous loading of individual data items, and do not directly support the continuous queries that are typical for stream applications [17]. The first central contribution of this work consists in the definition and the theoretical investigation of three term rewriting systems to rewrite queries with reverse predicates, like parent or ancestor, into equivalent forward queries, i.e., queries without reverse predicates. Our rewriting approach is vital to the evaluation of queries with reverse predicates against unbounded XML streams, because neither the storage of past fragments of the stream, nor several stream traversals, as required by the evaluation of reverse predicates, are affordable. Beyond their declared main purpose of providing equivalences between queries with reverse predicates and forward queries, the applications of our rewriting systems shed light on other query language properties, like the expressivity of some of its fragments, the query minimization, or even the complexity of query evaluation. For example, using these systems, one can rewrite any graph query into an equivalent forward forest query. The second main contribution consists in a streamed and progressive evaluation strategy of forward queries against XML streams. The evaluation is specified using compositions of so-called stream processing functions, and is implemented using networks of deterministic pushdown transducers. The complexity of this evaluation strategy is polynomial in both the query and the data sizes for forward forest queries and even for a large fragment of graph queries. The third central contribution consists in two real monitoring applications that use directly the results of this work: the monitoring of processes running on UNIX comput- ers, and a system for providing graphically real-time traffic and travel information, as broadcasted within ubiquitous radio signals. vi Zusammenfassung Heutzutage ist XML der de facto Standard fur¨ den Datenaustausch im Web. Dabei reicht die Spanne an verfugbaren¨ XML Daten von kleinen Webseiten bis hin zu immer größer werdenden Sammlungen, beispielsweise an biologischen oder anstronomischen Daten und sogar, möglicherweise unbegrenzte, Datenströme mit schnellem Datenaufkommen, wie sie in publish-subscribe Systemen verwendet werden. Getrieben durch die weite Verbreitung von XML Daten, bekommt die Anfragebear- beitung an XML Daten zunehmend größere theoretische und praktische Bedeutung. In den letzten Jahren konnten Initiativen sowohl von Seiten der Industrie als auch aus der Forschung beobachtet werden, die darauf abziehen eine angemessene XML Anfragesprache zu definieren. Das Kernergebnis dieser Initiativen ist die Identifikation eines navigationalen Ansatzes zur Lokalisierung von Informationen in XML Daten in der benutzer-orientierten Anfragesprache XPath. Diese Arbeit bringt die zwei oben genannten Welten, die XPath Anfragebearbeitung und XML Ströme, zusammen und zeigt die sowohl praktische als auch theoretische Rele- vanz dieser Verbindung. Der erste Hauptbeitrag dieser Arbeit besteht in der Definition und der theoretischen Untersuchung von drei Termersetzungssystemen, um Anfragen mit sogenannten \reverse" Predikaten, wie beispielsweise parent oder ancestor, in equivalente Anfragen, die keine solche Predikate enthalten, umzuschreiben. Unser Ansatz ist essentiell fuer die Auswertung von Anfragen mit \reverse" Predikaten gegen unbegrenzte XML Ströme, da weder die Speicherung von bereits verarbeiteten Stromfragmenten noch mehrere Durchläufe ub¨ er den XML Strom erforderlich sind. Neben diesem Hauptziel, die Anwendungen unserer Umschreibungssysteme werfen ein neues Licht auf andere Eigenschaften der Anfragesprache, wie die Ausdruckskraft einiger Fragmente, die Minimierung von Anfragen, und sogar die Komplexität der Anfrageauswer- tung. Man kann beispielsweise unter Nutzung dieser Umschreibungssysteme beliebige Graphanfragen in equivalente Waldanfragen ohne \reverse" Predikate umschreiben. Der zweite Hauptbeitrag besteht in einer strom-basierten, progressiven Auswertungsstrate- gie fur¨ Waldanfragen ohne \reverse" Predikate gegen XML Ströme. Die Auswertung wird spezifiziert durch die Komposition von sogenannten Stromverarbeitungsfunktionen und implementiert unter Verwendung von Netzwerken aus deterministischen Kellerautomaten. Die Komplexität dieser Auswertungsstrategie ist polynomiell sowohl in der Grösse der An- frage als auch der Daten fuer Waldanfragen ohne \reverse" Predikate und sogar fur¨ viele Graphanfragen. Der letzte Hauptbeitrag besteht aus zwei praktisch verwendbaren Ub¨ erwachungssystemen, die direkt auf den Resultaten dieser Arbeit aufsetzen: die Ub¨ erwachung von auf einem UNIX System laufenden Prozessen und ein System, das Verkehrsinformationen aus Ra- diosignalen in Echtzeit ub¨ erwacht und graphisch aufbereitet. vii Acknowledgments During the last three years, many people have contributed directly or indirectly to the development of this dissertation. I would like to express my gratitude to them. First of all I am deeply indebted to my advisor Fran¸cois Bry, for his continuing trust and support during the evolution of this thesis. Further, I am grateful to Dan Suciu, whose work on XML query processing influenced constantly my research directions. This thesis and its author further benefitted from long and very useful discussions with two of my best supporters Tim Furche and Holger Meuss. Without their active commitment, this dissertation would not have been possible. I thank the students, whose theses I co-supervised, for their interest in my work and for bringing new relevant ideas to surface: Fatih Coskun, Serap Durmaz, Tim Furche, Tobias Kiesling, Sebastian Schaffert, Dominik Schwald, and Markus Spannagel. I thank also the members of our teaching and research group for creat- ing a stimulating environment at the office and a pleasant stay in Munich: among others, Slim Abdennadher, Sacha Berger, Tim Geisler, Martin Josko, Michael Kraus, Ellen Lilge, Bernhard Lorenz, Hans Jurgen¨ Ohlbach, Paula P˘atrânjan, Stephanie Spranger, and Felix Weigel. I especially want to mention Norbert Eisinger for his always competent advises on various subjects ranging from easy ones, like confluence of rewriting systems, to complex ones, like teaching computer science topics. Last, but definitely not least, I thank my wife, Flori, for her love and non-interrupting support, my parents and my brother for enduring the physical distance that separated us for such a long time, and all my friends for the weekends we spent together doing no research. viii Contents 1 Introduction 1 1.1 Data Streams: Use, Concepts, and Research Issues . 2 1.2 Thesis Contributions and Overview . 6 2 Preliminaries 9 2.1 XML Essentials . 9 2.2 Example Scenarios . 11 3 LGQ (Logic Graph Query): An Abstraction of XPath 15 3.1 Data Model . 16 3.2 Syntax . 19 3.3 Semantics . 22 3.4 Digraph Representations . 25 3.5 Path, Tree, DAG, Graph Formulas and Queries . 26 3.6 Forward Formulas and their Specializations . 28 3.7 Measures for Formulas . 29 3.8 LGQ versus XPath . 31 3.8.1 XPath . 31 3.8.2 Conciseness of LGQ over XPath . 36 3.8.3 XPath=LGQ Forests . 38 4 Source-to-source Query Transformation: From LGQ to Forward LGQ 45 4.1 Problem Description . 48 4.2 A Taste of Term Rewriting Systems . 52 4.3 Rewrite Rules preserving LGQ Equivalence . 56 4.3.1 Rules adding single-join DAG-Structure . 57 4.3.2 Rules preserving Tree-Structure . 59 4.3.3 Rules removing DAG-Structure . 67 4.3.4 Rules for LGQ Normalization . 69 4.3.5 Rules for LGQ Simplification . 70 4.4 Three Approaches to Rewrite LGQ to Forward LGQ Forests . 72 4.4.1 Rewriting Examples . 73 4.4.2 Soundness and Completeness . 76 x Contents 4.4.3 Termination . 79 4.4.4 Confluence . 80 4.5 Complexity Analysis . 81 4.6 Related Work . 89 5 Evaluation of Forward LGQ Forest Queries against XML Streams 95 5.1 Problem Description . 96 5.2 Specification . 101 5.2.1 Stream Messages . 102 5.2.2 Stream Processing Functions . 103 5.2.3 From LGQ to Stream

Evaluation of Xpath Queries Against XML Streams Dan Olteanu

Describing Media Content of Binary Data in XML W3C Working Group Note 2 May 2005

XML Specifications Growth of the Web

What Is XML Schema?

Command Injection in XML Signatures and Encryption Bradley W

Describing Media Content of Binary Data in XML W3C Working Group Note 4 May 2005

XML Information Set

THE TEXT ENCODING INITIATIVE Edward

5241 Index 0939-0964.Qxd 29/08/02 5.30 Pm Page 941

Towards a Content-Based Billing Model: the Synergy Between Access Control and Billing

3.4 the Extensible Markup Language (XML)

SWAD-Europe Deliverable 5.1: Schema Technology Survey

XML Tutorial Description