Evaluation of Xpath Queries on XML Streams with Networks of Early Nested Word Automata Tom Sebastian

Evaluation of XPath Queries on XML Streams with Networks of Early Nested Word Automata Tom Sebastian To cite this version: Tom Sebastian. Evaluation of XPath Queries on XML Streams with Networks of Early Nested Word Automata. Databases [cs.DB]. Universite Lille 1, 2016. English. tel-01342511 HAL Id: tel-01342511 https://hal.inria.fr/tel-01342511 Submitted on 6 Jul 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. UniversitéLille 1 – Sciences et Technologies Innovimax Sarl Institut National de Recherche en Informatique et en Automatique Centre de Recherche en Informatique, Signal et Automatique de Lille These` présentée en première version en vue d’obtenir le grade de Docteur par l’Universitéde Lille 1 en spécialitéInformatique par Tom Sebastian Evaluation of XPath Queries on XML Streams with Networks of Early Nested Word Automata Thèse soutenue le 17/06/2016 devant le jury composéde : Carlo Zaniolo University of California Rapporteur Anca Muscholl UniversitéBordeaux Rapporteur Kim Nguyen UniversitéParis-Sud Examinateur Remi Gilleron UniversitéLille 3 Président du Jury Joachim Niehren INRIA Directeur Abstract The eXtended Markup Language (Xml) provides a format for representing data trees, which is standardized by the W3C and largely used today for exchanging information between all kinds of computer programs. The main application areas of Xml are document processing (DocBook), Web information representation (Html), and NoSql databases with nested relations. The main task for Xml processing are the problems to query and transform data trees. For solving these tasks, the W3C has developed the languages Xslt, XQuery, and XProc. As a common core, they rely on XPath for querying data trees for nodes, strings, functions, or sequences thereof. The challenge that we tackle in this thesis is the problem of how to answer XPath queries on Xml streams with low latency, full coverage, high time efficiency, and low memory costs. This requires to overcome the following difficulties: 1. Earliest query answering is computationally untractable, so that one cannot hope to always reach optimal latency and memory consumption. 2. No previous streaming algorithm was able to deal with all navigational XPath queries while also supporting comparisons of data values, aggregates, and higher-order functions. 3. In contrast to in-memory XPath engines, where document projection with respect to a given XPath query is highly relevant for time efficiency, it was unclear how to project away irrelevant parts of Xml streams. In this thesis we first propose to approximate earliest query answering for navigational XPath queries by compilation to early nested word automata. It turns out that this yields quasi-optimal results in most practical cases, leading to almost optimal latency and memory consumption. Second, we contribute a formal semantics of XPath 3.0. It is obtained by mapping XPath to the new query language λXP that we introduce. This language has the advantage to be based on first prin- ciples, so that it can be considered as a core language for XPath processing. We then show how to compile λXP queries to networks of early nested word automata, and develop streaming algorithms for the latter. Thereby we obtain a streaming algorithm that indeed covers all of XPath 3.0. Third, we develop an algorithm for projecting Xml streams with respect to the query defined by an early nested word automaton. Thereby we are able to make our streaming algorithms highly time efficient. We have implemented all our algorithms with the objective to obtain an industrially applicable streaming tool, and tested them on the usual benchmarks. It turns out that our algorithms outperform all previous approaches in time efficiency, coverage, and latency. Acknowledgments First of all, I want to express my gratitude to my doctoral advisor Joachim Niehren. You have introduced me to many different topics that we covered in my thesis and I appreciate all that I could learn. It has always been a pleasure to work with you! To me you have been more than just a great teacher and supervisor. Thank you for having been on my side with the various problems that we lived through during my thesis. Thank you that you cared so much for me. It was a great time. I will miss it! I want to thank Carlo Zaniolo and Anca Muscholl for their great examination and review of this thesis. I like to thank Kim Nguyen and Remi Gilleron who took part in the jury of my defense, and Iovka Benova and S lawek Staworko for proof reading parts of my thesis. Thanks also to Mohamed Zergaoui, who always had trust in me and who financed this thesis. I was gifted with a very pleasant work environment with handsome colleagues. I have been very fortunate with Denis, who implemented a great deal of the software tool that we have developed during my PhD. He was always helpful and I could enjoy many interesting discussions with him on code and algorithms. In addition, I especially appreciated the presence of Iovka, Antoine, Pavel, and Momar. Huge thanks I dedicate to my family. Especially my lovely parents have sup- ported me in so many ways throughout my thesis. I cannot thank you enough! Also I could not have wished for a better brother. Thanks Ron also for equipping me with the newest of technology. “Meine Mama” :) I am so grateful for my best friends. Eric: thank you so much for being such a good friend for so many years, thanks for putting so much effort into our friendship, for the sports, the competitiveness, and for your kindness. Frank: “hey hey hey” :) thank you also for being such a great friend for all this time, especially for being so charismatic; you have the talent to put a smile onto anybody’s face :) Olaaaa: I “love” you :) I am so happy to have met you. You are the best! Bisousss. S lawek: I enjoy every time when we meet. Thanks for your marvelous hospitality and all the times that we spend together. Eglantine, my little flower, thank you for being so sweet and to have come into my life and heart. “Eggy Spaghetti P***t” ;) Last but not least, I want to acknowledge the great Snooker times with Mathias, the evenings with Gwenael and Anne-Sophie, the cooking by Guillaume and the Risotto of Michele, the dish washing with Wassim, and finally all great times that I shared with my flat mates Manon, Sandra, Vincent, and Alexandra. Contents 1 Introduction 1 1.1 Xml Processing based on XPath .................... 2 1.1.1 Schema Validation ........................ 2 1.1.2 Query-based Transformation .................. 4 1.1.3 XPath .............................. 5 1.1.4 Regular Extension of XPath ................... 8 1.1.5 In Memory Evaluation ...................... 8 1.1.6 Streaming Evaluation ...................... 9 1.2 Open Problems .............................. 12 1.2.1 Low Coverage ........................... 12 1.2.2 Low Latency ........................... 13 1.2.3 Low Time Efficiency ....................... 13 1.2.4 Lack of Formal Semantics. .................... 13 1.3 Contributions ............................... 14 1.3.1 Formal Semantics of XPath 3.0 in λXP ............ 14 1.3.2 Early Query Answering by Early Nwas ............ 14 1.3.3 Networks of Early Nwas ..................... 15 1.3.4 Projection for Nwas ....................... 15 1.3.5 Implementation and Experiments ................ 16 1.4 Organization ............................... 16 1.5 Publications ................................ 17 2 Preliminaries 19 2.1 Data Trees ................................ 19 2.1.1 Definition ............................. 20 2.1.2 Logical Structure ......................... 21 2.1.3 Finite State Abstractions .................... 22 2.1.4 Json Trees ............................ 23 2.2 Nested Words ............................... 24 2.2.1 Definition ............................. 24 2.2.2 Linearization of Data Trees ................... 24 2.3 Xml Data Trees ............................. 25 2.3.1 Xml Data Model ......................... 25 2.3.2 Xml Data Model Specifics .................... 26 2.3.3 Mapping to Nested Words and Back to Xml Documents . 28 2.4 Nested Word Automata ......................... 29 2.4.1 Definition ............................. 29 2.4.2 Runs of Nested Words ...................... 30 2.4.3 Runs on Marked Trees ...................... 31 vi Contents 2.4.4 Determinization of NWAs .................... 32 2.5 Recursive Functions ........................... 34 2.5.1 Partial Orders and Fixed Points. ................ 34 2.5.2 Lambda and Letrec Notation .................. 35 I A Formal Semantics of XPath 3.0 39 3 Navigational XPath 3.0 41 3.1 Overview ................................. 41 3.2 Grammar ................................. 42 3.2.1 Ebnf ............................... 42 3.2.2 Expressions ............................ 43 3.2.3 Parse Tree ............................. 44 3.3 Path Expressions ............................. 45 3.3.1 Forward and Backward ..................... 45 3.3.2 Label Tests ............................ 46 3.3.3 Kind Tests ............................ 46 3.3.4 Relative, Absolute, and Abbreviated Expressions

Evaluation of Xpath Queries on XML Streams with Networks of Early Nested Word Automata Tom Sebastian

Towards an Analytics Query Engine *

QUERYING JSON and XML Performance Evaluation of Querying Tools for Offline-Enabled Web Applications

Programming a Parallel Future

Declarative Data Analytics: a Survey

Big Data Analytic Approaches Classification

A Dataflow Language for Large Scale Processing of RDF Data

Extracting Data from Nosql Databases a Step Towards Interactive Visual Analysis of Nosql Data Master of Science Thesis

IBM Infosphere Biginsights Version 2.1: Installation Guide Chapter 1

HIL: a High-Level Scripting Language for Entity Integration

Schemas and Types for JSON Data

Technology Overview

Data Analytics Using Mapreduce Framework for DB2's Large Scale XML Data Processing