Abstract High Performance Xpath Evaluation

ABSTRACT Title of dissertation: HIGH PERFORMANCE XPATH EVALUATION IN XML STREAMS Feng Peng, Doctor of Philosophy, 2006 Dissertation directed by: Professor Sudarshan S. Chawathe Department of Computer Science This thesis presents methods for efficiently evaluating structural queries over tree-structured data streams. A data stream usually consists of a sequence of items that arrive in an order determined by the source. An application that uses such data cannot revisit an earlier item in the stream unless it buffers the item itself. Naive buffering methods are not practical due to the high throughput and indefinite length of data streams. Compared with the flat, relational-like data model for data streams that has received recent attention, processing a tree-structured XML data stream poses additional challenges, since a data item cannot, in general, be interpreted without taking structural information into account. In this thesis, we focus on the evaluation of XPath queries on streaming XML. As a W3C standard, XPath has become a core XML technology not only as a standalone query language but also as the foundation of XQuery and XSLT. Features such as subqueries and reverse axes make XPath a powerful query language but they also complicate XPath query processing. We present our work on XSQ, a streaming XPath query engine. Our methods are based on a novel segment-based evaluation scheme. XSQ uses very little memory and is able to process unbounded and unsegmented streaming data because it does not build a DOM tree in memory. It also provides high throughput by only processing the relevant portions of the data and low response time by returning results as early as possible. XSQ is the first streaming system to support complex XPath features such as multiple predicates, closure axes, aggregations, reverse axes, and subqueries. We also describe our work on XPaSS, an XPath-based publish-subscribe system that simultaneously evaluates a large number of XPath queries over XML streams. Unlike other similar systems that filter pre-segmented documents as results, XPaSS returns only the precisely delineated data specified by a user query. It uses a segment-sharing scheme instead of prefix- and suffix-sharing that are commonly used. In our experiments, XPaSS supports up to one million XPath subscriptions using a modest PC-class server, with a throughput comparable to that of the simpler filtering systems. HIGH PERFORMANCE XPATH EVALUATION IN XML STREAMS by Feng Peng Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2006 Advisory Committee: Professor Sudarshan S. Chawathe, Chair/Advisor Professor Samrat Bhattacharjee Professor Samir Khuller Professor Nick Roussopoulos Professor Martin Dresner c Copyright by Feng Peng 2006 DEDICATION I would like to dedicate the dissertation to my family: my father, Zailiang Peng, my mother, Chunjin Liu, and my wife, Xin Lei. Without their love and support, there is no way I could have complished this. ii ACKNOWLEDGMENTS I would like to thank my advisor, Dr. Chawathe. It is really fortunate for me to have the opportunity to work with Dr. Chawathe. Looking back at the last four years, I realize how much I have learned from him. I would like to thank Dr. Samir and Dr. Bhattacharjee, for their continuous consideration and help during my study in Maryland. I would like to thank Dr. Rossopoulos, who taught my first database course here. It definitely triggered my interests in database research and led my way into the area. Last but not least, I would like to thank Dr. Dresner for taking the time to meet me and agree to be in my committee. iii TABLE OF CONTENTS List of Figures vii 1 Introduction 1 1.1 Streaming XML . 1 1.2 Streaming XPath Evaluation . 4 1.2.1 Join- and Navigation-based XPath Evaluation . 7 1.2.2 Streaming Evaluation Scenarios . 8 1.3 Challenges . 12 1.3.1 Querying Instead of Filtering . 12 1.3.2 Recursive Subqueries . 14 1.3.3 Reverse Axes . 15 1.3.4 Multiple Query Evaluation . 17 1.3.5 More Challenges . 18 1.4 Contributions . 18 2 Related Work 22 2.1 XPath filtering . 22 2.2 Streaming XML Queries . 25 2.3 XML Querying and Transformation . 28 2.4 Theoretical Researches on XPath . 30 2.5 Data Streams Management Systems . 32 2.6 Streaming Algorithms . 35 3 Preliminaries 38 3.1 Data Model for XML . 38 3.2 Data Model for XML Streams . 39 3.3 XPath . 41 3.3.1 XPath Semantics . 42 3.3.2 Predicates in XPath . 44 3.3.3 XPath Examples . 45 4 XPath Queries with Closures, Predicates, and Aggregations 47 4.1 Introduction . 47 4.2 Compiling XPath Queries . 53 4.2.1 HPDT . 53 4.2.2 Templates for BPDT . 59 4.2.3 Building HPDTs from XPath Queries . 62 4.2.4 Aggregations . 66 4.3 Runtime Engine . 67 4.3.1 Matching Records . 68 4.3.2 Buffer Operations . 72 4.3.3 Correctness . 75 4.3.4 Implementation . 78 iv 4.3.5 Complexity . 81 4.4 Experimental Evaluation . 83 4.4.1 Experimental Setup . 84 4.4.2 Throughput . 87 4.4.3 Latency . 90 4.4.4 Memory Usage . 95 4.4.5 Characterizing the XPath Processors . 98 4.4.6 Characterizing XSQ-F . 101 5 Segment-based Streaming XPath Evaluation 108 5.1 Segments . 108 5.2 XPath Query Tree . 111 5.3 Matchings . 114 6 XPath Query with Subqueries 117 6.1 Introduction . 117 6.2 Motivating Examples . 119 6.3 Evaluation Algorithm . 121 6.3.1 A Main-memory Method . 122 6.3.2 Streaming Evaluation Algorithm . 127 6.3.3 A Threaded Stack . 133 6.3.4 A Two-Flag marking Scheme . 135 6.4 A Comprehensive Example . 139 6.5 Performance Evaluation . 143 6.5.1 Experimental Setup . 143 6.5.2 Evaluating Subqueries . 148 6.5.3 Simple Queries on Main-memory Datasets . 153 6.5.4 Simple Queries on Large Datasets . 157 6.5.5 Processing Boolean Operators . 158 6.5.6 Output Latency . 162 7 XPath Queries with Reverse Axes 164 7.1 Introduction . 164 7.2 Single-step Normal Form . 167 7.3 Streaming Evaluation Algorithm . 170 7.3.1 Dependency Between Elements . 170 7.3.2 Hierarchical Index . 172 7.3.3 Create HIndex Dynamically . 176 7.4 Performance Evaluation . 178 7.4.1 Experiment Setup . 178 7.4.2 Simple Queries . 181 7.4.3 Complex Queries . 184 7.4.4 Number of Reverse Axes . 186 v 8 An XPath Subscription Server 190 8.1 Introduction . 190 8.2 Segment-based Evaluation . 196 8.2.1 Partial Information of Matching . 197 8.2.2 Data Structures . 198 8.2.3 Streaming Evaluation . 199 8.3 Segment-based Grouping . 206 8.3.1 Compile Time . 206 8.3.2 Runtime . 207 8.3.3 Complexities . 213 8.3.4 A Running Example . 214 8.3.5 Implementation . 215 8.4 Performance Evaluation . 218 8.4.1 Setup . 218 8.4.2 Scalability . 221 8.4.3 Varying Dataset . 224 8.4.4 Varying Query Features . 225 8.5 Conclusion . 231 9 Future Work 233 9.1 Schema-based Runtime Optimization . 233 9.2 Streaming XQuery Evaluation . 235 10 Conclusion 238 Bibliography 241 vi LIST OF FIGURES 1.1 Example XML data . 3 1.2 Sequence of SAX events . 3 1.3 The DOM tree for the data in Figure 1.1 . 5 1.4 An example of an RSS2.0 feed . 9 1.5 The DAG pattern specified by an XPath query . 13 1.6 Screenshot of XSQ displaying a HPDT . 19 3.1 Sample XML Stream . 39 3.2 The DOM tree for the data in Figure 3.1 . 40 4.1 Input Fragment 2 . 48 4.2 EBNF for an XPath Subset . 49 4.3 A Sample HPDT . 56 4.4 Template BPDT for: /n[@a = v] . ..

Load more