Efficient Xml Stream Processing and Searching
Total Page:16
File Type:pdf, Size:1020Kb
THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES EFFICIENT XML STREAM PROCESSING AND SEARCHING By WEI ZHANG A Dissertation submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy Degree Awarded: Spring Semester, 2012 Wei Zhang defended this dissertation on March 22, 2012. The members of the supervisory committee were: Robert A. van Engelen Professor Directing Dissertation Erlebacher Gordon University Representative Xiuwen Liu Committee Member Xin Yuan Committee Member Zhenhai Duan Committee Member The Graduate School has verified and approved the above-named committee members, and certifies that the dissertation has been approved in accordance with the university requirements. ii To Xiaobo, my love, whose wholehearted support helped me complete this dissertation. iii ACKNOWLEDGMENTS This dissertation would not have been completed without the helps of several people. I would like to express my gratitude to my advisor, Dr. Robert A. van Engelen, for his support, patience, and encouragement throughout my graduate studies. He has been instrumental in the successful completion of this work. He provided an environment that nurtured my growth during the early stages of the research process, while giving sufficient freedom and patience during the later stages of my dissertation research. His technical and editorial advice was essential to the completion of this dissertation and has taught me innumerable lessons and insights on the workings of academic research in general. I would give my thanks to my committee members, Dr. Xin Yuan, Dr. Zhenhai Duan, Dr. Xiuwen Liu and Dr. Gordon Erlebacher for reading my prospectus and previous drafts of this dissertation and providing many valuable comments that improved the presentation and contents of this dissertation. Special thanks would also be given to Dr. David Whalley, my previous comittee member, for his valuable advices on my research and publications, comments and text edits on my prospectus and dissertation drfats. Last, I would like to thank my wife Xiaobo for her understanding and for her love during the past few years. Her support, encouragement, and dedication to our children was in the end what made this dissertation possible. My parents, Binfan and fanglan, receive my deepest gratitude and love for their dedication and the years of support during my graduate studies. iv TABLE OF CONTENTS List of Tables...................................... viii List of Figures...................................... ix List of Algorithms.................................... xii List of Listings..................................... xiii Abstract......................................... xiv 1 Introduction1 1.1 A Motivating Example............................5 1.2 Thesis Statement...............................9 1.3 Contributions................................. 10 1.4 Organizations................................. 11 2 Preliminaries 12 2.1 Extensible Markup language (XML)..................... 12 2.1.1 Syntax of XML............................ 12 2.1.2 Handling White Space in XML Documents............. 13 2.1.3 XML Namespaces.......................... 13 2.2 XML Schema................................. 15 2.3 XML Parsing and Validation......................... 16 2.4 XPath Querying Languages.......................... 17 2.5 Performance Challenges of XML Parsing, Validation and Searching.... 18 3 Related Work 22 3.1 Parsing Optimizations............................. 22 3.1.1 Schema-specific Parsing....................... 22 3.1.2 Parallel XML Parsing......................... 25 3.1.3 Differential Parsing.......................... 26 3.1.4 XML Hardware Acceleration.................... 27 3.1.5 Summary............................... 28 3.2 Validation Optimizations........................... 28 3.2.1 Automaton Based Approach..................... 29 3.2.2 Push-Down Automaton Based Approach.............. 30 3.2.3 Integrated Approach......................... 31 v 3.2.4 Incremental Validation........................ 31 3.2.5 Push-Down Automaton Based Approach.............. 32 3.2.6 Logic Based Approach........................ 32 3.2.7 Query Based Approach........................ 32 3.2.8 Schema Based Differential Re-validation.............. 33 3.2.9 Summary............................... 33 3.3 Deserialization Optimizations........................ 34 3.3.1 Static-dynamic Approach....................... 34 3.3.2 Differential Deserialization...................... 34 3.3.3 Summary............................... 37 3.4 XML Searching Optimizations........................ 38 3.4.1 Automata Based Approach...................... 38 3.4.2 Push-down Automaton Approach.................. 40 3.4.3 Stack Based Approach........................ 40 3.4.4 Directed Acyclic Graph Based Approach.............. 40 3.4.5 Parallel XML Query Optimizations................. 41 3.4.6 Schema Based Query Optimizations................. 43 3.4.7 Logic-based Query Optimizations.................. 45 3.4.8 Incremental Query Evaluation.................... 45 3.4.9 XML Query Techniques in Native XML Databases......... 46 3.4.10 Other Query Optimization Approaches............... 47 3.4.11 Summary............................... 48 3.5 Binary XML Encoding Attempts....................... 48 4 Mapping XML Schema Content Models to Augmented Context Free Gram- mars 52 4.1 Permutation Phrase Grammars........................ 52 4.2 Multi-Occurrence Phrases.......................... 54 4.3 XML Schema Content Models........................ 54 4.4 Mapping Schema Components to Augmented Grammar........... 56 4.4.1 Terms and Notations......................... 56 4.4.2 Mapping Rules............................ 57 5 Table-Driven XML Parsing and Validation 69 5.1 Overview of TDX............................... 69 5.2 Constructing Parsing Table.......................... 71 5.2.1 Preserving the LL(1) Property.................... 71 5.2.2 FIRST and FOLLOW sets construction for LL(1) grammar produc- tions.................................. 72 5.2.3 Constructing augmented LL(1) parsing table............ 73 5.3 Parsing Engine................................ 73 5.3.1 Parsing Permutation phrases..................... 74 5.3.2 Parsing Multi-occurrence Phrases.................. 75 vi 5.3.3 Table-Driven Parsing and Validation................. 77 5.4 Scanning and Tokenization.......................... 80 5.5 Pipelined Scanning, Tokenization, and Parsing and Validation........ 80 6 Table-Driven XML Stream Searching 83 6.1 Introduction.................................. 83 6.2 Overview of Table-Driven XPath Query Processor.............. 86 6.3 Table-Driven XML Stream Query Evaluation................ 87 6.3.1 Query Evaluation Engine....................... 92 6.3.2 Query Evaluation Algorithm..................... 93 6.4 Construction of a Query Table........................ 97 6.4.1 LL(1) Grammar Graph........................ 98 6.4.2 Query Table Construction...................... 98 7 Performance Evaluation 103 7.1 Experimental Setup.............................. 103 7.1.1 Parsers Compared.......................... 104 7.1.2 XPath Query Processors Compared................. 105 7.2 Performance Evaluation of TDX Parser................... 105 7.2.1 Overall Performance......................... 106 7.2.2 Compared to Schema-specific Validating Parsers.......... 109 7.2.3 Compared to Non-Schema-Specific Validating Parsers....... 113 7.2.4 Compared to Non-validating Parsers................. 115 7.2.5 Performance impact of Validating Unordered XML Elements and Attributes............................... 119 7.2.6 Scalability Impact of Number of Unordered Elements for TDX... 123 7.2.7 Overhead Incurred by Dynamic TDX................ 123 7.2.8 Performance Improvements by Pipelining.............. 126 7.3 Performance of TDX Query Processor.................... 128 7.3.1 Performance of Queries with Simple Step Queries......... 128 7.3.2 Performance of Queries with Wildcards............... 131 7.3.3 Performance of Queries with Backward Axes............ 132 7.3.4 Performance of Equivalent Queries................. 133 7.3.5 Performance of Multiple Queries................... 136 7.3.6 Scalability.............................. 136 7.4 Discussion................................... 138 8 Conclusions and Future Works 139 A List of XML Schemas Used in Performance Experiments 141 Bibliography...................................... 149 Biographical Sketch................................... 165 vii LIST OF TABLES 1.1 Parsing table for the grammar in Figure 1.3. The number inside the parenthesis represents the integral token for that tag.....................8 4.1 Mapping rules translating top-level schema components............ 58 4.2 Mapping rules translating element declarations................. 59 4.3 Mapping rules translating complex type definitions............... 61 4.4 Mapping rules translating complex type definitions (continued)........ 62 4.5 Mapping rules translating non-normative simple type components....... 63 4.6 Mapping rules translating occurrence constraints................ 64 4.7 Mapping rules translating model group definitions............... 64 4.8 Mapping rules translating model group components.............. 65 4.9 Mapping rules translating attribute uses..................... 66 4.10 Mapping rules translating attribute group definitions.............. 66 5.1 Schema fragment generating non-LL(1) grammar............... 72 6.1 LL(1) grammar that is generated from the XML schema A.1 (Upper case letters denote nonterminals;