Efficient Query Processing for Data Integration
Total Page:16
File Type:pdf, Size:1020Kb
Efficient Query Processing for Data Integration Zachary G. Ives A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2002 Program Authorized to Offer Degree: Department of Computer Science and Engineering University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Zachary G. Ives and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Chair of Supervisory Committee: Alon Halevy Reading Committee: Alon Halevy (chair) Daniel Weld Dan Suciu Date: In presenting this dissertation in partial fulfillment of the requirements for the Doc- toral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with “fair use” as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to Bell and Howell Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, or to the author. Signature Date University of Washington Abstract Efficient Query Processing for Data Integration by Zachary G. Ives Chair of Supervisory Committee: Professor Alon Halevy Computer Science and Engineering A major problem today is that important data is scattered throughout dozens of sepa- rately evolved data sources, in a form that makes the “big picture” difficult to obtain. Data integration presents a unified virtual view of all data within a domain, allowing the user to pose queries across the complete integrated schema. This dissertation addresses the performance needs of real-world business and sci- entific applications. Standard database techniques for answering queries are inappro- priate for data integration, where data sources are autonomous, they generally lack mechanisms for sharing of statistical information about their content, and the envi- ronment is shared with other users and subject to unpredictable change. My thesis proposes the use of pipelined and adaptive techniques for processing data integration queries, and I present a unified architecture for adaptive query processing, including novel algorithms and an experimental evaluation. An operator called x-scan extracts the relevant content from an XML source as streams across the network, which enables more work to be done in parallel. Next, the query is answered using algorithms (such as an extended version of the pipelined hash join) whose work is adaptively scheduled, varying to accommodate the relative data arrival rates of the sources. Finally, the sys- tem can adapt the ordering of the various operations (the query plan), either at points where the data is being saved to disk or in mid-execution, using a novel technique called convergent query processing. I show that these techniques provide significant benefits in processing data integration queries. TABLE OF CONTENTS List of Figures iii List of Tables v Chapter 1: Introduction 1 1.1 The Motivations for Data Integration . 2 1.2 Query Processing for Data Integration . 5 1.3 Outline of Dissertation . 10 Chapter 2: Background: Data Integration and XML 12 2.1 Data Integration System Architecture . 12 2.2 The XML Format and Data Model . 16 2.3 Querying XML Data . 20 Chapter 3: Query Processing for Data Integration 25 3.1 Position in the Space of Adaptive Query Processing . 26 3.2 Adaptive Query Processing for Data Integration . 28 3.3 The Tukwila Data Integration System: An Adaptive Query Processor . 33 Chapter 4: An Architecture for Pipelining XML Streams 36 4.1 Previous Approaches to XML Processing . 40 4.2 The Tukwila XML Architecture . 42 4.3 Streaming XML Input Operators . 50 4.4 Tukwila XML Query Operators . 58 4.5 Supporting Graph-Structured Data in Tukwila . 61 4.6 Experimental Results . 66 4.7 Conclusions . 81 i Chapter 5: Execution Support for Adaptivity 83 5.1 An Adaptive Execution Architecture . 87 5.2 Adaptive Query Operators . 94 5.3 Experiments . 101 5.4 Conclusions . 107 Chapter 6: Adaptive Optimization of Queries 109 6.1 Convergent Query Processing . 113 6.2 Operators for Phased Execution . 119 6.3 Implementation within Tukwila . 122 6.4 Experiments . 127 6.5 Conclusion . 135 Chapter 7: Tukwila Applications and Extensions 137 7.1 Data Management for Ubiquitous Computing . 137 7.2 Peer Data Management . 139 7.3 Integration for Medicine: GeneSeek . 142 7.4 Summary . 143 Chapter 8: Related Work 145 8.1 Data Integration (Chapter 2) . 145 8.2 XML Processing (Chapter 4) . 148 8.3 Adaptive Query Processing (Chapters 5, 6) . 152 Chapter 9: Conclusions and Future Directions 158 9.1 Future Work in Adaptive Query Processing . 159 9.2 Envisioning a Universal Data Management Interface . 161 Bibliography 166 ii LIST OF FIGURES 1.1 Data warehousing and integration compared . 4 2.1 Data integration architecture diagram . 13 2.2 Sample XML document . 17 2.3 XML-QL graph representation of example data . 18 2.4 XQuery representation of example data . 19 2.5 Example XQuery over sample data . 21 2.6 Results of example query . 22 3.1 Architecture diagram of Tukwila . 33 4.1 Tukwila XML query processing architecture . 44 4.2 Example XQuery to demonstrate query plan . 46 4.3 Example Tukwila query plan . 47 4.4 Encoding of a tree within a tuple . 48 4.5 Basic operation of x-scan operator . 52 4.6 The web-join operator generalizes the dependent join . 57 4.7 X-scan components for processing graph-structured data . 63 4.8 Experimental evaluation of different XML processors . 69 4.9 Wide-area performance of query processors . 71 4.10 Comparison of data set sizes and running times . 73 4.11 Scale-up of x-scan for simple and complex paths . 74 4.12 Experimental comparison of XML vs. JDBC as a transport . 77 4.13 Comparison of nest and join operations . 78 4.14 Scale-up results for x-scan over graph and tree-based data . 80 4.15 Query processing times with bounded memory . 80 5.1 Example of query re-optimization . 86 5.2 Example of collector policy rules . 95 iii 5.3 Performance benefits of pipelined hash join . 103 5.4 Comparison of Symmetric Flush vs. Left Flush overflow resolution . 105 5.5 Interleaved planning and execution produces overall benefits . 106 6.1 Example of phased query execution for 3-way join . 115 6.2 Architecture of the Tukwila convergent query processor . 123 6.3 Experimental results over 100Mbps LAN . 129 6.4 Experimental results over slow network . 132 6.5 Performance of approach under limited memory . 133 7.1 Example of schema mediation in a PDMS . 140 7.2 Piazza system architecture . 141 iv LIST OF TABLES 4.1 Physical query operators and algorithms in Tukwila . 59 4.2 Experimental data sets . 67 4.3 Systems compared in Section 4.6.1. 68 4.4 List of pattern-matching queries . 68 4.5 List of queries used in XML processing experiments . 75 6.1 Data sources for experiments . 128 8.1 Comparison of adaptive query processing techniques . 157 v ACKNOWLEDGMENTS Portions of this dissertation have previously been published in SIGMOD [IFF+99] and in the IEEE Data Engineering Bulletin [IHW01, ILW+00]. However, these portions have been significantly revised and extended within this dissertation. Additionally, portions of Chapter 4 have been submitted concurrently to the VLDB Journal. It's amazing to look back and see how things have changed over the past few years, how others have influenced me. I've been blessed with a brilliant and incredibly generous group of collaborators and advisors, who've given me great ideas and shaped my research ideas and my thesis work. Special thanks to my advisors, Alon Halevy and Dan Weld, who let me be creative but kept me on track, and with whom I've had countless stimulating discussions that led to new ideas. Even more importantly, they taught me choose worthwhile problems and aim high. I have come to understand just how critical this is in the research world. Thanks also to Steve Gribble, Hank Levy, and Dan Suciu for showing me a broader perspective on my research topics — it really helps to see a problem from an outsider's perspective in many cases — and for many fruitful discussions and arguments. And I am greatly appreciative of the amount of work they put into preparing me for the interview circuit, despite the fact that I did not have a formal advisee relationship with them. I'm also grateful to Igor Tatarinov, Jayant Madhavan, Maya Rodrig, and Ste- fan Sariou for their contributions to the various projects in which I have partic- ipated these past few years. I greatly enjoyed their ideas and their enthusiasm, and I learned a great deal from working with them. Many of these people have also been invaluable sources of comments on my papers. A special thank-you is warranted for Rachel Pottinger, who has been a fellow “databaser” and constant source of encouragement since the beginning — and vi to Steve Wolfman as well, who, while not a database person, has been a good friend and source of feedback. They have been perhaps the best exemplars of why this department has a stellar reputation as a place to work. I'd also like to express my appreciation for those who contributed sugges- tions and feedback to my work, even if they weren't officially affiliated with it: Corin Anderson, Phil Bernstein, Luc Bouganim, Neal Cardwell, Andy Collins, AnHai Doan, Daniela Florescu, Dennis Lee, Hartmut Liefke, David Maier, Ioana Manolescu, Oren Zamir, and the anonymous reviewers of my conference and journal submissions.