Evaluation of Top-k Queries over Structured and Semi-structured Data
Am´elie Marian
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences
COLUMBIA UNIVERSITY
2005 c 2005 Am´elie Marian All Rights Reserved ABSTRACT
Evaluation of Top-k Queries over Structured and Semi-structured Data
Am´elie Marian
Traditionally, queries over structured (e.g., relational) and semi-structured (e.g., XML) data identify the exact matches for the queries. This exact-match query model is not ap- propriate for many database applications and scenarios where queries are inherently fuzzy —often expressing user preferences and not hard Boolean constraints— and are best an- swered with a ranked, or “top-k,” list of the best matching objects. The top-k query model is widely used in web search engines and information retrieval systems over (relatively un- structured) text data. This thesis addresses fundamental issues in defining and efficiently processing top-k queries for a variety of scenarios, presenting different query processing challenges. In all these scenarios, our query processing algorithms attempt to focus on the objects that are most likely to be among the top-k matches for a given query, and discard —as early as possible— objects that are guaranteed not to qualify for the top-k answer, thus minimizing query processing time. One important top-k query scenario that we study is web applications where the data objects are only available through remote, autonomous web sources. During query pro- cessing, these sources have to be queried repeatedly for a potentially large set of candidate objects. Processing top-k queries efficiently in such a scenario is challenging, as web sources exhibit diverse probing costs and access interfaces, as well as constraints on the degree of concurrency that they support. By considering the peculiarities of the sources and poten- tially designing object-specific query execution plans, our adaptive algorithms efficiently prune non-top-k answers and produce significantly more efficient query executions than previously existing algorithms, which select “global” query execution plans and do not fully take advantage of source-access parallelism. Another important scenario that we study is XML integration applications where XML data originates in heterogeneous sources, and therefore may not share the same schema. In this scenario, exact query matches are too rigid, so XML query answers are ranked based on their “similarity” to the queries, in terms of both content and structure. Processing top-k queries efficiently in such a scenario is challenging, as the number of candidate answers increases dramatically with the query size. (XML path queries are, in effect, joins.) By pruning irrelevant data fragments as early as possible, our algorithms minimize the number of candidate answers considered during query evaluation. As another contribution of this thesis, we extend our query processing algorithms to handle natural variations of the basic top-k query model. Specifically, we develop algorithms for queries that, in addition to fuzzy conditions, include some hard Boolean constraints (e.g., to allow the users to specify a more complex set of preferences). We also study extensions of our algorithms to handle scenarios where individual objects can be combined through join operations. Finally, while our algorithms return the exact k best matches to a query, we may sometimes be interested in trading some quality in the top-k answers in exchange for faster query execution times. We develop extensions of our algorithms for this approximate top-k query model; our approximate algorithms exploit various tradeoffs between query execution time and answer quality. In summary, this thesis studies the general problem of processing top-k queries over structured and semi-structured data. These queries are natural and abound in web appli- cations. We present efficient top-k query processing algorithms that return, rather than a possibly large set of objects, only those objects that are closest to the query specification. Our algorithms efficiently prune the number of objects considered during query processing, reducing the amount of information to consider to find valuable data. Contents
1 Introduction 1
2 Processing Top-k Queries over Structured and Semi-structured Data 4 2.1 Query Model ...... 6 2.2 Top-k Query Processing ...... 10 2.2.1 Discarding Useless Objects ...... 10 2.2.2 The Upper Property ...... 12
3 Sequential Top-k Query Processing Strategies over Web-Accessible Struc- tured Data 14 3.1 Data Model ...... 16 3.2 An Existing Top-k Strategy ...... 18 3.2.1 The TA Algorithm ...... 18 3.2.2 Optimizations over TA ...... 20 3.3 The Sequential Upper Algorithm ...... 22 3.3.1 Selecting the Best Source ...... 22 3.3.2 Cost Analysis ...... 26 3.3.2.1 Counting Sorted Accesses ...... 26 3.3.2.2 Instance Optimality ...... 28 3.4 Experimental Results ...... 28 3.4.1 Implementation ...... 29 3.4.1.1 Techniques ...... 29 3.4.1.2 Supporting Data Structures ...... 31
i 3.4.1.3 Local Sources ...... 31 3.4.1.4 Real Web-Accessible Sources ...... 33 3.4.1.5 Evaluation Metrics and Other Experimental Settings . . . 35 3.4.2 Experiments over Local Data ...... 36 3.4.2.1 Probing Time ...... 36 3.4.2.2 Local Processing Time ...... 40 3.4.2.3 Using Data Distribution Statistics ...... 42 3.4.3 Comparison with MPro ...... 43 3.4.4 Experiments over Real Web-Accessible Sources ...... 47 3.5 Conclusions ...... 48
4 Parallel Top-k Query Processing Strategies over Web-Accessible Struc- tured Data 50 4.1 Parallel Data Model ...... 51 4.2 A Simple Parallelization Scheme ...... 52 4.3 The Parallel pUpper Algorithm ...... 53 4.3.1 Relying on the Upper Property ...... 53 4.3.2 Taking Source Congestion into Account ...... 54 4.3.3 Avoiding Redundant Computation ...... 55 4.4 Experimental Results ...... 58 4.4.1 Implementation ...... 59 4.4.1.1 Techniques ...... 59 4.4.1.2 Parallelism ...... 60 4.4.2 Experiments over Local Data ...... 61 4.4.2.1 Probing Time and Parallel Efficiency ...... 61 4.4.2.2 Using Data Distribution Statistics ...... 64 4.4.3 Comparison with Simple Parallelization Schemes ...... 65 4.4.4 Experiments over Real Web-Accessible Sources ...... 66 4.5 Conclusions ...... 68
ii 5 Top-k Query Processing Strategies over Semi-structured Data 69 5.1 Background ...... 71 5.1.1 XML and Semi-structured Data ...... 72 5.1.2 XML Relaxation ...... 72 5.2 XML Data Model ...... 75 5.3 The Whirlpool System ...... 80 5.3.1 Architecture ...... 81 5.3.2 Prioritization Strategies ...... 87 5.3.3 Routing Strategies ...... 88 5.3.4 Parallelism ...... 89 5.4 Experimental Results ...... 89 5.4.1 Implementation ...... 90 5.4.1.1 Techniques ...... 90 5.4.1.2 Data and Queries ...... 91 5.4.1.3 Evaluation Parameters ...... 92 5.4.1.4 Evaluation Metrics ...... 93 5.4.2 Experiments ...... 93 5.4.2.1 Comparison of Adaptive Routing Strategies ...... 94 5.4.2.2 Adaptive vs. Static Routing Strategies ...... 96 5.4.2.3 Cost of Adaptivity ...... 97 5.4.2.4 Effect of Parallelism ...... 98 5.4.2.5 Varying Evaluation Parameters ...... 99 5.4.2.6 Scalability ...... 102 5.5 Conclusions ...... 103
6 Extensions to the Top-k Query Model 104 6.1 Top-k Query Processing Strategies over Web Sources ...... 106 6.1.1 Filtering Conditions ...... 106 6.1.1.1 Sequential Algorithms ...... 107
iii 6.1.1.2 Parallel Algorithms ...... 109 6.1.1.3 Experimental Results ...... 109 6.1.2 Joins ...... 113 6.1.2.1 Sequential Algorithms ...... 115 6.1.2.2 Parallel Algorihms ...... 116 6.1.2.3 Experimental Results ...... 119 6.2 Approximate Evaluation of Top-k Queries ...... 124 6.2.1 Approximation Model and Metrics ...... 125 6.2.2 User-Defined Approximation ...... 125 6.2.3 Online Approximation ...... 126 6.2.4 Experimental Results ...... 127 6.2.4.1 Implementation ...... 127 6.2.4.2 User-Defined Approximation ...... 129 6.2.4.3 Online Approximation ...... 131 6.2.4.4 Visualization Interface ...... 135 6.3 Conclusions ...... 137
7 Related Work 139 7.1 Top-k Query Evaluation ...... 139 7.2 Approximate Query Processing ...... 142 7.3 Adaptive Query Plans ...... 143 7.4 XML Query Processing ...... 143 7.5 Information Retrieval ...... 144 7.6 Integrating Databases and Information Retrieval ...... 145
8 Conclusions and Future Work 146 8.1 Conclusions ...... 146 8.2 Future Work ...... 149 8.2.1 Multi-Goal Top-k Query Optimization ...... 149
iv 8.2.2 Multi-Query Optimization ...... 150 8.2.3 Scoring Functions ...... 150
Bibliography 151
v List of Figures
2.1 A heterogeneous XML data collection about books...... 5 2.2 Star schema representation of the restaurant recommendation example. . . 9 2.3 Snapshot of a top-3 query execution...... 13
3.1 Algorithm TAz...... 19
3.2 Algorithm TAz-EP...... 21 3.3 Algorithm Upper...... 23 3.4 Performance of the different strategies for the default setting of the experi- ment parameters, and for alternate attribute-value distributions...... 37 3.5 Performance of the different strategies for the default setting of the experi- ment parameters, as a function of the number of objects requested k. . . . . 38 3.6 Performance of the different strategies for the Uniform data set, as a function of the number of sources...... 38 3.7 Performance of the different strategies for the Uniform data set, as a function of the number of SR-Sources...... 39 3.8 Performance of the different strategies for the Uniform data set, as a function of the cardinality of Objects...... 40
3.9 The local processing time for Upper, MPro-EP, and TAz-EP, as a function of the number of objects...... 41
3.10 The total processing time for Upper, MPro-EP, and TAz-EP, as a function of the time unit f...... 41
vi 3.11 The performance of Upper improves when the expected scores are known in advance...... 44 3.12 Performance of Upper-Sample, Upper, MPro-EP, and MPro, when sampling is available and for different data sets...... 45 3.13 Total processing time for Upper and MPro, as a function of the time unit f, for the real-life Cover data set...... 45 3.14 Performance of Upper-H, Upper-Sample, Upper, MPro-EP, and MPro for different expected score distributions...... 47 3.15 Experimental results for the real web-accessible data sets relevant to our New York City restaurant scenario...... 48
4.1 Function SelectBestSubset...... 53 4.2 An execution step of pUpper...... 56 4.3 Algorithm pUpper...... 57 4.4 Function GenerateQueues...... 58 4.5 Effect of the attribute score distribution on performance...... 61 4.6 Effect of the number of objects requested k on performance...... 62 4.7 Effect of the number of source objects |Objects| on performance...... 63
4.8 Effect of the number of parallel accesses per source pR(Di) on performance. 63 4.9 Performance of pTA, pUpper, and PP-MPro-Constraints over different at- tribute value distributions (one SR-Source)...... 64 4.10 Effect of the number of objects requested k (a) and the number of accesses
per source pR(Di) (b) on the performance of pTA, pUpper, and Upper over real web sources...... 66
5.1 XML queries on the heterogeneous book collection...... 70 5.2 A heterogeneous XML book collection...... 73 5.3 Relaxed XML queries...... 74 5.4 The Whirlpool architecture for the top-k query of Figure 5.1(ii)...... 81
vii 5.5 Function generateServerPredicates...... 84 5.6 Algorithm Whirlpool...... 86 5.7 Performance of Whirlpool-S and Whirlpool-M, for various adaptive routing strategies...... 94 5.8 Performance of LockStep-NoPrun, LockStep, Whirlpool-S and Whirlpool-M, for static and adaptive routing strategies (linear scale)...... 95 5.9 Number of server operations for LockStep, Whirlpool-S and Whirlpool-M, for static and adaptive routing strategies (linear scale)...... 95 5.10 Ratio of the query execution time of the different techniques over LockStep- NoPrun’s best query execution time, for different join operation cost values. 97 5.11 Ratio of Whirlpool-M’s query execution time over Whirlpool-S’s query execu- tion time...... 98 5.12 Performance of Whirlpool-S and Whirlpool-M, as a function of k and the query size (logarithmic scale)...... 100 5.13 Performance of Whirlpool-S and Whirlpool-M, as a function of the document and query sizes (logarithmic scale, k=15)...... 101
6.1 Performance of the sequential strategies for the default setting of the exper- iment parameters, and for alternate attribute-value distributions...... 110 6.2 Performance of the sequential strategies for the default setting of the exper- iment parameters, as a function of the number of filtering attributes. . . . . 110 6.3 Performance of the parallel strategies for the default setting of the experiment parameters, and for alternate attribute-value distributions...... 112 6.4 Performance of the parallel strategies for the default setting of the experiment parameters, as a function of the number of filtering attributes...... 112 6.5 Constellation schema representation of the restaurant recommendation ex- ample...... 113 6.6 Adaptation of the Upper algorithm for the join scenario...... 117
6.7 Adaptation of the TAz-EP algorithm for the join scenario...... 118
viii 6.8 Adaptation of the SelectBestSubset function for the join scenario...... 119 6.9 Performance of the sequential strategies for the default setting of the exper- iment parameters, and for alternate attribute-value distributions...... 120 6.10 Performance of the sequential strategies for the default setting of the exper- iment parameters, as a function of the number of query objects (centralized schema)...... 121 6.11 Performance of the sequential strategies for the default setting of the ex- periment parameters, as a function of the number of query objects (chained schema)...... 121 6.12 Performance of the sequential strategies for the default setting of the exper- iment parameters, as a function of the join selectivity...... 122 6.13 Performance of the parallel strategies for the default setting of the experiment parameters, and for alternate attribute-value distributions...... 123 6.14 Performance of the parallel strategies for the default setting of the experiment parameters, as a function of the number of query objects (centralized schema).124 6.15 Performance of the sequential strategies for the θ-approximation...... 129 6.16 Performance of the parallel strategies for the θ-approximation...... 130 6.17 Answer precision for the θ-approximation...... 130 6.18 Answer precision of the sequential strategies for the online approximation as a function of time spent in probes...... 132 6.19 Answer precision of the parallel strategies for the online approximation as a function of time spent in probes...... 132 6.20 Distance to solution of the sequential strategies for the online approximation as a function of time spent in probes...... 133 6.21 Distance to solution of the parallel strategies for the online approximation as a function of time spent in probes...... 133 6.22 Number of candidates considered by the sequential strategies for the online approximation as a function of time spent in probes...... 134
ix 6.23 Number of candidates considered by the parallel strategies for the online approximation as a function of time spent in probes...... 135 6.24 Visualization interface screenshot...... 136
x List of Tables
3.1 “Dimensions” to characterize sequential query processing algorithms. . . . . 30 3.2 Default parameter values for experiments over local data...... 32 3.3 Real web-accessible sources used in the experimental evaluation...... 33
5.1 A comparison of the extension of the tf.idf function to XML documents with the original tf.idf function...... 77 5.2 Evaluation parameters, with default values noted in boldface...... 92 5.3 Percentage of objects created by Whirlpool-M, as a function of the maximum possible number of objects, for different query and document sizes...... 102
xi Acknowledgments
First, I would like to thank my advisor Luis Gravano for his patience and guidance. He taught me a great deal about research and writing, and was always available for discussion. His thoughtful and painstaking comments on every aspect of my writing style and research methodology have tremendously helped me improve my work. He has been an amazing advisor and I am grateful I had the chance to work with him. I learned how exciting and fun research could be from Serge Abiteboul at I.N.R.I.A., Serge encouraged me to pursue a Ph.D. in the United States, and I am forever grateful for that great advice. Divesh Srivastava has been a wonderful mentor at AT&T, and helped me tremendously through my job search earlier this year. During my internship at AT&T, I had the pleasure to work with several outstanding researchers: Sihem Amer-Yahia, Nick Koudas (Chapter 5 is joint work with Sihem, Nick and Divesh), David Toman, and Yannis Kotidis. I truly enjoyed our long brainstorming sessions. In addition to my thesis work, I have had the pleasure to collaborate on research projects with wonderful people. My first experience with research was in the VERSO team at I.N.R.I.A., where I had the chance to interact with researchers from all around the world. I had great fun collaborating with J´erˆome Sim´eon from I.B.M. Research (at the time at Lucent). Finally, I worked with Surajit Chaudhuri from Microsoft Research, who has given me great feedback on my work. The members of the Columbia Database group gave me invaluable comments on my presentation skills, and maybe more importantly were always available to discuss research
xii and non-research issues. In particular, Ken Ross has always taken the time to give me some advice on my work, and my career. His suggestions have always been most helpful. Mihalis Yannakakis was kind enough to serve on my Ph.D. committee and to provide useful comments on my work. Panos Ipeirotis patiently answered (and still does) my never-ending questions on all possible aspects of academic life and administrative details. Over the years, Eugene Agichtein, Nico Bruno, John Cieslewicz, Wisam Dakka, Alpa Jain, Julia Stoyanovich and Jingren Zhou have been wonderful people with whom to share ideas and tips. (Chapters 3 and 4 are joint work with Nico and Luis.) Other students of the 7th floor in the CEPSR building have helped me keep my sanity: Pablo Duboue, No´emie Elhadad, Elena Filatova, Smaranda Muresan, Michel Galley, and Ani Nenkova. I will miss seeing them all every day. My friends both in New York and in France have always been a great source of support, and always were patient with me when I went MIA around deadlines. I am thankful I can count on them. My brother, Alexandre, has always been there for me, and is a great source of comfort. My parents have been a great inspiration in my life. Finally, and more importantly, my husband Cyril has encouraged me all throughout my Ph.D., and was there to support me when I was discouraged or just plain tired. I could not have done it without him.
xiii Chapter 1 1
Chapter 1
Introduction
A large amount of structured and semi-structured information is available through the Inter- net, either through interfaces to web-accessible databases (e.g., MapQuest)1 or exchanged between applications (e.g., XML messages in web services). This wealth of information makes it difficult for users to identify relevant data for their (often relatively fuzzy) infor- mation needs. This thesis focuses on query processing techniques to efficiently identify the data that is most relevant to user queries, saving users from having to sort through a large amount of information to find valuable data. Traditionally, query processing techniques over structured (e.g., relational) and semi- structured (e.g., XML) data identify the exact matches for the queries. This exact-match query model is not appropriate for many database applications and scenarios where queries are inherently fuzzy —often expressing user preferences and not hard Boolean constraints— and are best answered with a ranked list of the “best” objects for the queries. A query processing strategy for such a query then needs to identify k objects with the highest score for the query, according to some scoring function. This “top-k” query model is widely used in web search engines and information retrieval systems over (relatively unstructured) text data. This thesis addresses fundamental issues in defining and efficiently processing top-k queries for a variety of scenarios, presenting different query processing challenges.
1http://www.mapquest.com CHAPTER 1. INTRODUCTION 2
Specifically, the main contributions of this thesis are as follows. In Chapter 2, we present our top-k query model and general top-k query processing framework. In all the web scenarios we study, our query processing algorithms attempt to focus on the objects that are most likely to be among the top-k matches for a given query, and discard —as early as possible— objects that are guaranteed not to qualify for the top-k answer, thus minimizing query processing time. In Chapters 3 and 4, we study a web application scenario where the data object at- tributes are available only via remote web sources. Processing a top-k query in such a scenario involves accessing a variety of autonomous, heterogeneous sources. During query processing, these sources have to be queried repeatedly for a potentially large set of candi- date objects. For example, if we want to return the top-k restaurant recommendations for a specific user, we might consider the distance between the candidate restaurants and the user. We could retrieve the distance information by repeatedly querying, say, a web site such as MapQuest with the user address and the candidate restaurant addresses. Process- ing top-k queries efficiently in such a scenario is challenging, as web sources exhibit diverse probing costs and access interfaces, as well as constraints on the degree of concurrency that they support. In Chapter 3, we present Upper, a sequential top-k query processing algorithm for this web source scenario. By considering the peculiarities of the sources and potentially designing object-specific query execution plans, Upper efficiently prunes non- top-k answers and produces significantly more efficient query executions than previously existing algorithms, which select “global” query execution plans. In Chapter 4, we present pUpper, a parallelization of Upper that takes full advantage of the intrinsic parallel nature of the web and accesses several web sources simultaneously, possibly sending several concur- rent requests to each individual source as well. Like Upper, pUpper considers object-specific query execution plans, and can thus consider intra-query source congestion when scheduling source accesses. In Chapter 5, we study an XML integration application scenario where XML data originates in heterogeneous sources, and therefore may not share the same schema. In this scenario, exact query matches are too rigid, so XML query answers are ranked based on their “similarity” to the queries, in terms of both content and structure. Processing top-k CHAPTER 1. INTRODUCTION 3
queries efficiently in such a scenario is challenging, as the number of candidate answers increases dramatically with the query size. (XML path queries are, in effect, joins.) We present Whirlpool, a family of algorithms for processing top-k queries over XML data. By pruning irrelevant data fragments as early as possible, Whirlpool minimizes the number of candidate answers considered during query evaluation. In Chapter 6, we extend our query processing algorithms to handle natural variations of the basic top-k query model. Specifically, we develop algorithms for queries that, in addition to fuzzy conditions, include some hard Boolean constraints (e.g., to allow the users to specify a more complex set of preferences). We also study extensions of our algorithms to handle scenarios where individual objects can be combined through join operations. Finally, while our algorithms return the exact k best matches to a query, we may sometimes be interested in trading some quality in the top-k answers in exchange for faster query execution times. We develop extensions of our algorithms for this approximate top-k query model; our approximate algorithms exploit various tradeoffs between query execution time and answer quality. Finally, in Chapter 7 we discuss related work, while in Chapter 8 we present conclusions and directions for future research. Chapter 2 4
Chapter 2
Processing Top-k Queries over Structured and Semi-structured Data
Traditionally, query processing techniques over structured (e.g., relational) and semi-struc- tured (e.g., XML) data identify the exact matches for the queries. This exact-match query model is not appropriate for many database applications and scenarios where queries are inherently fuzzy —often expressing user preferences and not hard Boolean constraints— and are best answered with a ranked list of the “best” objects for the queries. A top-k query in this context is then simply an assignment of target values to the attributes of the query. In turn, a top-k query processing strategy for such a query then needs to identify k objects with the highest score for the query, according to some scoring function. This top-k query model is widely used in web search engines and information retrieval (IR) systems over text data. This thesis addresses fundamental challenges in defining and efficiently processing top-k queries for a variety of structured and semi-structured data scenarios that are common in web applications. The following two examples illustrate the two important top-k query scenarios on which we focus in this thesis:
Example 1: Consider a relation with information about restaurants in the New York City CHAPTER 2. PROCESSING TOP-K QUERIES 5
area. Each tuple (or object) in this relation has a number of attributes, including Address, Rating, and Price, which indicate, respectively, the restaurant’s location, the overall food rating for the restaurant, as determined by a restaurant review website and represented by a grade between 1 and 30, and the average price for a dinner. A user who lives at 2590 Broadway and is interested in spending around $25 for a top-quality restaurant might then ask a top-3 query with attributes “Address”=“2590 Broadway”, “Price”=$25, and “Rating”=30. Expecting exact matches for this query is not appropriate: no restaurant is usually awarded the top rating score of 30, and it is unlikely that a restaurant matching all query attributes will be at the exact address specified in the query. The result to this query should then be a list of the three restaurants that match the user’s specification the closest,
for some definition of proximity.
¡ "¨
¡ -"# ¥ ,¡ ¤¡
©
¡ "# $ ¤%
¦ ©
¥ !
¦&
¡ ¥
©
$ ¢%
¤¡ ¢£ ¥
¤¡ £ ¥ ¤¡ £ ¥
¥¡ ¤¡ ¢¡ ¤£ ¥
©
$ %
¦&
¡ ¥
¦¨§'©
¥(
¦¨§ © ¦¨§ ©
¥ ¥
¦¨§ ©
¥
¦&
¡ ¥
)
¥ * ¢¡
+ +
¥ ¡ ¥ ¡
¦ ©
¥ !
¥ ¢¡
.0/21
.0321 .04)1
Figure 2.1: A heterogeneous XML data collection about books.
Example 2: Consider the heterogeneous XML data collection in Figure 2.1, with informa- tion about books. This collection is derived from various sources that do not share the same schema. A query for the top-3 “book” elements with children nodes “title”=“Great Expec- tations”, “author”=“Dickens”, and “edition”=“paperback” (each child node represents an attribute of the book object) will not result in any exact match from the example XML collec- tion. However, intuitively all three data fragments (a), (b), and (c) are reasonable answers to such a query, and should be returned as approximate query answers. The result to this query should then be a list of the three books that match the query structure the closest, for some definition of proximity to the query. CHAPTER 2. PROCESSING TOP-K QUERIES 6
As the previous examples suggest, an answer to a top-k query is not anunordered set of objects that matches the query exactly, but rather an ordered set of objects, where the ordering is based on how closely each object matches the given query. Furthermore, the answer to a top-k query does not include all objects that match the query to some degree, but rather only the best k such objects. In this chapter, we define our top-k query model in detail in Section 2.1, and then discuss the general issue of efficiently processing top-k queries in Section 2.2.
2.1 Query Model
Consider a collection C of objects with attributes A1, . . . , An, plus perhaps some other attributes not mentioned in our queries. A top-k query over collection C simply specifies target values for each attribute Ai. Therefore, a top-k query is an assignment of values
{A1 = q1, . . . , An = qn} to the attributes of interest.
Example 1 (cont.): Consider our restaurant example. Our top-3 query in this example assigns a target value to all three restaurant attributes: “2590 Broadway” for Address, $25 for Price, and 30 for Rating.
Example 2 (cont.): Consider our XML book collection example. Our top-3 query in this example assigns a target value that represents the structural relationship required by the query for the attribute. For all three attributes, “title”, “author”, and “edition”, this target value is the “child” relationship (i.e., the target value is a structural child relationship between the “book” node and each of the “title”, “author” and “edition” nodes).
In some scenarios, the target values of a query are always explicitly specified in the query. For instance, the XML query in Example 2 specifies a target relationship for each attribute. In some other scenarios, the target values of a query can be implicit, and some attributes might always have the same “default” target value in every query. For example, it is reasonable to assume that the Rating attribute in Example 1 might always have an associated query value of 30. (It is unclear why a user would insist on a lesser-quality restaurant, given the target price specification.) CHAPTER 2. PROCESSING TOP-K QUERIES 7
In our query model, the answer to a top-k query q = {A1 = q1, . . . , An = qn} over a collection of objects C and for a scoring function is a list of the k objects in the collection with the highest score for the query. The score that each object t in C receives for q is generally
a function of a score for each individual attribute Ai of t, Score Ai (qi, ti), where qi is the
target value of attribute Ai in the query and ti is the value of object t for Ai. Typically, the
scoring function Score Ai that is associated with each attribute Ai is application-dependent, as the following examples illustrate.
Example 1 (cont.): For a restaurant object r, we can define the scoring function for the Address attribute so that it is inversely proportional to the distance (say, in miles) between the query and object addresses. Similarly, the scoring function for the Price attribute might be a function of the difference between the target price and the object’s price, perhaps penalizing restaurants that exceed the target price more than restaurants that are below it. The scoring function for the Rating attribute might simply be based on the object’s value for this attribute.
Example 2 (cont.): For a book object b, we can define individual attribute scoring func- tions so that they are determined by the structural relationship between the “book” node of object b and the query attributes. For instance, the scoring function for “title”=“Great Expectations” might be inversely proportional to the distance (in XML nodes) between the “title” element and object b’s “book” element in the XML data tree, with a perfect score of 1 if “title” is a child of “book”, and a score of 0 if no “title” elements are present in the data tree rooted at object b’s “book” node. Similar scoring functions can be used for scoring the “author” and “edition” attributes.
We make the simplifying assumption that the scoring function for each individual at- tribute returns scores between 0 and 1, with 1 denoting a perfect match. (Handling other score ranges is straightforward.) To combine these individual attribute scores into a final
score for each object, each attribute Ai has an associated weight wi indicating its relative importance in the query. Then, the final score for object t is defined as a weighted sum of CHAPTER 2. PROCESSING TOP-K QUERIES 8
the individual scores:1
n Score(q, t) = ScoreComb(s , . . . , s ) = w · s 1 n X i i i=1
where si = ScoreAi (qi, ti). The result of a top-k query is a ranked list of k objects with highest Score value, where we break ties arbitrarily. The algorithms presented in this thesis apply to a broad range of top-k query scenarios, as long as the underlying scoring functions are monotonic: Score(q, t) ≥ Score(q, t0) for every query q and pair of objects t, t0 such that 0 ScoreAi (qi, ti) ≥ Score Ai (qi, ti), i = 1, . . . , n. It is easy to see that our weighted-sum scoring function fits this requirement. In principle, an answer to a top-k query can either consist of k objects that best match the query along with their scores for the query, or just consist of the k objects without their associated scores. In the first part of this thesis, namely in Chapters 3, 4, and 5, we only consider techniques that return the top-k objects along with their scores, therefore returning an ordered list of the k objects with the highest scores for the query. This choice is consistent with most existing work on top-k query processing [BCG02, CK97, CK98, CH02, CG96, CGM04, FLN01, IAE03]. Returning an unordered set of the k best matches to the query as soon as they can be identified may help save query processing time because the final score of an object that is guaranteed to be among the top-k objects might not need to be fully computed during query processing. This approach has been explored by Fagin et al. in the NRA algorithm [FLN01]. To further speed up query processing, we may allow for some approximation in the query answer. Some approximation techniques have been suggested in [CH02, FLN01]. An approximate answer to a top-k query consists of k objects that are good answers to the query but that may not be the best k objects, along with some guarantees on the loss of quality of the approximate top-k answer with respect to the exact top-k query answer. We propose some approximation adaptations of our techniques in Chapter 6. In Chapters 3 and 4, we focus on a simple data model where the data can be represented as a single relational table. In this model, all attributes are associated with a single object.
1Our model and associated algorithms can be adapted to handle other scoring functions, which we believe are less meaningful than weighted sums for the applications that we consider. CHAPTER 2. PROCESSING TOP-K QUERIES 9
Rating Address
Restaurant
Price
Figure 2.2: Star schema representation of the restaurant recommendation example.
This data schema can then be represented as a “star” schema [RG00], as shown in Figure 2.2. A property of such a simple model is that the number of candidate answers is equal to the number of objects in the data collection. In contrast, a more complex query model involving joins on multiple objects may sometimes result in a larger number of candidate answers. Previous work has studied such joins scenarios [NCS+01, IAE03] when sorted indexes on individual attributes scores are available. We focus in data scenarios involving joins in Chapters 5 and 6. A naive brute-force top-k query processing strategy would consist of computing the score for the query for every object to identify and return k objects with the best scores. For instance, to answer the top-k query of Example 1, we would have to access every known restaurant and establish its scores for the three query attributes. Similarly, for the top-3 query of Example 2, we would have to consider every “book” node in the collection and check whether it has “title”, “author”, and “edition” descendants. For large collections of objects, it is easy to see that this brute-force evaluation could be prohibitively expensive. Fortunately, the top-k query model provides the opportunity for efficient query processing, as only the best k objects need to be returned. Objects that are not part of the top-k answer, therefore, might not need to be processed, as we will see. The challenge faced by top-k query processing techniques is then to identify the top-k objects efficiently, to limit the amount of processing done on non-top-k objects. In the next section, we discuss some key observations that can be used by top-k query processing techniques to quickly identify the best k objects, hence resulting in fast query executions. CHAPTER 2. PROCESSING TOP-K QUERIES 10
2.2 Top-k Query Processing
As discussed above, a naive top-k query processing strategy would be to fully evaluate (i.e., compute all attribute scores of) every object to identify and return k objects with highest scores. Such a strategy is unnecessarily expensive for top-k queries, as it does not take advantage of the fact that only the k best objects are part of the query answer, and the remaining objects do not need to be processed. An efficient top-k query processing strategy must then focus on discarding useless objects as early as possible during query processing by exploiting known object score information, as we will show in Section 2.2.1. To achieve this, we can take advantage of a key property of object scores that we introduce in Section 2.2.2. This property serves as the basis of the top-k query processing algorithms that we present in this thesis.
2.2.1 Discarding Useless Objects
Objects that are not in the answer to a top-k query do not need to be evaluated to answer the query, as long as they can somehow be safely discarded during query execution. In contrast, top-k objects need to be fully processed, since their scores for the query are returned as part of the query answer. An object can be discarded safely when the algorithm can determine, with certainty, that the object cannot be part of the top-k answer. To make such determination, our algorithms use the following object score information. At a given point in time during the evaluation of a top-k query q, we might have partial score information for an object, after having obtained some of the object’s attribute scores, but not others:
• U(t), the score upper bound for an object t, is the maximum score that t might reach for q, consistent with the information already available for t. U(t) is then the score
that t would get for q if t had the maximum possible score for every attribute Ai not
yet accessed for t. In addition, we define Uunseen as the score upper bound of any object not yet discovered.
• L(t), the score lower bound for an object t, is the minimum score that t might reach for q, consistent with the information already available for t. L(t) is then the score CHAPTER 2. PROCESSING TOP-K QUERIES 11
that t would get for q if t had the minimum score of 0 for every attribute Ai not yet accessed for t.
• E(t), the expected score of an object t, is the score that t would get for q if t had the
“expected” score for every attribute Ai not yet accessed for t. In absence of further
information, the expected score for Ai is assumed to be 0.5. Several techniques can be used for estimating score distribution (e.g., sampling); we will address this issue in Sections 3.4.2.3 and 4.4.2.2.
Example 1 (cont.): Consider once again our restaurant example. Assume that the weights of the attributes in the scoring function are as follows: 2 for “Distance”, 1 for “Rat-
ing”, and 1 for “Price”. A restaurant object r for which we know that Score Distance(q, r) =
0.2 for a query q, but for which Score Rating(q, r) and Score P rice(q, r) are unknown, will have 2∗0.2+1∗1+1∗1 2∗0.2+1∗0+1∗0 a score upper bound U(r) = 4 = 0.6, a score lower bound L(r) = 4 = 2∗0.2+1∗0.5+1∗0.5 0.1, and an expected score E(r) = 4 = 0.35 (assuming no information on score distribution is known).
Example 2 (cont.): Consider once again our XML collection example. Assume that the weights of the attributes in the scoring function are all equal to 1. A “book” object b for
which we know that Score title(q, b) = 0.6 for a query q, but for which Score author(q, b) and 0.6+1+1 Scoreedition(q, b) are unknown, will have a score upper bound U(b) = 3 = 0.866, a 0.6+0+0 0.6+0.5+0.5 score lower bound L(b) = 3 = 0.2, and an expected score E(b) = 3 = 0.533 (assuming no information on score distribution is known).
Using this information on score bounds, we can define the following property:
Property 1: Consider a top-k query q and suppose that, at some point in time, we have retrieved and partially evaluated a set of objects for the query. Assume further that the
score upper bound U(t) for an object t is strictly lower than the score lower bound L(ti) for
k different objects t1, . . . , tk ∈ T . Then t is guaranteed not to be one of the top-k objects for q.
Example 1 (cont.): Consider our restaurant example, in which we are interested in the
top-3 restaurants for query q. Consider the three restaurants r1, r2, and r3. Restaurant r1 CHAPTER 2. PROCESSING TOP-K QUERIES 12
has a (known) final score of 0.99 (i.e., U(r1) = L(r1) = E(r1) = 0.99), restaurant r2 has
a (known) final score of 0.8 (i.e., U(r2) = L(r2) = E(r2) = 0.8), and restaurant r3 has a score upper bound of 1, a score lower bound of 0.75, and an expected score of 0.875 (i.e.,
U(r3) = 1, L(r3) = 0.75, and E(r3) = 0.875). Then, a restaurant r with a score upper
bound U(r) = 0.6 is guaranteed not to be in the query result, as all three restaurants r1, r2,
and r3 are guaranteed to have higher scores than r.
Example 2 (cont.): Consider our XML collection example in which we are interested in
the top-3 restaurants for query q. Consider the three books b1, b2, and b3. Book b1 has a
(known) final score of 0.99 (i.e., U(b1) = L(b1) = E(b1) = 0.99), book b2 has a (known) final
score of 0.8 (i.e., U(b2) = L(b2) = E(b2) = 0.8), and book b3 has a score upper bound of 1,
a score lower bound of 0.66, and an expected score of 0.836 (i.e., U(b3) = 1, L(b3) = 0.66,
and E(b3) = 0.836). Then, a book b with a score upper bound U(b) = 0.866 cannot be safely
discarded, as its final score may be greater than the final score of book b3.
Our query processing algorithms then attempt to focus on the objects that are most likely to be among the top-k matches for a given query, and to discard —as early as possible— objects that are guaranteed not to qualify for the top-k answer, using the above property to minimize query processing time.
2.2.2 The Upper Property
As mentioned in the previous section, top-k query processing techniques can prune some of the query execution by discarding partially evaluated objects that are not going to be part of the top-k solution. An efficient top-k query processing algorithm should then carefully choose which object to process at any given time, to avoid doing unnecessary work. More specifically, as we will see, our top-k query processing strategies will exploit the following property to make their choice [BGM02, MBG04]:
Property 2: Consider a top-k query q. Suppose that at some point in time a top-k query processing strategy has collected some partial score information for some objects. Consider an object t whose score upper bound U(t) is strictly higher than that of every other object CHAPTER 2. PROCESSING TOP-K QUERIES 13
current top-k x : expected value score U
x x x x x x x x x x x
Figure 2.3: Snapshot of a top-3 query execution.
(i.e., U(t) > U(t0) ∀t0 6= t), and such that t has not been completely evaluated. Then, at least one attribute access will have to be done on t before the answer to q is reached:
• If t is one of the actual top-k objects, then we need to access all of its attributes to return its final score for q.
• If t is not one of the actual top-k objects, its score upper bound U(t) is higher than the score of any of the top-k objects. Hence t requires further evaluation so that U(t) decreases before a final answer can be established.
This property is illustrated in Figure 2.3 for a top-3 query. In this figure, the possible range of scores for each object is represented by a segment, and objects are sorted by their expected score. From Property 2, the object with the highest score upper bound, noted U in the figure, will have to be further evaluated before a solution is reached: either U is one of the top-3 objects for the query and its final score needs to be returned, or its score upper bound will have to be lowered through further evaluation so that we can safely discard the object. This property serves as the basis of the top-k query processing algorithms presented in this thesis. In the next chapters, we present top-k query processing strategies for different structured and semi-structured data scenarios. While our scenarios vary in their data models, all of our algorithms use the above properties to make dynamic choices during query execution to produce efficient query running times. Chapter 3 14
Chapter 3
Sequential Top-k Query Processing Strategies over Web-Accessible Structured Data
In Chapter 2, we introduced our top-k query model, and presented object score properties that we can exploit to produce efficient top-k query executions. In this chapter, we focus on an important web application scenario and define efficient top-k query processing algorithms for this scenario. In our web application scenario, data objects are only available through remote, au- tonomous web sources, exhibiting a variety of access interfaces and constraints as illustrated in the example below.
Example 1 (cont.): Consider our restaurant example from Chapter 2. Each restaurant attribute in this example might be available only through remote calls to external web sources: the Rating attribute might be available through the Zagat-Review web site1, which, given an individual restaurant name, returns its food rating as a number between 1 and 30 (“random access”). This site might also return a list of all restaurants ordered by their food rating (“sorted access”). Similarly, the Price attribute might be available through the New York
1http://www.zagat.com CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 15
Times’s NYT-Review web site2. Finally, the scoring associated with the Address attribute might be handled by the MapQuest web site, which returns the distance (in miles) between the restaurant and the user addresses.
During query processing, the remote web sources have to be queried (or probed) repeat- edly for a potentially large set of candidate objects. In our restaurant example, a possible query processing strategy is to start with the Zagat-Review source, which supports sorted access, to identify a set of candidate restaurants to explore further. This source returns a rank of restaurants in decreasing order of food rating. To compute the final score for each restaurant and identify the top-10 matches for our query, we then obtain the proximity be- tween each restaurant and the user-specified address by querying MapQuest, and check the average dinner price for each restaurant individually at the NYT-Review source. Hence, we interact with three autonomous sources and repeatedly query them for a potentially large set of candidate restaurants. Processing top-k queries efficiently in such a scenario is challenging, as web sources ex- hibit diverse probing costs and access interfaces. By considering the peculiarities of the sources and potentially designing object-specific query execution plans, we design adap- tive algorithms, based on the properties from Section 2.2, that efficiently prune useless objects and produce significantly more efficient query executions than previously existing algorithms, which select “global” query execution plans. In this chapter, we make the following contributions:
• A data model that captures web source interfaces and probing costs.
• Some natural improvements to an existing top-k query processing strategy, TA [FLN01], to decrease its query processing time.
• An efficient sequential top-k query processing algorithm that interleaves sorted and random accesses during query processing and schedules random accesses at a fine- granularity per-object level.
2http://www.nytoday.com CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 16
• A thorough, extensive experimental evaluation of the new algorithms using real and local data sets, and for a wide range of query parameters.
The rest of this chapter is organized as follows. First, we present our data model in Section 3.1. Then, in Section 3.2, we discuss and improve on an existing top-k query processing strategy. In Section 3.3, we present an efficient sequential top-k query processing technique. In Section 3.4, we report on an extensive experimental evaluation of our strategy. Finally, we conclude this chapter in Section 3.5. This chapter is based on work that has been published in [BGM02, MBG04].
3.1 Data Model
In our web application scenario, data is accessed through probes to web sources, which exhibit a variety of interfaces and access costs. In this section we refine the data and query model of Chapter 2 and instantiate it to our web scenario. In this scenario, the object attributes are handled and provided by autonomous sources accessible over the web with a variety of interfaces. For instance, the Price attribute in Example 1 is provided by the NYT-Review web site and can be accessed only by querying this site’s web interface3. We distinguish between three types of sources based on their access interface:
Definition 1: [Source Types] Consider an attribute Ai and a top-k query q. Assume
further that Ai is handled by a source S. We say that S is an S-Source if we can obtain
from S a list of objects sorted in descending order of Score Ai by (repeated) invocation of a
getNext(S, q) probe interface. Alternatively, assume that Ai is handled by a source R that only returns scoring information when prompted about individual objects. In this case, we say that R is an R-Source. R provides random access on Ai through a getScore(R, q, t) probe interface, where t is a set of attribute values that identify an object in question. (As a small variation, sometimes an R-Source will return the actual attribute Ai value for an
3Of course, in some cases we might be able to download all this remote information and cache it locally with the query processor. However, this will not be possible for legal or technical reasons for some other sources, or might lead to highly inaccurate or outdated information. CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 17
object, rather than its associated score.) Finally, we say that a source that provides both sorted and random access is an SR-Source.
Example 1 (cont.): In our restaurant example, attribute Rating is associated with the Zagat-Review web site. This site provides both a list of restaurants sorted by their rating (sorted access), and the rating of a specific restaurant given its name (random access). Hence, Zagat-Review is an SR-Source. In contrast, Address is handled by the MapQuest web site, which returns the distance between the restaurant address and the user-specified address. Hence, MapQuest is an R-Source.
To define top-k query processing strategies over the three source types above, we need to consider the cost that accessing such sources entails:
Definition 2: [Access Costs] Consider a source R that provides a random-access inter- face, and a top-k query. We refer to the average time that it takes R to return the score for a given object as tR(R). (tR stands for “random-access time.”) Similarly, consider a source S that provides a sorted-access interface. We refer to the average time that it takes S to return the top object for the query for the associated attribute as tS(S). (tS stands for “sorted-access time.”) We make the simplifying assumption that successive invocations of the getNext interface also take time tS(S) on average.
We make a number of assumptions in our presentation. The top-k evaluation strategies that we consider do not allow for “wild guesses” [FLN01]: an object must be “discovered” under sorted access before it can be probed using random access. Therefore, we need to have at least one source with sorted access capabilities to discover new objects. We consider nsr SR-Sources D1, . . ., Dnsr (nsr ≥ 1) and nr R-Sources Dnsr+1, . . ., Dn (nr ≥ 0), where
n = nsr + nr is the total number of sources. A scenario with several S-Sources (with no random-access interface) is problematic: to return the top-k objects for a query together with their scores, as required by our query model (Chapter 2), we might have to access all objects in some of the S-Sources to retrieve the corresponding attribute scores for the top-k objects. This can be extremely expensive in practice. Fagin et al. [FLN01] presented the NRA algorithm to deal with multiple S-Sources; however, NRA only identifies the top-k objects and does not compute their final scores. CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 18
We refer to the set of all objects available through the sources as the Objects set. Addi- tionally, we assume that all sources D1, . . . , Dn “know about” all objects in Objects. In other words, given a query q and an object t ∈ Objects, we can obtain the score corresponding to q and t for attribute Ai, for all i = 1, . . . , n. Of course, this is a simplifying assumption that is likely not to hold in practice, where each source might be autonomous and not coordinated in any way with the other sources. For instance, in our running example the NYT-Review site might not have reviewed a specific restaurant, and hence it will not be able to return a score for the Price attribute for such a restaurant. In this case, we simply use a default value for t’s score for the missing attributes. In this chapter, we focus on sequential top-k query processing strategies. In a sequential setting, during query processing, we can have at most one outstanding (random- or sorted- access) probe at any given time. When a probe completes, a sequential strategy chooses either to perform sorted access on a source to potentially obtain unseen objects, or to pick an already seen object, together with a source for which the object has not been probed, and perform a random-access probe on the source to get the corresponding score for the object.
3.2 An Existing Top-k Strategy
We now review and extend an existing algorithm to process top-k queries over sources that provide sorted and random access interfaces. Specifically, in Section 3.2.1 we discuss Fagin et al.’s TA algorithm [FLN01], and then propose improvements over this algorithm in Section 3.2.2.
3.2.1 The TA Algorithm
Fagin et al. [FLN01] presented the TA algorithm for processing top-k queries over SR-Sources. We adapted this algorithm in [BGM02] and introduced the TA-Adapt algorithm, which handles one SR-Source and any number of R-Sources. Fagin et al. [FLN03] generalized TA-Adapt to handle any number of SR-Sources and R-Sources. Their resulting algorithm,
TAz, is summarized in Figure 3.1. CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 19
Algorithm TAz (Input: top-k query q)
(01) Initialize Uunseen = 1. (Uunseen is an upper bound on the score of any object not yet retrieved.) (02) Repeat
(03) For each SR-Source Di (1 ≤ i ≤ nsr ):
(04) Get the best unretrieved object t for attribute Ai from Di: t ← getNext(Di, q).
(05) Update Uunseen = ScoreComb(s`(1), . . . , s`(nsr), 1, . . . , 1),
n|r {ztimes}
where s`(j) is the last score seen under sorted access in Dj . (Initially, s`(j) = 1.)
(06) For each source Dj (1 ≤ j ≤ n):
(07) If t’s score for attribute Aj is unknown:
(08) Retrieve t’s score for attribute Aj , sj , via a random probe to Dj :
sj ← getScore(Dj , q, t). (09) Calculate t’s final score for q. (10) If t’s score is one of the top-k scores seen so far, keep object t along with its score.
(11) Until we have seen at least k objects and Uunseen is no larger than the scores of the current k top objects. (12) Return the top-k objects along with their score.
Figure 3.1: Algorithm TAz. CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 20
At any point in time, TAz keeps track of Uunseen, the highest possible score an object
that has not yet been seen by the algorithm can have. TAz proceeds in the following way: for each SR-Source, the algorithm retrieves the next “best” object via sorted access (Step 4), probes all unknown attribute scores for this object via random access (Steps 6–8) and
computes the object’s final score (Step 9). At any given point in time, TAz keeps track of the k known objects with the highest scores. As soon as no unretrieved object can have a score higher that the current top-k objects, the solution is reached (Step 11) and
the top-k objects are returned (Step 12). The original version of TAz assumes bounded buffers [FLN03] to minimize space requirements and discards information on objects whose final scores are too low to be top-k. This may lead to redundant random accesses when such objects are retrieved again from a different SR-Source. To avoid redundant accesses, a simple solution —which we use in our implementation— is to keep all object information until the algorithm returns, which requires space that is linear in the number of objects retrieved.
3.2.2 Optimizations over TA
Fagin et al. [FLN03] showed that TA and TAz are “instance optimal” with respect to the family of top-k query processing algorithms that do not make wild guesses (see Sec- tion 3.3.2.2). Specifically, the TA and TAz execution times are within a constant factor of the execution times of any such top-k algorithm. However, it is possible to improve on TA and TAz by saving object probes. In [BGM02], we presented two optimizations over TA that can be applied over TAz. The first optimization (TA-Opt in [BGM02]) saves random access probes when an object is guaranteed not to be part of the top-k answer (i.e., when its score upper bound is lower than the scores of the current top-k objects).
This optimization is done by adding a shortcut test condition after Step 6 of TAz. The second optimization (TA-EP in [BGM02]) exploits results on expensive-predicate query op- timization [HS93, KMPS94]. Research in this area has studied how to process selection queries of the form p1 ∧ . . . ∧ pn, where each predicate pi can be expensive to calculate. The key idea is to order the evaluation of predicates to minimize the expected execution time. The evaluation order is determined by the Rank of each predicate pi, defined as CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 21
Algorithm TAz-EP (Input: top-k query q)
(01) Initialize Uunseen = 1. (Uunseen is an upper bound on the score of any object not yet retrieved.) (02) Repeat
(03) For each SR-Source Di (1 ≤ i ≤ nsr ):
(04) Get the best unretrieved object t for attribute Ai from Di: t ← getNext(Di, q).
(05) Update Uunseen = ScoreComb(s`(1), . . . , s`(nsr), 1, . . . , 1),
n|r {ztimes}
where s`(j) is the last score seen under sorted access in Dj . (Initially, s`(j) = 1.)
(06) For each source Dj (1 ≤ j ≤ n) in decreasing order of Rank(Dj ) : (07) If U(t) is less than or equal to the score of k objects, skip to (11).
(08) If t’s score for attribute Aj is unknown:
(09) Retrieve t’s score for attribute Aj , sj , via a random probe to Dj :
sj ← getScore(Dj , q, t). (10) Calculate t’s final score for q. (11) If we probed t completely and t’s score is one of the top-k scores, keep object t along with its score.
(12) Until we have seen at least k objects and Uunseen is no larger than the scores of the current k top objects. (13) Return the top-k objects along with their score.
Figure 3.2: Algorithm TAz-EP.
1−selectivity(pi) Rank(pi) = , where selectivity(pi) is the fraction of the objects that are cost-per-object(pi)
estimated to satisfy pi, and cost-per-object(pi) is the average time to evaluate pi over an object. We can adapt this idea to our framework as follows.
Let w1, . . . , wn be the weights of the sources D1, . . . , Dn in the scoring function ScoreComb.
If e(Ai) is the expected score of a randomly picked object for source Ri, the expected
decrease of U(t) after probing source Ri for object t is δi = wi · (1 − e(Ai)). We sort
the sources in decreasing order of their Rank, where Rank for a source Ri is defined as
δi Rank(Ri) = . Thus, we favor fast sources that might have a large impact on the final tR(Ri) score of an object; these sources are likely to substantially change the value of U(t) fast.
We combine these two optimizations to define the TAz-EP algorithm (Figure 3.2). The first optimization appears in Steps 7 and 11. The second optimization appears in Step 6. CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 22
3.3 The Sequential Upper Algorithm
We now present a top-k query processing strategy that we call Upper, variants of which we introduced in [BGM02] and [MBG04]. Our original formulation of Upper was for a restricted scenario of only one SR-Source and any number of R-Sources. In [MBG04], we relaxed this restriction to allow for any number of SR-Sources and R-Sources. Unlike TAz, which completely probes each object immediately after the object is identified, Upper allows for more flexible probe schedules in which sorted and random accesses can be interleaved even when some objects have only been partially probed. When a probe completes, Upper decides whether to perform a sorted-access probe on a source to get new objects, or to perform the “most promising” random-access probe on the “most promising” object that has already been retrieved via sorted access. The Upper algorithm is detailed in Figure 3.3. Exploiting Property 2 from Section 2.2.2, Upper chooses to probe the object with the highest score upper bound, since this object will have to be probed at least once before a top-k solution can be reached. If the score upper bound of unretrieved objects is higher than the highest score upper bound of the retrieved objects, Upper chooses to retrieve a new object via sorted access. In this case, Upper has to choose which SR-Source to access. This can be decided in several ways. A simple approach that works well in practice is to use a round-robin algorithm (Step 6).
3.3.1 Selecting the Best Source
After Upper picks an object to probe, the choice of source to probe for the object (Step 14) is handled by the SelectBestSource function, and is influenced by a number of factors: the cost of the random access probes, the weights of the corresponding attributes in the scoring function (or the ranking function itself if we consider a scoring function different than weighted sum), and the expected attribute scores.
The SelectBestSource function chooses the best source with which to probe object tH next. (Object tH is picked in Step 3.) This choice should depend on whether tH is one of the top-k objects or not. To define this function, we would then need to know the k-th highest actual score scorek among all objects in Objects. Of course, Upper does not know the actual CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 23
Algorithm Upper (Input: top-k query q)
(01) Initialize Uunseen = 1, Candidates = ∅, and returned = 0. (02) While (returned < k)
(03) If Candidates 6= ∅, pick tH ∈ Candidates such that U(tH ) = maxt∈Candidates U(t).
(04) Else tH is undefined.
(05) If tH is undefined or U(tH ) < Uunseen (unseen objects might have larger scores than all candidates):
(06) Use a round-robin policy to choose the next SR-Source Di (1 ≤ i ≤ nsr) to access via a sorted access.
(07) Get the best unretrieved object t from Di: t ← getNext(Di, q).
(08) Update Uunseen = ScoreComb(s`(1), . . . , s`(nsr), 1, . . . , 1),
n|r {ztimes}
where s`(j) is the last score seen under sorted access in Dj . (Initially, s`(j) = 1.) (09) If t ∈/ Candidates: Insert t in Candidates.
(10) Else If tH is completely probed (tH is one of the top-k objects):
(11) Return tH with its score; remove tH from Candidates. (12) returned = returned + 1. (13) Else:
(14) Di ← SelectBestSource(tH ,Candidates).
(15) Retrieve tH ’s score for attribute Ai, si, via a random probe to Di:
si ← getScore(Di, q, tH ).
Figure 3.3: Algorithm Upper. CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 24
object scores a priori, so it relies on expected scores to make its choices and estimates the 0 value scorek (i.e., the k-th top score) using scorek, the k-th largest expected object score. 0 (We define scorek = 0 if we have retrieved fewer than k objects.) We considered several implementations of the SelectBestSource function [GMB02], such as a greedy approach, or
considering the best subset of sources for object tH that is expected to decrease U(tH ) 0 below scorek (this implementation of SelectBestSource was presented in [BGM02]). Our experimental evaluation [GMB02] shows that using the “non-redundant sources” approach that we discuss below for SelectBestSource results in the best performance, so we only focus on this version of the function in the remainder of this chapter, for conciseness.
Our implementation of SelectBestSource picks the next source to probe for object tH by
first deciding whether tH is likely to be one of the top-k objects or not:
0 • Case 1: E(tH ) < scorek. In this case, tH is not expected to be one of the top-k
objects. To decide what source to probe next for tH , we favor sources that can have a
high “impact” (i.e., that can sufficiently reduce the score of tH so that we can discard
tH ) while being efficient (i.e., with a relatively low value for tR). More specifically, 0 ∆ = U(tH )−scorek is the amount by which we need to decrease U(tH ) to “prove” that
tH is not one of the top-k answers. In other words, it does not really matter how large
the decrease of U(tH ) is beyond ∆ when choosing the best probe for tH . Note that it 0 is always the case that ∆ ≥ 0: from the choice of tH , it follows that U(tH ) ≥ scorek. 0 To see why, suppose that U(tH ) < scorek. Then U(tH ) < E(t) ≤ U(t) for k objects t, 0 from the definition of scorek. But U(tH ) is highest among the objects in Candidates,
which would imply that the k objects t such that U(t) > U(tH ) had already been removed from Candidates and output as top-k objects. And this is not possible since the final query result has not been reached (returned < k; see Step 2). Also, the
expected decrease of U(tH ) after probing source Ri is given by δi = wi · (1 − e(Ai)),
where wi is the weight of attribute Ai in the query (Chapter 2) and e(Ai) is the
expected score for attribute Ai. Then, the ratio:
Min{∆, δi} Rank(Ri) = tR(Ri)
is a good indicator of the “efficiency” of source Ri: a large value of this ratio indicates CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 25
that we can reduce the value of U(tH ) by a sufficiently large amount (i.e., Min {∆, δi}) 4 relative to the time that the associated probe requires (i.e., tR(Ri)).
Interestingly, while choosing the source with the highest rank value is efficient, it sometimes results in provably sub-optimal choices, as illustrated in the following ex- ample.
Example 2: Consider an object t and two R-Sources R1 and R2, with access times
tR(R1)=1 and tR(R2)=10, and query weights w1=0.1 and w2=0.9. Assume that 0 scorek=0.5 and U(t) = 0.9, so the amount by which we need to decrease t to “prove” it is not one of the top answers is expected to be ∆ = 0.9 − 0.5 = 0.4. If we assume Min{0.4,0.05} that e(A1)=e(A2)=0.5, we would choose source R1 (with rank 1 = 0.05) Min{0.4,0.45} over source R2 (with rank 10 = 0.04). However, we know that we will need 0 to eventually lower U(t) below scorek=0.5, and that R1 can only decrease U(t) by 0.1
to 0.8, since w1=0.1. Therefore, in subsequent iterations, source R2 would need to be
probed anyway. In contrast, if we start with source R2, we might decrease U(t) below 0 scorek = 0.5 with only one probe, thus avoiding a probe to source R1 for t.
The previous example shows that, for a particular object t, a source Ri can be “redun-
dant” independently of its rank Min{∆, δi}/tR(Ri). Therefore, such a source should not be probed for t before the “non-redundant” sources. The set of redundant sources for an object is not static, but rather depends on the execution state of the algorithm. 0 (In the example above, if scorek = 0.89, there are no redundant sources for object
t.) To identify the subset of non-redundant available sources for object tH , we let 0 ∆ = U(tH ) − scorek as above and let R = {R1, . . . , Rm} be the set of sources not yet
probed for tH . If ∆ = 0, all sources are considered not to be redundant. Otherwise, if
4SelectBestSource might need to be modified to handle scoring functions other than the weighted-sum function. In particular, for functions where the final object scores cannot be in general approximated or usefully bounded unless all input values are known (e.g., as is the case for the min function), a per-object scheduling strategy is not necessary. In such cases, the probe history of an object does not impact source choice and so, the SelectBestSource function should make decisions at a higher level of granularity (e.g., by ordering sources based on source access time). CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 26
∆ > 0 we say that source Ri is redundant for object tH at a given step of the probing
process if ∀Y ⊆ R − {Ri} : If wi + wj ≥ ∆ then wj ≥ ∆ (i.e., for Pj:Rj ∈Y Pj:Rj ∈Y 0 every possible choice of sources {Ri} ∪ Y that can decrease U(tH ) to scorek or lower, Y by itself can also do it). By negating the predicate above, replacing the implication with the equivalent disjunction, and manipulating the resulting predicate, we obtain
the following test to identify non-redundant sources: Ri is non-redundant if and only
if ∃Y ⊆ R − {Ri} : ∆ − wi ≤ wj < ∆. It is not difficult to prove that, for Pj:Rj ∈Y any possible assignment of values to wi and ∆ > 0, there is always at least one avail- able non-redundant source. Therefore, after identifying the subset of non-redundant
sources, our SelectBestSource function returns the non-redundant source for object tH with the maximum rank Min{∆,δi} if ∆ > 0. If ∆ = 0, all sources have the same Rank tR(Ri) value, and we pick the source with the fastest random-access time for the query.
0 • Case 2: E(tH ) ≥ scorek. In this case, tH is expected to be one of the top-k objects,
and so we will need to probe tH completely. Therefore all sources for which tH has not been probed are non-redundant and SelectBestSource returns the not-yet-probed source with the highest δi ratio. tR(Ri)
In summary, when a probe completes, Upper can either (a) perform a sorted-access probe on a source if the unseen objects have the highest score upper bound (Steps 5–9), or (b) select both an object and a source to probe next (Steps 13–15), guided in both cases by Property 2. In addition, Upper can return results as they are produced, rather than having to wait for all top-k results to be known before producing the final answer (Steps 10–12).
3.3.2 Cost Analysis
We now discuss the efficiency of the various algorithms. Specifically, Section 3.3.2.1 analyzes the number of sorted accesses that each algorithm requires, and Section 3.3.2.2 discusses the optimality of Upper.
3.3.2.1 Counting Sorted Accesses
Interestingly, Upper and TAz behave in an identical manner with respect to sorted accesses: CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 27
Lemma 1: Consider a top-k query q over multiple SR-Sources and R-Sources. Then, Upper
and all variations of TAz perform the same number of sorted accesses when processing q.
Proof 1: We note that the choice of sorted-access sources in both TAz and Upper follows
the same fixed round-robin strategy, which is independent of the input (see Step 3 for TAz
in Figure 3.1 and Step 6 for Upper in Figure 3.3). Therefore, after Upper or TAz perform
some equal number of sorted accesses, the value of Uunseen is the same for both algorithms.
Consider the execution of both TAz and Upper after both algorithms have retrieved the same
set Retrieved of objects, with |Retrieved| ≥ k. (Naturally, TAz and Upper need to retrieve at least k objects via sorted access to output the top-k solution.)
• If TAz decides to retrieve a new object after processing the objects in Retrieved, then
it holds that Uunseen > Score(q, m), where m is the object in Retrieved with the k-th largest score. Suppose that the execution of Upper finishes without retrieving any new object beyond those in Retrieved, and let m0 be the k-th object output as Upper’s result 0 for q. Since m was also retrieved by TAz, and because of the choice of m, it holds 0 0 that Score(q, m) = Score(q, m ). Then Score(q, m ) < Uunseen and hence Upper could never have output this object as part of the query result (see Step 5 in Figure 3.3), contradicting the choice of m0. Therefore Upper also needs to retrieve a new object,
just as TAz does.
• If Upper decides to retrieve a new object after processing the objects in Retrieved, then it holds that Upper output fewer than k objects from Retrieved as part of the query
result, and that U(t) < Uunseen for each object t ∈ Retrieved not yet output (see Step 5 in Figure 3.3). Then, since Score(q, t) ≤ U(t) for each object t, it follows that
Score(q, m) < Uunseen, where m is the object in Retrieved with the k-th largest actual
score for q. Therefore, from Step 11 in Figure 3.1 it follows that TAz also needs to retrieve a new object, just as Upper does.
2
Interestingly, since TAz performs all random accesses for the objects considered, Upper
never performs more random accesses than TAz does. CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 28
3.3.2.2 Instance Optimality
As presented in [FLN03], TAz is “instance optimal,” where the definition of “instance optimality” —slightly adapted from [FLN03] to match our terminology— is:
Definition 3 [Instance Optimality] Let A be a class of algorithms and D be a class of source instances. An algorithm B ∈ A is instance optimal over A and D if there are constants c and c0 such that for every A ∈ A and D ∈ D we have that cost(B, D) ≤ c · cost(A, D) + c0, where cost(a, D) is, in our context, the combined sorted- and random- access time required by algorithm a over the sources in D.
An interesting observation is that the number of random accesses in TAz is an upper
bound on the number of random accesses in TAz-EP: TAz-EP is an optimization over TAz
aimed at reducing the number of random accesses. The shortcuts used in TAz-EP are only
used to discard objects sooner than in TAz and do not affect the number of sorted accesses performed by the algorithm. Also, as explained in the previous section, Upper performs no
more sorted or random accesses than TAz does. Hence, the TAz “instance optimality” also
applies to the TAz-EP and Upper algorithms. Therefore, the experimental section of the
chapter (Section 3.4), in which we compare the TAz and Upper algorithms, will evaluate the algorithms with real-world and local data to measure their “absolute” efficiency (they are all “instance optimal”).
3.4 Experimental Results
We performed an extensive experimental evaluation of Upper. In this section, we first discuss our implementation choices and evaluation settings (Section 3.4.1), then report results over real and synthetic data sets (Section 3.4.2), compare Upper with MPro [CH02], a competing algorithm that does not make per-object scheduling choices (Section 3.4.3), and present results over real web sources (Section 3.4.4) CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 29
3.4.1 Implementation
In this section, we first present the query processing techniques used in our experimental evaluation (Section 3.4.1.1) and discuss data structures that we use to implement these query processing strategies (Section 3.4.1.2). We also define the local (Section 3.4.1.3) and real data sets (Section 3.4.1.4) that we use for the experimental evaluation of the various techniques, as well as the prototype that we implemented for our experiments over real web-accessible sources (Section 3.4.4). Finally, we discuss the metrics and other settings that we use in our experimental evaluation (Section 3.4.1.5) .
3.4.1.1 Techniques
We compare the performance of Upper (Section 3.3) with that of TAz-EP (Section 3.2.1). In addition, we consider MPro, an algorithm presented by Chang and Hwang [CH02] to optimize the execution of expensive predicates for top-k queries, rather than for our web- source scenario. MPro is more general than our techniques in that it targets a wider range of scenarios: local expensive predicates, external expensive predicates, arbitrary monotonic scoring functions, and joins. Their “probes” are typically not as expensive as our web- source accesses, hence the need for faster probe scheduling. Unlike our Upper technique (Section 3.3), MPro defines a fixed schedule of accesses to R-Sources during an initial object-sampling step, and thus selects which object to probe next during query execution but avoids source selection on a per-object basis. Upper is a technique in which source probes are scheduled at a fine object-level granu- larity, and where probes on different objects can be interleaved (see Table 3.1). In contrast,
TAz-EP is a technique in which source probes are scheduled at a coarse query-level granular- ity, and where each object is fully processed before probes on a different object can proceed. MPro is an example of a technique with interleaving of probes on different objects and with query-level probe scheduling. (We evaluate MPro experimentally in Section 3.4.2.3, where we consider a scenario in which object sampling —as required by MPro— is possible. We also defer the discussion of the Upper-Sample technique outlined in Table 3.1 until that section.) MPro-EP is an instantiation of the MPro algorithm with a different source-order criterion. Specifically, MPro-EP departs from the original MPro algorithm in that it does CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 30
Per-Query Scheduling Per-Object Scheduling of Probes of Probes
No Interleaving of TAz-EP TAz-SelectBestSource Probes across Objects Interleaving No Sampling MPro-EP Upper of Probes Available across Objects Sampling MPro Upper-Sample Available
Table 3.1: “Dimensions” to characterize sequential query processing algorithms.
not rely on object sampling and orders sources by their Rank values as defined in Sec- tion 3.2.2. Note that MPro-EP can also be regarded as a modification of Upper for which the SelectBestSource function always considers each source’s object-independent Rank value as defined in Section 3.2.2 when deciding what source to pick for a given object. The “dimensions” outlined in Table 3.1 suggest an additional technique. This technique, denoted as TAz-SelectBestSource in Table 3.1, is similar to TAz-EP in that it does not interleave probes on multiple objects. However, the schedule of probes on each object is not fixed, but rather is influenced by the returned probe scores and determined dynamically using Upper’s SelectBestSource function. For conciseness, we do not report experimental
figures for this technique, since it results in only minor time savings over the simpler TAz-EP
algorithm. Similarly, we do not consider variations of TAz-EP and TAz-SelectBestSource that exploit sampling-derived information.
By comparing MPro-EP and TAz-EP, our experiments help quantify the saving in prob- ing time that is due to the interleaving of object probes. By comparing MPro-EP and Upper, our experiments help understand the impact of the relatively expensive per-object schedul- ing on query processing efficiency. CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 31
3.4.1.2 Supporting Data Structures
Our algorithms keep track of the objects retrieved and their partial score information in a hash table indexed by the object ids. For each object, we record the attribute scores returned by the different sources (a special value is used when information about a particular source is not yet available). For efficiency, we also incrementally maintain the score upper bounds of each object. Finally, depending on the algorithm, each object is augmented with a small number of pointers that help us to efficiently maintain the rank of each object in different ordered lists (see [GMB02]). During the execution of the algorithms of Sections 3.4.1.1, each object can be part of multiple sorted lists. As an example, Upper (Section 3.3) needs to keep track of the object with the largest score upper bound (Step 3 in the algorithm in Figure 3.3). The SelectBestSource function also needs to identify the object with the k-th highest expected score. We implement each sorted list using heap-based priority queues, which provide constant-time access to the first ranked element, and logarithmic-time inser- tions and deletions. We additionally modified these standard priority queues to extract in constant time the k-th ranked object in the list still with logarithmic-time insertions and deletions.
3.4.1.3 Local Sources
We generate a number of local SR-Sources and R-Sources for our experiments. The attribute values for each object are generated using one of the following distributions:
Uniform: Attributes are independent of each other and attribute values are uniformly distributed (default setting).
Gaussian: Attributes are independent of each other and attribute values are generated from five overlapping multidimensional Gaussian bells [PFTV93].
Zipfian: Attributes are independent of each other and attribute values are generated from a Zipf function with 1,000 distinct values and Zipfian parameter z = 1. The 1,000 distinct attribute values are generated randomly in the [0,1] range, and the i-th most frequent attribute value appears f(i) = |Objects|/(iz · 1,000 1/jz) times. Pj=1 Correlated: We divide sources into two groups and generate attribute values so that values CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 32
k nsr nr |Objects| tR tS Data Sets 50 3 3 10,000 [1, 10] [0.1, 1] Uniform
Table 3.2: Default parameter values for experiments over local data.
from sources within the same group are correlated. In each group, the attribute values for a “base” source are generated using a uniform distribution. The attribute values for the other sources in a group are picked for an object from a short interval around the object’s attribute value in the “base” source. Our default Correlated data set consists of two groups of three sources each.
Mixed: Attributes are independent of each other. Sources are divided into three groups, and the attribute values within each group are generated using the Uniform, Gaussian, and Zipfian distributions, respectively.
Cover: To validate our techniques on real data distributions, we performed experiments over the Cover data set, a six-dimensional projection of the CovType data set [HBM98], used for predicting forest cover types from cartographic variables. The data contains infor- mation about various wilderness areas. Specifically, we consider six attributes: elevation (in meters), aspect (in degrees azimuth), slope (in degrees), horizontal distance to hydrology (in meters), vertical distance to hydrology (in meters), and horizontal distance to roadways (in meters). We extracted a database of 10,000 objects from the CovType data set. For simplicity, we will refer to the these sources as the “local” sources, to indicate that these are locally available sources under our control, as opposed to the real web sources described next. For our experiments, we vary the number of SR-Sources nsr, the number of
R-Sources nr, the number of objects available through sorted access |Objects|, the random access time tR(Di) for each source Di (a random value between 1 and 10), and the sorted access time tS(Di) for each source Di (a random value between 0.1 and 1). Table 3.2 lists the default value for each parameter. Unless we specify otherwise, we use this default setting. CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 33
Source Attribute(s) Input
Verizon Yellow Pages (S) Distance type of cuisine, user address Subway Navigator (R) SubwayTime restaurant address, user address MapQuest (R) DrivingTime restaurant address, user address AltaVista (R) Popularity free text with restaurant name and address Zagat Review (R) ZFood, ZService, restaurant name ZDecor, ZPrice NYT Review (R) TRating, TPrice restaurant name
Table 3.3: Real web-accessible sources used in the experimental evaluation.
3.4.1.4 Real Web-Accessible Sources
In addition to experiments over the “local” data sets above, we evaluated the algorithms over real, autonomous web sources. For this, we implemented a prototype of the algorithms to answer top-k queries about New York City restaurants. Our prototype is written in C++ and Python. Users input a starting address and their desired type of cuisine (if any), together with importance weights for the following R-Source attributes: SubwayTime (handled by the SubwayNavigator site5), DrivingTime (handled by the MapQuest site), Popularity (handled by the AltaVista search engine6; see below), ZFood, ZService, ZDecor, and ZPrice (handled by the Zagat-Review web site), and TRating and TPrice (provided by the New York Times’s NYT-Review web site). The Verizon Yellow Pages listing7, which for sorted access returns restaurants of the user-specified type sorted by shortest distance from a given address, is the only SR-Source. Table 3.3 summarizes these sources and their interfaces. The Popularity attribute requires further explanation. We approximate the “popularity” of a restaurant with the number of web pages that mention the restaurant, as reported by the AltaVista search engine. (The idea of using web search engines as a “popularity oracle”
5http://www.subwaynavigator.com
6http://www.altavista.com
7http://www.superpages.com CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 34
has been used before in the WSQ/DSQ system [GW00].) Consider, for example, restaurant “Tavern on the Green,” which is one of the most popular restaurants in the United States. As of the writing of this thesis, a query on AltaVista on “Tavern on the Green” AND “New York” returns 82,100 hits. In contrast, the corresponding query for a much less popular restaurant in New York City, “Caffe Taci” AND “New York,” returns only 470 hits. Of course, the reported number of hits might inaccurately capture the actual number of pages that talk about the restaurants in question, due to both false positives and false negatives. Also, in rare cases web presence might not reflect actual “popularity.” However, anecdotal observations indicate that search engines work well as coarse popularity oracles. Attributes Distance, SubwayTime, DrivingTime, ZFood, ZService, ZDecor, and TRating have “default” target values in the queries (e.g., a DrivingTime of 0 and a ZFood rating of 30). The target value for Popularity is arbitrarily set to 100 hits, while the ZPrice and TPrice target values are set to the least expensive value in the scale. In the default setting, the weights of all six sources are equal. Naturally, the real sources above do not fit our model of Section 3.1 perfectly. For example, some of these sources return scores for multiple attributes simultaneously (e.g., as is the case for the Zagat-Review site). Also, as we mentioned before, information on a restaurant might be missing in some sources (e.g., a restaurant might not have an entry at the Zagat-Review site). In such a case, our system will give a default (expected) score of 0.5 to the score of the corresponding attribute. In a real web environment, source access times are usually not fixed and depend on several parameters such as network traffic or server load. Using a fixed approximation of the source response time (such as an average of past response times) may result in degraded performance since our algorithms use these times to choose what probe to do next. To develop accurate adaptive estimates for the tR times, we adapt techniques for esti- mating the round trip time of network packets. Specifically, TCP implementations use a “smoothed” round trip time estimate (SRTT ) to predict future round trip times, computed as follows:
SRTT i+1 = (α × SRTT i) + ((1 − α) × si) where SRTT i+1 is the new estimate of the round trip time, SRTT i is the current estimate CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 35
of the round trip time, si is the time taken by the last round trip sample, and α is a constant between 0 and 1 that controls the sensitivity of the SRTT to changes. For good performance, Mills [Mil83] recommends using two values for α: α = 15/16, when the last sample time is lower than the estimate time (SRTT i), and α = 3/4, when the last sample time is higher than the estimate. This makes the estimate more responsive to increases in the source response time than to decreases. Our prototype keeps track of the response time of probes to each R-Source Ri and adjusts the average access time for Ri, tR(Ri), using the SRTT estimates above. Since the sorted accesses to the SR-Sources Si are decided
independently of their sorted-access times, we do not adjust tS(Si).
3.4.1.5 Evaluation Metrics and Other Experimental Settings
To understand the relative performance of the various top-k processing techniques over local sources, we time the two main components of the algorithms:
• tprobes is the time spent accessing the remote sources, in “units” of time. (In Sec- tion 3.4.2.2, we report results for different values —in msecs.— of this time unit.)
• tlocal is the time spent locally scheduling remote source accesses, in seconds.
While source access and local scheduling happen in parallel, it is revealing to analyze the tprobes and tlocal times associated with the query processing techniques separately, since the techniques that we consider differ significantly in the amount of local processing time that they require. For the experiments over the real-web sources, we report the total query execution time:
• ttotal is the total time spent executing a top-k query, in seconds, including both remote source access and scheduling.
We also report the number of random probes issued by each technique:8
• |probes| is the total number of random probes issued during a top-k query execution.
8The number of sorted accesses is the same for all presented techniques (Section 3.3.2.1). CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 36
For the local sources, unless we note otherwise we generate 100 queries randomly, with attribute weights randomly picked in the [1,10] range. We report the average values of the
metrics for different settings of nsr, nr, |Objects|, and k for different attribute distributions. We conducted experiments on 1.4Ghz 2Gb RAM machines running Red Hat Linux 7.1. For the real web sources, we defined queries that ask for top French, Italian, and Japanese restaurants in Manhattan, for users located in different addresses. Attribute weights are
arbitrarily picked from the [1,10] range for each query. We report the average ttotal value for different queries. We conducted experiments on a 550Mhz 758MB RAM machine running Red Hat Linux 7.1.
3.4.2 Experiments over Local Data
We now report results for the sequential techniques over the local data sets presented in Section 3.4.1.3. We first report on the performance of the techniques in terms of probing time in Section 3.4.2.1. Then, in Section 3.4.2.2, we compare the local processing time needed by the different techniques. Finally, in Section 3.4.2.3, we study the effect of data score distribution knowledge on the techniques.
3.4.2.1 Probing Time
In this section, we report on the probing time of the sequential techniques for a range of query and local data set parameters.
Effect of the Attribute Value Distribution: Figure 3.4 reports results for the default setting (Table 3.2), for various attribute value distributions. In all cases, Upper substantially
outperforms TAz-EP. The performance of MPro-EP is just slightly worse than that of Upper,
which suggests that the gain in probing time of Upper over TAz-EP mostly results from interleaving probes on objects. Interestingly, while Upper has faster overall probing times that MPro-EP, MPro-EP results in slightly fewer random accesses (e.g., for the Uniform data set, Upper performed on average 11,342 random accesses and MPro-EP performed on average 11,045 random accesses). For the Cover data set, which consists of real-world data, the results are similar to those for the other data sets.
Effect of the Number of Objects Requested k: Figure 3.5 reports results for the CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 37
Upper MPro-EP TAz-EP
100000
90000
80000
70000
60000
50000
probes
t 40000
30000
20000
10000
0 Uniform Gaussian Zipfian Correlated Mixed Cover
Figure 3.4: Performance of the different strategies for the default setting of the experiment parameters, and for alternate attribute-value distributions.
default setting (Table 3.2) as a function of k. As k increases, the time needed by each algorithm to return the top-k objects increases as well, since all techniques need to retrieve and process more objects. Once again, the Upper strategy consistently outperforms TAz- EP, with MPro-EP as a close second.
Effect of the Number of Sources n: Figure 3.6 reports results for the default set- ting, as a function of the total number of sources n (half the sources are SR-Sources, half are R-Sources). Not surprisingly, the tprobes time needed by all the algorithms increases with the number of available sources. When we consider a single SR-Source and a single
R-Source, tprobes is the same for all algorithms. However, when more sources are available, the differences between the techniques become more pronounced, with Upper and MPro-EP consistently resulting in the best performance.
Effect of the Number of SR-Sources nsr: Figure 3.7 reports results for the default setting, as a function of the total number of sources nsr (out of a total of six sources). The per- formance of TAz-EP remains almost constant when we vary the number of SR-Sources. In contrast, the performance of Upper and MPro-EP improves when the number of SR-Sources CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 38
Upper MPro-EP TAz-EP
90000
80000
70000
60000
50000
probes
t 40000
30000
20000
10000
0 0 20 40 60 80 100 120 k
Figure 3.5: Performance of the different strategies for the default setting of the experiment parameters, as a function of the number of objects requested k.
Upper MPro-EP TAz-EP
250000
200000
150000
probes
t 100000
50000
0 0 2 4 6 8 10 12 14 n
Figure 3.6: Performance of the different strategies for the Uniform data set, as a function of the number of sources. CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 39
is high, as more information on the top objects is obtained from sorted accesses, which are cheaper than random accesses. The information gained from these extra sorted accesses allows these algorithms to identify high-score objects (objects with high scores for all the SR-Sources attributes) sooner and therefore to return the top-k objects faster. Upper is slightly better than MPro-EP, with savings in probing time that remains close to constant
for all values of nsr.
Upper MPro-EP TAz-EP
80000
70000
60000
50000
40000
probes
t 30000
20000
10000
0 0 1 2 3 4 5 6 7
n sr
Figure 3.7: Performance of the different strategies for the Uniform data set, as a function of the number of SR-Sources.
Effect of the Cardinality of the Objects Set: Figure 3.8 studies the impact of the number of objects available. As the number of object increases, the performance of each algorithm drops since more objects have to be evaluated before a solution is returned. The
tprobes time needed by each algorithm is approximately linear in |Objects|. MPro-EP is faster and scales better than TAz-EP since MPro-EP only considers objects that need to be probed before the top-k answer is reached and therefore does not waste resources on useless probes. Upper’s reduction in probing time over MPro-EP increases with the number of objects, suggesting that per-object source scheduling becomes more efficient as the number of objects increase. CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 40
Upper MPro-EP TAz-EP
450000
400000
350000
300000
250000
probes
t 200000
150000
100000
50000
0 0 20000 40000 60000 80000 100000 |Objects|
Figure 3.8: Performance of the different strategies for the Uniform data set, as a function of the cardinality of Objects.
3.4.2.2 Local Processing Time
In the previous section, we showed that Upper and MPro-EP result in substantially fewer random probes than TAz-EP. However, probe interleaving requires expensive computation as object score information needs to be kept and sorted. In addition, Upper requires more expensive probe scheduling than TAz-EP and MPro-EP do, so we now turn to studying the effect of this local computation on overall performance. Interestingly, we show experimen- tally that Upper results in considerably faster executions than TAz-EP, considering both probing time and local execution time. Our experiments also show that Upper results in slightly faster overall query execution times than MPro-EP.
Figure 3.9 shows the tlocal time for Upper, MPro-EP, and TAz-EP for the default setting
of the experiments in Table 3.2, and for varying number of objects. Not surprisingly, TAz- EP is locally more efficient than Upper and MPro-EP. The additional local processing needed by Upper and MPro-EP is spent maintaining object queues. (Both techniques need access to the object with the largest score upper bound at different points in time.) In turn, Upper is more expensive than MPro-EP because of two factors: (1) Upper schedules probes at the CHAPTER 3. SEQUENTIAL TOP-K QUERY PROCESSING STRATEGIES 41
Upper MPro-EP TAz-EP
18
16
14
12
10
(in seconds) 8
local
t 6
4
2
0 0 20000 40000 60000 80000 100000 |Objects|
Figure 3.9: The local processing time for Upper, MPro-EP, and TAz-EP, as a function of the number of objects.
Upper MPro-EP TAz-EP
150
100
TAz-EP=100)
(normalized with respect to