List Intersection for Web Search: Algorithms, Cost Models, and Optimizations

List Intersection for Web Search: Algorithms, Cost Models, and Optimizations ∗ Sunghwan Kim Taesung Lee Seung-won Hwang POSTECH IBM Research AI Yonsei University [email protected] [email protected] [email protected] Sameh Elnikety Microsoft Research [email protected] ABSTRACT a. top-down analysis for i = 1 to |L1| This paper studies the optimization of list intersection, es- for j = 1 to |L2| if pecially in the context of the matching phase of search en- ... gines. Given a user query, we intersect the postings lists else Frontend Backend ... corresponding to the query keywords to generate the list of x = array[y] documents matching all keywords. Since the speed of list in- Alg #1 tersection depends the algorithm, hardware, and list lengths ... and their correlations, none the existing intersection algorithms outperforms the others in every scenario. Therefore, b. translation c. probabilistic analysis we develop a cost-based approach in which we identify a α1 ...β1 ... γ1 ... Oα1 ...Oβ1 ...Oγ1 ... search space, spanning existing algorithms and their combi- $est. = * nations. We propose a cost model to estimate the cost of the cost parameters event counts algorithms with their combinations, and use the cost model d. cost model to search for the lowest-cost algorithm. The resulting plan is usually a combination of 2-way algorithms, outperforming Case #1 Case #2 conventional 2-way and k-way algorithms. The proposed ap- Alg #1 0.2 ms 1.2 ms proach is more general than designing a specific algorithm, Alg #2 0.05 ms 0.1 ms as the cost models can be adapted to different hardware. ... We validate the cost model experimentally on two differ- Alg #5 0.03 ms 0.5 ms ent CPUs, and show that the cost model closely estimates Suggestion Alg #5 Alg #2 the actual cost. Using both real and synthetic datasets, we show that the proposed cost-based optimizer outperforms e. cost optimizer the state-of-the-art alternatives. Figure 1: Overview of cost-based optimizer. We build a PVLDB Reference Format: cost model with following two steps: (1) translate the cost Sunghwan Kim, Taesung Lee, Seung-won Hwang, Sameh Elnikety. factors of the top-down cycle accounting method into cost List Intersection for Web Search: Algorithms, Cost Models, and parameters, and (2) translate an intersection process into Optimizations. PVLDB, 12(1): 1-13, 2018. event counts through probabilistic analysis. The cost-based DOI: https://doi.org/10.14778/3275536.3275537 optimizer searches for the optimal plan. Keywords list intersection, cost-based query optimization, statistical 1. INTRODUCTION analysis, top-down analysis List intersection is a fundamental operation that is widely used in data processing and serving platforms, such as web ∗corresponding author search engines and databases. In web search engines, a multi- word query goes through a \matching phase", which is es- sentially an intersection of sorted lists corresponding to the This work is licensed under the Creative Commons Attribution- query keywords. Each list (also called inverted or postings NonCommercial-NoDerivatives 4.0 International License. To view a copy list) contains the IDs of web documents matching the key- of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For word [11]. For example, a query with (e.g., \VLDB 2019 any use beyond those covered by this license, obtain permission by emailing Research Papers") is dispatched to the servers managing [email protected]. Copyright is held by the owner/author(s). Publication rights the web corpus inverted index, and each server intersects the licensed to the VLDB Endowment. corresponding postings lists and returns its intersection out- Proceedings of the VLDB Endowment, Vol. 12, No. 1 ISSN 2150-8097. come as a result. Due to this significance, many algorithms DOI: https://doi.org/10.14778/3275536.3275537 have been proposed to execute this operation, but no list 1 intersection algorithm is cost-optimal in all scenarios. For Algorithm 1 2-Gallop example, merge-based algorithms are effective when the lists Input: L1;L2: Two lists to intersect are of similar lengths and are accessed sequentially. Mean- Output: R = L1 \ L2: intersection of two lists while, search-based algorithms perform better when one list 1: R = φ 2: c1 = i = 0; c2 = j = 0 is significantly shorter than others, as items in the shortest 3: for i = 0 to jL1j − 1 do list work as pivots to skip accessing on non-matching items 4: δ = GallopSearch(L2; j; L1[i]) in long lists. 5: j = j + δ 6: if L1[i] = L2[j] then A cost-based approach aims at having an accurate cost 7: R = R [ fL1[i]g;L1[i] = φ model, so to generate the optimal query plan for the given 8: return R scenario, as commonly used in the database systems [9, 20]. However, the existing list intersection cost models estimate Algorithm 2 2-Merge the cost using a number of comparisons [2, 4, 5], which does Input: L ;L : Two lists to intersect not reflect the execution time on modern processors that 1 2 Output: R = L1 \ L2: intersection of two lists support out-of-order execution, predictive speculation, and 1: c1 = i = 0; c2 = j = 0 hardware prefetching [29]. For this reason, we develop a cost 2: u = L1[i]; v = L2[j] 3: while i < jL1j and j < jL2j do model that uses several hardware parameters rather than 4: if u < v then // < inventing new specialized intersection algorithms. 5: i + +; u = L1[i] // SB< Considering the dependency on hardware parameters, a 6: else if u > v then // > 7: j + +; v = L2[j] // SB> desirable cost model should estimate the cost of a given al- 8: else // = gorithm with a set of raw parameters, rather than with a 9: // SB= single scalar parameter, such as the number of comparisons. 10: R = R [ fug We use profiling to measure the hardware parameters and 11: i + +; j + + 12: u = L1[i]; v = L2[j] develop cost models for several intersection algorithms. We 13: return R build on related efforts that use hardware parameters, such as cycle accounting (CA) [29] which targets analyzing the Algorithm 3 2-SIMD bottleneck for given CPU architecture and algorithm. We use the hardware parameters to estimate the cost of execut- Input: L1;L2: Two lists to intersect Output: R = L1 \ L2: intersection of two lists ing the intersection algorithm rather than identifying bottle- 1: i = 0(= c1); j = 0(= c2) neck. We compute the total execution cost using three sets 2: u = L1[i]; v = L2[j] of hardware parameters reflecting (1) instruction execution, 3: while i < jL1j and j < jL2j do 4: d1 = (L1[i];L1[i + 1]; ··· ;L1[i + ∆ − 1]) (2) branch misprediction, and (3) memory references. 5: d2 = (L2[j];L2[j + 1]; ··· ;L2[j + ∆ − 1]) We address several challenges to develop a principled cost- 6: R = R [ (d1 \ d2) based approach for list intersection. The first challenge is 7: i = i + ∆ · (L1[i + ∆ − 1] ≤ L2[j + ∆ − 1]) 8: j = j + ∆ · (L1[i + ∆ − 1] ≥ L2[j + ∆ − 1]) hardware cost parameterization, by overcoming taxon- 9: return R omy mismatch. The state-of-the-art top-down method (Fig- ure 1a) for CA divides architecture into two major parts, backend and frontend, and analyzes the cost with respect to • We build a cost-based optimizer that suggests the opti- three types of overhead|backend, frontend, and between. mal intersection plan for intersecting two or more lists. This does not map well with the causes we want to identify, • We analytically show that the proposed cost models are for which we propose to translate the cost factors of top- precise in a wide range of scenarios. down method into the unit cost parameters (Figure 1b). • We empirically demonstrate the proposed cost models The second is probabilistic counting. How many ele- and the cost-based optimizer perform well in a wide ments in a list are accessed (or skipped) is specific to the in- range of scenarios. tersection algorithm and the data distributions. We propose a probabilistic model (Figure 1c), translating cost factors into the an expected total cost (Figure 1d). Last challenge is search space reduction, to reduce 2. LIST INTERSECTION ALGORITHMS search overhead without compromising the optimality of the This section describes existing representative 2-way and k- algorithm found. In addition, we show that considering 2- way intersection algorithms, as preliminaries to model their way algorithms is often sufficient, which reduces the search behaviors in our cost models. overhead of the proposed approach. This finding promotes nonblocking pipelined implementation scenarios when some 2.1 2-way algorithms of the required information on input lists are not available 2-way algorithms accept two lists L1 and L2 as inputs, at query time. and return its intersection L1 \ L2. We demonstrate that the proposed model performs well in two CPUs on both synthetic and real-world datasets, and 2.1.1 2-Gallop show that our cost-based optimizer (Figure 1e) outperforms In our application scenario of search, input lists are or- the existing list intersection approaches. dered, which enables a search algorithm, skipping some ele- Our main contributions are the following: ments, instead of incrementing the cursor one step at a time. 2-Gallop (Algorithm 1) is a representative 2-way search- • We parameterize intersection scenarios to reflect the ar- based algorithm that uses each element of the shorter list chitecture and input characteristics, and build cost mod- L1 as a pivot and searches for the match in another list L2. els for major intersection algorithms. 2 In particular, for a search, [6] implements an exponential Algorithm 4 General k-way intersection j search of looking for the first exponent, j, where the 2 -th Input: L T value in L2 is greater than the search key.

Load more